An OpenStack Crime Story solved by tcpdump, sysdig, and iostat – Episode 1

14.9.2014 | 5 minutes reading time

This is the story of how the tiniest things can sometimes be the biggest culprits. Because first and foremost, this is a detective story. So come and follow me on a little crime scene investigation that uncovered an incredibly counterintuitive and almost criminal default setting in Ubuntu 14.04 crippling the virtual network performance and especially network latency.

And just in case you don’t like detective stories – and this goes out to all you nerds out there – I promise it will be worth your while, because this is also an introduction to keepalived, tcpdump, and sysdig. So sit back and enjoy!

The problems came out of the blue. We are running an OpenStack installation for the next generation CenterDevice cloud.

CenterDevice is a cloud-based document solution and our clients rely on its 24/7 availability. In order to achieve this availability, all our components are multiply redundant. For example, our load balancers are two separate virtual machines running HAProxy . Both instances manage a highly available IP address via keepalived .

This system worked very well. Until both load balancers became the victims of an evil crime. Both virtual machines started flapping their highly available virtual IP address and we were receiving alerting e-mails by the dozens, but there was no obvious change to the system that could have explained this behavior.

And this is how the story begins.

When I was called to investigate this weird behavior and answer the question what was happening to the poor little load balancers, I started by taking a good close look at the crime scene: keepalived. The following listing shows our master configuration of keepalived on virtual host loadbalancer01 1 .

1% sudo cat /etc/keepalived/keepalived.conf
2global_defs {
3        notification_email {
4                some-e-mail-address@veryimportant.people
5        }
6        notification_email_from loadbalancer01@protecting.the.innocent.local
7        smtp_server 10.10.9.8
8        smtp_connect_timeout 30
9        router_id loadbalancer01
10}
11 
12vrrp_script chk_haproxy {
13        script "killall -0 haproxy"
14        interval 2
15        weight 2
16}
17 
18vrrp_instance loadbalancer {
19        state MASTER
20        interface eth0
21        virtual_router_id 205
22        priority 101
23        smtp_alert
24        advert_int 1
25        authentication {
26                auth_type PASS
27                auth_pass totally_secret
28        }
29        virtual_ipaddress {
30                192.168.205.7
31        }
32        track_script {
33                chk_haproxy
34        }
35}

The same configuration on our second load balancer loadbalancer02 looks exactly the same with notification email and router id changed accordingly as well as a lower priority. This all looked fine to me and it was immediately clear that keepalived.conf was not the one to blame. I needed to figure out a different reason why the two keepalived were continuously flapping the virtual IP address.

Now, it is important to understand how VRRP , the protocol keepalived uses to check the availability of its partners, work. All partners continuously exchange keep alive packets via the multicast address vrrp.mcast.net which resolves to 224.0.0.18. These packets use IP protocol number 112. Only if the backup does not receive these keep alive packets from the current master, it assumes the partner is dead, murdered, or otherwise gone AWOL and takes over the virtual IP address, now acting as the new master. If the old master decides to check back in, the virtual IP address is exchanged again.

Since we were observing this exchange back and forth, I suspected the virtual network to unlawfully disregard its responsibility. I immediately rushed to the terminal and started to monitor the traffic. This might seem easy, but it is far more complex than you think. The figure below, taken from openstack.redhat.com, shows on overview of how Neutron creates virtual networks.

OpenStack Neutron Architecture [https://openstack.redhat.com/Networking_in_too_much_detail]

Since both load balancers were running on different nodes, the VRRP packets traveled from A to J and back to Q. So where to look for the missing packets?

I decided to tail the packets at A, E, and J. Maybe that was where the culprit was hiding. So I started tcpdump on loadbalancer01, the compute node node01, and the network controller control01 looking for missing packets. And there were packets, I tell you. Lots of packets. But I could not see missing packets until I stopped tcpdump:

1$ tcpdump -l host vrrp.mcast.net
2...
345389 packets captured
4699 packets received by filter
5127 packets dropped by kernel

Dropped packets by the kernel? Are these our victims? Unfortunately not.

tcpdump uses a little buffer in the kernel to store captured packets. If too many new packets arrive before the user process tcpdump can decode them, the kernel drops them to make room for freshly arriving packets.

Not very helpful when you are trying to find where packets get, well, dropped. But there is the handy parameter -B for tcpdump that increases this buffer. I tried again:

1$ tcpdump -l -B 10000 host vrrp.mcast.net

Much better, no more dropped packets by the kernel. And now, I saw something. While the VRRP packets were dumped on my screen, I noticed that sometimes there was a lag of more than one second between VRRP keep alives. This struck me as odd. This should not happen as the interval for VRRP packets has been set to one second in the configurations. I felt that I was on to something, but just needed to dig deeper.

I let tcpdump show me the time differences between succeeding packets and detect differences bigger than one second.

1$ tcpdump -i any -l -B 10000 -ttt host vrrp.mcast.net  | grep -v '^00:00:01'

Oh my god. Look at the screenshot above 2 . There is a delay of more than 3,5 seconds. Of course loadbalancer02 assumes his partner went MIA.

But wait, the delay already starts on loadbalancer01? I have been fooled! It is not the virtual network. It had to be the master keepalived host! But why? Why should the virtual machine hold packets back? There must be some evil hiding in this machine and I will find and face it…

Stay tuned for our all new next episode of OpenStack Crime Investigation tomorrow on the codecentric Blog.

Footnotes

1. All personal details like IP, E-Mail addresses etc. have been changed to protect the innocent.↵
2. Yes, you see correctly. I am running screen inside tmux . This allows me to sustain the screen tabs even during logouts.↵

Was this post helpful?

Blog author

Lukas Pustina

Do you still have questions? Just send me a message.

fromLukas Pustina

An OpenStack Crime Story solved by tcpdump, sysdig, and iostat – Episode...

Previously on OpenStack Crime Investigation … Two load balancers running as virtual machine in our OpenStack based cloud, sharing a keepalived based highly available IP address started to flap, switching the IP address back and forth. After ruling ...

Cloud
Hosting

16.9.2014 | 5 minutes reading time

Lukas Pustina

An OpenStack Crime Story solved by tcpdump, sysdig, and iostat – Episode...

Previously on OpenStack Crime Investigation. I was called to a crime scene; our OpenStack based private cloud for CenterDevice. Somebody or something was causing our virtual load balancers to flap their highly available IP address. tcpdump showed me...

Infrastructure
Open Source
APM
Cloud
IT-Security

15.9.2014 | 5 minutes reading time

Lukas Pustina

Provisioning IaaS Clouds with Dynamic Ansible Inventories and OpenStack...

My colleague Daniel Schneller gave an introduction to Ansible . A key concept of Ansible is the inventory. It contains all hosts of your site that you want to provision with Ansible. For bare metal hardware, this inventory is a static file enumerating...

Database
Cloud

24.6.2014 | 5 minutes reading time

Lukas Pustina

Crypto is Broken or How to Apply Secure Crypto as a Developer

Last year’s revelations show that crypto is broken on all levels. 1 We cannot trust hardware nor commercial software providers anymore to securely encrypt our data. My first instinct as a developer is to turn to open source libraries which have been...

Crypto
IT-Security

5.3.2014 | 8 minutes reading time

Lukas Pustina

Ceph Object Storage as fast as it gets or Benchmarking Ceph

CenterDevice is a distributed document management and sharing software without any single centralized component. In our next evolution we are going to use the distributed object store Ceph for storing our encrypted documents. In this article, my colleague...

Infrastructure
Software development

3.3.2014 | 9 minutes reading time

Lukas Pustina

Docker Registry or How to Run your own Private Docker Image Repository

Docker allows to bundle artifacts and configurations in an image. These images run as light weight system-level virtual machines. In my previous articles, I showed how to use Docker in general and how to use networking . In this article, I will show...

Container

18.2.2014 | 5 minutes reading time

Lukas Pustina

Docker Networking Made Simple or 3 Ways to Connect LXC Containers

In my previous article , I introduced Docker as a lightweight alternative to hypervisor-based virtualization. The article described the basic usage of Docker. Today, we dig a bit deeper and cover advanced topics regarding Docker networking and how to...

CI/CD
DevOps
Container

26.1.2014 | 7 minutes reading time

Lukas Pustina

Lightweight Virtual Machines Made Simple with Docker or How to Run 100...

Running virtual machines has many benefits. They utilize your hardware much better, are easy to backup and exchange, and isolate services from each other. But running virtual machines also has downsides. Virtual machine images are clunky. Also and more...

DevOps
Open Source
APM

6.1.2014 | 8 minutes reading time

Lukas Pustina

Your Hardware will Fail – Just not the Way You Expect

Why do people decide to move their services into the cloud? For one thing they wish to store large amounts of data. But they also wish for response times and reliability that classical sever installation cannot offer. Therefore, clusters of commodity...

13.11.2013 | 5 minutes reading time

Lukas Pustina

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Serverless from Europe: My Experience with Scaleway as an Alternative ...

In addition to dominant US providers like AWS, Azure, and GCP, the French company Scaleway now offers a comprehensive serverless computing portfolio. This includes services for Function as a Service, a lightweight Key/Value Store, and a simple messaging...

Compliance
Infrastructure
data protection
Cloud native
Cloud
Infrastructure as Code

28.5.2025 | 5 minutes reading time

Florian Lüdiger

The Ultimate Tool for Engineers and Developers: Compass Premium

It’s not an every day activity that a tool comes and redefines how engineering and development teams operate, but Compass is the tool with a game-changing solution. As Atlassian's out-of-the-box internal developer platform, Compass helps teams to stay...

Atlassian
Cloud

3.12.2024 | 4 minutes reading time

Özge Kavas

Living on the edge: building serverless applications with Cloudflare Workers

Cloudflare is best known for its CDN, DNS server (1.1.1.1) or WAF/DDos mitigation services. These services are highly predicated on “Edge Computing”, bringing data closer to the user interested in those services – a user in Australia will be happier ...

Cloud native
Cloud
Serverless

28.11.2024 | 14 minutes reading time

We deployed our SaaS Application on fly.io (and it was great).

How we deployed our application in a fraction of the time while saving 100% of the cost. Our team, a bunch of experienced software engineers without prior contact to cloud deployments, wanted to deploy our OCPP-compliant EV Charging Station Simulator...

AWS
Cloud

23.10.2024 | 4 minutes reading time

Jannis Mainczyk

Dangling DNS in cloud infrastructures

Dangling DNS entries are nothing new. Forgotten, outdated or incorrect DNS records can lead to subdomains being taken over and used in phishing campaigns, for example, to steal employee secrets. Due to dynamic IP addresses of rapidly changing resources...

IT-Security
Validation
Cloud
AWS
Infrastructure

5.9.2024 | 4 minutes reading time

Markus Höfer

Spring Boot and HTMX: Deployment to AWS Lambda

This is the next part of my series about Spring Boot and HTMX. In this post, I will show you how to deploy the application created in the previous post to AWS Lambda. If you're in a hurry or impatient, you can simply check out the accompanying Git Repo...

Serverless
Spring
AWS
DevOps
Cloud

30.7.2024 | 5 minutes reading time

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

In a recent blog post, we explored how Dapr works and how to test it on a simple local Kubernetes cluster. One of Dapr's key advantages is its component system, which enhances portability. In this post, we'll take our previously daperized demo app and...

Software development
Cloud
Azure
Cloud native

22.7.2024 | 10 minutes reading time

Manuel Zapf

Modern Microservices: Unleashing the Power of .NET Core, Aspire, and Dapr

I recall the days when writing a web application in C# with .NET meant deploying it on an IIS web server for accessibility. Today, this approach seems outdated, especially with the shift towards microservice-based architectures. Fortunately, Microsoft...

Software architecture
Open Source
Cloud
Microservices
Infrastructure as Code
.NET
Cloud native

27.6.2024 | 8 minutes reading time

Manuel Zapf

From sidecars to sidecarless: Tracing the evolution of service mesh technologies...

Ever wondered how the technology that seamlessly manages microservices traffic evolved from early implementations to lean, kernel-level solutions? Let's dive into the fascinating journey of service meshes, from Linkerd 1.x to the cutting-edge technologies...

Cloud
Networking
Infrastructure
Kubernetes
Linux

22.5.2024 | 10 minutes reading time

Manuel Zapf

Demystifying the Kubernetes Gateway API: What the heck is it and why should...

When Gateway API debuted in October last year, this concluded a nearly four-year-long process that started in summer 2019. Gateway API is the successor of core Ingress definition, aiming towards various goals. This blog post will give a brief overview...

API
Open Source
Cloud
Networking
Kubernetes
Cloud native

15.3.2024 | 6 minutes reading time

Manuel Zapf

Cloud-native (application) networking in 2024

It's 2024 and Software is still eating the world. Whether it's powering an e-commerce platform, driving AI applications, or supporting critical business processes within organizations, there's a high likelihood that these applications are running in ...

Cloud
Networking
Infrastructure
Kubernetes

8.3.2024 | 2 minutes reading time

Manuel Zapf

Charge your APIs Volume 22: Mastering the Art of API Federation

API Federation is becoming essential in modern API management, addressing the complexities of evolving digital enterprises. It marks a shift from centralised, monolithic management to a dynamic, modular framework. Unlike traditional methods, API Federation...

API
Cloud
Cloud native

7.2.2024 | 11 minutes reading time

Daniel Kocot

How to upgrade your Aurora Serverless database schema using CDK and Lambda

Imagine the following situation: You are building a serverless application using e.g. lambdas, you setup your system using CDK (or CloudFormation) and you store your data in Aurora Serverless. How would you automate your database schema adaptations or...

Cloud
Database
AWS
Infrastructure as Code
Serverless

16.1.2023 | 12 minutes reading time

Heroku is dead: Let’s deploy Spring Boot containers on fly.io!

Heroku is cancelling their free plan! What about all my open-source projects? Luckily fly.io comes to the rescue! Here are the missing docs on how to run Spring Boot on fly.io.Why I love(d) HerokuHeroku was my go-to PaaS for open-source projects for ...

CI/CD
Java
Cloud
DevOps
Spring

18.9.2022 | 17 minutes reading time

CloudWatch on AWS: How to tackle high-security requirements

If you build cloud-native applications, you will also generate log output. Log outputs are essential to log the functionality of the application and to be able to localize errors very quickly in the event of a crash. However, log outputs of any kind ...

AWS
Cloud
IT-Security

23.8.2022 | 15 minutes reading time

Jörg Riegel

Tame the multi-cloud beast with Crossplane: Let’s start with AWS S3

What if learning the Kubernetes API is all you need to provision any infrastructure? And we’re not only talking about AWS, Azure & Google – but also IONOS, DigitalOcean and even vSphere. Let’s have a look at Crossplane and how we can create an S3 Bucket...

AWS
CI/CD
Cloud
DevOps

3.7.2022 | 21 minutes reading time

Building an instant noodles DevOps starter pack with Terraform and AWS

How can we help a fictitious startup kickstart its software development process? Using Terraform and AWS services, we’ll build an IT infrastructure that is ready within minutes and ticks quite a few boxes on the technical DevOps capabilities list. Just...

Cloud
Infrastructure
AWS
CI/CD
DevOps

27.6.2022 | 21 minutes reading time

Development Containers & GitHub Codespaces kill the “works on my machine...

We love them, and hate them at the same time: local development environments. But what if we could use remote development techniques like Development Containers or GitHub Codespaces to finally overcome the “works on my machine” problem? And also end ...

DevOps
CI/CD
Cloud
Container

12.6.2022 | 15 minutes reading time

Rebooting Accelerate, part 2: How to deliver value faster

So we want to deliver value faster, but how do we do it? The good news is that there are lots of ways to achieve it. The bad news is that it’s hard to pick the right means. What capabilities and approaches are the ones that matter to us as tech people...

Cloud
DevOps

6.6.2022 | 13 minutes reading time

Secretless connections from GitHub Actions to AWS using OIDC

Imagine the following scenario: You set up your GitHub Actions in your repository. And it’s all cool until you want to access your cloud provider resources. Now you might be tempted to create an access key and secret access key, place it as a secret ...

Azure
Cloud
AWS
CI/CD
DevOps
GitHub

29.5.2022 | 8 minutes reading time

Manuel

An OpenStack Crime Story solved by tcpdump, sysdig, and iostat – Episode 1

Footnotes

Was this post helpful?

Blog author

More articles

An OpenStack Crime Story solved by tcpdump, sysdig, and iostat – Episode...

An OpenStack Crime Story solved by tcpdump, sysdig, and iostat – Episode...

Provisioning IaaS Clouds with Dynamic Ansible Inventories and OpenStack...

Crypto is Broken or How to Apply Secure Crypto as a Developer

Ceph Object Storage as fast as it gets or Benchmarking Ceph

Docker Registry or How to Run your own Private Docker Image Repository

Docker Networking Made Simple or 3 Ways to Connect LXC Containers

Lightweight Virtual Machines Made Simple with Docker or How to Run 100...

Your Hardware will Fail – Just not the Way You Expect

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Serverless from Europe: My Experience with Scaleway as an Alternative ...

The Ultimate Tool for Engineers and Developers: Compass Premium

Living on the edge: building serverless applications with Cloudflare Workers

We deployed our SaaS Application on fly.io (and it was great).

Dangling DNS in cloud infrastructures

Spring Boot and HTMX: Deployment to AWS Lambda

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

Modern Microservices: Unleashing the Power of .NET Core, Aspire, and Dapr

From sidecars to sidecarless: Tracing the evolution of service mesh technologies...

Demystifying the Kubernetes Gateway API: What the heck is it and why should...

Cloud-native (application) networking in 2024

Charge your APIs Volume 22: Mastering the Art of API Federation

How to upgrade your Aurora Serverless database schema using CDK and Lambda

Heroku is dead: Let’s deploy Spring Boot containers on fly.io!

CloudWatch on AWS: How to tackle high-security requirements

Tame the multi-cloud beast with Crossplane: Let’s start with AWS S3

Building an instant noodles DevOps starter pack with Terraform and AWS

Development Containers & GitHub Codespaces kill the “works on my machine...

Rebooting Accelerate, part 2: How to deliver value faster

Secretless connections from GitHub Actions to AWS using OIDC