XFS: Possible Memory Allocation Deadlock in kmem_alloc

10.4.2017 | 10 minutes reading time

A few weeks ago we were surprised by seemingly random I/O hangs on several virtual machines. Any attempt to write to their data volumes blocked, making the load average rise into the stratosphere, and — slightly more consequentially — make Elasticsearch or MongoDB freak out.

Looking at the hypervisor’s logs I noticed lots of these messages in dmesg and syslog:


...
Mar  5 22:42:57 node06 kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
Mar  5 22:42:59 node06 kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
...

This post describes the analysis, cause and remedy of the problem.

First Things First: A Quick Fix

The highest priority was to get the databases in the VMs running again. Google turned up a recommendation to drop the page cache on the hypervisor, because somehow the problem seemed to be memory fragmentation related and could be resolved this way at least temporarily:

1% echo 1 > /proc/sys/vm/drop_caches

Indeed, doing that immediately made the VMs responsive again. In one case, Elasticsearch needed to be restarted inside the VM, but the cluster quickly recovered. However in the following days, the problem kept recurring, increasing in frequency and happening on several hypervisors.

Kernel Update to the Rescue (?)

I noticed we were running a kernel with a known problem in XFS, effectively it being more wasteful with memory than it needed to be for certain operations (see this 2013 XFS Mailing List Post). According to this Ubuntu Launchpad Issue that particular problem was fixed in a later Ubuntu kernel, so we installed the most recent one on the hypervisors. For a few days it seemed to fix the issue, because we did not see the hangs again. Unfortunately, it still came back, just a little later.

Intermediate measure: the xfs-guard

Doing more research I consistently came across the topic of memory fragmentation, not understanding why that could become a problem all of a sudden, because the machines had run with unchanged configurations, both inside the VMs and on the hypervisor, for several months. (It actually made sense after all, but at this point I had not yet understood the underlying mechanisms.) It turned out, dropping the page cache was not really necessary after all, dropping the slab cache (echo 2 > /proc/sys/vm/drop_caches) was enough to get the VMs responsive. I think dropping the page cache also worked via a side effect of freeing enough (unrelated) memory for more slab allocations to succeed.

Knowing that this worked, to buy me some time, I quickly threw together a small “xfs-guard” daemon and installed it on the hypervisors. Basically, all it does is tail the syslog constantly, look for the error message and drop the slab cache repeatedly until the message stops repeating. You can find it on Github and on Ansible Galaxy .

Once that was deployed (at not a moment too soon: it kicked in for the first time just a few hours after rolling it out) I had a little more time to find the root cause.

Digging deeper

Reading further, everything kept pointing towards file system fragmentation, and in turn (slab) memory fragmentation inside the kernel. I felt a bit reminiscent of the 90s, when running something like Norton Speed Disk to defragment FAT file systems was a pretty common task. I was reluctant to believe that this could still be a problem almost 30 years later, especially because XFS has a reputation for being pretty good at keeping fragmentation under control.

Remember:
Norton Disk Doctor and Norton Speed Disk aren't the same thing pic.twitter.com/TSNgCpxobN
— Anatoly Shashkin ? (@dosnostalgic) August 8, 2016

XFS will, for example speculatively pre-allocate more space on disk than is actually requested, assuming that there will be more related data coming soon afterwards. This way it can place that data right next to the first chunk, instead of potentially having to put it elsewhere on disk. This and the fact that the file system in question only held very few, very large image files, provided to the VMs as block devices, made the fragmentation theory seem even more unlikely. I needed more data.

File systems must keep track of the allocated and free parts of a disk. High disk fragmentation result in lots of small pieces of occupied and free space to be kept track of. With file systems a part of the kernel, the necessary metadata structures are kept in kernel memory, which is managed by the slab allocator. Even with the most efficient data structures, at some point it becomes impossible to accommodate the metadata for an ever-increasing number of disk space fragments. When that happens, slab memory allocation fails, and the file system blocks writes altogether.

Fortunately, XFS provides some powerful tools to analyse and diagnose what it is doing under the hood. My first step was to run the XFS debugger “xfs_db” and tell it to report some high level information on overall fragmentation (frag -f). Notice that all commands mentioned in this post can run with the file system mounted and operating normally, which is pretty neat. They will, however, create additional I/O and at least temporarily consume some memory.

1% xfs_db -r -c "frag -f" /dev/md0p1
2actual 42561387, ideal 1089, fragmentation factor 100.00%

After about 30s, it came back with the above output. 100% looks pretty scary, but according to the XFS documentation, the “fragmentation factor” can be misleading, as it tends to approach 100% quickly, without necessarily indicating a serious problem. There is a nice graph in this XFS FAQ entry explaining the details. The output shows the number of “extents” (individual contiguous ranges of data belonging to files); both the actual number and what XFS considers ideal. Even if seeing the fragmentation factor alone was not strictly enough, having more than 40 million pieces of data seemed pretty hefty, especially compared to the much smaller ideal value.

So apparently something was wrong, but still more information was needed. Next up, I wanted to figure out how fragmented the free space on the volume was, assuming this might cause writing new data to stall if somehow it could not be made to fit in the free areas.

Allocation Groups and Free Space

I learned that XFS splits up a big volume into “Allocation Groups” which can be considered their own (mostly) separate “sub file systems”. The number of allocation groups depends on how large the total XFS volume is. Their purpose is to allow parallelising file operations. This Novell Knowledge Base page has a nice summary and some helpful scripts to gather information about the free space and its fragmentation per allocation group. The XFS documentation goes into even more technical detail.

I ran a loop across all hypervisor’s XFS file systems and came up with nothing particularly helpful. There were vast amounts of free and contiguous space in all allocation groups. For completeness, I ran the same tests inside the VMs, also showing no significant free space fragmentation.

File fragmentation

Next up was the analysis of the fragmentation of the disk image files themselves. With them having been created with a fixed size and shortly after the creation of the underlying XFS file system, I assumed there could not really be much fragmentation here, either. But as they say…

By loading the video, you agree to YouTube's privacy policy.
Learn more

Load video

Always unblock YouTube

This is the command used to find the individual extents and their sizes allocated for a particular file on XFS:

1% xfs_bmap -v ffc63a70.disk > ffc63a70-bmap-before-defrag.txt

It took a few minutes, despite being run on a pretty fast SSD RAID. While it was going, its resident memory usage ballooned up to about 3 GB. Turns out, it was a good idea to redirect the output to a file:

1% wc -l ffc63a70-bmap-before-defrag.txt
234694400 ffc63a70-bmap-before-defrag.txt

Close to 35 million extents just for this single disk image. The .txt file alone was larger than 3 GB! The same general picture turned up for the other VM images across all servers. In total, each hypervisor had around 50 million extents, for just a handful of large files. This is the first few lines of one of the text files:


EXT: FILE-OFFSET     BLOCK-RANGE            AG AG-OFFSET       TOTAL
  0: [0..255]:       2339629568..2339629823 24 (512..767)        256
  1: [256..359]:     2339629208..2339629311 24 (152..255)        104
  2: [360..5135]:    2339715400..2339720175 24 (86344..91119)   4776
  3: [5136..5447]:   2339720256..2339720567 24 (91200..91511)    312
  4: [5448..6143]:   2339721496..2339722191 24 (92440..93135)    696
  5: [6144..7495]:   2339723200..2339724551 24 (94144..95495)   1352
  6: [7496..8159]:   2339725560..2339726223 24 (96504..97167)    664
  7: [8160..9167]:   2339727232..2339728239 24 (98176..99183)   1008

Apparently this heavy fragmentation into an immense number of tiny extents is a side effect of how qemu writes to these images. It appears similar to what is described in this 2014 XFS mailing list entry.

Aggressive flushing or direct writes defeat XFS’s fragmentation prevention features. The fact that the image files were allocated en-bloc right after the filesystem creation, does not matter in this case, because XFS apparently always uses sparse allocations. That means it does not actually claim all the space immediately, but only dynamically when actual writes happen. Seeing if we can do anything about this, while at the same to not sacrificing consistency in case of machine crashes, remains a problem to be solved another day.

With this in mind it’s understandable why the problems began only quite some time after the creation of the disk images. Up until then, XFS had dutifully managed the ever-growing number of extents per file, up to a tipping point where the memory needed to do so started to run out regularly. So even with xfs-guard in place, the constant write activity happening inside the virtual machines, would have just delayed the inevitable.

Defragmenting the disk images

XFS comes with a file system reorganizer tool “xfs_fsr” which can work on individual files (but also whole volumes). It tries to create a (less fragmented) copy of a file and replaces the original with that copy once it is done. This operation temporarily requires enough free space, ideally largely contiguous, to create the duplicate. Luckily, as determined earlier, we had large areas of unfragmented free space available. Moreover, a file cannot be in use when the reorganization takes place. So on a weekend I shut down the relevant VMs, one at a time, making their disk images available to xsr_fsr. This is a sample run for one disk image:

1% time xfs_fsr -v ffc63a70.disk | tee ffc63a70.disk -xfs_fsr-output.txt
2ffc63a70.disk extents before:34694398 after:81
3xfs_fsr -v ffc63a70.disk   0.13s user 732.62s system 55% cpu 22:01.67 total

For this particular image, of the originally 34 million extents, only 81 remained after the defragmentation. I can see how that is easier track ?. The numbers for all other VM images were similar, every time the number of extents dropped by several orders of magnitude. Notice that due to the the way the reorganizer works, it will cause heavy I/O, so you might want to run this during off-hours, if your usage patterns allow it.

Conclusion and next steps

The memory allocation deadlocks, which had started to happen almost daily, have not occurred again. In addition to the xfs-guard watchdog, we are now preparing a script to report the number of extents per disk image to our monitoring system regularly. So even if we can’t find a way to prevent qemu from causing the fragmentation to grow, we can keep an eye on it and schedule defrags before it reaches critical levels again.

Interestingly, shortly after the VMs were restarted, I re-checked the fragmentation manually. One of the images had reached over 40.000 extents again already. However a few more samples taken after a couple of days showed that after the rather steep initial increase, it seemed taper off. Still, all the more reason to watch it closely.

Was this post helpful?

Blog author

Daniel Schneller

Do you still have questions? Just send me a message.

fromDaniel Schneller

True KVM Live Migration with OpenStack Icehouse and Ceph based VM storage

Intro As mentioned before — for example in Fabian’s The CenterDevice Cloud Architecture Revisited post from December 2014) — our document management product CenterDevice runs on top of infrastructure virtualized by OpenStack. Where that older post...

Cloud

16.3.2015 | 12 minutes reading time

Daniel Schneller

Rate Limiting based on HTTP headers with HAProxy

Recently we had a problem with a buggy update to a piece of 3rd party client software. It produced lots and lots of valid, but nonsensical requests, targeting our system. This post details how we added a dynamic rate limiting to our HAProxy load balancers...

3.12.2014 | 7 minutes reading time

Daniel Schneller

Localizing Mobile Apps

What do the acronyms I18N or L10N stand for? What do they mean for developers of mobile applications in particular? I hosted a session about localizing mobile applications at Developer Week 2014 in Nuremberg. It covers — among other things — text, numbers...

26.8.2014 | 1 minutes reading time

Daniel Schneller

Jinja2 for better Ansible playbooks and templates

There have been posts about Ansible on this blog before, so this one will not go into Ansible basics again, but focus on ways to improve your use of variables, often, but not only used together with the template module, showing some of the more involved...

24.8.2014 | 11 minutes reading time

Daniel Schneller

Ansible: Simple yet powerful automation

Automatic provisioning of infrastructure as well as deployment is a cornerstone of DevOps. It brings the benefits of version control, reproducibility, and a central place to consolidate (executable) knowledge about infrastructure setups. Best known provisioning...

CI/CD
DevOps
Infrastructure

22.6.2014 | 14 minutes reading time

Daniel Schneller

SSH Two-Factor Authentication with Duo Security

An ever increasing number of services start offering (and recommending) additional means of securing access to your accounts: Instead of just asking users to identify and authenticate themselves with a simple set of username and password, a second piece...

10.3.2014 | 7 minutes reading time

Daniel Schneller

Pseudo-Localization for Cocoa Apps

Locali… what? Simply speaking, localizing an application means translating all output it produces on the screen (and printouts etc.) to the language of the people using it. There is more to it, though, than a simple translation of messages. You should...

Java
iOS
Software development

23.10.2013 | 14 minutes reading time

Daniel Schneller

SSL: Man in the middle? – No, thank you!

At DWX Developer Week I recently gave a talk on SSL and man in the middle attacks. Due to the popular demand (and some internal scheduling issues) I repeated it again internally. However, the recording of that is available on the codecentric YouTube ...

2.7.2013 | 1 minutes reading time

Daniel Schneller

Easier JBehave steps with variants

In an earlier post we offered an introduction to the JBehave project for automatic acceptance testing. While that article focused on setup and general use of the framework, this time I will concentrate on a recent addition I wrote and contributed to...

Agile
Java

1.4.2012 | 4 minutes reading time

Daniel Schneller

Why good metrics values do not equal good quality

Quite regularly, codecentric’s experts perform reviews and quality evaluations of software products. For example, clients may want to get an independent assessment of a program they had a contractor develop. In other cases, they request an assessment...

Agile methods
Java

3.10.2011 | 7 minutes reading time

Daniel Schneller

Using JMeter to measure binary protocols

In a recent project I developed a bridge component to connect a backend web service with a credit-card terminal. The terminal can only speak a binary protocol. The bridge needs to map the binary messages to the corresponding backend calls. If you are...

Java
APM

9.5.2011 | 6 minutes reading time

Daniel Schneller

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Serverless from Europe: My Experience with Scaleway as an Alternative ...

In addition to dominant US providers like AWS, Azure, and GCP, the French company Scaleway now offers a comprehensive serverless computing portfolio. This includes services for Function as a Service, a lightweight Key/Value Store, and a simple messaging...

Compliance
Infrastructure
data protection
Cloud native
Cloud
Infrastructure as Code

28.5.2025 | 5 minutes reading time

Florian Lüdiger

The Ultimate Tool for Engineers and Developers: Compass Premium

It’s not an every day activity that a tool comes and redefines how engineering and development teams operate, but Compass is the tool with a game-changing solution. As Atlassian's out-of-the-box internal developer platform, Compass helps teams to stay...

Atlassian
Cloud

3.12.2024 | 4 minutes reading time

Özge Kavas

Living on the edge: building serverless applications with Cloudflare Workers

Cloudflare is best known for its CDN, DNS server (1.1.1.1) or WAF/DDos mitigation services. These services are highly predicated on “Edge Computing”, bringing data closer to the user interested in those services – a user in Australia will be happier ...

Cloud native
Cloud
Serverless

28.11.2024 | 14 minutes reading time

We deployed our SaaS Application on fly.io (and it was great).

How we deployed our application in a fraction of the time while saving 100% of the cost. Our team, a bunch of experienced software engineers without prior contact to cloud deployments, wanted to deploy our OCPP-compliant EV Charging Station Simulator...

AWS
Cloud

23.10.2024 | 4 minutes reading time

Jannis Mainczyk

Using External Secrets with Crossplane & ArgoCD

Most Crossplane providers need to authenticate themself against Cloud infrastructure providers. But how do we store these Secrets in a GitOps fashion? If external secret stores are a great way of doing this: How do we successfully integrate them with...

Infrastructure as Code
Platform engineering
DevOps
Cloud native

30.9.2024 | 15 minutes reading time

Going full GitOps with Crossplane & ArgoCD

In the last post we already deployed Crossplane with ArgoCD in a GitOps-fashion. But what about Crossplane providers and their configuration? And can't we optimize the boostrapping with the ArgoCD App-of-Apps pattern? We can! And we'll also provision...

Cloud native
Platform engineering
DevOps
Infrastructure as Code

9.9.2024 | 13 minutes reading time

Dangling DNS in cloud infrastructures

Dangling DNS entries are nothing new. Forgotten, outdated or incorrect DNS records can lead to subdomains being taken over and used in phishing campaigns, for example, to steal employee secrets. Due to dynamic IP addresses of rapidly changing resources...

IT-Security
Validation
Cloud
AWS
Infrastructure

5.9.2024 | 4 minutes reading time

Markus Höfer

Bootstrapping Crossplane with ArgoCD

After going into detail about why the integration of Crossplane and ArgoCD is a great way to unlock a new level of GitOps, I promised to dive into the details of such a setup. Here we are! Let's have a look at the basic steps how to use Crossplane together...

Infrastructure as Code
Platform engineering
DevOps
Cloud native

2.9.2024 | 11 minutes reading time

From Classic CI/CD to GitOps with ArgoCD & Crossplane

Lately I found a passion in integrating Crossplane with ArgoCD and finally wanted to write about all the steps needed to create a full blown working setup of both. Just as I finished the code and tried to find a good start into the topic, I found that...

DevOps
Platform engineering
Cloud native
Infrastructure as Code

27.8.2024 | 8 minutes reading time

Charge your APIs Volume 30 - Gateway to Success: Understanding and Choosing...

API gateways are essential for managing and securing data flow between services. As software architectures evolve, different types of API gateways have emerged to address specific challenges: Legacy, Agnostic, and Kubernetes-native. Drawing on insights...

API
Software architecture
Infrastructure
Integration

21.8.2024 | 12 minutes reading time

Daniel Kocot

Spring Boot and HTMX: Deployment to AWS Lambda

This is the next part of my series about Spring Boot and HTMX. In this post, I will show you how to deploy the application created in the previous post to AWS Lambda. If you're in a hurry or impatient, you can simply check out the accompanying Git Repo...

Serverless
Spring
AWS
DevOps
Cloud

30.7.2024 | 5 minutes reading time

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

In a recent blog post, we explored how Dapr works and how to test it on a simple local Kubernetes cluster. One of Dapr's key advantages is its component system, which enhances portability. In this post, we'll take our previously daperized demo app and...

Software development
Cloud
Azure
Cloud native

22.7.2024 | 10 minutes reading time

Manuel Zapf

Modern Microservices: Unleashing the Power of .NET Core, Aspire, and Dapr

I recall the days when writing a web application in C# with .NET meant deploying it on an IIS web server for accessibility. Today, this approach seems outdated, especially with the shift towards microservice-based architectures. Fortunately, Microsoft...

Software architecture
Open Source
Cloud
Microservices
Infrastructure as Code
.NET
Cloud native

27.6.2024 | 8 minutes reading time

Manuel Zapf

Create, build & publish Crossplane Configuration Packages with GitHub ...

You already created your first Crossplane Compositions? Pretty nice! But how to store them in Git? How to create and build a Configuration Package from it? And finally: how to publish and consume these Configurations in your Crossplane management cluster...

DevOps
Platform engineering
Cloud native
Infrastructure as Code

3.6.2024 | 14 minutes reading time

Testing Crossplane Compositions with kuttl, Part 2: Given, When, Assert

In the first part of this blog series we learned about kuttl and why it's a great idea to write tests for your Crossplane Compositions. Now it's time to set up the kuttl test steps to finally verify our Composition renders correctly. Crossplane – blog...

Infrastructure as Code
Cloud native
Platform engineering
DevOps

27.5.2024 | 16 minutes reading time

From sidecars to sidecarless: Tracing the evolution of service mesh technologies...

Ever wondered how the technology that seamlessly manages microservices traffic evolved from early implementations to lean, kernel-level solutions? Let's dive into the fascinating journey of service meshes, from Linkerd 1.x to the cutting-edge technologies...

Cloud
Networking
Infrastructure
Kubernetes
Linux

22.5.2024 | 10 minutes reading time

Manuel Zapf

Testing Crossplane Compositions with kuttl, Part 1: Preparing the TestSuite

Does writing Kubernetes Manifests count as writing code? Should we still bother to test it? Sure! And with the Kubernetes Test Tool (kuttl) there's great tooling available. Let's explore how to use it with Crossplane. Crossplane – blog series 1. Tame...

Cloud native
Platform engineering
DevOps
Infrastructure as Code

21.5.2024 | 16 minutes reading time

Demystifying the Kubernetes Gateway API: What the heck is it and why should...

When Gateway API debuted in October last year, this concluded a nearly four-year-long process that started in summer 2019. Gateway API is the successor of core Ingress definition, aiming towards various goals. This blog post will give a brief overview...

API
Open Source
Cloud
Networking
Kubernetes
Cloud native

15.3.2024 | 6 minutes reading time

Manuel Zapf

Cloud-native (application) networking in 2024

It's 2024 and Software is still eating the world. Whether it's powering an e-commerce platform, driving AI applications, or supporting critical business processes within organizations, there's a high likelihood that these applications are running in ...

Cloud
Networking
Infrastructure
Kubernetes

8.3.2024 | 2 minutes reading time

Manuel Zapf

Charge your APIs Volume 22: Mastering the Art of API Federation

API Federation is becoming essential in modern API management, addressing the complexities of evolving digital enterprises. It marks a shift from centralised, monolithic management to a dynamic, modular framework. Unlike traditional methods, API Federation...

API
Cloud
Cloud native

7.2.2024 | 11 minutes reading time

Daniel Kocot

XFS: Possible Memory Allocation Deadlock in kmem_alloc

First Things First: A Quick Fix

Kernel Update to the Rescue (?)

Intermediate measure: the xfs-guard

Digging deeper

Allocation Groups and Free Space

File fragmentation

Defragmenting the disk images

Conclusion and next steps

Was this post helpful?

Blog author

More articles

True KVM Live Migration with OpenStack Icehouse and Ceph based VM storage

Rate Limiting based on HTTP headers with HAProxy

Localizing Mobile Apps

Jinja2 for better Ansible playbooks and templates

Ansible: Simple yet powerful automation

SSH Two-Factor Authentication with Duo Security

Pseudo-Localization for Cocoa Apps

SSL: Man in the middle? – No, thank you!

Easier JBehave steps with variants

Why good metrics values do not equal good quality

Using JMeter to measure binary protocols

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Serverless from Europe: My Experience with Scaleway as an Alternative ...

The Ultimate Tool for Engineers and Developers: Compass Premium

Living on the edge: building serverless applications with Cloudflare Workers

We deployed our SaaS Application on fly.io (and it was great).

Using External Secrets with Crossplane & ArgoCD

Going full GitOps with Crossplane & ArgoCD

Dangling DNS in cloud infrastructures

Bootstrapping Crossplane with ArgoCD

From Classic CI/CD to GitOps with ArgoCD & Crossplane

Charge your APIs Volume 30 - Gateway to Success: Understanding and Choosing...

Spring Boot and HTMX: Deployment to AWS Lambda

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

Modern Microservices: Unleashing the Power of .NET Core, Aspire, and Dapr

Create, build & publish Crossplane Configuration Packages with GitHub ...

Testing Crossplane Compositions with kuttl, Part 2: Given, When, Assert

From sidecars to sidecarless: Tracing the evolution of service mesh technologies...

Testing Crossplane Compositions with kuttl, Part 1: Preparing the TestSuite

Demystifying the Kubernetes Gateway API: What the heck is it and why should...

Cloud-native (application) networking in 2024

Charge your APIs Volume 22: Mastering the Art of API Federation