Elasticsearch Zero Downtime Reindexing – Problems and Solutions

17.9.2014 | 8 minutes reading time

Reindexing Elasticsearch could be so easy. Well in the first place, we all wouldn’t have to reindex at all. Why should you do this? There is dynamic mapping! In this post I will explain why dynamic mapping won’t do you much good, how you can deal with inevitable errors in your static mapping, what zero downtime reindexing is, and finally how you can deal with the drawbacks this approach has.

Basics: In the end, everyone maps static anyways.

So what happens when you throw a random json at Elasticsearch and call it a day? Elasticsearch will, after it finds out that the given Index does not provide a mapping for that kind of data, try to determine a new mapping according to the data supplied.

So if we throw a “book” at a fresh Elasticsearch instance with dynamic indexing turned on:

1POST blog/articles/1
2{
3  "author": "Chris",
4  "title": "useful Cat facts (III)"
5}

Elasticsearch will index it without complaints, because these are obviously String fields:

1GET blog/articles/_mapping

1"blog": {
2  "mappings": {
3    "articles": {
4      "properties": {
5        "author": { 
6          "type": "string"
7        },
8        "title": {
9          "type": "string"
10}}}}}}

But we are in the epoch of Big Data, where input comes in chaotically, without much norming. Let’s imagine someone comes along and posts a new blogpost:

1POST blog/articles/2
2{
3  "Author": "The Dude",
4  "Title": "thats just like, your opinion man!"
5}

This will be indexed just fine, but our new mapping will look like:

1{
2  "blog": {
3    "mappings": {
4      "articles": {
5        "properties": {
6          "Author": {
7            "type": "string"
8          },
9          "Title": {
10            "type": "string"
11          },
12          "author": { 
13            "type": "string"
14          },
15          "title": {
16            "type": "string"
17}}}}}}

Yikes! That’s not what we wanted – Elasticsearch can’t determinate if this is “legitimately” different or we’ve just been vague. So sooner or later (and hopefully sooner) you will start to define a mapping for your data.

Your Mapping is most likely wrong

Okay, now that we’ve got the basics out of the way we get to the more sophisticated Problems – what happens when your mapping is wrong? Generally it’s additive: When you add a field, the underlying newly created Lucene segments will just be bigger from now on, and the old ones are left as they were. Searches for the new field will be applied to old segments, but will not result in a hit. Since Lucene never edits a written segment, this bubbles up to Elasticsearch – we cannot change a field type after data has been indexed.

We all know that our first guesses when setting things up is most likely not the end-to-be-all, but needs to be revised later on. The very same happens when you have your Elasticsearch cluster already in production.

The simplest way to tackle this would be just to drop your current index, apply a new mapping and reindex everything again. This approach is fine while you’re still in your dev (or maybe staging) environment. But in production, your reindex can easily take a couple hours, maybe days – Good luck telling your customers you’re offline during that period. Also this only works if you have your old data available somewhere else to feed the reindex – otherwise you need to figure out how to do this without downtime.

Zero Downtime Reindexing

There is already a great entry in the Elasticsearch Guide that is derived from the post on the official blog that you should read, too. Just to give it a short TL;DR:

Elasticsearch provides us with the fantastic and helpful concept of aliases. So to get to a seamless reindexing you do the following:

create an alias that points to the index with the old mapping
point your application to your alias instead of your index
create a new index with the updated mapping
move data from old to new
atomically move the alias from the old to the new index
delete the old stuff

-> The cluster stays fully operational during the whole operation and you experience no downtime!

1. Where do the WRITE operations go in the meanwhile?

Unfortunately the official documentation does not discuss how to handle incoming writes to your cluster during the reindexing period. Such an operation might take a lot of time depending on various factors like your machinery, the size of your dataset, your analyzers and so on. Aliases do not allow us to write to both the old and the new index at the same time, so we need to take care of that. Currently I’d suggest two approaches:

1 a) Duplicate Writes yourself

The most straightforward solution is to change your application in a way so it will write the same data to both of your indices simultaneously.

Obviously, duplicated writes will leave their performance impact when both indices operate on the same machine. But it might be worth it if your reindex process dies in the middle of the reindexing and you do not have a mechanism for recovery implemented – your old data is still in a valid state.

1b) Write to new index and read from both

The Guide states :

A search request can target multiple indices, so having the search alias point to tweets_1 and tweets_2 is perfectly valid. However, indexing requests can only target a single index. For this reason, we have to switch the index alias to point only to the new index.

If you are not in control of the software writing towards your application, or the first approach is not feasible because of other environmental constraints, you can alternatively switch the write alias towards the new index and read from both at the same time. Please note that you will get duplicates in your queries, so it is your responsibility to deal with them application-wise. Also concepts like pagination will provide additional hurdles.

In conclusion, your application has to be aware of the reindexing process and behave accordingly to your chosen strategy. Either you will write in both indices or deal with duplicated results. It depends on your application which way is acceptable. But besides this point, this concept has another weakness:

2. Lost Updates and Deletes!

When we’re in the middle of a lenghty reindexing process, all incoming writes are written to the new index. This is unproblematic for indexing new documents – they are just appended to the index, and have no relation to the old one.

But what about an UPDATE or DELETE of a document? When they are already transferred into the new index, there is no problem. But in the other case, the external operation will fail with an error, and later on the value will be put into the new index in an outdated version.

Now this output is not desirable and should be avoided! If your application supports updates and deletions we will have to include additional steps into our reindexing process. The basic idea is that you do not delete documents, but mark them as “deleted” instead and exclude them from queries. Here are some proposals to get you started:

2 a) Incremental Reindexing

For this approach to work, your whole infrastructure needs to adapt the following two concepts:

Every modification updates a timestamp field of the document
Instead of writing your critical updates and deletions to the new index we will still apply them to the old one. Our reindexing job will move all documents that are older than its own start timestamp to the new index. Every update that happens to be during this time will update the document timestamp. Note that Elasticsearch already provides a _timestamp field that can be activated in the mapping.
When the reindexing job has terminated successfully it will start again and transfer all modifications during its last execution time. When it reaches an iteration where it has nothing to do, we consider it done and continue the wrap-up as in the regular process.

Drawbacks:

If you have a lot of deletions you will artificially bloat your index. This can be improved by cleaning all marked-as-deleted documents after your reindex. Still, since a DELETE in Elasticsearch will just be a mark-as-delete in Lucene, there will be bloat.
The logical delete implemented as an UPDATE is more expensive than a regular DELETE, so watch out for performance hits.
After the last reindexing iteration, there must be a “Stop-the-world” phase to prevent any modifications from sneaking in. Our suggested approach would be to include that into your deployment process if you can.

2 b). Modification Buffering

If your reindexing is expected to last only a short amount of time there might be another solution to be considered:

Elasticsearch has a simple versioning control with the special _version field. When your application keeps this information during the GET -> modify -> UPDATE / DELETE phase and sends it back, Elasticsearch will check if the version matches.

Example: If your Document has a version #1 and you send the UPDATE to Eleasticearch with this version as a parameter, and the document has not been transferred yet, you will get a VersionConflictEngineException – in this case, hold the update in your application and retry later (how much “later” is acceptable depends on your application and can ultimately only be answered by you).

The same drawback as in 2a applies: You cannot truly delete your documents anymore, you have to mark them as deleted as well.

Conclusion

It’s not important which solution you will take from this article, the most important point is to be aware of the drawbacks of the “official” reindexing procedure. You’ll have to figure out how you will work around these limitations depending on your business needs.

Was this post helpful?

Blog author

Christian Uhl

Do you still have questions? Just send me a message.

fromChristian Uhl

Elasticsearch tips: inserting vs. updating your index

Transforming an update-heavy Elasticsearch use case into an insert-heavy one. Just recently i’ve had the opportunity to set up an Elasticsearch installation at a customer that had a rather unique use case, and i’d like to share my approach of that with...

NoSQL
APM

12.12.2014 | 6 minutes reading time

Christian Uhl

Behaviour Driven Development with Elasticsearch

Elasticsearch has been riding on top of the hype for a while now, and I expect it to hit even harder with the release of 1.0 – We will continue to see a massive growth in various fields throughout the tech world, and even more use cases will be discovered...

Big Data
Search

24.2.2014 | 5 minutes reading time

Christian Uhl

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

How to Catch the Good Guys: My Learnings on Recruiting IT Security Professionals...

In 2024, I embarked on the journey to become a recruiter for an IT Security Consulting team. I thought, “How hard can it be?” I had already been a recruiter for over 10 years, focusing predominantly on software developers, and I imagined my new task ...

IT-Security
HR

13.6.2025 | 4 minutes reading time

Christine Seagar

Relative path DLL hijacking in Windows programs

As part of a Red Team assessment, a challenge arose to execute our own code via a DLL. The reason for this scenario was the use of Application Allow Listing software, which blocks the execution of unknown executables. The usual options for loading DLLs...

IT-Security

24.3.2025 | 4 minutes reading time

Timo Sablowski

Self-issued JWT for mobile client authentication

Overview Mobile applications frequently authenticate their backend calls via JWT. These tokens are frequently used in conjunction with OIDC to authenticate a user. Sometimes, particularly in high-assurance scenarios, it can be preferable to authenticate...

IT-Security
Mobile
Rust
Kotlin
Android

4.2.2025 | 8 minutes reading time

Elisabeth Schulz

How we can hack an AI with just a few words

How we can hack an AI with just a few words Artificial intelligence (AI) has undergone an astonishing transformation in recent years and is now present in many areas of life. Whether in the form of chatbots that help us with everyday questions or generative...

IT-Security
AI

27.1.2025 | 4 minutes reading time

Dangling DNS in cloud infrastructures

Dangling DNS entries are nothing new. Forgotten, outdated or incorrect DNS records can lead to subdomains being taken over and used in phishing campaigns, for example, to steal employee secrets. Due to dynamic IP addresses of rapidly changing resources...

IT-Security
Validation
Cloud
AWS
Infrastructure

5.9.2024 | 4 minutes reading time

Markus Höfer

Zero Trust Azure Identity & Access Architecture

Falko Lehmann and Hendrik Kamp have already explained in their blog post on Zero-trust Architecture why zero-trust security models are preferable to traditional perimeter security models in order to minimize damage from cyber attacks. Falko and Hendrik...

IT-Security
IAM
Azure
Software architecture

4.6.2024 | 14 minutes reading time

Zero-trust architecture – Why we need to end perimeter-based security

Introduction This article will help you understand the importance of zero-trust architecture and why it is the state of the art to protect your organization from cyberattacks. We see it as fundamental knowledge for solution and system architects to consider...

IT-Security
Networking

29.9.2023 | 9 minutes reading time

Hendrik Kamp

Fighting Gandalf with magic spells (the spells are prompt injections) ...

Note: Do not attack any systems for which you do not have explicit permission to do so. In this article, I will recount the tale of outwitting a large language model by performing prompt injection attacks. Before we start, let's establish a common baseline...

IT-Security
AI

10.7.2023 | 12 minutes reading time

Michael Wagner

Secure your Kubernetes workloads with OPA Gatekeeper

Last month, Kubernetes 1.25 was released. And with that, the long-announced removal of PodSecurityPolicies (short: PSPs) finally becomes reality. Finally? Yes – as Tabitha Sable from the Kubernetes SIG Security Team said herself in the linked blog post...

IT-Security
Kubernetes
Infrastructure

15.12.2022 | 8 minutes reading time

My Keycloak learning journey

Keycloak is an open-source identity provider. You can add authentication to applications and secure services with minimum effort. No need to deal with storing users or authenticating users. Keycloak provides user federation, strong authentication, user...

Keycloak
IT-Security

22.11.2022 | 8 minutes reading time

Open Policy Agent – Primer

The Open Policy Agent (OPA) is a general-purpose, open-source policy engine, i.e. a collection of components that allows for a uniform and efficient implementation of rules of all kinds. This article shows a small practical example. When was the last...

CI/CD
Software architecture
IT-Security

19.10.2022 | 5 minutes reading time

Marco Paga

CloudWatch on AWS: How to tackle high-security requirements

If you build cloud-native applications, you will also generate log output. Log outputs are essential to log the functionality of the application and to be able to localize errors very quickly in the event of a crash. However, log outputs of any kind ...

AWS
Cloud
IT-Security

23.8.2022 | 15 minutes reading time

Jörg Riegel

GitLab security scanning – part 3: Kubernetes deployments

In part 1 and part 2 , we focused on different types of security scanning practices. In this article we will take a look at Kubernetes deployments with Helm and Helmfile. In particular, we are interested in how to ensure that objects deployed to Kubernetes...

DevOps
IT-Security
CI/CD
GitLab
Cloud
Kubernetes

15.5.2022 | 4 minutes reading time

Sven Hertzberg

Keycloak.X, but secure – without vulnerable libraries

TLDR: How to reduce the known CVEs (common vulnerabilities and exposures) to zero by creating your own Keycloak distribution* .IntroductionKeycloak (see website) will become easier and more robust by switching to Quarkus, at least that’s the promise...

Java
IT-Security
Keycloak

9.5.2022 | 11 minutes reading time

GitLab security scanning – part 2

… Containers … applications … licenses … In part 1 of the article series, we focused on static scanning of source code. In this article we will go one step further. First we look at the scanning of (container) images. Then we delve into the topic of...

CI/CD
Git
GitLab
IT-Security

18.4.2022 | 5 minutes reading time

Sven Hertzberg

GitLab security scanning

Secure.Your.Code! …At all stages…Automatically…Always…Starting with the first line of your code… Today, the security scanning of code, containers and applications is at least as important as the functionality of the application itself. It’s vital to ...

CI/CD
Git
GitLab
IT-Security

14.3.2022 | 5 minutes reading time

Sven Hertzberg

From Keycloak to Keycloak.X

The popular open-source IAM solution Keycloak (see project page ) is undergoing a major technology change. As part of the Keycloak.X efforts , the underlying platform is to be changed from Wildfly/Undertow to Quarkus/Vertx. This platform change has been...

IT-Security
Keycloak

23.12.2021 | 14 minutes reading time

Overview of hardened container base images

How to choose the best container base image? What does “best” mean in this context? This blog post will not try to determine the best base image. We will pick just one of the aspects: security. We will have a look at how you can give your container base...

CI/CD
IT-Security

9.8.2021 | 6 minutes reading time

How to use OAuth2 Proxy for central authentication

This blog post will show you how to use one central OAuth2 Proxy (see the official page ) as authentication proxy for multiple services inside your Kubernetes Cluster .The default example on how to secure a service with Nginx and OAuth2 Proxy shows ...

Infrastructure
Microservices
Cloud
Kubernetes
IT-Security

7.6.2021 | 2 minutes reading time

How mature is your DevOps? – Some thoughts on measuring progress

Spoiler: It doesn’t really matter.Recently, we received the following inquiry quite a few times from our customers: “How do you measure your progress towards Dev(Sec)Ops? Is there some sort of maturity model or a required skill set for everyone involved...

Agile transformation
CI/CD
DevOps
IT-Security

6.6.2021 | 4 minutes reading time

Elasticsearch Zero Downtime Reindexing – Problems and Solutions

Was this post helpful?

Blog author

More articles

Elasticsearch tips: inserting vs. updating your index

Behaviour Driven Development with Elasticsearch

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

How to Catch the Good Guys: My Learnings on Recruiting IT Security Professionals...

Relative path DLL hijacking in Windows programs

Self-issued JWT for mobile client authentication

How we can hack an AI with just a few words

Dangling DNS in cloud infrastructures

Zero Trust Azure Identity & Access Architecture

Zero-trust architecture – Why we need to end perimeter-based security

Fighting Gandalf with magic spells (the spells are prompt injections) ...

Secure your Kubernetes workloads with OPA Gatekeeper

My Keycloak learning journey

Open Policy Agent – Primer

CloudWatch on AWS: How to tackle high-security requirements

GitLab security scanning – part 3: Kubernetes deployments

Keycloak.X, but secure – without vulnerable libraries

GitLab security scanning – part 2

GitLab security scanning

From Keycloak to Keycloak.X

Overview of hardened container base images

How to use OAuth2 Proxy for central authentication

How mature is your DevOps? – Some thoughts on measuring progress