Scaling an Elasticsearch Index – Introduction

30.3.2015 | 7 minutes reading time

A well-known design decision of Elasticsearch is that a fixed number of shards has to be specified when creating an index. It is not possible to start out with just one or only a few shards and add more shards later as the data increases.

Now what to do if we find ourselves in a situation where the capacity of an index is exhausted and we really need to extend that index? We will try to find the answer in this blog series.

First of all, let’s try to understand why Elasticsearch doesn’t provide anything like dynamic shard splitting. Quite a few other database and search engine products offer shard splitting so that extending a database or an index can be done conveniently via a single API call.

The most obvious advantage of the approach taken by Elasticsearch is that a particular shard key will always point to the same shard. There is no need to implement a complicated shard splitting algorithm, with heavyweight behind-the-scenes data migrations, high disk space requirements, and all kinds of unexpected error situations. Further reasoning, shedding some light on why shard splitting is considered a bad idea, is given here .

The recommendation for Elasticsearch users goes like this: First, estimate the capacity of a single shard by performing measurements with realistic amounts of data. Next, based on the results of those measurements, calculate the total number of shards needed to hold the expected data. Finally, overallocate a little to have some headroom in case something unexpected happens.

Probably, just following that recommendation will satisfactorily handle many scenarios arising in practice. Also, when designing a system, we should spend some time thinking about the expected amounts of data anyway, even without any sharding involved. On the other hand, there is no guarantee that our expectations or measurements are correct – unexpected success of a startup is a typical example. We have to accept that even with careful planning an Elasticsearch index may reach its capacity. And if it happens, it is guaranteed to happen at the worst possible point in time, so we should better be prepared. The following two approaches are available:

Extend the index by a second index and have queries cover both of them.
Replace the index by a new, bigger index. Once all data are migrated, use the new index only.

Note that, regarding search performance, both approaches are equivalent because it doesn’t make any difference if we query a single index with, say, 50 shards or 50 indexes with one shard each. In both cases, 50 Lucene indexes are searched.

In this article, we will concentrate on the first approach, extending an index. Parts 2 and 3 of this series will cover the second approach, migrating to a new index.

Extending an index

Suppose we have an „index1“ that we want to extend by a second index „index2“. This can be achieved with the following steps:

Create a second index „index2“ with the same mapping as „index1“.
Direct all new documents to „index2“. If the application already uses an alias for indexing, that alias only needs to be switched from „index1“ to „index2“. If there is no such alias yet, it can be created in a preparatory step (which might involve a new deployment or restart of the application). Of course, we can do without an alias just by hardcoding “index2” into the application, but the use of an alias is recommended for increased flexibility.
Direct search queries to both indexes. Ideally, the application already uses a separate alias for queries and we only need to update that alias definition to point to both „index1“ and „index2“. Once again, this can also be done without an alias by specifying both indexes in search requests, but the use of an alias is recommended.

Readers familiar with the ELK stack may notice that this is exactly how Elasticsearch indexes are usually managed for scenarios involving time-based data, e.g., log data. With ELK, the steps outlined above are applied regularly, i.e., once per day or week. Extending an index in this way is a perfect fit for use cases where documents are indexed once and never changed later, where basically all you do is index and search.

Unfortunately, things are not so simple in the general case where the index needs to support requests by document ID other than indexing, e.g., retrieving, updating or deleting a document. With multiple indexes, for each such operation we need to know which index to address for that particular document. This in turn means that we have to implement an additional sharding-like mechanism in the application. Furthermore, the mechanism has to be designed such that it identifies all already existing documents as stored in „index1″. How can we achieve that? Let’s take a look at two possible options.

Option 1: Use information already available in documents

While it’s a good shard key for fresh indexes, the document ID doesn’t qualify as a shard key in our scenario at hand: Documents with all kinds of IDs have already been stored in „index1“, so we cannot find a clear rule to distinguish them from documents directed to „index2“. Instead, a good candidate for our shard key is something like the creation date of a document. Assuming that all documents to be indexed carry a creation date field with them, we could modify our application to send all documents with a creation date later than some point in time to „index2“. All documents with an earlier creation date will be sent to “index1”. When accessing a document by ID, its creation date can be used to identify the index to use. Let’s discuss this idea.

Advantages

Already existing information is reused, so there is no need to modify any other part of the application beyond the search engine clients.

Disadvantages

The search engine clients of the application need to be modified to perform their own sharding via the document creation date. Unfortunately, it’s not possible to use an alias for that purpose. Thus, in contrast to ELK-type scenarios, clients cannot just index into a single „current“ index (alias) but have to be aware of different indexes.
When reading, updating or deleting a document, the creation date of the document needs to be available. Most likely this will involve reading it from a primary database (by document ID) first, which results in an additional database call. Note that without some primary database the outlined approach is not possible at all because then there is no place to fetch the document creation date from.

Option 2: Store additional indexing information in the primary database

Given that option 1 already requires the presence of a primary database, we can also consider making slightly bigger changes to the application in order to achieve a simpler result. How about just storing the respective index name directly with documents in the primary DB? Whenever needed, it can be read from there and provided to the search engine client. For entries that don’t have an index name stored in the primary database, we can just assume that the original index, „index1“, is the correct index. Let’s discuss this idea:

Advantages

Search clients can work with index names directly. They don’t have to know about artificial rules involving, e.g., creation dates.

Disadvantages

The index name for each document has to be stored in the primary database. This requires changing all parts of the application that store new documents to the primary database. Also, we might even have to update the database schema.
The primary database will be directly coupled to the search index. This coupling might create additional complexity for any future changes.

Conclusion

When there is no need to retrieve, update or delete single documents by ID, extending an index by another one one is a fairly simple and promising approach. However, if additional operations by document ID are required, things are more difficult. We have discussed two possible options for extending an index, but neither is particularly attractive or straightforward to implement. One more option we didn’t discuss is to simply duplicate all requests by document ID and send them to both indexes, as only one of them will be able to sensibly work with them. And while it’s indeed a valid approach, let’s not consider it a serious attempt at a solution. Unfortunately, there is not much more that comes to (my) mind.

But no need to despair! In the next parts of this series we will delve deep into the details of the second approach, migrating to a new index.

Was this post helpful?

Blog author

Patrick Peschlow

Do you still have questions? Just send me a message.

fromPatrick Peschlow

Transactions in Elasticsearch

Earlier this year a customer mentioned a search requirement that I hadn’t really thought about before: How to achieve transactions in Elasticsearch? Recently, the same requirement popped up again in a conversation I had with other search aficionados....

6.10.2014 | 8 minutes reading time

Patrick Peschlow

Elasticsearch Indexing Performance Cheatsheet

You plan to index large amounts of data in Elasticsearch? Or you are already trying to do so but it turns out that throughput is too low? Here is a collection of tips and ideas to increase indexing throughput with Elasticsearch. Some of them I have successfully...

NoSQL

8.5.2014 | 8 minutes reading time

Patrick Peschlow

Elasticsearch Monitoring and Management Plugins

Elasticsearch offers a highly useful plugin mechanism as a standard way for extending its core. Plugins enable developers to add new functionality, e.g., a custom analyzer, or provide alternatives to existing functionality, like swapping in another transport...

30.3.2014 | 11 minutes reading time

Patrick Peschlow

Useful JVM Flags – Part 8 (GC Logging)

The last part of this series is about garbage collection logging and associated flags. The GC log is a highly important tool for revealing potential improvements to the heap and GC configuration or the object allocation pattern of the application. For...

3.1.2014 | 8 minutes reading time

Patrick Peschlow

Useful JVM Flags – Part 7 (CMS Collector)

The Concurrent Mark Sweep Collector (“CMS Collector”) of the HotSpot JVM has one primary goal: low application pause times. This goal is important for most interactive applications like web applications. Before we take a look at the relevant JVM flags...

4.3.2013 | 10 minutes reading time

Patrick Peschlow

ForkJoinPool vs. ThreadPoolExecutor

Recently, an article of mine appeared on the German site Heise Developer, and today the English translation was published on The H Developer. The article gives an introduction to the Java 7 ForkJoinPool and explains for which application scenarios ...

25.11.2012 | 1 minutes reading time

Patrick Peschlow

Useful JVM Flags – Part 6 (Throughput Collector)

For most application areas that we find in practice, a garbage collection (GC) algorithm is being evaluated according to two criteria: The higher the achieved throughput, the better the algorithm.The smaller the resulting pause times, the better the ...

4.1.2012 | 10 minutes reading time

Patrick Peschlow

Useful JVM Flags – Part 5 (Young Generation Garbage Collection)

In this part of our series we focus on one of the major areas of the heap, the “young generation”. First of all, we discuss why an adequate configuration of the young generation is so important for the performance of our applications. Then we move on...

18.8.2011 | 13 minutes reading time

Patrick Peschlow

Useful JVM Flags – Part 4 (Heap Tuning)

Ideally, a Java application runs just fine with the default JVM settings so that there is no need to set any flags at all. However, in case of performance problems (which unfortunately arise quite often) some knowledge about relevant JVM flags is a welcome...

2.7.2011 | 6 minutes reading time

Patrick Peschlow

Useful JVM Flags – Part 3 (Printing all XX Flags and their Values)

With a recent update of Java 6 (must have been update 20 oder 21), the HotSpot JVM offers two new command line flags which print a table of all XX flags and their values to the command line right after JVM startup. As many HotSpot users were longing ...

Java
APM

10.4.2011 | 4 minutes reading time

Patrick Peschlow

Useful JVM Flags – Part 2 (Flag Categories and JIT Compiler Diagnostics...

In the second part of this series, I give an introduction to the different categories of flags offered by the HotSpot JVM. Also, I am going to discuss some interesting flags regarding JIT compiler diagnostics. JVM flag categories The HotSpot JVM offers...

Java
APM

23.3.2011 | 9 minutes reading time

Patrick Peschlow

Useful JVM Flags – Part 1 (JVM Types and Compiler Modes)

Modern JVMs do an amazing job at running Java applications (and those of other compatible languages) in an efficient and stable manner. Adaptive memory management, garbage collection, just-in-time compilation, dynamic classloading, lock optimization ...

Java
APM

8.3.2011 | 6 minutes reading time

Patrick Peschlow

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Scaling an Elasticsearch Index – Introduction

Extending an index

Option 1: Use information already available in documents

Advantages

Disadvantages

Option 2: Store additional indexing information in the primary database

Advantages

Disadvantages

Conclusion

Was this post helpful?

Blog author

More articles

Transactions in Elasticsearch

Elasticsearch Indexing Performance Cheatsheet

Elasticsearch Monitoring and Management Plugins

Useful JVM Flags – Part 8 (GC Logging)

Useful JVM Flags – Part 7 (CMS Collector)

ForkJoinPool vs. ThreadPoolExecutor

Useful JVM Flags – Part 6 (Throughput Collector)

Useful JVM Flags – Part 5 (Young Generation Garbage Collection)

Useful JVM Flags – Part 4 (Heap Tuning)

Useful JVM Flags – Part 3 (Printing all XX Flags and their Values)

Useful JVM Flags – Part 2 (Flag Categories and JIT Compiler Diagnostics...

Useful JVM Flags – Part 1 (JVM Types and Compiler Modes)

Your job at codecentric?

Agile Developer und Consultant (w/d/m)