Elasticsearch Indexing Performance Cheatsheet

8.5.2014 | 8 minutes reading time

You plan to index large amounts of data in Elasticsearch? Or you are already trying to do so but it turns out that throughput is too low? Here is a collection of tips and ideas to increase indexing throughput with Elasticsearch. Some of them I have successfully tried myself, others I have only read about and found them reasonable. In any case, I hope you will find them useful.

In order to fit all this into a single article, I have kept the suggestions rather brief. For some of them, you may feel that you need to learn more before putting them into practice. To ease your task a little, I have included links to the relevant sections of the Elasticsearch documentation which you may use as a starting point for further research.

General Performance

Before doing anything more specific, it makes sense to follow the advice given in the Elasticsearch documentation on configuration. In a nutshell:

Set the maximum number of open file descriptors for the user running Elasticsearch to at least 32k or 64k.
If possible, consider disabling swapping for the Elasticsearch process memory. Note, however, that in a virtualized environment this may not behave as expected.
Set -Xms to the same value as -Xmx (the same result can be achieved by setting the ES_HEAP_SIZE environment variable).
Leave some amount of physical memory unassigned so that the OS file system cache is free to use it for Lucene’s benefit. A rule of thumb is to have the Elasticsearch JVM use no more than half of the available memory.

Mapping

If your search requirements allow it, there is some room for optimization in the mapping definition of your index:

By default, Elasticsearch stores the original data in a special _source field . If you do not need it, disable it.
By default, Elasticsearch analyzes the input data of all fields in a special _all field . If you do not need it, disable it.
If you are using the _source field, there is no additional value in setting any other field to _stored.
If you are not using the _source field, only set those fields to _stored that you need to. Note, however, that using _source brings certain advantages, such as the ability to use the update API.
For analyzed fields, do you need norms? If not, disable them by setting norms.enabled to false.
Do you need to store term frequencies and positions, as is done by default, or can you do with less – maybe only doc numbers? Set index_options to what you really need, as outlined in the string core type description.
For analyzed fields, use the simplest analyzer that satisfies the requirements for the field. Or maybe you can even go with not_analyzed?
Do not analyze, store, or even send data to Elasticsearch that you do not need for answering search requests. In particular, double-check the content of mappings that you do not define yourself (e.g., because a tool like Logstash generates them for you).

Requests and Clients

You can also gain a lot from optimizing the way in which you transfer indexing requests to Elasticsearch:

Do you have to send a separate request for each document? Or can you buffer documents in order to use the bulk API for indexing multiple documents with a single request?
When using bulk requests, optimize the bulk size, i.e., how many documents you bundle in a single request. Usually an appropriate bulk size has to be discovered empirically by trying out different sizes under realistic load conditions.
If your business can afford it, you can even consider trading some reliability for performance using the bulk UDP API for certain data. This is particularly interesting if the client and server participating in the request reside on the same host.
If you are using an HTTP client, consider using long-lived HTTP connections. Also, make sure that HTTP chunking is not hampering throughput.
Consider using one of the various existing clients as they may contain performance advantages over using plain HTTP.
If your client speaks Java, consider using the NodeClient . A NodeClient joins the cluster and knows which nodes to address for certain requests, possibly saving one hop when compared to other clients. If you cannot use the NodeClient, e.g., due to security restrictions, see if you can use TransportClient before considering something else.
Can you parallelize indexing by using multiple clients? It may well be that a single client turns out to be the indexing bottleneck and that the Elasticsearch server is able to handle a much higher load.

Sharding and Replication

Elasticsearch provides sharding and replication as the recommended way for scaling and increasing availability of an index. There are a few things to consider:

If a single Elasticsearch server is not enough to provide your desired indexing throughput, you may need to scale out. Multiple cluster nodes enable parallel work on an index by sharding it. Note: The number of shards of an index needs to be set on index creation and cannot be changed later. In case you do not know exactly how much data to expect, you may consider overallocating a few shards (but not too many, they are not free!) to have some spare capacity available. Other than that, index aliases may provide a way (albeit with limitations) of scaling out an index at a later point in time.
Replication is an important feature for being able to cope with failure, but the more replicas you have the longer indexing will take. Thus, for raw indexing throughput it would be best to have no replicas at all. Luckily, in contrast to the number of shards, you may change the number of replicas of an index at any time, which gives us some additional options. In certain situations, such as populating a new index initially, or migrating data from one index to another, it may prove beneficial to start without replication and only add replicas later, once the time-critical initial indexing has been completed.
Consider separating data nodes (that actually store and index data) from “aggregator nodes” (used only for querying). When aggregator nodes handle search queries and only contact data nodes as needed, they take load off the data nodes which will then have more capacity for handling indexing requests.
By default, an indexing request is completed once the data has been safely received (i.e., stored in the transaction log) by all replicas. By setting the query parameter replication to async , the request will already complete when the data has been acknowledged on the primary shard.

Index Settings

There are several index level settings that you may tune to improve indexing throughput:

By default, an index shard uses a refresh interval of one second, i.e., new documents become available for search after one second. Even though refreshing is a more lightweight operation than one may think, it comes at a cost. Thus, depending on your search requirements, you may consider setting the refresh interval to something higher than one second. It can even make sense to temporarily turn off refreshing completely for an index (by setting the interval to -1), e.g., during a bulk indexing run, and trigger it manually at the end.
Compared to refreshing an index shard, the really expensive operation is flushing its transaction log (which involves a Lucene commit). Elasticsearch performs flushes based on a number of triggers that may be changed at run time . By delaying flushes, or disabling them completely, you can increase indexing throughput. Just be aware that nothing comes for free, and the delayed flush will of course take longer when it eventually happens.
The default segment merge policy, “tiered”, supports a compound format where data is stored in fewer files to reduce the number of open file handles needed. However, the compound format comes along with a performance penalty. There are two settings, index.compound_on_flush and index.compound_format, that specify whether the compound format should be used for new segments and merged segments, respectively. Making sure that both are set to false may improve indexing performance, at the cost of more file handles.
Segment merging is done in the background but requires I/O from which indexing performance may suffer. Therefore, it is possible to throttle merging to a maximum number of bytes per second, on the node or index level. Note that throttling is already done by default, but maybe you want to adjust the predefined limit according to your needs.
The setting indices.memory.index_buffer_size defines the percentage of available heap memory that may be used for indexing operations (the remaining heap memory will mainly be used for search operations). The default of 10% may be too low if you have lots of data to index, and it may make sense to set it to a higher value .
Index warmup is a useful concept to speed up search queries, but when indexing large amounts of data (in particular, bulk indexing) it may make sense to temporarily disable it .
Consider increasing the node level thread pool size for indexing and bulk operations (and measure if it really brings an improvement).
The setting index.index_concurrency limits the number of threads that may concurrently perform indexing operations on a single shard. Consider increasing the value, especially when there are no other shards on the node (and measure if it pays off).

Conclusion

I hope some of these suggestions will help you resolve any indexing performance problems you might have. Keep in mind, however, that the most important aspect of a search engine is, well, the search. Do not make the mistake of tuning your search engine to maximum indexing throughput only to discover that out of a sudden its query performance suffers or it does not fulfill the functional requirements anymore. Always make sure that your users get a quality search experience and really find what they are looking for.

Was this post helpful?

Blog author

Patrick Peschlow

Do you still have questions? Just send me a message.

fromPatrick Peschlow

Scaling an Elasticsearch Index – Introduction

A well-known design decision of Elasticsearch is that a fixed number of shards has to be specified when creating an index. It is not possible to start out with just one or only a few shards and add more shards later as the data increases. Now what to...

30.3.2015 | 7 minutes reading time

Patrick Peschlow

Transactions in Elasticsearch

Earlier this year a customer mentioned a search requirement that I hadn’t really thought about before: How to achieve transactions in Elasticsearch? Recently, the same requirement popped up again in a conversation I had with other search aficionados....

6.10.2014 | 8 minutes reading time

Patrick Peschlow

Elasticsearch Monitoring and Management Plugins

Elasticsearch offers a highly useful plugin mechanism as a standard way for extending its core. Plugins enable developers to add new functionality, e.g., a custom analyzer, or provide alternatives to existing functionality, like swapping in another transport...

30.3.2014 | 11 minutes reading time

Patrick Peschlow

Useful JVM Flags – Part 8 (GC Logging)

The last part of this series is about garbage collection logging and associated flags. The GC log is a highly important tool for revealing potential improvements to the heap and GC configuration or the object allocation pattern of the application. For...

3.1.2014 | 8 minutes reading time

Patrick Peschlow

Useful JVM Flags – Part 7 (CMS Collector)

The Concurrent Mark Sweep Collector (“CMS Collector”) of the HotSpot JVM has one primary goal: low application pause times. This goal is important for most interactive applications like web applications. Before we take a look at the relevant JVM flags...

4.3.2013 | 10 minutes reading time

Patrick Peschlow

ForkJoinPool vs. ThreadPoolExecutor

Recently, an article of mine appeared on the German site Heise Developer, and today the English translation was published on The H Developer. The article gives an introduction to the Java 7 ForkJoinPool and explains for which application scenarios ...

25.11.2012 | 1 minutes reading time

Patrick Peschlow

Useful JVM Flags – Part 6 (Throughput Collector)

For most application areas that we find in practice, a garbage collection (GC) algorithm is being evaluated according to two criteria: The higher the achieved throughput, the better the algorithm.The smaller the resulting pause times, the better the ...

4.1.2012 | 10 minutes reading time

Patrick Peschlow

Useful JVM Flags – Part 5 (Young Generation Garbage Collection)

In this part of our series we focus on one of the major areas of the heap, the “young generation”. First of all, we discuss why an adequate configuration of the young generation is so important for the performance of our applications. Then we move on...

18.8.2011 | 13 minutes reading time

Patrick Peschlow

Useful JVM Flags – Part 4 (Heap Tuning)

Ideally, a Java application runs just fine with the default JVM settings so that there is no need to set any flags at all. However, in case of performance problems (which unfortunately arise quite often) some knowledge about relevant JVM flags is a welcome...

2.7.2011 | 6 minutes reading time

Patrick Peschlow

Useful JVM Flags – Part 3 (Printing all XX Flags and their Values)

With a recent update of Java 6 (must have been update 20 oder 21), the HotSpot JVM offers two new command line flags which print a table of all XX flags and their values to the command line right after JVM startup. As many HotSpot users were longing ...

Java
APM

10.4.2011 | 4 minutes reading time

Patrick Peschlow

Useful JVM Flags – Part 2 (Flag Categories and JIT Compiler Diagnostics...

In the second part of this series, I give an introduction to the different categories of flags offered by the HotSpot JVM. Also, I am going to discuss some interesting flags regarding JIT compiler diagnostics. JVM flag categories The HotSpot JVM offers...

Java
APM

23.3.2011 | 9 minutes reading time

Patrick Peschlow

Useful JVM Flags – Part 1 (JVM Types and Compiler Modes)

Modern JVMs do an amazing job at running Java applications (and those of other compatible languages) in an efficient and stable manner. Adaptive memory management, garbage collection, just-in-time compilation, dynamic classloading, lock optimization ...

Java
APM

8.3.2011 | 6 minutes reading time

Patrick Peschlow

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

The universal recommender in Action(ML)

IntroductionRecommender systems have become crucial for many different businesses. E-commerce uses recommenders to guide their customers in finding the right products and to assure they stay on the site. Newspapers or entertainment websites want to keep...

AI
NoSQL
Data
Machine Learning
Python

18.4.2021 | 11 minutes reading time

Francesca Diana

Kick-start your microservice project with JHipster

I recently looked for a solution on how to prototype a customer project in a short time and came across JHipster. The target architecture used Spring Boot in the backend and an Angular frontend. JHipster can scaffold this in its simplest variant as...

Node.js
Angular
Software development
Container
NoSQL
Cloud
JavaScript
Java
Keycloak
Kubernetes
Microservices
IT-Security
Open Source
React
Spring

12.5.2020 | 13 minutes reading time

Jörg Riegel

Golang, Gin & MongoDB – Building microservices easily

Golang, a.k.a. Go, has been around in the industry for quite some time now, but people are still reluctant to just go ahead and use it. To help you get started, follow me on this journey and create your first microservice using Golang, Gin and Docker...

Cloud
Container
Go
Microservices
NoSQL

21.4.2020 | 10 minutes reading time

From PDF data sheets to shared understanding with serverless SHACL

Knowledge contained in PDF filesWhen crawling the web for information about products of a specific category, may it be instances of industrial machine parts, chemical components, or even household goods, manufacturers of such goods often provide the ...

NoSQL
AWS
Big Data
Data
API
Microservices
Python
Serverless
Webdevelopment

1.4.2020 | 12 minutes reading time

Using Apache PLC4X and ElasticSearch for IIoT monitoring and anomaly detection

Industrial IoT (IIoT) as a buzzword gained traction within recent years. However, implementing common use cases like real-time monitoring of PLCs may involve a huge amount of money and effort. For example, current approaches implementing such a monitoring...

NoSQL
IoT
IIoT

7.10.2019 | 6 minutes reading time

Cloud Launcher for MongoDB in the Google Compute Engine

In this post you will learn how to use Google’s Cloud Launcher to set up instances for a MongoDB replica set in the Google Compute Engine.Replication in MongoDBA minimal MongoDB replica set consists of two data bearing nodes and one so-called arbiter...

Cloud
Infrastructure as Code
Google
NoSQL

5.3.2018 | 3 minutes reading time

Tobias Trelle

Change Streams in MongoDB 3.6

MongoDB 3.6 introduces an interesting API enhancement called change streams. With change streams you can watch for changes to certain collections by means of the driver API. This feature replaces all the custom oplog watcher implementations out there...

Change Management
NoSQL

15.1.2018 | 2 minutes reading time

Tobias Trelle

SMACK stack from the trenches

This is going to be a sum-up of the experience gathered on various projects done with the SMACK stack. For details about the SMACK stack you might want to take a look at the following blog – The SMACK Stack – Hands on . Apache Spark – the S in SMACK...

Reactive Programming
NoSQL
Big Data
Messaging

19.1.2017 | 12 minutes reading time

Spring Boot & Apache CXF – Logging & Monitoring with Logback, Elasticsearch...

Cool! SOAP-Endpoints that are based on Microservice technologies. But how do we find an error inside one of our many “micro servers”? What about the content of our SOAP messages and how do we log in general? And last but not least: How many products ...

Frontend
NoSQL
Java
APM
Logging
Spring

26.7.2016 | 27 minutes reading time

IoT Analytics Platform

The Internet of Things a.k.a. the next industrial revolution is the current hype, but what kinds of challenges do we face with the consumption of big amounts of data? One variant is to collect all the data and do post processing in batches. However, ...

Cloud
IoT
NoSQL
Scala
Big Data

13.7.2016 | 15 minutes reading time

Automatic Testing of Logstash Configuration

In the second half I show how you can test your Logstash configuration. However first I want to show why automatic tests for configuration files are important. Feel free to skip this part if you already know this.Configuration is source code and should...

Agile
Infrastructure
Open Source
Search
CI/CD
DevOps
NoSQL
Logging
Testing

20.6.2016 | 5 minutes reading time

Elasticsearch Custom realm for Kerberos

Shield is the official security plugin for Elasticsearch. Since version 2.0 it supports custom realms which offer the possibility to add support for arbitrary authentication and authorization mechanisms. Codecentric AG has developed a custom realm for...

NoSQL
IT-Security

25.4.2016 | 6 minutes reading time

Getting started with Titan using Cassandra and Solr

Titan comes with several possibilities to configure the storage (BerkleyDb, Cassandra, Hbase) and the underlying search engine (Lucene, Solr, Elastic). Since DataStax aquired Aurelius and DataStax Enterprise Search uses Solr, I wanted to setup an environment...

DevOps
Search
Big Data
NoSQL

25.2.2016 | 4 minutes reading time

Markus Höfer

Joins and Schema Validation in MongoDB 3.2

Version 3.2 of the NoSQL database MongoDB introduces two new interesting features (amongst others) that I’d like to explore in this blog post.JoinsThe logical namespaces where documents are stored are called collections in MongoDB. Up to now every type...

NoSQL
Big Data
Validation

7.12.2015 | 3 minutes reading time

Tobias Trelle

Combining Apache Cassandra with Apache Karaf

Getting the best of Apache Cassandra inside Apache Karaf: this blog post will describe how easy it was to embed the NoSQL database inside the runtime. This can be helpful while developing OSGi-related applications with Karaf that work together with Cassandra...

NoSQL
Container

19.12.2014 | 9 minutes reading time

Elasticsearch tips: inserting vs. updating your index

Transforming an update-heavy Elasticsearch use case into an insert-heavy one.Just recently i’ve had the opportunity to set up an Elasticsearch installation at a customer that had a rather unique use case, and i’d like to share my approach of that with...

NoSQL
APM

12.12.2014 | 6 minutes reading time

Elasticsearch Zero Downtime Reindexing – Problems and Solutions

Reindexing Elasticsearch could be so easy. Well in the first place, we all wouldn’t have to reindex at all. Why should you do this? There is dynamic mapping! In this post I will explain why dynamic mapping won’t do you much good, how you can deal with...

NoSQL
IT-Security

17.9.2014 | 8 minutes reading time

Docker simplified: Run Redis, MongoDB and more with a few keystrokes

You probably know this situation: To develop a piece of software, other services like databases and messaging systems are required. These services would traditionally be installed natively on developers’ machines or would be running inside virtual machines...

Container
NoSQL

3.8.2014 | 3 minutes reading time

MongoDB World 2014

For the very first time, the MongoDB community from all over the world gathered in one place. The MongoDB World conference 2014 took place in New York City from June 23rd to 25th. TalksThe talks were separated into three topics: dev, ops & buisness ...

Big Data
NoSQL
Community

6.7.2014 | 2 minutes reading time

Tobias Trelle

Test Automation for NoSQL Databases with NoSQL Unit and Travis-CI

Today I want to give you a short summary of my NoSQL matters talk on test automation for NoSQL databases . I basically introduce two tools that may help you with writing unit and integration tests for NoSQL databases: NoSQLUNit is a JUnit extension...

NoSQL
Testing
CI/CD

7.5.2014 | 1 minutes reading time

Tobias Trelle

Elasticsearch Indexing Performance Cheatsheet

General Performance

Mapping

Requests and Clients

Sharding and Replication

Index Settings

Conclusion

Was this post helpful?

Blog author

More articles

Scaling an Elasticsearch Index – Introduction

Transactions in Elasticsearch

Elasticsearch Monitoring and Management Plugins

Useful JVM Flags – Part 8 (GC Logging)

Useful JVM Flags – Part 7 (CMS Collector)

ForkJoinPool vs. ThreadPoolExecutor

Useful JVM Flags – Part 6 (Throughput Collector)

Useful JVM Flags – Part 5 (Young Generation Garbage Collection)

Useful JVM Flags – Part 4 (Heap Tuning)

Useful JVM Flags – Part 3 (Printing all XX Flags and their Values)

Useful JVM Flags – Part 2 (Flag Categories and JIT Compiler Diagnostics...

Useful JVM Flags – Part 1 (JVM Types and Compiler Modes)

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

The universal recommender in Action(ML)

Kick-start your microservice project with JHipster

Golang, Gin & MongoDB – Building microservices easily

From PDF data sheets to shared understanding with serverless SHACL

Using Apache PLC4X and ElasticSearch for IIoT monitoring and anomaly detection

Cloud Launcher for MongoDB in the Google Compute Engine

Change Streams in MongoDB 3.6

SMACK stack from the trenches

Spring Boot & Apache CXF – Logging & Monitoring with Logback, Elasticsearch...

IoT Analytics Platform

Automatic Testing of Logstash Configuration

Elasticsearch Custom realm for Kerberos

Getting started with Titan using Cassandra and Solr

Joins and Schema Validation in MongoDB 3.2

Combining Apache Cassandra with Apache Karaf

Elasticsearch tips: inserting vs. updating your index

Elasticsearch Zero Downtime Reindexing – Problems and Solutions

Docker simplified: Run Redis, MongoDB and more with a few keystrokes

MongoDB World 2014

Test Automation for NoSQL Databases with NoSQL Unit and Travis-CI