Elasticsearch tips: inserting vs. updating your index

12.12.2014 | 6 minutes reading time

Transforming an update-heavy Elasticsearch use case into an insert-heavy one.

Just recently i’ve had the opportunity to set up an Elasticsearch installation at a customer that had a rather unique use case, and i’d like to share my approach of that with you. This post will show why an update heavy use of Elasticsearch is a bad idea and how you could transform it into an insert heavy one, which is way faster.

Prerequisites

The requirements involved tracking the lifecycle of a document that entered the company via various input channels, and is processed by a number of automated systems. Sometimes it happens that one of these documents gets lost between steps or is misanalyzed and therefore gets lost in the system. If someone happens to inquire the status of such a lost document noone could really give a good answer on that, or attempt to fix it. That’s not a desirable state.

Fortunately the “metadata” of such a document does contain the OCR fulltext, so any kind of “storage engine” with fulltext search capabilities is needed, and that really sounds like a job for elasticsearch! It’s especially easy because we were able to hook custom code into each of these processing steps. Another great coincidence is that we can print a barcode on each document, so every process step can be truly independent of the others. This will influence my conclusion later on.

As for the general usage of this system, I would expect to have a lot of writing operations (lots of documents processed, most of them without errors) and only few reading operations (you only check when something went wrong, if at all). This will bring us to some conclusions you would not expect in a more traditional use case.

The ‘naive’ NoSQL approach

As with every Elasticsearch project I’m involved in I like to step back first and give the data model a good thought. Sure, Elasticsearch is schemaless, but that does not mean you can skip thinking about your data at all, especially not if you want acceptable performance later on. Naturally I was inclined to think of a document as a flat structure, that contains it’s various events and the respective results and timestamps. They could be thought of relations, sure, but since they are naturally tightly bound to each other (in classical terms a 1-1 Relationship, if you will) it saves you the awkwardness of joining things together.

Implemented that would mean that the first operation on an document would create (or upsert it) and each following step would update the document accordingly. I’m not quite happy with this approach since updating lots of documents all the time has the following drawbacks:

Downsides to frequent updates

Cost for get_then_reindex

Any updates you would do during the lifecycle would mostly be “partial updates”, where you only send the things that have changed to the Elasticsearch cluster. In fact, the independent software systems should really be unaware of the state updates the other systems did to avoid coupling of these systems. Elasticsearch allows us to do partial updates, but internally these are “get_then_update” operations, where the whole document is fetched, the changes are applied and then the document is indexed again. Even without disk hits one can imagine the potential performance implications if this is your main use case.

potential version conflicts

The “get_then_update” operations are not atomic, and Elasticsearch uses implicit versioning of it’s documents, so version conflicts are to be expected. They are automatically handled (by last-write-wins) and do not need to be handled by your software, but it’s another performance impact you have to be aware of.

need to store _source

Another uniqueness about the “get_then_update” update is that Elasticsearch can not use the indexed document itself but needs the original instead. This forces you to keep the _source field activated. In my case that was not an issue but it’s something to be aware of.

Lucene “soft deletes” and merging cost

On the Lucene layer, an update is actually not an update but an (atomic) “insert and delete” operation. But alas, this is still not the full truth: Deletes are soft, that means they are marked with a tombstone flag and reside in the segment. Only a merging operation will clean them up eventually. Like garbage collection for instantiated and dereferenced objects, this can lead to additional pressure on your system.

In conclusion, update operations can be considered rather expensive. Now that our application will ultimately consist of (almost) nothing but update operations, this seems like a bad idea. Let’s try changing that.

Index instead of Update

To achieve a different operation we need to split the document into its event parts – so we have got a relation going. To keep things a little denormalized we can reduce the event into a single type that contains all the possible data – and only fill the fields relevant to that:

Relations in Elasticsearch

To handle relations, Elasticsearch provides us with two different mechanisms that both have their individual pros and cons: nested documents and parent-child relations. For an in depth introduction to both concepts, i’d recommend reading the Elasticsearch Guide’s chapter on modeling your data .

Without poorly replicating the description, in a nutshell, the nested documents live inside the original document type and the parent-child documents live separately in their own type, and are joined at query time. You need to be aware that parents and their children necessarily have to live on the same shard, and the parent-child ID map is held in memory.

For our specific use case (where there are plenty of updates which are our performance concern, and search performance is actually negligible), we chose a parent-child relation as the better fit: we can truly insert a new event, without touching the original document or any of the other events. This is possible in this case because every step in the process chain does already know about the ID of the document without touching Elasticsearch. It’s a printed ID on the document, that we can reuse as an ID for our Dokument type.

In the end, the performance numbers on the hardware we had to our disposal prove us right: We are able to process a day’s worth of data in about 2 seconds!

Conclusion

While this was a relatively rare use case that you probably won’t encounter in the wild, it contains an interesting essence: Sometimes the “natural” or obvious data model goes against the inner workings of Elasticsearch, and it’s useful to remodel your data to better fit your system. Afterwards (and by that I mean the rest of the week, talking about how insanely fast one can accomplish results with elasticsearch) we were able to develop a small webapp where users can search the generated data – and were pleasantly surprised that search operations are still way faster than we anticipated!

Was this post helpful?

Blog author

Christian Uhl

Do you still have questions? Just send me a message.

JavaScript test performance: getting the best out of Jest

In recent years Jest has established itself as the go-to testing framework for JavaScript and TypeScript development. It provides a complete toolkit (test runner, assertion library, mocking library, code coverage and more) out of the box, and requires...

Node.js
JavaScript
APM
Testing

12.11.2021 | 7 minutes reading time

The universal recommender in Action(ML)

IntroductionRecommender systems have become crucial for many different businesses. E-commerce uses recommenders to guide their customers in finding the right products and to assure they stay on the site. Newspapers or entertainment websites want to keep...

AI
NoSQL
Data
Machine Learning
Python

18.4.2021 | 11 minutes reading time

Francesca Diana

The how of monitoring your services

Lately, there has been a lot of discussion about SLAs, SLOs and SLIs. As this article states, it is hard to define the correct SLOs and SLIs. This discussion is about what part of your services you want to monitor. But it is also difficult to measure...

Infrastructure
APM

17.11.2020 | 5 minutes reading time

Performance optimization of a GraphQL app with Instana

“Works on my machine.” Okay, but we know quite well software never behaves the same when running on different machines… We knew that, but ran into unexpected performance issues when going live with a simple app. Here’s how we fixed the problem and improved...

Cloud
APM
API
JavaScript

21.7.2020 | 8 minutes reading time

Kick-start your microservice project with JHipster

I recently looked for a solution on how to prototype a customer project in a short time and came across JHipster. The target architecture used Spring Boot in the backend and an Angular frontend. JHipster can scaffold this in its simplest variant as...

Node.js
Angular
Software development
Container
NoSQL
Cloud
JavaScript
Java
Keycloak
Kubernetes
Microservices
IT-Security
Open Source
React
Spring

12.5.2020 | 13 minutes reading time

Jörg Riegel

How to secure a GraphQL service using persisted queries

GraphQL is a rising query language that gives clients the power to ask for what they need and get exactly that in a single request. In theory this leads to effective and flexible client-server communication. But adopting new technology always comes ...

API
JavaScript
APM
IT-Security

30.4.2020 | 10 minutes reading time

Golang, Gin & MongoDB – Building microservices easily

Golang, a.k.a. Go, has been around in the industry for quite some time now, but people are still reluctant to just go ahead and use it. To help you get started, follow me on this journey and create your first microservice using Golang, Gin and Docker...

Cloud
Container
Go
Microservices
NoSQL

21.4.2020 | 10 minutes reading time

From PDF data sheets to shared understanding with serverless SHACL

Knowledge contained in PDF filesWhen crawling the web for information about products of a specific category, may it be instances of industrial machine parts, chemical components, or even household goods, manufacturers of such goods often provide the ...

NoSQL
AWS
Big Data
Data
API
Microservices
Python
Serverless
Webdevelopment

1.4.2020 | 12 minutes reading time

Performance Analysis of a GraphQL application with Instana

Modern IT landscapes typically consist of a bunch of different microservices. Replacing the monoliths brings us more complexity due to more parts and all their dependencies.A key aspect for running these systems is the appropriate monitoring with the...

DevOps
Infrastructure
API
Microservices
APM

6.3.2020 | 9 minutes reading time

Publishing application metrics to CloudWatch using Micrometer

Why metrics?In my post about Quality attributes in software we introduced observability as an important quality attribute of modern software applications. Observability expresses whether changes in a system are reflected in a quantitative measure.Especially...

AWS
Cloud
DevOps
Kotlin
APM

21.12.2019 | 10 minutes reading time

Hit me baby one more time – What are cache hits and why should you care...

MotivationWhen reasoning about algorithm performance we often look at complexity. Especially when comparing different algorithms, looking at asymptotic complexity (e.g. the big-O notation) is useful. We have to keep in mind, however, that the big-O ...

APM
Software development
Scala

6.12.2019 | 11 minutes reading time

Microbenchmarking your Scala code

Motivation I am sure you recognize this loading spinner icon. I do not know anyone who likes to wait for the computer. However, when writing software I usually favour readability, maintainability, and extensibility over speed. I agree with Donald Knuth...

Microservices
APM
Scala

29.11.2019 | 11 minutes reading time

Using Apache PLC4X and ElasticSearch for IIoT monitoring and anomaly detection

Industrial IoT (IIoT) as a buzzword gained traction within recent years. However, implementing common use cases like real-time monitoring of PLCs may involve a huge amount of money and effort. For example, current approaches implementing such a monitoring...

NoSQL
IoT
IIoT

7.10.2019 | 6 minutes reading time

Stefan Herrmann

Serverless plugins – Automatic generation of monitoring dashboards

The Serverless framework defines a meta language on top of the Infrastructure as Code (IaC) services of many Cloud Service Providers (CSP). It simplifies the developer experience when designing scalable serverless APIs and boosts productivity and maintainability...

Cloud
DevOps
AWS
APM
Serverless

30.7.2019 | 7 minutes reading time

Cloud Launcher for MongoDB in the Google Compute Engine

In this post you will learn how to use Google’s Cloud Launcher to set up instances for a MongoDB replica set in the Google Compute Engine.Replication in MongoDBA minimal MongoDB replica set consists of two data bearing nodes and one so-called arbiter...

Cloud
Infrastructure as Code
Google
NoSQL

5.3.2018 | 3 minutes reading time

Tobias Trelle

Change Streams in MongoDB 3.6

MongoDB 3.6 introduces an interesting API enhancement called change streams. With change streams you can watch for changes to certain collections by means of the driver API. This feature replaces all the custom oplog watcher implementations out there...

Change Management
NoSQL

15.1.2018 | 2 minutes reading time

Tobias Trelle

Performance measurement with JMH – Java Microbenchmark Harness

What is benchmarking and why should we do that? If there are multiple ways to implement a feature or if we have serious doubts about performance while using a certain technology, special implementation patterns or a new “cutting edge” library, we have...

Java
APM

22.10.2017 | 7 minutes reading time

Kevin Peters

The JVM on Fire – Using Flame Graphs to Analyse Performance

Currently there are several tools available to analyse your application performance and show the results. Traditionally these results are either shown in some kind of table form, either flat or as a tree view. Flame graphs are relatively new and take...

Java
APM

20.9.2017 | 6 minutes reading time

Gatling Load Testing Part 1 – Using Gatling

Gatling is a Scala-based load testing tool developed by the Gatling Corp. The tool itself is open source and can be found on GitHub . On top of the open part, an enterprise edition exists.Load tests in Gatling are written in Scala. The API for writing...

Testing
APM
Scala

20.6.2017 | 20 minutes reading time

SMACK stack from the trenches

This is going to be a sum-up of the experience gathered on various projects done with the SMACK stack. For details about the SMACK stack you might want to take a look at the following blog – The SMACK Stack – Hands on . Apache Spark – the S in SMACK...

Reactive Programming
NoSQL
Big Data
Messaging

19.1.2017 | 12 minutes reading time