MongoDB: Close Encounters of the Third Kind

11.11.2012 | 15 minutes reading time

Just with the first key strokes of this post I am entering the third week of my MongoDB developers course @ 10gen . Luckily I found another film title that fits in here, well one way or the other . Writing this blog series along with taking the lectures is really filling up a good amount of my time currently and to be able to keep up the writing I had to sacrifice the Python homework so far (what a pity with respect to Python). That means most likely that there is hardly any chance for passing the final certification. But writing on this is helping me a great deal to get a better understanding of MongoDB and – most important – it is at the same time a great deal of fun. And as I am confident that some alter ego of mine from a parallel universe will skip the blogging and do the certification instead, nothing is lost in the end :). Wondering what all my other selfs are doing right now, but never mind.

Let’s jump to the topic of this week’s lecture from the class and thus the topic of this blog post, which is:

Schema Design

I am looking forward to this one, especially as I understood in the beginning that MongoDB is schema-less. But it has already turned out during the class that this only means MongoDB does no require a strict and pre-defined schema. Browsing the topics from the lectures this will also be about transactions (or probably the absence of them), atomic operations and joins (the Mongo-way).

“If you are coming from the world of relational databases you know that there is a best ideal way to design your schema, which is the third normal form.” – Quotes from the course

Everyone developing software using relational databases knows the time that is spent on the design of the table schema. You might have some real “hardliners” that take the Third Normal Form by the letter. Others value some redundant data if it eases accessing that data. This might also have some well-thought performance reasons. Overall one can – and normally does – spend a good amount of time discussing the “optimal” schema. There is nothing bad about this as it is – as with other design decisions of application development – that there is a need to discuss these things. The only bad feeling that sneaks in on me every now and then is that all this has to be mainly considered from the view-point of the SQL database. It is a kind of science in itself that is mostly independent from accessing that data later on from any application (which we still have to program in the end). And that can be quite tough despite the fact that we have object-relational frameworks nowadays that should help us closing that gap, but to be honest: I sometimes have the feeling things become worse as it gets harder and harder to understand what is really happening.

And these thoughts just fit so perfectly well to the following quote from the course.

“When you are not biased towards any particular access pattern you are equally bad at all of them.” – Quotes from the course

This was already coming to my mind during last week’s class, but now it is getting much clearer and is also addressed more directly in the lectures: With an approach as it is offered by MongoDB the database schema is much more tailored to the needs of the application development and not other way around. In the class this is called Application-Driven Schema-Design. I have to slow-down my enthusiasm a little bit as I am still lacking any real-life programming experience (especially in a bigger project) with MongoDB, but it sounds exactly as the right thing to do. And it also means I have to urgently do some Java programming on this. Days (and nights) are simply too short.

It should be made clear here that this does not mean that everything will be put into one big document. Inside our application the data is for sure as well not inside one big object without references. But it probably means that we will do more “pre-joins” by putting some data inside one document, which would be for sure spread across different tables in a relational database. In the course the example of a blog application is used and the comments are stored along with the blog post inside one document structure. A thing that would never be done in SQL, but seems quite natural when giving it a second thought.

“Occasionally – for performance reasons – we’re gonna decide that we want to duplicate the data within the document. But that’s not gonna be the default.” – Quotes from the course

As there is no construct to join data in MongoDB – the way we know it from relational databases – it will be the application’s responsibility to perform any “joins”. It is really a rather big mind shift – and that’s probably the reason I am repeating myself on this – but we should check more carefully for possibilities to embed data in one document. This might be anyway more naturally from an application’s point of view. Of course we have to put aside our “training” on normalising database structures. Another advantage of this is the potential performance gain. Reading data in from three different documents would require MongoDB to load data from three different data files, requiring more disc seeks. Reading in one document will be faster and probably most of the time also easier to implement. One thing to keep in mind is the 16MB limit for documents of course. (And maybe it is not a good idea to go from one extreme to the other anyway.)

Pre-Joins are the Constraints of the MongoDB world

Database constraints are another interesting topic and I would like to start with a short anecdote from a project I have been working in quite some time back. You know constraints in relational databases. It is a great thing to keep your data consistent by “glueing” things together. A record in table X with foreign key Z can only exist if there is a record in table Z that has a proper id-value. Ok, back in that project I was working in we had a really complex and complicated database design with dozens of foreign key constraints. But one of the problems was: If you wanted to delete something you needed to know the exact order of all these constraints as otherwise you would fail deleting anything. In the end there have been scripts disabling all the constraints, doing the deletion and then enabling them again while keeping the fingers crossed that everything was still intact. My feelings towards the heavy use of constraints are a bit divided since then, even though they still can come in quite handy to let the (relational) database solve certain problems for you.

“One good rule of thumb with MongoDB schema design is, that if you see yourself doing it exactly the same way you would do it in a relational design, you probably not taking the best approach.” – Quotes from the course

How is it then with constraints and MongoDB? Well, the simple answer is the concept of constraints does not exist in MongoDB. Either you are embedding the data that belongs together or the constraints have to be enforced programmatically in the application. My takeaway on this so far is: If data can be embedded in a meaningful way inside one document things will get much easier than in the relational world. If for whatever reason this pre-joining of data is not possible the danger of getting inconsistent data is increasing and implementation will probably get more complex to ensure this is avoided.

No Transactions

Not really anything new here by now. MongoDB does not support transactions. But it does support atomic operations on documents, which means that one will always see all changes to a document or none of them. Again it helps a lot considering the differences of MongoDB and relational databases to classify the fact that transactions are missing. In relational databases transactions are often required to ensure that updates of one data set, which is spread over several tables, is done in a consistent way. Writing the next sentence I have more and more the feeling I have been completely assimilated to the MongoDB world already. (Mental note, scan for MongoDB nanoprobes in my blood stream !) Back to topic :-): If it happens that all data fits well into one document (meaning it makes sense and does not come close the 16MB barrier), this and the fact that operations are atomic on documents will very well replace any needs for additional transaction-support.

So basically there are the following proposed approaches:

Structure the application in a way that it can work with a single document. This is pretty much what I mentioned before.
Implement transactions in software if updates to different collections are required. This depends then highly on the application as such, but as the word semaphore was used in the class , I had to repeat it here.
Tolerate a little bit of inconsistency.

The last point from the list leads me to this quote from the class and some more discussion on it.

“Just tolerance a little bit of inconsistency (…) It does not matter if everyone sees your wall update (in Facebook) simultaneously.” -Quotes from the course

As Facebook was used here as an example it is easy for me to agree. But for a lot of other systems this might not be a real option for different reasons. Some systems might really require consistency all the time and in a lot of other systems it is at least believed that it is required all the time. I am not sure if I would like to be the one to start a discussing about “a little bit of (temporary) inconsistency” ;-). Anyway, as there are two other possibilities to achieve transactional behaviour using MongoDB this is not a real problem either.

Data Relation

I am still looking at this through the glasses of someone used to relational databases. And as such I am having my experiences how to model the different relations that can occur between entities. So we have one-to-one, one-to-many, and many-to-many. With MongoDB a new learning process is obviously needed.

One-To-One Relations

Let’s take a look at an example for a one-to-one relation.


captain {
    _id  : 'JamesT.Kirk',
    name : 'James T. Kirk',
    age  : 38,
    ...
    ship : 'ussenterprise'
}

starship {
    _id     : 'ussenterprise',
    name    : 'USS Enterprise',
    class   : 'Galaxy',
    ...
    captain : 'JamesT.Kirk'
}

This is obviously an example for a one-to-one relation. A star ship can only have one captain at a time and a captain is captain of exactly one star ship. Now in the above example both entities are put in different collections. To be able to “join” them (inside our program, not using MongoDB) we have created references from the captain-collection to the starship-collection and vise versa. This is perfectly ok and might make sense depending on the additional data that is stored in the different documents. The following aspects should be considered when modelling one-to-one relations:

Frequency of access of the different documents.
Size of the items, keeping the 16 MB limit in mind.
Atomically of the data and thus data consistency.

“You should pre-join your data, you should embed the data in ways that make sense for your application. For lots of different reasons and one is that it helps keeping your data intact and consistent.” – Quotes from the course

Probably in the above example it would be ok to embed the captain to its starship as follows:


starship {
    _id     : 'ussenterprise',
    name    : 'USS Enterprise',
    class   : 'Galaxy',
    ...
    captain : {
        name : 'James T. Kirk',
        age  : 38,
        ...
    }
}

Looks quite natural by now, doesn’t it. Well, probably it simply requires some experience to be able to get a good understanding when to embed.

One-To-Many Relations

One starship is having many crew members. Might be a god example to start with for the one-to-many relation. Does it make sense to embed the list of an entire crew inside the starship-document. Well in principle yes I would say, but the problem could be that when extending our example to a borg cube we might run into the 16 MB limit. That thing can have a crew compliment of up to 130.000. (It just comes to my mind that here instead of a name-attribute a designation-attribute could be used in the document thanks to MongoDB‘s flexibility.)

“When it requires two collections then it requires two collections.” – Quotes from the course

In case of 1-to-many relations there will often be a need for real linking between collections. It is advisable to link from the collection storing the many values to the collection storing the one. Thus every borg drone knows on which cube it belongs. What I found a very good rule of thumb in deciding the data structure here is the question: Is this really one-to-many or is it just one-to-few. In the latter case an array inside the document is most likely the better choice.

Many-To-Many Relations

Another example to start with. At Starfleet Academy there are candidates and instructors. Several candidates will be assigned to one instructor and one instructor will have several candidates to train.


candidates {
    _id  : 99,
    name : 'Harry Kim',
    ...
    instructors : [1, 2]
}

instructors {
    _id  : 1,
    name : 'Tuvok',
    ...
    candidates : [99, 100]
}

In the above example we have two different collections. We have a link in both directions by having for each candidate a list of instructors and for each instructor a list of candidates.

This is MongoDB, so no one but you and your program are responsible for ensuring consistency of the documents, e.g. that there really exists another candidate where _id equals 100.

Again it helps asking the question is this really many-to-many or is it just few-to-few in order to decide if embedding is an option. And one should consider if the entities should be able to exist independent of each other. Embedding for example the candidates into the instructors-collection (also a bad idea for other reasons) in the above example would require to create an instructor document to be able to create a candidate. They could no longer exist independent of each other. Probably keeping them separate and only linking them together is the better idea here.

I guess we agree that this blog post was now far too long without any real hacking, wasn’t it? So let’s take a look at the Mongo shell and something called “Multikey Indexes”:


> db.instructors.find()
{ "_id" : 1, "name" : "Tuvok", "candidates" : [ 99, 100 ] }
{ "_id" : 2, "name" : "Spock", "candidates" : [ 99, 100 ] }
{ "_id" : 3, "name" : "Pike", "candidates" : [ 100, 101 ] }
> db.candidates.find()
{ "_id" : 99, "name" : "Harry Kim" }
{ "_id" : 100, "name" : "Tom Paris" }
{ "_id" : 101, "name" : "Seven of Nine" }
> db.instructors.ensureIndex({candidates : 1})
> db.instructors.find({candidates : {$all : [99,100]}})
{ "_id" : 1, "name" : "Tuvok", "candidates" : [ 99, 100 ] }
{ "_id" : 2, "name" : "Spock", "candidates" : [ 99, 100 ] }

What have I done here? Obviously I created two collections and added some data. Let’s look from the perspective of the instructors, where Tuvok and Spock are both having Harry and Tom as candidates. Christopher Pike is having Tom and Seven of Nine. Now we can create an index on the candidates-array in the instructors-collection issuing: db.instructors.ensureIndex({candidates : 1})

“The ability to structure and express rich data is one of the things that makes MongoDB so interesting.” – Quotes from the course

The following query definitely has worked without the index, but let’s assume we have really a lot of data. Then let’s find out all the instructors having both Harry and Kim as candidates. Good to have a little refreshing from the querying-syntax that was considered quite extensively previous week (and in the corresponding blog post ). More details on the indexing will follow at a later time in the class (that was at least promised), but I cannot withstand to show you the explain()-command right away.


> db.instructors.find({candidates : {$all : [99,100]}}).explain()
{
    "cursor" : "BtreeCursor candidates_1",
    "isMultiKey" : true,
    "n" : 2,
    "nscannedObjects" : 2,
    "nscanned" : 2,
    "nscannedObjectsAllPlans" : 2,
    "nscannedAllPlans" : 2,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 0,
    "indexBounds" : {
        "candidates" : [
            [
                99,
                99
            ]
        ]
    },
    "server" : "Thomass-MacBook-Pro.local:27017"
}

Well, it should be visible from the “cursor” : “BtreeCursor candidates_1”-entry that our created index has been used. Ok, probably we have to wait for next weeks lessons to learn more on this :-).

To be continued …

The MongoDB class series

Part 1 – MongoDB: First Contact
Part 2 – MongoDB: Second Round
Part 3 – MongoDB: Close Encounters of the Third Kind
Part 4 – MongoDB: I am Number Four
Part 5 – MongoDB: The Fith Element
Part 6 – MongoDB: The Sixth Sense
Part 7 – MongoDB: Tutorial Overview and Ref-Card

Java Supplemental Series

Part 1 – MongoDB: Supplemental – GRIDFS Example in Java
Part 2 – MongoDB: Supplemental – A complete Java Project – Part 1
Part 3 – MongoDB: Supplemental – A complete Java Project – Part 2

Was this post helpful?

Blog author

Thomas Jaspers

Senior Software Engineer & AI Enthusiast

Do you still have questions? Just send me a message.

The universal recommender in Action(ML)

IntroductionRecommender systems have become crucial for many different businesses. E-commerce uses recommenders to guide their customers in finding the right products and to assure they stay on the site. Newspapers or entertainment websites want to keep...

AI
NoSQL
Data
Machine Learning
Python

18.4.2021 | 11 minutes reading time

Francesca Diana

Kick-start your microservice project with JHipster

I recently looked for a solution on how to prototype a customer project in a short time and came across JHipster. The target architecture used Spring Boot in the backend and an Angular frontend. JHipster can scaffold this in its simplest variant as...

Node.js
Angular
Software development
Container
NoSQL
Cloud
JavaScript
Java
Keycloak
Kubernetes
Microservices
IT-Security
Open Source
React
Spring

12.5.2020 | 13 minutes reading time

Jörg Riegel

Golang, Gin & MongoDB – Building microservices easily

Golang, a.k.a. Go, has been around in the industry for quite some time now, but people are still reluctant to just go ahead and use it. To help you get started, follow me on this journey and create your first microservice using Golang, Gin and Docker...

Cloud
Container
Go
Microservices
NoSQL

21.4.2020 | 10 minutes reading time

From PDF data sheets to shared understanding with serverless SHACL

Knowledge contained in PDF filesWhen crawling the web for information about products of a specific category, may it be instances of industrial machine parts, chemical components, or even household goods, manufacturers of such goods often provide the ...

NoSQL
AWS
Big Data
Data
API
Microservices
Python
Serverless
Webdevelopment

1.4.2020 | 12 minutes reading time

Using Apache PLC4X and ElasticSearch for IIoT monitoring and anomaly detection

Industrial IoT (IIoT) as a buzzword gained traction within recent years. However, implementing common use cases like real-time monitoring of PLCs may involve a huge amount of money and effort. For example, current approaches implementing such a monitoring...

NoSQL
IoT
IIoT

7.10.2019 | 6 minutes reading time

Stefan Herrmann

Cloud Launcher for MongoDB in the Google Compute Engine

In this post you will learn how to use Google’s Cloud Launcher to set up instances for a MongoDB replica set in the Google Compute Engine.Replication in MongoDBA minimal MongoDB replica set consists of two data bearing nodes and one so-called arbiter...

Cloud
Infrastructure as Code
Google
NoSQL

5.3.2018 | 3 minutes reading time

Tobias Trelle

Change Streams in MongoDB 3.6

MongoDB 3.6 introduces an interesting API enhancement called change streams. With change streams you can watch for changes to certain collections by means of the driver API. This feature replaces all the custom oplog watcher implementations out there...

Change Management
NoSQL

15.1.2018 | 2 minutes reading time

Tobias Trelle

SMACK stack from the trenches

This is going to be a sum-up of the experience gathered on various projects done with the SMACK stack. For details about the SMACK stack you might want to take a look at the following blog – The SMACK Stack – Hands on . Apache Spark – the S in SMACK...

Reactive Programming
NoSQL
Big Data
Messaging

19.1.2017 | 12 minutes reading time

Spring Boot & Apache CXF – Logging & Monitoring with Logback, Elasticsearch...

Cool! SOAP-Endpoints that are based on Microservice technologies. But how do we find an error inside one of our many “micro servers”? What about the content of our SOAP messages and how do we log in general? And last but not least: How many products ...

Frontend
NoSQL
Java
APM
Logging
Spring

26.7.2016 | 27 minutes reading time

IoT Analytics Platform

The Internet of Things a.k.a. the next industrial revolution is the current hype, but what kinds of challenges do we face with the consumption of big amounts of data? One variant is to collect all the data and do post processing in batches. However, ...

Cloud
IoT
NoSQL
Scala
Big Data

13.7.2016 | 15 minutes reading time

Automatic Testing of Logstash Configuration

In the second half I show how you can test your Logstash configuration. However first I want to show why automatic tests for configuration files are important. Feel free to skip this part if you already know this.Configuration is source code and should...

Agile
Infrastructure
Open Source
Search
CI/CD
DevOps
NoSQL
Logging
Testing

20.6.2016 | 5 minutes reading time

Elasticsearch Custom realm for Kerberos

Shield is the official security plugin for Elasticsearch. Since version 2.0 it supports custom realms which offer the possibility to add support for arbitrary authentication and authorization mechanisms. Codecentric AG has developed a custom realm for...

NoSQL
IT-Security

25.4.2016 | 6 minutes reading time

Getting started with Titan using Cassandra and Solr

Titan comes with several possibilities to configure the storage (BerkleyDb, Cassandra, Hbase) and the underlying search engine (Lucene, Solr, Elastic). Since DataStax aquired Aurelius and DataStax Enterprise Search uses Solr, I wanted to setup an environment...

DevOps
Search
Big Data
NoSQL

25.2.2016 | 4 minutes reading time

Markus Höfer

Joins and Schema Validation in MongoDB 3.2

Version 3.2 of the NoSQL database MongoDB introduces two new interesting features (amongst others) that I’d like to explore in this blog post.JoinsThe logical namespaces where documents are stored are called collections in MongoDB. Up to now every type...

NoSQL
Big Data
Validation

7.12.2015 | 3 minutes reading time

Tobias Trelle

Combining Apache Cassandra with Apache Karaf

Getting the best of Apache Cassandra inside Apache Karaf: this blog post will describe how easy it was to embed the NoSQL database inside the runtime. This can be helpful while developing OSGi-related applications with Karaf that work together with Cassandra...

NoSQL
Container

19.12.2014 | 9 minutes reading time

Elasticsearch tips: inserting vs. updating your index

Transforming an update-heavy Elasticsearch use case into an insert-heavy one.Just recently i’ve had the opportunity to set up an Elasticsearch installation at a customer that had a rather unique use case, and i’d like to share my approach of that with...

NoSQL
APM

12.12.2014 | 6 minutes reading time

Elasticsearch Zero Downtime Reindexing – Problems and Solutions

Reindexing Elasticsearch could be so easy. Well in the first place, we all wouldn’t have to reindex at all. Why should you do this? There is dynamic mapping! In this post I will explain why dynamic mapping won’t do you much good, how you can deal with...

NoSQL
IT-Security

17.9.2014 | 8 minutes reading time

Docker simplified: Run Redis, MongoDB and more with a few keystrokes

You probably know this situation: To develop a piece of software, other services like databases and messaging systems are required. These services would traditionally be installed natively on developers’ machines or would be running inside virtual machines...

Container
NoSQL

3.8.2014 | 3 minutes reading time

Elasticsearch Indexing Performance Cheatsheet

You plan to index large amounts of data in Elasticsearch? Or you are already trying to do so but it turns out that throughput is too low? Here is a collection of tips and ideas to increase indexing throughput with Elasticsearch. Some of them I have successfully...

NoSQL

8.5.2014 | 8 minutes reading time

Elasticsearch 101

IntroductionElasticsearch is a highly scalable search engine that stores data in a structure optimized for language based searches and it is a whole lot of fun to work with. In this 101 I’ll will give you a hands-on introduction to Elasticsearch and...

Search
NoSQL

7.2.2014 | 10 minutes reading time

MongoDB: Close Encounters of the Third Kind

Schema Design

Pre-Joins are the Constraints of the MongoDB world

No Transactions

Data Relation

One-To-One Relations

One-To-Many Relations

Many-To-Many Relations

Was this post helpful?

Blog author

More articles in this subject area

The universal recommender in Action(ML)

Kick-start your microservice project with JHipster

Golang, Gin & MongoDB – Building microservices easily

From PDF data sheets to shared understanding with serverless SHACL

Using Apache PLC4X and ElasticSearch for IIoT monitoring and anomaly detection

Cloud Launcher for MongoDB in the Google Compute Engine

Change Streams in MongoDB 3.6

SMACK stack from the trenches

Spring Boot & Apache CXF – Logging & Monitoring with Logback, Elasticsearch...

IoT Analytics Platform

Automatic Testing of Logstash Configuration

Elasticsearch Custom realm for Kerberos

Getting started with Titan using Cassandra and Solr

Joins and Schema Validation in MongoDB 3.2

Combining Apache Cassandra with Apache Karaf

Elasticsearch tips: inserting vs. updating your index

Elasticsearch Zero Downtime Reindexing – Problems and Solutions

Docker simplified: Run Redis, MongoDB and more with a few keystrokes

Elasticsearch Indexing Performance Cheatsheet

Elasticsearch 101