Beliebte Suchanfragen
//

MongoDB Text Search Tutorial

10.1.2013 | 7 minutes of reading time

In my introduction to text search in MongoDB , we had a look at the basic features. Today we’ll have a closer look at the details.

API

You may have noticed that a text search is not executed with a find() command. Instead you call

1db.foo.runCommand( "text", {search: "bar"} )

Remember it’s an experimental feature still. Adding it to the implementation of the find() command would have mixed critical production code with the new text search feature. When executed via a runCommand() call, text search can be run and tested in isolation.

I expect to see a new query operator like $text or $textsearch as soon as text search is integrated with the standard find() command.

Text Query Syntax

In the previous examples we just searched for a single word. We can do more than that. Let’s have a look at the following example:

1db.foo.drop()
2db.foo.ensureIndex( {txt: "text"} )
3db.foo.insert( {txt: "Robots are superior to humans"} )
4db.foo.insert( {txt: "Humans are weak"} )
5db.foo.insert( {txt: "I, Robot - by Isaac Asimov"} )


A search for “robot” will find two documents, the same it true for “human”:

1> db.foo.runCommand("text", {search: "robot"}).results.length
22
3> db.foo.runCommand("text", {search: "human"}).results.length
42

When searching for multiple terms, an OR search is performed, yielding three documents in our example:

1> db.foo.runCommand("text", {search: "human robot"}).results.length
23

I would have expected that the given search words are AND-ed not OR-ed.

Negation

By adding a heading minus sign to a search word, you can exclude documents containing that word. Let’s say, we want all documents on “robot” but no “humans”.

1> db.foo.runCommand("text", {search: "robot -humans"})
2{
3        "queryDebugString" : "robot||human||||",
4        "language" : "english",
5        "results" : [
6                {
7                        "score" : 0.6666666666666666,
8                        "obj" : {
9                                "_id" : ObjectId("50ebc484214a1e88aaa4ada0"),
10                                "txt" : "I, Robot - by Isaac Asimov"
11                        }
12                }
13        ],
14        "stats" : {
15                "nscanned" : 2,
16                "nscannedObjects" : 0,
17                "n" : 1,
18                "timeMicros" : 212
19        },
20        "ok" : 1
21}

Phrase Search

By enclosing multiple words inside quotes (“foo bar”) you perform a phrase search . Inside a phrase, order is important and stop words are also taken into account:

1> db.foo.runCommand("text", {search: '"robots are"'})
2{
3        "queryDebugString" : "robot||||robots are||",
4        "language" : "english",
5        "results" : [
6                {
7                        "score" : 0.6666666666666666,
8                        "obj" : {
9                                "_id" : ObjectId("50ebc482214a1e88aaa4ad9e"),
10                                "txt" : "Robots are superior to humans"
11                        }
12                }
13        ],
14        "stats" : {
15                "nscanned" : 2,
16                "nscannedObjects" : 0,
17                "n" : 1,
18                "timeMicros" : 185
19        },
20        "ok" : 1
21}

Please have a look at the “queryDebugField”:

1"queryDebugString" : "robot||||robots are||"

It tells us that our search string contains one stem “robot” but also the phrase “robots are”. That’s the reason we have only one hit. Compare that to these searches:

1> // order matters inside phrase
2> db.foo.runCommand("text", {search: '"are robots"'}).results.length
30
4> // no phrase search --> OR query
5> db.foo.runCommand("text", {search: 'are robots'}).results.length
62

Multi Language Support

Stemming and stop word filtering are both language dependent. So we have to tell MongoDB what language to use for indexing and searching if you want to use other languages than the default which is English. MongoDB uses the open source Snowball stemmer that supports these languages .

In order to use another language for indexing and searching, you do this when creating the index:

1db.de.ensureIndex( {txt: "text"}, {default_language: "german"} )

With this setting, MongoDB assumes that all text in the field “txt” and all text searches on that collection are in German. Let’s see if it works:

1> db.de.insert( {txt: "Ich bin Dein Vater, Luke." } )
2> db.de.validate().keysPerIndex["text.de.$txt_text"]
32

As you can see, there are only two index keys, so stop word filtering did occur (this time with a German stop word list. Vater is the German word for father, not some typo with Vader) Let’s try some searches:

1> db.de.runCommand("text", {search: "ich"}).results.length
20
3> db.de.runCommand("text", {search: "Vater"}).results.length
41
5> db.de.runCommand("text", {search: "Luke"}).results.length
61

Please note that we don’t have to give the language we are searching for because it is derived from the index. We have hits for the meaningful words “Vater” and “Luke”, but not for the stop word “ich” (which means “I”).

It it also possible to mix multiple languages in the same index. Each single document can have its own language:

1db.de.insert( {language:"english", txt: "Ich bin ein Berliner" } )

If a field “language” is present, its content defines the language for stemming and stop word filtering for the indexed field(s) of that document. The word “ich” is not a stop word in English, so it is indexed now.

1// default language: german -> no hits
2> db.de.runCommand("text", {search: "ich"})
3{
4        "queryDebugString" : "||||||",
5        "language" : "german",
6        "results" : [ ],
7        "stats" : {
8                "nscanned" : 0,
9                "nscannedObjects" : 0,
10                "n" : 0,
11                "timeMicros" : 96
12        },
13        "ok" : 1
14}
15 
16// search for English -> one hit
17> db.de.runCommand("text", {search: "ich", language: "english"})
18{
19        "queryDebugString" : "ich||||||",
20        "language" : "english",
21        "results" : [
22                {
23                        "score" : 0.625,
24                        "obj" : {
25                                "_id" : ObjectId("50ed163b1e27d5e73741fafb"),
26                                "language" : "english",
27                                "txt" : "Ich bin ein Berliner"
28                        }
29                }
30        ],
31        "stats" : {
32                "nscanned" : 1,
33                "nscannedObjects" : 0,
34                "n" : 1,
35                "timeMicros" : 161
36        },
37        "ok" : 1
38}

What happened here? The default language for searching is German. So the first search has no result (as before). In the second search we say to search for English text (to be more precise: for index keys that were generated with an English stemmer and stop words). That’s why we find the famous sentence from JFK.

What does that mean? Well, you have are real multi language text search at hand. You can store text messages from around the world in one collection and still search them dependent on the language.

Multiple Fields

A text index can span more that one field. If you are using more than one field, each field can have its one weight. That enables you to have indexed text parts of your document with different meanings.

1> db.mail.ensureIndex( {subject: "text", body: "text"}, {weights: {subject: 10} } )
2> db.mail.getIndices()
3[
4        ...
5        {
6                "v" : 0,
7                "key" : {
8                        "_fts" : "text",
9                        "_ftsx" : 1
10                },
11                "ns" : "de.mail",
12                "name" : "subject_text_body_text",
13                "weights" : {
14                        "body" : 1,
15                        "subject" : 10
16                },
17                "default_language" : "english",
18                "language_override" : "language"
19        }
20]

We created a text index spanning the fields “subject” and “body”, where the first got a weight of 10 and the latter the standard weight 1. Let’s see what impact these weights have:

1> db.mail.insert( {subject: "Robot leader to minions", body: "Humans suck", prio: 0 } )
2> db.mail.insert( {subject: "Human leader to minions", body: "Robots suck", prio: 1 } )
3> db.mail.runCommand("text", {search: "robot"})
4{
5        "queryDebugString" : "robot||||||",
6        "language" : "english",
7        "results" : [
8                {
9                        "score" : 6.666666666666666,
10                        "obj" : {
11                                "_id" : ObjectId("50ed1be71e27d5e73741fafe"),
12                                "subject" : "Robot leader to minions",
13                                "body" : "Humans suck"
14                                "prio" : 0 
15                        }
16                },
17                {
18                        "score" : 0.75,
19                        "obj" : {
20                                "_id" : ObjectId("50ed1bfd1e27d5e73741faff"),
21                                "subject" : "Human leader to minions",
22                                "body" : "Robots suck"
23                                "prio" : 1
24                        }
25                }
26        ],
27        "stats" : {
28                "nscanned" : 2,
29                "nscannedObjects" : 0,
30                "n" : 2,
31                "timeMicros" : 148
32        },
33        "ok" : 1
34}

The document with “robot” in the “subject” field has much higher score because the weight of 10 is a taken as a multiplier.

Filtering and Projection

You can apply additional search criteria via filtering:

1> db.mail.runCommand("text", {search: "robot", filter: {prio:0} } )
2{
3        "queryDebugString" : "robot||||||",
4        "language" : "english",
5        "results" : [
6                {
7                        "score" : 6.666666666666666,
8                        "obj" : {
9                                "_id" : ObjectId("50ed22621e27d5e73741fb04"),
10                                "subject" : "Robot leader to minions",
11                                "body" : "Humans suck",
12                                "prio" : 0
13                        }
14                }
15        ],
16        "stats" : {
17                "nscanned" : 2,
18                "nscannedObjects" : 2,
19                "n" : 1,
20                "timeMicros" : 185
21        },
22        "ok" : 1
23}

Please note that filtering does not use an index.

If you are interested only in a subset of fields, you can use projection (similar to the aggreation framework):

1> db.mail.runCommand("text", {search: "robot", project: {_id:0, prio:0} } )
2{
3        "queryDebugString" : "robot||||||",
4        "language" : "english",
5        "results" : [
6                {
7                        "score" : 6.666666666666666,
8                        "obj" : {
9                                "subject" : "Robot leader to minions",
10                                "body" : "Humans suck"
11                        }
12                },
13                {
14                        "score" : 0.75,
15                        "obj" : {
16                                "subject" : "Human leader to minions",
17                                "body" : "Robots suck"
18                        }
19                }
20        ],
21        "stats" : {
22                "nscanned" : 2,
23                "nscannedObjects" : 0,
24                "n" : 2,
25                "timeMicros" : 127
26        },
27        "ok" : 1
28}

Filtering and projection can be combined, of course.

Examples

All examples can be found on github . Try them yourself.

Summary

With this second part on MongoDB text search we had a look at the more intereting features of the text search capability. For a start that’s quite a good toolbox to implement your own search engines. I’m looking forward your feedback.

share post

Likes

0

//

More articles in this subject area

Discover exciting further topics and let the codecentric world inspire you.

//

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.