Move n-gram extraction into your Keras model!

18.7.2019 | 7 minutes reading time

Move n-gram extraction into your Keras model!

In a project on large-scale text classification, a colleague of mine significantly raised the accuracy of our Keras model by feeding it with bigrams and trigrams instead of single characters. For his experiments he could just modify the preprocessing and the model as he wished, but for production, it was much preferable to just replace the model being served by tensorflow and leave all other code unchanged. And that is what we did — move the bigram and trigram extraction into our neural network. In this blog post, I’ll show you the basic idea , the implementation , an application and the limitations of our approach.

The idea: n-gram extraction via convolution

Suppose we want to process the quote

“I’d far rather be happy than right any day”

of Douglas Adams . Instead of looking at the text as a sequence of characters

I‘d far rather …

a neural network may profit from looking at pairs of adjacent characters, that is, at the sequence of bigrams

I’‘dd ffaarr rraatthheer …

or even at the sequence of trigrams or n-grams for n larger 3. To feed the neural network, we need to convert characters into numbers, for example, using the ASCII or UTF-8 codes. Our bigrams then become sequences of pairs of numbers:

73, 3939, 100100, 3232, 102102, 97 …

If we encode these bigrams using the rule (a, b) ↦ N · a + b, where N is the size of our alphabet, we obtain a sequence of numbers again: in case N=256, this would be

73*256+39=1872739*256+100=10084100*256+32=2563232*256+102=8294 …

More generally, we can encode n-grams for arbitrary n using the rule

(a₀, …,a_n-1) ↦ N^n-1 · a₁ + N^n-2 · a₂ + … + N · a_n-2 + a_n-1.

Here comes the key observation: with this encoding rule,

extracting n-grams becomes a convolution of the sequence of character codes with the kernel (1,N, …, N^n-1).

And this preprocessing step can easily be inserted as a first step into any character-level text-processing neural network.

The implementation

As a warm-up, let us implement the n-gram extraction as a convolution with NumPy . Given a NumPy array of character codes, the n-gram length n and the size of the alphabet N, the following function returns the sequence of encoded n-grams as an array:

1import numpy as np
2
3def ngrams_numpy(array, n, alphabet_size):
4    kernel = np.power(alphabet_size, range(0, n))
5    return np.convolve(array, kernel, mode='valid')

Next, how about the deep learning library Keras ? Suppose we already have a working text-processing model whose input are (batches of) sequences of character codes. Then we can add bigram or n-gram extraction as a first layer using a lambda layer in one line. Indeed, given a batch of samples in form of a tensor of shape (batch_size, sample_length), the following function returns a batch of encoded bigrams in form of a tensor of shape (batch_size, sample_length - 1):

1from keras.layers import Lambda
2
3def bigrams_lambda_layer(alphabet_size):
4    return Lambda(lambda x: x[:,:-1] + x[:,1:] * alphabet_size)

However, lambda layers in Keras may cause problems when saving , loading or checkpointing the model.

For further deployment of a model, for example with tensorflow serving , it might be better to avoid a lambda layer and to use a 1d-convolutional layer with fixed weights as follows:

1import numpy as np
2from keras import layers, backend
3
4def ngram_block(n, alphabet_size):
5    def wrapped(inputs):
6        layer = layers.Conv1D(1, n, use_bias=False, trainable=False)
7        x = layers.Reshape((-1, 1))(inputs)
8        x = layer(x)
9        kernel = np.power(alphabet_size, range(0, n), 
10                          dtype=backend.floatx())
11        layer.set_weights([kernel.reshape(n, 1, 1)])
12        return layers.Reshape((-1,))(x)
13
14    return wrapped

This function can be used like a layer:

1bigrams_tensor = ngram_block(2, alphabet_size)(input_tensor)

See also the source code for the experiment below . What this function does is

create a 1d-convolutional layer layer with one feature map, window size n, zero bias vector and frozen weights that are not changed during training,
reshape the input inputs, which is a tensor of shape (batch_size, sample_length), to a tensor x with shape (batch_size, sample_length, 1) (necessary because convolutional layers operate on sequences of vectors and not on sequences of scalars),
apply the convolutional layer to the reshaped input,
set the kernel of the convolutional layer and
reshape the output of the convolutional layer from (batch_size, sample_length, 1) to (batch_size, sample_length) again.

An experiment

Let us finally see how this idea works out for a classical test case, the 20 newsgroups dataset , where the task is to guess the topic of a given post from its text. We shall use a simple character-level convolutional network and see how n-gram extraction inside the model affects the classification accuracy and and training time.

To load the data, we use the datasets module of scikit-learn :

1from sklearn.datasets import fetch_20newsgroups as fetch
2
3data = fetch(subset="train", remove=("headers", "footers", "quotes"))
4posts, topics = data["data"], data["target"]

Now posts is a list of newsgroup posts as strings, and topics is a list of numbers representing the respective newsgroup topics. For each topic, we have 350 to 600 samples:

Note that this is way too little data for a character-level model to perform well. But let us try nevertheless.

We apply some minimal preprocessing and

convert the characters to lower case,
filter out all characters that are not contained in our chosen ALPHABET,
replace the remaining characters by their index in the ALPHABET,
trim the sequence of indices to a fixed length MAX_LEN,
stack all those sequences in one large NumPy array :

1import numpy as np
2
3ALPHABET = "abcdefghijklmnopqrstuvwxyz1234567890 !$#()-=+:;,.?/"
4MAX_LEN = 1000
5
6def encode_sample(sample, index):
7    indices = [index[char] for char in sample if char in index]     
8    return np.resize(np.array(indices), MAX_LEN)
9
10index = {char: i + 1 for i, char in enumerate(ALPHABET)}
11X = np.stack([encode_sample(x.lower(), index) for x in posts])
12y = np.eye(20)[topics]

Now X is an array of shape (len(posts), MAX_LEN), and y is an array of shape (len(posts), 20) containing the one-hot encoded topics.

As a baseline, we train a simple convolutional model:

1from keras import layers, models, optimizers
2
3LAYER_PARAMS = [[64, 3, 3], [128, 3, 3]]
4EMBEDDING_DIM = 16
5
6def build_model():
7    inputs = layers.Input(shape=(MAX_LEN,))
8    x = layers.Embedding(len(ALPHABET), EMBEDDING_DIM)(inputs)
9    for filters, kernel_size, pool_size in LAYER_PARAMS:
10        x = layers.Conv1D(filters, kernel_size, activation="relu")(x)
11        x = layers.BatchNormalization()(x)
12        x = layers.SpatialDropout1D(0.15)(x)
13        x = layers.MaxPooling1D(pool_size)(x)
14    x = layers.GlobalAveragePooling1D()(x)
15    x = layers.Dense(20, activation="softmax")(x)
16    model = models.Model(inputs=inputs, outputs=x)
17    model.compile(optimizer=optimizers.Adadelta(),
18                  loss="categorical_crossentropy", metrics=["acc"])
19    return model
20
21model = build_model()
22history = model.fit(X, y, epochs=60, batch_size=20, 
23                    validation_split=0.2)

The results are quite poor — the validation accuracy reaches just 60 percent

By careful tuning of hyperparameters, things certainly could be improved a bit.

Now let us see how bigram and trigram extraction will affect performance of the model. Using the function ngram_block, we only need to insert the line x = ngram_block(n, size)(inputs) between the Input and Embedding layers in build_model as follows:

1def build_ngram_model(n):
2    inputs = layers.Input(shape=(MAX_LEN,))
3    x = ngram_block(n, len(ALPHABET))(inputs)
4    x = layers.Embedding(pow(len(ALPHABET), n), n * EMBEDDING_DIM)(x)
5    for filters, kernel_size, pool_size in LAYER_PARAMS:
6        x = layers.Conv1D(filters, kernel_size, activation="relu")(x)
7        x = layers.BatchNormalization()(x)
8        x = layers.SpatialDropout1D(0.05 + 0.1 * n)(x)
9        x = layers.MaxPooling1D(pool_size)(x)
10    x = layers.GlobalAveragePooling1D()(x)
11    x = layers.Dense(20, activation="softmax")(x)
12    model = models.Model(inputs=inputs, outputs=x)
13    model.compile(optimizer=optimizers.Adadelta(),
14                  loss="categorical_crossentropy", metrics=["acc"])
15    return model

We also raised the embedding dimension (because now we want to embed bigrams and trigrams instead of single characters) and use an adaptive spatial dropout rate. Let us see how the n-gram model performs:

1for n in range(1, 4):
2    build_ngram_model(n).fit(X, y, epochs=40, 
3                             batch_size=20, validation_split=0.2)

The training histories show that n-gram extraction yields a significant improvement:

Indeed, the mean validation accuracy of the last 5 training epochs increased by more than 10 percent:

n	1	2	3
mean validation accuracy	0.5796	0.6401	0.7064

One limitation of the technique

Why did we stop at trigrams in the experiment above? The reason is that we do not only encode the n-grams that occur in our samples, but reserve codings for all n-grams that could possibly occur. And that makes a huge difference when n is growing larger:

n	1	2	3	4	5
#(occuring n-grams)	52	2,596	47,203	214,362	551,904
#(potential n-grams)	51	2,601	132,651	6,765,201	345,025,251

And therefore, the embedding layer will need memory increasing exponentially with n. This is the reason why we stick to bigrams or trigrams. By the way, the numbers above where extracted as follows:

1import pandas as pd
2
3def all_ngrams(n):
4    length = MAX_LEN - n + 1
5    def ngrams(x):
6        return set(zip(*[x[i:length + i] for i in range(0, n)]))
7    
8    return set().union(*[ngrams(x) for x in X])
9
10ns = range(1,6)
11alphabet_size = len(ALPHABET)
12cts = {'#(occuring n-grams)': [len(all_ngrams(n)) for n in ns], 
13       '#(potential n-grams)': [pow(alphabet_size, n) for n in ns]}
14pd.DataFrame(cts, index = pd.Index(ns, name='n')).transpose()

Was this post helpful?

Blog author

Thomas Timmermann

Do you still have questions? Just send me a message.

fromThomas Timmermann

NER with little data? Transformers to the rescue!

How do you solve deep learning problems with too little labelled data? The answer, of course, is transfer learning. In this post, we will apply this concept to named entity recognition (NER) and fine-tune a pre-trained BERT to extract information from...

Data
Machine Learning
AI
NLP
Agile transformation

14.12.2020 | 8 minutes reading time

Thomas Timmermann

Take control of named entity recognition with your own Keras model!

This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts. In a previous...

Data
Python
AI
NLP
Machine Learning

13.11.2020 | 9 minutes reading time

Thomas Timmermann

NER @ CLI: Custom-named entity recognition with spaCy in four lines

Named entity recognition is a technical term for a solution to a key automation problem: extraction of information from text. Applications include automation of business processes involving documentsdistillation of data from the web by scraping websitesindexing...

Data
AI
NLP
Machine Learning

6.11.2020 | 8 minutes reading time

Thomas Timmermann

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

On January 27, 2025, the technology stock exchange experienced an unexpected crash: The NVIDIA stock price plummeted by over 17%, temporarily wiping out nearly $600 billion in market value and setting a new historical record in the stock market. Many...

AI
Generative AI
LLM

29.1.2025 | 8 minutes reading time

How we can hack an AI with just a few words

How we can hack an AI with just a few words Artificial intelligence (AI) has undergone an astonishing transformation in recent years and is now present in many areas of life. Whether in the form of chatbots that help us with everyday questions or generative...

IT-Security
AI

27.1.2025 | 4 minutes reading time

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 minutes reading time

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 minutes reading time

Daniel Kocot

Simplifying LLM Application Development: A Newcomer's Perspective

I. Introduction Large Language Models (LLMs) have become highly popular due to their transformative impact on various fields, especially within IT. They enable developers to create innovative software applications centered around AI interactions, offering...

Generative AI
AI

6.12.2024 | 13 minutes reading time

Function Calling with GPT Models

GenAI is a powerful tool for generating content and interacting with applications using natural language. However, this tool also has significant limitations when you plan to use it in your own software. GenAI's knowledge is limited to information that...

Generative AI
AI
LLM

6.9.2024 | 5 minutes reading time

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 minutes reading time

Daniel Kocot

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 minutes reading time

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 minutes reading time

Francesca Diana

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 minutes reading time

Francesca Diana

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

The Digital Product Passport represents a significant shift for digital units within organisations, compelling them to ensure comprehensive data transparency. This tool not only serves as a product's digital fingerprint but also opens up new dimensions...

Data
Product management

25.1.2024 | 7 minutes reading time

Daniel Kocot

Answer questions about your documents with OpenAI and Pinecone

In recent years, large language models (LLMs) have made remarkable progress in interacting with humans, showcasing their ability to answer a wide array of questions. Trained on publicly accessible internet content, these models have broad knowledge across...

13.11.2023 | 12 minutes reading time

Lukas Lehmann

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

In the ever-evolving landscape of software development, buzzwords and paradigms come and go. One such term that has gained significant traction in recent years is "API-First Development." It's been hailed as the holy grail of modern software engineering...

API
Data

19.10.2023 | 5 minutes reading time

Daniel Kocot

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 minutes reading time

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 minutes reading time

Move n-gram extraction into your Keras model!