Take control of named entity recognition with your own Keras model!

13.11.2020 | 9 minutes reading time

This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts.

In a previous post , we solved the same NER task on the command line with the NLP library spaCy . The present approach requires some work and knowledge, but yields a much more flexible solution which we can tune, scale and modify to our needs.

The NER dataset and task

We use the dataset presented by E. Leitner, G. Rehm and J. Moreno-Schneider in

Fine-grained Named Entity Recognition in Legal Documents.

again. It consists of decisions from several German federal courts with annotations of named entities referring to legal norms, court decisions, legal literature and others of the following form:

‘Trotz der zweifelhaften Bewertung von MDMA als ” harte Droge ” ( vgl. BGH , Beschluss vom 3. Februar 1999 – 5 StR 705/98
, juris Rn. 2 RS ; zum Meinungsstand Patzak in Körner / Patzak / Volkmer , BtMG , 8. Aufl. , Vorbem. zu §§ 29 ff. Rn. 213 LIT mwN ; Weber , BtMG , 5. Aufl. , § 1 Rn. 364 LIT mwN ) hat der Strafausspruch
Bestand , da die verhängte Rechtsfolge jedenfalls angemessen ist ( § 354 Abs. 1a Satz 1 StPO GS ) . ‘

The task will be to build, train and evaluate a model that, given sample sentences, annotates each token of each sentence with a tag that indicates whether this token is part of a reference to a legal norm, court decision, legal literature and so on.

NER with bi-LSTM for dummies

We implement a standard deep-learning architecture for NER — a bi-directional recurrent neural network — which works as follows:

Each sentence is split into a sequence of token and each token is represented by a word vector. These word vectors or embeddings are usually pre-trained on a huge corpus of documents so that they encode semantic information. We thus employ general language proficiency to our special task, a technique known as transfer learning . Common methods for pre-training are word2vec , gloVe or fasttext ; we use the word vectors provided by spaCy .
The model processes the input sequence step by step and maintains an internal memory along the way,
- reading the corresponding input vector,
- combining this input with the internal memory,
- producing an output vector and
- updating the internal memory
at each step. This magic is carried out by a long-short-term memory (LSTM) cell . As a result, we obtain an output sequence ot the same length as the input sequence, and an internal memory state.
Going backwards, the model reads the input again and produces a second output sequence.
At each position, the outputs of steps 2 and 3 are combined and fed into a classifier which outputs, for the input word at this position, the probability that should be annotated with the first tag, second tag, and so on.

To improve performance, one can replace the last feed-forward layer by a conditional random field model (CRF) . The resulting architecture is called bi-LSTM-CRF model.

Setting up the environment

First, set up a virtual environment as described in the preceding blog post , and install the required dependencies:

1mkdir keras_ner_project
2cd keras_ner_project
3python3 -m venv .venv
4source .venv/bin/activate
5pip install spacy
6python -m spacy download de_core_news_md
7pip install tensorflow

Alternatively, follow along with Jupyter running inside a TensorFlow Docker container , or with a google colab notebook .

Next, download the data as in the preceding blog post (in case you are inside a Jupyter notebook, put an exclamation mark ! in front of each command to have it executed by the shell):

1mkdir -p data/01_raw
2curl https://github.com/elenanereiss/Legal-Entity-Recognition/raw/master/data/dataset_courts.zip \
3     -L -o data/01_raw/raw.zip
4!unzip data/01_raw/raw.zip -d data/01_raw

Step 1: Preprocessing for NER

The data files contain sample sentences separated by blank lines, with one token and annotation in BIO format per line as follows:

 1an O
 2Kapitalgesellschaften O
 3( O
 4§ B-GS
 517 I-GS
 6Abs. I-GS
 71 I-GS
 8und I-GS
 92 I-GS
10EStG I-GS
11) O

We read such a data file line-by-line and store the sentences as lists of token-tag pairs:

 1def load_data(filename: str):
 2    with open(filename, 'r') as file:
 3        lines = [line[:-1].split() for line in file]
 4    samples, start = [], 0
 5    for end, parts in enumerate(lines):
 6        if not parts:
 7            sample = [(token, tag.split('-')[-1]) for token, tag in lines[start:end]]
 8            samples.append(sample)
 9            start = end + 1
10    if start < end:
11        samples.append(lines[start:end])
12    return samples
13
14train_samples = load_data('data/01_raw/bag.conll')
15val_samples = load_data('data/01_raw/bgh.conll')
16all_samples = train_samples + val_samples

For simplicity, we’ll truncate the sentences to a maximum length and pad shorter input sequences. But first, let us determine the set of all tags in the data and add an extra tag for the padding:

1schema = ['_'] + sorted({tag for sentence in samples for _, tag in sentence})

Next, we represent each token by a word vector, using a pre-trained German language model of the NLP library spaCy :

 1import spacy
 2import numpy as np
 3
 4nlp = spacy.load('de_core_news_md')
 5EMB_DIM = nlp.vocab.vectors_length
 6MAX_LEN = 50
 7
 8def preprocess(samples):
 9    tag_index = {tag: index for index, tag in enumerate(schema)}
10    X = np.zeros((len(samples), MAX_LEN, EMB_DIM), dtype=np.float32)
11    y = np.zeros((len(samples), MAX_LEN), dtype=np.uint8)
12    vocab = nlp.vocab
13    for i, sentence in enumerate(samples):
14        for j, (token, tag) in enumerate(sentence[:MAX_LEN]):
15            X[i, j] = vocab.get_vector(token)
16            y[i,j] = tag_index[tag]
17    return X, y
18
19X_train, y_train = preprocess(train_samples)
20X_val, y_val = preprocess(val_samples)

Now, we got the data ready for NER and can assemble our model!

Step 2: Build the bi-LSTM model

With the wide range of layers offered by Keras , we can can construct a bi-directional LSTM model as a sequence of two compound layers:

The bidirectional LSTM layer encapsulates a forward- and a backward-pass of an LSTM layer, followed by the stacking of the sequences returned by both passes.
The second layer applies a dense classification layer to every position of the stacked sequences. Here, the SoftMax
activation function scales the output so that we obtain sequences of probability distributions:

 1from tensorflow.keras.models import Sequential
 2from tensorflow.keras.layers import Bidirectional, LSTM, TimeDistributed, Dense
 3
 4def build_model(nr_filters=256):
 5    input_shape = (MAX_LEN, EMB_DIM)
 6    lstm = LSTM(NR_FILTERS, return_sequences=True)
 7    bi_lstm = Bidirectional(lstm, input_shape=input_shape)
 8    tag_classifier = Dense(len(schema), activation='softmax')
 9    sequence_labeller = TimeDistributed(tag_classifier)
10    return Sequential([bi_lstm, sequence_labeller])
11
12model = build_model()

For more complex architectures involving multiple inputs or outputs, residual connections or the like, Keras offers a more flexible functional API . With this, we can create directed acyclic graphs of tensors connected by applications of layers, and specify a model in terms of its input and output tensors.

Step 3: Train the model

To train a model means to optimize its weights or parameters on data so that the model’s predictions approximate the truth. For Keras to perform this optimization, we need to specify

how to measure the distance of the prediction to the truth, that is, a loss function ,
the optimization strategy which is a variant of batch-wise gradient descent .

Additionally, we can specify a metrics to monitor the training progress. Once this has been done using the compile method , we can call the fit method for training:

 1def train(model, epochs=10, batch_size=32):
 2    model.compile(optimizer='Adam',
 3                  loss='sparse_categorical_crossentropy',
 4                  metrics='accuracy')
 5    history = model.fit(X_train, y_train,
 6                        validation_split=0.2,
 7                        epochs=epochs,
 8                        batch_size=batch_size)
 9    return history.history
10
11history = train(model)

Keras provides implementations of all the standard optimizers , loss functions and metrics , and also allows us to supply our own.

The training history contains the losses and metrics achieved on the training and validation data after each epoch. Here, I got the following result:

Note the scale on the y-axis, but don’t get excited by accuracies of 99%: almost all token are labelled by the trivial tag O and hence accuracy does not tell much about detection of the non-trivial tags.

Step 4: Evaluate the model

To assess the performance of the model, we apply it to the preprocessed validation data and obtain a tensor of the shape (len(val_samples), MAX_LEN, len(schema)). This tensor contains, for each sample sentence and each token in this sentence, a predicted probability distribution over the tags. We choose the tag with highest probability and return, for each sentence and each token, the true and the predicted tag:

1def predict(model):
2    y_probs = model.predict(X_val)
3    y_pred = np.argmax(y_probs, axis=-1)
4    return [
5        [(token, tag, schema[index]) for (token, tag), index in zip(sentence, tag_pred)]
6        for sentence, tag_pred in zip(val_samples, y_pred)
7    ]
8
9predictions = predict(model)

Finally, we compute precision, recall and f1-score on the level of tag categories using scikit learn ’s classification_report :

 1import pandas as pd
 2from sklearn.metrics import classification_report
 3
 4def evaluate(predictions):
 5    y_t = [pos[1] for sentence in predictions for pos in sentence]
 6    y_p = [pos[2] for sentence in predictions for pos in sentence]
 7    report = classification_report(y_t, y_p, output_dict=True)
 8    return pd.DataFrame.from_dict(report).transpose().reset_index()
 9
10evaluate(predictions)

Training a model with 1024 filters for 10 epochs, we reach the following scores:

tag	f1-score	precision	recall	support
EUN	56.9	67.0	49.5	398
GRT	65.9	91.0	51.6	643
GS	94.5	96.1	92.9	6774
INN	41.3	88.9	26.9	119
LD	74.0	67.0	82.6	86
LDS	0.0	0.0	0.0	9
LIT	79.5	74.3	85.4	1681
MRK	0.0	0.0	0.0	49
ORG	25.3	32.4	20.8	159
PER	0.0	0.0	0.0	473
RR	92.0	94.4	89.8	560
RS	90.7	97.1	85.0	8380
ST	71.9	93.9	58.2	79
STR	0.0	0.0	0.0	35
UN	32.7	64.9	21.8	110
VO	2.2	4.0	1.5	66
VS	0.0	0.0	0.0	10
VT	18.0	11.7	38.9	144

Let’s see how this compares to the results achieved with spaCy :

It seems that our hand-built NER model does very well! But beware that these experiments do not show a winner: neither of the two approaches has been optimized and we did not compare training time nor compute resources used. The main differentiating factor is that

spaCy can be used out-of-the-box with no understanding of deep learning
the approach presented here is much more flexible and tuneable (see below).

What next?

With the deep learning library Keras , build and training our custom NER model took just a few lines, but setting up the data and the training required much more understanding than the command-line approach with spaCy .

To improve performance, we could try to tune the model and

increase the number of filters, that is, the size of the LSTM cell,
stack several bidirectional layers on top of each other,
replace the time-distributed classification layer with a conditional random field (CRF) model or
address the imbalance of the tag distribution with a focal loss instead of categorical cross-entropy.

But to achieve a significant boost, we need to provide our model with more input by

labeling more task-specific training data or
applying more of task-independent language proficiency to our task.

In a next blog post, we shall fine-tune a pre-trained NLP transformer model to our NER task and get state-of-the-art performance.

Stay tuned!

Was this post helpful?

Blog author

Thomas Timmermann

Do you still have questions? Just send me a message.

The Ralph Wiggum Loop: Autonomous Code Generation with a Fresh Context

Ralph Wiggum is the simple-minded boy from The Simpsons who says things like "I'm learnding!" and eats glue. Of all people, he is now the namesake for a technique for autonomous code generation. The idea behind: If the thought of letting code be generated...

Generative AI
LLM
AI
Software development

6.4.2026 | 7 minutes reading time

Johannes Barop

KubeCon Europe 2026: AI agents go to production

tl;dr A summary of KubeCon Europe 2026: It is the year AI agents move from prototypes to production. This article covers what that means: giving agents verifiable identities, routing inference traffic with the new Gateway API Inference Extension, governing...

Cloud native
AI

31.3.2026 | 11 minutes reading time

AI Code Tsunami Hits the QA Dam: The End of Balanced Velocity

Note upfront: This article is specifically aimed at teams working on the modernization and further development of existing systems, not at greenfield projects where completely different rules apply. Everyone is talking about the massive productivity ...

Generative AI
AI
DevOps
Test Driven Development
Testing

30.3.2026 | 8 minutes reading time

DeepFake: Detect AI-Generated Images in 5 Steps

We live in a time when an image is no longer a reliable guarantee of truth. AI‑generated content floods social media feeds, news platforms and messenger groups every single day, and only very few people are able to tell the difference. What once required...

IT-Security
AI
Generative AI
Search
Google
data protection
Digitalization

16.3.2026 | 5 minutes reading time

MotherDuck Dives: From Natural Language to Live Dashboards

Dives are interactive visualizations created through natural language, built directly on top of data in MotherDuck. Users describe what they want to see, and an AI agent generates a persistent, interactive component that lives in their workspace alongside...

MotherDuck
Data
Data Science
Big Data

9.3.2026 | 8 minutes reading time

Niklas Niggemann

From Stories to Code: How Domain Storytelling and EventStorming Give LLMs...

The Broken Promise of AI-Assisted Development By now, most development teams have tried using an LLM to generate code. The results are familiar: syntactically correct, superficially plausible, and frequently wrong in ways that take hours to diagnose...

4.3.2026 | 15 minutes reading time

Narwhals: Building Dataframe-Agnostic Libraries with Zero Dependencies

After the publication of our article about Ibis, Dr André Schemaitat pointed us to a similar tool with growing popularity – Narwhals. Narwhals describes itself as an "extremely lightweight and extensible compatibility layer between dataframe libraries...

Data
Python
Software development

3.3.2026 | 11 minutes reading time

Niklas Niggemann

Don't Let Your AI Cheat: Isolated Specification Testing with Claude Code

AI agents are powerful — but they will cheat if you let them. Letting the same agent develop and test your application risks one thing: it will no longer fulfill the specification, it will simply learn to pass the tests. This article shows how to ...

AI
LLM
Testing

2.3.2026 | 12 minutes reading time

Thomas Jaspers

Talk to your Data Part 3: The Potential of Natural Language

This is the last and final part of our article series covering the new MCP server by MotherDuck. We have already presented the basics and challenges in previous parts. Now, we want to conclude with our findings and comments on the current state and give...

MotherDuck
Data

27.2.2026 | 7 minutes reading time

Hendrik Kamp

Niklas Niggemann

Talk to your Data Part 2: Limits and Performance Enhancements

In part one of this series, we introduced the MotherDuck MCP server in combination with opencode and showcased initial context engineering. We also showed deeper knowledge retrieval using natural language instead of SQL. In this article we will dive ...

MotherDuck
Data

19.2.2026 | 8 minutes reading time

Niklas Niggemann

Hendrik Kamp

Talk to your Data Part 1: How to generate Insights with MotherDuck MCP...

MotherDuck's new MCP server gives us the opportunity to have a conversation with an AI models like Claude or ChatGPT and ask questions about our data that are directly transformed into SQL. The queries are executed against the actual data in our cloud...

MotherDuck
Data

12.2.2026 | 6 minutes reading time

Niklas Niggemann

Hendrik Kamp

Ibis: Selecting the Right Execution Engine Without Rewriting Your Logic

In our previous benchmarks, DuckDB consistently outperformed Polars and Pandas on large analytical workloads, but performance comparisons miss a critical question: what happens when you need to move from local DuckDB development to a BigQuery production...

MotherDuck
Data
Big Data
Data Science

10.2.2026 | 6 minutes reading time

Niklas Niggemann

DuckDB vs. Polars: Performance & Memory on Massive Parquet Data

Update 02.02.26 – After helpful insights from the Polars team on LinkedIn, we enhanced our benchmark setup with a configuration of Polars where async is forced. This is elaborated in the article. Our previous benchmark compared DuckDB, Polars, and Pandas...

MotherDuck
Data Science
Data

20.1.2026 | 15 minutes reading time

Niklas Niggemann

5 reasons we developers misjudge agentic software engineering

Throughout 2025 a kind of trench warfare raged between software developers on the pro and anti-AI development camps. We are, by definition, the experts on software creation. Ironically, this also makes us highly biased, and is exactly the reason you ...

Generative AI
AI

8.1.2026 | 5 minutes reading time

John Fletcher

The developer's dilemma - mastering the transition to AI engineering

Dear software developer, please choose one of the following options for 2026 and beyond:a) finding yourself with obsolete skills, and eventually, unemployed. b) salary increases lower than inflation, whilst expectations of your output continually increase...

AI
Generative AI

1.1.2026 | 11 minutes reading time

John Fletcher

MotherDuck: Access Management and Scalable Analytics Overview

MotherDuck's architecture for storage management and user access is built on several key design principles that shape how data is organized and shared. To understand how MotherDuck manages access control, you need to understand three key concepts: organizations...

Data
MotherDuck

8.12.2025 | 6 minutes reading time

Hendrik Kamp

DuckDB vs. DataFrame Libraries

Update 10.12.25 – After helpful insights from Polars Engineer Thijs Nieuwdorp following the initial posting of this article, we were able to refactor our use of the deprecated .count() function in Polars, replacing it with the correct .len() function...

MotherDuck
Data
Data Science
Python
Database

1.12.2025 | 10 minutes reading time

Niklas Niggemann

ODPS: The Standard for Data Products

The data landscape in an organization often looks like this: teams gather and produce data everyday. Each team develops their own metadata models and documentation, if there is any at all. Governance policies exist in scattered documentation (spreadsheets...

Data

7.11.2025 | 4 minutes reading time

DuckDB and MotherDuck for customer facing analytics

MotherDuck
Data

21.10.2025 | 5 minutes reading time

Matthias Niehoff

DuckDB’s friendly SQL is a game changer for developer experience

I don’t think anyone will be surprised when I say that SQL is not the nicest language to work with. Some might even say that it has terrible ergonomics, especially for larger and more complex queries. Still, there are very good reasons why SQL is the...

Data
MotherDuck

14.10.2025 | 12 minutes reading time

Take control of named entity recognition with your own Keras model!

The NER dataset and task

NER with bi-LSTM for dummies

Setting up the environment

Step 1: Preprocessing for NER

Step 2: Build the bi-LSTM model

Step 3: Train the model

Step 4: Evaluate the model

What next?

Was this post helpful?

Blog author

More articles in this subject area

The Ralph Wiggum Loop: Autonomous Code Generation with a Fresh Context

KubeCon Europe 2026: AI agents go to production

AI Code Tsunami Hits the QA Dam: The End of Balanced Velocity

DeepFake: Detect AI-Generated Images in 5 Steps

MotherDuck Dives: From Natural Language to Live Dashboards

From Stories to Code: How Domain Storytelling and EventStorming Give LLMs...

Narwhals: Building Dataframe-Agnostic Libraries with Zero Dependencies

Don't Let Your AI Cheat: Isolated Specification Testing with Claude Code

Talk to your Data Part 3: The Potential of Natural Language

Talk to your Data Part 2: Limits and Performance Enhancements

Talk to your Data Part 1: How to generate Insights with MotherDuck MCP...

Ibis: Selecting the Right Execution Engine Without Rewriting Your Logic

DuckDB vs. Polars: Performance & Memory on Massive Parquet Data

5 reasons we developers misjudge agentic software engineering

The developer's dilemma - mastering the transition to AI engineering

MotherDuck: Access Management and Scalable Analytics Overview

DuckDB vs. DataFrame Libraries

ODPS: The Standard for Data Products

DuckDB and MotherDuck for customer facing analytics

DuckDB’s friendly SQL is a game changer for developer experience