Take control of named entity recognition with your own Keras model!

13.11.2020 | 9 minutes reading time

This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts.

In a previous post , we solved the same NER task on the command line with the NLP library spaCy . The present approach requires some work and knowledge, but yields a much more flexible solution which we can tune, scale and modify to our needs.

The NER dataset and task

We use the dataset presented by E. Leitner, G. Rehm and J. Moreno-Schneider in

Fine-grained Named Entity Recognition in Legal Documents.

again. It consists of decisions from several German federal courts with annotations of named entities referring to legal norms, court decisions, legal literature and others of the following form:

‘Trotz der zweifelhaften Bewertung von MDMA als ” harte Droge ” ( vgl. BGH , Beschluss vom 3. Februar 1999 – 5 StR 705/98
, juris Rn. 2 RS ; zum Meinungsstand Patzak in Körner / Patzak / Volkmer , BtMG , 8. Aufl. , Vorbem. zu §§ 29 ff. Rn. 213 LIT mwN ; Weber , BtMG , 5. Aufl. , § 1 Rn. 364 LIT mwN ) hat der Strafausspruch
Bestand , da die verhängte Rechtsfolge jedenfalls angemessen ist ( § 354 Abs. 1a Satz 1 StPO GS ) . ‘

The task will be to build, train and evaluate a model that, given sample sentences, annotates each token of each sentence with a tag that indicates whether this token is part of a reference to a legal norm, court decision, legal literature and so on.

NER with bi-LSTM for dummies

We implement a standard deep-learning architecture for NER — a bi-directional recurrent neural network — which works as follows:

Each sentence is split into a sequence of token and each token is represented by a word vector. These word vectors or embeddings are usually pre-trained on a huge corpus of documents so that they encode semantic information. We thus employ general language proficiency to our special task, a technique known as transfer learning . Common methods for pre-training are word2vec , gloVe or fasttext ; we use the word vectors provided by spaCy .
The model processes the input sequence step by step and maintains an internal memory along the way,
- reading the corresponding input vector,
- combining this input with the internal memory,
- producing an output vector and
- updating the internal memory
at each step. This magic is carried out by a long-short-term memory (LSTM) cell . As a result, we obtain an output sequence ot the same length as the input sequence, and an internal memory state.
Going backwards, the model reads the input again and produces a second output sequence.
At each position, the outputs of steps 2 and 3 are combined and fed into a classifier which outputs, for the input word at this position, the probability that should be annotated with the first tag, second tag, and so on.

To improve performance, one can replace the last feed-forward layer by a conditional random field model (CRF) . The resulting architecture is called bi-LSTM-CRF model.

Setting up the environment

First, set up a virtual environment as described in the preceding blog post , and install the required dependencies:

1mkdir keras_ner_project
2cd keras_ner_project
3python3 -m venv .venv
4source .venv/bin/activate
5pip install spacy
6python -m spacy download de_core_news_md
7pip install tensorflow

Alternatively, follow along with Jupyter running inside a TensorFlow Docker container , or with a google colab notebook .

Next, download the data as in the preceding blog post (in case you are inside a Jupyter notebook, put an exclamation mark ! in front of each command to have it executed by the shell):

1mkdir -p data/01_raw
2curl https://github.com/elenanereiss/Legal-Entity-Recognition/raw/master/data/dataset_courts.zip \
3     -L -o data/01_raw/raw.zip
4!unzip data/01_raw/raw.zip -d data/01_raw

Step 1: Preprocessing for NER

The data files contain sample sentences separated by blank lines, with one token and annotation in BIO format per line as follows:

 1an O
 2Kapitalgesellschaften O
 3( O
 4§ B-GS
 517 I-GS
 6Abs. I-GS
 71 I-GS
 8und I-GS
 92 I-GS
10EStG I-GS
11) O

We read such a data file line-by-line and store the sentences as lists of token-tag pairs:

 1def load_data(filename: str):
 2    with open(filename, 'r') as file:
 3        lines = [line[:-1].split() for line in file]
 4    samples, start = [], 0
 5    for end, parts in enumerate(lines):
 6        if not parts:
 7            sample = [(token, tag.split('-')[-1]) for token, tag in lines[start:end]]
 8            samples.append(sample)
 9            start = end + 1
10    if start < end:
11        samples.append(lines[start:end])
12    return samples
13
14train_samples = load_data('data/01_raw/bag.conll')
15val_samples = load_data('data/01_raw/bgh.conll')
16all_samples = train_samples + val_samples

For simplicity, we’ll truncate the sentences to a maximum length and pad shorter input sequences. But first, let us determine the set of all tags in the data and add an extra tag for the padding:

1schema = ['_'] + sorted({tag for sentence in samples for _, tag in sentence})

Next, we represent each token by a word vector, using a pre-trained German language model of the NLP library spaCy :

 1import spacy
 2import numpy as np
 3
 4nlp = spacy.load('de_core_news_md')
 5EMB_DIM = nlp.vocab.vectors_length
 6MAX_LEN = 50
 7
 8def preprocess(samples):
 9    tag_index = {tag: index for index, tag in enumerate(schema)}
10    X = np.zeros((len(samples), MAX_LEN, EMB_DIM), dtype=np.float32)
11    y = np.zeros((len(samples), MAX_LEN), dtype=np.uint8)
12    vocab = nlp.vocab
13    for i, sentence in enumerate(samples):
14        for j, (token, tag) in enumerate(sentence[:MAX_LEN]):
15            X[i, j] = vocab.get_vector(token)
16            y[i,j] = tag_index[tag]
17    return X, y
18
19X_train, y_train = preprocess(train_samples)
20X_val, y_val = preprocess(val_samples)

Now, we got the data ready for NER and can assemble our model!

Step 2: Build the bi-LSTM model

With the wide range of layers offered by Keras , we can can construct a bi-directional LSTM model as a sequence of two compound layers:

The bidirectional LSTM layer encapsulates a forward- and a backward-pass of an LSTM layer, followed by the stacking of the sequences returned by both passes.
The second layer applies a dense classification layer to every position of the stacked sequences. Here, the SoftMax
activation function scales the output so that we obtain sequences of probability distributions:

 1from tensorflow.keras.models import Sequential
 2from tensorflow.keras.layers import Bidirectional, LSTM, TimeDistributed, Dense
 3
 4def build_model(nr_filters=256):
 5    input_shape = (MAX_LEN, EMB_DIM)
 6    lstm = LSTM(NR_FILTERS, return_sequences=True)
 7    bi_lstm = Bidirectional(lstm, input_shape=input_shape)
 8    tag_classifier = Dense(len(schema), activation='softmax')
 9    sequence_labeller = TimeDistributed(tag_classifier)
10    return Sequential([bi_lstm, sequence_labeller])
11
12model = build_model()

For more complex architectures involving multiple inputs or outputs, residual connections or the like, Keras offers a more flexible functional API . With this, we can create directed acyclic graphs of tensors connected by applications of layers, and specify a model in terms of its input and output tensors.

Step 3: Train the model

To train a model means to optimize its weights or parameters on data so that the model’s predictions approximate the truth. For Keras to perform this optimization, we need to specify

how to measure the distance of the prediction to the truth, that is, a loss function ,
the optimization strategy which is a variant of batch-wise gradient descent .

Additionally, we can specify a metrics to monitor the training progress. Once this has been done using the compile method , we can call the fit method for training:

 1def train(model, epochs=10, batch_size=32):
 2    model.compile(optimizer='Adam',
 3                  loss='sparse_categorical_crossentropy',
 4                  metrics='accuracy')
 5    history = model.fit(X_train, y_train,
 6                        validation_split=0.2,
 7                        epochs=epochs,
 8                        batch_size=batch_size)
 9    return history.history
10
11history = train(model)

Keras provides implementations of all the standard optimizers , loss functions and metrics , and also allows us to supply our own.

The training history contains the losses and metrics achieved on the training and validation data after each epoch. Here, I got the following result:

Note the scale on the y-axis, but don’t get excited by accuracies of 99%: almost all token are labelled by the trivial tag O and hence accuracy does not tell much about detection of the non-trivial tags.

Step 4: Evaluate the model

To assess the performance of the model, we apply it to the preprocessed validation data and obtain a tensor of the shape (len(val_samples), MAX_LEN, len(schema)). This tensor contains, for each sample sentence and each token in this sentence, a predicted probability distribution over the tags. We choose the tag with highest probability and return, for each sentence and each token, the true and the predicted tag:

1def predict(model):
2    y_probs = model.predict(X_val)
3    y_pred = np.argmax(y_probs, axis=-1)
4    return [
5        [(token, tag, schema[index]) for (token, tag), index in zip(sentence, tag_pred)]
6        for sentence, tag_pred in zip(val_samples, y_pred)
7    ]
8
9predictions = predict(model)

Finally, we compute precision, recall and f1-score on the level of tag categories using scikit learn ’s classification_report :

 1import pandas as pd
 2from sklearn.metrics import classification_report
 3
 4def evaluate(predictions):
 5    y_t = [pos[1] for sentence in predictions for pos in sentence]
 6    y_p = [pos[2] for sentence in predictions for pos in sentence]
 7    report = classification_report(y_t, y_p, output_dict=True)
 8    return pd.DataFrame.from_dict(report).transpose().reset_index()
 9
10evaluate(predictions)

Training a model with 1024 filters for 10 epochs, we reach the following scores:

tag	f1-score	precision	recall	support
EUN	56.9	67.0	49.5	398
GRT	65.9	91.0	51.6	643
GS	94.5	96.1	92.9	6774
INN	41.3	88.9	26.9	119
LD	74.0	67.0	82.6	86
LDS	0.0	0.0	0.0	9
LIT	79.5	74.3	85.4	1681
MRK	0.0	0.0	0.0	49
ORG	25.3	32.4	20.8	159
PER	0.0	0.0	0.0	473
RR	92.0	94.4	89.8	560
RS	90.7	97.1	85.0	8380
ST	71.9	93.9	58.2	79
STR	0.0	0.0	0.0	35
UN	32.7	64.9	21.8	110
VO	2.2	4.0	1.5	66
VS	0.0	0.0	0.0	10
VT	18.0	11.7	38.9	144

Let’s see how this compares to the results achieved with spaCy :

It seems that our hand-built NER model does very well! But beware that these experiments do not show a winner: neither of the two approaches has been optimized and we did not compare training time nor compute resources used. The main differentiating factor is that

spaCy can be used out-of-the-box with no understanding of deep learning
the approach presented here is much more flexible and tuneable (see below).

What next?

With the deep learning library Keras , build and training our custom NER model took just a few lines, but setting up the data and the training required much more understanding than the command-line approach with spaCy .

To improve performance, we could try to tune the model and

increase the number of filters, that is, the size of the LSTM cell,
stack several bidirectional layers on top of each other,
replace the time-distributed classification layer with a conditional random field (CRF) model or
address the imbalance of the tag distribution with a focal loss instead of categorical cross-entropy.

But to achieve a significant boost, we need to provide our model with more input by

labeling more task-specific training data or
applying more of task-independent language proficiency to our task.

In a next blog post, we shall fine-tune a pre-trained NLP transformer model to our NER task and get state-of-the-art performance.

Stay tuned!

Was this post helpful?

Blog author

Thomas Timmermann

Do you still have questions? Just send me a message.

fromThomas Timmermann

NER with little data? Transformers to the rescue!

How do you solve deep learning problems with too little labelled data? The answer, of course, is transfer learning. In this post, we will apply this concept to named entity recognition (NER) and fine-tune a pre-trained BERT to extract information from...

Data
Machine Learning
AI
NLP
Agile transformation

14.12.2020 | 8 minutes reading time

Thomas Timmermann

NER @ CLI: Custom-named entity recognition with spaCy in four lines

Named entity recognition is a technical term for a solution to a key automation problem: extraction of information from text. Applications include automation of business processes involving documentsdistillation of data from the web by scraping websitesindexing...

Data
AI
NLP
Machine Learning

6.11.2020 | 8 minutes reading time

Thomas Timmermann

Move n-gram extraction into your Keras model!

Move n-gram extraction into your Keras model! In a project on large-scale text classification, a colleague of mine significantly raised the accuracy of our Keras model by feeding it with bigrams and trigrams instead of single characters. For his experiments...

AI
NLP
Big Data
Python
Data

18.7.2019 | 7 minutes reading time

Thomas Timmermann

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 [Missing String "readingTime"]

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 [Missing String "readingTime"]

Daniel Kocot

Miriam Greis

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

On January 27, 2025, the technology stock exchange experienced an unexpected crash: The NVIDIA stock price plummeted by over 17%, temporarily wiping out nearly $600 billion in market value and setting a new historical record in the stock market. Many...

AI
Generative AI
LLM

29.1.2025 | 8 [Missing String "readingTime"]

How we can hack an AI with just a few words

How we can hack an AI with just a few words Artificial intelligence (AI) has undergone an astonishing transformation in recent years and is now present in many areas of life. Whether in the form of chatbots that help us with everyday questions or generative...

IT-Security
AI

27.1.2025 | 4 [Missing String "readingTime"]

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 [Missing String "readingTime"]

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 [Missing String "readingTime"]

Daniel Kocot

Simplifying LLM Application Development: A Newcomer's Perspective

I. Introduction Large Language Models (LLMs) have become highly popular due to their transformative impact on various fields, especially within IT. They enable developers to create innovative software applications centered around AI interactions, offering...

Generative AI
AI

6.12.2024 | 13 [Missing String "readingTime"]

Function Calling with GPT Models

GenAI is a powerful tool for generating content and interacting with applications using natural language. However, this tool also has significant limitations when you plan to use it in your own software. GenAI's knowledge is limited to information that...

Generative AI
AI
LLM

6.9.2024 | 5 [Missing String "readingTime"]

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 [Missing String "readingTime"]

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 [Missing String "readingTime"]

Daniel Kocot

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 [Missing String "readingTime"]

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 [Missing String "readingTime"]

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 [Missing String "readingTime"]

Francesca Diana

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 [Missing String "readingTime"]

Francesca Diana

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

The Digital Product Passport represents a significant shift for digital units within organisations, compelling them to ensure comprehensive data transparency. This tool not only serves as a product's digital fingerprint but also opens up new dimensions...

Data
Product management

25.1.2024 | 7 [Missing String "readingTime"]

Daniel Kocot

Answer questions about your documents with OpenAI and Pinecone

In recent years, large language models (LLMs) have made remarkable progress in interacting with humans, showcasing their ability to answer a wide array of questions. Trained on publicly accessible internet content, these models have broad knowledge across...

13.11.2023 | 12 [Missing String "readingTime"]

Lukas Lehmann

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

In the ever-evolving landscape of software development, buzzwords and paradigms come and go. One such term that has gained significant traction in recent years is "API-First Development." It's been hailed as the holy grail of modern software engineering...

API
Data

19.10.2023 | 5 [Missing String "readingTime"]

Daniel Kocot

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 [Missing String "readingTime"]

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 [Missing String "readingTime"]

Charge your APIs Volume 13: Data meets APIOps

In the swirling digital vortex that modern businesses navigate, two things stand clear as day: our escalating reliance on Application Programming Interfaces (APIs) and the immeasurable value of data. The API Operations (APIOps) pipeline, with its automated...

API
Data

24.8.2023 | 11 [Missing String "readingTime"]

Daniel Kocot

Take control of named entity recognition with your own Keras model!

The NER dataset and task

NER with bi-LSTM for dummies

Setting up the environment

Step 1: Preprocessing for NER

Step 2: Build the bi-LSTM model

Step 3: Train the model

Step 4: Evaluate the model

What next?

Was this post helpful?

Blog author

More articles

NER with little data? Transformers to the rescue!

NER @ CLI: Custom-named entity recognition with spaCy in four lines

Move n-gram extraction into your Keras model!

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

How we can hack an AI with just a few words

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

Simplifying LLM Application Development: A Newcomer's Perspective

Function Calling with GPT Models

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook

A/B Testing: An introduction

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

Answer questions about your documents with OpenAI and Pinecone

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

Charge your APIs Volume 13: Data meets APIOps