NER @ CLI: Custom-named entity recognition with spaCy in four lines

6.11.2020 | 8 minutes reading time

Named entity recognition is a technical term for a solution to a key automation problem: extraction of information from text. Applications include

automation of business processes involving documents
distillation of data from the web by scraping websites
indexing document collections for scientific, investigative, or economic purposes

Some cases can be treated by classical approaches, for example:

forms with a fixed structure can be handled by layout-based rules
entities with fixed pattern like phone numbers can be extracted using regular expressions
occurrences of known entities like invoice numbers or customer names can be detected by matching against a database

But when more flexibility is needed, named entity recognition (NER) may be just the right tool for the task. In a sequence of blog posts, we will explain and compare three approaches to extract references to laws and verdicts from court decisions:

First, we use the popular NLP library spaCy and train a custom NER model on the command line with no fuzz.
Next, we build a bidirectional word-level LSTM model by hand with TensorFlow & Keras .
Finally, we fine-tune a pre-trained BERT model using huggingface transformers for state-of-the-art performance on the task.

This post introduces the dataset and task and covers the command line approach using spaCy .

Our dataset and task

The dataset for our task was presented by E. Leitner, G. Rehm and J. Moreno-Schneider in

Fine-grained Named Entity Recognition in Legal Documents.

and can be found on GitHub . It consists of decisions from several German federal courts with annotations of entities referring to legal norms, court decisions, legal literature, and others of the following form:

The entire dataset comprises 66,723 sentences. We pick

court decisions of the Federal Labour Court (BAG) for training and
court decisions of the Federal Court of Justice (BGH) for validation.

The following histograms show the distribution of sentence lengths and token annotations for this slice, where ‘O’ denotes the “empty” annotation:

The NER task we want to solve is, given sample sentences, to annotate each token of each sentence with a tag which indicates whether this token is part of a reference to a legal norm, court decision, legal literature, and so on. Put differently, this is a sequence-labeling task where we classify each token as belonging to one or none annotation class.

Enter the NLP library spaCy

The Python library spaCy provides “industrial-strength natural language processing” covering

15 languages with small-, medium- or large-scale language models
the full NLP pipeline starting with tokenization over word embeddings to part-of-speech tagging and parsing
many NLP tasks like classification, similarity estimation or named entity recognition

We now show how to use it for our NER task with no knowledge of deep learning nor NLP.

Get your keyboard ready!

Step 0: Setup

To experiment along, you need Python 3. Fire up a terminal to work on the command line, create a folder for this experiment, switch to this folder and create and activate a virtual environment with

python3 -m venv .venv
source .venv/bin/activate

In case you are on Windows, switch to the Subsystem for Linux or replace the last line by

.venv\Scripts\activate.bat

Next, install spaCy and download the medium-sized German language model with

pip install spacy
python -m spacy download de_core_news_md

Step 1: Get the NER data ready

The dataset is hosted on GitHub and contained in one zip file which we download and unzip:

mkdir -p data/01_raw
curl https://github.com/elenanereiss/Legal-Entity-Recognition/raw/master/data/dataset_courts.zip \
     -L -o data/01_raw/raw.zip
!unzip data/01_raw/raw.zip -d data/01_raw

Each of the unzipped files contains sample sentences from one court. The sentences come as paragraphs separated by blank lines, with one token and annotation in BIO format per line as follows:

an O
Kapitalgesellschaften O
( O
§ B-GS
17 I-GS
Abs. I-GS
1 I-GS
und I-GS
2 I-GS
EStG I-GS
) O

We simply use

file data/01_raw/bag.conll for training
file data/01_raw/bgh.conll for validation,

and convert these files into the format required by spaCy :

python -m spacy convert --converter ner data/01_raw/bag.conll data/02_train
python -m spacy convert --converter ner data/01_raw/bgh.conll data/03_val

Along the way, we obtain some status information:

To check for potential problems before training, we check the data with spaCy’s debug-data tool:

1python -m spacy debug-data de data/02_train data/03_val -p ner -b de_core_news_md

which produces the following output:

As we have seen before, some tags occur extremely rarely so we can’t expect the model to learn them very well. Moreover, we see that the language model knows almost all words occuring in the dataset, which may come as a surprise.

Step 2: Train the NER model

To obtain a custom model for our NER task, we use spaCy’s train tool as follows:

python -m spacy train de data/04_models/md data/02_train data/03_val \
    --base-model de_core_news_md --pipeline 'ner' -R -n 20

which tells spaCy to train a new model

for the German language whose code is de
saving the trained model in data/04_models
using the training and validation data in data/02_train and data/03_val, respectively,
starting from the base model de_core_news_md
where the task to be trained is ner — named entity recognition
replacing the standard named entity recognition component via -R
using 20 epochs, that is, 20 runs over the entire training data.

Depending on your system, training may take several minutes up to a few hours. In case you have an NVidia GPU with CUDA set up, you can try to speed up the training, see spaCy’s installation and training instructions.

To track the progress, spaCy displays a table showing the loss (NER loss), precision (NER P), recall (NER R) and F1-score (NER F) reached after each epoch:

Itn	NER Loss	NER P	NER R	NER F	Token %	CPU WPS
1	26507.803	64.209	51.197	56.970	100.000	34947
2	14681.514	67.480	57.931	62.342	100.000	39232
3	10907.758	68.239	59.384	63.504	100.000	42043

At the end, spaCy tells you that it stored the last and the best model version in data/04_models/model-final and data/04_models/md/model-best, respectively. To check the performance of the model after training, we evaluate it on the validation data:

python -m spacy evaluate data/04_models/md/model-best data/03_val

This outputs the precision, recall and F1-score for the NER task again (NER P, NER R, NER F):

Time	Words	Words/s	TOK	POS	UAS	LAS	NER P	NER R	NER F	Textcat
4.37	177835	40663	100.00	0.00	0.00	0.00	70.15	60.09	64.73	0.00

The overall performance looks moderate. For better results, one could use

the large language model de_core_news_lg
more training steps
more training data (we only used a subset of the dataset).

As an example, training the large model for 40 epochs yields the following scores:

Time	Words	Words/s	TOK	POS	UAS	LAS	NER P	NER R	NER F	Textcat
4.52	177835	39339	100.00	0.00	0.00	0.00	73.72	64.39	68.74	0.00

Apparently, the problem is not the model, but the data: some tag categories appear very rarely so it’s hard for the model learn them. For a more thorough evaluation, we need to see the scores for each tag category.

Step 3: Use the model for named entity recognition

To use our new model and to see how it performs on each annotation class, we need to use the Python API of spaCy . To experiment along, activate the virtual environment again, install Jupyter and start a notebook with

pip install jupyter
jupyter notebook spacy_ner.ipynb

If it did not open by itself, open a web browser pointing to the URL output by the last command, and enter the following Python code blocks in code cells to work along.

Let us load the best-trained model version:

import spacy
MODEL_PATH = 'data/04_models/md/model-best'
nlp = spacy.load(MODEL_PATH)

It can be applied to detect entities in new text as follow :

sample = """Trotz der zweifelhaften Bewertung von MDMA als "harte Droge"
( vgl. BGH , Beschluss vom 3. Februar 1999 - 5 StR 705/98 ,
juris Rn. 2 ; zum Meinungsstand Patzak in Körner / Patzak / Volkmer
, BtMG , 8. Aufl. , Vorbem. zu §§ 29 ff. Rn. 213 mwN ; Weber , BtMG ,
5. Aufl. , § 1 Rn. 364 mwN ) hat der Strafausspruch Bestand ,
da die verhängte Rechtsfolge jedenfalls angemessen ist 
(§ 354 Abs. 1a Satz 1 StPO) ."""

doc = nlp(sample)

for ent in doc.ents:
    print(ent.label_, ':', ent.text)

The output looks as follows:

RS : BGH , Beschluss vom 3. Februar 1999 - 5 StR 705/98 , juris Rn. 2
LIT : Patzak in Körner / Patzak / Volkmer , BtMG , 8.
GS : § 29 ff
GS : Rn
LIT : Weber , BtMG , 5.
GS : § 1 Rn
GS : § 354 Abs. 1a Satz 1 StPO

Step 4: Evaluate the model

To obtain scores for the model on the level of annotation classes, we continue to work in the Jupyter notebook and load the validation data:

from spacy.gold import GoldCorpus

VAL_FILENAME = 'data/03_val/bgh.json'

val_corpus = GoldCorpus(VAL_FILENAME, VAL_FILENAME)
docs_golds = list(val_corpus.train_docs(nlp))
docs, golds = zip(*docs_golds)

To apply our model to these documents, we need to use only the NER component of the model’s NLP pipeline:

1ner = nlp.pipeline[0][1]
2predictions = list(ner.pipe(docs))

Finally, we can evaluate the performance using the Scorer class. Along the way, we count how often each tag occured:

from spacy.scorer import Scorer
from collections import Counter

tag_counts = Counter()
scorer = Scorer()
for y_p, y_t in zip(predictions, golds):
    scorer.score(y_p, y_t)
    for tag in y_t.ner:
        tag_counts[tag.split('-')[-1]] += 1
print(scorer.ents_p, scorer.ents_r, scorer.ents_f)

These are the same scores that we obtained by validating on the command line. Additionally, the ents_per_type attribute of scorer gives us access to the tag-level scores. With pandas installed (pip install pandas), we can put these scores in a table as follows:

import pandas as pd

scores = (pd.DataFrame.from_dict(scorer.ents_per_type, orient='index')
                      .join(pd.Series(tag_counts, name='support'))
                      .sort_values(by='support', ascending=False))
scores

For the medium model trained over 20 epochs, we obtain the following result:

tag	p	r	f	support
RS	62.77	63.34	63.06	18615
GS	84.93	84.93	84.93	7640
LIT	73.70	83.82	78.44	4685
GRT	67.88	32.40	43.86	662
RR	94.37	81.03	87.19	560
EUN	14.28	7.81	10.10	540
PER	25.00	1.62	3.05	494
ORG	32.25	28.57	30.30	176
VT	4.86	29.16	8.33	150
INN	33.33	8.00	12.90	124
UN	47.61	16.39	24.39	122
LD	36.87	65.82	47.27	95
ST	28.12	11.25	16.07	85
VO	0.00	0.00	0.00	81
MRK	0.00	0.00	0.00	58
AN	50.00	1.92	3.70	57
STR	0.00	0.00	0.00	35
LDS	33.33	10.00	15.38	10
VS	0.00	0.00	0.00	10

This gives a much clearer picture. Plotting the F1-Score (f) versus the number of tokens with this tag shows a correlation between poor performance and shortage of training data:

What next?

We’ve seen that spaCy allows us to train a model for extracting information from text with no knowledge of deep learning or NLP with a few commands on the command line. The options to improve performance and to adjust the model to our needs are, however, limited. In two following posts, we shall do better and

train a standard bi-directional LSTM model by hand, using TensorFlow & Keras
train state-of-the-art transformer models using huggingface transformers .

Stay tuned!

Was this post helpful?

Blog author

Thomas Timmermann

Do you still have questions? Just send me a message.

fromThomas Timmermann

NER with little data? Transformers to the rescue!

How do you solve deep learning problems with too little labelled data? The answer, of course, is transfer learning. In this post, we will apply this concept to named entity recognition (NER) and fine-tune a pre-trained BERT to extract information from...

Data
Machine Learning
AI
NLP
Agile transformation

14.12.2020 | 8 minutes reading time

Thomas Timmermann

Take control of named entity recognition with your own Keras model!

This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts. In a previous...

Data
Python
AI
NLP
Machine Learning

13.11.2020 | 9 minutes reading time

Thomas Timmermann

Move n-gram extraction into your Keras model!

Move n-gram extraction into your Keras model! In a project on large-scale text classification, a colleague of mine significantly raised the accuracy of our Keras model by feeding it with bigrams and trigrams instead of single characters. For his experiments...

AI
NLP
Big Data
Python
Data

18.7.2019 | 7 minutes reading time

Thomas Timmermann

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

On January 27, 2025, the technology stock exchange experienced an unexpected crash: The NVIDIA stock price plummeted by over 17%, temporarily wiping out nearly $600 billion in market value and setting a new historical record in the stock market. Many...

AI
Generative AI
LLM

29.1.2025 | 8 minutes reading time

How we can hack an AI with just a few words

How we can hack an AI with just a few words Artificial intelligence (AI) has undergone an astonishing transformation in recent years and is now present in many areas of life. Whether in the form of chatbots that help us with everyday questions or generative...

IT-Security
AI

27.1.2025 | 4 minutes reading time

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 minutes reading time

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 minutes reading time

Daniel Kocot

Simplifying LLM Application Development: A Newcomer's Perspective

I. Introduction Large Language Models (LLMs) have become highly popular due to their transformative impact on various fields, especially within IT. They enable developers to create innovative software applications centered around AI interactions, offering...

Generative AI
AI

6.12.2024 | 13 minutes reading time

Function Calling with GPT Models

GenAI is a powerful tool for generating content and interacting with applications using natural language. However, this tool also has significant limitations when you plan to use it in your own software. GenAI's knowledge is limited to information that...

Generative AI
AI
LLM

6.9.2024 | 5 minutes reading time

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 minutes reading time

Daniel Kocot

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 minutes reading time

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 minutes reading time

Francesca Diana

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 minutes reading time

Francesca Diana

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

The Digital Product Passport represents a significant shift for digital units within organisations, compelling them to ensure comprehensive data transparency. This tool not only serves as a product's digital fingerprint but also opens up new dimensions...

Data
Product management

25.1.2024 | 7 minutes reading time

Daniel Kocot

Answer questions about your documents with OpenAI and Pinecone

In recent years, large language models (LLMs) have made remarkable progress in interacting with humans, showcasing their ability to answer a wide array of questions. Trained on publicly accessible internet content, these models have broad knowledge across...

13.11.2023 | 12 minutes reading time

Lukas Lehmann

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

In the ever-evolving landscape of software development, buzzwords and paradigms come and go. One such term that has gained significant traction in recent years is "API-First Development." It's been hailed as the holy grail of modern software engineering...

API
Data

19.10.2023 | 5 minutes reading time

Daniel Kocot

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 minutes reading time

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 minutes reading time

NER @ CLI: Custom-named entity recognition with spaCy in four lines

Our dataset and task

Enter the NLP library spaCy

Step 0: Setup

Step 1: Get the NER data ready

Step 2: Train the NER model

Step 3: Use the model for named entity recognition

Step 4: Evaluate the model

What next?

Was this post helpful?

Blog author

More articles

NER with little data? Transformers to the rescue!

Take control of named entity recognition with your own Keras model!

Move n-gram extraction into your Keras model!

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

How we can hack an AI with just a few words

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

Simplifying LLM Application Development: A Newcomer's Perspective

Function Calling with GPT Models

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook

A/B Testing: An introduction

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

Answer questions about your documents with OpenAI and Pinecone

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals