NER with little data? Transformers to the rescue!

14.12.2020 | 8 minutes reading time

How do you solve deep learning problems with too little labelled data? The answer, of course, is transfer learning. In this post, we will apply this concept to named entity recognition (NER) and

fine-tune a pre-trained BERT to extract information from legal texts,
encounter a token misalignment problem due to BERT’s preference for sub-word token, and
observe tremendous improvements on difficult classes compared to the hand-made bi-lstm model of our previous posts.

Let’s get started!

The NER dataset and task

We use the dataset presented by E. Leitner, G. Rehm and J. Moreno-Schneider in

Fine-grained Named Entity Recognition in Legal Documents.

again. It consists of German court decisions with annotations of entities referring to legal norms, court decisions, legal literature and so on of the following form:

‘Trotz der zweifelhaften Bewertung von MDMA als ” harte Droge “( vgl. BGH , Beschluss vom 3. Februar 1999 – 5 StR 705/98 , juris Rn. 2RS; zum Meinungsstand Patzak inKörner / Patzak / Volkmer , BtMG , 8. Aufl. , Vorbem. zu §§ 29 ff. Rn. 213LITmwN ;Weber , BtMG , 5. Aufl. , § 1 Rn. 364LITmwN ) hat der Strafausspruch Bestand , da die verhängte Rechtsfolge jedenfalls angemessen ist (§ 354 Abs. 1a Satz 1 StPOGS) . ‘

The task for our model will be to annotate, given a sample sentence, each word with a tag that indicates whether this word is part of a reference to legal norm, court decisions and so on. For more details, see the first post of this series.

The transformer revolution

In case you haven’t read about transformers, here’s a summary. For details on the the original transformer architecture, see the original paper or one of the many blog posts on the topic.

Transformers transformed natural language processing (NLP) with

a revolutionary attention mechanism that replaces convolutional or recurrent architectures,
a shift in transfer learning from pre-training (word vectors) for feature extraction to training generic language models plus fine-tuning on downstream tasks, and
an exponential growth of model size that brought us performance on par with humans on a number of NLP tasks but also exploding resource consumption with diminishing returns:

To leverage transformers for our custom NER task, we’ll use the Python library huggingface transformers which provides

a model repository including BERT, GPT-2 and others, pre-trained in a variety of languages,
wrappers for downstream tasks like classification, named entity recognition, summarization, et cetera and
convenient ways to fine-tunining on downstream tasks , e.g. in end-to-end pipelines or via TensorFlow or PyTorch .

Get your keyboard ready or follow along just reading!

Setting up the environment

Set up a virtual environment, install the required dependencies and download the dataset similarly as in the preceding blog posts :

1mkdir transformers_ner_project && cd transformers_ner_project
2python3 -m venv .venv && source .venv/bin/activate
3pip install numpy pandas tqdm sklearn transformers[tf-cpu]
4mkdir -p data/01_raw
5curl https://github.com/elenanereiss/Legal-Entity-Recognition/raw/master/data/dataset_courts.zip 
6     -L -o data/01_raw/raw.zip
7unzip data/01_raw/raw.zip -d data/01_raw

Alternatively, follow along with Jupyter running inside a TensorFlow Docker container , or with a Google Colab notebook .

Step 1: Loading a pre-trained BERT

With huggingface transformers , it’s super-easy to get a state-of-the-art pre-trained transformer model nicely packaged for our NER task: we choose a pre-trained German BERT model from the model repository and request a wrapped variant with an additional token classification layer for NER with just a few lines:

1from transformers import AutoConfig, TFAutoModelForTokenClassification
2
3MODEL_NAME = 'bert-base-german-cased' 
4
5config = AutoConfig.from_pretrained(MODEL_NAME, num_labels=len(schema))
6model = TFAutoModelForTokenClassification.from_pretrained(MODEL_NAME, 
7                                                          config=config)
8model.summary()

The result is a TensorFlow model consisting of the pre-trained BERT transformer, followed by a drop-out and a dense classifier layer which predicts the tag of each token:

Model: "tf_bert_for_token_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bert (TFBERTMainLayer)       multiple                  109081344 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  16149     
=================================================================
Total params: 109,097,493
Trainable params: 109,097,493
Non-trainable params: 0
_________________________________________________________________

Step 2: Preprocessing

The data files contain sample sentences separated by blank lines, with one token and annotation in BIO format per line:

1an O
2Kapitalgesellschaften O
3( O
4§ B-GS
517 I-GS
6Abs. I-GS
71 I-GS
8und I-GS
92 I-GS
10EStG I-GS
11) O

We read two data files line-by-line, store the sentences as lists of token-tag pairs, and determine the annotation schema just like we did it for training our bi-LSTM model :

1def load_data(filename: str):
2    with open(filename, 'r') as file:
3        lines = [line[:-1].split() for line in file]
4    samples, start = [], 0
5    for end, parts in enumerate(lines):
6        if not parts:
7            sample = [(token, tag.split('-')[-1]) 
8                          for token, tag in lines[start:end]]
9            samples.append(sample)
10            start = end + 1
11    if start < end:
12        samples.append(lines[start:end])
13    return samples
14
15train_samples = load_data('data/01_raw/bag.conll')
16val_samples = load_data('data/01_raw/bgh.conll')
17samples = train_samples + val_samples
18schema = ['_'] + sorted({tag for sentence in samples 
19                             for _, tag in sentence})

Gotcha! Sub-word tokenization?

But how do we feed the data into our transformer? The answer depends on the model that we chose because it has been pre-trained with a custom sub-word tokenizer. This tokenizer splits an input sentence into a sequence of subword tokens instead of words, using an algorithm like byte-pair encoding or unigram language models . Let’s get hold of the tokenizer that was used to pre-train our model,

1from transformers import AutoTokenizer
2tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

and apply it to some samples. The results are dictionaries where we’re mainly interested in the component input_ids:

`sample`	`tokenizer(sample)['input_ids']`
`'Das ist'`	`[3, 295, 127, 4]`
`'eine Frage'`	`[3, 155, 1685, 4]`
`'eine hochinteressante Frage'`	`[3, 155, 2426, 21477, 5004, 1685, 4]`

What do we see?

The tokenizer marks the beginning and the end of a sample with a 3 and 4, respectively.
Common words like 'Das', 'ist', 'eine', 'Frage' are treated as single tokens.
Less frequent words like 'hochinteressante' are split up into a sequence of sub-word token.

So we need to

apply the sub-word tokenizer to every word in our input samples, and
whenever it does split up a word, tag each sub-word like the entire word.

This can be done as follows:

1import numpy as np
2import tqdm
3
4def tokenize_sample(sample):
5    seq = [
6               (subtoken, tag)
7               for token, tag in sample
8               for subtoken in tokenizer(token)['input_ids'][1:-1]
9           ]
10    return [(3, 'O')] + seq + [(4, 'O')]
11
12def preprocess(samples):
13    tag_index = {tag: i for i, tag in enumerate(schema)}
14    tokenized_samples = list(tqdm(map(tokenize_sample, samples)))
15    max_len = max(map(len, tokenized_samples))
16    X = np.zeros((len(samples), max_len), dtype=np.int32)
17    y = np.zeros((len(samples), max_len), dtype=np.int32)
18    for i, sentence in enumerate(tokenized_samples):
19        for j, (subtoken_id, tag) in enumerate(sentence):
20            X[i, j] = subtoken_id
21            y[i,j] = tag_index[tag]
22    return X, y
23
24X_train, y_train = preprocess(train_samples)
25X_val, y_val = preprocess(val_samples)

Step 3: Fine-tuning BERT on our custom NER task

Training the model is now more or less the same as in the preceding post with our bi-LSTM model:

1import pandas as pd
2
3NR_EPOCHS=10
4BATCH_SIZE=16
5
6optimizer = tf.keras.optimizers.Adam(lr=0.00001)
7loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
8model.compile(optimizer=optimizer, loss=loss, metrics='accuracy')
9history = model.fit(tf.constant(X_train), tf.constant(y_train),
10                    validation_split=0.2, epochs=EPOCHS, 
11                    batch_size=BATCH_SIZE)

Well, except that now the model has some more parameters and training for just one epoch might take … some hours, depending on your hardware. Here’s the validation accuracy (note the lower bound): validation accuracy history

Note the domain of the accuracy and that the x-axis measures the training time in seconds.

Step 4: Evaluation — gotcha again!

Now that we have trained our custom-NER-BERT, we want to apply it and … face another problem: the model predicts tag annotations on the sub-word level, not on the word level. To obtain word-level annotations, we need to aggregate the sub-word level predictions for each word. Two obvious solutions come to mind:

for each sub-word, choose the tag with highest probability, and then use a majority vote, or
average the predicted probabilities over all sub-words of a word, and then take the tag with highest average probability.

Given predictions pred for a sequence seq of sub-words of shape (len(seq), len(schema)), this would amount to taking the tag indexed by

scipy.stats.mode(np.argmax(pred, axis=-1)), using the package SciPy , or
np.argmax(np.mean(pred, axis=0)),

respectively, or, in the picture below, to go 1. first right, then down or 2. first down, then right: sub-word prediction aggregation

We choose variant 2 and apply it to the model’s predictions as follows:

1def aggregate(sample, predictions):
2    results = []
3    i = 1
4    for token, y_true in sample:
5        nr_subtoken = len(tokenizer(token)['input_ids']) - 2
6        pred = predictions[i:i+nr_subtoken]
7        i += nr_subtoken
8        y_pred = schema[np.argmax(np.sum(pred, axis=0))]
9        results.append((token, y_true, y_pred))
10    return results
11
12y_probs = model.predict(X_val)[0]
13predictions = [aggregate(sample, predictions)
14               for sample, predictions in zip(val_samples, y_probs)]

Finally, we can evaluate the predictions on the level of tokens as a multi-class classification problem using scikit-learn again as in the preceding blog post . Here is the scatterplot of the resulting f1-Scores versus the support for each tag class:
NER-f1-score vs support per tag class

Conclusion

Let’s see how our new results compare to those of the previous post, and note that I’ve let BERT train 50 times as long as the bi-LSTM:

comparison NER-f1-scores per tag

We see that BERT significantly outperforms the bi-LSTM on difficult classes in our task. Is this only because of the more powerful network architecture and more training time? No! The scatterplot above shows a significant correlation between the f1-score and the supply of training data, and points us to the key advantage of the present approach:

Before (bi-LSTM), we used it in the form of pre-trained word embeddings.
Now (BERT), we start from a fully trained language model that embodies much more knowledge.

The upshot is:

The fewer data we have, the more important transfer learning becomes.

Was this post helpful?

Blog author

Thomas Timmermann

Do you still have questions? Just send me a message.

fromThomas Timmermann

Take control of named entity recognition with your own Keras model!

This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts. In a previous...

Data
Python
AI
NLP
Machine Learning

13.11.2020 | 9 minutes reading time

Thomas Timmermann

NER @ CLI: Custom-named entity recognition with spaCy in four lines

Named entity recognition is a technical term for a solution to a key automation problem: extraction of information from text. Applications include automation of business processes involving documentsdistillation of data from the web by scraping websitesindexing...

Data
AI
NLP
Machine Learning

6.11.2020 | 8 minutes reading time

Thomas Timmermann

Move n-gram extraction into your Keras model!

Move n-gram extraction into your Keras model! In a project on large-scale text classification, a colleague of mine significantly raised the accuracy of our Keras model by feeding it with bigrams and trigrams instead of single characters. For his experiments...

AI
NLP
Big Data
Python
Data

18.7.2019 | 7 minutes reading time

Thomas Timmermann

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 [Missing String "readingTime"]

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 [Missing String "readingTime"]

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 [Missing String "readingTime"]

Daniel Kocot

Miriam Greis

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

On January 27, 2025, the technology stock exchange experienced an unexpected crash: The NVIDIA stock price plummeted by over 17%, temporarily wiping out nearly $600 billion in market value and setting a new historical record in the stock market. Many...

AI
Generative AI
LLM

29.1.2025 | 8 [Missing String "readingTime"]

How we can hack an AI with just a few words

How we can hack an AI with just a few words Artificial intelligence (AI) has undergone an astonishing transformation in recent years and is now present in many areas of life. Whether in the form of chatbots that help us with everyday questions or generative...

IT-Security
AI

27.1.2025 | 4 [Missing String "readingTime"]

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 [Missing String "readingTime"]

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 [Missing String "readingTime"]

Daniel Kocot

Simplifying LLM Application Development: A Newcomer's Perspective

I. Introduction Large Language Models (LLMs) have become highly popular due to their transformative impact on various fields, especially within IT. They enable developers to create innovative software applications centered around AI interactions, offering...

Generative AI
AI

6.12.2024 | 13 [Missing String "readingTime"]

Function Calling with GPT Models

GenAI is a powerful tool for generating content and interacting with applications using natural language. However, this tool also has significant limitations when you plan to use it in your own software. GenAI's knowledge is limited to information that...

Generative AI
AI
LLM

6.9.2024 | 5 [Missing String "readingTime"]

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 [Missing String "readingTime"]

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 [Missing String "readingTime"]

Daniel Kocot

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 [Missing String "readingTime"]

Daniel Kocot

Charge your APIs Volume 27: Transition from COE/C4E to an API Platform...

The Center of Excellence (COE) focuses on centralised expertise, ensuring best practices and governance, while the Center for Enablement (C4E) empowers teams with tools, guidance, and support for API development. Although beneficial, these models face...

API
Platform engineering
Agile transformation
Agile

24.5.2024 | 10 [Missing String "readingTime"]

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 [Missing String "readingTime"]

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 [Missing String "readingTime"]

Francesca Diana

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 [Missing String "readingTime"]

Francesca Diana

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

The Digital Product Passport represents a significant shift for digital units within organisations, compelling them to ensure comprehensive data transparency. This tool not only serves as a product's digital fingerprint but also opens up new dimensions...

Data
Product management

25.1.2024 | 7 [Missing String "readingTime"]

Daniel Kocot

Answer questions about your documents with OpenAI and Pinecone

In recent years, large language models (LLMs) have made remarkable progress in interacting with humans, showcasing their ability to answer a wide array of questions. Trained on publicly accessible internet content, these models have broad knowledge across...

13.11.2023 | 12 [Missing String "readingTime"]

Lukas Lehmann

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

In the ever-evolving landscape of software development, buzzwords and paradigms come and go. One such term that has gained significant traction in recent years is "API-First Development." It's been hailed as the holy grail of modern software engineering...

API
Data

19.10.2023 | 5 [Missing String "readingTime"]

Daniel Kocot

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 [Missing String "readingTime"]

NER with little data? Transformers to the rescue!

The NER dataset and task

The transformer revolution

Setting up the environment

Step 1: Loading a pre-trained BERT

Step 2: Preprocessing

Gotcha! Sub-word tokenization?

Step 3: Fine-tuning BERT on our custom NER task

Step 4: Evaluation — gotcha again!

Conclusion

Was this post helpful?

Blog author

More articles

Take control of named entity recognition with your own Keras model!

NER @ CLI: Custom-named entity recognition with spaCy in four lines

Move n-gram extraction into your Keras model!

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

How we can hack an AI with just a few words

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

Simplifying LLM Application Development: A Newcomer's Perspective

Function Calling with GPT Models

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Charge your APIs Volume 27: Transition from COE/C4E to an API Platform...

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook

A/B Testing: An introduction

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

Answer questions about your documents with OpenAI and Pinecone

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

An introduction to federated learning in an industrial context: Advanced