Beliebte Suchanfragen

Cloud Native

DevOps

IT-Security

Agile Methoden

Java

//

Take control of named entity recognition with your own Keras model!

13.11.2020 | 7 minutes of reading time

This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts.

In a previous post , we solved the same NER task on the command line with the NLP library spaCy . The present approach requires some work and knowledge, but yields a much more flexible solution which we can tune, scale and modify to our needs.

The NER dataset and task

We use the dataset presented by E. Leitner, G. Rehm and J. Moreno-Schneider in

Fine-grained Named Entity Recognition in Legal Documents.

again. It consists of decisions from several German federal courts with annotations of named entities referring to legal norms, court decisions, legal literature and others of the following form:

‘Trotz der zweifelhaften Bewertung von MDMA als ” harte Droge ” ( vgl. BGH , Beschluss vom 3. Februar 1999 – 5 StR 705/98
, juris Rn. 2 RS
; zum Meinungsstand Patzak in Körner / Patzak / Volkmer , BtMG , 8. Aufl. , Vorbem. zu §§ 29 ff. Rn. 213 LIT mwN ; Weber , BtMG , 5. Aufl. , § 1 Rn. 364 LIT mwN ) hat der Strafausspruch
Bestand , da die verhängte Rechtsfolge jedenfalls angemessen ist ( § 354 Abs. 1a Satz 1 StPO GS ) . ‘

The task will be to build, train and evaluate a model that, given sample sentences, annotates each token of each sentence with a tag that indicates whether this token is part of a reference to a legal norm, court decision, legal literature and so on.

NER with bi-LSTM for dummies

We implement a standard deep-learning architecture for NER — a bi-directional recurrent neural network — which works as follows:

  1. Each sentence is split into a sequence of token and each token is represented by a word vector. These word vectors or embeddings are usually pre-trained on a huge corpus of documents so that they encode semantic information. We thus employ general language proficiency to our special task, a technique known as transfer learning . Common methods for pre-training are word2vec , gloVe or fasttext ; we use the word vectors provided by spaCy .
  2. The model processes the input sequence step by step and maintains an internal memory along the way,
    • reading the corresponding input vector,
    • combining this input with the internal memory,
    • producing an output vector and
    • updating the internal memory

    at each step. This magic is carried out by a long-short-term memory (LSTM) cell . As a result, we obtain an output sequence ot the same length as the input sequence, and an internal memory state.

  3. Going backwards, the model reads the input again and produces a second output sequence.
  4. At each position, the outputs of steps 2 and 3 are combined and fed into a classifier which outputs, for the input word at this position, the probability that should be annotated with the first tag, second tag, and so on.

To improve performance, one can replace the last feed-forward layer by a conditional random field model (CRF) . The resulting architecture is called bi-LSTM-CRF model.

Setting up the environment

First, set up a virtual environment as described in the preceding blog post , and install the required dependencies:

1mkdir keras_ner_project
2cd keras_ner_project
3python3 -m venv .venv
4source .venv/bin/activate
5pip install spacy
6python -m spacy download de_core_news_md
7pip install tensorflow

Alternatively, follow along with Jupyter running inside a TensorFlow Docker container , or with a google colab notebook .

Next, download the data as in the preceding blog post (in case you are inside a Jupyter notebook, put an exclamation mark ! in front of each command to have it executed by the shell):

1mkdir -p data/01_raw
2curl https://github.com/elenanereiss/Legal-Entity-Recognition/raw/master/data/dataset_courts.zip \
3     -L -o data/01_raw/raw.zip
4!unzip data/01_raw/raw.zip -d data/01_raw

Step 1: Preprocessing for NER

The data files contain sample sentences separated by blank lines, with one token and annotation in BIO format per line as follows:

 1an O
 2Kapitalgesellschaften O
 3( O
 4§ B-GS
 517 I-GS
 6Abs. I-GS
 71 I-GS
 8und I-GS
 92 I-GS
10EStG I-GS
11) O

We read such a data file line-by-line and store the sentences as lists of token-tag pairs:

 1def load_data(filename: str):
 2    with open(filename, 'r') as file:
 3        lines = [line[:-1].split() for line in file]
 4    samples, start = [], 0
 5    for end, parts in enumerate(lines):
 6        if not parts:
 7            sample = [(token, tag.split('-')[-1]) for token, tag in lines[start:end]]
 8            samples.append(sample)
 9            start = end + 1
10    if start < end:
11        samples.append(lines[start:end])
12    return samples
13
14train_samples = load_data('data/01_raw/bag.conll')
15val_samples = load_data('data/01_raw/bgh.conll')
16all_samples = train_samples + val_samples

For simplicity, we’ll truncate the sentences to a maximum length and pad shorter input sequences. But first, let us determine the set of all tags in the data and add an extra tag for the padding:

1schema = ['_'] + sorted({tag for sentence in samples for _, tag in sentence})

Next, we represent each token by a word vector, using a pre-trained German language model of the NLP library spaCy :

 1import spacy
 2import numpy as np
 3
 4nlp = spacy.load('de_core_news_md')
 5EMB_DIM = nlp.vocab.vectors_length
 6MAX_LEN = 50
 7
 8def preprocess(samples):
 9    tag_index = {tag: index for index, tag in enumerate(schema)}
10    X = np.zeros((len(samples), MAX_LEN, EMB_DIM), dtype=np.float32)
11    y = np.zeros((len(samples), MAX_LEN), dtype=np.uint8)
12    vocab = nlp.vocab
13    for i, sentence in enumerate(samples):
14        for j, (token, tag) in enumerate(sentence[:MAX_LEN]):
15            X[i, j] = vocab.get_vector(token)
16            y[i,j] = tag_index[tag]
17    return X, y
18
19X_train, y_train = preprocess(train_samples)
20X_val, y_val = preprocess(val_samples)

Now, we got the data ready for NER and can assemble our model!

Step 2: Build the bi-LSTM model

With the wide range of layers offered by Keras , we can can construct a bi-directional LSTM model as a sequence of two compound layers:

  1. The bidirectional LSTM layer encapsulates a forward- and a backward-pass of an LSTM layer, followed by the stacking of the sequences returned by both passes.
  2. The second layer applies a dense classification layer to every position of the stacked sequences. Here, the SoftMax
    activation
    function scales the output so that we obtain sequences of probability distributions:
 1from tensorflow.keras.models import Sequential
 2from tensorflow.keras.layers import Bidirectional, LSTM, TimeDistributed, Dense
 3
 4def build_model(nr_filters=256):
 5    input_shape = (MAX_LEN, EMB_DIM)
 6    lstm = LSTM(NR_FILTERS, return_sequences=True)
 7    bi_lstm = Bidirectional(lstm, input_shape=input_shape)
 8    tag_classifier = Dense(len(schema), activation='softmax')
 9    sequence_labeller = TimeDistributed(tag_classifier)
10    return Sequential([bi_lstm, sequence_labeller])
11
12model = build_model()
 1def train(model, epochs=10, batch_size=32):
 2    model.compile(optimizer='Adam',
 3                  loss='sparse_categorical_crossentropy',
 4                  metrics='accuracy')
 5    history = model.fit(X_train, y_train,
 6                        validation_split=0.2,
 7                        epochs=epochs,
 8                        batch_size=batch_size)
 9    return history.history
10
11history = train(model)
1def predict(model):
2    y_probs = model.predict(X_val)
3    y_pred = np.argmax(y_probs, axis=-1)
4    return [
5        [(token, tag, schema[index]) for (token, tag), index in zip(sentence, tag_pred)]
6        for sentence, tag_pred in zip(val_samples, y_pred)
7    ]
8
9predictions = predict(model)

Finally, we compute precision, recall and f1-score on the level of tag categories using scikit learn ’s classification_report :

 1import pandas as pd
 2from sklearn.metrics import classification_report
 3
 4def evaluate(predictions):
 5    y_t = [pos[1] for sentence in predictions for pos in sentence]
 6    y_p = [pos[2] for sentence in predictions for pos in sentence]
 7    report = classification_report(y_t, y_p, output_dict=True)
 8    return pd.DataFrame.from_dict(report).transpose().reset_index()
 9
10evaluate(predictions)

Training a model with 1024 filters for 10 epochs, we reach the following scores:

tagf1-scoreprecisionrecallsupport
EUN56.967.049.5398
GRT65.991.051.6643
GS94.596.192.96774
INN41.388.926.9119
LD74.067.082.686
LDS0.00.00.09
LIT79.574.385.41681
MRK0.00.00.049
ORG25.332.420.8159
PER0.00.00.0473
RR92.094.489.8560
RS90.797.185.08380
ST71.993.958.279
STR0.00.00.035
UN32.764.921.8110
VO2.24.01.566
VS0.00.00.010
VT18.011.738.9144

Let’s see how this compares to the results achieved with spaCy :

It seems that our hand-built NER model does very well! But beware that these experiments do not show a winner: neither of the two approaches has been optimized and we did not compare training time nor compute resources used. The main differentiating factor is that

  • spaCy can be used out-of-the-box with no understanding of deep learning
  • the approach presented here is much more flexible and tuneable (see below).

What next?

With the deep learning library Keras , build and training our custom NER model took just a few lines, but setting up the data and the training required much more understanding than the command-line approach with spaCy .

To improve performance, we could try to tune the model and

  • increase the number of filters, that is, the size of the LSTM cell,
  • stack several bidirectional layers on top of each other,
  • replace the time-distributed classification layer with a conditional random field (CRF) model or
  • address the imbalance of the tag distribution with a focal loss instead of categorical cross-entropy.

But to achieve a significant boost, we need to provide our model with more input by

  1. labeling more task-specific training data or
  2. applying more of task-independent language proficiency to our task.

In a next blog post, we shall fine-tune a pre-trained NLP transformer model to our NER task and get state-of-the-art performance.

Stay tuned!

share post

Likes

0

//

More articles in this subject area

Discover exciting further topics and let the codecentric world inspire you.

//

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.