Digital processing of incoming mail with full-text indexing

Provinzial NordWest can process over 100,000 pages of correspondence and email documents in 24 hours and make them available for full-text searches. AI and machine learning make it possible.

The Provinzial NordWest Group is part of the Sparkassen-Finanzgruppe and one of the largest public insurance groups in Germany.

The project at a glance

230,000 pages processed per day at peak times
70% recognition rate over 90% of documents

Open source tools reduce operating costs
High quality of results as the starting point for further AI projects

Initial situation

The Provinzial NordWest insurance group operates locally for its customers in Schleswig-Holstein, Mecklenburg-Western Pomerania, Hamburg and Westphalia. Provinzial Nord's network of 220 insurance agencies extends from Westerland to Rügen and from Viöl to Hamburg-Harburg. It is represented by 438 offices between Bocholt and Höxter in Westphalia.

PNW has been digitizing incoming correspondence (paper, fax, email) for years and stores scanned documents digitally in the TIFF image format. In the past, only the first page of a document folder was recognized by OCR and used for classification due to the costs and computing performance involved. The SHERLOQ project (formerly “Sherlock”) was launched at the end of 2018. Initially developed as a proof of concept, SHERLOQ had the goal of storing the entire incoming correspondence as full text in a searchable form.

PNW processes well over 100,000 pages of letters and email documents every day. The basic requirement was to process this volume within 24 hours and thus provide a daily data basis for full-text searches over all incoming correspondence from the CRM system.

Solution

The project presented a whole range of new challenges for internal software development and IT operations. The use of modern but heterogeneous technologies such as OpenCV, Tesseract, TensorFlow and Keras required a high degree of flexibility in terms of development, build, and deployment. In order to create a common standard, especially for latter points, the individual services of SHERLOQ are operated in docker containers.

At present, SHERLOQ consists of nine services that are loosely coupled by means of queues and can be scaled individually according to the number of containers. This is particularly important due to the heavy load at certain core times, such as early morning or evening. Each service keeps a journal of its current lead times. For example, a Tesseract service takes an average of ten seconds per page, while pre-processing, such as cleaning and scaling up, takes less than a second. The microservice architecture allows SHERLOQ to address towards this imbalance.

In order to relieve text recognition, pages are divided into text and image documents using a trained, deep neural network. This allows the system to filter out larger TIFF files that contain no text at all. The recognition rate is matched and measured at run time using a large dictionary in Elasticsearch. Elasticsearch also provides a mechanism for suggesting words, which SHERLOQ uses to correct recognition errors. The full texts are then also stored and made available in Elasticsearch.

Result

SHERLOQ has processed 230,000 pages per day at peak times. Recognition rates were over 70 percent for 90 percent of the incoming documents. In addition, there are correctly recognized proper names that are not included in the dictionary. The system has been productive since September and has already stored over 12 million pages, which are available to the CRM system with a full text search.

In addition, new projects based on the data are already in progress. The projects range from new methods for document classification with machine learning models to the recognition of intention in correspondence with customers. Additional to the results of the project and the follow-up projects in the fields of AI and data science, experience was also gained in the operation of Docker and heterogeneous architectures. With the help of containers, operating the application does not require a great deal of effort and paves the way for a heterogeneous application landscape and thus also for new tools and possibilities.

Illustration von Detektiv Sherloq, der mit Lupe Daten, Bilder, Diagramme, Zip-Files und Briefe untersucht.

All SHERLOQ functions at a glance

SHERLOQ is the solution for automating your customer communication. Thanks to the combination of AI framework and workflow platform, you can

process documents faster and more efficiently.
observe compliance requirements.
take the pressure off your business and IT teams.
increase customer satisfaction.

The SHERLOQ process integrates very flexibly and is highly scalable in our application landscape! Docker supports this perfectly. The potential benefits of the full-text database are very high, which is noticeable both in the processing of individual documents and in the comprehensive analysis of documents.
Matthias Kortbus
ITK activities/documents, Provinzial

Any questions about the project?

Do you wish to use AI and SHERLOQ in your projects? Are you interested in an custom solution for your company? Let's talk without obligation.

Kaan Soyyigit

Market Lead Insurance

Kaan Soyyigit

Market Lead Insurance

A project discussion meeting with whiteboard and notebook

Further projects of codecentric AG

Find out about other successful projects that we have completed with our customers. Perhaps you will find inspiration for a use case in your company here.