Document classification, data extraction and everything

20.8.2019 | 6 minutes reading time

Over time, a lot of posts about document classification and data extraction, using Kofax, among other products, have been published in the codecentric blog. This blog post will put these posts into context and point out the changes with regard to older posts.

The codecentric divison ‘Digital Integration’ uses the products Kofax Capture / Kofax Transformation Modules and Kofax Total Agility, among others, in customer projects. Therefore, a large part of the posts refer to these products.

The listed posts were published independently over the last years. I have grouped the posts into different areas to get some clearity:

Best practices
Experience reports
Tips and tricks
Latest trends
The basis of everything
Best practices

Best practices in document classification and data extraction

Regardless of specific projects, these posts are about best practices in customer projects that use document classification and data extraction.

The basis of mailroom automation is the classification of incomig documents into different document types, as the extraction of data will be different within these types. The following article explains the classification tools available in Kofax Transformation Modules.

Document classification with Kofax Transformation Modules (KTM)

Artificial intelligence, neural networks and machine learning play a role in this environment. Kofax Transformation Modules offers AI mechanisms for years, providing new tools with every release:

Kofax Transformation Modules (KTM), AI and Machine Learning

Experience reports

The following blog posts are based on customer projects. Topics range from details of the processing of SEPA-mandates to document process auomation in insurance companies.

At one of our customers, incoming SEPA mandates are processed automatically or manually, depending on whether handwritten notes occur in a certain area of the SEPA form. This post explains how this can be done with KTM tools:

Kofax Transformation Modules: SEPA Mandates and handwritten additional information – or: who scribbled on my form?

Document process automation is discussed at the beginning of every project. At project start there often are different understandings about this topic and project members have to develop a common understanding about it before a project starts. These different views are presented in the following blog post:

KTM and insurance companies: Document Process Automation

The aim of a recognition process is preferably the automatic processing of incoming documents. Contract terminations offer potential for automation, as the termination date is often mentionend in the document. Practical problems may occur, but these can be solved with KTM tools:

Automatic termination of an insurance contract – how Kofax KTM may help

Tips and tricks

“Small” problems will occur in every customer project that cannot be solved with standard tools. This requires some creativity to find a solution without using external tools. Here are some tips and tricks which arose from our projects:

Scan and recognition products often try to rotate captured documents in a “correct” way, so people may read it without rotating the page manually. Sometimes this automation fails, especially with telefaxes, which may contain lines that are printed 90° or 180° rotated to the main text. The following article explains how to rotate this problem documents ‘correct’ automatically:

Orientation problems with document processing (Kofax Transformation Modules)

KTM offers so-called ‘dictionaries’. For example, you may use regular expressions for extracting a date from a document which may appear in different formats: 01.09.2019, 01. September 2019, etc. A dictionary (a plain text file) can contain the names of the months and their abbreviations. This dictionary can be referenced in the regular expression. This saves a lot of typing efforts when defining the regular expressions, and on the other hand, you may change the dictionary without modifying your KTM project. All this is KTM standard functionality. But sometimes you would like to search for stuff in the dictionary by script. This can be done this way:

Kofax Transformation Modules (KTM) – Dictionaries: Search by script

The next piece of advice is not necessary any more by now and may only be used for KTM versions 5 or lower. Machine-written data can easily be extracted by freeform recognition. This wasn’t possible with handwritten data, as full page OCR engines were optimized to machine-written characters. The following post describes how to recognize hand-written data with freeform recognition tools. Kofax KTM 5.5 and higher offers a new full text OCR engine which extracts machine- and handwritten data on a page.

Kofax Transformation Modules (KTM): ‘free-form recognition’ for handwritten numbers

The all-purpose tool for data extraction with KTM is the so-called format locators. The following two blog posts are an introduction to how to use these freeform recognition tools:

Kofax Transformation Modules – format locators and dynamic regular expressions

Kofax Transformation Modules – format locators and dynamic regular expressions – Part 2

Latest trends

For years, Kofax Capture and Kofax Transformation Modules have been the basis of many capture projects and Kofax is market leader in this area. To be prepared for advanced requirements, Kofax offers a product called Kofax Total Agility (KTA). In simple terms: KTA contains the products Kofax Capture, Kofax Total Agility and Kofax Import Connector embedded in a flexible workflow engine. Daniel Brodka explains the extensive capabilities of KTA in this post:

Introduction of and first steps in Kofax Total Agility

A growing part of our business is the area of Robot Process Automation (RPA). Kofax provides the product Kapow as platform to process data from structured or unstructured databases, files, email systems, websites, portals and even legacy mainframe systems or terminal emulations. Kapow fits perfectly into the other existing Kofax products. Kofax Kapow has changed its name recently and is now named Kofax RPA. Stefan Blank has summarized the capabilities of Kofax RPA/Kapow by building an example robot:

Robotic Process Automation with Kofax Kapow™

The basis of everything

The successful capturing platform offered by Kofax is Kofax Capture. With Kofax Capture you may get nice solutions without even using KTM. How to do this and how to create your own extensions to the platform is shown by Stefan Blank in this post about extended customizing of Kofax Capture:

Kofax Capture – Customisation beyond standard features

Stefan Blank wrote another blog post about a project-specific extension to Kofax Capture. This post is about adjustments of the scanning module to meet specific project requirements:

Kofax Capture Advanced Scan Api: A first approach

Kofax Capture includes a module to validate data that has been recognized or enter further data manually. This module is called ‘Validaton’. Within ‘Validation’, there is a scripting language which can be used to customize the behavior of the module to project requirements. For years, this language was ‘SBL’-Basic which is compatible to the ‘old’ Visual Basic of the 90s. But for some years it has also been possible to use .NET (VB, C#) as development environment. The next post explains what you need to consider when switching from SBL to .NET:

Kofax Capture Validation Scripting – from SBL to VB.NET for Dummies

Barcodes are a popular mechanism for document separation. The barcodes may be put as labels on the first document page or they may be inserted on a separate page before the first page of a document. The document separation works well in general. But sometimes external barcodes will generate wrong splitting of the documents, as they are recognized as separation barcodes. Now the document structure is destroyed. But there is a remedy for this:

Kofax Capture – Document Separation and Barcodes

Hopefully, this summary and sorting of the miscellaneous blog posts has made the topic of data capturing, data classification and data extraction more clear and transparent. For any questions and remarks please use the commet section below. We appreciate your feedback!

Was this post helpful?

Blog author

Jürgen Voss

Do you still have questions? Just send me a message.

fromJürgen Voss

Kofax Transformation Modules: Natural Language Processing, sentiments ...

Kofax Transformation Modules (KTM) offers several tools for document classification and data extraction. There are some older blog articles about these tools: – Document classification – Data extraction with format locators – Machine Learning The...

Content Management
AI
Archiving
NLP

6.4.2020 | 8 minutes reading time

Jürgen Voss

Orientation problems with document processing (Kofax Transformation Modules...

Document classification and data extraction in business companies have to deal with paper documents, emails and faxes. The orientation of the digitized documents (0°, 90°, 180°, 270°) usually doesn’t matter. During OCR processing the system will recognize...

Content Management
Archiving
AI

7.7.2019 | 3 minutes reading time

Jürgen Voss

Kofax Transformation Modules (KTM) – Dictionaries: Search by script

In addition to fuzzy databases KTM also offers so-called dictionaries for the optimization of recognition. For example these dictionaries can be used in the regular expressions of a format locator to find dates of the form “01. December 2015”. The dictionary...

6.7.2017 | 2 minutes reading time

Jürgen Voss

Kofax Transformation Modules (KTM), AI and Machine Learning

The topics AI, machine learning and deep learning are on everyone’s lips, and the media regularly publishes articles on them. What many do not know is that Kofax Transformation Modules (KTM) also provides mechanisms of machine learning. KTM is a system...

5.6.2017 | 5 minutes reading time

Jürgen Voss

Kofax Capture Validation Scripting – from SBL to VB.NET for Dummies

With Kofax Capture you can enter document index values in a validation screen or just confirm or changes values which have been recognized automatically. The validation screen form presents all fields of a document and the user has to confirm/change ...

8.6.2016 | 4 minutes reading time

Jürgen Voss

Kofax Transformation Modules: SEPA Mandates and handwritten additional...

Within the last two years many companies had to ask their customers to sign the SEPA Direct Debit Mandates. It is an established procedure to send out forms with filled customer data (the SEPA Mandate). The customer signs the mandate and sends it back...

19.2.2016 | 5 minutes reading time

Jürgen Voss

Kofax Transformation Modules (KTM): ‘free-form recognition’ for handwritten...

In contrast to form based recognition, the free-form recognition tries to find certain values (like an insurance number) somewhere on a document. It is helpful if the searched value has a structure that can be found with regular expressions. Furthermore...

NLP
Archiving

19.7.2015 | 4 minutes reading time

Jürgen Voss

Kofax Capture – Document Separation and Barcodes

A well known approach to separate documents at scan time is the use of barcode labels on the first page of a document. The barcode may also be put on a single separator sheet. If a batch of documents is scanned by Kofax Capture, the barcode will be recognized...

6.1.2015 | 4 minutes reading time

Jürgen Voss

IBM Content Collector for SAP (formerly known as IBM CommonStore for SAP...

IBM Content Collector for SAP (ICC/SAP) is an interface for SAP ERP-Systems and IBM archiving systems: IBM Content Manager, On Demand und TSM. SAP provides the standard interface ‘ArchiveLink’ for linking external archiving systems. ICC/SAP is certified...

Content Management
NLP
Archiving

22.7.2014 | 5 minutes reading time

Jürgen Voss

KTM and insurance companies: Document Process Automation

Many of our customers are using systems for automatic document classification and data extraction. ‘Kofax Transformation Modules’ (KTM) is one of these systems. These data capturing systems extract metadata out of the electronic images (these are ...

29.11.2013 | 5 minutes reading time

Jürgen Voss

Document classification with Kofax Transformation Modules (KTM)

22.3.2013 | 6 minutes reading time

Jürgen Voss

Kofax Transformation Modules – format locators and dynamic regular expressions...

Part 2: Dynamic regular expressions in KTM In the first part of this blog article I explained the use of KTM format locators and regular epressions. Now I will try to explain how flexible KTM projects can be designed by using the KTM internal scripting...

1.2.2013 | 4 minutes reading time

Jürgen Voss

Kofax Transformation Modules – format locators and dynamic regular expressions

Part 1: An introduction to format locators and regular expressions Many of our customers are using systems for automatic document classification and data extraction. These data capturing systems extract metadata out of the electronic images (these are...

9.1.2013 | 5 minutes reading time

Jürgen Voss

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

On January 27, 2025, the technology stock exchange experienced an unexpected crash: The NVIDIA stock price plummeted by over 17%, temporarily wiping out nearly $600 billion in market value and setting a new historical record in the stock market. Many...

AI
Generative AI
LLM

29.1.2025 | 8 minutes reading time

How we can hack an AI with just a few words

How we can hack an AI with just a few words Artificial intelligence (AI) has undergone an astonishing transformation in recent years and is now present in many areas of life. Whether in the form of chatbots that help us with everyday questions or generative...

IT-Security
AI

27.1.2025 | 4 minutes reading time

Simplifying LLM Application Development: A Newcomer's Perspective

I. Introduction Large Language Models (LLMs) have become highly popular due to their transformative impact on various fields, especially within IT. They enable developers to create innovative software applications centered around AI interactions, offering...

Generative AI
AI

6.12.2024 | 13 minutes reading time

Function Calling with GPT Models

GenAI is a powerful tool for generating content and interacting with applications using natural language. However, this tool also has significant limitations when you plan to use it in your own software. GenAI's knowledge is limited to information that...

Generative AI
AI
LLM

6.9.2024 | 5 minutes reading time

Answer questions about your documents with OpenAI and Pinecone

In recent years, large language models (LLMs) have made remarkable progress in interacting with humans, showcasing their ability to answer a wide array of questions. Trained on publicly accessible internet content, these models have broad knowledge across...

13.11.2023 | 12 minutes reading time

Lukas Lehmann

Fighting Gandalf with magic spells (the spells are prompt injections) ...

Note: Do not attack any systems for which you do not have explicit permission to do so. In this article, I will recount the tale of outwitting a large language model by performing prompt injection attacks. Before we start, let's establish a common baseline...

IT-Security
AI

10.7.2023 | 12 minutes reading time

Michael Wagner

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

In this article, we'll explore how to use the Poetry package manager to manage the dependencies of a machine learning project that makes use of the M1 GPU for TensorFlow training. We'll cover the motivation for using Poetry in this context, and we'll...

Machine Learning
Apple
Data
AI
Python

11.1.2023 | 3 minutes reading time

Denis Stalz-John

How to use Java classes in Python

There is an old truism: “Use the right tool for the job.” However, in building software, we are often forced to nail in screws, just because the rest of the application was built with the figurative hammer Java. Of course, one of the preferred solutions...

AI
Java
Python

15.11.2021 | 8 minutes reading time

Hendrik Schawe

The universal recommender in Action(ML)

IntroductionRecommender systems have become crucial for many different businesses. E-commerce uses recommenders to guide their customers in finding the right products and to assure they stay on the site. Newspapers or entertainment websites want to keep...

AI
NoSQL
Data
Machine Learning
Python

18.4.2021 | 11 minutes reading time

Francesca Diana

NER with little data? Transformers to the rescue!

How do you solve deep learning problems with too little labelled data? The answer, of course, is transfer learning. In this post, we will apply this concept to named entity recognition (NER) andfine-tune a pre-trained BERT to extract information from...

Data
Machine Learning
AI
NLP
Agile transformation

14.12.2020 | 8 minutes reading time

Take control of named entity recognition with your own Keras model!

This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts.In a previous...

Data
Python
AI
NLP
Machine Learning

13.11.2020 | 9 minutes reading time

NER @ CLI: Custom-named entity recognition with spaCy in four lines

Named entity recognition is a technical term for a solution to a key automation problem: extraction of information from text. Applications includeautomation of business processes involving documentsdistillation of data from the web by scraping websitesindexing...

Data
AI
NLP
Machine Learning

6.11.2020 | 9 minutes reading time

DISH-O-TRON – Train that vision model!

With this article we continue our endeavor of building dish-o-tron – an AI system designed to prevent the sudden appearance of dirty dishes in the community kitchen sink, and hence turning the community kitchen into a place of peace and harmony.This ...

AI
Computer Vision

11.10.2020 | 11 minutes reading time

Marcel Mikl

DISH-O-TRON – Gather that DATA you must!

This is the second article in our dish-o-tron series (a non-standard Deep Learning tutorial) in which we tackle one of the biggest problems in community kitchens: coming across someone else’s dirty dishes. We are facing this problem by building a state...

AI
Computer Vision
Machine Learning

24.9.2020 | 11 minutes reading time

Marcel Mikl

DISH-O-TRON – No more dirty dishes thanks to AI

Sadly, to tell you the truth, doing dishes is still a thing. However, so far most of our readers still like our non-standard Deep Learning tutorial.Typically, AI is demonstrated as solving various toy problems. AI plays chess and Go, AI plays video games...

10.9.2020 | 7 minutes reading time

Marcel Mikl

Why user-oriented development is so important – the story of tactics.ai

In this blog post, we want to give you an insight into the product development of tactics.ai. Our initial idea was a data-driven football analysis tool that applies machine learning techniques to analyze the strengths and weaknesses of opponents and ...

Agile
AI
Startup
Machine Learning
Product management

23.8.2020 | 8 minutes reading time

Denis Stalz-John

Thinking AI means re-thinking data

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid...

AI
Big Data
Data
Machine Learning

27.5.2020 | 7 minutes reading time

Marcel Mikl

Kofax Transformation Modules: Natural Language Processing, sentiments ...

Kofax Transformation Modules (KTM) offers several tools for document classification and data extraction. There are some older blog articles about these tools:– Document classification – Data extraction with format locators – Machine Learning The ...

Content Management
AI
Archiving
NLP

6.4.2020 | 8 minutes reading time

Physical regression testing for the Thermomix

Automating physical regression testing of products with computer vision and roboticsTesting a physical product can be a highly manual task. The advances in Deep Learning techniques and computer vision have led to a situation where we can start to strive...

AWS
IoT
Computer Vision
Product management
AI
Testing

31.3.2020 | 8 minutes reading time

Remote training with GitLab-CI and DVC

In many Data Science projects there is a point in time where the workstation under your desk is not the ideal machine to perform the model training anymore. More potent processors and GPUs are required, e.g. a suitable server in your company’s rack or...

Git
Machine Learning
CI/CD
AI
GitLab

27.1.2020 | 15 minutes reading time

Marcel Mikl

Document classification, data extraction and everything

Best practices in document classification and data extraction

Experience reports

Tips and tricks

Latest trends

The basis of everything

Was this post helpful?

Blog author

More articles

Kofax Transformation Modules: Natural Language Processing, sentiments ...

Orientation problems with document processing (Kofax Transformation Modules...

Kofax Transformation Modules (KTM) – Dictionaries: Search by script

Kofax Transformation Modules (KTM), AI and Machine Learning

Kofax Capture Validation Scripting – from SBL to VB.NET for Dummies

Kofax Transformation Modules: SEPA Mandates and handwritten additional...

Kofax Transformation Modules (KTM): ‘free-form recognition’ for handwritten...

Kofax Capture – Document Separation and Barcodes

IBM Content Collector for SAP (formerly known as IBM CommonStore for SAP...

KTM and insurance companies: Document Process Automation

Document classification with Kofax Transformation Modules (KTM)

Kofax Transformation Modules – format locators and dynamic regular expressions...

Kofax Transformation Modules – format locators and dynamic regular expressions

More articles in this subject area

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

How we can hack an AI with just a few words

Simplifying LLM Application Development: A Newcomer's Perspective

Function Calling with GPT Models

Answer questions about your documents with OpenAI and Pinecone

Fighting Gandalf with magic spells (the spells are prompt injections) ...

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

How to use Java classes in Python

The universal recommender in Action(ML)

NER with little data? Transformers to the rescue!

Take control of named entity recognition with your own Keras model!

NER @ CLI: Custom-named entity recognition with spaCy in four lines

DISH-O-TRON – Train that vision model!

DISH-O-TRON – Gather that DATA you must!

DISH-O-TRON – No more dirty dishes thanks to AI

Why user-oriented development is so important – the story of tactics.ai

Thinking AI means re-thinking data

Kofax Transformation Modules: Natural Language Processing, sentiments ...

Physical regression testing for the Thermomix

Remote training with GitLab-CI and DVC