Kofax Transformation Modules (KTM), AI and Machine Learning

5.6.2017 | 5 minutes reading time

The topics AI, machine learning and deep learning are on everyone’s lips, and the media regularly publishes articles on them. What many do not know is that Kofax Transformation Modules (KTM) also provides mechanisms of machine learning. KTM is a system for automatic classification of documents and extraction of data fields (see also: Document classification with Kofax Transformation Modules ).

KTM always included tools from machine learning, which can be used alone or together with the rule-based free-form recognition. This neural network-based methods will be briefly described here.

A KTM project consist of the following phases:

Project preparation: Document types, data fields, clustering
Project implementation: Classification and extraction design
Production: Capturing, classification, extraction, manual validation

Prior to extraction, the classification of the document has to be done because different types of documents normally have different extraction fields. Once the classification has been successfully carried out, the document type-specific field extraction can be started.

KTM provides tools from the field of machine learning for the project preparation as well as for the implementation of the project and the production phase in order to train the system and improve the quality of the results successively.

By training, learning systems recognize the context and store it for future use. KTM does not memorize the absolute position of a field, but saves the environment in which the field is located. This can include words which are located nearby (and their distances to the field), position to other fields, but also lines or similar objects. This newly learned context is immediately available when the next document is processed, and the field value can be extracted directly for a similar document – hopefully! “Hopefully” was inserted because such systems are not deterministic and some document types must be trained several times.

The KTM toolbox for machine learning contains of the following elements:

Clustering Tool: Get basic information about the document types, what are the main/important types?
Administrative training with examples of the main document types: Document type classification
Administrative training with examples of the main document types: Extraction of the field data
Production cycle: System learns by manual assignment of the document type
Production cycle: System learns by manual field correction/data entry

The Clustering Tool

At the beginning of a recognition project, it should first be clarified which documents promise the most “profit”. Which document types are worthwhile for training – and which should not be considered initially?

KTM includes a tool (clustering tool) that analyzes unsorted batches of documents and divides them into batches with similar characteristics. This sorting can be done according to graphical criteria as well as according to content. After using this tool you usually have a very good impression, which are the main document types of a project, which should be trained first.

This example shows, that one should first concentrate on the processing of the generated batches 1, 5 and 4. Part 4 contains 36 “CAR Parts Co-Delivery Note” documents.

Administrative training of document types

For this, you will use the main document types, that have been determined by the clustering tool. Within the KTM development environment, the document types are created manually or they can be created automatically from the batches. For each document type, an administrator assigns a number of sample documents to the system for learning. This number is project-dependent, but in real projects a value of about 20 documents has proven itself. The training of the document types can take place via the layout and/or the text content of the documents.

The success of this training can be immediately checked using the non-trained examples of the document type batches.

In real life projects, I only trust the classification result obtained by learning when a certain confidence level has been reached (e.g. 80%). At lower values, additional document-type-specific rules are used to determine the document type.

Administrative training of field extraction

After the training of the document types, the extraction of the document type-specific data fields can be done in the next step. Similar to the training of the document types, a certain number of documents is taken per document type. Training is done by just showing the system the position on the document where the data for a field should be extracted. This is simply done by using mouse clicks. KTM does not memorize the absolute positions, but stores features (graphics, words, lines etc.) near the extraction position.

Again, the success of this training can be immediately checked using the non-trained examples of the document type batches.

Online Learning during production

After a pre-trained system has been set to production, KTM offers the possibility to further improve the classification and extraction during daily processing. This includes the optimization of main document types already trained in the preparation, but also the basic training and optimization of the previously neglected other document types.

The KTM validation module offers all documents for validation where the classification was unsure or data fields were unsure or empty after extraction. A user can manually correct the classification and/or the data fields and the document may be marked for online learning if desired.

After that, the original document goes into the further processing and a copy is sent to the KTM learning mechanism. Depending on the configuration of the KTM system, the system learns the changes directly, and these are available at the next processed batch, or an administrator must first check and release the new learning document.

The following diagram shows the flow of KTM processing and the integration of online learning:

However, direct online learning – without the control of an administrator – entails the risk that the system will learn incorrectly, since the person at the validation workplace directly releases a document for learning. Neural networks cannot be debugged like programs in classical development – there must be other ways to find the error (the wrongly trained document) and make corrections.

KTM provides a view of all trained documents per document type as well as the possibility to remove or reconfigure documents from a learning set. Nevertheless, one should not underestimate the effort for such a correction. Therefore the release of new learning documents should be done by an administrator or a specialist despite the delay in getting the new training set in production.

Was this post helpful?

Blog author

Jürgen Voss

Do you still have questions? Just send me a message.

fromJürgen Voss

Kofax Transformation Modules: Natural Language Processing, sentiments ...

Kofax Transformation Modules (KTM) offers several tools for document classification and data extraction. There are some older blog articles about these tools: – Document classification – Data extraction with format locators – Machine Learning The...

Content Management
AI
Archiving
NLP

6.4.2020 | 8 minutes reading time

Jürgen Voss

Document classification, data extraction and everything

Over time, a lot of posts about document classification and data extraction, using Kofax, among other products, have been published in the codecentric blog. This blog post will put these posts into context and point out the changes with regard to older...

Content Management
AI
Archiving

20.8.2019 | 6 minutes reading time

Jürgen Voss

Orientation problems with document processing (Kofax Transformation Modules...

Document classification and data extraction in business companies have to deal with paper documents, emails and faxes. The orientation of the digitized documents (0°, 90°, 180°, 270°) usually doesn’t matter. During OCR processing the system will recognize...

Content Management
Archiving
AI

7.7.2019 | 3 minutes reading time

Jürgen Voss

Kofax Transformation Modules (KTM) – Dictionaries: Search by script

In addition to fuzzy databases KTM also offers so-called dictionaries for the optimization of recognition. For example these dictionaries can be used in the regular expressions of a format locator to find dates of the form “01. December 2015”. The dictionary...

6.7.2017 | 2 minutes reading time

Jürgen Voss

Kofax Capture Validation Scripting – from SBL to VB.NET for Dummies

With Kofax Capture you can enter document index values in a validation screen or just confirm or changes values which have been recognized automatically. The validation screen form presents all fields of a document and the user has to confirm/change ...

8.6.2016 | 4 minutes reading time

Jürgen Voss

Kofax Transformation Modules: SEPA Mandates and handwritten additional...

Within the last two years many companies had to ask their customers to sign the SEPA Direct Debit Mandates. It is an established procedure to send out forms with filled customer data (the SEPA Mandate). The customer signs the mandate and sends it back...

19.2.2016 | 5 minutes reading time

Jürgen Voss

Kofax Transformation Modules (KTM): ‘free-form recognition’ for handwritten...

In contrast to form based recognition, the free-form recognition tries to find certain values (like an insurance number) somewhere on a document. It is helpful if the searched value has a structure that can be found with regular expressions. Furthermore...

NLP
Archiving

19.7.2015 | 4 minutes reading time

Jürgen Voss

Kofax Capture – Document Separation and Barcodes

A well known approach to separate documents at scan time is the use of barcode labels on the first page of a document. The barcode may also be put on a single separator sheet. If a batch of documents is scanned by Kofax Capture, the barcode will be recognized...

6.1.2015 | 4 minutes reading time

Jürgen Voss

IBM Content Collector for SAP (formerly known as IBM CommonStore for SAP...

IBM Content Collector for SAP (ICC/SAP) is an interface for SAP ERP-Systems and IBM archiving systems: IBM Content Manager, On Demand und TSM. SAP provides the standard interface ‘ArchiveLink’ for linking external archiving systems. ICC/SAP is certified...

Content Management
NLP
Archiving

22.7.2014 | 5 minutes reading time

Jürgen Voss

KTM and insurance companies: Document Process Automation

Many of our customers are using systems for automatic document classification and data extraction. ‘Kofax Transformation Modules’ (KTM) is one of these systems. These data capturing systems extract metadata out of the electronic images (these are ...

29.11.2013 | 5 minutes reading time

Jürgen Voss

Document classification with Kofax Transformation Modules (KTM)

22.3.2013 | 6 minutes reading time

Jürgen Voss

Kofax Transformation Modules – format locators and dynamic regular expressions...

Part 2: Dynamic regular expressions in KTM In the first part of this blog article I explained the use of KTM format locators and regular epressions. Now I will try to explain how flexible KTM projects can be designed by using the KTM internal scripting...

1.2.2013 | 4 minutes reading time

Jürgen Voss

Kofax Transformation Modules – format locators and dynamic regular expressions

Part 1: An introduction to format locators and regular expressions Many of our customers are using systems for automatic document classification and data extraction. These data capturing systems extract metadata out of the electronic images (these are...

9.1.2013 | 5 minutes reading time

Jürgen Voss

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Kofax Transformation Modules (KTM), AI and Machine Learning

The Clustering Tool

Administrative training of document types

Administrative training of field extraction

Online Learning during production

Was this post helpful?

Blog author

More articles

Kofax Transformation Modules: Natural Language Processing, sentiments ...

Document classification, data extraction and everything

Orientation problems with document processing (Kofax Transformation Modules...

Kofax Transformation Modules (KTM) – Dictionaries: Search by script

Kofax Capture Validation Scripting – from SBL to VB.NET for Dummies

Kofax Transformation Modules: SEPA Mandates and handwritten additional...

Kofax Transformation Modules (KTM): ‘free-form recognition’ for handwritten...

Kofax Capture – Document Separation and Barcodes

IBM Content Collector for SAP (formerly known as IBM CommonStore for SAP...

KTM and insurance companies: Document Process Automation

Document classification with Kofax Transformation Modules (KTM)

Kofax Transformation Modules – format locators and dynamic regular expressions...

Kofax Transformation Modules – format locators and dynamic regular expressions

Your job at codecentric?

Agile Developer und Consultant (w/d/m)