Document classification with Kofax Transformation Modules (KTM)

22.3.2013 | 6 minutes reading time

Many of our customers are using systems for automatic document classification and data extraction. ‘Kofax Transformation Modules’ (KTM) is one of these systems. These data capturing systems extract metadata out of the electronic images (these are the scanned pages of the documents, faxes or emails) and release the data and the document to business applications.

In this article I will explain the different ways of document classification within KTM.

Up to now, two other articles about KTM were published in the codecentric blog:

Kofax Transformation Modules – format locators and dynamic regular expressions – Part 1
Kofax Transformation Modules – format locators and dynamic regular expressions – Part 2

Before data can be extracted out of a document, KTM needs to know the type of the document. Invoices have to be treated different than for example insurance contracts. You want to extract invoice number, invoice date and amounts from an invoice but the insurance number and the insurance class from the contract.

So you have to determine the document type first, before data extraction can take place. In KTM this is done by the classification mechanism which occurs before extraction. As soon as a document has been classified, the metadata can be extracted.

Document classification in KTM can be done with various methods, which differ in the amount of complexity and the effort in document preparation:

1. Classification by layout

This classification method tries to determine the type of document by using the graphical structure of the document. This is the fastest way of classification, as no Optical Character Recognition (OCR) is needed. This classication method can only be used in document areas, where the documents can be separated cleary by optic. Examples are application types, which can be distinguished by their design (structure, company logo, …). Inappropriate would be forms in financial or insurance services, as these forms look very similar.

You have to train KTM for using the layout classification on the document types of a customer. But the manual effort for this is kept very low by KTM. You must collect some samples for the appropraite document types and show KTM which samples are representative for which document type. KTM will then learn the characteristical layout structures of each document type. This training of the document types can be done easily with the graphical user interface of KTM Project Builder.

2. Classification by content

The approach of the (automatically) classification by content is similar to layout classification. The difference is, that the content of the documents is used for classification instead of the layout. To achieve that, an OCR read of the documents must be done before.

And what’s really nice: you don’t have to care for the meaning of the content. Just as with the layout classification, you just have to prepare batches of samples for the document types. After the OCR reading has occured, you just show KTM which samples are representative for each document type. After this setup KTM will learn automatically, which words, phrases or word combinations are chraracteristic for a docuemnt type. This training of the samples takes place in the KTM Project Builder similar to the approach with layout classification.

3. Classification by instructions

When using layout or content classification you ‘only’ have to provide sufficient samples to KTM. The work of learning and evaluation wil be done by KTM itself. If classification by instruction is used, the developer has to know the content of the document and he must be able to evaluate them in the business environment. For each document type you can define words, phrases and word combinatons manually, which are characteristic for the document type. So you will need some subject-specific knowledge about the documents.

Classification by instruction is often used with general correspondence documents. For example: if ‘dunning’ and ‘dunning charge’ are both found on a document, this document must be classified as ‘dunning procedure’.

In order to use classification by instructions an OCR read must have been done before. The instructions (words, phrases, word combinations) are entered into KTM once again via the KTM Project Builder.

4. Classification by script

4.1 Barcode

Sometimes a document barcode may be sufficient for the classification of the document. This is also possible with KTM, but you have to use the internal script language of KTM, which is comparable to Visual Basic.

You have to start with a barcode locator (BCode), which recognizes the barcode value and which has to be defined on projcet level. A short script on project level will classify the document into the desired document type:

1' Class script: Project
2Private Sub Document_BeforeClassifyXDoc(pXDoc As CASCADELib.CscXDocument, bSkip As Boolean)
3  If pXDoc.Locators.ItemByName("BCode").Alternatives.Count>0 Then
4     If pXDoc.Locators.ItemByName("BCode").Alternatives(0).Confidence > 0.95 Then
5       pXDoc.Reclassify "Barcodeantrag"
6       Exit Sub 'only one reclassify
7     End If
8  End If
9End Sub

The script will be called in the event Document_BeforeClassifyXDoc, which is executed before all other classification mechanisms of KTM.

The script checks first, if the barcode locator has found anything at all and if the confidence is greater than 95%. If this is the case, the reclassify method is used to classify the document as document type ‘Barcodeantrag’ (barcode claim). After ‘reclassify’ you have to exit the sub, so that no further ‘reclassify’ may happen in the following code. Multiple ‘reclassify’ are possible with KTM, but you should be careful doing it, as you might create inifinite loops.

4.2 Format Locators, Advanced Zone Locators and Everything…

The ‘classification by script principle’ in 4.1 with barcode locators can be used with any other locators too. The important fact is, that the locator must identify a document type clearly. Furthermore the locator has to be defined on project level, as otherwise the script in Document_BeforeClassifyXDoc will not be executed. The primary goal of these locators is not the data extraction. They are just resources for classification.

A format locator, which is defined at project level, may identify the type of an insurance application and classify the document. The following image snippet shows a part of an insurance application for general liability insurance (‘Haftpflichtversicherunng’)

A format loactor (‘Antrag_Haft’) may search the word ‘Haftpflichtversicherung‘ (general liability insurance) above the word ‘Antrag‘ (application) in a region in the upper left corner of the document. If the format locator scores, the document can be classified as document type ‘Antrag_Haft’.

The scripting looks like this (equivalent to the barcode example):

1' Class script: Project
2Private Sub Document_BeforeClassifyXDoc(pXDoc As CASCADELib.CscXDocument, bSkip As Boolean)
3  If pXDoc.Locators.ItemByName("Antrag_Haft").Alternatives.Count>0 Then
4     If pXDoc.Locators.ItemByName("Antrag_Haft").Alternatives(0).Confidence > 0.95 Then
5       pXDoc.Reclassify "Antrag_Haftpflicht"
6       Exit Sub 'omnly one reclassify
7     End If
8  End If
9End Sub

If you want to use an ‘Advanced Zone Locator’ (Antrag_Haft_EZL) for the classification of the above insurance application, you have to adjust the script to the subfields of the zone locator:

1' Class script: Project
2Private Sub Document_BeforeClassifyXDoc(pXDoc As CASCADELib.CscXDocument, bSkip As Boolean)
3  If pXDoc.Locators.ItemByName("Antrag_Haft_EZL").Alternatives.Count>0 Then
4     If pXDoc.Locators.ItemByName("Antrag_Haft_EZL").Alternatives(0).SubFields.ItemByName("UF_Zone0").Confidence > 0.95 Then
5       pXDoc.Reclassify "Antrag_Haftpflicht"
6       Exit Sub 'only one reclassify
7     End If
8  End If
9End Sub

As you may see, there are a lot of possibilites in classification with scripting. For example you could use a database locator to identify the sender of a document (if an appropriate master data file exists) and classify the document according to the sender.

Often forms have a unique form number printed in the lower left corner, but this number is 90° rotated. An ‘Advanced Zone Locator’ can read the 90° rotated number and the document can be classified using this unique number with scripting.

Maybe this article gave you some motivation to experiment with KTM’s classification methods and scripting. Have fun 🙂

One more hint for the developers among our readers: my colleague Frank Engelen (he’s working in our Agile Software Factory) has just published an interesting article about data and document classification with the tool ‘RapidMiner’. With a little Java knowledge you may develop your own classification mechanism!

You can find his article here: Taking a look at Java-based Machine Learning by Classification

New: KTM and insurance companies: Document Process Automation

Was this post helpful?

Blog author

Jürgen Voss

Do you still have questions? Just send me a message.

fromJürgen Voss

Kofax Transformation Modules: Natural Language Processing, sentiments ...

Kofax Transformation Modules (KTM) offers several tools for document classification and data extraction. There are some older blog articles about these tools: – Document classification – Data extraction with format locators – Machine Learning The...

Content Management
AI
Archiving
NLP

6.4.2020 | 8 minutes reading time

Jürgen Voss

Document classification, data extraction and everything

Over time, a lot of posts about document classification and data extraction, using Kofax, among other products, have been published in the codecentric blog. This blog post will put these posts into context and point out the changes with regard to older...

Content Management
AI
Archiving

20.8.2019 | 6 minutes reading time

Jürgen Voss

Orientation problems with document processing (Kofax Transformation Modules...

Document classification and data extraction in business companies have to deal with paper documents, emails and faxes. The orientation of the digitized documents (0°, 90°, 180°, 270°) usually doesn’t matter. During OCR processing the system will recognize...

Content Management
Archiving
AI

7.7.2019 | 3 minutes reading time

Jürgen Voss

Kofax Transformation Modules (KTM) – Dictionaries: Search by script

In addition to fuzzy databases KTM also offers so-called dictionaries for the optimization of recognition. For example these dictionaries can be used in the regular expressions of a format locator to find dates of the form “01. December 2015”. The dictionary...

6.7.2017 | 2 minutes reading time

Jürgen Voss

Kofax Transformation Modules (KTM), AI and Machine Learning

The topics AI, machine learning and deep learning are on everyone’s lips, and the media regularly publishes articles on them. What many do not know is that Kofax Transformation Modules (KTM) also provides mechanisms of machine learning. KTM is a system...

5.6.2017 | 5 minutes reading time

Jürgen Voss

Kofax Capture Validation Scripting – from SBL to VB.NET for Dummies

With Kofax Capture you can enter document index values in a validation screen or just confirm or changes values which have been recognized automatically. The validation screen form presents all fields of a document and the user has to confirm/change ...

8.6.2016 | 4 minutes reading time

Jürgen Voss

Kofax Transformation Modules: SEPA Mandates and handwritten additional...

Within the last two years many companies had to ask their customers to sign the SEPA Direct Debit Mandates. It is an established procedure to send out forms with filled customer data (the SEPA Mandate). The customer signs the mandate and sends it back...

19.2.2016 | 5 minutes reading time

Jürgen Voss

Kofax Transformation Modules (KTM): ‘free-form recognition’ for handwritten...

In contrast to form based recognition, the free-form recognition tries to find certain values (like an insurance number) somewhere on a document. It is helpful if the searched value has a structure that can be found with regular expressions. Furthermore...

NLP
Archiving

19.7.2015 | 4 minutes reading time

Jürgen Voss

Kofax Capture – Document Separation and Barcodes

A well known approach to separate documents at scan time is the use of barcode labels on the first page of a document. The barcode may also be put on a single separator sheet. If a batch of documents is scanned by Kofax Capture, the barcode will be recognized...

6.1.2015 | 4 minutes reading time

Jürgen Voss

IBM Content Collector for SAP (formerly known as IBM CommonStore for SAP...

IBM Content Collector for SAP (ICC/SAP) is an interface for SAP ERP-Systems and IBM archiving systems: IBM Content Manager, On Demand und TSM. SAP provides the standard interface ‘ArchiveLink’ for linking external archiving systems. ICC/SAP is certified...

Content Management
NLP
Archiving

22.7.2014 | 5 minutes reading time

Jürgen Voss

KTM and insurance companies: Document Process Automation

29.11.2013 | 5 minutes reading time

Jürgen Voss

Kofax Transformation Modules – format locators and dynamic regular expressions...

Part 2: Dynamic regular expressions in KTM In the first part of this blog article I explained the use of KTM format locators and regular epressions. Now I will try to explain how flexible KTM projects can be designed by using the KTM internal scripting...

1.2.2013 | 4 minutes reading time

Jürgen Voss

Kofax Transformation Modules – format locators and dynamic regular expressions

Part 1: An introduction to format locators and regular expressions Many of our customers are using systems for automatic document classification and data extraction. These data capturing systems extract metadata out of the electronic images (these are...

9.1.2013 | 5 minutes reading time

Jürgen Voss

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Document classification with Kofax Transformation Modules (KTM)

Was this post helpful?

Blog author

More articles

Kofax Transformation Modules: Natural Language Processing, sentiments ...

Document classification, data extraction and everything

Orientation problems with document processing (Kofax Transformation Modules...

Kofax Transformation Modules (KTM) – Dictionaries: Search by script

Kofax Transformation Modules (KTM), AI and Machine Learning

Kofax Capture Validation Scripting – from SBL to VB.NET for Dummies

Kofax Transformation Modules: SEPA Mandates and handwritten additional...

Kofax Transformation Modules (KTM): ‘free-form recognition’ for handwritten...

Kofax Capture – Document Separation and Barcodes

IBM Content Collector for SAP (formerly known as IBM CommonStore for SAP...

KTM and insurance companies: Document Process Automation

Kofax Transformation Modules – format locators and dynamic regular expressions...

Kofax Transformation Modules – format locators and dynamic regular expressions

Your job at codecentric?

Agile Developer und Consultant (w/d/m)