From PDF data sheets to shared understanding with serverless SHACL

1.4.2020 | 12 minutes reading time

Knowledge contained in PDF files

When crawling the web for information about products of a specific category, may it be instances of industrial machine parts, chemical components, or even household goods, manufacturers of such goods often provide the desired information as PDF data sheets. These documents are designed for printers and human readers rather than for processing by software agents. This regularly leads to challenges when comparing information about hundreds or thousands of products due to the heterogeneity in structure and semantics of information contained in PDF data sheets.

Inspired by the data-information-knowledge-hierarchy (DIKW pyramid) as part of information science, we propose a three-step approach in order to gain a shared understanding of concepts that are referenced by PDF data sheets as shown in Figure 1.

Figure 1: Proposed three-step approach inspired by the DIKW hierarchy. — _{Figure 1: Proposed three-step approach inspired by the DIKW-hierarchy.}

The first step is to (1) parse data of PDF files in order to extract the information contained in these files. Such information could be unstructured text as well as key/value pairs describing the properties of a subject of interest.

For the second step, the extracted key/value pairs have to be (2) mapped to explicit semantics which represents formalized knowledge. The gained knowledge could be about the subject of interest or related subjects as well as about the document itself. The mapping process therefore has to consider not only key and value of an extracted property, but also the corresponding subject.

Finally, the third step employs mechanisms to (3) validate and enrich formalized knowledge about subjects in relation to e.g. shared concepts of linked data. The resulting shared knowledge graph builds the foundation for the wisdom that empowers smart agents, chatbots or other AI tools.

In the following, we explain the implementation of the proposed three-step approach. An overview of the implementation architecture is shown in Figure 2.

_{Figure 2: Architecture of the implementation.}

Starting from the PDF file on the left, we (1) parse PDF data sheets, (2) map the key/value pairs to explicit semantics, and (3) judge the gained knowledge using SHACL (Shapes Constraint Language).

Step 1: Parsing PDF data sheets

The problem of retrieving textual information from PDF files is probably as old as the PDF itself and there is a bunch of libraries that address this issue. For example, for Python there are PDFMiner or PyPDF that easily extract all textual information from PDF files without any preparation. However, when it comes to data sheets, a lot of important information about the subject of interest is contained within tables that cannot be processed properly using such libraries. Those cases are addressed by dedicated libraries such as tabula-py . The latter is a Python wrapper of tabula-java , which is a Java library for extracting tables from PDF files.

Although this approach works well on extracting information from tables contained in PDF files, it requires some manual preparation of PDF files that have a complex structure – for example where tables cannot be automatically detected. This is the case for most data sheets. For the preparation, Tabula offers a web view that allows users to define areas of interest within PDF files. To demonstrate the proposed three-step approach, we have gathered publicly available PDF data sheets for vacuum cleaners as they are provided by their respective manufacturers. One of these data sheets and the areas of interests as identified using Tabula are shown in Figure 3.

These areas can be exported as templates for tabula-py and applied to all PDF files which have a similar structure, which is probably the case for data sheets provided by the same manufacturer:

1[
2    {
3        "page": 2,
4        "extraction_method": "guess",
5        "x1": 53.94688758850098,
6        "x2": 237.73835289001465,
7        "y1": 230.29740287780763,
8        "y2": 385.8132581329346,
9        "width": 183.79146530151368,
10        "height": 155.51585525512695
11    }, 
12    {
13        "page": 2,
14        "extraction_method": "guess",
15        "x1": 312.8919480133057,
16        "x2": 511.5653133392334,
17        "y1": 231.04149787902833,
18        "y2": 369.4431681060791,
19        "width": 198.67336532592773,
20        "height": 138.4016702270508
21    }
22]

The result of the tabula-py export is a list of key/value pairs for each subject:

1{
2    "Staubbehälter Volumen": "0,6 l",
3    "Filtersystem": "EPA Filter Klasse 11 (Permanent)",
4    "Akkuleistung": "2.330 mAh",
5    "Laufzeit": "100 min",
6    "Ladezeit": "3 Stunden",
7    "Flächenleistung": "150 m2",
8    "Befahrbare Teppichhöhe": "15 mm",
9    "Geräuschpegel": "60 dB",
10    "Maße (H x Ø)": "89 x 340 mm",
11    "Maße incl. Verpackung (H x B x T)": "163 x 443 x 545 mm",
12    "Gewicht netto / brutto": "3 kg / 5,6 kg",
13    "Besondere Ausstattung": "Intelligente Raumerkennung mit Kamera, […]",
14    "Reinigungsmodi": "Zick Zack, My Space, Smart Turbo, Turbo, […]",
15    "Bedienung": "Bedientasten auf der Oberfläche des Gerätes, […]",
16    "Zubehör (im Lieferumfang enthalten)": "Fernbedienung, Ladestation, […]",
17    "Korpus": "Ocean Black"
18}

In this example, we see keys such as “Akkuleistung” or “Gewicht netto / brutto” which are not explicitly linked to concrete concepts. From the perspective of a machine, these keys are just strings. Also, the values are interpreted as plain strings, for example “2.330 mAh” or “3 kg / 5,6 kg”. Those strings do not specify a data type or language they represent. Therefore, the numbers are not automatically interpreted as numeric values and the unit symbols are just arbitrary characters.

Step 2: Mapping to explicit Semantics

What we have after extracting table data out of PDFs is information in the form of a list of key/value pairs for each subject. But how can we recognize the knowledge that is contained within this information? For this task, we have to map ambiguous strings to explicit concepts which can be modeled using RDF as defined in the W3C recommendation.

In the case of physical quantities, we suggest employing the semantically rich and shared concepts of the QUDT project. Originally developed for the NASA Exploration Initiatives Ontology Models project, QUDT is nowadays maintained by the not-for-profit organization QUDT.org and provides detailed semantics for hundreds of quantities and thousands of associated units, including labels, descriptions, relationships with each other, and conversion information as shown in Figure 4.

_{Figure 4: Knowledge about the kilogram as provided by QUDT.}

The importance of such explicit modelling of quantities and units is proven not least by the Mars Climate Orbiter project with a budget of more than 300 million USD. The project failed in 1999 due to the loss of the spacecraft caused by a software that expected quantities to be in a different unit than was actually provided by another software component.

In order to provide explicit semantics for each subject, we have to state both: the assertions about an instance, in our case the properties of a concrete vacuum cleaner, as well as information about the terminology which is used to describe that instance.

Instance (assertions)

1@prefix cc: <http://example.org/cc/> .
2 
3cc:HOMBOT%20L77BK a cc:VacuumCleaner ;
4    cc:type "Vacuum Cleaner" ;
5    rdfs:label "HOMBOT L77BK" ;
6    cc:mass "3 kg" ;
7.

Schema (terminology)

1@prefix cc: <http://example.org/cc/>;
2@prefix qudt: <http://qudt.org/schema/qudt/>; 
3 
4cc:mass
5  rdf:type owl:DatatypeProperty ;
6  cc:rangeQuantity qudt:MassUnit ;
7  rdfs:label "weight as string with any unit, e.g. \"50 kg\""@en ;
8  rdfs:range xsd:string ;
9.

In this example, we have the assertions that the subject is an instance of the class “cc:VacuumCleaner” and has a string value for the property “cc:mass”. Both terms are not only strings, but representations that symbolize concrete concepts. These concepts are contained in the schema that explicitly describes the intended terminology.

A remarkable part here is that the schema describes the property “cc:mass” as reference to quantities with type “qudt:MassUnit”. This reference to the QUDT model is the precondition for exploiting schema knowledge as required in step 3 of the proposed approach. As shown here for the property “cc:mass”, the mapping has to be repeated in the same way for each property of the subject. We achieve this by comparing the strings of concept labels derived from the schema with the strings of keys and values as retrieved from step 1. Although this comparison works well in most cases, we have to bear in mind that it can also lead to mismatches that later lead to incorrect results. It is therefore worth checking the results of the mapping process manually in case of doubt.

Step 3: Judging gained knowledge using SHACL

For the third step, we require tools to judge the previously gained knowledge with respect to completeness and consistency. In the case of RDF graphs, it makes sense to employ the Shapes Constraint Language (SHACL) for this purpose. SHACL was published in 2015 and since 2017 it is also a W3C recommendation. It makes use of so-called node shapes that are applied to instance data in order to gain reports with respect to a valid description of a subject within a certain domain. Within the example of vacuum cleaners, such node shape definition could include property shapes that demand for exactly one value for the property “rdfs:label” and exactly one value for the poperty “cc:mass”. Each instance of the class “cc:VacuumCleaner” has to fulfill these constraints in order to be considered a valid instance of this class.

1cc:VacuumCleaner
2  rdf:type rdfs:Class ;
3  rdf:type sh:NodeShape ;
4  rdfs:label "Staubsauger"@de ;
5  rdfs:label "vacuum cleaner"@en ;
6  sh:property [
7      rdf:type sh:PropertyShape ;
8      sh:path rdfs:label ;
9      sh:datatype xsd:string ;
10      sh:maxCount 1 ;
11      sh:minCount 1 ;
12      sh:name "label" ;
13    ] ; 
14  sh:property [
15      rdf:type sh:PropertyShape ;
16      sh:path cc:mass ;
17      sh:datatype xsd:string ;
18      sh:maxCount 1 ;
19      sh:minCount 1 ;
20      sh:name "mass" ;
21    ] ;
22.

This shape can now be applied to all instances of “cc:VacuumCleaner” to test whether the description is a valid vacuum cleaner description or not. A bunch of tools has been developed that support such validations using SHACL, e.g. TopBraid Composer by TopQuadrat. An exemplary validation report as created by TopBraid Composer is shown in Figure 5.

_{Figure 5: SHACL report created by TopBraid Composer.}

Similar to the SHACL validation functionality of TopBraid Composer, the same SHACL shapefile can be applied to RDF graphs using programming libraries that implement SHACL functionality, such as the TopBraid SHACL API for Java or the pySHACL validator for Python. If the subject description passed to such libraries complies to the SHACL shape definition that is associated with the class of the subject, the API returns a report in RDF which states that the constraints of the shape are fulfilled.

1[] a sh:ValidationReport ;
2    sh:conforms true .

In case of a constraint violation, the generated report contains the exact parts of the shape definition which are not fulfilled. This detailed report allows us to automatically identify any constraint violation of the subject description.

1[] a sh:ValidationReport ;
2    sh:conforms false ;
3    sh:result [ a sh:ValidationResult ;
4            sh:focusNode cc:113723_origin_3 ; ;
5            sh:resultPath cc:mass ;
6            sh:resultSeverity sh:Violation ;
7            sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
8            sh:sourceShape [ a sh:PropertyShape ;
9                    sh:datatype xsd:string ;
10                    sh:maxCount 1 ;
11                    sh:minCount 1 ;
12                    sh:name "mass" ;
13                    sh:path cc:mass ] ] .

In addition to validate RDF graphs, SHACL also provides advanced features to apply rules on RDF models. These rules can be defined as standard SPARQL queries which are W3C recommendations themselves. Again, TopBraid Composer is capable of executing such rules and enrich subject descriptions based on that. Examples for such rules are shown in Figure 6 and Figure 7.

_{Figure 6: SHACL rule to transform a string quantity to a float quantity with explicit unit.}

These exemplary rules can be employed to transform statements about quantities to statements that represent the same meaning using different units or a different representation as needed for varying use cases. The result of applying those rules to the example of a vacuum cleaner description is shown in Figure 8.

_{Figure 8: Equivalent string units inferred by the QUDT model.}

In this example, the string “25 kg” is transformed to derived strings with an equivalent meaning such as “0.025 mT” or “55.12 lbm” just by exploiting the semantics provided by QUDT.

Thanks to this model-driven approach, it is possible to reuse the same programming code for plenty of use cases in varying domains, such as industrial machine parts, chemical components, or even household goods. All that has to be maintained is the model itself, which could also be done by people without programming skills due to the availability of suitable RDF and SHACL modelling software.

Serverless implementation

The proposed approach is not limited to processing pipelines that have to be installed and prepared on dedicated hardware for each use case. In fact, the generic functionality of SHACL allows for a reusable and even serverless implementation. For example, the SHACL shapes can be maintained and shared using a hosted version control system such as GitHub which allows you to keep track of changes within the schema definition as shown in Figure 9.

github — _{Figure 9: Keep track of schema changes with GitHub.}

In order to validate a subject description, the RDF graph of that subject could be serialized as Turtle and sent to a serverless cloud function such as AWS Lambda or Azure functions that evaluate that RDF graph on demand using the SHACL shape maintained on the hosted version control system. The functionality of such implementation can easily be tested using any web-API tester such as Postman as shown in Figure 10.

_{Figure 10: Using Postman to send Turtle as POST request to an AWS Lambda function and receive a SHACL report as boolean, text, and JSON-LD.}

In this example, the RDF graph describing the subject of interest is serialized as Turtle and sent as a POST request to an AWS Lambda function which is associated with the schema definition in GitHub. The response contains a Boolean value that states whether the described subject conforms to the SHACL constraints defined in the schema for instances of the associated class. In case of a constraint violation, the response also contains a textual description of the exact constraint violation and an RDF graph (in this example serialized as JSON-LD ) that provides a machine-readable SHACL report. In the given example, the subject is described as an instance of the class “cc:VacuumCleaner”. According to the SHACL schema definition for this class, each instance has to state a value for the property “cc:mass”, which is not the case for the submitted subject description. Therefore, the validation function returns a report that states this constraint violation.

Conclusion

In this article we have implemented a generic approach to prepare formalized knowledge for a shared understanding by extracting the information contained in arbitrary PDF data sheets. By employing mostly W3C recommendations such as RDF, SPARQL, and SHACL, we ensure that the approach is future-proof and in accordance with the efforts of a global community. We also employ freely available schema knowledge as provided by QUDT as well as standardized tools to reduce the risk of untested software. In addition, we have introduced a reusable and serverless implementation for a SHACL validation pipeline using GitHub and AWS Lambda to reduce setup and maintenance costs in varying environments.

Was this post helpful?

Blog author

Matthias Frank

Do you still have questions? Just send me a message.

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

API Thinking

In many projects, teams create technically functional APIs without focusing on business requirements or future use. Development often still follows a classic pattern: assign a ticket, implement the interface, and close the ticket. What's missing is the...

29.4.2025 | 5 minutes reading time

Miriam Greis

Daniel Kocot

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

When your API platform lacks the desired impact

Many companies have high hopes for API platforms, expecting them to facilitate integrations, promote reuse and future-proof the company technologically. Initially, a lot of things seem to go well: the platform is established, the first projects have ...

Integration
API

12.2.2025 | 6 minutes reading time

Miriam Greis

Daniel Kocot

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 minutes reading time

Matthias Niehoff

Spring and Vue - A setup for small projects (Part 2)

In the first part we presented a setup for a combination of Spring Boot and Vue.js. Now we have to look at how to connect two type-safe languages, TypeScript for the frontend and Java for the backend, through a REST-API and in a type-safe manner. We ...

Spring
Frontend
API
JavaScript
Java

17.1.2025 | 10 minutes reading time

Roger Butenuth

Nils Winking

Spring and Vue - A setup for small projects (Part 1)

Quickly adding a new Vue.js application to an existing Spring Boot project should be pretty easy, or at least a googleable problem, or so we thought. But in the end, it wasn't. However, with the right combination of configuration, components, and some...

Spring
Frontend
JavaScript
Java
API

10.1.2025 | 8 minutes reading time

Roger Butenuth

Nils Winking

Enterprise Integration Patterns Reloaded Part 1

Part 1: The Power of Patterns – Why Enterprise Integration Matters More Than Ever The idea for this blog series comes from an observation I’ve made repeatedly over the years. During workshops, talks, or training sessions, whenever I bring up Enterprise...

Integration
API

8.1.2025 | 6 minutes reading time

Daniel Kocot

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 minutes reading time

Daniel Kocot

Living on the edge: building serverless applications with Cloudflare Workers

Cloudflare is best known for its CDN, DNS server (1.1.1.1) or WAF/DDos mitigation services. These services are highly predicated on “Edge Computing”, bringing data closer to the user interested in those services – a user in Australia will be happier ...

Cloud native
Cloud
Serverless

28.11.2024 | 14 minutes reading time

Charge your APIs Volume 35 - Going Beyond OpenAPI: Using API Value Proposition...

APIs have become the backbone of modern digital transformation, connecting systems, automating processes, and enabling innovative customer experiences. However, their potential often remains underutilised due to a persistent gap between technical descriptions...

20.11.2024 | 6 minutes reading time

Daniel Kocot

Charge your APIs Volume 34 - From Christian Posta’s Omni-Directional API...

As companies expand their digital ecosystems with APIs at the core, managing these interfaces with flexibility and governance has become essential. Traditional centralised API management models struggle to keep pace with decentralised, microservices-...

13.11.2024 | 6 minutes reading time

Daniel Kocot

We deployed our SaaS Application on fly.io (and it was great).

How we deployed our application in a fraction of the time while saving 100% of the cost. Our team, a bunch of experienced software engineers without prior contact to cloud deployments, wanted to deploy our OCPP-compliant EV Charging Station Simulator...

AWS
Cloud

23.10.2024 | 4 minutes reading time

Jannis Mainczyk

Charge your APIs Volume 33 - Definition-Based API Mocking, Simulation,...

Key TakeawaysThis article is the third and last one in a three-part series about definition-based API mocking, simulation, and testing with Microcks (make sure you have read the first and second article)The previous articles focused on (i) Microcks’ ...

Testing
API

23.10.2024 | 11 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 32 - Definition-Based API Mocking, Simulation,...

Key TakeawaysThis article is the second one in a three-part series about definition-based API mocking, simulation, and testing with Microcks (make sure you have read the first article)While the previous article concentrated on Microcks’ architecture,...

API
Testing

16.10.2024 | 11 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 31 - Definition-Based API Mocking, Simulation,...

Key TakeawaysAPI mocking used, e.g., for integration testing, is challenging as it assumes conformance to mocked API functionality, which can incur significant costs as mock complexity increases with API complexityDefinition-based API mocking can reduce...

API
Testing

9.10.2024 | 9 minutes reading time

Dr. Florian Rademacher

Dangling DNS in cloud infrastructures

Dangling DNS entries are nothing new. Forgotten, outdated or incorrect DNS records can lead to subdomains being taken over and used in phishing campaigns, for example, to steal employee secrets. Due to dynamic IP addresses of rapidly changing resources...

IT-Security
Validation
Cloud
AWS
Infrastructure

5.9.2024 | 4 minutes reading time

Markus Höfer

Charge your APIs Volume 30 - Gateway to Success: Understanding and Choosing...

API gateways are essential for managing and securing data flow between services. As software architectures evolve, different types of API gateways have emerged to address specific challenges: Legacy, Agnostic, and Kubernetes-native. Drawing on insights...

API
Software architecture
Infrastructure
Integration

21.8.2024 | 12 minutes reading time

Daniel Kocot

Charge your APIs Volume 29: API enabling as a factor for success

An Enabling Team is one of the four team types of the Team Topologies framework, as introduced by Matthew Skelton and Manuel Pais. How can this pattern be successfully applied to the design and development of APIs to create interfaces that really contribute...

9.8.2024 | 9 minutes reading time

Miriam Greis

From PDF data sheets to shared understanding with serverless SHACL

Knowledge contained in PDF files

Step 1: Parsing PDF data sheets

Step 2: Mapping to explicit Semantics

Step 3: Judging gained knowledge using SHACL

Serverless implementation

Conclusion

Was this post helpful?

Blog author

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Using Dagster with DuckDB

API Thinking

Querying Databricks Delta Tables in Motherduck

When your API platform lacks the desired impact

Introducing Data Interface Quadrants (DIQs)

Access Databricks UnityCatalog from duckdb

Spring and Vue - A setup for small projects (Part 2)

Spring and Vue - A setup for small projects (Part 1)

Enterprise Integration Patterns Reloaded Part 1

Charge your APIs Volume 36 - Trends for 2025

Living on the edge: building serverless applications with Cloudflare Workers

Charge your APIs Volume 35 - Going Beyond OpenAPI: Using API Value Proposition...

Charge your APIs Volume 34 - From Christian Posta’s Omni-Directional API...

We deployed our SaaS Application on fly.io (and it was great).

Charge your APIs Volume 33 - Definition-Based API Mocking, Simulation,...

Charge your APIs Volume 32 - Definition-Based API Mocking, Simulation,...

Charge your APIs Volume 31 - Definition-Based API Mocking, Simulation,...

Dangling DNS in cloud infrastructures

Charge your APIs Volume 30 - Gateway to Success: Understanding and Choosing...

Charge your APIs Volume 29: API enabling as a factor for success