Vorträge

Treffen Sie uns auf einer Konferenz!

Verschaffen Sie sich einen Überblick, auf welchen Events unsere codecentric-Kollegen als Referenten vertreten sind. Wir würden uns freuen, Sie auf einer Konferenz persönlich begrüßen zu dürfen.

Scalable OCR pipelines using Python, Tensorflow and Tesseract

12.06.2018

Berlin Buzzwords

KulturBrauerei, Schönhauser Allee, Berlin, Deutschland

In this talk we make a trip through the world of text recognition with free software and go step by step through the individual sections of a flexible and scalable OCR application. In a live demo you will be shown how Tesseract is used for text recognition and how the quality can be significantly improved doing a little pre-processing with openCV. Subsequently the documents are stored and indexed in Elasticsearch to allow full text search. All this with just a few lines of code and all in the sense of interactive programming with Jupyter.

Agenda

  • Quirks and pitfalls in text recognition of scanned documents
  • Potential of pre-processing with openCV
  • Use Tesseract at scale
  • Quantify, compare and revaluate results
  • Use of Tensorflow in a production-ready application