Some test text!
Java / Guides / Overview
Optical Character Recognition (OCR) is the process of taking image based versions of characters and converting them into machine encoded text.
Some popular use cases include:
Apryse SDK requires a separately downloadable OCR Module as an optional add-on utility in order to use OCR with the SDK. It is currently available on Windows, Linux, macOS.
The default OCR Module is powered by the Tesseract 4 engine. If this engine does not suit your needs, Apryse also offers the IRIS OCR Module based on the IRIS iDRS engine. This module may provide better results in some cases, especially when considering multiple disconnected text snippets on a page, as might occur in documents such as magazine covers or a CAD documents. The IRIS module is currently available on Windows and Linux Platforms.
Using an OCR module, the SDK can create searchable and selectable text from images or PDFs, producing either a PDF with selectable text, or outputting just the text position data in reusable json or xml form.
The module takes advantage of pdftron.PDF.Convert.ToPdf internally and accepts multiple image formats, as well as PDFs with only raster images. The result quality depends on image supplied. The ideal image is greyscale with resolution in the vicinity of 300 DPI .
In this section, we showcase the potential OCR workflow.
Get the answers you need: Support