Set OCR workflows: output, language & quality on Server/Desktop
IRIS OCR module
If only one OCR module is present (either the IRIS or default OCR module), Apryse SDK will use that module automatically (license permitting). When multiple OCR modules are present, the IRIS module can be selected using the OCR options object: `OCROptions.setEngine("iris")`.
To make a searchable PDF by adding invisible text to an image using OCR.
Convert images to PDF with searchable/selectable text Full code sample which shows how to use the Apryse OCR module on scanned documents in multiple languages. The OCR module can make searchable PDFs and extract scanned text for further indexing.
Process a scanned document
To make a searchable PDF by adding invisible text to an image based PDF such as a scanned document using OCR.
If we want to apply raw OCR output to the input document, we can either call OCRModule::ImageToPDF (if input file is an image) or OCROptions::ProcessPDF (for a PDF). However, it is likely that some post-processing will be beneficial, e.g., comparing results against white/black lists. To this purpose we can first extract text and corresponding metadata as either JSON or XML before re-applying processed results to the input document.
Note that the OCR structure is simplified and we are expecting an array of Page, with each page consisting of Word array. Each Word is described by its text content and 4 typographic point values (i.e., font-size="12" x="321" y="141" length="43" in the example above) needed to construct the bounding box for placement of text on a page.
Language options
We use pdftron.PDF.OCROptions convenience class to pass OCR parameters. We can call pdftron.PDF.OCROptions.AddLang to pick a target language. If no language option is set, English is assumed.
OCR Module binary currently contains 6 built-in languages to play with:
English: eng
French: fra
Spanish: spa
Italian: ita
German: deu
Russian: rus
IRIS OCR module extends the set of built-in languages with:
Only one of the Chinese (traditional), Chinese (simplified), Japanese and Korean can be selected at the same time
Chinese (traditional): chi_tra
Chinese (simplified): chi_sim
Japanese: jpn
Korean: kor
Adding languages to the default OCR module
Additional trained language files can be placed in the search path ( which can be registered using PDFNet::AddResourceSearchPath ). Afterwards they can be referred to via their file prefix.
Multiple languages
Multiple languages can be specified, although it is not recommended to use more than 3 languages.
1// Add French, Spanish and default English to target languages
2OCROptions opts;
3opts.AddLang("fra");
4opts.AddLang("spa");
1// Add French, Spanish and default English to target languages
2OCROptions opts = new OCROptions();
3opts.AddLang("fra");
4opts.AddLang("spa");
1opts = NewOCROptions()
2opts.AddLang("fra")
3opts.AddLang("spa")
1' Add French, Spanish and default English to target languages
2Dim opts As OCROptions = New OCROptions()
3opts.AddLang("fra")
4opts.AddLang("spa")
1// Add French, Spanish and default English to target languages
2OCROptions opts = new OCROptions();
3opts.addLang("fra");
4opts.addLang("spa");
1async function main() {
2 // Add French, Spanish and default English to target languages
3 const opts = new PDFNet.OCRModule.OCROptions();
4 opts.addLang("fra");
5 opts.addLang("spa");
6}
7PDFNet.runWithCleanup(main);
1// Add French, Spanish and default English to target languages
1# Add French, Spanish and default English to target languages
2opts = OCROptions()
3opts.AddLang("fra")
4opts.AddLang("spa")
1// Add French, Spanish and default English to target languages
2$opts = new OCROptions();
3$opts->AddLang("fra");
4$opts->AddLang("spa");
1# Add French, Spanish and default English to target languages
2opts = OCROptions.new
3opts.AddLang("fra")
4opts.AddLang("spa")
Output quality options
When processing documents with a priori known layouts, we can enhance output quality by either specifying regions that we want OCR to ignore via OCROptions::AddIgnoreZonesForPage, or listing exclusive regions to process via OCROptions::AddTextZonesForPage. Both zone options act as stencils, wherein for ignore zones we white out area inside supplied rectangular regions before processing, and for the the text zones we white out areas outside the supplied regions. The options store an array of RectCollection, where the index into the array corresponds to the relevant page number. OCROptions::AddIgnoreZonesForPage can also be used to skip pages via setting ignore zone to equal page's media box.
1// Optionally specify page zones for OCR extraction in a multipage document
2RectCollection page_zones;
3
4page_zones.AddRect(900, 2384, 1236, 2480);
5page_zones.AddRect(948, 1288, 1672, 1476);
6
7// OCR will only process the two specified zones on the first page
8opts.AddTextZonesForPage(page_zones, 1);
9
10// Reset zone container
11page_zones.Clear();
12
13page_zones.AddRect(428, 1484, 1784, 2344);
14
15// OCR will only process one specified zone on the second page
16opts.AddTextZonesForPage(page_zones, 2);
1// Optionally specify page zones for OCR extraction in a multipage document
2RectCollection page_zones = new RectCollection();
3
4page_zones.AddRect(900, 2384, 1236, 2480);
5page_zones.AddRect(948, 1288, 1672, 1476);
6
7// OCR will only process the two specified zones on the first page
8opts.AddTextZonesForPage(page_zones, 1);
9
10// Reset zone container
11page_zones.Clear();
12
13page_zones.AddRect(428, 1484, 1784, 2344);
14
15// OCR will only process one specified zone on the second page
16opts.AddTextZonesForPage(page_zones, 2);
1// Optionally specify page zones for OCR extraction in a multipage document