Product:

Get started

Viewer

Basic operations

Learn more

Annotation

MS Office

Generate via template

Conversion

Smart Data Extraction

Augmenting LLMs with Smart Data Extraction

PDF/A

Accessibility

Forms

Create

Page manipulation

PDF Editing

OCR

Overview

IRIS OCR

Document Resolution

OCR Workflow

Samples

APIs

Digital signature

Comparison

Bookmark

Optimization

Layer (OCG)

Redaction

Security

Portfolio

Low-level PDF API

Changelogs

Set OCR workflows: output, language & quality on Server/Desktop

IRIS OCR module

If only one OCR module is present (either the IRIS or default OCR module), Apryse SDK will use that module automatically (license permitting). When multiple OCR modules are present, the IRIS module can be selected using the OCR options object: `OCROptions.setEngine("iris")`.

To make a searchable PDF by adding invisible text to an image using OCR.

Requires the OCR module add-on

1PDFDoc doc;
2
3// Run OCR on the image without options
4OCRModule::ImageToPDF(doc, image_path, NULL);

1PDFDoc doc = new PDFDoc();
2
3// Run OCR on the image without options            
4OCRModule.ImageToPDF(doc, image_path, null);

1doc := NewPDFDoc()
2// Run OCR on the image without options
3ocrOpts := NewOCROptions()
4OCRModuleImageToPDF(doc, image_path, ocrOpts)

1Using doc As PDFDoc = New PDFDoc()
2
3   ' Run OCR on the image without options
4   OCRModule.ImageToPDF(doc, image_path, nil)
5      
6End Using

1PDFDoc doc = new PDFDoc();
2
3// Run OCR on the image without options
4OCRModule.imageToPDF(doc, image_path, null);

1async function main() {
2   const doc = await PDFNet.PDFDoc.create();
3
4   // Run OCR on the image without options
5   await PDFNet.OCRModule.imageToPDF(doc, image_path);
6}
7PDFNet.runWithCleanup(main);

1PTPDFDoc * doc = [[PTPDFDoc alloc] init];
2
3// Run OCR on the image without options
4[PTOCRModule ImageToPDF: doc src: image_path options: nil];

1doc = PDFDoc()
2
3# Run OCR on the image without options
4OCRModule.ImageToPDF(doc, image_path, None)

1$doc = new PDFDoc();
2
3// Run OCR on the image without options
4OCRModule::ImageToPDF($doc, $image_path, NULL);

1doc = PDFDoc.new
2
3# Run OCR on the image without options
4OCRModule.ImageToPDF(doc, image_path, nil)

Convert images to PDF with searchable/selectable text
Full code sample which shows how to use the Apryse OCR module on scanned documents in multiple languages. The OCR module can make searchable PDFs and extract scanned text for further indexing. Samples available in Python, C# (.Net), C++, Go, Java, Node.js (JavaScript), PHP, Ruby, VB.

Process a scanned document

To make a searchable PDF by adding invisible text to an image based PDF such as a scanned document using OCR.

1PDFDoc doc(filename);
2
3// Set English as the language of choice
4OCROptions opts;
5opts.AddLang("eng");
6
7// Run OCR on the PDF with options
8OCRModule::ProcessPDF(doc, &opts);

1PDFDoc doc = new PDFDoc(filename);
2
3// Set English as the language of choice
4OCROptions opts = new OCROptions();
5opts.AddLang("eng");
6
7// Run OCR on the PDF with options            
8OCRModule.ProcessPDF(doc, opts);

1doc = NewPDFDoc(filename)
2// Set English as the language of choice
3opts = NewOCROptions()
4opts.AddLang("eng")
5// Run OCR on the PDF with options
6OCRModuleProcessPDF(doc, opts)

1Using doc As PDFDoc = New PDFDoc(filename)
2
3   ' Set English as the language of choice
4   Dim opts As OCROptions = New OCROptions()
5   opts.AddLang("eng")
6
7   ' Run OCR on the PDF with options
8   OCRModule.ProcessPDF(doc, opts)
9      
10End Using

1PDFDoc doc = new PDFDoc(filename);
2
3// Set English as the language of choice
4OCROptions options = new OCROptions();
5options.addLang("eng");
6
7// Run OCR on the PDF with options
8OCRModule.processPDF(doc, options);

1async function main() {
2   const doc = await PDFNet.PDFDoc.createFromFilePath(filename);
3
4   // Set English as the language of choice
5   const opts = new PDFNet.OCRModule.OCROptions();
6   opts.addLang("eng");
7
8   // Run OCR on the PDF with options
9   await PDFNet.OCRModule.processPDF(doc, opts);
10}
11PDFNet.runWithCleanup(main);

1PTPDFDoc * doc = [[PTPDFDoc alloc] initWithFilepath: filename];
2
3// Set English as the language of choice
4PTObjSet * set = [[PTObjSet alloc] init];
5PTObj * options = [set CreateDict];
6PTObj * lang_array = [options PutArray: @"Langs"];
7[lang_array PushBackString: @"eng"];
8
9// Run OCR on the PDF with options
10[PTOCRModule ProcessPDF: doc options: options];

1doc = PDFDoc(filename)
2
3# Set English as the language of choice
4opts = OCROptions()
5opts.AddLang("eng")
6
7# Run OCR on the PDF with options
8OCRModule.ProcessPDF(doc, opts)

1$doc = new PDFDoc($filename);
2
3// Set English as the language of choice
4$opts = new OCROptions();
5$opts->AddLang("eng");
6
7// Run OCR on the PDF with options
8OCRModule::ProcessPDF($doc, $opts);

1doc = PDFDoc.new(filename)
2
3# Set English as the language of choice
4opts = OCROptions.new
5opts.AddLang("eng")
6
7# Run OCR on the PDF with options
8CRModule.ProcessPDF(doc, opts)

Add searchable/selectable text to an image based PDF like a scanned document
Full code sample which shows how to use the Apryse OCR module on scanned documents in multiple languages. The OCR module can make searchable PDFs and extract scanned text for further indexing. Samples available in Python, C# (.Net), C++, Go, Java, Node.js (JavaScript), PHP, Ruby, VB

Get metadata as JSON

If we want to apply raw OCR output to the input document, we can either call OCRModule::ImageToPDF (if input file is an image) or OCROptions::ProcessPDF (for a PDF). However, it is likely that some post-processing will be beneficial, e.g., comparing results against white/black lists. To this purpose we can first extract text and corresponding metadata as either JSON or XML before re-applying processed results to the input document.

1// Setup empty destination doc
2PDFDoc doc;
3
4std:string image_path = "path/to/image";
5
6// Extract OCR results as JSON
7UString json = OCRModule::GetOCRJsonFromImage(doc, image_path, opts);
8
9// Post-processing step (whatever it might be) 
10
11// Re-apply results. 
12OCRModule::ApplyOCRJsonToPDF(doc, json);

1// Setup empty destination doc
2PDFDoc doc = new PDFDoc();
3string image_path = "path/to/image";
4
5// Extract OCR results as JSON
6string json = OCRModule.GetOCRJsonFromImage(doc, image_path, opts);
7
8// Post-processing step (whatever it might be) 
9
10// Re-apply results. 
11OCRModule.ApplyOCRJsonToPDF(doc, json);

1doc := NewPDFDoc()
2image_path = "path/to/image"
3opts = NewOCROptions()
4json := OCRModuleGetOCRJsonFromImage(doc, image_path, opts)
5// Post-processing step (whatever it might be)
6// Re-apply results. 
7OCRModuleApplyOCRJsonToPDF(doc, json)

1' Setup empty destination doc
2Using doc As PDFDoc = New PDFDoc()
3	 Dim image_path As String = "path/to/image"
4
5	' Extract OCR results as JSON
6	 Dim json As String = OCRModule.GetOCRJsonFromImage(doc, image_path, opts)
7
8	' Post-processing step (whatever it might be) 
9
10	' Re-apply results. 
11	OCRModule.ApplyOCRJsonToPDF(doc, json)
12
13End Using

1// Setup empty destination doc
2PDFDoc doc = new PDFDoc();
3String image_path = "path/to/image";
4
5// Extract OCR results as JSON
6String json = OCRModule.getOCRJsonFromImage(doc, image_path, opts);
7
8// Post-processing step (whatever it might be) 
9
10// Re-apply results. 
11OCRModule.applyOCRJsonToPDF(doc, json);

1async function main() {
2   // Setup empty destination doc
3   const doc = await PDFNet.PDFDoc.create();
4   const image_path = "path/to/image";
5
6   // Extract OCR results as JSON
7   const json = await PDFNet.OCRModule.getOCRJsonFromImage(doc, image_path, opts);
8
9   // Post-processing step (whatever it might be) 
10
11   // Re-apply results. 
12   await PDFNet.OCRModule.applyOCRJsonToPDF(doc, json);
13}
14PDFNet.runWithCleanup(main);

1// Setup empty destination doc
2PTPDFDoc * doc = [[PTPDFDoc alloc] init];
3
4NSString * image_path = @"path/to/image";
5
6// Extract OCR results as JSON
7NSString * json = [PTOCRModule GetOCRJsonFromPDF: doc options: nil];
8
9// Post-processing step (whatever it might be) 
10
11// Re-apply results. 
12[PTOCRModule ApplyOCRJsonToPDF: doc json: json];

1# Setup empty destination doc
2doc = PDFDoc()
3image_path = "path/to/image"
4
5# Extract OCR results as JSON
6json = OCRModule.GetOCRJsonFromImage(doc, image_path, opts)
7
8# Post-processing step (whatever it might be) 
9
10# Re-apply results. 
11OCRModule.ApplyOCRJsonToPDF(doc, json)

1// Setup empty destination doc
2$doc = new PDFDoc();
3$image_path = "path/to/image";
4
5// Extract OCR results as JSON
6$json = OCRModule::GetOCRJsonFromImage($doc, $image_path, $opts);
7
8// Post-processing step (whatever it might be) 
9
10// Re-apply results. 
11OCRModule::ApplyOCRJsonToPDF($doc, $json);

1# Setup empty destination doc
2doc = PDFDoc.new
3
4image_path = "path/to/image"
5
6# Extract OCR results as JSON
7json = OCRModule.GetOCRJsonFromImage(doc, image_path, opts)
8
9# Post-processing step (whatever it might be) 
10
11# Re-apply results. 
12OCRModule.ApplyOCRJsonToPDF(doc, json)

Output Attributes

OCR output consists of nested arrays: array of pages, array of paragraphs, array of lines, array of words. Pages have additional metadata:

Attribute	Value	Description
num		page number
dpi		document resolution (needed to correctly scale the coordinates from points to pixels)
origin	TopLeft	coordinate system has origin at the top left corner (default)
	BottomLeft	coordinate system has origin at the bottom left corner (i.e., PDF page coordinate system)

Then each word in the OCR output has the following:

Attribute	Value	Description
x	bouding box lower left corner x coordinate
y	bouding box lower left corner y coordinate
length	length of bounding box
font-size	text's font size
text	text output
orientation	L	270 degrees clockwise rotation
	R	90 degrees clockwise rotation
	D	180 degrees clockwise rotation
	U	0 degrees clockwise rotation
Finally, each line has an optional `box` property consisting of 4 values having the same interpretation as `pdftron::PDF::Rect`.

Sample JSON output

Below is a sample JSON output that the OCR module would output.

JSON

1{  
2   "Page":[  
3      {  
4         "Para":[  
5            {  
6               "Line":[  
7                  {  
8                     "Word":[  
9                        {  
10                           "font-size": 27,
11                           "length": 64,
12                           "orientation": "U",
13                           "text":"Hello",
14                           "x": 273,
15                           "y": 265
16                        }
17                     ],
18                     "box":[  
19                        273,
20                        265,
21                        64,
22                        29
23                     ]
24                  }
25               ]
26            }
27         ],
28         "num": 1,
29         "dpi": 96,
30         "origin": "BottomLeft"
31      }
32   ]
33}

External OCR results

The API can also be used to apply OCR XML/JSON generated by different OCR engines. The expected structure for input JSON and XML respectively are:

JSON

1{  
2   "Page":[  
3    	{  
4          "Word":[  
5              {  
6                  "font-size": 12,
7                  "length": 43,
8                  "text":"ABC",
9                  "x": 321,
10                  "y": 141
11              }
12         ],
13         "num": 1,
14         "dpi": 96,
15         "origin": "TopLeft"
16      	}
17   ]
18}

XML

1<Doc>
2	<Page num="1" origin="TopLeft" dpi="96">
3		<Word font-size="12" x="321" y="141" length="43">ABC</Word>
4	</Page>
5</Doc>

Note that the OCR structure is simplified and we are expecting an array of Page, with each page consisting of Word array. Each Word is described by its text content and 4 typographic point values (i.e., font-size="12" x="321" y="141" length="43" in the example above) needed to construct the bounding box for placement of text on a page.

Language options

We use pdftron.PDF.OCROptions convenience class to pass OCR parameters. We can call pdftron.PDF.OCROptions.AddLang to pick a target language. If no language option is set, English is assumed.

OCR Module binary currently contains 6 built-in languages to play with:

English: eng
French: fra
Spanish: spa
Italian: ita
German: deu
Russian: rus

IRIS OCR module extends the set of built-in languages with:

Only one of the Chinese (traditional), Chinese (simplified), Japanese and Korean can be selected at the same time

Chinese (traditional): chi_tra
Chinese (simplified): chi_sim
Japanese: jpn
Korean: kor

Adding languages to the default OCR module

Additional trained language files can be placed in the search path ( which can be registered using PDFNet::AddResourceSearchPath ). Afterwards they can be referred to via their file prefix.

Multiple languages

Multiple languages can be specified, although it is not recommended to use more than 3 languages.

1// Add French, Spanish and default English to target languages
2OCROptions opts;
3opts.AddLang("fra");
4opts.AddLang("spa");

1// Add French, Spanish and default English to target languages
2OCROptions opts = new OCROptions();
3opts.AddLang("fra");
4opts.AddLang("spa");

1opts = NewOCROptions()
2opts.AddLang("fra")
3opts.AddLang("spa")

1' Add French, Spanish and default English to target languages
2Dim opts As OCROptions = New OCROptions()
3opts.AddLang("fra")
4opts.AddLang("spa")

1// Add French, Spanish and default English to target languages
2OCROptions opts = new OCROptions();
3opts.addLang("fra");
4opts.addLang("spa");

1async function main() {
2   // Add French, Spanish and default English to target languages
3   const opts = new PDFNet.OCRModule.OCROptions();
4   opts.addLang("fra");
5   opts.addLang("spa");
6}
7PDFNet.runWithCleanup(main);

1// Add French, Spanish and default English to target languages
2PTObjSet * set = [[PTObjSet alloc] init];
3PTObj * options = [set CreateDict];
4PTObj * lang_array = [options PutArray: @"Langs"];
5[lang_array PushBackString: @"fra"];
6[lang_array PushBackString: @"spa"];

1# Add French, Spanish and default English to target languages
2opts = OCROptions()
3opts.AddLang("fra")
4opts.AddLang("spa")

1// Add French, Spanish and default English to target languages
2$opts = new OCROptions();
3$opts->AddLang("fra");
4$opts->AddLang("spa");

1# Add French, Spanish and default English to target languages
2opts = OCROptions.new
3opts.AddLang("fra")
4opts.AddLang("spa")

Output quality options

When processing documents with a priori known layouts, we can enhance output quality by either specifying regions that we want OCR to ignore via OCROptions::AddIgnoreZonesForPage, or listing exclusive regions to process via OCROptions::AddTextZonesForPage. Both zone options act as stencils, wherein for ignore zones we white out area inside supplied rectangular regions before processing, and for the the text zones we white out areas outside the supplied regions. The options store an array of RectCollection, where the index into the array corresponds to the relevant page number. OCROptions::AddIgnoreZonesForPage can also be used to skip pages via setting ignore zone to equal page's media box.

1// Optionally specify page zones for OCR extraction in a multipage document
2RectCollection page_zones;
3
4page_zones.AddRect(900, 2384, 1236, 2480);
5page_zones.AddRect(948, 1288, 1672, 1476);
6
7// OCR will only process the two specified zones on the first page
8opts.AddTextZonesForPage(page_zones, 1);
9
10// Reset zone container
11page_zones.Clear();
12
13page_zones.AddRect(428, 1484, 1784, 2344);
14
15// OCR will only process one specified zone on the second page
16opts.AddTextZonesForPage(page_zones, 2);

1// Optionally specify page zones for OCR extraction in a multipage document
2RectCollection page_zones = new RectCollection();
3
4page_zones.AddRect(900, 2384, 1236, 2480);
5page_zones.AddRect(948, 1288, 1672, 1476);
6
7// OCR will only process the two specified zones on the first page
8opts.AddTextZonesForPage(page_zones, 1);
9
10// Reset zone container
11page_zones.Clear();
12
13page_zones.AddRect(428, 1484, 1784, 2344);
14
15// OCR will only process one specified zone on the second page
16opts.AddTextZonesForPage(page_zones, 2);

1// Optionally specify page zones for OCR extraction in a multipage document
2textZones := NewRectCollection()
3// select horizontal BUFFER ZONE sign
4textZones.AddRect(NewRect(900.0, 2384.0, 1236.0, 2480.0))
5textZones.AddRect(NewRect(948.0, 1288.0, 1672.0, 1476.0))
6opts.AddTextZonesForPage(textZones, 1)
7// Reset zone container
8textZones.Clear();
9textZones.AddRect(NewRect(428.0, 1484.0, 1784.0, 2344.0))
10// OCR will only process one specified zone on the second page
11opts.AddTextZonesForPage(textZones, 2)

1' Optionally specify page zones for OCR extraction in a multipage document
2Dim page_zones As RectCollection = New RectCollection()
3
4page_zones.AddRect(900, 2384, 1236, 2480)
5page_zones.AddRect(948, 1288, 1672, 1476)
6
7' OCR will only process the two specified zones on the first page
8opts.AddTextZonesForPage(page_zones, 1)
9
10' Reset zone container
11page_zones.Clear()
12
13page_zones.AddRect(428, 1484, 1784, 2344)
14
15' OCR will only process one specified zone on the second page
16opts.AddTextZonesForPage(page_zones, 2)

1// Optionally specify page zones for OCR extraction in a multipage document
2RectCollection page_zones = new RectCollection();
3
4page_zones.addRect(900, 2384, 1236, 2480);
5page_zones.addRect(948, 1288, 1672, 1476);
6
7// OCR will only process the two specified zones on the first page
8opts.addTextZonesForPage(page_zones, 1);
9
10// Reset zone container
11page_zones.clear();
12
13page_zones.addRect(428, 1484, 1784, 2344);
14
15// OCR will only process one specified zone on the second page
16opts.addTextZonesForPage(page_zones, 2);

1async function main() {
2   // Optionally specify page zones for OCR extraction in a multipage document
3   let page_zones = [];
4
5   page_zones.push(new PDFNet.Rect(900, 2384, 1236, 2480));
6   page_zones.push(new PDFNet.Rect(948, 1288, 1672, 1476));
7
8   // OCR will only process the two specified zones on the first page
9   opts.addTextZonesForPage(page_zones, 1);
10
11   // Reset zone container
12   page_zones = [];
13
14   page_zones.push(new PDFNet.Rect(428, 1484, 1784, 2344));
15
16   // OCR will only process one specified zone on the second page
17   opts.addTextZonesForPage(page_zones, 2);
18}
19PDFNet.runWithCleanup(main);

1PTPDFRectCollection * page_zones = [[PTPDFRectCollection alloc] init];
2
3[page_zones AddRect: [[PTPDFRect alloc] initWithX1:900 y1: 2384 x2: 1236 y2: 2480 ]];
4[page_zones AddRect: [[PTPDFRect alloc] initWithX1:948 y1: 1288 x2: 1672 y2: 1476 ]];
5
6// OCR will only process the two specified zones on the first page
7[opts AddTextZonesForPage: page_zones page_num:1];
8
9// Reset zone container
10[page_zones Clear];
11
12[page_zones AddRect: [[PTPDFRect alloc] initWithX1:428 y1: 1484 x2: 1784 y2: 2344 ]];
13
14// OCR will only process one specified zone on the second page
15[opts AddTextZonesForPage: page_zones page_num:2];

1# Optionally specify page zones for OCR extraction in a multipage document
2page_zones = RectCollection()
3
4page_zones.AddRect(Rect(900, 2384, 1236, 2480))
5page_zones.AddRect(Rect(948, 1288, 1672, 1476))
6
7# OCR will only process the two specified zones on the first page
8opts.AddTextZonesForPage(page_zones, 1)
9
10# Reset zone container
11page_zones.Clear()
12
13page_zones.AddRect(Rect(428, 1484, 1784, 2344))
14
15# OCR will only process one specified zone on the second page
16opts.AddTextZonesForPage(page_zones, 2)

1// Optionally specify page zones for OCR extraction in a multipage document
2$page_zones = new RectCollection();
3
4$page_zones->AddRect(new Rect(900.0, 2384.0, 1236.0, 2480.0));
5$page_zones->AddRect(new Rect(948.0, 1288.0, 1672.0, 1476.0));
6
7// OCR will only process the two specified zones on the first page
8$opts->AddTextZonesForPage($page_zones, 1);
9
10// Reset zone container
11$page_zones->Clear();
12
13$page_zones->AddRect(new Rect(428.0, 1484.0, 1784.0, 2344.0));
14
15// OCR will only process one specified zone on the second page
16$opts->AddTextZonesForPage($page_zones, 2);

1# Optionally specify page zones for OCR extraction in a multipage document
2page_zones = RectCollection.new
3
4page_zones.AddRect(Rect.new(900, 2384, 1236, 2480))
5page_zones.AddRect(Rect.new(948, 1288, 1672, 1476))
6
7# OCR will only process the two specified zones on the first page
8opts.AddTextZonesForPage(page_zones, 1)
9
10# Reset zone container
11page_zones.Clear
12
13page_zones.AddRect(Rect.new(428, 1484, 1784, 2344))
14
15# OCR will only process one specified zone on the second page
16opts.AddTextZonesForPage(page_zones, 2)

Setting Input Resolution

We enable users to manually set input image resolution (tweaking which can often lead to better results in practice).

1// Manually override DPI
2opts.AddDPI(300);

1// Manually override DPI
2opts.AddDPI(300);

1// Manually override DPI
2opts.AddDPI(300);

1' Manually override DPI
2opts.AddDPI(300)

1// Manually override DPI
2opts.addDPI(300);

1// Manually override DPI
2opts.addDPI(300);

1// Manually override DPI
2[opts AddDPI: 300];

1# Manually override DPI
2opts.AddDPI(300)

1// Manually override DPI
2$opts->AddDPI(300);

1# Manually override DPI
2opts.AddDPI(300)

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales

Product:

Product:

Set OCR workflows: output, language & quality on Server/Desktop

IRIS OCR module

Process a scanned document

Get metadata as JSON

Output Attributes

Sample JSON output

JSON

External OCR results

JSON

XML

Language options

Adding languages to the default OCR module

Multiple languages

Output quality options

Setting Input Resolution

On this page