Handwriting Intelligent Character Recognition (ICR) workflows for the Apryse Server SDK

Requirements

This guide includes handwriting ICR workflows starting with the simplest use cases, then moving to more advanced use cases.

Process a scanned document

Make a searchable PDF by adding invisible text to an image-based PDF, such as a scanned document, using Handwriting ICR.

1PDFDoc doc(input_pdf_path);
2
3// Run ICR on the .pdf with the default options.
4HandwritingICRModule::ProcessPDF(doc);

Full code sample to process a scanned document

We also have a full code sample to add searchable/selectable text to an image-bassed PDF, like a scanned document, which shows how to use the Apryse Handwriting ICR module on scanned documents in multiple programming languages. The Handwriting ICR module can make searchable PDFs and extract scanned text for further indexing. Samples are available in Python, C# (.Net), C++, Go, Java, Node.js (JavaScript), PHP, Ruby, VB, and Obj-C.

Extract handwritten text as JSON

If you want to apply raw ICR output to the input document, you can call HandwritingICRModule.ProcessPDF. However, it is likely that some post-processing will be beneficial, e.g., common spell checker or comparing results against white/blacklists. For this purpose, you can, first, extract text and corresponding metadata as JSON before re-applying the processed results to the input document.

1// Open the .pdf document.
2PDFDoc doc(input_path + "icr.pdf");
3
4// Extract ICR results in JSON format.
5UString json = HandwritingICRModule::GetICRJsonFromPDF(doc);
6
7// Post-processing step (whatever it might be)
8
9// Re-apply results.
10HandwritingICRModule::ApplyICRJsonToPDF(doc, json);

Output Attributes

ICR output consists of nested arrays:

  • Array of pages.
  • Array of paragraphs.
  • Array of lines.
  • Array of words.

Pages have additional metadata:

Attribute

Value

Description

num

page number

dpi

document resolution (needed to correctly scale the coordinates from points to pixels)

origin

TopLeft

coordinate system has origin at the top left corner (default)

BottomLeft

coordinate system has origin at the bottom left corner (i.e., PDF page coordinate system)

Then, each word in the ICR output includes the following:

Attribute

Value

Description

x

bounding box lower left corner x coordinate

y

bounding box lower left corner y coordinate

length

length of bounding box

font-size

text's font size

text

text output

orientation

L

270 degrees clockwise rotation

R

90 degrees clockwise rotation

D

180 degrees clockwise rotation

U

0 degrees clockwise rotation

Each line has an optional box property consisting of 4 values having the same interpretation as pdftron::PDF::Rect.

External ICR results

The API can also be used to apply ICR JSON generated by different OCR or ICR engines. The expected structure for input JSON is:

JSON

1{
2 "Page":[
3 {
4 "Word":[
5 {
6 "font-size": 12,
7 "length": 43,
8 "text":"ABC",
9 "x": 321,
10 "y": 141
11 }
12 ],
13 "num": 1,
14 "dpi": 96,
15 "origin": "TopLeft"
16 }
17 ]
18}

Note that the ICR structure is simplified and we're expecting an array of Page, with each page consisting of Word array. Each Word is described by its text content and 4 typographic point values (font-size="12" x="321" y="141" length="43" in the example above) needed to construct the bounding box for placement of text on a page.

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales