We also have a full code sample to add searchable/selectable text to an image-bassed PDF, like a scanned document, which shows how to use the Apryse Handwriting ICR module on scanned documents in multiple programming languages. The Handwriting ICR module can make searchable PDFs and extract scanned text for further indexing. Samples are available in Python, C# (.Net), C++, Go, Java, Node.js (JavaScript), PHP, Ruby, VB, and Obj-C.
Extract handwritten text as JSON
If you want to apply raw ICR output to the input document, you can call HandwritingICRModule.ProcessPDF. However, it is likely that some post-processing will be beneficial, e.g., common spell checker or comparing results against white/blacklists. For this purpose, you can, first, extract text and corresponding metadata as JSON before re-applying the processed results to the input document.
document resolution (needed to correctly scale the coordinates from points to pixels)
origin
TopLeft
coordinate system has origin at the top left corner (default)
BottomLeft
coordinate system has origin at the bottom left corner (i.e., PDF page coordinate system)
Then, each word in the ICR output includes the following:
Attribute
Value
Description
x
bounding box lower left corner x coordinate
y
bounding box lower left corner y coordinate
length
length of bounding box
font-size
text's font size
text
text output
orientation
L
270 degrees clockwise rotation
R
90 degrees clockwise rotation
D
180 degrees clockwise rotation
U
0 degrees clockwise rotation
Each line has an optional box property consisting of 4 values having the same interpretation as pdftron::PDF::Rect.
External ICR results
The API can also be used to apply ICR JSON generated by different OCR or ICR engines. The expected structure for input JSON is:
JSON
1{
2 "Page":[
3 {
4 "Word":[
5 {
6 "font-size": 12,
7 "length": 43,
8 "text":"ABC",
9 "x": 321,
10 "y": 141
11 }
12 ],
13 "num": 1,
14 "dpi": 96,
15 "origin": "TopLeft"
16 }
17 ]
18}
Note that the ICR structure is simplified and we're expecting an array of Page, with each page consisting of Word array. Each Word is described by its text content and 4 typographic point values (font-size="12" x="321" y="141" length="43" in the example above) needed to construct the bounding box for placement of text on a page.