OCR to search PDFs and Extract Text - Ruby Sample Code

Requirements
View Demo

Sample code shows how to use the Apryse Server OCR module on scanned documents in multiple languages; provided in Python, C++, C# (.Net), Java, Node.js (JavaScript), PHP, Ruby and VB. The OCR module can make searchable PDFs and extract scanned text for further indexing.

Looking for OCR + WebViewer? Check out our OCR - Showcase Sample Code

Learn more about our Server SDK and OCR capabilities.

Implementation steps

To run this sample, you will need:

  1. Get started with Server SDK in your language/framework
  2. Download OCR Module
  3. Add the sample code provided below

To use this feature in production, your license key will need the OCR Package. Trial keys already include this package.

1#---------------------------------------------------------------------------------------
2# Copyright (c) 2001-2025 by Apryse Software Inc. All Rights Reserved.
3# Consult LICENSE.txt regarding license information.
4#---------------------------------------------------------------------------------------
5
6require '../../../PDFNetC/Lib/PDFNetRuby'
7include PDFNetRuby
8require '../../LicenseKey/RUBY/LicenseKey'
9
10$stdout.sync = true
11
12# Relative path to the folder containing test files.
13input_path = "../../TestFiles/OCR/"
14output_path = "../../TestFiles/Output/"
15
16#---------------------------------------------------------------------------------------
17# The following sample illustrates how to use OCR module
18#---------------------------------------------------------------------------------------
19
20# The first step in every application using PDFNet is to initialize the
21# library and set the path to common PDF resources. The library is usually
22# initialized only once, but calling Initialize multiple times is also fine.
23PDFNet.Initialize(PDFTronLicense.Key)
24
25# The location of the OCR Module
26PDFNet.AddResourceSearchPath("../../../PDFNetC/Lib/");
27
28#Example 1) Convert the first page to PNG and TIFF at 92 DPI.
29
30begin
31
32 # if the IRIS OCR module is available, will use that instead of the default
33 use_iris = OCRModule.IsIRISModuleAvailable
34 if !OCRModule.IsModuleAvailable
35 puts 'Unable to run OCRTest: PDFTron SDK OCR module not available.'
36 puts '---------------------------------------------------------------'
37 puts 'The OCR module is an optional add-on, available for download'
38 puts 'at https://dev.apryse.com/. If you have already downloaded this'
39 puts 'module, ensure that the SDK is able to find the required files'
40 puts 'using the PDFNet::AddResourceSearchPath() function.'
41
42 else
43
44 # Example 1) Process image with specifying options, IRIS OCR module and English as the language of choice
45 # --------------------------------------------------------------------------------
46
47 # A) Setup empty destination doc
48 doc = PDFDoc.new
49
50 # B) Setup options with:
51 opts = OCROptions.new
52
53 # B.1. IRIS OCR module, if available
54 if use_iris
55 opts.SetOCREngine("iris")
56 end
57
58 # B.2. English as the language of choice
59 opts.AddLang("eng")
60
61 # C) Run OCR on the .png with options
62 OCRModule.ImageToPDF(doc, input_path + "psychomachia_excerpt.png", opts)
63
64 # D) Check the result
65 doc.Save(output_path + "psychomachia_excerpt.pdf", 0)
66 puts "Example 1: psychomachia_excerpt.png"
67
68 doc.Close
69
70 # Example 2) Process document using multiple languages
71 # --------------------------------------------------------------------------------
72
73 # A) Setup empty destination doc
74 doc = PDFDoc.new
75
76 # B) Setup options with:
77 opts = OCROptions.new
78
79 # B.1. IRIS OCR module, if available
80 if use_iris
81 opts.SetOCREngine("iris")
82 end
83
84 # B.2. multiple target languages, English will always be considered as secondary language
85 opts.AddLang("deu")
86 opts.AddLang("fra")
87 opts.AddLang("eng")
88
89 # C) Run OCR on the .jpg with options
90 OCRModule.ImageToPDF(doc, input_path + "multi_lang.jpg", opts)
91
92 # D) Check the result
93 doc.Save(output_path + "multi_lang.pdf", 0)
94 puts "Example 2: multi_lang.jpg"
95
96 doc.Close
97
98 # Example 3) Process a .pdf specifying a language - German - and ignore zone comprising a sidebar image
99 # --------------------------------------------------------------------------------
100
101 # A) Open the .pdf document
102 doc = PDFDoc.new(input_path + "german_kids_song.pdf")
103
104 # B) Setup options with:
105 opts = OCROptions.new
106
107 # B.1. IRIS OCR module, if available
108 if use_iris
109 opts.SetOCREngine("iris")
110 end
111
112 # B.2. German as the language of choice
113 opts.AddLang("deu")
114
115 # B.3. ignore zone comprising a sidebar image
116 ignore_zones = RectCollection.new
117 ignore_zones.AddRect(Rect.new(424, 163, 493, 730))
118 opts.AddIgnoreZonesForPage(ignore_zones, 1)
119
120 # C) Run OCR on the .pdf with options
121 OCRModule.ProcessPDF(doc, opts)
122
123 # D) check the result
124 doc.Save(output_path + "german_kids_song.pdf", 0)
125 puts "Example 3: german_kids_song.pdf"
126
127 doc.Close
128
129 # Example 4) Process multi-page tiff with text/ignore zones specified for each page,
130 # optionally provide English as the target language
131 # --------------------------------------------------------------------------------
132
133 # A) Setup empty destination doc
134 doc = PDFDoc.new
135
136 # B) Setup options with:
137 opts = OCROptions.new
138
139 # B.1. IRIS OCR module, if available
140 if use_iris
141 opts.SetOCREngine("iris")
142 end
143
144 # B.2. English as the language of choice
145 opts.AddLang("eng")
146
147 # B.3 text/ignore zones
148 ignore_zones = RectCollection.new
149
150 # ignore signature box in the first 2 pages
151 ignore_zones.AddRect(Rect.new(1492, 56, 2236, 432))
152 opts.AddIgnoreZonesForPage(ignore_zones, 1)
153
154 opts.AddIgnoreZonesForPage(ignore_zones, 2)
155
156 # can use a combination of ignore and text boxes to focus on the page area of interest,
157 # as ignore boxes are applied first, we remove the arrows before selecting part of the diagram
158 ignore_zones.Clear
159 ignore_zones.AddRect(Rect.new(992, 1276, 1368, 1372))
160 opts.AddIgnoreZonesForPage(ignore_zones, 3)
161
162 text_zones = RectCollection.new
163 # we only have text zones selected in page 3
164
165 # select horizontal BUFFER ZONE sign
166 text_zones.AddRect(Rect.new(900, 2384, 1236, 2480))
167
168 # select right vertical BUFFER ZONE sign
169 text_zones.AddRect(Rect.new(1960, 1976, 2016, 2296))
170 # select Lot No.
171 text_zones.AddRect(Rect.new(696, 1028, 1196, 1128))
172
173 # select part of the plan inside the BUFFER ZONE
174 text_zones.AddRect(Rect.new(428, 1484, 1784, 2344))
175 text_zones.AddRect(Rect.new(948, 1288, 1672, 1476))
176 opts.AddTextZonesForPage(text_zones, 3)
177
178 # C) Run OCR on the .pdf with options
179 OCRModule.ImageToPDF(doc, input_path + "bc_environment_protection.tif", opts)
180
181 # D) check the result
182 doc.Save(output_path + "bc_environment_protection.pdf", 0)
183 puts "Example 4: bc_environment_protection.tif"
184
185 doc.Close
186
187 # Example 5) Alternative workflow for extracting OCR result JSON, postprocessing
188 # (e.g., removing words not in the dictionary or filtering special
189 # out special characters), and finally applying modified OCR JSON to the source PDF document
190 # --------------------------------------------------------------------------------
191
192 # A) Open the .pdf document
193 doc = PDFDoc.new(input_path + "zero_value_test_no_text.pdf")
194
195 # B) Setup options with:
196 opts = OCROptions.new
197
198 # B.1. IRIS OCR module, if available
199 if use_iris
200 opts.SetOCREngine("iris")
201 end
202
203 # B.2. English as the language of choice
204 opts.AddLang("eng")
205
206 # C) Run OCR on the .pdf with options
207 json = OCRModule.GetOCRJsonFromPDF(doc, opts)
208
209 # D) Post-processing step (whatever it might be)
210 puts "Have OCR result JSON, re-applying to PDF"
211 OCRModule.ApplyOCRJsonToPDF(doc, json)
212
213 # E) Check the result
214 doc.Save(output_path + "zero_value_test_no_text.pdf", 0)
215 puts "Example 5: extracting and applying OCR JSON from zero_value_test_no_text.pdf"
216
217 doc.Close
218
219 # Example 6) The postprocessing workflow has also an option of extracting OCR results in XML format,
220 # similar to the one used by TextExtractor
221 # --------------------------------------------------------------------------------
222
223 # A) Setup empty destination doc
224 doc = PDFDoc.new
225
226 # B) Setup options with:
227 opts = OCROptions.new
228
229 # B.1. IRIS OCR module, if available
230 if use_iris
231 opts.SetOCREngine("iris")
232 end
233
234 # B.2. English as the language of choice
235 opts.AddLang("eng")
236
237 # C) Run OCR on the .tif with options, extracting OCR results in XML format. Note that
238 # in the process we convert the source image into PDF.
239 # We reuse this PDF document later to add hidden text layer to it.
240 xml = OCRModule.GetOCRXmlFromImage(doc, input_path + "physics.tif", opts)
241
242 # D) Post-processing step (whatever it might be)
243 puts "Have OCR result XML, re-applying to PDF"
244 OCRModule.ApplyOCRXmlToPDF(doc, xml)
245
246 # E) Check the result
247 doc.Save(output_path + "physics.pdf", 0)
248 puts "Example 6: extracting and applying OCR XML from physics.tif"
249
250 doc.Close
251
252 # Example 7) Resolution can be manually set, when DPI missing from metadata or is wrong
253 # --------------------------------------------------------------------------------
254
255 # A) Setup empty destination doc
256 doc = PDFDoc.new
257
258 # B) Setup options with:
259 opts = OCROptions.new
260
261 # B.1. IRIS OCR module, if available
262 if use_iris
263 opts.SetOCREngine("iris")
264 end
265
266 # B.2. text zone
267 text_zones = RectCollection.new
268 text_zones.AddRect(Rect.new(140, 870, 310, 920))
269 opts.AddIgnoreZonesForPage(text_zones, 1)
270
271 # B.3 Manually override DPI
272 opts.AddDPI(100)
273
274 # C) Run OCR on the .jpg with options
275 OCRModule.ImageToPDF(doc, input_path + "corrupted_dpi.jpg", opts)
276
277 # D) Check the result
278 doc.Save(output_path + "corrupted_dpi.pdf", 0)
279 puts "Example 7: converting image with corrupted resolution metadata corrupted_dpi.jpg to pdf with searchable text"
280
281 doc.Close
282
283 end
284 rescue Exception=>e
285 puts e
286
287end
288PDFNet.Terminate
289

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales