OCR to search PDFs and Extract Text - Ruby Sample Code

Sample code shows how to use the Apryse Server OCR module on scanned documents in multiple languages; provided in Python, C++, C# (.Net), Java, Node.js (JavaScript), PHP, Ruby and VB. The OCR module can make searchable PDFs and extract scanned text for further indexing.

To run this sample, you will need:

  1. Get started with Server SDK in your language/framework
  2. Download an OCR Module

Learn more about our Server SDK.

1#---------------------------------------------------------------------------------------
2# Copyright (c) 2001-2025 by Apryse Software Inc. All Rights Reserved.
3# Consult LICENSE.txt regarding license information.
4#---------------------------------------------------------------------------------------
5
6require '../../../PDFNetC/Lib/PDFNetRuby'
7include PDFNetRuby
8require '../../LicenseKey/RUBY/LicenseKey'
9
10$stdout.sync = true
11
12# Relative path to the folder containing test files.
13input_path = "../../TestFiles/OCR/"
14output_path = "../../TestFiles/Output/"
15
16#---------------------------------------------------------------------------------------
17# The following sample illustrates how to use OCR module
18#---------------------------------------------------------------------------------------
19
20# The first step in every application using PDFNet is to initialize the
21# library and set the path to common PDF resources. The library is usually
22# initialized only once, but calling Initialize multiple times is also fine.
23PDFNet.Initialize(PDFTronLicense.Key)
24
25# The location of the OCR Module
26PDFNet.AddResourceSearchPath("../../../PDFNetC/Lib/");
27
28#Example 1) Convert the first page to PNG and TIFF at 92 DPI.
29
30begin
31
32 # if the IRIS OCR module is available, will use that instead of the default
33 use_iris = OCRModule.IsIRISModuleAvailable
34 if !OCRModule.IsModuleAvailable
35 puts 'Unable to run OCRTest: PDFTron SDK OCR module not available.'
36 puts '---------------------------------------------------------------'
37 puts 'The OCR module is an optional add-on, available for download'
38 puts 'at https://dev.apryse.com/. If you have already downloaded this'
39 puts 'module, ensure that the SDK is able to find the required files'
40 puts 'using the PDFNet::AddResourceSearchPath() function.'
41
42 else
43
44 # Example 1) Process image with specifying options, IRIS OCR module and English as the language of choice
45 # --------------------------------------------------------------------------------
46
47 # A) Setup empty destination doc
48 doc = PDFDoc.new
49
50 # B) Setup options with:
51 opts = OCROptions.new
52
53 # B.1. IRIS OCR module, if available
54 if use_iris
55 opts.SetOCREngine("iris")
56 end
57
58 # B.2. English as the language of choice
59 opts.AddLang("eng")
60
61 # C) Run OCR on the .png with options
62 OCRModule.ImageToPDF(doc, input_path + "psychomachia_excerpt.png", opts)
63
64 # D) Check the result
65 doc.Save(output_path + "psychomachia_excerpt.pdf", 0)
66 puts "Example 1: psychomachia_excerpt.png"
67
68 doc.Close
69
70 # Example 2) Process document using multiple languages
71 # --------------------------------------------------------------------------------
72
73 # A) Setup empty destination doc
74 doc = PDFDoc.new
75
76 # B) Setup options with:
77 opts = OCROptions.new
78
79 # B.1. IRIS OCR module, if available
80 if use_iris
81 opts.SetOCREngine("iris")
82 end
83
84 # B.2. multiple target languages, English will always be considered as secondary language
85 opts.AddLang("deu")
86 opts.AddLang("fra")
87 opts.AddLang("eng")
88
89 # C) Run OCR on the .jpg with options
90 OCRModule.ImageToPDF(doc, input_path + "multi_lang.jpg", opts)
91
92 # D) Check the result
93 doc.Save(output_path + "multi_lang.pdf", 0)
94 puts "Example 2: multi_lang.jpg"
95
96 doc.Close
97
98 # Example 3) Process a .pdf specifying a language - German - and ignore zone comprising a sidebar image
99 # --------------------------------------------------------------------------------
100
101 # A) Open the .pdf document
102 doc = PDFDoc.new(input_path + "german_kids_song.pdf")
103
104 # B) Setup options with:
105 opts = OCROptions.new
106
107 # B.1. IRIS OCR module, if available
108 if use_iris
109 opts.SetOCREngine("iris")
110 end
111
112 # B.2. German as the language of choice
113 opts.AddLang("deu")
114
115 # B.3. ignore zone comprising a sidebar image
116 ignore_zones = RectCollection.new
117 ignore_zones.AddRect(Rect.new(424, 163, 493, 730))
118 opts.AddIgnoreZonesForPage(ignore_zones, 1)
119
120 # C) Run OCR on the .pdf with options
121 OCRModule.ProcessPDF(doc, opts)
122
123 # D) check the result
124 doc.Save(output_path + "german_kids_song.pdf", 0)
125 puts "Example 3: german_kids_song.pdf"
126
127 doc.Close
128
129 # Example 4) Process multi-page tiff with text/ignore zones specified for each page,
130 # optionally provide English as the target language
131 # --------------------------------------------------------------------------------
132
133 # A) Setup empty destination doc
134 doc = PDFDoc.new
135
136 # B) Setup options with:
137 opts = OCROptions.new
138
139 # B.1. IRIS OCR module, if available
140 if use_iris
141 opts.SetOCREngine("iris")
142 end
143
144 # B.2. English as the language of choice
145 opts.AddLang("eng")
146
147 # B.3 text/ignore zones
148 ignore_zones = RectCollection.new
149
150 # ignore signature box in the first 2 pages
151 ignore_zones.AddRect(Rect.new(1492, 56, 2236, 432))
152 opts.AddIgnoreZonesForPage(ignore_zones, 1)
153
154 opts.AddIgnoreZonesForPage(ignore_zones, 2)
155
156 # can use a combination of ignore and text boxes to focus on the page area of interest,
157 # as ignore boxes are applied first, we remove the arrows before selecting part of the diagram
158 ignore_zones.Clear
159 ignore_zones.AddRect(Rect.new(992, 1276, 1368, 1372))
160 opts.AddIgnoreZonesForPage(ignore_zones, 3)
161
162 text_zones = RectCollection.new
163 # we only have text zones selected in page 3
164
165 # select horizontal BUFFER ZONE sign
166 text_zones.AddRect(Rect.new(900, 2384, 1236, 2480))
167
168 # select right vertical BUFFER ZONE sign
169 text_zones.AddRect(Rect.new(1960, 1976, 2016, 2296))
170 # select Lot No.
171 text_zones.AddRect(Rect.new(696, 1028, 1196, 1128))
172
173 # select part of the plan inside the BUFFER ZONE
174 text_zones.AddRect(Rect.new(428, 1484, 1784, 2344))
175 text_zones.AddRect(Rect.new(948, 1288, 1672, 1476))
176 opts.AddTextZonesForPage(text_zones, 3)
177
178 # C) Run OCR on the .pdf with options
179 OCRModule.ImageToPDF(doc, input_path + "bc_environment_protection.tif", opts)
180
181 # D) check the result
182 doc.Save(output_path + "bc_environment_protection.pdf", 0)
183 puts "Example 4: bc_environment_protection.tif"
184
185 doc.Close
186
187 # Example 5) Alternative workflow for extracting OCR result JSON, postprocessing
188 # (e.g., removing words not in the dictionary or filtering special
189 # out special characters), and finally applying modified OCR JSON to the source PDF document
190 # --------------------------------------------------------------------------------
191
192 # A) Open the .pdf document
193 doc = PDFDoc.new(input_path + "zero_value_test_no_text.pdf")
194
195 # B) Setup options with:
196 opts = OCROptions.new
197
198 # B.1. IRIS OCR module, if available
199 if use_iris
200 opts.SetOCREngine("iris")
201 end
202
203 # B.2. English as the language of choice
204 opts.AddLang("eng")
205
206 # C) Run OCR on the .pdf with options
207 json = OCRModule.GetOCRJsonFromPDF(doc, opts)
208
209 # D) Post-processing step (whatever it might be)
210 puts "Have OCR result JSON, re-applying to PDF"
211 OCRModule.ApplyOCRJsonToPDF(doc, json)
212
213 # E) Check the result
214 doc.Save(output_path + "zero_value_test_no_text.pdf", 0)
215 puts "Example 5: extracting and applying OCR JSON from zero_value_test_no_text.pdf"
216
217 doc.Close
218
219 # Example 6) The postprocessing workflow has also an option of extracting OCR results in XML format,
220 # similar to the one used by TextExtractor
221 # --------------------------------------------------------------------------------
222
223 # A) Setup empty destination doc
224 doc = PDFDoc.new
225
226 # B) Setup options with:
227 opts = OCROptions.new
228
229 # B.1. IRIS OCR module, if available
230 if use_iris
231 opts.SetOCREngine("iris")
232 end
233
234 # B.2. English as the language of choice
235 opts.AddLang("eng")
236
237 # C) Run OCR on the .tif with options, extracting OCR results in XML format. Note that
238 # in the process we convert the source image into PDF.
239 # We reuse this PDF document later to add hidden text layer to it.
240 xml = OCRModule.GetOCRXmlFromImage(doc, input_path + "physics.tif", opts)
241
242 # D) Post-processing step (whatever it might be)
243 puts "Have OCR result XML, re-applying to PDF"
244 OCRModule.ApplyOCRXmlToPDF(doc, xml)
245
246 # E) Check the result
247 doc.Save(output_path + "physics.pdf", 0)
248 puts "Example 6: extracting and applying OCR XML from physics.tif"
249
250 doc.Close
251
252 # Example 7) Resolution can be manually set, when DPI missing from metadata or is wrong
253 # --------------------------------------------------------------------------------
254
255 # A) Setup empty destination doc
256 doc = PDFDoc.new
257
258 # B) Setup options with:
259 opts = OCROptions.new
260
261 # B.1. IRIS OCR module, if available
262 if use_iris
263 opts.SetOCREngine("iris")
264 end
265
266 # B.2. text zone
267 text_zones = RectCollection.new
268 text_zones.AddRect(Rect.new(140, 870, 310, 920))
269 opts.AddIgnoreZonesForPage(text_zones, 1)
270
271 # B.3 Manually override DPI
272 opts.AddDPI(100)
273
274 # C) Run OCR on the .jpg with options
275 OCRModule.ImageToPDF(doc, input_path + "corrupted_dpi.jpg", opts)
276
277 # D) Check the result
278 doc.Save(output_path + "corrupted_dpi.pdf", 0)
279 puts "Example 7: converting image with corrupted resolution metadata corrupted_dpi.jpg to pdf with searchable text"
280
281 doc.Close
282
283 end
284 rescue Exception=>e
285 puts e
286
287end
288PDFNet.Terminate
289

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales