Smart Data Extraction - Ruby Sample Code

Sample code shows how to use the Apryse Data Extraction module to extract tabular data, document structure and form fields from PDF documents. Sample code provided in Python, C++, C# (.Net), Java, Node.js (JavaScript), PHP, Ruby and VB.

To run this sample, you will need to:

  1. Get started with Server SDK in your language/framework
  2. Download the Data Extraction Module

Learn more about our Server SDK.

1#---------------------------------------------------------------------------------------
2# Copyright (c) 2001-2024 by Apryse Software Inc. All Rights Reserved.
3# Consult LICENSE.txt regarding license information.
4#---------------------------------------------------------------------------------------
5
6require '../../../PDFNetC/Lib/PDFNetRuby'
7include PDFNetRuby
8require '../../LicenseKey/RUBY/LicenseKey'
9
10$stdout.sync = true
11
12#---------------------------------------------------------------------------------------
13# The Data Extraction suite is an optional PDFNet add-on collection that can be used to
14# extract various types of data from PDF documents.
15#
16# The Apryse SDK Data Extraction suite can be downloaded from
17# https://docs.apryse.com/core/guides/info/modules#data-extraction-module
18#
19# Please contact us if you have any questions.
20#---------------------------------------------------------------------------------------
21
22# Relative path to the folder containing the test files.
23$inputPath = "../../TestFiles/"
24$outputPath = "../../TestFiles/Output/"
25
26def main()
27 # The first step in every application using PDFNet is to initialize the
28 # library. The library is usually initialized only once, but calling
29 # Initialize() multiple times is also fine.
30 PDFNet.Initialize(PDFTronLicense.Key)
31
32 PDFNet.AddResourceSearchPath("../../../PDFNetC/Lib/")
33
34 #-----------------------------------------------------------------------------------
35 # The following sample illustrates how to extract tables from PDF documents.
36 #-----------------------------------------------------------------------------------
37
38 # Test if the add-on is installed
39 if !DataExtractionModule.IsModuleAvailable(DataExtractionModule::E_Tabular) then
40 puts ""
41 puts "Unable to run Data Extraction: Apryse SDK Tabular Data module not available."
42 puts "-----------------------------------------------------------------------------"
43 puts "The Data Extraction suite is an optional add-on, available for download"
44 puts "at https://docs.apryse.com/core/guides/info/modules#data-extraction-module . If you have already"
45 puts "downloaded this module, ensure that the SDK is able to find the required files"
46 puts "using the PDFNet.AddResourceSearchPath() function."
47 puts ""
48 else
49 begin
50 # Extract tabular data as a JSON file
51 puts "Extract tabular data as a JSON file"
52
53 outputFile = $outputPath + "table.json"
54 DataExtractionModule.ExtractData($inputPath + "table.pdf", outputFile, DataExtractionModule::E_Tabular)
55
56 puts "Result saved in " + outputFile
57
58 #------------------------------------------------------
59 # Extract tabular data as a JSON string
60 puts "Extract tabular data as a JSON string"
61
62 outputFile = $outputPath + "financial.json"
63 json = DataExtractionModule.ExtractData($inputPath + "financial.pdf", DataExtractionModule::E_Tabular)
64 File.open(outputFile, 'w') { |file| file.write(json) }
65
66 puts "Result saved in " + outputFile
67
68 #------------------------------------------------------
69 # Extract tabular data as an XLSX file
70 puts "Extract tabular data as an XLSX file"
71
72 outputFile = $outputPath + "table.xlsx"
73 DataExtractionModule.ExtractToXLSX($inputPath + "table.pdf", outputFile)
74
75 puts "Result saved in " + outputFile
76
77 #------------------------------------------------------
78 # Extract tabular data as an XLSX stream (also known as filter)
79 puts "Extract tabular data as an XLSX stream"
80
81 outputFile = $outputPath + "financial.xlsx"
82 outputXlsxStream = MemoryFilter.new(0, false)
83 options = DataExtractionOptions.new()
84 options.SetPages("1") # page 1
85 DataExtractionModule.ExtractToXLSX($inputPath + "financial.pdf", outputXlsxStream, options)
86 outputXlsxStream.SetAsInputFilter()
87 outputXlsxStream.WriteToFile(outputFile, false)
88
89 puts "Result saved in " + outputFile
90 rescue => error
91 puts "Unable to extract tabular data, error: " + error.message
92 end
93 end
94
95 #-----------------------------------------------------------------------------------
96 # The following sample illustrates how to extract document structure from PDF documents.
97 #-----------------------------------------------------------------------------------
98
99 # Test if the add-on is installed
100 if !DataExtractionModule.IsModuleAvailable(DataExtractionModule::E_DocStructure) then
101 puts ""
102 puts "Unable to run Data Extraction: PDFTron SDK Structured Output module not available."
103 puts "-----------------------------------------------------------------------------"
104 puts "The Data Extraction suite is an optional add-on, available for download"
105 puts "at https://docs.apryse.com/documentation/core/info/modules/. If you have already"
106 puts "downloaded this module, ensure that the SDK is able to find the required files"
107 puts "using the PDFNet.AddResourceSearchPath() function."
108 puts ""
109 else
110 begin
111 # Extract document structure as a JSON file
112 puts "Extract document structure as a JSON file"
113
114 outputFile = $outputPath + "paragraphs_and_tables.json"
115 DataExtractionModule.ExtractData($inputPath + "paragraphs_and_tables.pdf", outputFile, DataExtractionModule::E_DocStructure)
116
117 puts "Result saved in " + outputFile
118
119 #------------------------------------------------------
120 # Extract document structure as a JSON string
121 puts "Extract document structure as a JSON string"
122
123 outputFile = $outputPath + "tagged.json"
124 json = DataExtractionModule.ExtractData($inputPath + "tagged.pdf", DataExtractionModule::E_DocStructure)
125 File.open(outputFile, 'w') { |file| file.write(json) }
126
127 puts "Result saved in " + outputFile
128 rescue => error
129 puts "Unable to extract document structure data, error: " + error.message
130 end
131 end
132
133 #-----------------------------------------------------------------------------------
134 # The following sample illustrates how to extract form fields from PDF documents.
135 #-----------------------------------------------------------------------------------
136
137 # Test if the add-on is installed
138 if !DataExtractionModule.IsModuleAvailable(DataExtractionModule::E_Form) then
139 puts ""
140 puts "Unable to run Data Extraction: PDFTron SDK AIFormFieldExtractor module not available."
141 puts "-----------------------------------------------------------------------------"
142 puts "The Data Extraction suite is an optional add-on, available for download"
143 puts "at https://docs.apryse.com/documentation/core/info/modules/. If you have already"
144 puts "downloaded this module, ensure that the SDK is able to find the required files"
145 puts "using the PDFNet.AddResourceSearchPath() function."
146 puts ""
147 else
148 begin
149 # Extract form fields as a JSON file
150 puts "Extract form fields as a JSON file"
151
152 outputFile = $outputPath + "formfields-scanned.json"
153 DataExtractionModule.ExtractData($inputPath + "formfields-scanned.pdf", outputFile, DataExtractionModule::E_Form)
154
155 puts "Result saved in " + outputFile
156
157 #------------------------------------------------------
158 # Extract form fields as a JSON string
159 puts "Extract form fields as a JSON string"
160
161 outputFile = $outputPath + "formfields.json"
162 json = DataExtractionModule.ExtractData($inputPath + "formfields.pdf", DataExtractionModule::E_Form)
163 File.open(outputFile, 'w') { |file| file.write(json) }
164
165 puts "Result saved in " + outputFile
166
167 #-----------------------------------------------------------------------------------
168 # Detect and add form fields to a PDF document.
169 # PDF document already has form fields, and this sample will update to the new fields.
170 puts "Extract document structure as a PDF file"
171 doc = PDFDoc.new($inputPath + "formfields-scanned-withfields.pdf")
172
173 outputFile = $outputPath + "formfields-scanned-fields-new.pdf"
174
175 DataExtractionModule.DetectAndAddFormFieldsToPDF(doc)
176 doc.Save(outputFile, SDFDoc::E_linearized);
177 doc.Close
178
179 puts "Result saved in " + outputFile
180
181 #-----------------------------------------------------------------------------------
182 # Detect and add form fields to a PDF document.
183 # PDF document already has form fields, and this sample will keep the original fields.
184 puts "Extract document structure as a PDF file"
185 doc = PDFDoc.new($inputPath + "formfields-scanned-withfields.pdf")
186
187 outputFile = $outputPath + "formfields-scanned-fields-old.pdf"
188
189 options = DataExtractionOptions.new()
190 options.SetOverlappingFormFieldBehavior("KeepOld")
191 DataExtractionModule.DetectAndAddFormFieldsToPDF(doc, options)
192 doc.Save(outputFile, SDFDoc::E_linearized);
193 doc.Close
194
195 puts "Result saved in " + outputFile
196
197
198 rescue => error
199 puts "Unable to extract form fields data, error: " + error.message
200 end
201 end
202
203 if !DataExtractionModule.IsModuleAvailable(DataExtractionModule::E_GenericKeyValue) then
204 puts ""
205 puts "Unable to run Data Extraction: PDFTron SDK AIFormFieldExtractor module not available."
206 puts "-----------------------------------------------------------------------------"
207 puts "The Data Extraction suite is an optional add-on, available for download"
208 puts "at https://docs.apryse.com/documentation/core/info/modules/. If you have already"
209 puts "downloaded this module, ensure that the SDK is able to find the required files"
210 puts "using the PDFNet.AddResourceSearchPath() function."
211 puts ""
212 else
213 begin
214 puts "Extract key-value pairs from a PDF"
215 # Simple example: Extract Keys & Values as a JSON file
216 DataExtractionModule.ExtractData($inputPath + "newsletter.pdf", $outputPath + "newsletter_key_val.json", DataExtractionModule::E_GenericKeyValue)
217 puts "Result saved in " + $outputPath + "newsletter_key_val.json"
218
219 # Example with customized options:
220 # Extract Keys & Values from pages 2-4, excluding ads
221 options = DataExtractionOptions.new()
222 options.SetPages("2-4")
223
224 p2_exclusion_zones = RectCollection.new()
225 # Exclude the ad on page 2
226 # These coordinates are in PDF user space, with the origin at the bottom left corner of the page
227 # Coordinates rotate with the page, if it has rotation applied.
228 p2_exclusion_zones.AddRect(Rect.new(166, 47, 562, 222))
229 options.AddExclusionZonesForPage(p2_exclusion_zones, 2)
230
231 p4_inclusion_zones = RectCollection.new()
232 p4_exclusion_zones = RectCollection.new()
233 # Only include the article text for page 4, exclude ads and headings
234 p4_inclusion_zones.AddRect(Rect.new(30, 432, 562, 684))
235 p4_exclusion_zones.AddRect(Rect.new(30, 657, 295, 684))
236 options.AddInclusionZonesForPage(p4_inclusion_zones, 4)
237 options.AddExclusionZonesForPage(p4_exclusion_zones, 4)
238 puts "Extract Key-Value pairs from specific pages and zones as a JSON file"
239 DataExtractionModule.ExtractData($inputPath + "newsletter.pdf", $outputPath + "newsletter_key_val_with_zones.json", DataExtractionModule::E_GenericKeyValue, options)
240 puts "Result saved in " + $outputPath + "newsletter_key_val_with_zones.json"
241
242 rescue => error
243 puts "Unable to extract form fields data, error: " + error.message
244 end
245 end
246
247 #-----------------------------------------------------------------------------------
248
249 PDFNet.Terminate
250 puts "Done."
251end
252
253main()
254

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales