Extract Image from PDFs - Ruby Sample Code

Sample code for using Apryse SDK to extract images from PDF files, along with their positioning information and DPI; provided in Python, C++, C#, Java, Node.js (JavaScript), PHP, Ruby and VB. Instead of converting PDF images to a Bitmap, you can also extract uncompressed/compressed image data directly using element.GetImageData() (described in the PDF Data Extraction code sample).

Learn more about our full PDF Data Extraction SDK Capabilities.

To start your free trial, get stated with Server SDK.

1#---------------------------------------------------------------------------------------
2# Copyright (c) 2001-2023 by Apryse Software Inc. All Rights Reserved.
3# Consult LICENSE.txt regarding license information.
4#---------------------------------------------------------------------------------------
5
6require '../../../PDFNetC/Lib/PDFNetRuby'
7include PDFNetRuby
8require '../../LicenseKey/RUBY/LicenseKey'
9
10$stdout.sync = true
11
12#-----------------------------------------------------------------------------------
13# This sample illustrates one approach to PDF image extraction
14# using PDFNet.
15#
16# Note: Besides direct image export, you can also convert PDF images
17# to GDI+ Bitmap, or extract uncompressed/compressed image data directly
18# using element.GetImageData() (e.g. as illustrated in ElementReaderAdv
19# sample project).
20#-----------------------------------------------------------------------------------
21
22$image_counter = 0
23
24# Relative path to the folder containing the test files.
25$input_path = "../../TestFiles/"
26$output_path = "../../TestFiles/Output/"
27
28def ImageExtract(reader)
29 element = reader.Next()
30 while !(element.nil?) do
31 if (element.GetType() == Element::E_image or
32 element.GetType() == Element::E_inline_image)
33
34 $image_counter =$image_counter + 1
35 puts "--> Image: " + $image_counter.to_s()
36 puts " Width: " + element.GetImageWidth().to_s()
37 puts " Height: " + element.GetImageHeight().to_s()
38 puts " BPC: " + element.GetBitsPerComponent().to_s()
39
40 ctm = element.GetCTM()
41 x2 = 1
42 y2 = 1
43 pt = Point.new(x2, y2)
44 point = ctm.Mult(pt)
45 puts " Coords: x1=%.2f, y1=%.2f, x2=%.2f, y2=%.2f" % [ctm.m_h, ctm.m_v, point.x, point.y]
46
47 if element.GetType() == Element::E_image
48 image = Image.new(element.GetXObject())
49
50 fname = "image_extract1_" + $image_counter.to_s()
51
52 path = $output_path + fname
53 image.Export(path)
54
55 #path = $output_path + fname + ".tif"
56 #image.ExportAsTiff(path)
57
58 #path = $output_path + fname + ".png"
59 #image.ExportAsPng(path)
60 end
61 elsif element.GetType() == Element::E_form
62 reader.FormBegin()
63 ImageExtract(reader)
64 reader.End()
65 end
66 element = reader.Next()
67 end
68end
69
70 # Initialize PDFNet
71 PDFNet.Initialize(PDFTronLicense.Key)
72
73 # Example 1:
74 # Extract images by traversing the display list for
75 # every page. With this approach it is possible to obtain
76 # image positioning information and DPI.
77
78 doc = PDFDoc.new($input_path + "newsletter.pdf")
79 doc.InitSecurityHandler()
80
81 reader = ElementReader.new()
82
83 # Read every page
84 itr = doc.GetPageIterator()
85 while itr.HasNext() do
86 reader.Begin(itr.Current())
87 ImageExtract(reader)
88 reader.End()
89 itr.Next()
90 end
91
92 doc.Close()
93
94 puts "Done."
95 puts "----------------------------------------------------------------"
96
97 # Example 2:
98 # Extract images by scanning the low-level document.
99
100 doc = PDFDoc.new($input_path + "newsletter.pdf")
101 doc.InitSecurityHandler()
102 $image_counter= 0
103
104 cos_doc = doc.GetSDFDoc()
105 num_objs = cos_doc.XRefSize()
106 i = 1
107 while i < num_objs do
108 obj = cos_doc.GetObj(i)
109
110 if !(obj.nil?) and !(obj.IsFree()) and obj.IsStream()
111 # Process only images
112 itr = obj.Find("Type")
113
114 if !(itr.HasNext()) or !(itr.Value().GetName() == "XObject")
115 i = i + 1
116 next
117 end
118
119 itr = obj.Find("Subtype")
120 if !(itr.HasNext()) or !(itr.Value().GetName() == "Image")
121 i = i + 1
122 next
123 end
124
125 image = Image.new(obj)
126 $image_counter = $image_counter + 1
127 puts "--> Image: " + $image_counter.to_s()
128 puts " Width: " + image.GetImageWidth().to_s()
129 puts " Height: " + image.GetImageHeight().to_s()
130 puts " BPC: " + image.GetBitsPerComponent().to_s()
131
132 fname = "image_extract2_" + $image_counter.to_s()
133
134 path = $output_path + fname
135 image.Export(path)
136
137 #path = $output_path + fname + ".tif"
138 #image.ExportAsTiff(path)
139
140 #path = $output_path + fname + ".png"
141 #image.ExportAsPng(path)
142 end
143 i = i + 1
144 end
145 doc.Close()
146 PDFNet.Terminate
147 puts "Done."

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales