Product:

Get started

Release notes

Migration Guides

What is WebViewer

DocumentViewer

Open/Save Document

Events

UI customization

Annotation

Collaboration

MS Office

DOCX Editor

Spreadsheet Editor

Conversion

PDF/A

Forms

Generate

Page manipulation

Edit page content

Extraction

Overview

Text extraction

Text position

Selected text

Image extraction

Color separation

Embedded fonts

Samples

APIs

Digital signature

Outlines/Bookmarks

Compare files

Optimization

Layers (OCGs)

Measurement

Redaction

Security

HTML

BIM

Video

Audio

Portfolios

Low-level PDF API

Full API

WebViewer Server

Custom server

Best practices

Advanced

Changelogs

Extracting images from a PDF using JavaScript

To extract image content from a PDF document.

1WebViewer({ fullAPI: true})
2  .then(instance => {
3    const { PDFNet } = instance.Core;
4    await PDFNet.initialize();
5    const doc = await PDFNet.PDFDoc.createFromURL(filename);
6    const reader = await PDFNet.ElementReader.create();
7
8    //  Read page content on every page in the document
9    const itr = await doc.getPageIterator();
10    for (itr; await itr.hasNext(); itr.next())
11    {
12      // Read the page
13      const page = await itr.current();
14      reader.beginOnPage(page);
15      await ProcessElements(reader);
16      reader.end();
17    }
18
19    async function ProcessElements(reader)
20    {
21      // Traverse the page display list
22      for (let element = await reader.next(); element !== null; element = await reader.next()) {
23        const elementType = await element.getType();
24        switch (elementType)
25        {
26          case PDFNet.Element.Type.e_image:
27          {
28            const image = await PDFNet.Image.createFromObj(await element.getXObject());
29            image.exportFromStream(output_stream); // or exportAsTiffFromStream or exportAsPngFromStream
30            // optionally, you can also extract uncompressed/compressed 
31            // image data directly using element.getImageData()
32          }
33          case PDFNet.Element.Type.e_form:
34          {
35            reader.formBegin();
36            ProcessElements(reader);
37            reader.end();
38            break;
39          }
40        }
41      }
42    }
43
44  })

1WebViewer({ fullAPI: true})
2  .then(instance => {
3    const { PDFNet } = instance;
4    await PDFNet.initialize();
5    const doc = await PDFNet.PDFDoc.createFromURL(filename);
6    const reader = await PDFNet.ElementReader.create();
7
8    //  Read page content on every page in the document
9    const itr = await doc.getPageIterator();
10    for (itr; await itr.hasNext(); itr.next())
11    {
12      // Read the page
13      const page = await itr.current();
14      reader.beginOnPage(page);
15      await ProcessElements(reader);
16      reader.end();
17    }
18
19    async function ProcessElements(reader)
20    {
21      // Traverse the page display list
22      for (let element = await reader.next(); element !== null; element = await reader.next()) {
23        const elementType = await element.getType();
24        switch (elementType)
25        {
26          case PDFNet.Element.Type.e_image:
27          {
28            const image = await PDFNet.Image.createFromObj(await element.getXObject());
29            image.exportFromStream(output_stream); // or exportAsTiffFromStream or exportAsPngFromStream
30            // optionally, you can also extract uncompressed/compressed 
31            // image data directly using element.getImageData()
32          }
33          case PDFNet.Element.Type.e_form:
34          {
35            reader.formBegin();
36            ProcessElements(reader);
37            reader.end();
38            break;
39          }
40        }
41      }
42    }
43
44  })

PDF image extraction
Full code sample which illustrates a few approaches to PDF image extraction.

About reading page content

Page content is represented as a sequence of graphical Elements such as paths, text, images, and forms. The only effect of the ordering of Elements in the display list is the order in which Elements are painted. Elements that occur later in the display list can obscure earlier elements.

A display list can be traversed using an ElementReader object. To start traversing the display list, call reader.Begin(). Then, reader.Next() will return subsequent Elements until null is returned (marking the end of the display list).

While ElementReader only works with one page at a time, the same ElementReader object may be reused to process multiple pages.

About Form XObjects, Type3 font glyphs, and tiling patterns

A PDF page display list may contain child display lists of Form XObjects, Type3 font glyphs, and tiling patterns. A form XObject is a self-contained description of any sequence of graphics objects (such as path objects, text objects, and sampled images), defined as a PDF content stream. It may be painted multiple times — either on several pages or at several locations on the same page — and will produce the same results each time (subject only to the graphics state at the time the Form XObject is painted). In order to open a child display list for a Form XObject, call the reader.FormBegin() method. To return processing to the parent display list call reader.End(). Processing of the Form XObject display (traversing the child display list) is illustrated below.

Note that, in the above sample code, a child display list is opened when an element with type Element.ElementType.e_form is encountered by the reader.FormBegin() method. The child display list becomes the current display list until it is closed using reader.End(). At this point the processing is returned to the parent display list and the next Element returned will be the Element following the Form XObject. Also note that, because Form XObjects may be nested, a sub-display list could have its own child display lists. The sample above shows traversing these nested Form XObjects recursively.

Similarly, a pattern display list can be opened using reader.PatternBegin(), and a Type3 glyph display list can be opened using the reader.Type3FontBegin() method.

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales

Product:

Product:

Extracting images from a PDF using JavaScript

Related Links

Related Links

About reading page content

About Form XObjects, Type3 font glyphs, and tiling patterns

On this page