Product:

Get started

Release notes

Migration Guides

What is WebViewer

DocumentViewer

Open/Save Document

Events

UI customization

Annotation

Collaboration

MS Office

DOCX Editor

Spreadsheet Editor

Conversion

PDF/A

Forms

Generate

Page manipulation

Edit page content

Extraction

Overview

Text extraction

Text position

Selected text

Image extraction

Color separation

Embedded fonts

Samples

APIs

Digital signature

Outlines/Bookmarks

Compare files

Optimization

Layers (OCGs)

Measurement

Redaction

Security

HTML

BIM

Video

Audio

Portfolios

Low-level PDF API

Full API

WebViewer Server

Custom server

Best practices

Advanced

Changelogs

Extracting text from PDF documents using JavaScript

Text extraction is based on a inhouse heuristic algorithm which attempts to find the human readable reading order in a document. The reading order is determined by a number of factors such as spacing, font size, font type, and more. What makes text extraction challenging is there is no clear definition in the PDF specification which describes semantic information or logical structures.

Text extraction reading ordering is not defined in the ISO PDF standard. In fact, there is no concept of sentence, paragraph, tables, or anything similar in a typical PDF file. This means each PDF vendor is left to their own design/solution and will extract text with some differences. Therefore, reading order is not guaranteed to match the order that a typical user reading the document would follow.

The reading order of a magazine, newspaper article, and an academic article are all quite different due to the lack of semantic information in a PDF and the placement/ordering of text in the document. Where different users may have different expectations of the correct reading order.

Use loadPageText API to capture text from a document page.

1const wvElement = document.getElementById('viewer');
2WebViewer({ ...options }, wvElement)
3  .then(async instance => {
4    const pageNumber = 1; // Extract the text in the first page
5    const doc = instance.Core.documentViewer.getDocument();
6
7    const text = await doc.loadPageText(pageNumber);
8    // .. do something with text
9    console.log(text);
10  });

1const wvElement = document.getElementById('viewer');
2WebViewer({ ...options }, wvElement)
3  .then(async instance => {
4    const pageNumber = 1; // Extract the text in the first page
5    const doc = instance.docViewer.getDocument();
6
7    const text = await doc.loadPageText(pageNumber);
8    // .. do something with text
9    console.log(text);
10  });

1const wvElement = document.getElementById('viewer');
2WebViewer({ ...options }, wvElement)
3  .then(instance => {
4    const pageIndex = 0; // Extract the text in the first page
5    const doc = instance.docViewer.getDocument();
6
7    // Accepts 0 based page index
8    doc.loadPageText(pageIndex, text => {
9      // .. do something with text
10      console.log(text);
11    })
12  });

Advanced text extraction from a page region

To perform advanced text extraction from a region of a PDF document page.

1WebViewer({
2  fullAPI: true,
3  // Other instantiation options
4})
5  .then(instance => {
6    const { PDFNet, documentViewer } = instance.Core;
7
8    documentViewer.addEventListener('documentLoaded', async () => {
9      await PDFNet.initialize();
10      const doc = await documentViewer.getDocument().getPDFDoc();
11      const firstPage = await doc.getPage(1);
12
13      const txt = await PDFNet.TextExtractor.create();
14      const rect = new PDFNet.Rect(0, 0, 612, 794);
15      txt.begin(firstPage, rect); // Read the page.
16
17      // Extract words one by one.
18      let line = await txt.getFirstLine();
19      for (; (await line.isValid()); line = (await line.getNextLine()))
20      {
21          for (word = await line.getFirstWord(); (await word.isValid()); word = (await word.getNextWord()))
22          {
23              // await word.getString();
24          }
25      }
26    })
27  })

1WebViewer({
2  fullAPI: true,
3  // Other instantiation options
4})
5  .then(instance => {
6    const { PDFNet, docViewer } = instance;
7    docViewer.on('documentLoaded', () => {
8      await PDFNet.initialize();
9      const doc = await docViewer.getDocument().getPDFDoc();
10      const firstPage = await doc.getPage(1);
11      const txt = await PDFNet.TextExtractor.create();
12      const rect = new PDFNet.Rect(0, 0, 612, 794);
13      txt.begin(page, rect); // Read the page.
14      // Extract words one by one.
15      let line = await txt.getFirstLine();
16      for (; (await line.isValid()); line = (await line.getNextLine()))
17      {
18          for (word = await line.getFirstWord(); (await word.isValid()); word = (await word.getNextWord()))
19          {
20              // await word.getString();
21          }
22      }
23    })
24  })

Read a PDF File (Parse & Extract Text) Full sample code which illustrates the basic text extraction capabilities.

Extract text under an annotation

To extract text from under an annotation in the document after all annotations are loaded.

1WebViewer({
2  fullAPI: true,
3  // Other instantiation options
4})
5  .then(instance => {
6    const { PDFNet, documentViewer, annotManager } = instance.Core;
7    documentViewer.addEventListener('annotationsLoaded', async () => {
8      await PDFNet.initialize();
9      const doc = await documentViewer.getDocument().getPDFDoc();
10      // export annotations from the document
11      const annots = await annotManager.exportAnnotations();
12      // Run PDFNet methods with memory management
13      await PDFNet.runWithCleanup(async () => {
14        // lock the document before a write operation
15        // runWithCleanup will auto unlock when complete
16        doc.lock();
17        // import annotations to PDFNet
18        const fdf_doc = await PDFNet.FDFDoc.createFromXFDF(annots);
19        await doc.fdfUpdate(fdf_doc);
20        const page = await doc.getPage(1);
21        const rect = await page.getCropBox();
22        const annotation = await page.getAnnot(0);
23        const te = await PDFNet.TextExtractor.create();
24        te.begin(page, rect);
25        const textData = await te.getTextUnderAnnot(annotation);
26        console.log(textData);
27      });
28    })
29  })

1WebViewer({
2  fullAPI: true,
3  // Other instantiation options
4})
5  .then(instance => {
6    const { PDFNet, docViewer, annotManager } = instance;
7    docViewer.on('annotationsLoaded', async () => {
8      await PDFNet.initialize();
9      const doc = await docViewer.getDocument().getPDFDoc();
10      // export annotations from the document
11      const annots = await annotManager.exportAnnotations();
12      // Run PDFNet methods with memory management
13      await PDFNet.runWithCleanup(async () => {
14        // lock the document before a write operation
15        // runWithCleanup will auto unlock when complete
16        doc.lock();
17        // import annotations to PDFNet
18        const fdf_doc = await PDFNet.FDFDoc.createFromXFDF(annots);
19        await doc.fdfUpdate(fdf_doc);
20        const page = await doc.getPage(1);
21        const rect = await page.getCropBox();
22        const annotation = await page.getAnnot(0);
23        const te = await PDFNet.TextExtractor.create();
24        te.begin(page, rect);
25        const textData = await te.getTextUnderAnnot(annotation);
26        console.log(textData);
27      });
28    })
29  })

About extracting text

When we use the ElementReader class to read elements from a PDF document, we are often faced with data that is partial. For example, let us say that we are attempting to extract a sentence that says "This is a sample sentence." from a PDF document. We could potentially end up with two elements - "T" and "his is a sample sentence.". This is possible because in a PDF document, text objects are not always cleanly organized into words sentences, or paragraphs. The ElementReader class will return Element objects exactly as they are defined in the PDF page content stream.

Text runs

An element of type e_text directly corresponds to a Tj element in the PDF document. Each e_text element represents a text run, which represents a sequence of text glyphs that use the same font and graphics attributes. Say, if there is a single word, whose letters are each presented with a different font, then each letter would be a separate text run. You may also encounter text runs that contain multiple words separated by spaces. The PDF format does not guarantee that the text will be presented in reading order.

TextExtractor class

All this just goes to say that attempting to use an ElementReader to extract text data from a PDF document is not guaranteed to return data in the order expected (reading order). The most straightforward approach to extract words and text from text-runs is using the pdftron.PDF.TextExtractor class, as shown in the TextExtract sample project - TextExtract Sample

TextExtractor will assemble words, lines, and paragraphs, remove duplicate strings, reconstruct text reading order, etc. Using TextExtractor you can also obtain bounding boxes for each word, line, or paragraph (along with style information such as font, color, etc). This information can be used to search for corresponding text elements using ElementReader.

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales

Product:

Product:

Extracting text from PDF documents using JavaScript

Related Links

Related Links

Related Links

Advanced text extraction from a page region

Related Links

Related Links

Extract text under an annotation

Related Links

Related Links

About extracting text

Text runs

TextExtractor class

On this page