Product:

Get started

Viewer

Basic operations

Learn more

Annotation

MS Office

Generate via template

Conversion

Smart Data Extraction

Augmenting LLMs with Smart Data Extraction

PDF/A

Accessibility

Forms

Create

Page manipulation

PDF Editing

OCR

Digital signature

Comparison

Bookmark

Optimization

Layer (OCG)

Redaction

Security

Portfolio

Low-level PDF API

Changelogs

Extracting text from a PDF on Server/Desktop

To extract text from a PDF document.

Text extraction reading ordering is not defined in the ISO PDF standard. In fact, there is no concept of sentence, paragraph, tables, or anything similar in a typical PDF file. This means each PDF vendor is left to their own design/solution and will extract text with some differences. Therefore, reading order is not guaranteed to match the order that a typical user reading the document would follow.

The reading order of a magazine, newspaper article, and an academic article are all quite different due to the lack of semantic information in a PDF and the placement/ordering of text in the document. Where different users may have different expectations of the correct reading order.

1PDFDoc doc = new PDFDoc(filename)
2Page page = doc.GetPage(1);
3
4TextExtractor txt = new TextExtractor();
5txt.Begin(page);
6
7// Extract words one by one.
8TextExtractor.Word word;
9for (TextExtractor.Line line = txt.GetFirstLine(); line.IsValid(); line=line.GetNextLine())
10{
11    for (word=line.GetFirstWord(); word.IsValid(); word=word.GetNextWord())
12    {
13        //word.GetString();
14    }
15}

1PDFDoc doc(filename);
2Page page = doc.GetPage(1);
3
4TextExtractor txt;
5txt.Begin(page); // Read the page.
6
7// Extract words one by one.
8TextExtractor::Line line = txt.GetFirstLine();
9TextExtractor::Word word;
10for (; line.IsValid(); line=line.GetNextLine())
11{
12    for (word=line.GetFirstWord(); word.IsValid(); word=word.GetNextWord())
13    {
14        //word.GetString();
15    }
16}

1doc := NewPDFDoc(filename)
2page := doc.GetPage(1)
3
4txt := NewTextExtractor()
5txt.Begin(page) // Read the page
6
7// Extract words one by one.
8word := NewWord()
9line := txt.GetFirstLine()
10for line.IsValid(){
11    word = line.GetFirstWord()
12    for word.IsValid(){
13        //wordString := word.GetString()
14        word = word.GetNextWord()
15    }
16    line = line.GetNextLine()
17}

1PDFDoc doc = new PDFDoc(filename);
2Page page = doc.getPage(1);
3
4TextExtractor txt = new TextExtractor();
5txt.begin(page);  // Read the page.
6
7// Extract words one by one.
8TextExtractor.Word word;
9for (TextExtractor.Line line = txt.getFirstLine(); line.isValid(); line = line.getNextLine()) 
10{
11    for (word = line.getFirstWord(); word.isValid(); word = word.getNextWord()) 
12    {
13        //word.getString();
14    }
15}

1async function main() {
2    const doc = await PDFNet.PDFDoc.createFromURL(filename);
3    const page = await doc.getPage(1);
4
5    const txt = await PDFNet.TextExtractor.create();
6    const rect = await page.getCropBox();
7    txt.begin(page, rect); // Read the page.
8
9    // Extract words one by one.
10    let line = await txt.getFirstLine();
11    for (; (await line.isValid()); line = (await line.getNextLine())) 
12    {
13        for (word = await line.getFirstWord(); (await word.isValid()); word = (await word.getNextWord())) 
14        {
15            // await word.getString();
16        }
17    }
18}
19PDFNet.runWithCleanup(main);

1PTPDFDoc *doc = [[PTPDFDoc alloc] initWithFilepath: filename];
2PTPage *page = [doc GetPage:1];
3
4PTTextExtractor *txt = [[PTTextExtractor alloc] init];
5[txt Begin: page clip_ptr: 0 flags: 0]; // Read the page.
6
7PTTextExtractorLine *line = [txt GetFirstLine];
8PTWord *word;
9for (; [line IsValid]; line=[line GetNextLine])	{
10    for (word=[line GetFirstWord]; [word IsValid]; word=[word GetNextWord]) {
11        //[word GetString];
12    }
13}

1$doc = new PDFDoc($filename);
2$page = $doc->GetPage(1);
3
4$txt = new TextExtractor();
5$txt->Begin($page); // Read the page.
6
7for ($line = $txt->GetFirstLine(); $line->IsValid(); $line=$line->GetNextLine())	
8{
9    for ($word=$line->GetFirstWord(); $word->IsValid(); $word=$word->GetNextWord()) 
10    {
11        //$word->GetString()
12    }
13}

1doc = PDFDoc(filename)
2page = doc.GetPage(1)
3
4txt = TextExtractor()
5txt.Begin(page) # Read the page
6
7word = Word()
8line = txt.GetFirstLine()
9while line.IsValid():
10    word = line.GetFirstWord()
11    while word.IsValid():
12        # word.GetString()
13        word = word.GetNextWord()
14    line = line.GetNextLine()

1doc = PDFDoc.new(filename)
2page = doc.GetPage(1)
3
4txt = TextExtractor.new
5txt.Begin(page) # Read the page
6
7word = Word.new
8line = txt.GetFirstLine
9while line.IsValid do
10    word = line.GetFirstWord
11    while word.IsValid do
12        # word.GetString
13        word = word.GetNextWord
14    end
15    line = line.GetNextLine
16end

1Dim doc As PDFDoc = New PDFDoc(filename)
2Dim page As Page = doc.GetPage(1)
3
4Dim txt As TextExtractor = New TextExtractor
5txt.Begin(page)	 ' Read the page.
6
7Dim word As TextExtractor.Word
8Dim line As TextExtractor.Line = txt.GetFirstLine()
9While line.IsValid()
10    word = line.GetFirstWord()
11    While word.IsValid()
12        ' word.GetString()
13        word = word.GetNextWord()
14    End While
15    line = line.GetNextLine()
16End While

Read a PDF File sample
Full sample code which illustrates the basic text extraction capabilities.

Extract text under an annotation

To extract text from under an annotation in the document.

1PDFDoc doc = new PDFDoc(filename)
2Page page = doc.GetPage(1);
3Annot annotation = page.GetAnnot(0);
4
5TextExtractor txt = new TextExtractor();
6txt.Begin(page); // Read the page.
7string textData = txt.GetTextUnderAnnot(annotation);

1PDFDoc doc(filename);
2Page page = doc.GetPage(1);
3Annot annotation = page.GetAnnot(0);
4
5TextExtractor txt;
6txt.Begin(page); // Read the page.
7UString textData = txt.GetTextUnderAnnot(annotation);

1doc := NewPDFDoc(filename)
2page := doc.GetPage(1)
3annotation := page.GetAnnot(0)
4
5txt := NewTextExtractor()
6txt.Begin(page); // Read the page.
7textData := txt.GetTextUnderAnnot(annotation)

1PDFDoc doc = new PDFDoc(filename);
2Page page = doc.getPage(1);
3Annot annotation = page.getAnnot(0);
4
5TextExtractor txt = new TextExtractor();
6txt.begin(page);  // Read the page.
7String textData = txt.getTextUnderAnnot(annotation);

1async function main() {
2    const doc = await PDFNet.PDFDoc.createFromURL(filename);
3    const page = await doc.getPage(1);
4    const annotation = await page.getAnnot(0);
5
6    const txt = await PDFNet.TextExtractor.create();
7    const rect = await page.getCropBox();
8    txt.begin(page, rect); // Read the page.
9    const textData = await txt.getTextUnderAnnot(annotation);
10}
11PDFNet.runWithCleanup(main);

1PTPDFDoc *doc = [[PTPDFDoc alloc] initWithFilepath: filename];
2PTPage *page = [doc GetPage:1];
3PTAnnot *annotation = [page GetAnnot:0];
4
5PTTextExtractor *txt = [[PTTextExtractor alloc] init];
6[txt Begin: page clip_ptr: 0 flags: 0]; // Read the page.
7NSString* textData = [txt GetTextUnderAnnot: annotation];

1$doc = new PDFDoc($filename);
2$page = $doc->GetPage(1);
3$annotation = $page->GetAnnot(0);
4
5$txt = new TextExtractor();
6$txt->Begin($page); // Read the page.
7$textData = $txt->GetTextUnderAnnot($annotation);

1doc = PDFDoc(filename)
2page = doc.GetPage(1)
3annotation = page.GetAnnot(0)
4
5txt = TextExtractor()
6txt.Begin(page) # Read the page
7textData = txt.GetTextUnderAnnot(annotation)

1doc = PDFDoc.new(filename)
2page = doc.GetPage(1)
3annotation = page.GetAnnot(0)
4
5txt = TextExtractor.new
6txt.Begin(page) # Read the page
7textData = txt.GetTextUnderAnnot(annotation)

1Dim doc As PDFDoc = New PDFDoc(filename)
2Dim page As Page = doc.GetPage(1)
3Dim annotation As Annot = page.GetAnnot(0)
4
5Dim txt As TextExtractor = New TextExtractor
6txt.Begin(page)	 ' Read the page.
7Dim textData As String = txt.GetTextUnderAnnot(annotation)

About extracting text

When we use the ElementReader class to read elements from a PDF document, we are often faced with data that is partial. For example, let us say that we are attempting to extract a sentence that says "This is a sample sentence." from a PDF document. We could potentially end up with two elements - "T" and "his is a sample sentence.". This is possible because in a PDF document, text objects are not always cleanly organized into words sentences, or paragraphs. The ElementReader class will return Element objects exactly as they are defined in the PDF page content stream.

Text runs

An element of type e_text directly corresponds to a Tj element in the PDF document. Each e_text element represents a text run, which represents a sequence of text glyphs that use the same font and graphics attributes. Say, if there is a single word, whose letters are each presented with a different font, then each letter would be a separate text run. You may also encounter text runs that contain multiple words separated by spaces. The PDF format does not guarantee that the text will be presented in reading order.

TextExtractor class

All this just goes to say that attempting to use an ElementReader to extract text data from a PDF document is not guaranteed to return data in the order expected (reading order). The most straightforward approach to extract words and text from text-runs is using the pdftron.PDF.TextExtractor class, as shown in the TextExtract sample project - TextExtract Sample

TextExtractor will assemble words, lines, and paragraphs, remove duplicate strings, reconstruct text reading order, etc. Using TextExtractor you can also obtain bounding boxes for each word, line, or paragraph (along with style information such as font, color, etc). This information can be used to search for corresponding text elements using ElementReader.

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales

Product:

Product:

Extracting text from a PDF on Server/Desktop

Extract text under an annotation

About extracting text

Text runs

TextExtractor class

On this page