public class

TextExtractor

extends Object
implements AutoCloseable
java.lang.Object
   ↳ com.pdftron.pdf.TextExtractor

Class Overview

TextExtractor is used to analyze a PDF page and extract words and logical structures that are visible within a given region. The resulting list of lines and words can be traversed element by element or accessed as a string buffer. The class also includes utility methods to extract PDF text as HTML or XML.

Possible use case scenarios for TextExtractor include:

  • Converting PDF pages to text or XML for content repurposing.
  • Searching PDF pages for specific words or keywords.
  • Indexing large PDF repositories for indexing or content retrieval purposes (i.e. implementing a PDF search engine).
  • Classifying or summarizing PDF documents based on their text content.
  • Finding specific words for content editing purposes (such as splitting pages based on keywords etc).

The main task of TextExtractor is to interpret PDF pages and offer a simple to use API to:

  • Normalize all text content to Unicode.
  • Extract inferred logical structure (word by word, line by line, or paragraph by paragraph).
  • Extract positioning information for every line, word, or a glyph.
  • Extract style information (such as information about the font, font size, font styles, etc) for every line, word, or a glyph.
  • Control the content analysis process. A number of options (such as removal of text obscured by images) is available to let the user direct the flow of content recognition algorithms that will meet their requirements.
  • Offer utility methods to convert PDF page content to text, XML, or HTML.

Note: TextExtractor is analyzing only textual content of the page. This means that the rasterized (e.g. in scanned pages) or vectorized text (where glyphs are converted to path outlines) will not be recognized as text. Please note that it is still possible to extract this content using ElementReader interface.

In some cases TextExtractor may extract text that does not appear to be on the visible page (e.g. when text is obscured by an image or a rectangle). In these situations it is possible to use processing flags such as 'e_remove_hidden_text' and 'e_no_invisible_text' to remove hidden text.

A sample use case:

 ... Initialize PDFNet ...
 PDFDoc doc = new PDFDoc(filein);
 doc.initSecurityHandler();
 Page page = doc.pageBegin().current();
 TextExtractor txt = new TextExtractor();
 txt.begin(page, 0, TextExtractor.ProcessingFlags.e_remove_hidden_text);
 string text = txt.getAsText();
 // or traverse words one by one...
 TextExtractor.Word word;
 for (TextExtractor.Line line = txt.GetFirstLine(); line.IsValid(); line=line.GetNextLine()) {
   for (word=line.GetFirstWord(); word.IsValid(); word=word.GetNextWord()) {
     string w = word.GetString();
   
 }
 }
 

For full sample code, please take a look at TextExtract sample sample project.

Summary

Nested Classes
class TextExtractor.CharRange TextExtractor.CharRange object represents a range of text based on Unicode character indices. 
class TextExtractor.Compat Compatibility layer API. 
class TextExtractor.Line  
class TextExtractor.Style A class representing predominant text style associated with a given Line, a Word, or a Glyph. 
class TextExtractor.Word  
Constants
int e_extract_using_zorder Use Z-order as reading order for text
int e_no_dup_remove Disables removing duplicated text that is frequently used to achieve visual effects of drop shadow and fake bold.
int e_no_invisible_text Enables removing text that uses rendering mode 3 (i.e.
int e_no_ligature_exp Disables expanding of ligatures using a predefined mapping.
int e_no_watermarks Enables removal of text that is marked as part of a Watermark layer
int e_output_bbox Include bounding box information for each XML element.
int e_output_style_info Include font and styling information.
int e_punct_break Treat punctuation (e.g.
int e_remove_hidden_text Enables removal of text that is obscured by images or rectangles.
int e_words_as_elements Output words as XML elements instead of inline text.
Public Constructors
TextExtractor()
Constructor.
Public Methods
void begin(Page page, Rect clip_ptr)
Start reading the page.
void begin(Page page, Rect clip_ptr, int flags)
Start reading the page.
void begin(Page page)
Start reading the page.
void close()
Frees the native memory of the object.
void destroy()
Frees the native memory of the object.
String getAsText()
Get all words in the current selection as a single string.
String getAsText(boolean dehyphen)
Get all words in the current selection as a single string.
String getAsXML()
Get text content in a form of an XML string.
String getAsXML(int xml_output_flags)
Get text content in a form of an XML string.
TextExtractor.Line getFirstLine()
Get the first line.
Highlights getHighlights(CharRange[] char_ranges)
Get a Highlights object based on an array of character ranges.
int getNumLines()
Get the number lines.
boolean getRightToLeftLanguage()
Checkes if text extractor works in right-to-left language mode.
String getTextUnderAnnot(Annot annot)
Get all the characters that intersect an annotation.
int getWordCount()
Get the word count.
void setOCGContext(Context ctx)
Set the Optional Content Group (OCG) context that should be used when rendering the page.
void setRightToLeftLanguage(boolean right_2_left)
Sets text extractor to work in right-to-left language mode.
[Expand]
Inherited Methods
From class java.lang.Object
From interface java.lang.AutoCloseable

Constants

public static final int e_extract_using_zorder

Use Z-order as reading order for text

Constant Value: 256 (0x00000100)

public static final int e_no_dup_remove

Disables removing duplicated text that is frequently used to achieve visual effects of drop shadow and fake bold.

Constant Value: 2 (0x00000002)

public static final int e_no_invisible_text

Enables removing text that uses rendering mode 3 (i.e. invisible text). Invisible text is usually used in 'PDF Searchable Images' (i.e. scanned pages with a corresponding OCR text). As a result, invisible text will be extracted by default.

Constant Value: 16 (0x00000010)

public static final int e_no_ligature_exp

Disables expanding of ligatures using a predefined mapping. Default ligatures are: fi, ff, fl, ffi, ffl, ch, cl, ct, ll, ss, fs, st, oe, OE.

Constant Value: 1 (0x00000001)

public static final int e_no_watermarks

Enables removal of text that is marked as part of a Watermark layer

Constant Value: 128 (0x00000080)

public static final int e_output_bbox

Include bounding box information for each XML element. The bounding box information will be stored as 'bbox' attribute.

Constant Value: 2 (0x00000002)

public static final int e_output_style_info

Include font and styling information.

Constant Value: 4 (0x00000004)

public static final int e_punct_break

Treat punctuation (e.g. full stop, comma, semicolon, etc.) as word break characters.

Constant Value: 4 (0x00000004)

public static final int e_remove_hidden_text

Enables removal of text that is obscured by images or rectangles. Since this option has small performance penalty on performance of text extraction, by default it is not enabled.

Constant Value: 8 (0x00000008)

public static final int e_words_as_elements

Output words as XML elements instead of inline text.

Constant Value: 1 (0x00000001)

Public Constructors

public TextExtractor ()

Constructor. Instantiate new TextExtractor.

Public Methods

public void begin (Page page, Rect clip_ptr)

Start reading the page.

Parameters
page Page to read.
clip_ptr A pointer to the optional clipping rectangle. This parameter can be used to selectively read text from a given rectangle.

public void begin (Page page, Rect clip_ptr, int flags)

Start reading the page.

Parameters
page Page to read.
clip_ptr A pointer to the optional clipping rectangle. This parameter can be used to selectively read text from a given rectangle.
flags A list of ProcessingFlags used to control text extraction algorithm.

public void begin (Page page)

Start reading the page.

Parameters
page Page to read.

public void close ()

Frees the native memory of the object. This can be explicity called to control the deallocation of native memory and avoid situations where the garbage collector does not free the object in a timely manner.

public void destroy ()

Frees the native memory of the object. This can be explicity called to control the deallocation of native memory and avoid situations where the garbage collector does not free the object in a timely manner.

public String getAsText ()

Get all words in the current selection as a single string.

Returns
  • The string containing all words in the current selection. Words will be separated with space (i.e. ' ') or new line (i.e. '\n') characters.

public String getAsText (boolean dehyphen)

Get all words in the current selection as a single string.

Parameters
dehyphen If true, finds and removes hyphens that split words across two lines. Hyphens are often used a the end of lines as an indicator that a word spans two lines. Hyphen detection enables removal of hyphen character and merging of text runs to form a single word. This option has no effect on Tagged PDF files.
Returns
  • The string containing all words in the current selection. Words will be separated with space (i.e. ' ') or new line (i.e. '\n') characters.

public String getAsXML ()

Get text content in a form of an XML string.

Note: This method returns the same as if calling getAsXML(0). Please see getAsXML(int) for more information.

Returns
  • The string containing XML output.

public String getAsXML (int xml_output_flags)

Get text content in a form of an XML string.

Note: XML output will be encoded in UTF-8 and will have the following structure:

 <Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
  <Flow id="1">
   <Para id="1">
    <Line box="72, 708.075, 467.895, 10.02" style="font-family:Calibri; font-size:10.02; color: #000000;">
     <Word box="72, 708.075, 30.7614, 10.02">PDFNet</Word>
     <Word box="106.188, 708.075, 15.9318, 10.02">SDK</Word>
     <Word box="125.617, 708.075, 6.22242, 10.02">is</Word>
      ...
    </Line>
   </Para>
  </Flow>
 </Page>
 
 

The above XML output was generated by passing the following union of flags: e_words_as_elements | e_output_bbox | e_output_style_info.

In case 'xml_output_flags' was not specified, the default XML output would look as follows:

 <Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
  <Flow id="1">
   <Para id="1">
    <Line>PDFNet SDK is an amazingly comprehensive, high-quality PDF developer toolkit...</Line>
    <Line>levels. Using the PDFNet PDF library, ...</Line>
     ...
   </Para>
  </Flow>
 </Page>
 
 

Parameters
xml_output_flags flags controlling XML output.

public TextExtractor.Line getFirstLine ()

Get the first line.

Note: To traverse the list of all text lines on the page use getNextLine(). To traverse the list of all word on a given line use getFirstWord().

Returns
  • The first line of text on the selected page.

public Highlights getHighlights (CharRange[] char_ranges)

Get a Highlights object based on an array of character ranges.

Parameters
char_ranges an array of character ranges to be highlighted
Returns
  • a Highlights object containing the selected characters

public int getNumLines ()

Get the number lines.

Returns
  • The number of lines of text on the selected page.

public boolean getRightToLeftLanguage ()

Checkes if text extractor works in right-to-left language mode.

public String getTextUnderAnnot (Annot annot)

Get all the characters that intersect an annotation.

Parameters
annot The annotation to intersect with.

public int getWordCount ()

Get the word count.

Returns
  • the number of words on the page.

public void setOCGContext (Context ctx)

Set the Optional Content Group (OCG) context that should be used when rendering the page. This function can be used to selectively render optional content (such as PDF layers) based on the states of optional content groups in the given context.

Parameters
ctx Optional Content Group (OCG) context, or NULL if TextExtractor should process all content on the page.

public void setRightToLeftLanguage (boolean right_2_left)

Sets text extractor to work in right-to-left language mode.

Parameters
right_2_left If true, text extractor is set to right-to-left language mode.