public class

TextExtractor

extends Object
implements AutoCloseable

java.lang.Object
↳	com.pdftron.pdf.TextExtractor

Class Overview

TextExtractor is used to analyze a PDF page and extract words and logical structures that are visible within a given region. The resulting list of lines and words can be traversed element by element or accessed as a string buffer. The class also includes utility methods to extract PDF text as HTML or XML.

Possible use case scenarios for TextExtractor include:

Converting PDF pages to text or XML for content repurposing.
Searching PDF pages for specific words or keywords.
Indexing large PDF repositories for indexing or content retrieval purposes (i.e. implementing a PDF search engine).
Classifying or summarizing PDF documents based on their text content.
Finding specific words for content editing purposes (such as splitting pages based on keywords etc).

The main task of TextExtractor is to interpret PDF pages and offer a simple to use API to:

Normalize all text content to Unicode.
Extract inferred logical structure (word by word, line by line, or paragraph by paragraph).
Extract positioning information for every line, word, or a glyph.
Extract style information (such as information about the font, font size, font styles, etc) for every line, word, or a glyph.
Control the content analysis process. A number of options (such as removal of text obscured by images) is available to let the user direct the flow of content recognition algorithms that will meet their requirements.
Offer utility methods to convert PDF page content to text, XML, or HTML.

Note: TextExtractor is analyzing only textual content of the page. This means that the rasterized (e.g. in scanned pages) or vectorized text (where glyphs are converted to path outlines) will not be recognized as text. Please note that it is still possible to extract this content using ElementReader interface.

In some cases TextExtractor may extract text that does not appear to be on the visible page (e.g. when text is obscured by an image or a rectangle). In these situations it is possible to use processing flags such as 'e_remove_hidden_text' and 'e_no_invisible_text' to remove hidden text.

A sample use case:

 ... Initialize PDFNet ...
 PDFDoc doc = new PDFDoc(filein);
 doc.initSecurityHandler();
 Page page = doc.pageBegin().current();
 TextExtractor txt = new TextExtractor();
 txt.begin(page, 0, TextExtractor.ProcessingFlags.e_remove_hidden_text);
 string text = txt.getAsText();
 // or traverse words one by one...
 TextExtractor.Word word;
 for (TextExtractor.Line line = txt.GetFirstLine(); line.IsValid(); line=line.GetNextLine()) {
   for (word=line.GetFirstWord(); word.IsValid(); word=word.GetNextWord()) {
     string w = word.GetString();
   
 }
 }

For full sample code, please take a look at TextExtract sample sample project.

Summary

Nested Classes
class	TextExtractor.CharRange	TextExtractor.CharRange object represents a range of text based on Unicode character indices.
class	TextExtractor.Compat	Compatibility layer API.
class	TextExtractor.Line
class	TextExtractor.Style	A class representing predominant text style associated with a given Line, a Word, or a Glyph.
class	TextExtractor.Word

Constants
int	e_extract_using_zorder	Use Z-order as reading order for text
int	e_no_dup_remove	Disables removing duplicated text that is frequently used to achieve visual effects of drop shadow and fake bold.
int	e_no_invisible_text	Enables removing text that uses rendering mode 3 (i.e.
int	e_no_ligature_exp	Disables expanding of ligatures using a predefined mapping.
int	e_no_watermarks	Enables removal of text that is marked as part of a Watermark layer
int	e_output_bbox	Include bounding box information for each XML element.
int	e_output_style_info	Include font and styling information.
int	e_punct_break	Treat punctuation (e.g.
int	e_remove_hidden_text	Enables removal of text that is obscured by images or rectangles.
int	e_words_as_elements	Output words as XML elements instead of inline text.

Public Constructors
	TextExtractor() Constructor.

Public Methods
void	begin(Page page, Rect clip_ptr) Start reading the page.
void	begin(Page page, Rect clip_ptr, int flags) Start reading the page.
void	begin(Page page) Start reading the page.
void	close() Frees the native memory of the object.
void	destroy() Frees the native memory of the object.
String	getAsText() Get all words in the current selection as a single string.
String	getAsText(boolean dehyphen) Get all words in the current selection as a single string.
String	getAsXML() Get text content in a form of an XML string.
String	getAsXML(int xml_output_flags) Get text content in a form of an XML string.
TextExtractor.Line	getFirstLine() Get the first line.
Highlights	getHighlights(CharRange[] char_ranges) Get a Highlights object based on an array of character ranges.
int	getNumLines() Get the number lines.
boolean	getRightToLeftLanguage() Checkes if text extractor works in right-to-left language mode.
String	getTextUnderAnnot(Annot annot) Get all the characters that intersect an annotation.
int	getWordCount() Get the word count.
void	setOCGContext(Context ctx) Set the Optional Content Group (OCG) context that should be used when rendering the page.
void	setRightToLeftLanguage(boolean right_2_left) Sets text extractor to work in right-to-left language mode.

[Expand]

Inherited Methods

From class java.lang.Object

From interface java.lang.AutoCloseable

Constants

public static final int e_extract_using_zorder

Use Z-order as reading order for text

Constant Value: 256 (0x00000100)

public static final int e_no_dup_remove

Disables removing duplicated text that is frequently used to achieve visual effects of drop shadow and fake bold.

Constant Value: 2 (0x00000002)

public static final int e_no_invisible_text

Enables removing text that uses rendering mode 3 (i.e. invisible text). Invisible text is usually used in 'PDF Searchable Images' (i.e. scanned pages with a corresponding OCR text). As a result, invisible text will be extracted by default.

Constant Value: 16 (0x00000010)

public static final int e_no_ligature_exp

Disables expanding of ligatures using a predefined mapping. Default ligatures are: fi, ff, fl, ffi, ffl, ch, cl, ct, ll, ss, fs, st, oe, OE.

Constant Value: 1 (0x00000001)

public static final int e_no_watermarks

Enables removal of text that is marked as part of a Watermark layer

Constant Value: 128 (0x00000080)

public static final int e_output_bbox

Include bounding box information for each XML element. The bounding box information will be stored as 'bbox' attribute.

Constant Value: 2 (0x00000002)

public static final int e_output_style_info

Include font and styling information.

Constant Value: 4 (0x00000004)

public static final int e_punct_break

Treat punctuation (e.g. full stop, comma, semicolon, etc.) as word break characters.

Constant Value: 4 (0x00000004)

public static final int e_remove_hidden_text

Enables removal of text that is obscured by images or rectangles. Since this option has small performance penalty on performance of text extraction, by default it is not enabled.

Constant Value: 8 (0x00000008)

public static final int e_words_as_elements

Output words as XML elements instead of inline text.

Constant Value: 1 (0x00000001)

Public Constructors

public TextExtractor ()

Constructor. Instantiate new TextExtractor.

Public Methods

public void begin (Page page, Rect clip_ptr)

Start reading the page.

Parameters

page	Page to read.
clip_ptr	A pointer to the optional clipping rectangle. This parameter can be used to selectively read text from a given rectangle.

public void begin (Page page, Rect clip_ptr, int flags)

Start reading the page.

Parameters

page	Page to read.
clip_ptr	A pointer to the optional clipping rectangle. This parameter can be used to selectively read text from a given rectangle.
flags	A list of ProcessingFlags used to control text extraction algorithm.

public void begin (Page page)

Start reading the page.

Parameters

page	Page to read.

public void close ()

Frees the native memory of the object. This can be explicity called to control the deallocation of native memory and avoid situations where the garbage collector does not free the object in a timely manner.

public void destroy ()

Frees the native memory of the object. This can be explicity called to control the deallocation of native memory and avoid situations where the garbage collector does not free the object in a timely manner.

public String getAsText ()

Get all words in the current selection as a single string.

Returns

The string containing all words in the current selection. Words will be separated with space (i.e. ' ') or new line (i.e. '\n') characters.

public String getAsText (boolean dehyphen)

Get all words in the current selection as a single string.

Parameters

dehyphen	If true, finds and removes hyphens that split words across two lines. Hyphens are often used a the end of lines as an indicator that a word spans two lines. Hyphen detection enables removal of hyphen character and merging of text runs to form a single word. This option has no effect on Tagged PDF files.

Returns

The string containing all words in the current selection. Words will be separated with space (i.e. ' ') or new line (i.e. '\n') characters.

public String getAsXML ()

Get text content in a form of an XML string.

Note: This method returns the same as if calling getAsXML(0). Please see getAsXML(int) for more information.

Returns

The string containing XML output.

public String getAsXML (int xml_output_flags)

Get text content in a form of an XML string.

Note: XML output will be encoded in UTF-8 and will have the following structure:

 <Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
  <Flow id="1">
   <Para id="1">
    <Line box="72, 708.075, 467.895, 10.02" style="font-family:Calibri; font-size:10.02; color: #000000;">
     <Word box="72, 708.075, 30.7614, 10.02">PDFNet</Word>
     <Word box="106.188, 708.075, 15.9318, 10.02">SDK</Word>
     <Word box="125.617, 708.075, 6.22242, 10.02">is</Word>
      ...
    </Line>
   </Para>
  </Flow>
 </Page>

The above XML output was generated by passing the following union of flags: e_words_as_elements | e_output_bbox | e_output_style_info.

In case 'xml_output_flags' was not specified, the default XML output would look as follows:

 <Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
  <Flow id="1">
   <Para id="1">
    <Line>PDFNet SDK is an amazingly comprehensive, high-quality PDF developer toolkit...</Line>
    <Line>levels. Using the PDFNet PDF library, ...</Line>
     ...
   </Para>
  </Flow>
 </Page>

Parameters

xml_output_flags	flags controlling XML output.

public TextExtractor.Line getFirstLine ()

Get the first line.

Note: To traverse the list of all text lines on the page use getNextLine(). To traverse the list of all word on a given line use getFirstWord().

Returns

The first line of text on the selected page.

public Highlights getHighlights (CharRange[] char_ranges)

Get a Highlights object based on an array of character ranges.

Parameters

char_ranges	an array of character ranges to be highlighted

Returns

a Highlights object containing the selected characters

public int getNumLines ()

Get the number lines.

Returns

The number of lines of text on the selected page.

public boolean getRightToLeftLanguage ()

Checkes if text extractor works in right-to-left language mode.

public String getTextUnderAnnot (Annot annot)

Get all the characters that intersect an annotation.

Parameters

annot	The annotation to intersect with.

public int getWordCount ()

Get the word count.

Returns

the number of words on the page.

public void setOCGContext (Context ctx)

Set the Optional Content Group (OCG) context that should be used when rendering the page. This function can be used to selectively render optional content (such as PDF layers) based on the states of optional content groups in the given context.

Parameters

ctx	Optional Content Group (OCG) context, or NULL if TextExtractor should process all content on the page.

public void setRightToLeftLanguage (boolean right_2_left)

Sets text extractor to work in right-to-left language mode.

Parameters

right_2_left	If `true`, text extractor is set to right-to-left language mode.

Interfaces

Classes

Enums

TextExtractor

Class Overview

Summary

Constants

public static final int e_extract_using_zorder

public static final int e_no_dup_remove

public static final int e_no_invisible_text

public static final int e_no_ligature_exp

public static final int e_no_watermarks

public static final int e_output_bbox

public static final int e_output_style_info

public static final int e_punct_break

public static final int e_remove_hidden_text

public static final int e_words_as_elements

Public Constructors

public TextExtractor ()

Public Methods

public void begin (Page page, Rect clip_ptr)

Parameters

public void begin (Page page, Rect clip_ptr, int flags)

Parameters

public void begin (Page page)

Parameters

public void close ()

public void destroy ()

public String getAsText ()

Returns

public String getAsText (boolean dehyphen)

Parameters

Returns

public String getAsXML ()

Returns

public String getAsXML (int xml_output_flags)

Parameters

public TextExtractor.Line getFirstLine ()

Returns

public Highlights getHighlights (CharRange[] char_ranges)

Parameters

Returns

public int getNumLines ()

Returns

public boolean getRightToLeftLanguage ()

public String getTextUnderAnnot (Annot annot)

Parameters

public int getWordCount ()

Returns

public void setOCGContext (Context ctx)

Parameters

public void setRightToLeftLanguage (boolean right_2_left)

Parameters