java.lang.Object | |
↳ | com.pdftron.pdf.TextExtractor |
TextExtractor
is used to analyze a PDF page and extract words
and logical structures that are visible within a given region. The
resulting list of lines and words can be traversed element by element or
accessed as a string buffer. The class also includes utility methods to
extract PDF text as HTML or XML.
Possible use case scenarios for TextExtractor
include:
The main task of TextExtractor
is to interpret PDF pages and offer a
simple to use API to:
Note: TextExtractor
is analyzing only textual content of the page.
This means that the rasterized (e.g. in scanned pages) or vectorized
text (where glyphs are converted to path outlines) will not be recognized
as text. Please note that it is still possible to extract this content
using ElementReader
interface.
In some cases TextExtractor
may extract text that does not appear to
be on the visible page (e.g. when text is obscured by an image or a
rectangle). In these situations it is possible to use processing flags
such as 'e_remove_hidden_text'
and 'e_no_invisible_text'
to remove hidden text.
A sample use case:
... Initialize PDFNet ... PDFDoc doc = new PDFDoc(filein); doc.initSecurityHandler(); Page page = doc.pageBegin().current(); TextExtractor txt = new TextExtractor(); txt.begin(page, 0, TextExtractor.ProcessingFlags.e_remove_hidden_text); string text = txt.getAsText(); // or traverse words one by one... TextExtractor.Word word; for (TextExtractor.Line line = txt.GetFirstLine(); line.IsValid(); line=line.GetNextLine()) { for (word=line.GetFirstWord(); word.IsValid(); word=word.GetNextWord()) { string w = word.GetString();
} }
For full sample code, please take a look at TextExtract sample sample project.
Nested Classes | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
class | TextExtractor.CharRange | TextExtractor.CharRange object represents a range of text based on Unicode character indices. | |||||||||
class | TextExtractor.Compat | Compatibility layer API. | |||||||||
class | TextExtractor.Line | ||||||||||
class | TextExtractor.Style | A class representing predominant text style associated with a given Line, a Word, or a Glyph. | |||||||||
class | TextExtractor.Word |
Constants | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
int | e_extract_using_zorder | Use Z-order as reading order for text | |||||||||
int | e_no_dup_remove | Disables removing duplicated text that is frequently used to achieve visual effects of drop shadow and fake bold. | |||||||||
int | e_no_invisible_text | Enables removing text that uses rendering mode 3 (i.e. | |||||||||
int | e_no_ligature_exp | Disables expanding of ligatures using a predefined mapping. | |||||||||
int | e_no_watermarks | Enables removal of text that is marked as part of a Watermark layer | |||||||||
int | e_output_bbox | Include bounding box information for each XML element. | |||||||||
int | e_output_style_info | Include font and styling information. | |||||||||
int | e_punct_break | Treat punctuation (e.g. | |||||||||
int | e_remove_hidden_text | Enables removal of text that is obscured by images or rectangles. | |||||||||
int | e_words_as_elements | Output words as XML elements instead of inline text. |
Public Constructors | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
TextExtractor()
Constructor.
|
Public Methods | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
void |
begin(Page page, Rect clip_ptr)
Start reading the page.
| ||||||||||
void |
begin(Page page, Rect clip_ptr, int flags)
Start reading the page.
| ||||||||||
void |
begin(Page page)
Start reading the page.
| ||||||||||
void |
close()
Frees the native memory of the object.
| ||||||||||
void |
destroy()
Frees the native memory of the object.
| ||||||||||
String |
getAsText()
Get all words in the current selection as a single string.
| ||||||||||
String |
getAsText(boolean dehyphen)
Get all words in the current selection as a single string.
| ||||||||||
String |
getAsXML()
Get text content in a form of an XML string.
| ||||||||||
String |
getAsXML(int xml_output_flags)
Get text content in a form of an XML string.
| ||||||||||
TextExtractor.Line |
getFirstLine()
Get the first line.
| ||||||||||
Highlights |
getHighlights(CharRange[] char_ranges)
Get a Highlights object based on an array of character ranges.
| ||||||||||
int |
getNumLines()
Get the number lines.
| ||||||||||
boolean |
getRightToLeftLanguage()
Checkes if text extractor works in right-to-left language mode.
| ||||||||||
String |
getTextUnderAnnot(Annot annot)
Get all the characters that intersect an annotation.
| ||||||||||
int |
getWordCount()
Get the word count.
| ||||||||||
void |
setOCGContext(Context ctx)
Set the Optional Content Group (OCG) context that should be used when
rendering the page.
| ||||||||||
void |
setRightToLeftLanguage(boolean right_2_left)
Sets text extractor to work in right-to-left language mode.
|
[Expand]
Inherited Methods | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
From class
java.lang.Object
| |||||||||||
From interface
java.lang.AutoCloseable
|
Use Z-order as reading order for text
Disables removing duplicated text that is frequently used to achieve visual effects of drop shadow and fake bold.
Enables removing text that uses rendering mode 3 (i.e. invisible text). Invisible text is usually used in 'PDF Searchable Images' (i.e. scanned pages with a corresponding OCR text). As a result, invisible text will be extracted by default.
Disables expanding of ligatures using a predefined mapping. Default ligatures are: fi, ff, fl, ffi, ffl, ch, cl, ct, ll, ss, fs, st, oe, OE.
Enables removal of text that is marked as part of a Watermark layer
Include bounding box information for each XML element. The bounding box information will be stored as 'bbox' attribute.
Include font and styling information.
Treat punctuation (e.g. full stop, comma, semicolon, etc.) as word break characters.
Enables removal of text that is obscured by images or rectangles. Since this option has small performance penalty on performance of text extraction, by default it is not enabled.
Output words as XML elements instead of inline text.
Constructor. Instantiate new TextExtractor.
Start reading the page.
page | Page to read. |
---|---|
clip_ptr | A pointer to the optional clipping rectangle. This parameter can be used to selectively read text from a given rectangle. |
Start reading the page.
page | Page to read. |
---|---|
clip_ptr | A pointer to the optional clipping rectangle. This parameter can be used to selectively read text from a given rectangle. |
flags | A list of ProcessingFlags used to control text extraction algorithm. |
Frees the native memory of the object. This can be explicity called to control the deallocation of native memory and avoid situations where the garbage collector does not free the object in a timely manner.
Frees the native memory of the object. This can be explicity called to control the deallocation of native memory and avoid situations where the garbage collector does not free the object in a timely manner.
Get all words in the current selection as a single string.
Get all words in the current selection as a single string.
dehyphen | If true, finds and removes hyphens that split words across two lines. Hyphens are often used a the end of lines as an indicator that a word spans two lines. Hyphen detection enables removal of hyphen character and merging of text runs to form a single word. This option has no effect on Tagged PDF files. |
---|
Get text content in a form of an XML string.
Note: This method returns the same as if calling getAsXML(0)
. Please see getAsXML(int)
for more information.
Get text content in a form of an XML string.
Note: XML output will be encoded in UTF-8 and will have the following structure:
<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
<Flow id="1">
<Para id="1">
<Line box="72, 708.075, 467.895, 10.02" style="font-family:Calibri; font-size:10.02; color: #000000;">
<Word box="72, 708.075, 30.7614, 10.02">PDFNet</Word>
<Word box="106.188, 708.075, 15.9318, 10.02">SDK</Word>
<Word box="125.617, 708.075, 6.22242, 10.02">is</Word>
...
</Line>
</Para>
</Flow>
</Page>
The above XML output was generated by passing the following union of flags:
e_words_as_elements | e_output_bbox | e_output_style_info
.
In case 'xml_output_flags'
was not specified, the default XML output
would look as follows:
<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
<Flow id="1">
<Para id="1">
<Line>PDFNet SDK is an amazingly comprehensive, high-quality PDF developer toolkit...</Line>
<Line>levels. Using the PDFNet PDF library, ...</Line>
...
</Para>
</Flow>
</Page>
xml_output_flags | flags controlling XML output. |
---|
Get the first line.
Note: To traverse the list of all text lines on the page use getNextLine()
.
To traverse the list of all word on a given line use getFirstWord()
.
Get a Highlights object based on an array of character ranges.
char_ranges | an array of character ranges to be highlighted |
---|
Get the number lines.
Checkes if text extractor works in right-to-left language mode.
Get all the characters that intersect an annotation.
annot | The annotation to intersect with. |
---|
Get the word count.
Set the Optional Content Group (OCG) context that should be used when rendering the page. This function can be used to selectively render optional content (such as PDF layers) based on the states of optional content groups in the given context.
ctx | Optional Content Group (OCG) context, or NULL if TextExtractor should process all content on the page. |
---|
Sets text extractor to work in right-to-left language mode.
right_2_left | If true , text extractor is set to right-to-left language mode.
|
---|