#include <TextExtractor.h>
Public Types | |
enum | ProcessingFlags { e_no_ligature_exp = 1, e_no_dup_remove = 2, e_punct_break = 4, e_remove_hidden_text = 8, e_no_invisible_text = 16, e_no_watermarks = 128, e_extract_using_zorder = 256 } |
enum | XMLOutputFlags { e_words_as_elements = 1, e_output_bbox = 2, e_output_style_info = 4 } |
typedef pdftron::PDF::Style | Style |
typedef pdftron::PDF::Word | Word |
typedef pdftron::PDF::Line | Line |
Public Member Functions | |
TextExtractor () | |
~TextExtractor () | |
void | Begin (Page page, const Rect *clip_ptr=0, UInt32 flags=0) |
void | SetOCGContext (OCG::Context *ctx) |
int | GetWordCount () |
void | SetRightToLeftLanguage (bool rtl) |
bool | GetRightToLeftLanguage () |
UString | GetAsText (bool dehyphen=true) |
void | GetAsText (UString &out_str, bool dehyphen=true) |
UString | GetTextUnderAnnot (const Annot &annot) |
void | GetTextUnderAnnot (UString &out_str, const Annot &annot) |
UString | GetAsXML (UInt32 xml_output_flags=0) |
void | GetAsXML (UString &out_xml, UInt32 xml_output_flags=0) |
Highlights | GetHighlights (const std::vector< CharRange > &char_ranges) |
Highlights | GetHighlights (const CharRange *char_ranges, size_t char_ranges_count) |
int | GetNumLines () |
Line | GetFirstLine () |
void | Destroy () |
TextExtractor is used to analyze a PDF page and extract words and logical structure within a given region. The resulting list of lines and words can be traversed element by element or accessed as a string buffer. The class also includes utility methods to extract PDF text as HTML or XML.
Possible use case scenarios for TextExtractor include:
The main task of TextExtractor is to interpret PDF pages and offer a simple to use API to:
Note: TextExtractor is analyzing only textual content of the page. This means that the rasterized (e.g. in scanned pages) or vectorized text (where glyphs are converted to path outlines) will not be recognized as text. Please note that it is still possible to extract this content using pdftron.PDF.ElementReader interface.
In some cases TextExtractor may extract text that does not appear to be on the visible page (e.g. when text is obscured by an image or a rectangle). In these situations it is possible to use processing flags such as 'e_remove_hidden_text' and 'e_no_invisible_text' to remove hidden text.
A sample use case (in C++):
A sample use case (in C#):
For full sample code, please take a look at TextExtract sample project.
Definition at line 116 of file TextExtractor.h.
Definition at line 121 of file TextExtractor.h.
Definition at line 119 of file TextExtractor.h.
Definition at line 120 of file TextExtractor.h.
Processing options that can be passed in Begin() method to direct the flow of content recognition algorithms
Enumerator | |
---|---|
e_no_ligature_exp | |
e_no_dup_remove | |
e_punct_break | |
e_remove_hidden_text | |
e_no_invisible_text | |
e_no_watermarks | |
e_extract_using_zorder |
Definition at line 133 of file TextExtractor.h.
Flags controlling the structure of XML output in a call to GetAsXML().
Enumerator | |
---|---|
e_words_as_elements | |
e_output_bbox | |
e_output_style_info |
Definition at line 240 of file TextExtractor.h.
pdftron::PDF::TextExtractor::TextExtractor | ( | ) |
Constructor and destructor
pdftron::PDF::TextExtractor::~TextExtractor | ( | ) |
Start reading the page.
page | Page to read. |
clip_ptr | A pointer to the optional clipping rectangle. This parameter can be used to selectively read text from a given rectangle. |
flags | A list of ProcessingFlags used to control text extraction algorithm. |
void pdftron::PDF::TextExtractor::Destroy | ( | ) |
Frees the native memory of the object.
UString pdftron::PDF::TextExtractor::GetAsText | ( | bool | dehyphen = true | ) |
Get all words in the current selection as a single string.
out_str | The string containing all words in the current selection. Words will be separated with space (i.e. ' ') or new line (i.e. ' ') characters. |
dehyphen | If true, finds and removes hyphens that split words across two lines. Hyphens are often used a the end of lines as an indicator that a word spans two lines. Hyphen detection enables removal of hyphen character and merging of text runs to form a single word. This option has no effect on Tagged PDF files. |
void pdftron::PDF::TextExtractor::GetAsText | ( | UString & | out_str, |
bool | dehyphen = true |
||
) |
Get text content in a form of an XML string.
out_xml | - The string containing XML output. |
xml_output_flags | - flags controlling XML output. For more information, please see TextExtract::XMLOutputFlags. |
XML output will be encoded in UTF-8 and will have the following structure:
The above XML output was generated by passing the following union of flags in the call to GetAsXML(): (TextExtractor::e_words_as_elements | TextExtractor::e_output_bbox | TextExtractor::e_output_style_info)
In case 'xml_output_flags' was not specified, the default XML output would look as follows:
<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0"> <Flow id="1"> <Para id="1"> <Line>PDFNet SDK is an amazingly comprehensive, high-quality PDF developer toolkit...</Line> <Line>levels. Using the PDFNet PDF library, ...</Line> ...
</Flow> </Page>
Line pdftron::PDF::TextExtractor::GetFirstLine | ( | ) |
Highlights pdftron::PDF::TextExtractor::GetHighlights | ( | const std::vector< CharRange > & | char_ranges | ) |
Get a Highlights object based on an array of character ranges.
char_ranges | an array of character ranges to be highlighted |
Highlights pdftron::PDF::TextExtractor::GetHighlights | ( | const CharRange * | char_ranges, |
size_t | char_ranges_count | ||
) |
Get a Highlights object based on an array of character ranges.
char_ranges | an array of character ranges to be highlighted |
char_ranges_count | the number of ranges in the char_ranges array |
int pdftron::PDF::TextExtractor::GetNumLines | ( | ) |
bool pdftron::PDF::TextExtractor::GetRightToLeftLanguage | ( | ) |
Get all the characters that intersect an annotation.
annot | The annotation to intersect with. |
int pdftron::PDF::TextExtractor::GetWordCount | ( | ) |
void pdftron::PDF::TextExtractor::SetOCGContext | ( | OCG::Context * | ctx | ) |
Sets the Optional Content Group (OCG) context that should be used when processing the document. This function can be used to change the current OCG context. Optional content (such as PDF layers) will be selectively processed based on the states of optional content groups in the given context.
ctx | Optional Content Group (OCG) context, or NULL if TextExtractor should process all content on the page. |
void pdftron::PDF::TextExtractor::SetRightToLeftLanguage | ( | bool | rtl | ) |
Sets the directionality of text extractor. Must be called before the processing of a page started.
rtl | mode reverses the directionality of TextExtractor algorithm. |