Class TextExtractor
<p>
TextExtractor is used to analyze a PDF page and extract words and logical structures that are visible within a given region. The resulting list of lines and words can be traversed element by element or accessed as a string buffer. The class also includes utility methods to extract PDF text as HTML or XML.
Possible use case scenarios for TextExtractor include:- Converting PDF pages to text or XML for content repurposing.
- Searching PDF pages for specific words or keywords.
- Indexing large PDF repositories for indexing or content.
- Classifying or summarizing PDF documents based on their text content.
- Finding specific words for content editing purposes (such as splitting pages.
- Normalize all text content to Unicode.
- Extract inferred logical structure (word by word, line by line, or paragraph by paragraph).
- Extract positioning information for every line, word, or a glyph.
- Extract style information (such as information about the font, font size, font styles, etc) for every line, word, or a glyph.
- Control the content analysis process. A number of options (such as removal of text obscured by images) is available to let the user direct the flow of content recognition algorithms that will meet their requirements.
- Offer utility methods to convert PDF page content to text, XML, or HTML.
TextExtractor is analyzing only textual content of the page. This means that the rasterized (e.g. in scanned pages) or vectorized text (where glyphs are converted to path outlines) will not be recognized as text. Please note that it is still possible to extract this content using pdftron.PDF.ElementReader interface.
In some cases TextExtractor may extract text that does not appear to be on the visible page (e.g. when text is obscured by an image or a rectangle). In these situations it is possible to use processing flags such as 'e_remove_hidden_text' and 'e_no_invisible_text' to remove hidden text.
//... Initialize PDFNet ...
PDFDoc doc = new PDFDoc(filein);
doc.initSecurityHandler();
Page page = doc.pageBegin().current();
TextExtractor txt = new TextExtractor();
txt.begin(page, 0, TextExtractor.ProcessingFlags.e_remove_hidden_text);
string text = txt.getAsText();
// or traverse words one by one...
TextExtractor.Word word;
for (TextExtractor.Line line = txt.GetFirstLine(); line.IsValid(); line=line.GetNextLine()) {
for (word=line.GetFirstWord(); word.IsValid(); word=word.GetNextWord()) {
string w = word.GetString();
}
}
Implements
Inherited Members
Namespace: pdftron.PDF
Assembly: PDFNet.dll
Syntax
public class TextExtractor : IDisposable
Constructors
TextExtractor()
Constructor and destructor.
Declaration
public TextExtractor()
Methods
Begin(Page)
Start reading the page.
Declaration
public void Begin(Page page)
Parameters
Type | Name | Description |
---|---|---|
Page | page | Page to read. |
Begin(Page, Rect)
Declaration
public void Begin(Page page, Rect clip_ptr)
Parameters
Type | Name | Description |
---|---|---|
Page | page | |
Rect | clip_ptr |
Begin(Page, Rect, ProcessingFlags)
Declaration
public void Begin(Page page, Rect clip_ptr, TextExtractor.ProcessingFlags flags)
Parameters
Type | Name | Description |
---|---|---|
Page | page | |
Rect | clip_ptr | |
TextExtractor.ProcessingFlags | flags |
Dispose()
Releases all resources used by the TextExtractor
Declaration
public override sealed void Dispose()
Dispose(bool)
Declaration
[HandleProcessCorruptedStateExceptions]
protected virtual void Dispose(bool A_0)
Parameters
Type | Name | Description |
---|---|---|
bool | A_0 |
EnableActualText(bool)
Consider all actual text as a single word if true.
Declaration
public void EnableActualText(bool enable)
Parameters
Type | Name | Description |
---|---|---|
bool | enable |
~TextExtractor()
Declaration
protected ~TextExtractor()
GetAsText()
Get all words in the current selection as a single string.
Declaration
public string GetAsText()
Returns
Type | Description |
---|---|
string | The string containing all words in the current selection. Words will be separated with space (i.e. ' ') or new line (i.e. '\n') characters. |
GetAsText(bool)
Get all words in the current selection as a single string.
Declaration
public string GetAsText(bool dehyphen)
Parameters
Type | Name | Description |
---|---|---|
bool | dehyphen | If true, finds and removes hyphens that split words across two lines. Hyphens are often used a the end of lines as an indicator that a word spans two lines. Hyphen detection enables removal of hyphen character and merging of text runs to form a single word. This option has no effect on Tagged PDF files. |
Returns
Type | Description |
---|---|
string | The string containing all words in the current selection. Words will be separated with space (i.e. ' ') or new line (i.e. '\n') characters. |
GetAsXML()
Get text content in a form of an XML string.
Declaration
public string GetAsXML()
Returns
Type | Description |
---|---|
string | The string containing XML output. |
Remarks
XML output will be encoded in UTF-8 and will have the following structure:
<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
<Flow id="1">
<Para id="1">
<Line box="72, 708.075, 467.895, 10.02" style="font-family:Calibri; font-size:10.02; color: #000000;">
<Word box="72, 708.075, 30.7614, 10.02">PDFNet</Word>
<Word box="106.188, 708.075, 15.9318, 10.02">SDK</Word>
<Word box="125.617, 708.075, 6.22242, 10.02">is</Word>
...
</Line>
</Para>
</Flow>
</Page>
The above XML output was generated by passing the following union of
flags in the call to GetAsXML():
(TextExtractor::e_words_as_elements | TextExtractor::e_output_bbox | TextExtractor::e_output_style_info)
In case 'xml_output_flags' was not specified, the default XML output would look as follows:
<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
<Flow id="1">
<Para id="1">
<Line>PDFNet SDK is an amazingly comprehensive, high-quality PDF developer toolkit...</Line>
<Line>levels. Using the PDFNet PDF library, ...</Line>
...
</Para>
</Flow>
</Page>
</code>
</example>
GetAsXML(XMLOutputFlags)
Get text content in a form of an XML string.
Declaration
public string GetAsXML(TextExtractor.XMLOutputFlags flags)
Parameters
Type | Name | Description |
---|---|---|
TextExtractor.XMLOutputFlags | flags | flags controlling XML output. For more
information, please see |
Returns
Type | Description |
---|---|
string | The string containing XML output. |
Remarks
XML output will be encoded in UTF-8 and will have the following
structure:
<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
<Flow id="1">
<Para id="1">
<Line box="72, 708.075, 467.895, 10.02" style="font-family:Calibri; font-size:10.02; color: #000000;">
<Word box="72, 708.075, 30.7614, 10.02">PDFNet</Word>
<Word box="106.188, 708.075, 15.9318, 10.02"<SDK</Word>
<Word box="125.617, 708.075, 6.22242, 10.02"<is</Word>
...
</Line>
</Para>
</Flow>
</Page>
The above XML output was generated by passing the following union of
flags in the call to GetAsXML():
(TextExtractor::e_words_as_elements | TextExtractor::e_output_bbox | TextExtractor::e_output_style_info)
In case 'xml_output_flags' was not specified, the default XML output would look as follows:
<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
<Flow id="1">
<Para id="1">
<Line<PDFNet SDK is an amazingly comprehensive, high-quality PDF developer toolkit...</Line>
<Line<levels. Using the PDFNet PDF library, ...</Line>
...
</Para>
</Flow>
</Page>
GetFirstLine()
Gets the first line of text on the selected page
Declaration
public TextExtractor.Line GetFirstLine()
Returns
Type | Description |
---|---|
TextExtractor.Line | The first line of text on the selected page. |
GetHighlights(CharRange[])
Declaration
public Highlights GetHighlights(TextExtractor.CharRange[] char_ranges)
Parameters
Type | Name | Description |
---|---|---|
CharRange[] | char_ranges |
Returns
Type | Description |
---|---|
Highlights |
GetNumLines()
Gets the number of line
Declaration
public int GetNumLines()
Returns
Type | Description |
---|---|
int | number of lines |
GetRightToLeftLanguage()
Gets the directionality of text extractor.
Declaration
public bool GetRightToLeftLanguage()
Returns
Type | Description |
---|---|
bool | the directionality of text extractor. |
GetTextUnderAnnot(Annot)
Declaration
public string GetTextUnderAnnot(Annot annot)
Parameters
Type | Name | Description |
---|---|---|
Annot | annot |
Returns
Type | Description |
---|---|
string |
GetWordCount()
Gets the word count.
Declaration
public int GetWordCount()
Returns
Type | Description |
---|---|
int | the number of words on the page. |
SetOCGContext(Context)
Declaration
public void SetOCGContext(Context ctx)
Parameters
Type | Name | Description |
---|---|---|
Context | ctx |
SetRightToLeftLanguage(bool)
Sets the directionality of text extractor. Must be called before the processing of a page started.
If true reverses the directionality of TextExtractor algorithm.Declaration
public void SetRightToLeftLanguage(bool rtl)
Parameters
Type | Name | Description |
---|---|---|
bool | rtl |