TextExtractor Class |
TextExtractor is analyzing only textual content of the page. This means that the rasterized (e.g. in scanned pages) or vectorized text (where glyphs are converted to path outlines) will not be recognized as text. Please note that it is still possible to extract this content using pdftron.PDF.ElementReader interface.
In some cases TextExtractor may extract text that does not appear to be on the visible page (e.g. when text is obscured by an image or a rectangle). In these situations it is possible to use processing flags such as 'e_remove_hidden_text' and 'e_no_invisible_text' to remove hidden text.
//... Initialize PDFNet ... PDFDoc doc = new PDFDoc(filein); doc.initSecurityHandler(); Page page = doc.pageBegin().current(); TextExtractor txt = new TextExtractor(); txt.begin(page, 0, TextExtractor.ProcessingFlags.e_remove_hidden_text); string text = txt.getAsText(); // or traverse words one by one... TextExtractor.Word word; for (TextExtractor.Line line = txt.GetFirstLine(); line.IsValid(); line=line.GetNextLine()) { for (word=line.GetFirstWord(); word.IsValid(); word=word.GetNextWord()) { string w = word.GetString(); } }
Namespace: pdftron.PDF
public sealed class TextExtractor : IClosable
The TextExtractor type exposes the following members.
Name | Description | |
---|---|---|
TextExtractor | Constructor and destructor. |
Name | Description | |
---|---|---|
Begin(Page) | Start reading the page. | |
Begin(Page, Rect) | Start reading the page.
| |
Begin(Page, Rect, TextExtractorProcessingFlags) | Start reading the page.
| |
Close | ||
Equals | (Inherited from Object.) | |
GetAsText | Get all words in the current selection as a single string.
| |
GetAsText(Boolean) | Get all words in the current selection as a single string.
| |
GetAsXML | Get text content in a form of an XML string.
| |
GetAsXML(TextExtractorXMLOutputFlags) | Get text content in a form of an XML string.
| |
GetFirstLine | Gets the first line of text on the selected page
| |
GetHashCode | Serves as a hash function for a particular type. (Inherited from Object.) | |
GetHighlights | ||
GetNumLines | Gets the number of line
| |
GetTextUnderAnnot | Get all the characters that intersect an annotation.
| |
GetType | Gets the Type of the current instance. (Inherited from Object.) | |
GetWordCount | Gets the word count.
| |
SetRightToLeftLanguage |
Tells TextExtractor that the document reads from right to left.
| |
ToString | Returns a string that represents the current object. (Inherited from Object.) |