Click or drag to resize

TextExtractor Class

TextExtractor is used to analyze a PDF page and extract words and logical structures that are visible within a given region. The resulting list of lines and words can be traversed element by element or accessed as a string buffer. The class also includes utility methods to extract PDF text as HTML or XML. Possible use case scenarios for TextExtractor include:
  • Converting PDF pages to text or XML for content repurposing.
  • Searching PDF pages for specific words or keywords.
  • Indexing large PDF repositories for indexing or content.
retrieval purposes (i.e. implementing a PDF search engine).
  • Classifying or summarizing PDF documents based on their text content.
  • Finding specific words for content editing purposes (such as splitting pages.
The main task of TextExtractor is to interpret PDF pages and offer a simple to use API to:
  • Normalize all text content to Unicode.
  • Extract inferred logical structure (word by word, line by line, or paragraph by paragraph).
  • Extract positioning information for every line, word, or a glyph.
  • Extract style information (such as information about the font, font size, font styles, etc) for every line, word, or a glyph.
  • Control the content analysis process. A number of options (such as removal of text obscured by images) is available to let the user direct the flow of content recognition algorithms that will meet their requirements.
  • Offer utility methods to convert PDF page content to text, XML, or HTML.

TextExtractor is analyzing only textual content of the page. This means that the rasterized (e.g. in scanned pages) or vectorized text (where glyphs are converted to path outlines) will not be recognized as text. Please note that it is still possible to extract this content using pdftron.PDF.ElementReader interface.

In some cases TextExtractor may extract text that does not appear to be on the visible page (e.g. when text is obscured by an image or a rectangle). In these situations it is possible to use processing flags such as 'e_remove_hidden_text' and 'e_no_invisible_text' to remove hidden text.

For full sample code, please take a look at TextExtract sample project.
//... Initialize PDFNet ...
PDFDoc doc = new PDFDoc(filein);
Page page = doc.pageBegin().current();
TextExtractor txt = new TextExtractor();
txt.begin(page, 0, TextExtractor.ProcessingFlags.e_remove_hidden_text);
string text = txt.getAsText();
// or traverse words one by one...
TextExtractor.Word word;
for (TextExtractor.Line line = txt.GetFirstLine(); line.IsValid(); line=line.GetNextLine()) {
for (word=line.GetFirstWord(); word.IsValid(); word=word.GetNextWord()) {
string w = word.GetString();
Inheritance Hierarchy

Namespace:  pdftron.PDF
Assembly:  pdftron (in pdftron.dll) Version:
public sealed class TextExtractor : IClosable

The TextExtractor type exposes the following members.

Public methodTextExtractor
Constructor and destructor.
Public methodBegin(Page)
Start reading the page.
Public methodBegin(Page, Rect)
Start reading the page.
Public methodBegin(Page, Rect, TextExtractorProcessingFlags)
Start reading the page.
Public methodClose
Public methodEquals
Determines whether the specified Object is equal to the current Object.
(Inherited from Object.)
Public methodGetAsText
Get all words in the current selection as a single string.
Public methodGetAsText(Boolean)
Get all words in the current selection as a single string.
Public methodGetAsXML
Get text content in a form of an XML string.
Public methodGetAsXML(TextExtractorXMLOutputFlags)
Get text content in a form of an XML string.
Public methodGetFirstLine
Gets the first line of text on the selected page
Public methodGetHashCode
Serves as a hash function for a particular type.
(Inherited from Object.)
Public methodGetHighlights
Public methodGetNumLines
Gets the number of line
Public methodGetTextUnderAnnot
Get all the characters that intersect an annotation.
Public methodGetType
Gets the Type of the current instance.
(Inherited from Object.)
Public methodGetWordCount
Gets the word count.
Public methodSetRightToLeftLanguage
Tells TextExtractor that the document reads from right to left.
Public methodToString
Returns a string that represents the current object.
(Inherited from Object.)
See Also