TextExtractor Class

TextExtractor is used to analyze a PDF page and extract words and logical structures that are visible within a given region. The resulting list of lines and words can be traversed element by element or accessed as a string buffer. The class also includes utility methods to extract PDF text as HTML or XML. Possible use case scenarios for TextExtractor include:

Converting PDF pages to text or XML for content repurposing.
Searching PDF pages for specific words or keywords.
Indexing large PDF repositories for indexing or content.

retrieval purposes (i.e. implementing a PDF search engine).

Classifying or summarizing PDF documents based on their text content.
Finding specific words for content editing purposes (such as splitting pages.

The main task of TextExtractor is to interpret PDF pages and offer a simple to use API to:

Normalize all text content to Unicode.
Extract inferred logical structure (word by word, line by line, or paragraph by paragraph).
Extract positioning information for every line, word, or a glyph.
Extract style information (such as information about the font, font size, font styles, etc) for every line, word, or a glyph.
Control the content analysis process. A number of options (such as removal of text obscured by images) is available to let the user direct the flow of content recognition algorithms that will meet their requirements.
Offer utility methods to convert PDF page content to text, XML, or HTML.

Remarks

TextExtractor is analyzing only textual content of the page. This means that the rasterized (e.g. in scanned pages) or vectorized text (where glyphs are converted to path outlines) will not be recognized as text. Please note that it is still possible to extract this content using pdftron.PDF.ElementReader interface.

In some cases TextExtractor may extract text that does not appear to be on the visible page (e.g. when text is obscured by an image or a rectangle). In these situations it is possible to use processing flags such as 'e_remove_hidden_text' and 'e_no_invisible_text' to remove hidden text.

Examples

For full sample code, please take a look at TextExtract sample project.

Copy

//... Initialize PDFNet ...
PDFDoc doc = new PDFDoc(filein);
doc.initSecurityHandler();
Page page = doc.pageBegin().current();
TextExtractor txt = new TextExtractor();
txt.begin(page, 0, TextExtractor.ProcessingFlags.e_remove_hidden_text);
string text = txt.getAsText();
// or traverse words one by one...
TextExtractor.Word word;
for (TextExtractor.Line line = txt.GetFirstLine(); line.IsValid(); line=line.GetNextLine()) {
for (word=line.GetFirstWord(); word.IsValid(); word=word.GetNextWord()) {
string w = word.GetString();
}
}

Inheritance Hierarchy

SystemObject
pdftron.PDFTextExtractor

Namespace: pdftron.PDF
Assembly: pdftron (in pdftron.dll) Version: 255.255.255.255

Syntax

C++

JavaScript

Copy

public sealed class TextExtractor : IClosable

Public NotInheritable Class TextExtractor
	Implements IClosable

public ref class TextExtractor sealed : IClosable

pdftron.PDF.TextExtractor = function();

Type.createClass(
	'pdftron.PDF.TextExtractor',
	null,
	Windows.Foundation.IClosable);

The TextExtractor type exposes the following members.

Constructors

	Name	Description
	TextExtractor	Constructor and destructor.

Top

Methods

	Name	Description
	Begin(Page)	Start reading the page.
	Begin(Page, Rect)	Start reading the page.
	Begin(Page, Rect, TextExtractorProcessingFlags)	Start reading the page.
	Close
	Equals	Determines whether the specified Object is equal to the current Object. (Inherited from Object.)
	GetAsText	Get all words in the current selection as a single string.
	GetAsText(Boolean)	Get all words in the current selection as a single string.
	GetAsXML	Get text content in a form of an XML string.
	GetAsXML(TextExtractorXMLOutputFlags)	Get text content in a form of an XML string.
	GetFirstLine	Gets the first line of text on the selected page
	GetHashCode	Serves as a hash function for a particular type. (Inherited from Object.)
	GetHighlights
	GetNumLines	Gets the number of line
	GetTextUnderAnnot	Get all the characters that intersect an annotation.
	GetType	Gets the Type of the current instance. (Inherited from Object.)
	GetWordCount	Gets the word count.
	SetRightToLeftLanguage	Tells TextExtractor that the document reads from right to left.
	ToString	Returns a string that represents the current object. (Inherited from Object.)

Top

Reference

pdftron.PDF Namespace