Summary: Constants | Ctors | Methods | Inherited Methods | [Expand All]

public class

TextSearch

extends Object
implements AutoCloseable

java.lang.Object
↳	com.pdftron.pdf.TextSearch

Class Overview

TextSearch searches through a PDF document for a user-given search pattern. The current implementation supports both verbatim search and the search using regular expressions, whose detailed syntax can be found at:

http://www.boost.org/doc/libs/release/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

TextSearch also provides users with several useful search modes and extra information besides the found string that matches the pattern. TextSearch can either keep running until a matched string is found or be set to return periodically in order for the caller to perform any necessary updates (e.g., UI updates). It is also worth mentioning that the search modes can be changed on the fly while searching through a document.

Possible use case scenarios for TextSearch include:

Guide users of a PDF viewer (e.g. implemented by PDFViewCtrl) to places where they are intersted in;
Find interested PDF documents which contain certain patterns;
Extract interested information (e.g., credit card numbers) from a set of files;
Extract Highlight information (refer to the Highlights class for details) from files for external use.

Note: Since hyphens ('-') are frequently used in PDF documents to concatenate the two broken pieces of a word at the end of a line, for example

"TextSearch is powerful for finding patterns in PDF files; yes, it is really pow- erful."

a search for "powerful" should return both instances. However, not all end-of-line hyphens are hyphens added to connect a broken word; some of them could be "real" hyphens. In addition, an input search pattern may also contain hyphens that complicate the situation. To tackle this problem, the following conventions are adopted:

When in the verbatim search mode and the pattern contains no hyphen, a matching string is returned if it is exactly the same or it contains end-of-line or start-of-line hyphens. For example, as mentioned above, a search for "powerful" would return both instances.
When in verbatim search mode and the pattern contains one or multiple hyphens, a matching string is returned only if the string matches the pattern exactly. For example, a search for "pow-erful" will only return the second instance, and a search for "power-ful" will return nothing.
When searching using regular expressions, hyphens are not taken care implicitly. Users should take care of it themselves. For example, in order to find both the "powerful" instances, the input pattern can be "pow-{0,1}erful".

A sample use case is coded below: (for a full sample, please take a look at the TextSearch sample project):

 // ... Initialize PDFNet ... 
 PDFDoc doc = new PDFDoc(filein);
 doc.initSecurityHandler(); 
 int mode = TextSearch.e_whole_word | TextSearch.e_page_stop; 
 UString pattern( "joHn sMiTh" ); 
 TextSearch txt_search = new TextSearch(); 

 //PDFDoc doesn't allow simultaneous access from different threads. If this
 //document could be used from other threads (e.g., the rendering thread inside
 //PDFView/PDFViewCtrl, if used), it is good practice to lock it.
 //Notice: don't forget to call doc.Unlock() to avoid deadlock.
 doc.lock(); 

 txt_search.begin( doc, pattern, mode, -1, -1 ); 
 while ( true ) { 
     TextSearchResult result = txt_search.run();
     if ( result.getCode() == TextSearchResult.e_found ) {
         System.out.println("found one instance: " + result.getResultStr()); 
     
     else { 
        break; 
     } 
 }

 //unlock the document to avoid deadlock.
 doc.unLock();
 }

Summary

Constants
int	e_ambient_string	Tells the search process to compute the ambient string of the found pattern.
int	e_case_sensitive	Match case-sensitively.
int	e_highlight	Tells the search process to compute Highlight information.
int	e_page_stop	Tells the search process to return when each page is finished; this is useful when a user needs Run() to return periodically so that certain things (e.g., UI) can be updated from time to time.
int	e_raw_text_search	Tells the search process to refrain from replacing newlines with spaces.
int	e_reg_expression	Use regular expressions.
int	e_search_up	Search upward (from the end of the file and from the bottom of a page).
int	e_search_using_zorder	Tells the search process to use Z-order as reading order for text.
int	e_whole_word	Match the entire word.

Public Constructors
	TextSearch() Constructor.

Public Methods
boolean	begin(PDFDoc doc, String pattern, int mode, int start_page, int end_page) Initialize for the search process.
void	close() Frees the native memory of the object.
void	destroy() Frees the native memory of the object.
int	getCurrentPage() Retrieve the number of the current page that is searched in.
int	getMode() Retrieve the current search mode.
TextSearchResult	run() Search the document.
void	setAmbientLettersAfter(int ambient_letters_after) Sets the maximum number of ambient string letters after the search term (default: 70).
void	setAmbientLettersBefore(int ambient_letters_before) Sets the maximum number of ambient string letters before the search term (default: 30).
void	setAmbientWordsAfter(int ambient_words_after) Sets the maximum number of ambient string words after the search term (default: 10).
void	setAmbientWordsBefore(int ambient_words_before) Sets the maximum number of ambient string words before the search term (default: 1).
void	setMode(int mode) Set the current search mode.
void	setOCGContext(Context ctx) Set the Optional Content Group (OCG) context that should be used when rendering the page.
boolean	setPattern(String pattern) Set the current search pattern.
void	setRightToLeftLanguage(boolean flag) Tells TextSearch that the document reads from right to left.

[Expand]

Inherited Methods

From class java.lang.Object

From interface java.lang.AutoCloseable

Constants

public static final int e_ambient_string

Tells the search process to compute the ambient string of the found pattern. This is useful if a user wants to examine or display what surrounds the found pattern.

Constant Value: 64 (0x00000040)

public static final int e_case_sensitive

Match case-sensitively.

Constant Value: 2 (0x00000002)

public static final int e_highlight

Tells the search process to compute Highlight information.

Constant Value: 32 (0x00000020)

public static final int e_page_stop

Tells the search process to return when each page is finished; this is useful when a user needs Run() to return periodically so that certain things (e.g., UI) can be updated from time to time.

Constant Value: 16 (0x00000010)

public static final int e_raw_text_search

Tells the search process to refrain from replacing newlines with spaces.

Constant Value: 128 (0x00000080)

public static final int e_reg_expression

Use regular expressions.

Constant Value: 1 (0x00000001)

public static final int e_search_up

Search upward (from the end of the file and from the bottom of a page).

Constant Value: 8 (0x00000008)

public static final int e_search_using_zorder

Tells the search process to use Z-order as reading order for text.

Constant Value: 256 (0x00000100)

public static final int e_whole_word

Match the entire word.

Constant Value: 4 (0x00000004)

Public Constructors

public TextSearch ()

Constructor. Create a new TextSearch object.

Public Methods

public boolean begin (PDFDoc doc, String pattern, int mode, int start_page, int end_page)

Initialize for the search process. This should be called before starting the actual search with method run().

Parameters

doc	the PDF document to search in.
pattern	the pattern to search for. When regular expression is used, it contains the expression, and in verbatim mode, it is the exact string to search for.
mode	the mode of the search process.
start_page	the start page of the page range to search in. -1 indicates the range starts from the first page.
end_page	the end page of the page range to search in. -1 indicates the range ends at the last page.

Returns

true if the initialization has succeeded.

public void close ()

Frees the native memory of the object. This can be explicity called to control the deallocation of native memory and avoid situations where the garbage collector does not free the object in a timely manner.

Throws

PDFNetException

public void destroy ()

Frees the native memory of the object. This can be explicity called to control the deallocation of native memory and avoid situations where the garbage collector does not free the object in a timely manner.

public int getCurrentPage ()

Retrieve the number of the current page that is searched in. If the returned value is -1, it indicates the search process has not been initialized (e.g., begin() is not called yet); if the returned value is 0, it indicates the search process has finished, and if the returned value is positive, it is a valid page number.

Returns

the current page number.

public int getMode ()

Retrieve the current search mode.

Returns

the current search mode.

public TextSearchResult run ()

Search the document. This method returns upon the following circumstances:

Reached the end of the document;
Reached the end of a page (if set to return by specifying mode 'e_page_stop');
Found an instance matching the search pattern.

Note that this method should be called in a loop in order to find all matching instances; in other words, the search is conducted in an incremental fashion.

Returns

the text search result

public void setAmbientLettersAfter (int ambient_letters_after)

Sets the maximum number of ambient string letters after the search term (default: 70). This should be called before starting the actual search with method Run().

Parameters

ambient_letters_after	-- maximum number of letters

Throws

PDFNetException

public void setAmbientLettersBefore (int ambient_letters_before)

Sets the maximum number of ambient string letters before the search term (default: 30). This should be called before starting the actual search with method Run().

Parameters

ambient_letters_before	-- maximum number of letters

Throws

PDFNetException

public void setAmbientWordsAfter (int ambient_words_after)

Sets the maximum number of ambient string words after the search term (default: 10). This should be called before starting the actual search with method Run().

Parameters

ambient_words_after	-- maximum number of words

Throws

PDFNetException

public void setAmbientWordsBefore (int ambient_words_before)

Sets the maximum number of ambient string words before the search term (default: 1). This should be called before starting the actual search with method Run().

Parameters

ambient_words_before	-- maximum number of words

Throws

PDFNetException

public void setMode (int mode)

Set the current search mode. For example, the following code turns on the regular expression:

TextSearch ts = new TextSearch(); int mode = ts.getMode(); mode |= TextSearch.e_reg_expression; ts.setMode(mode);

Parameters

mode	the search mode to set.

public void setOCGContext (Context ctx)

Set the Optional Content Group (OCG) context that should be used when rendering the page. This function can be used to selectively render optional content (such as PDF layers) based on the states of optional content groups in the given context.

Parameters

ctx	Optional Content Group (OCG) context, or NULL if TextSearch should process all content on the page.

public boolean setPattern (String pattern)

Set the current search pattern. Note that it is not necessary to call this method since the search pattern is already set when calling the begin() method. This method is provided for users to change the search pattern while searching through a document.

Parameters

pattern	the search pattern to set.

Returns

true if the setting has succeeded.

public void setRightToLeftLanguage (boolean flag)

Tells TextSearch that the document reads from right to left.

Interfaces

Classes

Enums

TextSearch

Class Overview

Summary

Constants

public static final int e_ambient_string

public static final int e_case_sensitive

public static final int e_highlight

public static final int e_page_stop

public static final int e_raw_text_search

public static final int e_reg_expression

public static final int e_search_up

public static final int e_search_using_zorder

public static final int e_whole_word

Public Constructors

public TextSearch ()

Public Methods

public boolean begin (PDFDoc doc, String pattern, int mode, int start_page, int end_page)

Parameters

Returns

public void close ()

Throws

public void destroy ()

public int getCurrentPage ()

Returns

public int getMode ()

Returns

public TextSearchResult run ()

Returns

public void setAmbientLettersAfter (int ambient_letters_after)

Parameters

Throws

public void setAmbientLettersBefore (int ambient_letters_before)

Parameters

Throws

public void setAmbientWordsAfter (int ambient_words_after)

Parameters

Throws

public void setAmbientWordsBefore (int ambient_words_before)

Parameters

Throws

public void setMode (int mode)

Parameters

public void setOCGContext (Context ctx)

Parameters

public boolean setPattern (String pattern)

Parameters

Returns

public void setRightToLeftLanguage (boolean flag)