public class

TextSearch

extends Object
implements AutoCloseable
java.lang.Object
   ↳ com.pdftron.pdf.TextSearch

Class Overview

TextSearch searches through a PDF document for a user-given search pattern. The current implementation supports both verbatim search and the search using regular expressions, whose detailed syntax can be found at:

TextSearch also provides users with several useful search modes and extra information besides the found string that matches the pattern. TextSearch can either keep running until a matched string is found or be set to return periodically in order for the caller to perform any necessary updates (e.g., UI updates). It is also worth mentioning that the search modes can be changed on the fly while searching through a document.

Possible use case scenarios for TextSearch include:

  • Guide users of a PDF viewer (e.g. implemented by PDFViewCtrl) to places where they are intersted in;
  • Find interested PDF documents which contain certain patterns;
  • Extract interested information (e.g., credit card numbers) from a set of files;
  • Extract Highlight information (refer to the Highlights class for details) from files for external use.

Note: Since hyphens ('-') are frequently used in PDF documents to concatenate the two broken pieces of a word at the end of a line, for example

"TextSearch is powerful for finding patterns in PDF files; yes, it is really pow- erful."

a search for "powerful" should return both instances. However, not all end-of-line hyphens are hyphens added to connect a broken word; some of them could be "real" hyphens. In addition, an input search pattern may also contain hyphens that complicate the situation. To tackle this problem, the following conventions are adopted:

  1. When in the verbatim search mode and the pattern contains no hyphen, a matching string is returned if it is exactly the same or it contains end-of-line or start-of-line hyphens. For example, as mentioned above, a search for "powerful" would return both instances.
  2. When in verbatim search mode and the pattern contains one or multiple hyphens, a matching string is returned only if the string matches the pattern exactly. For example, a search for "pow-erful" will only return the second instance, and a search for "power-ful" will return nothing.
  3. When searching using regular expressions, hyphens are not taken care implicitly. Users should take care of it themselves. For example, in order to find both the "powerful" instances, the input pattern can be "pow-{0,1}erful".

A sample use case is coded below: (for a full sample, please take a look at the TextSearch sample project):

 // ... Initialize PDFNet ... 
 PDFDoc doc = new PDFDoc(filein);
 doc.initSecurityHandler(); 
 int mode = TextSearch.e_whole_word | TextSearch.e_page_stop; 
 UString pattern( "joHn sMiTh" ); 
 TextSearch txt_search = new TextSearch(); 

 //PDFDoc doesn't allow simultaneous access from different threads. If this
 //document could be used from other threads (e.g., the rendering thread inside
 //PDFView/PDFViewCtrl, if used), it is good practice to lock it.
 //Notice: don't forget to call doc.Unlock() to avoid deadlock.
 doc.lock(); 

 txt_search.begin( doc, pattern, mode, -1, -1 ); 
 while ( true ) { 
     TextSearchResult result = txt_search.run();
     if ( result.getCode() == TextSearchResult.e_found ) {
         System.out.println("found one instance: " + result.getResultStr()); 
     
     else { 
        break; 
     } 
 }

 //unlock the document to avoid deadlock.
 doc.unLock();
 }
 

Summary

Constants
int e_ambient_string Tells the search process to compute the ambient string of the found pattern.
int e_case_sensitive Match case-sensitively.
int e_highlight Tells the search process to compute Highlight information.
int e_page_stop Tells the search process to return when each page is finished; this is useful when a user needs Run() to return periodically so that certain things (e.g., UI) can be updated from time to time.
int e_raw_text_search Tells the search process to refrain from replacing newlines with spaces.
int e_reg_expression Use regular expressions.
int e_search_up Search upward (from the end of the file and from the bottom of a page).
int e_search_using_zorder Tells the search process to use Z-order as reading order for text.
int e_whole_word Match the entire word.
Public Constructors
TextSearch()
Constructor.
Public Methods
boolean begin(PDFDoc doc, String pattern, int mode, int start_page, int end_page)
Initialize for the search process.
void close()
Frees the native memory of the object.
void destroy()
Frees the native memory of the object.
int getCurrentPage()
Retrieve the number of the current page that is searched in.
int getMode()
Retrieve the current search mode.
TextSearchResult run()
Search the document.
void setAmbientLettersAfter(int ambient_letters_after)
Sets the maximum number of ambient string letters after the search term (default: 70).
void setAmbientLettersBefore(int ambient_letters_before)
Sets the maximum number of ambient string letters before the search term (default: 30).
void setAmbientWordsAfter(int ambient_words_after)
Sets the maximum number of ambient string words after the search term (default: 10).
void setAmbientWordsBefore(int ambient_words_before)
Sets the maximum number of ambient string words before the search term (default: 1).
void setMode(int mode)
Set the current search mode.
void setOCGContext(Context ctx)
Set the Optional Content Group (OCG) context that should be used when rendering the page.
boolean setPattern(String pattern)
Set the current search pattern.
void setRightToLeftLanguage(boolean flag)
Tells TextSearch that the document reads from right to left.
[Expand]
Inherited Methods
From class java.lang.Object
From interface java.lang.AutoCloseable

Constants

public static final int e_ambient_string

Tells the search process to compute the ambient string of the found pattern. This is useful if a user wants to examine or display what surrounds the found pattern.

Constant Value: 64 (0x00000040)

public static final int e_case_sensitive

Match case-sensitively.

Constant Value: 2 (0x00000002)

public static final int e_highlight

Tells the search process to compute Highlight information.

Constant Value: 32 (0x00000020)

public static final int e_page_stop

Tells the search process to return when each page is finished; this is useful when a user needs Run() to return periodically so that certain things (e.g., UI) can be updated from time to time.

Constant Value: 16 (0x00000010)

public static final int e_raw_text_search

Tells the search process to refrain from replacing newlines with spaces.

Constant Value: 128 (0x00000080)

public static final int e_reg_expression

Use regular expressions.

Constant Value: 1 (0x00000001)

public static final int e_search_up

Search upward (from the end of the file and from the bottom of a page).

Constant Value: 8 (0x00000008)

public static final int e_search_using_zorder

Tells the search process to use Z-order as reading order for text.

Constant Value: 256 (0x00000100)

public static final int e_whole_word

Match the entire word.

Constant Value: 4 (0x00000004)

Public Constructors

public TextSearch ()

Constructor. Create a new TextSearch object.

Public Methods

public boolean begin (PDFDoc doc, String pattern, int mode, int start_page, int end_page)

Initialize for the search process. This should be called before starting the actual search with method run().

Parameters
doc the PDF document to search in.
pattern the pattern to search for. When regular expression is used, it contains the expression, and in verbatim mode, it is the exact string to search for.
mode the mode of the search process.
start_page the start page of the page range to search in. -1 indicates the range starts from the first page.
end_page the end page of the page range to search in. -1 indicates the range ends at the last page.
Returns
  • true if the initialization has succeeded.

public void close ()

Frees the native memory of the object. This can be explicity called to control the deallocation of native memory and avoid situations where the garbage collector does not free the object in a timely manner.

public void destroy ()

Frees the native memory of the object. This can be explicity called to control the deallocation of native memory and avoid situations where the garbage collector does not free the object in a timely manner.

public int getCurrentPage ()

Retrieve the number of the current page that is searched in. If the returned value is -1, it indicates the search process has not been initialized (e.g., begin() is not called yet); if the returned value is 0, it indicates the search process has finished, and if the returned value is positive, it is a valid page number.

Returns
  • the current page number.

public int getMode ()

Retrieve the current search mode.

Returns
  • the current search mode.

public TextSearchResult run ()

Search the document. This method returns upon the following circumstances:

  1. Reached the end of the document;
  2. Reached the end of a page (if set to return by specifying mode 'e_page_stop');
  3. Found an instance matching the search pattern.
Note that this method should be called in a loop in order to find all matching instances; in other words, the search is conducted in an incremental fashion.

Returns
  • the text search result

public void setAmbientLettersAfter (int ambient_letters_after)

Sets the maximum number of ambient string letters after the search term (default: 70). This should be called before starting the actual search with method Run().

Parameters
ambient_letters_after -- maximum number of letters

public void setAmbientLettersBefore (int ambient_letters_before)

Sets the maximum number of ambient string letters before the search term (default: 30). This should be called before starting the actual search with method Run().

Parameters
ambient_letters_before -- maximum number of letters

public void setAmbientWordsAfter (int ambient_words_after)

Sets the maximum number of ambient string words after the search term (default: 10). This should be called before starting the actual search with method Run().

Parameters
ambient_words_after -- maximum number of words

public void setAmbientWordsBefore (int ambient_words_before)

Sets the maximum number of ambient string words before the search term (default: 1). This should be called before starting the actual search with method Run().

Parameters
ambient_words_before -- maximum number of words

public void setMode (int mode)

Set the current search mode. For example, the following code turns on the regular expression:

TextSearch ts = new TextSearch(); int mode = ts.getMode(); mode |= TextSearch.e_reg_expression; ts.setMode(mode);

Parameters
mode the search mode to set.

public void setOCGContext (Context ctx)

Set the Optional Content Group (OCG) context that should be used when rendering the page. This function can be used to selectively render optional content (such as PDF layers) based on the states of optional content groups in the given context.

Parameters
ctx Optional Content Group (OCG) context, or NULL if TextSearch should process all content on the page.

public boolean setPattern (String pattern)

Set the current search pattern. Note that it is not necessary to call this method since the search pattern is already set when calling the begin() method. This method is provided for users to change the search pattern while searching through a document.

Parameters
pattern the search pattern to set.
Returns
  • true if the setting has succeeded.

public void setRightToLeftLanguage (boolean flag)

Tells TextSearch that the document reads from right to left.