java.lang.Object | |
↳ | com.pdftron.pdf.TextSearch |
TextSearch
searches through a PDF document for a user-given search pattern.
The current implementation supports both verbatim search and the search using
regular expressions, whose detailed syntax can be found at:
TextSearch
also provides users with several useful search modes and extra
information besides the found string that matches the pattern. TextSearch
can
either keep running until a matched string is found or be set to return
periodically in order for the caller to perform any necessary updates (e.g.,
UI updates). It is also worth mentioning that the search modes can be changed
on the fly while searching through a document.
Possible use case scenarios for TextSearch include:
Highlights
class for details)
from files for external use.
Note: Since hyphens ('-') are frequently used in PDF documents to
concatenate the two broken pieces of a word at the end of a line, for example
"TextSearch is powerful for finding patterns in PDF files; yes, it is really
pow- erful."
a search for "powerful" should return both instances. However, not all
end-of-line hyphens are hyphens added to connect a broken word; some of them
could be "real" hyphens. In addition, an input search pattern may also
contain hyphens that complicate the situation. To tackle this problem, the
following conventions are adopted:
A sample use case is coded below: (for a full sample, please take a look at the TextSearch sample project):
// ... Initialize PDFNet ... PDFDoc doc = new PDFDoc(filein); doc.initSecurityHandler(); int mode = TextSearch.e_whole_word | TextSearch.e_page_stop; UString pattern( "joHn sMiTh" ); TextSearch txt_search = new TextSearch(); //PDFDoc doesn't allow simultaneous access from different threads. If this //document could be used from other threads (e.g., the rendering thread inside //PDFView/PDFViewCtrl, if used), it is good practice to lock it. //Notice: don't forget to call doc.Unlock() to avoid deadlock. doc.lock(); txt_search.begin( doc, pattern, mode, -1, -1 ); while ( true ) { TextSearchResult result = txt_search.run(); if ( result.getCode() == TextSearchResult.e_found ) { System.out.println("found one instance: " + result.getResultStr());
else { break; } } //unlock the document to avoid deadlock. doc.unLock(); }
Constants | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
int | e_ambient_string | Tells the search process to compute the ambient string of the found pattern. | |||||||||
int | e_case_sensitive | Match case-sensitively. | |||||||||
int | e_highlight | Tells the search process to compute Highlight information. | |||||||||
int | e_page_stop | Tells the search process to return when each page is finished; this is useful when a user needs Run() to return periodically so that certain things (e.g., UI) can be updated from time to time. | |||||||||
int | e_raw_text_search | Tells the search process to refrain from replacing newlines with spaces. | |||||||||
int | e_reg_expression | Use regular expressions. | |||||||||
int | e_search_up | Search upward (from the end of the file and from the bottom of a page). | |||||||||
int | e_search_using_zorder | Tells the search process to use Z-order as reading order for text. | |||||||||
int | e_whole_word | Match the entire word. |
Public Constructors | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
TextSearch()
Constructor.
|
Public Methods | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
boolean |
begin(PDFDoc doc, String pattern, int mode, int start_page, int end_page)
Initialize for the search process.
| ||||||||||
void |
close()
Frees the native memory of the object.
| ||||||||||
void |
destroy()
Frees the native memory of the object.
| ||||||||||
int |
getCurrentPage()
Retrieve the number of the current page that is searched in.
| ||||||||||
int |
getMode()
Retrieve the current search mode.
| ||||||||||
TextSearchResult |
run()
Search the document.
| ||||||||||
void |
setAmbientLettersAfter(int ambient_letters_after)
Sets the maximum number of ambient string letters after the search term (default: 70).
| ||||||||||
void |
setAmbientLettersBefore(int ambient_letters_before)
Sets the maximum number of ambient string letters before the search term (default: 30).
| ||||||||||
void |
setAmbientWordsAfter(int ambient_words_after)
Sets the maximum number of ambient string words after the search term (default: 10).
| ||||||||||
void |
setAmbientWordsBefore(int ambient_words_before)
Sets the maximum number of ambient string words before the search term (default: 1).
| ||||||||||
void |
setMode(int mode)
Set the current search mode.
| ||||||||||
void |
setOCGContext(Context ctx)
Set the Optional Content Group (OCG) context that should be used when
rendering the page.
| ||||||||||
boolean |
setPattern(String pattern)
Set the current search pattern.
| ||||||||||
void |
setRightToLeftLanguage(boolean flag)
Tells TextSearch that the document reads from right to left.
|
[Expand]
Inherited Methods | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
From class
java.lang.Object
| |||||||||||
From interface
java.lang.AutoCloseable
|
Tells the search process to compute the ambient string of the found pattern. This is useful if a user wants to examine or display what surrounds the found pattern.
Match case-sensitively.
Tells the search process to compute Highlight information.
Tells the search process to return when each page is finished; this is useful when a user needs Run() to return periodically so that certain things (e.g., UI) can be updated from time to time.
Tells the search process to refrain from replacing newlines with spaces.
Use regular expressions.
Search upward (from the end of the file and from the bottom of a page).
Tells the search process to use Z-order as reading order for text.
Match the entire word.
Constructor. Create a new TextSearch object.
Initialize for the search process. This should be called before starting the actual search with method run().
doc | the PDF document to search in. |
---|---|
pattern | the pattern to search for. When regular expression is used, it contains the expression, and in verbatim mode, it is the exact string to search for. |
mode | the mode of the search process. |
start_page | the start page of the page range to search in. -1 indicates the range starts from the first page. |
end_page | the end page of the page range to search in. -1 indicates the range ends at the last page. |
true
if the initialization has succeeded.
Frees the native memory of the object. This can be explicity called to control the deallocation of native memory and avoid situations where the garbage collector does not free the object in a timely manner.
PDFNetException |
---|
Frees the native memory of the object. This can be explicity called to control the deallocation of native memory and avoid situations where the garbage collector does not free the object in a timely manner.
Retrieve the number of the current page that is searched in. If the
returned value is -1, it indicates the search process has not been
initialized (e.g., begin()
is not called yet); if the returned value is
0, it indicates the search process has finished, and if the returned
value is positive, it is a valid page number.
Retrieve the current search mode.
Search the document. This method returns upon the following circumstances:
Sets the maximum number of ambient string letters after the search term (default: 70). This should be called before starting the actual search with method Run().
ambient_letters_after | -- maximum number of letters |
---|
PDFNetException |
---|
Sets the maximum number of ambient string letters before the search term (default: 30). This should be called before starting the actual search with method Run().
ambient_letters_before | -- maximum number of letters |
---|
PDFNetException |
---|
Sets the maximum number of ambient string words after the search term (default: 10). This should be called before starting the actual search with method Run().
ambient_words_after | -- maximum number of words |
---|
PDFNetException |
---|
Sets the maximum number of ambient string words before the search term (default: 1). This should be called before starting the actual search with method Run().
ambient_words_before | -- maximum number of words |
---|
PDFNetException |
---|
Set the current search mode. For example, the following code turns on the regular expression:
TextSearch ts = new TextSearch();
int mode = ts.getMode();
mode |= TextSearch.e_reg_expression;
ts.setMode(mode);
mode | the search mode to set. |
---|
Set the Optional Content Group (OCG) context that should be used when rendering the page. This function can be used to selectively render optional content (such as PDF layers) based on the states of optional content groups in the given context.
ctx | Optional Content Group (OCG) context, or NULL if TextSearch should process all content on the page. |
---|
Set the current search pattern. Note that it is not necessary to call
this method since the search pattern is already set when calling the
begin()
method. This method is provided for users to change
the search pattern while searching through a document.
pattern | the search pattern to set. |
---|
true
if the setting has succeeded.
Tells TextSearch that the document reads from right to left.