#include <TextSearch.h>
Public Types | |
enum | TextSearchModes { e_reg_expression = 0x0001, e_case_sensitive = e_reg_expression << 1, e_whole_word = e_case_sensitive << 1, e_search_up = e_whole_word << 1, e_page_stop = e_search_up << 1, e_highlight = e_page_stop << 1, e_ambient_string = e_highlight << 1, e_raw_text_search = e_ambient_string << 1, e_search_using_zorder = e_raw_text_search << 1 } |
typedef TRN_UInt32 | Mode |
Public Member Functions | |
TextSearch () | |
~TextSearch () | |
bool | Begin (PDFDoc &doc, const UString &pattern, Mode mode, int start_page=-1, int end_page=-1) |
SearchResult | Run () |
bool | SetPattern (const UString &pattern) |
Mode | GetMode () const |
void | SetMode (Mode mode) |
void | SetRightToLeftLanguage (bool flag) |
int | GetCurrentPage () const |
void | SetOCGContext (OCG::Context *context) |
void | Destroy () |
void | SetAmbientLettersBefore (int ambient_letters_before) |
void | SetAmbientLettersAfter (int ambient_letters_after) |
void | SetAmbientWordsBefore (int ambient_words_before) |
void | SetAmbientWordsAfter (int ambient_words_after) |
TextSearch searches through a PDF document for a user-given search pattern. The current implementation supports both verbatim search and the search using regular expressions, whose detailed syntax can be found at:
http://www.boost.org/doc/libs/release/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
TextSearch also provides users with several useful search modes and extra information besides the found string that matches the pattern. TextSearch can either keep running until a matched string is found or be set to return periodically in order for the caller to perform any necessary updates (e.g., UI updates). It is also worth mentioning that the search modes can be changed on the fly while searching through a document.
Possible use case scenarios for TextSearch include:
Note:
Since hyphens ('-') are frequently used in PDF documents to concatenate the two broken pieces of a word at the end of a line, for example
"TextSearch is powerful for finding patterns in PDF files; yes, it is really pow- erful."
a search for "powerful" should return both instances. However, not all end-of-line hyphens are hyphens added to connect a broken word; some of them could be "real" hyphens. In addition, an input search pattern may also contain hyphens that complicate the situation. To tackle this problem, the following conventions are adopted:
a)When in the verbatim search mode and the pattern contains no hyphen, a matching string is returned if it is exactly the same or it contains end-of-line or start-of-line hyphens. For example, as mentioned above, a search for "powerful" would return both instances. b)When in verbatim search mode and the pattern contains one or multiple hyphens, a matching string is returned only if the string matches the pattern exactly. For example, a search for "pow-erful" will only return the second instance, and a search for "power-ful" will return nothing. c)When searching using regular expressions, hyphens are not taken care implicitly. Users should take care of it themselves. For example, in order to find both the "powerful" instances, the input pattern can be "pow-{0,1}erful".
A sample use case (in C++):
For a full sample, please take a look at the TextSearch sample project.
Definition at line 173 of file TextSearch.h.
typedef TRN_UInt32 pdftron::PDF::TextSearch::Mode |
Typedef the search mode.
Definition at line 186 of file TextSearch.h.
Search modes that control how searching is conducted.
Enumerator | |
---|---|
e_reg_expression | |
e_case_sensitive | |
e_whole_word | |
e_search_up | |
e_page_stop | |
e_highlight | |
e_ambient_string | |
e_raw_text_search | |
e_search_using_zorder |
Definition at line 191 of file TextSearch.h.
pdftron::PDF::TextSearch::TextSearch | ( | ) |
Constructor and destructor.
pdftron::PDF::TextSearch::~TextSearch | ( | ) |
bool pdftron::PDF::TextSearch::Begin | ( | PDFDoc & | doc, |
const UString & | pattern, | ||
Mode | mode, | ||
int | start_page = -1 , |
||
int | end_page = -1 |
||
) |
Initialize for search process. This should be called before starting the actual search with method Run().
doc | the PDF document to search in. |
pattern | the pattern to search for. When regular expression is used, it contains the expression, and in verbatim mode, it is the exact string to search for. |
mode | the mode of the search process. |
start_page | the start page of the page range to search in. The default value is -1 indicating the range starts from the first page. |
end_page | the end page of the page range to search in. The default value is -1 indicating the range ends at the last page. |
void pdftron::PDF::TextSearch::Destroy | ( | ) |
Frees the native memory of the object.
int pdftron::PDF::TextSearch::GetCurrentPage | ( | ) | const |
Retrieve the number of the current page that is searched in. If the returned value is -1, it indicates the search process has not been initialized (e.g., Begin() is not called yet); if the returned value is 0, it indicates the search process has finished, and if the returned value is positive, it is a valid page number.
Mode pdftron::PDF::TextSearch::GetMode | ( | ) | const |
Retrieve the current search mode.
SearchResult pdftron::PDF::TextSearch::Run | ( | ) |
Search the document and returns upon the following circumstances: a)Reached the end of the document; b)Reached the end of a page (if set to return by specifying mode 'e_page_stop' ); c)Found an instance matching the search pattern.
Note that this method should be called in a loop in order to find all matching instances; in other words, the search is conducted in an incremental fashion.
void pdftron::PDF::TextSearch::SetAmbientLettersAfter | ( | int | ambient_letters_after | ) |
Sets the maximum number of ambient string letters after the search term (default: 70). This should be called before starting the actual search with method Run().
ambient_letters_after | – maximum number of letters. |
void pdftron::PDF::TextSearch::SetAmbientLettersBefore | ( | int | ambient_letters_before | ) |
Sets the maximum number of ambient string letters before the search term (default: 30). This should be called before starting the actual search with method Run().
ambient_letters_before | – maximum number of letters. |
void pdftron::PDF::TextSearch::SetAmbientWordsAfter | ( | int | ambient_words_after | ) |
Sets the maximum number of ambient string words after the search term (default: 10). This should be called before starting the actual search with method Run().
ambient_words_after | – maximum number of words. |
void pdftron::PDF::TextSearch::SetAmbientWordsBefore | ( | int | ambient_words_before | ) |
Sets the maximum number of ambient string words before the search term (default: 1). This should be called before starting the actual search with method Run().
ambient_words_before | – maximum number of words. |
void pdftron::PDF::TextSearch::SetMode | ( | Mode | mode | ) |
Set the current search mode. For example, the following code turns on the regular expressions:
TextSearch ts; ... TextSearch::Mode mode = ts.GetMode(); mode |= TextSearch::e_reg_expression; ts.SetMode(mode); ...
mode | the search mode to set. |
void pdftron::PDF::TextSearch::SetOCGContext | ( | OCG::Context * | context | ) |
Sets the Optional Content Group (OCG) context that should be used when processing the document. This function can be used to change the current OCG context. Optional content (such as PDF layers) will be selectively processed based on the states of optional content groups in the given context.
context | Optional Content Group (OCG) context, or NULL if TextSearch should process all content on the page. |
bool pdftron::PDF::TextSearch::SetPattern | ( | const UString & | pattern | ) |
Set the current search pattern. Note that it is not necessary to call this method since the search pattern is already set when calling the Begin() method. This method is provided for users to change the search pattern while searching through a document.
pattern | the search pattern to set. |
void pdftron::PDF::TextSearch::SetRightToLeftLanguage | ( | bool | flag | ) |
Tells TextSearch that language is from right to left.
flag | Set to true if the language is right to left. |