new TextSearch()
TextSearch searches through a PDF document for a user-given search pattern.
The current implementation supports both verbatim search and the search
using regular expressions, whose detailed syntax can be found at:
http://www.boost.org/doc/libs/release/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
TextSearch also provides users with several useful search modes and extra
information besides the found string that matches the pattern. TextSearch
can either keep running until a matched string is found or be set to return
periodically in order for the caller to perform any necessary updates
(e.g., UI updates). It is also worth mentioning that the search modes can be
changed on the fly while searching through a document.
Possible use case scenarios for TextSearch include:
- Guide users of a PDF viewer (e.g. implemented by PDFViewCtrl) to places
where they are intersted in;
- Find interested PDF documents which contain certain patterns;
- Extract interested information (e.g., credit card numbers) from a set of files;
- Extract Highlight information (refer to the Highlights class for details) from
files for external use.
Note:
- Since hyphens ('-') are frequently used in PDF documents to concatenate the two
broken pieces of a word at the end of a line, for example
"TextSearch is powerful for finding patterns in PDF files; yes, it is really pow-
erful."
a search for "powerful" should return both instances. However, not all end-of-line
hyphens are hyphens added to connect a broken word; some of them could be "real"
hyphens. In addition, an input search pattern may also contain hyphens that complicate
the situation. To tackle this problem, the following conventions are adopted:
a)When in the verbatim search mode and the pattern contains no hyphen, a matching
string is returned if it is exactly the same or it contains end-of-line
or start-of-line hyphens. For example, as mentioned above, a search for "powerful"
would return both instances.
b)When in verbatim search mode and the pattern contains one or multiple hyphens, a
matching string is returned only if the string matches the pattern exactly. For
example, a search for "pow-erful" will only return the second instance, and a search
for "power-ful" will return nothing.
c)When searching using regular expressions, hyphens are not taken care implicitly.
Users should take care of it themselves. For example, in order to find both the
"powerful" instances, the input pattern can be "pow-{0,1}erful".
A sample use case (in C++):
//... Initialize PDFNet ... PDFDoc doc(filein); doc.InitSecurityHandler(); int page_num; char buf[32]; UString result_str, ambient_string; Highlights hlts; TextSearch txt_search; TextSearch::Mode mode = TextSearch::e_whole_word | TextSearch::e_page_stop; UString pattern( "joHn sMiTh" ); //PDFDoc doesn't allow simultaneous access from different threads. If this //document could be used from other threads (e.g., the rendering thread inside //PDFView/PDFViewCtrl, if used), it is good practice to lock it. //Notice: don't forget to call doc.Unlock() to avoid deadlock. doc.Lock(); txt_search.Begin( doc, pattern, mode ); while ( true ) { SearchResult result = code = txt_search.Run(page_num, result_str, ambient_string, hlts ); if ( code == TextSearch::e_found ) { result_str.ConvertToAscii(buf, 32, true); cout << "found one instance: " << char_buf << endl; } else { break; } } //unlock the document to avoid deadlock. doc.UnLock();For a full sample, please take a look at the TextSearch sample project.
Extends
Members
-
<static> Mode
-
Properties:
Name Type Description e_reg_expression
number e_case_sensitive
number e_whole_word
number e_search_up
number e_page_stop
number e_highlight
number e_ambient_string
number -
<static> ResultCode
-
Properties:
Name Type Description e_done
number e_page
number e_found
number
Methods
-
<static> create()
-
Constructor and destructor.
Returns:
A promise that resolves to an object of type: "PDFNet.TextSearch"- Type
- Promise.<PDFNet.TextSearch>
-
begin(doc, pattern, mode [, start_page] [, end_page])
-
Parameters:
Name Type Argument Description doc
PDFNet.PDFDoc | PDFNet.SDFDoc | PDFNet.FDFDoc pattern
string mode
number start_page
number <optional>
end_page
number <optional>
Returns:
A promise that resolves to an object of type: "boolean"- Type
- Promise.<boolean>
-
destroy()
-
Destructor
- Inherited From:
Returns:
- Type
- Promise.<void>
-
getCurrentPage()
-
Retrieve the number of the current page that is searched in. If the returned value is -1, it indicates the search process has not been initialized (e.g., Begin() is not called yet); if the returned value is 0, it indicates the search process has finished, and if the returned value is positive, it is a valid page number.
Returns:
A promise that resolves to the current page number.- Type
- Promise.<number>
-
getMode()
-
Retrieve the current search mode.
Returns:
A promise that resolves to the current search mode.- Type
- Promise.<number>
-
run()
-
Runs a search on the document for a certain string. Make sure to call TextSearch.begin(doc, pattern, mode) with the proper parameters before calling TextSearch.run() The resolved object that TextSearch.run() returns contains the following objects: page_num - The number of the page with the match out_str - The string that matches the search parameter ambient_str - The ambient string of the found string (computed only if e_ambient_string is set) highlights - The Highlights info associated with the match (computed only if 'e_highlight' is set) code - Number representing the status of the search. - 0 - e_done, reached end of document. - 1 - e_page, reached end of page. (if set to return by specifying mode 'e_page_stop') - 2 - e_found, found an instance matching the search pattern
Returns:
A promise that resolves to an object containing the page_num, out_str ambient_str, highlights, and result code.- Type
- Promise.<any>
-
setMode(mode)
-
set the current search mode. For example, the following code turns on the regular expressions: TextSearch ts; ... TextSearch::Mode mode = ts.GetMode(); mode |= TextSearch::e_reg_expression; ts.SetMode(mode); ...
Parameters:
Name Type Description mode
number the search mode to set. Returns:
- Type
- Promise.<void>
-
setOCGContext(ctx)
-
Sets the Optional Content Group (OCG) context that should be used when processing the document. This function can be used to change the current OCG context. Optional content (such as PDF layers) will be selectively processed based on the states of optional content groups in the given context.
Parameters:
Name Type Description ctx
PDFNet.OCGContext Optional Content Group (OCG) context, or NULL if TextSearch should process all content on the page. Returns:
- Type
- Promise.<void>
-
setPattern(pattern)
-
Set the current search pattern. Note that it is not necessary to call this method since the search pattern is already set when calling the Begin() method. This method is provided for users to change the search pattern while searching through a document.
Parameters:
Name Type Description pattern
string the search pattern to set. Returns:
A promise that resolves to true if the setting has succeeded.- Type
- Promise.<boolean>
-
setRightToLeftLanguage(flag)
-
Tells TextSearch that language is from right to left.
Parameters:
Name Type Description flag
boolean Set to true if the language is right to left. Returns:
- Type
- Promise.<void>
-
takeOwnership()
-
Take the ownership of this object, so that PDFNet.runWithCleanup won't destroy this object.
- Inherited From:
Returns:
- Type
- void