Class TextSearch
TextSearch searches through a PDF document for a user-given search pattern. The current implementation supports both verbatim search and the search using regular expressions, whose detailed syntax can be found at:
http://www.boost.org/doc/libs/release/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
TextSearch also provides users with several useful search modes and extra information besides the found string that matches the pattern. TextSearch can either keep running until a matched string is found or be set to return periodically in order for the caller to perform any necessary updates (e.g., UI updates). It is also worth mentioning that the search modes can be changed on the fly while searching through a document.
Possible use case scenarios for TextSearch include:
- Guide users of a PDF viewer (e.g. implemented by PDFViewCtrl) to places where they are intersted in;
- Find interested PDF documents which contain certain patterns;
- Extract interested information (e.g., credit card numbers) from a set of files;
- Extract Highlight information (refer to the Highlights class for details) from files for external use.
- Since hyphens ('-') are frequently used in PDF documents to concatenate the two
broken pieces of a word at the end of a line, for example
"TextSearch is powerful for finding patterns in PDF files; yes, it is really pow- erful."
a search for "powerful" should return both instances. However, not all end-of-line hyphens are hyphens added to connect a broken word; some of them could be "real" hyphens. In addition, an input search pattern may also contain hyphens that complicate the situation. To tackle this problem, the following conventions are adopted:- When in the verbatim search mode and the pattern contains no hyphen, a matching string is returned if it is exactly the same or it contains end-of-line or start-of-line hyphens. For example, as mentioned above, a search for "powerful" would return both instances.
- When in verbatim search mode and the pattern contains one or multiple hyphens, a matching string is returned only if the string matches the pattern exactly. For example, a search for "pow-erful" will only return the second instance, and a search for "power-ful" will return nothing.
- When searching using regular expressions, hyphens are not taken care implicitly. Users should take care of it themselves. For example, in order to find both the "powerful" instances, the input pattern can be "pow-{0,1}erful".
//... Initialize PDFNet ...
PDFDoc doc = new PDFDoc(filein);
doc.initSecurityHandler();
int mode = TextSearch.e_whole_word | TextSearch.e_page_stop;
UString pattern( "joHn sMiTh" );
TextSearch txt_search = new TextSearch();
//PDFDoc doesn't allow simultaneous access from different threads. If this
//document could be used from other threads (e.g., the rendering thread inside
//PDFView/PDFViewCtrl, if used), it is good practice to lock it.
//Notice: don't forget to call doc.Unlock() to avoid deadlock.
doc.Lock();
txt_search.Begin( doc, pattern, mode, -1, -1 );
while ( true )
{
TextSearch.ResultCode result = txt_search.Run();
if ( result.GetCode() == TextSearchResult.e_found )
{
Console.WriteLine("found one instance: " + result.GetResultStr());
}
else
{
break;
}
}
//unlock the document to avoid deadlock.
doc.UnLock();
Implements
Inherited Members
Namespace: pdftron.PDF
Assembly: PDFNet.dll
Syntax
public class TextSearch : IDisposable
Constructors
TextSearch()
Constructor and destructor.
Declaration
public TextSearch()
Methods
Begin(PDFDoc, string, int, int, int)
Initialize for the search process. This should be called before starting the actual search. with method run().
Declaration
public bool Begin(PDFDoc doc, string pattern, int mode, int start_page, int end_page)
Parameters
Type | Name | Description |
---|---|---|
PDFDoc | doc | the PDF document to search in. |
string | pattern | the pattern to search for. When regular expression is used, it contains the expression, and in verbatim mode, it is the exact string to search for. |
int | mode | the mode of the search process. |
int | start_page | the start page of the page range to search in. -1 indicates the range starts from the first page. |
int | end_page | the end page of the page range to search in. -1 indicates the range ends at the last page. |
Returns
Type | Description |
---|---|
bool | true if the initialization has succeeded. |
Dispose()
Releases all resources used by the TextSearch
Declaration
public override sealed void Dispose()
Dispose(bool)
Declaration
[HandleProcessCorruptedStateExceptions]
protected virtual void Dispose(bool A_0)
Parameters
Type | Name | Description |
---|---|---|
bool | A_0 |
~TextSearch()
Declaration
protected ~TextSearch()
GetCurrentPage()
Retrieve the number of the current page that is searched in. If the returned value is -1, it indicates the search process has not been initialized (e.g., begin() is not called yet); if the returned value is 0, it indicates the search process has finished, and if the returned value is positive, it is a valid page number.
Declaration
public int GetCurrentPage()
Returns
Type | Description |
---|---|
int | the current page number. |
GetMode()
Retrieve the current search mode.
Declaration
public int GetMode()
Returns
Type | Description |
---|---|
int | the current search mode. |
Run(ref int, ref string, ref string, Highlights)
Search the document and returns upon the following circumstances:
- Reached the end of the document
- Reached the end of a page (if set to return by specifying mode 'e_page_stop' )
- Found an instance matching the search pattern
Declaration
public TextSearch.ResultCode Run(ref int page_num, ref string result_str, ref string ambient_str, Highlights hlts)
Parameters
Type | Name | Description |
---|---|---|
int | page_num | the number of the page the found instance is on. |
string | result_str | the found string that matches the search pattern. |
string | ambient_str | the ambient string of the found string (computed if 'e_ambient_string' is set). |
Highlights | hlts | the Highlights info associated with the found string (computed if 'e_highlight' is set). |
Returns
Type | Description |
---|---|
TextSearch.ResultCode | the code indicating the reason of the return. Note that only when the returned code is 'e_found', the resulting information is meaningful. |
SetAmbientLettersAfter(int)
Sets the maximum number of ambient string letters after the search term (default: 70). This should be called before starting the actual search with method Run().
Declaration
public void SetAmbientLettersAfter(int ambient_letters_after)
Parameters
Type | Name | Description |
---|---|---|
int | ambient_letters_after | maximum number of letters |
SetAmbientLettersBefore(int)
Sets the maximum number of ambient string letters before the search term (default: 30). This should be called before starting the actual search with method Run().
Declaration
public void SetAmbientLettersBefore(int ambient_letters_before)
Parameters
Type | Name | Description |
---|---|---|
int | ambient_letters_before | maximum number of letters |
SetAmbientWordsAfter(int)
Sets the maximum number of ambient string words after the search term (default: 10). This should be called before starting the actual search with method Run().
Declaration
public void SetAmbientWordsAfter(int ambient_words_after)
Parameters
Type | Name | Description |
---|---|---|
int | ambient_words_after | maximum number of words |
SetAmbientWordsBefore(int)
Sets the maximum number of ambient string words before the search term (default: 1). This should be called before starting the actual search with method Run().
Declaration
public void SetAmbientWordsBefore(int ambient_words_before)
Parameters
Type | Name | Description |
---|---|---|
int | ambient_words_before | maximum number of words |
SetMode(int)
Set the current search mode. For example, the following code turns on the regular expression:
TextSearch ts = new TextSearch(); ... int mode = ts.getMode(); mode |= TextSearch.e_reg_expression; ts.setMode(mode); ...
Declaration
public void SetMode(int mode)
Parameters
Type | Name | Description |
---|---|---|
int | mode | the search mode to set. |
SetOCGContext(Context)
Declaration
public void SetOCGContext(Context ctx)
Parameters
Type | Name | Description |
---|---|---|
Context | ctx |
SetPattern(string)
Sets the current search pattern. Note that it is not necessary to call this method since the search pattern is already set when calling the begin() method. This method is provided for users to change the search pattern while searching through a document.
Declaration
public bool SetPattern(string pattern)
Parameters
Type | Name | Description |
---|---|---|
string | pattern | the search pattern to set. |
Returns
Type | Description |
---|---|
bool | true if the setting has succeeded. |
SetRightToLeftLanguage(bool)
Tells TextSearch that reads from right to left.
Declaration
public void SetRightToLeftLanguage(bool flag)
Parameters
Type | Name | Description |
---|---|---|
bool | flag | True if the language is right to left. |