By default, PDF2Text outputs extracted text in the console window. To save the result in a certain folder instead, use the -o (or --output) parameter. For example:
pdf2text -o "..\..\My Output" 1.pdf
Note: If the specified path does not exist, PDF2Text will attempt to create the necessary folders.
By default, PDF2Text creates a separate text file for every page in the document. The output filename is constructed using the name of the input PDF file, page counter, and appropriate file extension. For example, the following command-line generates a sequence of text files in "MyFolder", starting with mydoc_1.txt, mydoc_2.txt, etc.:
pdf2text --o MyFolder mydoc.pdf
PDF2Text allows output filename customizations using the '--prefix' and '--digits' options. For example, the following command-line generates a sequence of text files in "MyFolder", starting with newname_0001.jpg, newname_0002.jpg, etc.:
pdf2text --o MyFolder --prefix newname --digits 4 mydoc.pdf
The '--digits' parameter specifies the number of digits used in the page counter portion of the output filename. By default, new digits are added as needed; however this parameter could be used to format the page counter field to a uniform width (e.g. myfile0001.jpg, myfile0010.jpg, instead of myfile_1.jpg, myfile_10.jpg, etc).
To avoid any ambiguities in file naming, the prefix option should be used only for conversion of individual documents.
By default, PDF2Text automatically converts PDF to a plain .txt file without any extra metadata. The output image format can be modified using the '-f' (or --format) option. For example,
pdf2text -f xml in.pdf
Will convert PDF to XML and will include number of additional properties such as positioning and styling information for each word.
The '--format' parameter accepts any of the following text formats:
By default, PDF2Text is using UTF8 encoding. To modify output encoding use -e (or --encoding) option. For example,
pdf2text --encoding UTF16 in.pdf
The '--encoding' parameter supports two encoding formats:
PDF2Text will, without user intervention, decrypt and convert documents secured with a master/owner password. If the document is secured using a user (i.e. 'file open') password, PDF2Text will, by default, prompt the user to enter the password. If '--noprompt' option is used, the program will not ask for a password, and an error message will be displayed instead.
For unattended conversion, the password can also be specified directly on the command-line using the '-p' (or --password) option. For example:
pdf2text -p secret -f xml secured.pdf
The above command line will convert PDF to xml format and will use the provided password ('secret') to open the secured PDF document.
Note: PDF2Text supports all standard security options available in PDF, including 40 and 128 bit RC4 encryption, Crypt filters, and 128 AES (Advanced Encryption Standard) encryption.
By default, PDF2Text will convert all PDF pages to text. You can specify a subset of pages to convert using the '-a' or '--pages' options. For example:
pdf2text -a 1,3,10 in.pdf
will convert only pages 1, 3, and 10. Please note that PDF2Text assumes that all pages are numbered sequentially starting from page 1.
To specify a range of pages, use dash character between numbers. For example:
pdf2text -a 1,10-20,50- in.pdf
will render the first page, pages in the range from 10 to 20 and all pages starting with page 50 to the last page in the document.
All even pages can be selected using the 'e' (or 'even') string. For example, the following line converts all even pages:
pdf2text --pages even in.pdf
Similarly odd pages can be selected using the 'o' (or 'odd') string. The following line converts all odd pages in the document and every page in the range from 100 to the last page:
pdf2text --pages odd,100- in.pdf
PDF2Text supports batch conversion of many PDF files in a single pass. To convert all PDF files in a given folder(s) you can use the following syntax:
pdf2text myfolder1 myfolder2
The '--subfolders' option can be used to recursively process all subfolders. For example, the following line will convert all documents in 'myfolder1' and 'myfolder2' as well as all subfolders:
pdf2text --subfolders myfolder1 myfolder2
By default, PDF2Text will convert all files with the extension '.pdf'. To select different files based on the extension use the '--extension' parameter. For example, to convert all XPS documents with a custom extension '.blob', you could use the following line:
pdf2text --extension .blob --subfolders myfolder1
The use of wild characters is also allowed. For example, to convert all PDF files starting with 'x' in the current folder use:
pdf2text x*.pdf
By default, PDF2Text will expand all ligatures in PDF. In writing and typography, a ligature occurs where two or more graphemes are joined as a single glyph. Use '--noligature' to disable ligature expansion. For example:
pdf2text --noligature mypdf
PDF files sometimes contain duplicated text to achieve visual effects of drop shadow and to fake bold text style. By default PDF2Text deletes duplicated overlapping text. To keep the duplicates, specify '--no_dup_remove' option on the command line. For example:
pdf2text --no_dup_remove mypdf
PDF2Text automatically remove hyphens in the original PDF file that are used for connecting split words across two lines. Use option '--nodehyphen' to disable word merging across lines. For example:
pdf2text --nodehyphen mypdf
PDF2Text provides several options related to the layout of text in the input PDF files.
In some cases, PDF documents may be missing spaces between punctuation characters and words may be merged into a single unit. To break words based on punctuation characters use '--punct_break' option. For example:
pdf2text --punct_break mypdf
In some cases, text in PDF may be obscured by images or rectangles. By default PDF2Text will extract this invisible text, however you can disable this behavior using '--remove_hidden_text' option. For example:
pdf2text --remove_hidden_text mypdf
Similarly some scanned PDF files or documents that went through OCR (Optical Character Recognition) may contain invisible text to facilitate text selection, highlighting, and text extraction. PDF2Text will automatically extract hidden text. To prevent text extraction of invisible text use '--remove_invisible_text' option. For example:
pdf2text --remove_invisible_text mypdf
In case you are looking for more flexibility or a more programmatic approach to text extraction, you may want to consider using Apryse PDF SDK, as shown in the following sample code available at: </samples/#textextract.> PDF SDK offers a fine grained control over text extraction and access to low-level features in PDF documents.
By default PDF2Text will expand all words into a single text line when converting to XML. In order to represent each word as a separate XML element with positioning and styling information use '--xml_words_as_elements' option. To include font and styling information for each word or line use 'xml_output_styles' option. For example, the default XML output for a given PDF may look as follows:
pdf2text my.pdf
Using '--xml_words_as_elements' and '--xml_output_styles' option the generated output is richer:
pdf2text --format xml -xml_words_as_elements -xml_output_styles my.pdf
PDF2Text provides several options for to retrieve page information from existing PDF documents:
Use '--wordcount' option to retrieve number of words for each page. For example:
pdf2text "-wordcount my.pdf
will retrieve number of words for each page in the specified document.
Use '--charcount' option to number of characters for each page. For example:
pdf2text --charcount my.pdf
will retrieve number of characters for each page in the specified document.
Use '--pageinfo' option to retrieve the width, height, media box, crop box, and rotation for every page in the document. For example:
pdf2text --pageinfo my.pdf
Using PDF2Text you can extract text from a subset of a page using the '--clip' parameter. The parameter accepts a list of four numbers, separated using commas, giving the coordinates of a pair of diagonally opposite corners. Typically, the list takes the form: llx, lly, urx, ury specifying the lower-left x, lower-left y, upper-right x, and upper-right y coordinates of the rectangle, in that order. The other two corners of the rectangle are then assumed to have coordinates (llx, ury) and (urx, lly). All coordinates need to be expressed in points (a basic unit of PDF 'user' coordinate system). One PDF point is 1⁄72 of an inch and is approximately the same as a point (unit commonly used in the printing industry).
For example:
pdf2text -c 150,600,250,700 license.pdf -a 1
PDF2Text is a completely stand alone application and does not include any dependencies on third-party components or software.
Did you find this helpful?
Trial setup questions?
Ask experts on DiscordNeed other help?
Contact SupportPricing or product questions?
Contact Sales