Did you find this guide helpful?
Some test text!
Cli / Guides / Options
Usage: pdf2text [<options>] file...
OPTIONS:
--file... arg A list of folders and/or file names to process.
-o [ --output ] arg The folder used to store output files. By
default, the output will be displayed on
screen.
-a [ --pages ] arg (=-) Specifies the list of pages to convert. By
default, all pages are converted.
-e [ --encoding ] arg (=UTF8) Output text encoding:
UTF8
UTF16
The default output encoding is UTF8.
-f [ --format ] arg (=plain) Output text formating:
plain
wordlist
textruns
xml
The default output format is 'plain' text.
--noligatures Disables expanding of ligatures using a
predefined mapping. Default ligatures are: fi,
ff, fl, ffi, ffl, ch, cl, ct, ll, ss, fs, st,
oe, OE.
--nodehyphen Disables finding and removing hyphens that
split words across two lines. Hyphens are often
used a the end of lines as an indicator that a
word spans two lines. Hyphen detection enables
removal of hyphen character and merging of text
runs to form a single word. This option has no
effect on Tagged PDF files.
--no_dup_remove Disables removing duplicated text that is
frequently used to achieve visual effects of
drop shadow and fake bold.
--punct_break Treat punctuation (e.g. full stop, comma,
semicolon, etc.) as word break characters.
--remove_hidden_text Enables removal of text that is obscured by
images or rectangles. Since this option has
small performance penalty on performance of
text extraction, by default it is not enabled.
--no_invisible_text Enables removing text that uses rendering mode
3 (i.e. invisible text). Invisible text is
usually used in 'PDF Searchable Images' (i.e.
scanned pages with a corresponding OCR text).
As a result, invisible text will be extracted
by default.
--output_bbox Include bounding box information for each text
element. If the output format is 'XML' the
bounding box information will be stored in
'bbox' attribute. If the output format is
'wordlist' the coordinates of the bounding box
will precede the word.
--xml_words_as_elements Output words as XML elements instead of inline
text.
--xml_output_styles Include font and styling information.
--wordcount Get the number of words on each page.
--charcount Get total number of characters on each page.
--pageinfo Get the width, height, media box, crop box, and
page rotation for every page.
--prefix arg The prefix for output text files. The output
filename will be constructed by appending the
prefix string, the page number, and the
appropriate file extension (e.g. myprefix1.txt,
myprefix2.xml, etc). The prefix option should
be used only for processing of individual
documents. By default, PDF filename will be
used as a prefix.
--digits arg The number of digits used in the page counter
portion of the output filename. By default, new
digits are added as needed; however this
parameter could be used to format the page
counter field to a uniform width (e.g.
myfile0001.txt, myfile0002.txt, etc).
--subfolders Process all sub-directory for every directory
specified in the argument list. By default,
sub-directories are not processed.
-c [ --clip ] arg User definable clip box. The default clip
region is crop box of the page.
--noprompt Disables any user input. By default, the
application will ask for a valid password if
the password is incorrect.
-p [ --pass ] arg The password for secured PDF files. Not
required if the input document is not secured
using the 'open' password.
--extension arg (=.pdf) The default file extension used to process PDF
documents. The default extension is ".pdf".
--verb arg (=1) Set the opt.m_verbosity level to 'arg' (0-2).
-v [ --version ] Print the version information.
-h [ --help ] Print a listing of available options.
Examples:
pdf2text my.pdf
pdf2text -o test_out/ex1 test/my.pdf
pdf2text --wordcount my.pdf
pdf2text -o test_out -a 1 -f xml --output_bbox my.pdf
Trial setup questions? Ask experts on Discord
Need other help? Contact Support
Pricing or product questions? Contact Sales