Section:

Command-Line Summary for PDF2Text

sh

1Usage: pdf2text [<options>] file...
2
3OPTIONS:
4
5 --file... arg A list of folders and/or file names to process.
6
7 -o [ --output ] arg The folder used to store output files. By
8 default, the output will be displayed on
9 screen.
10
11 -a [ --pages ] arg (=-) Specifies the list of pages to convert. By
12 default, all pages are converted.
13
14 -e [ --encoding ] arg (=UTF8) Output text encoding:
15 UTF8
16 UTF16
17 The default output encoding is UTF8.
18
19 -f [ --format ] arg (=plain) Output text formating:
20 plain
21 wordlist
22 textruns
23 xml
24 The default output format is 'plain' text.
25
26 --noligatures Disables expanding of ligatures using a
27 predefined mapping. Default ligatures are: fi,
28 ff, fl, ffi, ffl, ch, cl, ct, ll, ss, fs, st,
29 oe, OE.
30
31 --nodehyphen Disables finding and removing hyphens that
32 split words across two lines. Hyphens are often
33 used a the end of lines as an indicator that a
34 word spans two lines. Hyphen detection enables
35 removal of hyphen character and merging of text
36 runs to form a single word. This option has no
37 effect on Tagged PDF files.
38
39 --no_dup_remove Disables removing duplicated text that is
40 frequently used to achieve visual effects of
41 drop shadow and fake bold.
42
43 --punct_break Treat punctuation (e.g. full stop, comma,
44 semicolon, etc.) as word break characters.
45
46 --remove_hidden_text Enables removal of text that is obscured by
47 images or rectangles. Since this option has
48 small performance penalty on performance of
49 text extraction, by default it is not enabled.
50
51 --no_invisible_text Enables removing text that uses rendering mode
52 3 (i.e. invisible text). Invisible text is
53 usually used in 'PDF Searchable Images' (i.e.
54 scanned pages with a corresponding OCR text).
55 As a result, invisible text will be extracted
56 by default.
57
58 --output_bbox Include bounding box information for each text
59 element. If the output format is 'XML' the
60 bounding box information will be stored in
61 'bbox' attribute. If the output format is
62 'wordlist' the coordinates of the bounding box
63 will precede the word.
64
65 --xml_words_as_elements Output words as XML elements instead of inline
66 text.
67
68 --xml_output_styles Include font and styling information.
69
70 --wordcount Get the number of words on each page.
71
72 --charcount Get total number of characters on each page.
73
74 --pageinfo Get the width, height, media box, crop box, and
75 page rotation for every page.
76
77 --prefix arg The prefix for output text files. The output
78 filename will be constructed by appending the
79 prefix string, the page number, and the
80 appropriate file extension (e.g. myprefix1.txt,
81 myprefix2.xml, etc). The prefix option should
82 be used only for processing of individual
83 documents. By default, PDF filename will be
84 used as a prefix.
85
86 --digits arg The number of digits used in the page counter
87 portion of the output filename. By default, new
88 digits are added as needed; however this
89 parameter could be used to format the page
90 counter field to a uniform width (e.g.
91 myfile0001.txt, myfile0002.txt, etc).
92
93 --subfolders Process all sub-directory for every directory
94 specified in the argument list. By default,
95 sub-directories are not processed.
96
97 -c [ --clip ] arg User definable clip box. The default clip
98 region is crop box of the page.
99
100 --noprompt Disables any user input. By default, the
101 application will ask for a valid password if
102 the password is incorrect.
103
104 -p [ --pass ] arg The password for secured PDF files. Not
105 required if the input document is not secured
106 using the 'open' password.
107
108 --extension arg (=.pdf) The default file extension used to process PDF
109 documents. The default extension is ".pdf".
110
111 --verb arg (=1) Set the opt.m_verbosity level to 'arg' (0-2).
112
113 -v [ --version ] Print the version information.
114
115 -h [ --help ] Print a listing of available options.
116
117
118
119Examples:
120 pdf2text my.pdf
121 pdf2text -o test_out/ex1 test/my.pdf
122 pdf2text --wordcount my.pdf
123 pdf2text -o test_out -a 1 -f xml --output_bbox my.pdf

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales