Command-Line Summary for PDF2Text

sh

1Usage: pdf2text [<options>] file...
2
3OPTIONS:
4
5  --file... arg                 A list of folders and/or file names to process.
6
7  -o [ --output ] arg           The folder used to store output files. By
8                                default, the output will be displayed on
9                                screen.
10
11  -a [ --pages ] arg (=-)       Specifies the list of pages to convert. By
12                                default, all pages are converted.
13
14  -e [ --encoding ] arg (=UTF8) Output text encoding:
15                                 UTF8
16                                 UTF16
17                                The default output encoding is UTF8.
18
19  -f [ --format ] arg (=plain)  Output text formating:
20                                 plain
21                                 wordlist
22                                 textruns
23                                 xml
24                                The default output format is 'plain' text.
25
26  --noligatures                 Disables expanding of ligatures using a
27                                predefined mapping. Default ligatures are: fi,
28                                ff, fl, ffi, ffl, ch, cl, ct, ll, ss, fs, st,
29                                oe, OE.
30
31  --nodehyphen                  Disables finding and removing hyphens that
32                                split words across two lines. Hyphens are often
33                                used a the end of lines as an indicator that a
34                                word spans two lines. Hyphen detection enables
35                                removal of hyphen character and merging of text
36                                runs to form a single word. This option has no
37                                effect on Tagged PDF files.
38
39  --no_dup_remove               Disables removing duplicated text that is
40                                frequently used to achieve visual effects of
41                                drop shadow and fake bold.
42
43  --punct_break                 Treat punctuation (e.g. full stop, comma,
44                                semicolon, etc.) as word break characters.
45
46  --remove_hidden_text          Enables removal of text that is obscured by
47                                images or rectangles. Since this option has
48                                small performance penalty on performance of
49                                text extraction, by default it is not enabled.
50
51  --no_invisible_text           Enables removing text that uses rendering mode
52                                3 (i.e. invisible text). Invisible text is
53                                usually used in 'PDF Searchable Images' (i.e.
54                                scanned pages with a corresponding OCR text).
55                                As a result, invisible text will be extracted
56                                by default.
57
58  --output_bbox                 Include bounding box information for each text
59                                element. If the output format is 'XML' the
60                                bounding box information will be stored in
61                                'bbox' attribute. If the output format is
62                                'wordlist' the coordinates of the bounding box
63                                will precede the word.
64
65  --xml_words_as_elements       Output words as XML elements instead of inline
66                                text.
67
68  --xml_output_styles           Include font and styling information.
69
70  --wordcount                   Get the number of words on each page.
71
72  --charcount                   Get total number of characters on each page.
73
74  --pageinfo                    Get the width, height, media box, crop box, and
75                                page rotation for every page.
76
77  --prefix arg                  The prefix for output text files. The output
78                                filename will be constructed by appending the
79                                prefix string, the page number, and the
80                                appropriate file extension (e.g. myprefix1.txt,
81                                myprefix2.xml, etc). The prefix option should
82                                be used only for processing of individual
83                                documents. By default, PDF filename will be
84                                used as a prefix.
85
86  --digits arg                  The number of digits used in the page counter
87                                portion of the output filename. By default, new
88                                digits are added as needed; however this
89                                parameter could be used to format the page
90                                counter field to a uniform width (e.g.
91                                myfile0001.txt, myfile0002.txt, etc).
92
93  --subfolders                  Process all sub-directory for every directory
94                                specified in the argument list. By default,
95                                sub-directories are not processed.
96
97  -c [ --clip ] arg             User definable clip box. The default clip
98                                region is crop box of the page.
99
100  --noprompt                    Disables any user input. By default, the
101                                application will ask for a valid password if
102                                the password is incorrect.
103
104  -p [ --pass ] arg             The password for secured PDF files. Not
105                                required if the input document is not secured
106                                using the 'open' password.
107
108  --extension arg (=.pdf)       The default file extension used to process PDF
109                                documents. The default extension is ".pdf".
110
111  --verb arg (=1)               Set the opt.m_verbosity level to 'arg' (0-2).
112
113  -v [ --version ]              Print the version information.
114
115  -h [ --help ]                 Print a listing of available options.
116
117
118
119Examples:
120  pdf2text my.pdf
121  pdf2text -o test_out/ex1 test/my.pdf
122  pdf2text --wordcount my.pdf
123  pdf2text -o test_out -a 1 -f xml --output_bbox my.pdf
Did you find this helpful?
Trial setup questions?
Ask experts on Discord
Need other help?
Contact Support
Pricing or product questions?
Contact Sales