Some test text!

Search
Hamburger Icon

Cli / Guides / Usage

Common examples of command-line PDF to Text extraction

Apryse PDF2Text is a command-line application designed to convert PDF documents to text or XML. This section covers the basic usage of PDF2Text explaining all of the available options.

Basic Syntax

The basic command-line syntax is:

pdf2text [options] file1 file2 folder1 file3 ...

See more options in Command-Line Summary for PDF2Text

General Usage Examples

Example 1. The simplest command line: Convert PDF to plain text.

Notes:

  • This command heavily relies on defaults. The default output image format is plain text.

  • The '-o' (or --output) parameter is used to specify the output folder. If this option was not specified, text extracted will show in the console window.

pdf2text -o ex1 test/importantdoc.pdf

Example 2. Convert specific PDF pages to XML, including font and styling information, while preserving ligatures and removing hidden text.

Notes:

  • '-a' or '--pages' option is used to specify the pages to be converted.

  • '-f' option specifies output file format.

  • '--xml_output_styles' option is used to show font and styling information.

  • '--noligatures' option is used to keep ligature setting of the PDF file.

  • '--remove_hidden_text' option is used so that hidden text of the PDF file can be removed.

  • '--output' is equal to '-o', specifies the output folder.

pdf2text --output ex2 -a 3-10 -f xml --xml_output_styles --noligatures --remove_hidden_text test/impotantdoc.pdf

Example 3. Extract PDF text runs from a given clip region from a password protected PDF.

pdf2text -f textruns -o ex3 --c 0,0,595,842 test/blue_secret.pdf

Batch Processing and the Use of Wildcards

PDF2Text supports processing of multiple input documents in the same run. For example, it is possible to specify multiple PDF folders and PDF2Text will automatically process all PDF documents matching a given file extension. For example, the following command-line will process all PDF documents in folders 'test1' and 'test2'

c:\>pdf2text -o c:/output_folder c:/test1 c:/test2

Wildcard characters can also be used to process multiple input files.

For example, if a directory contains the following PDF documents:

C:\test1 >dir
 Directory of C:\test1
 01/04/2007 03:35 PM <DIR> .
 01/04/2007 03:35 PM <DIR> ..
 05/21/2004 02:27 PM A1.pdf
 05/03/2005 09:38 AM A2.pdf
 05/20/2003 08:46 AM B1.pdf
 05/15/2003 12:50 PM B2.pdf

To process all PDF documents in this folder, you could specify:

pdf2text -o c:/output_folder c:/test1/*.pdf

To process all PDF documents starting with 'A', you could specify:

pdf2text -o c:/output_folder c:/test1/A*.pdf

Or to process all PDF documents ending with '1', you could specify:

pdf2text -o c:/output_folder c:/test1/*1.pdf

You can use either of the two standard wildcards --- the question mark (?) and the asterisk (*) --- to specify filename and path arguments on the command line.

The wildcards are expanded in the same manner as operating system commands. (Please refer to your operating system user's guide if you are unfamiliar with wildcards). Enclosing an argument in double quotation marks (" ") suppresses the wildcard expansion. Within quoted arguments, you can represent quotation marks literally by preceding the double-quotation-mark character with a backslash (\). If no matches are found for the wildcard argument, the argument is passed literally.

Exit Codes

To provide additional feedback, PDF2Text returns exit codes after completing processing. The exit codes can be used to provide user feedback, for logging etc. This is particularly important for applications running in an unattended environment.

The following table lists possible exit codes and their description:

Exit Code       Description
--------------- ------------------------------------------------------------------
0               All files converted successfully.
1               Document is secured. Need a valid password to open the document.
2               Error opening the input file(s).
3               An unknown exception encountered.

All codes other then '0' indicate that there was an error during the conversion process.

The following illustrates a sample Windows batch script that processes exit codes:

@echo off rem convert all PDF files in 'data' folder

pdf2text ./data
if errorlevel 1 goto passwd
if errorlevel 2 goto inputerr
if errorlevel 3 goto othererror
if errorlevel 0 goto exit

:passwd
echo Document is protected. Need a valid password to open the document.
goto exit

:inputerr
echo No input files specified.
goto exit

:othererror
echo An error encountered during processing.
goto exit

:exit

Get the answers you need: Chat with us