Section:

PDF to Text Command Line Extraction

Apryse's PDF2Text is an easy-to-use, multi-platform command-line program for high-quality and efficient text extraction from PDF documents. PDF2Text can be used to convert text from any PDF document as Unicode or as structured XML, while providing a wide range of output styles and configuration options.

PDF2Text is offered as an easy-to-use command-line application and as a software development component that can be used as a building block for other client and server-based applications.

Why PDF2Text?

  • Complete Unicode support. PDF2Text can process PDF files from any

part of the world (including Asian languages) and represent the extracted text using UTF-8 and UTF-16. To improve Unicode output PDF2Text can recognize vendor-specific Unicode character assignments (in the Private Use Area) and map them to public Unicode area. Similarly Unicode ligatures and PDF specific ligatures can be broken into a sequence of individual Unicode characters. Characters that can't be mapped to Unicode are predictably mapped in the Private Use Area.

  • Intelligent Text Recognition. Intelligent text recognition and

logical structure engine used to recognize words, lines, paragraphs, and the reading order in PDF documents. The engine can remove duplicated text commonly used to drop shadows, or text that is obscured by other page content. The text extractor also works flawlessly with PDF documents that contain rotated text or documents where the information is presented in a random order or is scattered across the page.

  • Highest Reliability and Robustness. PDF2Text was from ground-up

designed to be run in high throughput server-based and multi-threaded applications. A regular and rigorous Q&A process sets high standards for the reliability of all Apryse products.

  • Top Performance. Advanced text recognition and content analysis

algorithms coupled with low-memory usage and native code efficiency, make PDF2Text the ideal choice for high-traffic servers as well as for interactive applications.

Key Functions

  • Extracts text from any PDF document to text or as structured XML.
  • Offers different Unicode text encoding (UTF-8 and UTF-16) options.
  • Provides positioning, font, and styling information for every Paragraph, Line, Word, or a Glyph on a page.
  • Offers options to control the level of detail and the formatting in the output XML.
  • Offers advanced options to control ligature expansion, hyphen removal, and to remove duplicate text (e.g. which is sometimes used for drop shadow effects).
  • Allows for text extraction from a clip rectangle or to hide text in specific regions on a page.
  • Option to remove hidden text or text that is obscured by other page elements (such as images or rectangles).
  • Support for all versions of the PDF format (PDF 1.0 to ISO32000).
  • Full support for encrypted documents (40 and 128 bit RC4 and 128 bit AES).
  • Supports automation and batch operation.

Sample Use Case Scenarios

  • Server-based, on-demand conversion of PDF documents to text format files.
  • Extract text from a large PDF repository for text indexing or content retrieval purposes (e.g. to implement a PDF search engine).
  • Classify or summarize PDF documents based on their content. Find specific words for content editing purposes (such as splitting pages based on keywords, etc).
  • Convert PDF pages to text or XML for content repurposing.
  • Search PDF pages for specific words or keywords and return their positioning information (e.g. to highlight instances of a given word).

Operating Systems Supported

  • Windows, Linux and Mac.

System Requirements

  • At least 10 MB of free disk space.
  • 2 GB or RAM.

Examples

sh

1#!/bin/sh
2echo "Example 1): Convert PDF to Text"
3./pdf2text "Apryse PDF2Text User Manual.pdf" --lic_key "<PDFNET_LICENSE_KEY>"
4echo
5echo "Example 2): Convert PDF to Text for page 1 in wordlist format with bounding box"
6./pdf2text -o test_out -a 1 -f wordlist --output_bbox "Apryse PDF2Text User Manual.pdf" --lic_key "<PDFNET_LICENSE_KEY>"
7echo
8echo "Example 3): Convert PDF to Text for page 1 in wordlist format with bounding box"
9./pdf2text -o test_out -a 1 -f xml --output_bbox *.pdf --lic_key "<PDFNET_LICENSE_KEY>"

For developers who are looking for a software development component to integrate into their application, Apryse also offers a PDF SDK, that is easy-to-use, yet powerful software component for embedding into client and server based applications. Apryse's PDF SDK is available as a plain 'C DLL' and can be easily accessed from any programming language (including C#, VB.NET, C/C++, Java, VB6, Perl, Python, Ruby, Delphi, etc). Our PDF SDKis Apryse's own comprehensive PDF library. If you require rasterization or additional PDF functionality, please visit (https://apryse.com/products/core-sdk/pdf/) or contact a Apryse representative for more information.

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales