Apryse's PDF2Text is an easy-to-use, multi-platform command-line program for high-quality and efficient text extraction from PDF documents. PDF2Text can be used to convert text from any PDF document as Unicode or as structured XML, while providing a wide range of output styles and configuration options.
PDF2Text is offered as an easy-to-use command-line application and as a software development component that can be used as a building block for other client and server-based applications.
part of the world (including Asian languages) and represent the extracted text using UTF-8 and UTF-16. To improve Unicode output PDF2Text can recognize vendor-specific Unicode character assignments (in the Private Use Area) and map them to public Unicode area. Similarly Unicode ligatures and PDF specific ligatures can be broken into a sequence of individual Unicode characters. Characters that can't be mapped to Unicode are predictably mapped in the Private Use Area.
logical structure engine used to recognize words, lines, paragraphs, and the reading order in PDF documents. The engine can remove duplicated text commonly used to drop shadows, or text that is obscured by other page content. The text extractor also works flawlessly with PDF documents that contain rotated text or documents where the information is presented in a random order or is scattered across the page.
designed to be run in high throughput server-based and multi-threaded applications. A regular and rigorous Q&A process sets high standards for the reliability of all Apryse products.
algorithms coupled with low-memory usage and native code efficiency, make PDF2Text the ideal choice for high-traffic servers as well as for interactive applications.
For developers who are looking for a software development component to integrate into their application, Apryse also offers a PDF SDK, that is easy-to-use, yet powerful software component for embedding into client and server based applications. Apryse's PDF SDK is available as a plain 'C DLL' and can be easily accessed from any programming language (including C#, VB.NET, C/C++, Java, VB6, Perl, Python, Ruby, Delphi, etc). Our PDF SDKis Apryse's own comprehensive PDF library. If you require rasterization or additional PDF functionality, please visit (https://apryse.com/products/core-sdk/pdf/) or contact a Apryse representative for more information.
Did you find this helpful?
Trial setup questions?
Ask experts on DiscordNeed other help?
Contact SupportPricing or product questions?
Contact Sales