Server/Desktop PDF Content Extraction Library

The Apryse SDK offers deep, programmatic access to PDF content—so you can extract exactly what you need, fast.

Key capabilities include:

Text Extraction: Pull structured Unicode text with style, position, and layout details using pdftron.PDF.TextExtractor. Advanced options include ligature expansion, hidden/duplicated text handling, and more.
Signature Extraction: Retrieve digital signatures, timestamps, and verification details.
Graphics-Level Access: Extract and analyze graphical elements, including paths, color spaces, dash patterns, and transparency settings.
Low-Level Character Data: Access exact positioning of text runs and individual characters for precise downstream processing.
Font and Glyph Access: Extract embedded fonts and glyph outlines for advanced rendering or analysis.
Image Extraction: Extract all embedded images, with support for all PDF compression filters—including optional RAW output and color normalization.
Layer (OCG) Extraction: Programmatically access PDF layers and optional content groups.
Annotations and Forms: Retrieve all form fields, annotations, and widget data directly from the document.
Tagged PDF Support: Access marked content for tagged PDFs, enabling structure-aware extraction.
Embedded Objects: Extract ICC profiles, U3D streams, attachments, and embedded files.
Metadata Access: Read document metadata for title, author, keywords, and more.

Did you find this helpful?

Trial setup questions?

Need other help?

Pricing or product questions?