Server/Desktop PDF Content Extraction Library

The Apryse SDK offers deep, programmatic access to PDF content—so you can extract exactly what you need, fast.

Key capabilities include:

  • Text Extraction: Pull structured Unicode text with style, position, and layout details using pdftron.PDF.TextExtractor. Advanced options include ligature expansion, hidden/duplicated text handling, and more.
  • Signature Extraction: Retrieve digital signatures, timestamps, and verification details.
  • Graphics-Level Access: Extract and analyze graphical elements, including paths, color spaces, dash patterns, and transparency settings.
  • Low-Level Character Data: Access exact positioning of text runs and individual characters for precise downstream processing.
  • Font and Glyph Access: Extract embedded fonts and glyph outlines for advanced rendering or analysis.
  • Image Extraction: Extract all embedded images, with support for all PDF compression filters—including optional RAW output and color normalization.
  • Layer (OCG) Extraction: Programmatically access PDF layers and optional content groups.
  • Annotations and Forms: Retrieve all form fields, annotations, and widget data directly from the document.
  • Tagged PDF Support: Access marked content for tagged PDFs, enabling structure-aware extraction.
  • Embedded Objects: Extract ICC profiles, U3D streams, attachments, and embedded files.
  • Metadata Access: Read document metadata for title, author, keywords, and more.

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales