Product:

Get started

Release notes

Migration Guides

What is WebViewer

DocumentViewer

Open/Save Document

Events

UI customization

Annotation

Collaboration

MS Office

DOCX Editor

Spreadsheet Editor

Conversion

PDF/A

Forms

Generate

Page manipulation

Edit page content

Extraction

Overview

Text extraction

Text position

Selected text

Image extraction

Color separation

Embedded fonts

Samples

APIs

Digital signature

Outlines/Bookmarks

Compare files

Optimization

Layers (OCGs)

Measurement

Redaction

Security

Portfolios

Low-level PDF API

Full API

WebViewer Server

Custom server

Best practices

Advanced

HTML

BIM

Video

Audio

Changelogs

JavaScript PDF Extraction Library

Content extraction provides the ability to access specific content from a document.

Apryse SDK benefits include:

Extract digital signatures (timestamps, etc)
Intuitive page content extraction based on a concept of graphical elements
High-quality and efficient text recognition engine (pdftron.PDF.TextExtractor). TextExtractor can be used to extract structured Unicode text including style and positioning information from any PDF document. The API is simple to use and has a number of advanced options related to hidden or duplicated text, ligature expansion, etc
Low-level text extraction (including positioning information for text runs and individual characters)
Complete access to the graphics state (for color spaces and colorants, dash properties, etc)
Full access to fonts, including glyph outlines
Image extraction. All compression filters allowed in PDF are supported and images can be optionally extracted in RAW format
Image color-conversion and normalization filters
Full access to marked content (e.g. used in tagged PDF documents to preserve logical structure or to mark transparency groups)
Full access to page form fields and annotations
Extraction of embedded fonts, ICC color profiles, U3D streams, embedded files, etc
Access to a document's metadata
High-level Logical Structure API and support for 'Tagged' PDF documents
Extract and render PDF layers (also known as Optional Content Groups, or OCGs)

Get started

Extract text from a PDF
To extract text from a PDF document.

Tools & Utilities

PDF2Text
A command-line tool for text extraction from PDF documents.

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales

Product:

Product:

JavaScript PDF Extraction Library

Get started

Tools & Utilities

On this page