Document Structure Recognition

Requirements
View Demo

Document Structure Recognition

Apryse's Document Structure Recognition engine helps you capture the visual and logical layout of a document. Unlike tabular extraction, this mode is designed to mimic how a human sees the page — recognizing paragraphs, lists, headers, footers, and images as distinct blocks.

It's ideal for use cases involving:

  • Accessibility tagging (e.g., reading order)
  • Screen reading tools
  • Document reconstruction
  • Visual layout parsing

How It Works

The engine detects layout elements based on visual positioning, spacing, indentation, and structural boundaries. It separates:

  • Paragraphs and lists
  • Headers and footers
  • Section columns vs table columns
  • Tables embedded inside paragraphs
  • Images and graphical elements

JSON Output Specification

Refer to the following specifications to learn more about the output JSON format:

Extract document structure as JSON file

Specify the name of the input PDF file and the name of the output JSON file, then select the Doc Structure engine:

1DataExtractionModule.ExtractData("paragraphs_and_tables.pdf", "paragraphs_and_tables.json", DataExtractionModule.DataExtractionEngine.e_doc_structure);

Extract document structure as JSON string

If you are going to parse the JSON right away, you may as well retrieve it as an in-memory string, instead of an external file.

Specify the name of the input PDF file, then select the Doc Structure engine:

1string json = DataExtractionModule.ExtractData("tagged.pdf", DataExtractionModule.DataExtractionEngine.e_doc_structure);

Optional Configurations

Select OCR Language

Password-Protected PDFs

Page Range

Deep Learning Assist

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales