Document Structure Recognition

Document Structure Recognition

Apryse's Document Structure Recognition engine helps you capture the visual and logical layout of a document. Unlike tabular extraction, this mode is designed to mimic how a human sees the page — recognizing paragraphs, lists, headers, footers, and images as distinct blocks.

It's ideal for use cases involving:

  • Accessibility tagging (e.g., reading order)
  • Screen reading tools
  • Document reconstruction
  • Visual layout parsing

How It Works

The engine detects layout elements based on visual positioning, spacing, indentation, and structural boundaries. It separates:

  • Paragraphs and lists
  • Headers and footers
  • Section columns vs table columns
  • Tables embedded inside paragraphs
  • Images and graphical elements

Extract document structure as JSON file

Specify the name of the input PDF file and the name of the output JSON file, then select the Doc Structure engine:

1DataExtractionModule.ExtractData("paragraphs_and_tables.pdf", "paragraphs_and_tables.json", DataExtractionModule.DataExtractionEngine.e_doc_structure);

Extract document structure as JSON string

If you are going to parse the JSON right away, you may as well retrieve it as an in-memory string, instead of an external file.

Specify the name of the input PDF file, then select the Doc Structure engine:

1string json = DataExtractionModule.ExtractData("tagged.pdf", DataExtractionModule.DataExtractionEngine.e_doc_structure);

Optional Configurations

Select OCR Language

Password-Protected PDFs

Page Range

Deep Learning Assist

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales