Some test text!

Search
Hamburger Icon

Dotnet / Guides / Overview

C# .NET Intelligent Data Extraction

Overview

Apryse's Data Extraction Suite allows the programmatic inspection of unstructured PDF documents and detects various structural elements in an easy-to-process way.

We can name several use cases where content recognition brings added value to documents:

  • Data mining
  • Financial analysis, forecasting, projections, estimation, modeling, quarterly reports
  • Table detection, spreadsheet calculations, chart building
  • Natural language processing, artificial intelligence, intelligent document processing
  • Translation of content into multiple languages with natural flow preservation
  • Tagging, archiving, searching, indexing, keywording, author-date citation
  • Redaction, content editing and text replacement, page renumbering, header and footer editing
  • Semantic comparison
  • Accessibility, screen reading for the visually impaired, reading order assessment
  • Forms processing, form field identification
  • Optical character recognition (OCR)

We offer three types of Data Extraction Modes:

  • Tabular Data Extraction: identify column and row structure, edit spreadsheets, perform calculations on cells, analyze numeric columns. Output is presented in JSON or Excel format.
  • Document Structure Recognition: discover full logical structure, including headers, footers, paragraphs, list items, table columns, cells, borders, images, graphics. Locate and extract each element in an easy-to-enumerate JSON format.
  • Form Field Identification: Use artificial intelligence and computer vision to detect form fields in documents that do not have any interactive field annotations embedded. The detected fields can be automatically added to the PDF.

Note: If you would prefer a Word, Excel, PowerPoint output for editing, viewing or printing, we would suggest our Office conversion APIs instead of document extraction. Unless your goal is to perform extensive spreadsheet calculations or data mining on the cells, in which case Tabular Data Extraction may suit you better.

For developers, system integrators, statisticians, machine learning engineers, JSON is probably the most suitable format. It is significantly easier to parse and iterate than Excel or even HTML. The JSON links back to the input PDF via page numbers and bounding box coordinates, which allows you to visualize the logical structure as annotation overlays on top of the PDF. You may choose to highlight certain entities or draw boxes around them.

The JSON also supplies a reading order for natural language processing or screen reading.

The data extraction functionality is implemented as an external module that can be downloaded from Data Extraction Module . It's currently offered for desktop and server Windows and Linux.

Evaluation Limitations

In evaluation mode, you are limited to processing no more than 6 pages in a single extraction operation.

In addition, an evaluation sheet is randomly inserted into the input document with the following text:

PDFTron Data Extraction trial mode. The trial is limited to 6 pages and will insert extra pages into the result (like this one).

This message will show up randomly in the JSON or Excel output.

Get started

Intelligent Data Extraction workflow
In this section, we showcase the potential Data Extraction workflow.

Get the answers you need: Chat with us