When using Python on Windows or Linux you can install the package via PIP with this command:
sh 1 pip install --extra-index-url=https://pypi.apryse.com apryse-data-extraction
When using Node.js on Windows or Linux you can install the package via NPM with this command:
sh 1 npm install @pdftron/data-extraction
For Windows, just copy DataExtractionModuleWindows.zip in your PDFNetC folder, then extract it locally. You should have files like
Lib\Windows\StructuredOutput.exe Lib\Windows\OCRModule.exe Lib\Windows\TabularData\TabularData.dll Lib\Windows\AIPageObjectExtractor\AIPageObjectExtractor.dll For Linux, just copy DataExtractionModuleLinux.tar.gz in your PDFNetC directory, then extract it locally. You should have files like
Lib/Linux/StructuredOutput Lib/Linux/OCRModule Lib/Linux/TabularData/TabularData Lib/Linux/AIPageObjectExtractor/AIPageObjectExtractor Please refer to the below specifications to learn more about the output JSON format.
If you are using PIP or NPM, you may skip setting AddResourceSearchPath
. Otherwise, follow the directions below.
The first thing to set up before the module can be used is the location of the Lib directory under which the external add-ons are installed, so that the SDK knows where to look for them. This is achieved via the PDFNet AddResourceSearchPath
function. If a relative path is used, it is based on the end-user executable.
C# C++ Go Java JavaScript PHP Python Ruby VB
1 PDFNet. AddResourceSearchPath ( " ../../../../../Lib/ " );
1 PDFNet :: AddResourceSearchPath ( " ../../../Lib/ " );
1 PDFNetAddResourceSearchPath ( " ../../../PDFNetC/Lib/ " )
1 PDFNet. addResourceSearchPath ( " ../../../Lib/ " );
1 await PDFNet. addResourceSearchPath ( ' ../../lib/ ' );
1 PDFNet :: AddResourceSearchPath ( " ../../../PDFNetC/Lib/ " );
1 PDFNet.AddResourceSearchPath( " ../../../PDFNetC/Lib/ " )
1 PDFNet . AddResourceSearchPath ( " ../../../PDFNetC/Lib/ " )
1 PDFNet. AddResourceSearchPath ( " ../../../../../Lib/ " )
Note: do not specify the actual Windows, Linux, MacOS directory, where the individual executables are, but its parent folder.
For error handling purposes, it is generally advisable to test whether the module is available via the IsModuleAvailable
function. Since the Data Extraction suite consists of multiple modules, an extra parameter is used to clarify the component to test.
C# C++ Go Java JavaScript PHP Python Ruby VB
1 if ( ! DataExtractionModule. IsModuleAvailable (DataExtractionModule.DataExtractionEngine.e_tabular))
2 {
3 // Unable to run Data Extraction: PDFTron SDK Tabular Data module not available.
4 }
5 if ( ! DataExtractionModule. IsModuleAvailable (DataExtractionModule.DataExtractionEngine.e_doc_structure))
6 {
7 // Unable to run Data Extraction: PDFTron SDK Structured Output module not available.
8 }
9 if ( ! DataExtractionModule. IsModuleAvailable (DataExtractionModule.DataExtractionEngine.e_form))
10 {
11 // Unable to run Data Extraction: PDFTron SDK AIFormFieldExtractor module not available.
12 }
1 if ( ! DataExtractionModule :: IsModuleAvailable (DataExtractionModule :: e_Tabular))
2 {
3 // Unable to run Data Extraction: PDFTron SDK Tabular Data module not available.
4 }
5 if ( ! DataExtractionModule :: IsModuleAvailable (DataExtractionModule :: e_DocStructure))
6 {
7 // Unable to run Data Extraction: PDFTron SDK Structured Output module not available.
8 }
9 if ( ! DataExtractionModule :: IsModuleAvailable (DataExtractionModule :: e_Form))
10 {
11 // Unable to run Data Extraction: PDFTron SDK AIFormFieldExtractor module not available.
12 }
1 if ! DataExtractionModuleIsModuleAvailable (DataExtractionModuleE_Tabular) {
2 // Unable to run Data Extraction: PDFTron SDK Tabular Data module not available.
3 }
4 if ! DataExtractionModuleIsModuleAvailable (DataExtractionModuleE_DocStructure) {
5 // Unable to run Data Extraction: PDFTron SDK Structured Output module not available.
6 }
7 if ! DataExtractionModuleIsModuleAvailable (DataExtractionModuleE_Form) {
8 // Unable to run Data Extraction: PDFTron SDK AIFormFieldExtractor module not available.
9 }
1 if ( ! DataExtractionModule. isModuleAvailable (DataExtractionModule.DataExtractionEngine.e_tabular))
2 {
3 // Unable to run Data Extraction: PDFTron SDK Tabular Data module not available.
4 }
5 if ( ! DataExtractionModule. isModuleAvailable (DataExtractionModule.DataExtractionEngine.e_doc_structure))
6 {
7 // Unable to run Data Extraction: PDFTron SDK Structured Output module not available.
8 }
9 if ( ! DataExtractionModule. isModuleAvailable (DataExtractionModule.DataExtractionEngine.e_form))
10 {
11 // Unable to run Data Extraction: PDFTron SDK AIFormFieldExtractor module not available.
12 }
1 if ( !await PDFNet.DataExtractionModule. isModuleAvailable (PDFNet.DataExtractionModule.DataExtractionEngine.e_Tabular)) {
2 // Unable to run Data Extraction: PDFTron SDK Tabular Data module not available.
3 }
4 if ( !await PDFNet.DataExtractionModule. isModuleAvailable (PDFNet.DataExtractionModule.DataExtractionEngine.e_DocStructure)) {
5 // Unable to run Data Extraction: PDFTron SDK Structured Output module not available.
6 }
7 if ( !await PDFNet.DataExtractionModule. isModuleAvailable (PDFNet.DataExtractionModule.DataExtractionEngine.e_Form)) {
8 // Unable to run Data Extraction: PDFTron SDK AIFormFieldExtractor module not available.
9 }
1 if ( ! DataExtractionModule :: IsModuleAvailable ( DataExtractionModule :: e_Tabular )) {
2 // Unable to run Data Extraction: PDFTron SDK Tabular Data module not available.
3 }
4 if ( ! DataExtractionModule :: IsModuleAvailable ( DataExtractionModule :: e_DocStructure )) {
5 // Unable to run Data Extraction: PDFTron SDK Structured Output module not available.
6 }
7 if ( ! DataExtractionModule :: IsModuleAvailable ( DataExtractionModule :: e_Form )) {
8 // Unable to run Data Extraction: PDFTron SDK AIFormFieldExtractor module not available.
9 }
1 if not DataExtractionModule.IsModuleAvailable(DataExtractionModule.e_Tabular):
2 pass # Unable to run Data Extraction: PDFTron SDK Tabular Data module not available.
3 if not DataExtractionModule.IsModuleAvailable(DataExtractionModule.e_DocStructure):
4 pass # Unable to run Data Extraction: PDFTron SDK Structured Output module not available.
5 if not DataExtractionModule.IsModuleAvailable(DataExtractionModule.e_Form):
6 pass # Unable to run Data Extraction: PDFTron SDK AIFormFieldExtractor module not available.
1 if ! DataExtractionModule . IsModuleAvailable ( DataExtractionModule :: E_Tabular ) then
2 # Unable to run Data Extraction: PDFTron SDK Tabular Data module not available.
3 end
4 if ! DataExtractionModule . IsModuleAvailable ( DataExtractionModule :: E_DocStructure ) then
5 # Unable to run Data Extraction: PDFTron SDK Structured Output module not available.
6 end
7 if ! DataExtractionModule . IsModuleAvailable ( DataExtractionModule :: E_Form ) then
8 # Unable to run Data Extraction: PDFTron SDK AIFormFieldExtractor module not available.
9 end
1 If Not DataExtractionModule. IsModuleAvailable (DataExtractionModule.DataExtractionEngine.e_tabular) Then
2 ' Unable to run Data Extraction: PDFTron SDK Tabular Data module not available.
3 End If
4 If Not DataExtractionModule. IsModuleAvailable (DataExtractionModule.DataExtractionEngine.e_doc_structure) Then
5 ' Unable to run Data Extraction: PDFTron SDK Structured Output module not available.
6 End If
7 If Not DataExtractionModule. IsModuleAvailable (DataExtractionModule.DataExtractionEngine.e_form) Then
8 ' Unable to run Data Extraction: PDFTron SDK AIFormFieldExtractor module not available.
9 End If
If you have the module installed but the function still returns false, please double check that the correct path was used in AddResourceSearchPath
earlier.
In this mode of operation, text on each page will become a part of a single table, like in a spreadsheet document. Headers, footers, paragraphs, lists, page numbers all come out of a combined tabular structure.
There is a choice between two types of output formats, either a JSON or an Excel document. JSON is more suitable for programmatic content discovery, natural language processing, statistical analysis, artificial intelligence. Excel is ideal for standard spreadsheet operations, calculations, formulas and charting.
Specify the name of the input PDF file and the name of the output JSON file, then select the Tabular engine:
C# C++ Go Java JavaScript PHP Python Ruby VB
1 DataExtractionModule. ExtractData ( " table.pdf " , " table.json " , DataExtractionModule.DataExtractionEngine.e_tabular);
1 DataExtractionModule :: ExtractData ( " table.pdf " , " table.json " , DataExtractionModule :: e_Tabular);
1 DataExtractionModuleExtractData ( " table.pdf " , " table.json " , DataExtractionModuleE_Tabular)
1 DataExtractionModule. extractData ( " table.pdf " , " table.json " , DataExtractionModule.DataExtractionEngine.e_tabular);
1 await PDFNet.DataExtractionModule. extractData ( ' table.pdf ' , ' table.json ' , PDFNet.DataExtractionModule.DataExtractionEngine.e_Tabular);
1 DataExtractionModule :: ExtractData ( " table.pdf " , " table.json " , DataExtractionModule :: e_Tabular );
1 DataExtractionModule.ExtractData( " table.pdf " , " table.json " , DataExtractionModule.e_Tabular)
1 DataExtractionModule . ExtractData ( " table.pdf " , " table.json " , DataExtractionModule :: E_Tabular )
1 DataExtractionModule. ExtractData ( " table.pdf " , " table.json " , DataExtractionModule.DataExtractionEngine.e_tabular)
If you are going to parse the JSON right away, you may as well retrieve it as an in-memory string, instead of an external file.
Specify the name of the input PDF file, then select the Tabular engine:
C# C++ Go Java JavaScript PHP Python Ruby VB
1 string json = DataExtractionModule. ExtractData ( " financial.pdf " , DataExtractionModule.DataExtractionEngine.e_tabular);
1 UString json = DataExtractionModule :: ExtractData ( " financial.pdf " , DataExtractionModule :: e_Tabular);
1 json := DataExtractionModuleExtractData ( " financial.pdf " , DataExtractionModuleE_Tabular).( string )
1 String json = DataExtractionModule. extractData ( " financial.pdf " , DataExtractionModule.DataExtractionEngine.e_tabular);
1 const json = await PDFNet.DataExtractionModule. extractDataAsString ( ' financial.pdf ' , PDFNet.DataExtractionModule.DataExtractionEngine.e_Tabular);
1 $json = DataExtractionModule :: ExtractData ( " financial.pdf " , DataExtractionModule :: e_Tabular );
1 json = DataExtractionModule.ExtractData( " financial.pdf " , DataExtractionModule.e_Tabular)
1 json = DataExtractionModule . ExtractData ( " financial.pdf " , DataExtractionModule :: E_Tabular )
1 Dim json As String = DataExtractionModule. ExtractData ( " financial.pdf " , DataExtractionModule.DataExtractionEngine.e_tabular)
Specify the name of the input PDF file and the name of the output XLSX file:
C# C++ Go Java JavaScript PHP Python Ruby VB
1 DataExtractionModule. ExtractToXLSX ( " table.pdf " , " table.xlsx " );
1 DataExtractionModule :: ExtractToXLSX ( " table.pdf " , " table.xlsx " );
1 DataExtractionModuleExtractToXLSX ( " table.pdf " , " table.xlsx " )
1 DataExtractionModule. extractToXLSX ( " table.pdf " , " table.xlsx " );
1 await PDFNet.DataExtractionModule. extractToXLSX ( ' table.pdf ' , ' table.xlsx ' );
1 DataExtractionModule :: ExtractToXLSX ( " table.pdf " , " table.xlsx " );
1 DataExtractionModule.ExtractToXLSX( " table.pdf " , " table.xlsx " )
1 DataExtractionModule . ExtractToXLSX ( " table.pdf " , " table.xlsx " )
1 DataExtractionModule. ExtractToXLSX ( " table.pdf " , " table.xlsx " )
Specify the name of the input PDF file and an output filter, such as MemoryFilter:
C# C++ Go Java JavaScript PHP Python Ruby VB
1 MemoryFilter output_xlsx_stream = new MemoryFilter ( 0 , false );
2 DataExtractionModule. ExtractToXLSX ( " financial.pdf " , output_xlsx_stream);
1 MemoryFilter output_xlsx_stream ( 0 , false );
2 DataExtractionModule :: ExtractToXLSX ( " financial.pdf " , output_xlsx_stream);
1 outputXlsxStream := NewMemoryFilter ( 0 , false )
2 DataExtractionModuleExtractToXLSX ( " financial.pdf " , outputXlsxStream)
1 MemoryFilter output_xlsx_stream = new MemoryFilter ( 0 , false );
2 DataExtractionModule. extractToXLSX ( " financial.pdf " , output_xlsx_stream);
1 const outputXlsxStream = PDFNet.Filters. MemoryFilter ( 0 , false );
2 await PDFNet.DataExtractionModule. extractToXLSX ( ' financial.pdf ' , outputXlsxStream);
1 $outputXlsxStream = new MemoryFilter ( 0 , false );
2 DataExtractionModule :: ExtractToXLSX ( " financial.pdf " , $outputXlsxStream);
1 outputXlsxStream = Filters.MemoryFilter( 0 , False )
2 DataExtractionModule.ExtractToXLSX( " financial.pdf " , outputXlsxStream)
1 outputXlsxStream = Filters . MemoryFilter . new ( 0 , false )
2 DataExtractionModule . ExtractToXLSX ( " financial.pdf " , outputXlsxStream)
1 Dim output_xlsx_stream As MemoryFilter = New MemoryFilter ( 0 , False )
2 DataExtractionModule. ExtractToXLSX ( " financial.pdf " , output_xlsx_stream)
In this mode of operation, the full logical structure is discovered, including paragraphs, lists, tables, headers, footers, images, graphics, like in a typical word processor. Section columns are differentiated from table columns, and paragraph text is separated from tables, although table cells may contain further paragraphs.
While Tabular Data Extraction is more focused on cells, financial tables, spreadsheets, formulas, currency signs, percentages and calculations on cells, Document Structure Recognition aims at the visual accuracy of documents, such as indentation, gaps, line spacing, alignment, borders and colors.
Specify the name of the input PDF file and the name of the output JSON file, then select the Doc Structure engine:
C# C++ Go Java JavaScript PHP Python Ruby VB
1 DataExtractionModule. ExtractData ( " paragraphs_and_tables.pdf " , " paragraphs_and_tables.json " , DataExtractionModule.DataExtractionEngine.e_doc_structure);
1 DataExtractionModule :: ExtractData ( " paragraphs_and_tables.pdf " , " paragraphs_and_tables.json " , DataExtractionModule :: e_DocStructure);
1 DataExtractionModuleExtractData ( " paragraphs_and_tables.pdf " , " paragraphs_and_tables.json " , DataExtractionModuleE_DocStructure)
1 DataExtractionModule. extractData ( " paragraphs_and_tables.pdf " , " paragraphs_and_tables.json " , DataExtractionModule.DataExtractionEngine.e_doc_structure);
1 await PDFNet.DataExtractionModule. extractData ( ' paragraphs_and_tables.pdf ' , ' paragraphs_and_tables.json ' , PDFNet.DataExtractionModule.DataExtractionEngine.e_DocStructure);
1 DataExtractionModule :: ExtractData ( " paragraphs_and_tables.pdf " , " paragraphs_and_tables.json " , DataExtractionModule :: e_DocStructure );
1 DataExtractionModule.ExtractData( " paragraphs_and_tables.pdf " , " paragraphs_and_tables.json " , DataExtractionModule.e_DocStructure)
1 DataExtractionModule . ExtractData ( " paragraphs_and_tables.pdf " , " paragraphs_and_tables.json " , DataExtractionModule :: E_DocStructure )
1 DataExtractionModule. ExtractData ( " paragraphs_and_tables.pdf " , " paragraphs_and_tables.json " , DataExtractionModule.DataExtractionEngine.e_doc_structure)
If you are going to parse the JSON right away, you may as well retrieve it as an in-memory string, instead of an external file.
Specify the name of the input PDF file, then select the Doc Structure engine:
C# C++ Go Java JavaScript PHP Python Ruby VB
1 string json = DataExtractionModule. ExtractData ( " tagged.pdf " , DataExtractionModule.DataExtractionEngine.e_doc_structure);
1 UString json = DataExtractionModule :: ExtractData ( " tagged.pdf " , DataExtractionModule :: e_DocStructure);
1 json := DataExtractionModuleExtractData ( " tagged.pdf " , DataExtractionModuleE_DocStructure).( string )
1 String json = DataExtractionModule. extractData ( " tagged.pdf " , DataExtractionModule.DataExtractionEngine.e_doc_structure);
1 const json = await PDFNet.DataExtractionModule. extractDataAsString ( ' tagged.pdf ' , PDFNet.DataExtractionModule.DataExtractionEngine.e_DocStructure);
1 $json = DataExtractionModule :: ExtractData ( " tagged.pdf " , DataExtractionModule :: e_DocStructure );
1 json = DataExtractionModule.ExtractData( " tagged.pdf " , DataExtractionModule.e_DocStructure)
1 json = DataExtractionModule . ExtractData ( " tagged.pdf " , DataExtractionModule :: E_DocStructure )
1 Dim json As String = DataExtractionModule. ExtractData ( " tagged.pdf " , DataExtractionModule.DataExtractionEngine.e_doc_structure)
In this mode of operation, the input is assumed to be a form. We currently offer 2 Form Field Identification Engines: "Form Field Detection" and "Form Field Key-Value Extraction".
Both engines require GLIBC 2.31 or newer on Linux, such as Debian 11 or Ubuntu 10.04 or newer
Our Form Field Detection engine surveys the page layout and finds the most probable arrangement of the individual form fields, along with the type of the identified fields, such as text fields or check boxes.
The output is presented in a straightforward JSON format, where each field is made up of a type, confidence value and bounding box coordinates.
IMPORTANT This engine is in beta: we expect the quality of the output to increase dramatically in subsequent releases, and there may be minor changes to the output schema.
Our Form Field Key-Value Extraction engine provides all the same information as the Form Field Detection engine, with some additional data provided for fields. For each form field in the document, an attempt is made to find the corresponding key (the label or question associated with a field) and value (the data within the field).
The output is presented in a JSON format, similar to the output of the Form Field Detection engine, with extra data fields for field key, label, and value text.
Specify the name of the input PDF file and the name of the output JSON file, then select the Form engine:
C# C++ Go Java JavaScript PHP Python Ruby VB
1 DataExtractionModule. ExtractData ( " formfields-scanned.pdf " , " formfields-scanned.json " , DataExtractionModule.DataExtractionEngine.e_form);
1 DataExtractionModule :: ExtractData ( " formfields-scanned.pdf " , " formfields-scanned.json " , DataExtractionModule :: e_Form);
1 DataExtractionModuleExtractData ( " formfields-scanned.pdf " , " formfields-scanned.json " , DataExtractionModuleE_Form)
1 DataExtractionModule. extractData ( " formfields-scanned.pdf " , " formfields-scanned.json " , DataExtractionModule.DataExtractionEngine.e_form);
1 await PDFNet.DataExtractionModule. extractData ( ' formfields-scanned.pdf ' , ' formfields-scanned.json ' , PDFNet.DataExtractionModule.DataExtractionEngine.e_Form);
1 DataExtractionModule :: ExtractData ( " formfields-scanned.pdf " , " formfields-scanned.json " , DataExtractionModule :: e_Form );
1 DataExtractionModule.ExtractData( " formfields-scanned.pdf " , " formfields-scanned.json " , DataExtractionModule.e_Form)
1 DataExtractionModule . ExtractData ( " formfields-scanned.pdf " , " formfields-scanned.json " , DataExtractionModule :: E_Form )
1 DataExtractionModule. ExtractData ( " formfields-scanned.pdf " , " formfields-scanned.json " , DataExtractionModule.DataExtractionEngine.e_form)
Alternatively, you can select the Form Key-Value Extraction engine:
C# C++ Go Java JavaScript PHP Python Ruby VB
1 DataExtractionModule. ExtractData ( " formfields-scanned.pdf " , " formfields-scanned.json " , DataExtractionModule.DataExtractionEngine.e_form_key_value);
1 DataExtractionModule :: ExtractData ( " formfields-scanned.pdf " , " formfields-scanned.json " , DataExtractionModule :: e_FormKeyValue);
1 DataExtractionModuleExtractData ( " formfields-scanned.pdf " , " formfields-scanned.json " , DataExtractionModuleE_FormKeyValue)
1 DataExtractionModule. extractData ( " formfields-scanned.pdf " , " formfields-scanned.json " , DataExtractionModule.DataExtractionEngine.e_form_key_value);
1 await PDFNet.DataExtractionModule. extractData ( ' formfields-scanned.pdf ' , ' formfields-scanned.json ' , PDFNet.DataExtractionModule.DataExtractionEngine.e_FormKeyValue);
1 DataExtractionModule :: ExtractData ( " formfields-scanned.pdf " , " formfields-scanned.json " , DataExtractionModule :: e_FormKeyValue );
1 DataExtractionModule.ExtractData( " formfields-scanned.pdf " , " formfields-scanned.json " , DataExtractionModule.e_FormKeyValue)
1 DataExtractionModule . ExtractData ( " formfields-scanned.pdf " , " formfields-scanned.json " , DataExtractionModule :: E_FormKeyValue )
1 DataExtractionModule. ExtractData ( " formfields-scanned.pdf " , " formfields-scanned.json " , DataExtractionModule.DataExtractionEngine.e_form_key_value)
If you are going to parse the JSON right away, you may as well retrieve it as an in-memory string, instead of an external file.
Specify the name of the input PDF file, then select the Form engine:
C# C++ Go Java JavaScript PHP Python Ruby VB
1 string json = DataExtractionModule. ExtractData ( " formfields.pdf " , DataExtractionModule.DataExtractionEngine.e_form);
1 UString json = DataExtractionModule :: ExtractData ( " formfields.pdf " , DataExtractionModule :: e_Form);
1 json := DataExtractionModuleExtractData ( " formfields.pdf " , DataExtractionModuleE_Form).( string )
1 String json = DataExtractionModule. extractData ( " formfields.pdf " , DataExtractionModule.DataExtractionEngine.e_form);
1 const json = await PDFNet.DataExtractionModule. extractDataAsString ( ' formfields.pdf ' , PDFNet.DataExtractionModule.DataExtractionEngine.e_Form);
1 $json = DataExtractionModule :: ExtractData ( " formfields.pdf " , DataExtractionModule :: e_Form );
1 json = DataExtractionModule.ExtractData( " formfields.pdf " , DataExtractionModule.e_Form)
1 json = DataExtractionModule . ExtractData ( " formfields.pdf " , DataExtractionModule :: E_Form )
1 Dim json As String = DataExtractionModule. ExtractData ( " formfields.pdf " , DataExtractionModule.DataExtractionEngine.e_form)
Alternatively, you can select the Form Key-Value Extraction engine:
C# C++ Go Java JavaScript PHP Python Ruby VB
1 string json = DataExtractionModule. ExtractData ( " formfields.pdf " , DataExtractionModule.DataExtractionEngine.e_form_key_value);
1 UString json = DataExtractionModule :: ExtractData ( " formfields.pdf " , DataExtractionModule :: e_FormKeyValue);
1 json := DataExtractionModuleExtractData ( " formfields.pdf " , DataExtractionModuleE_FormKeyValue).( string )
1 String json = DataExtractionModule. extractData ( " formfields.pdf " , DataExtractionModule.DataExtractionEngine.e_form_key_value);
1 const json = await PDFNet.DataExtractionModule. extractDataAsString ( ' formfields.pdf ' , PDFNet.DataExtractionModule.DataExtractionEngine.e_FormKeyValue);
1 $json = DataExtractionModule :: ExtractData ( " formfields.pdf " , DataExtractionModule :: e_FormKeyValue );
1 json = DataExtractionModule.ExtractData( " formfields.pdf " , DataExtractionModule.e_FormKeyValue)
1 json = DataExtractionModule . ExtractData ( " formfields.pdf " , DataExtractionModule :: E_FormKeyValue )
1 Dim json As String = DataExtractionModule. ExtractData ( " formfields.pdf " , DataExtractionModule.DataExtractionEngine.e_form_key_value)
You can automatically add detected forms to a PDF in a single step.
Java 1 PDFDoc doc = new PDFDoc ( " formfields.pdf " );
2 DataExtractionModule. detectAndAddFormFieldsToPDF (doc);
C# C++ Go Java JavaScript PHP Python Ruby VB
1 PDFDoc doc = new PDFDoc ( " formfields.pdf " );
2 DataExtractionModule. DetectAndAddFormFieldsToPDF (doc);
1 PDFDoc doc = new PDFDoc ( " formfields.pdf " );
2 DataExtractionModule. detectAndAddFormFieldsToPDF (doc);
1 doc := NewPDFDoc ( " formfields.pdf " )
2 DataExtractionModuleDetectAndAddFormFieldsToPDF (doc)
1 PDFDoc doc = new PDFDoc ( " formfields.pdf " );
2 DataExtractionModule. detectAndAddFormFieldsToPDF (doc);
1 const doc = await PDFNet.PDFDoc. createFromFilePath ( " formfields.pdf " );
2 await PDFNet.DataExtractionModule. detectAndAddFormFieldsToPDF (doc);
1 $doc = new PDFDoc ( " formfields.pdf " );
2 DataExtractionModule :: DetectAndAddFormFieldsToPDF ($doc);
1 doc = PDFDoc( " formfields.pdf " )
2 DataExtractionModule.DetectAndAddFormFieldsToPDF(doc)
1 doc = PDFDoc . new ( " formfields.pdf " )
2 DataExtractionModule . DetectAndAddFormFieldsToPDF (doc)
1 Dim doc as PDFDoc = New PDFDoc ( " formfields.pdf " )
2 DataExtractionModule. DetectAndAddFormFieldsToPDF (doc)
By default, this function uses the Form Field Detection engine . To use the Form Field Key-Value Extraction engine , see below .
Although the default options will satisfy most common use cases, we offer a couple of options to customize the extraction behavior and unlock lesser-used functionality.
The options object is passed as the last parameter to any extraction function, as shown below.
Use the Language
option to set the preferred OCR language(s). If you work with scanned documents in languages other than English, specify one or more 3-letter ISO 639-2 language codes, separated by spaces. For example, "eng deu spa fra"
for English, German, Spanish, French. You may also use comma or plus as a separator.
Supported languages:
eng
: Englishdeu
or ger
: Germanfra
or fre
: Frenchita
: Italianrus
: Russianspa
: SpanishNote: Listing too many languages at once may hurt performance and accuracy. If you know the exact language, it is always best to use that single setting.
C# C++ Go Java JavaScript PHP Python Ruby VB
1 DataExtractionOptions options = new DataExtractionOptions ();
2 options. SetLanguage ( " fra spa " ); // French and Spanish
3 DataExtractionModule. ExtractData ( " table.pdf " , " table.json " , DataExtractionModule.DataExtractionEngine.e_tabular, options);
1 DataExtractionOptions options;
2 options. SetLanguage ( " fra spa " ); // French and Spanish
3 DataExtractionModule :: ExtractData ( " table.pdf " , " table.json " , DataExtractionModule :: e_Tabular, & options);
1 options := NewDataExtractionOptions ()
2 options. SetLanguage ( " fra spa " ); // French and Spanish
3 DataExtractionModuleExtractData ( " table.pdf " , " table.json " , DataExtractionModuleE_Tabular, options)
1 DataExtractionOptions options = new DataExtractionOptions ();
2 options. setLanguage ( " fra spa " ); // French and Spanish
3 DataExtractionModule. extractData ( " table.pdf " , " table.json " , DataExtractionModule.DataExtractionEngine.e_tabular, options);
1 const options = new PDFNet.DataExtractionModule. DataExtractionOptions ();
2 options. setLanguage ( " fra spa " ); // French and Spanish
3 await PDFNet.DataExtractionModule. extractData ( ' table.pdf ' , ' table.json ' , PDFNet.DataExtractionModule.DataExtractionEngine.e_Tabular, options);
1 $options = new DataExtractionOptions ();
2 $options . setLanguage ( " fra spa " ); // French and Spanish
3 DataExtractionModule :: ExtractData ( " table.pdf " , " table.json " , DataExtractionModule :: e_Tabular , $options);
1 options = DataExtractionOptions()
2 options.SetLanguage( " fra spa " ) # French and Spanish
3 DataExtractionModule.ExtractData( " table.pdf " , " table.json " , DataExtractionModule.e_Tabular, options)
1 options = DataExtractionOptions . new ()
2 options. SetLanguage ( " fra spa " ) # French and Spanish
3 DataExtractionModule . ExtractData ( " table.pdf " , " table.json " , DataExtractionModule :: E_Tabular , options)
1 Dim options As DataExtractionOptions = New DataExtractionOptions ()
2 options. SetLanguage ( " fra spa " ) ' French and Spanish
3 DataExtractionModule. ExtractData ( " table.pdf " , " table.json " , DataExtractionModule.DataExtractionEngine.e_tabular, options)
Use the PDFPassword
option to specify a PDF password if one is required.
Encrypted PDF files that are protected by a password may only be opened when the password is specified in addition to the filename. No password is necessary for files that can be viewed without any authentication.
C# C++ Go Java JavaScript PHP Python Ruby VB
1 DataExtractionOptions options = new DataExtractionOptions ();
2 options. SetPDFPassword ( " password123 " ); // password for input PDF
3 DataExtractionModule. ExtractData ( " table.pdf " , " table.json " , DataExtractionModule.DataExtractionEngine.e_tabular, options);
1 DataExtractionOptions options;
2 options. SetPDFPassword ( " password123 " ); // password for input PDF
3 DataExtractionModule :: ExtractData ( " table.pdf " , " table.json " , DataExtractionModule :: e_Tabular, & options);
1 options := NewDataExtractionOptions ()
2 options. SetPDFPassword ( " password123 " ) // password for input PDF
3 DataExtractionModuleExtractData ( " table.pdf " , " table.json " , DataExtractionModuleE_Tabular, options)
1 DataExtractionOptions options = new DataExtractionOptions ();
2 options. setPDFPassword ( " password123 " ); // password for input PDF
3 DataExtractionModule. extractData ( " table.pdf " , " table.json " , DataExtractionModule.DataExtractionEngine.e_tabular, options);
1 const options = new PDFNet.DataExtractionModule. DataExtractionOptions ();
2 options. setPDFPassword ( " password123 " ); // password for input PDF
3 await PDFNet.DataExtractionModule. extractData ( ' table.pdf ' , ' table.json ' , PDFNet.DataExtractionModule.DataExtractionEngine.e_Tabular, options);
1 $options = new DataExtractionOptions ();
2 $options . setPDFPassword ( " password123 " ); // password for input PDF
3 DataExtractionModule :: ExtractData ( " table.pdf " , " table.json " , DataExtractionModule :: e_Tabular , $options);
1 options = DataExtractionOptions()
2 options.SetPDFPassword( " password123 " ) # password for input PDF
3 DataExtractionModule.ExtractData( " table.pdf " , " table.json " , DataExtractionModule.e_Tabular, options)
1 options = DataExtractionOptions . new ()
2 options. SetPDFPassword ( " password123 " ) # password for input PDF
3 DataExtractionModule . ExtractData ( " table.pdf " , " table.json " , DataExtractionModule :: E_Tabular , options)
1 Dim options As DataExtractionOptions = New DataExtractionOptions ()
2 options. SetPDFPassword ( " password123 " ) ' password for input PDF
3 DataExtractionModule. ExtractData ( " table.pdf " , " table.json " , DataExtractionModule.DataExtractionEngine.e_tabular, options)
Use the Pages
option to restrict the extraction to a selected range of pages.
This can be a single page number (such as "1"
for the first page), or a range separated by a dash (such as "1-5"
, or "7-"
for 7 and beyond). An empty string means all pages are extracted.
C# C++ Go Java JavaScript PHP Python Ruby VB
1 DataExtractionOptions options = new DataExtractionOptions ();
2 options. SetPages ( " 1 " ); // extract page 1
3 DataExtractionModule. ExtractData ( " table.pdf " , " table.json " , DataExtractionModule.DataExtractionEngine.e_tabular, options);
1 DataExtractionOptions options;
2 options. SetPages ( " 1 " ); // extract page 1
3 DataExtractionModule :: ExtractData ( " table.pdf " , " table.json " , DataExtractionModule :: e_Tabular, & options);
1 options := NewDataExtractionOptions ()
2 options. SetPages ( " 1 " ) // page 1
3 DataExtractionModuleExtractData ( " table.pdf " , " table.json " , DataExtractionModuleE_Tabular, options)
1 DataExtractionOptions options = new DataExtractionOptions ();
2 options. setPages ( " 1 " ); // extract page 1
3 DataExtractionModule. extractData ( " table.pdf " , " table.json " , DataExtractionModule.DataExtractionEngine.e_tabular, options);
1 const options = new PDFNet.DataExtractionModule. DataExtractionOptions ();
2 options. setPages ( " 1 " ); // page 1
3 await PDFNet.DataExtractionModule. extractData ( ' table.pdf ' , ' table.json ' , PDFNet.DataExtractionModule.DataExtractionEngine.e_Tabular, options);
1 $options = new DataExtractionOptions ();
2 $options . setPages ( " 1 " ); // page 1
3 DataExtractionModule :: ExtractData ( " table.pdf " , " table.json " , DataExtractionModule :: e_Tabular , $options);
1 options = DataExtractionOptions()
2 options.SetPages( " 1 " ) # page 1
3 DataExtractionModule.ExtractData( " table.pdf " , " table.json " , DataExtractionModule.e_Tabular, options)
1 options = DataExtractionOptions . new ()
2 options. SetPages ( " 1 " ) # page 1
3 DataExtractionModule . ExtractData ( " table.pdf " , " table.json " , DataExtractionModule :: E_Tabular , options)
1 Dim options As DataExtractionOptions = New DataExtractionOptions ()
2 options. SetPages ( " 1 " ) ' extract page 1
3 DataExtractionModule. ExtractData ( " table.pdf " , " table.json " , DataExtractionModule.DataExtractionEngine.e_tabular, options)
Specifies if Deep Learning is used with table recognition in the DocStructure engine. Table recognition accuracy improves at the cost of increased processing time. This only affects the DocStructure engine.
C# C++ Go Java JavaScript PHP Python Ruby VB
1 DataExtractionOptions options = new DataExtractionOptions ();
2 options. SetDeepLearningAssist ( true ); // Enable Deep learning assistant
3 DataExtractionModule. ExtractData ( " table.pdf " , " table.json " , DataExtractionModule.DataExtractionEngine.e_DocStructure, options);
1 DataExtractionOptions options;
2 options. SetDeepLearningAssist ( true ); // Enable Deep learning assistant
3 DataExtractionModule :: ExtractData ( " table.pdf " , " table.json " , DataExtractionModule :: e_DocStructure, & options);
1 options := NewDataExtractionOptions ()
2 options. SetDeepLearningAssist ( true ) // Enable Deep learning assistant
3 DataExtractionModuleExtractData ( " table.pdf " , " table.json " , DataExtractionModuleE_DocStructure, options)
1 DataExtractionOptions options = new DataExtractionOptions ();
2 options. setDeepLearningAssist ( true ); // Enable Deep learning assistant
3 DataExtractionModule. extractData ( " table.pdf " , " table.json " , DataExtractionModule.DataExtractionEngine.e_DocStructure, options);
1 const options = new PDFNet.DataExtractionModule. DataExtractionOptions ();
2 options. setDeepLearningAssist ( true ); // Enable Deep learning assistant
3 await PDFNet.DataExtractionModule. extractData ( ' table.pdf ' , ' table.json ' , PDFNet.DataExtractionModule.DataExtractionEngine.e_DocStructure, options);
1 $options = new DataExtractionOptions ();
2 $options . setDeepLearningAssist ( true ); // Enable Deep learning assistant
3 DataExtractionModule :: ExtractData ( " table.pdf " , " table.json " , DataExtractionModule :: e_DocStructure , $options);
1 options = DataExtractionOptions()
2 options.SetDeepLearningAssist( True ) # Enable Deep learning assistant
3 DataExtractionModule.ExtractData( " table.pdf " , " table.json " , DataExtractionModule.e_DocStructure, options)
1 options = DataExtractionOptions . new ()
2 options. SetDeepLearningAssist ( true ) # Enable Deep learning assistant
3 DataExtractionModule . ExtractData ( " table.pdf " , " table.json " , DataExtractionModule :: E_DocStructure , options)
1 Dim options As DataExtractionOptions = New DataExtractionOptions ()
2 options. SetDeepLearningAssist ( True ) ' Enable Deep learning assistant
3 DataExtractionModule. ExtractData ( " table.pdf " , " table.json " , DataExtractionModule.DataExtractionEngine.e_DocStructure, options)
When automatically detecting form fields and adding them to a document, you can force the module to preserve any existing form annotations that are already present in the document, only adding newly detected fields.
C# C++ Go Java JavaScript PHP Python Ruby VB
1 PDFDoc doc = new PDFDoc ( " formfields.pdf " );
2 DataExtractionOptions options = new DataExtractionOptions ();
3 options. SetOverlappingFormFieldBehavior ( " KeepOld " );
4 DataExtractionModule. DetectAndAddFormFieldsToPDF (doc, options);
1 PDFDoc doc ( " formfields.pdf " );
2 DataExtractionOptions options;
3 options. SetOverlappingFormFieldBehavior ( " KeepOld " );
4 DataExtractionModule :: DetectAndAddFormFieldsToPDF (doc, & options);
1 doc = NewPDFDoc ( " formfields.pdf " )
2 options := NewDataExtractionOptions ()
3 options. SetOverlappingFormFieldBehavior ( " KeepOld " )
4 DataExtractionModuleDetectAndAddFormFieldsToPDF (doc, options)
1 PDFDoc doc = new PDFDoc ( " formfields.pdf " );
2 DataExtractionOptions options = new DataExtractionOptions ();
3 options. setOverlappingFormFieldBehavior ( " KeepOld " );
4 DataExtractionModule. detectAndAddFormFieldsToPDF (doc, options);
1 const doc = await PDFNet.PDFDoc. createFromFilePath ( " formfields.pdf " );
2 const options = new PDFNet.DataExtractionModule. DataExtractionOptions ();
3 options. setOverlappingFormFieldBehavior ( ' KeepOld ' );
4 await PDFNet.DataExtractionModule. detectAndAddFormFieldsToPDF (doc, options);
1 $doc = new PDFDoc ( " formfields.pdf " );
2 $options = new DataExtractionOptions ();
3 $options -> SetOverlappingFormFieldBehavior ( " KeepOld " );
4 DataExtractionModule :: DetectAndAddFormFieldsToPDF ($doc, $options);
1 doc = PDFDoc( " formfields.pdf " )
2 options = DataExtractionOptions()
3 options.SetOverlappingFormFieldBehavior( " KeepOld " )
4 DataExtractionModule.DetectAndAddFormFieldsToPDF(doc, options)
1 doc = PDFDoc . new ( " formfields.pdf " )
2 options = DataExtractionOptions . new ()
3 options. SetOverlappingFormFieldBehavior ( " KeepOld " )
4 DataExtractionModule . DetectAndAddFormFieldsToPDF (doc, options)
1 Dim doc as PDFDoc = New PDFDoc ( " formfields.pdf " )
2 Dim options = New DataExtractionOptions ()
3 options. SetOverlappingFormFieldBehavior ( " KeepOld " )
4 DataExtractionModule. DetectAndAddFormFieldsToPDF (doc, options)
By default, DetectAndAddFormFieldsToPDF
uses the Form Field Detection engine . You can force the function to use the Form Field Key-Value Extraction engine using the "Form Extraction Engine" option.
C# C++ Go Java JavaScript PHP Python Ruby VB
1 PDFDoc doc = new PDFDoc ( " formfields.pdf " );
2 DataExtractionOptions options = new DataExtractionOptions ();
3 options. SetFormExtractionEngine ( " FormKeyValue " );
4 DataExtractionModule. DetectAndAddFormFieldsToPDF (doc, options);
1 PDFDoc doc ( " formfields.pdf " );
2 DataExtractionOptions options;
3 options. SetFormExtractionEngine ( " FormKeyValue " );
4 DataExtractionModule :: DetectAndAddFormFieldsToPDF (doc, & options);
1 doc = NewPDFDoc ( " formfields.pdf " )
2 options := NewDataExtractionOptions ()
3 options. SetFormExtractionEngine ( " FormKeyValue " )
4 DataExtractionModuleDetectAndAddFormFieldsToPDF (doc, options)
1 PDFDoc doc = new PDFDoc ( " formfields.pdf " );
2 DataExtractionOptions options = new DataExtractionOptions ();
3 options. setFormExtractionEngine ( " FormKeyValue " );
4 DataExtractionModule. detectAndAddFormFieldsToPDF (doc, options);
1 const doc = await PDFNet.PDFDoc. createFromFilePath ( " formfields.pdf " );
2 const options = new PDFNet.DataExtractionModule. DataExtractionOptions ();
3 options. setFormExtractionEngine ( ' FormKeyValue ' );
4 await PDFNet.DataExtractionModule. detectAndAddFormFieldsToPDF (doc, options);
1 $doc = new PDFDoc ( " formfields.pdf " );
2 $options = new DataExtractionOptions ();
3 $options -> SetFormExtractionEngine ( " FormKeyValue " );
4 DataExtractionModule :: DetectAndAddFormFieldsToPDF ($doc, $options);
1 doc = PDFDoc( " formfields.pdf " )
2 options = DataExtractionOptions()
3 options.SetFormExtractionEngine( " FormKeyValue " )
4 DataExtractionModule.DetectAndAddFormFieldsToPDF(doc, options)
1 doc = PDFDoc . new ( " formfields.pdf " )
2 options = DataExtractionOptions . new ()
3 options. SetFormExtractionEngine ( " FormKeyValue " )
4 DataExtractionModule . DetectAndAddFormFieldsToPDF (doc, options)
1 Dim doc as PDFDoc = New PDFDoc ( " formfields.pdf " )
2 Dim options = New DataExtractionOptions ()
3 options. SetFormExtractionEngine ( " FormKeyValue " )
4 DataExtractionModule. DetectAndAddFormFieldsToPDF (doc, options)
NOTE This option only has an effect on the `DetectAndAddFormFieldsToPDF` function. Passing this option to `ExtractData` will have no effect, as the `engine` parameter will take precedence.