Intelligent Data Extraction - Packaging

Reducing Package Size

The Intelligent Data Extraction module uses artificial intelligence in several of its engines, and as a result, can consume a substantial amount of disk space. This can be limiting for some users who need to work in constrained environments with limited storage, such as certain cloud computing environments. Here we will discuss how to reduce the package size depending on your Intelligent Data Extraction use case.

The Intelligent Data Extraction module is composed of 4 engines, each of which have their own file requirements. If you only need some subset of these engines, then you can remove the files from the package that are not a dependency of your required engine.

The file dependencies of each engine are platform specific. Please refer to the tab below that corresponds to your platform.

The below table maps each engine its the file dependencies.

Engine Name

Dependencies

Tabular Data Extraction

  • Lib/Linux/TabluarData/*
  • Lib/Linux/OCRModule

Document Structure Recognition

  • Lib/Linux/StructuredOutput
  • Lib/Linux/fonts2.pdf
  • Lib/Linux/tessdata/*

The following files are only required if using Deep Learning Assist:

  • Lib/Linux/AIPageObjectExtractor/AIPageObjectExtractor
  • Lib/Linux/AIPageObjectExtractor/table.onnx
  • Lib/Linux/AIPageObjectExtractor/table_tabular.onnx
  • Lib/Linux/AIPageObjectExtractor/Licenses

Form Field Detection

  • Lib/Linux/AIPageObjectExtractor/AIPageObjectExtractor
  • Lib/Linux/AIPageObjectExtractor/form.onnx
  • Lib/Linux/AIPageObjectExtractor/Licenses

Form Field Key-Value Extraction

  • Lib/Linux/AIPageObjectExtractor/AIPageObjectExtractor
  • Lib/Linux/AIPageObjectExtractor/form.onnx
  • Lib/Linux/AIPageObjectExtractor/kv.onnx
  • Lib/Linux/AIPageObjectExtractor/v.cab
  • Lib/Linux/AIPageObjectExtractor/Licenses

Generic Key-Value Extraction

  • Lib/Linux/AIPageObjectExtractor/AIPageObjectExtractor
  • Lib/Linux/AIPageObjectExtractor/kv.onnx
  • Lib/Linux/AIPageObjectExtractor/v.cab
  • Lib/Linux/AIPageObjectExtractor/Licenses

If the engines you are using do not depend on a given file, you are free to remove that file. For example, if you are using the Form Field Key-Value Extraction engine and the Document Structure Recognition engine (without Deep Learning Assist), then you can remove any files that are only needed for the Tabular Data Extraction engine. In this example, you would be left with the following:

Example Package Structure

1Lib
2└── Linux
3 ├── AIPageObjectExtractor
4 │ ├── AIPageObjectExtractor
5 │ ├── form.onnx
6 │ ├── kv.onnx
7 │ ├── v.cab
8 │ └── Licenses
9 ├── fonts2.pdf
10 ├── StructuredOutput
11 └── tessdata
12 ├── chi_sim.traineddata
13 ├── chi_sim_vert.traineddata
14 ├── chi_tra.traineddata
15 ├── chi_tra_vert.traineddata
16 ├── ell.traineddata
17 ├── eng.traineddata
18 ├── grc.traineddata
19 ├── jpn.traineddata
20 ├── jpn_vert.traineddata
21 ├── kor.traineddata
22 └── kor_vert.traineddata

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales