Using Intelligent Data Extraction to Augment Contextual LLM Queries - Document RAG Example

Example 2 - Document RAG

The previous example works well with small documents, but issues may arise when confronted with larger documents or a large corpus of documents, for a couple of reasons:

  1. Pricing of requests to the Open AI Chat Completions API is a function of input (and output) size. By attaching a large document to our query, the operation can get very expensive very quickly. This is also often unnecessary, as the query itself is rarely about the whole document, but rather about some subsection. See Open AI's documentation for more information on token limits.
  2. Chat Completion Models have a "token limit", which unfortunately constrains how much data we are able to pass to the model. Thus, we have to find a way to pass relevant subsections of the data when our total context exceeds the token limit. See Open AI's documentation for more information on token limits.

What can we do about this? Here, we will introduce the concept of Retrieval-Augmented Generation (RAG) (we will also refer to Retrieval-Augmented Generators as RAGs, depending on context. These are systems that employ Retrieval-Augmented Generation). A RAG can be used to find relevant information to a query from a large corpus of context information. This relevant information can then be attached to the query as context, without needing to attach the entire document. To do this, we will expand a bit on the list of steps we provided in the previous example:

  1. Extract document structure information using the Apryse Data Extraction Module.
  2. Split the document(s) into chunks that we can use to build an index. The index can then be used to search for the relevant context for a given query.
  3. Generate the required representations of the structured context data (we will use both HTML and plain text representations).
  4. Build the index using the Open AI Embeddings API.
  5. Given a user question / query, search the index for the most suitable context chunk.
  6. Attach the most relevant context chunk(s) to the query, and send it to the Open AI chat completion model.

For the following example, we will use a very large document, the ISO_32000-2:2020 PDF standard, to demonstrate these techniques. This document is available for free download from Adobe. If you haven't already, please download it and place it at the following location: idp_rag_guide/data/pdf/PDF_ISO_32000-2.pdf.

To run the example, use the following command (with your virtual environment active, if using):

sh

1python3 ./iso32000_rag.py

You should see some text indicating progress, with a question and answer about the document appearing at the end. LLM's aren't guaranteed to produce identical output between runs, but you should see something similar to the following:

sh

1Extracting Document Structure from /home/matt/dev/idp-deep-learning/
2 document-summary-rag/idp_rag_guide/data/pdf/PDF_ISO_32000-2.pdf...
3Extracted data to /home/matt/dev/idp-deep-learning/document-summary-rag/
4 idp_rag_guide/data/output/rag_example/PDF_ISO_32000-2/json/PDF_ISO_32000-2.json
5Using bookmark tree to split the document into sections...
6Generating HTML and Text representations for each section...
7Generating embeddings for each section...
8
9================================================================================
10
11Question: What are the meanings of the numeric values used by the Tj Operator? For example, "[(He)20(ll)10(o Wo)10(rld)]TJ"?
12Detected Context:
13 9.4.3 Text-showing operators
14 9.2.3 Achieving special graphical effects
15 9.2.4 Glyph positioning and metrics
16Response: The numeric values used by the TJ operator in a text-showing command
17like "[(He)20(ll)10(o Wo)10(rld)]TJ" represent adjustments to the text position
18between the glyphs or strings of glyphs. According to Excerpt #1, each element
19of the array passed to the TJ operator can be either a string or a number. If
20the element is a string, the operator shows the string. If it is a number, the
21operator adjusts the text position by that amount. This adjustment is a
22translation of the text matrix, Tm, and the number is expressed in thousandths
23of a unit of text space. The effect of this adjustment is to move the next
24glyph painted either to the left or down by the given amount, depending on the
25writing mode. In the default coordinate system, a positive adjustment moves the
26next glyph to the left (in horizontal writing mode) by the amount specified.
27
28Therefore, in the example "[(He)20(ll)10(o Wo)10(rld)]TJ":
29- The "20" after "(He)" moves the next glyph ("ll") 20 thousandths of a unit of
30text space to the left of where it would normally be placed.
31- The "10" after "(ll)" moves the next glyph sequence "(o Wo)" 10 thousandths
32of a unit of text space to the left of its standard position.
33- Similarly, the "10" after "(o Wo)" adjusts the position of "(rld)" to the
34left by 10 thousandths of a unit of text space from where it would otherwise be
35positioned.
36
37This mechanism allows for fine control over the spacing between glyphs or
38groups of glyphs, enabling adjustments for kerning, aesthetic spacing, or other
39typographic considerations.

Next Steps

For more details on how to build something like this yourself and a discussion of some of the decisions made for this example, see the Detailed Discussion.

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales