Some test text!

Search
Hamburger Icon

Core / Guides / Document RAG Example

Using Intelligent Data Extraction to Augment Contextual LLM Queries - Document RAG Example

Example 2 - Document RAG

The previous example works well with small documents, but issues may arise when confronted with larger documents or a large corpus of documents, for a couple of reasons:

  1. Pricing of requests to the Open AI Chat Completions API is a function of input (and output) size. By attaching a large document to our query, the operation can get very expensive very quickly. This is also often unnecessary, as the query itself is rarely about the whole document, but rather about some subsection. See Open AI's documentation for more information on token limits.
  2. Chat Completion Models have a "token limit", which unfortunately constrains how much data we are able to pass to the model. Thus, we have to find a way to pass relevant subsections of the data when our total context exceeds the token limit. See Open AI's documentation for more information on token limits.

What can we do about this? Here, we will introduce the concept of Retrieval-Augmented Generation (RAG) (we will also refer to Retrieval-Augmented Generators as RAGs, depending on context. These are systems that employ Retrieval-Augmented Generation). A RAG can be used to find relevant information to a query from a large corpus of context information. This relevant information can then be attached to the query as context, without needing to attach the entire document. To do this, we will expand a bit on the list of steps we provided in the previous example :

  1. Extract document structure information using the Apryse Data Extraction Module.
  2. Split the document(s) into chunks that we can use to build an index. The index can then be used to search for the relevant context for a given query.
  3. Generate the required representations of the structured context data (we will use both HTML and plain text representations).
  4. Build the index using the Open AI Embeddings API.
  5. Given a user question / query, search the index for the most suitable context chunk.
  6. Attach the most relevant context chunk(s) to the query, and send it to the Open AI chat completion model.

For the following example, we will use a very large document, the ISO_32000-2:2020 PDF standard, to demonstrate these techniques. This document is available for free download from Adobe. If you haven't already, please download it and place it at the following location: idp_rag_guide/data/pdf/PDF_ISO_32000-2.pdf.

To run the example, use the following command (with your virtual environment active, if using):

python3 ./iso32000_rag.py

You should see some text indicating progress, with a question and answer about the document appearing at the end. LLM's aren't guaranteed to produce identical output between runs, but you should see something similar to the following:

Extracting Document Structure from /home/matt/dev/idp-deep-learning/
    document-summary-rag/idp_rag_guide/data/pdf/PDF_ISO_32000-2.pdf...
Extracted data to /home/matt/dev/idp-deep-learning/document-summary-rag/
    idp_rag_guide/data/output/rag_example/PDF_ISO_32000-2/json/PDF_ISO_32000-2.json
Using bookmark tree to split the document into sections...
Generating HTML and Text representations for each section...
Generating embeddings for each section...

================================================================================

Question: What are the meanings of the numeric values used by the Tj Operator? For example, "[(He)20(ll)10(o Wo)10(rld)]TJ"?
Detected Context: 
        9.4.3 Text-showing operators
        9.2.3 Achieving special graphical effects
        9.2.4 Glyph positioning and metrics
Response: The numeric values used by the TJ operator in a text-showing command 
like "[(He)20(ll)10(o Wo)10(rld)]TJ" represent adjustments to the text position 
between the glyphs or strings of glyphs. According to Excerpt #1, each element 
of the array passed to the TJ operator can be either a string or a number. If 
the element is a string, the operator shows the string. If it is a number, the 
operator adjusts the text position by that amount. This adjustment is a 
translation of the text matrix, Tm, and the number is expressed in thousandths 
of a unit of text space. The effect of this adjustment is to move the next 
glyph painted either to the left or down by the given amount, depending on the 
writing mode. In the default coordinate system, a positive adjustment moves the 
next glyph to the left (in horizontal writing mode) by the amount specified.

Therefore, in the example "[(He)20(ll)10(o Wo)10(rld)]TJ":
- The "20" after "(He)" moves the next glyph ("ll") 20 thousandths of a unit of 
text space to the left of where it would normally be placed.
- The "10" after "(ll)" moves the next glyph sequence "(o Wo)" 10 thousandths 
of a unit of text space to the left of its standard position.
- Similarly, the "10" after "(o Wo)" adjusts the position of "(rld)" to the 
left by 10 thousandths of a unit of text space from where it would otherwise be 
positioned.

This mechanism allows for fine control over the spacing between glyphs or 
groups of glyphs, enabling adjustments for kerning, aesthetic spacing, or other 
typographic considerations.

Next Steps

For more details on how to build something like this yourself and a discussion of some of the decisions made for this example, see the Detailed Discussion .

Have questions? Connect with our experts on Discord.