Using Intelligent Data Extraction to Augment Contextual LLM Queries - Detailed Discussion

Detailed Discussion

Here we will go over some of the decisions we made in the Document RAG Example, including the rationale behind our choices, and some changes you might want to incorporate for your own use case, depending on your needs.

In the Document RAG Example, we refer to 6 basic steps that we performed when building our RAG. They are:

  1. Extract document structure information using the Apryse Data Extraction Module.
  2. Split the document(s) into chunks that we can use to build an index.
  3. Generate the required representations of the structured context data.
  4. Build the index using the Open AI Embeddings API.
  5. Given a user question / query, search the index for the most suitable context chunk.
  6. Attach the most relevant context chunk(s) to the query, and send it to the Open AI chat completion model.

Some of these steps are fairly self explanatory, and don't require much explanation. For others, we will discuss our decisions below and provide alternatives where appropriate, as some of the decisions we made may be less suitable for other use cases.

Splitting the Document

In order to retrieve relevant context to attach to a query, we first need to split the document into chunks. We did this using the bookmark tree to obtain chunks based on sections of the document (see the code in idp_rag_utils/bookmark_utils.py). Doing this ensures that your chunk does not abruptly end in the middle of a paragraph, and that the entirety of the context is captured in the section. There are some downsides to this decision, however:

  • Not all documents have bookmarks.
  • The demonstrated technique for splitting the document based on bookmarks works for the provided example, but would not be sufficient for more complex documents, for example, those with multiple columns.
  • The size of a section can vary wildly, and may, in some extreme cases, still be too large to meet your cost / token count requirements.

In these cases, you will need to use other methods of splitting up the document. A relatively simple technique is to split based on pages. We provide code to do this in the example, which you can toggle on and off by setting the SPLIT_BASED_ON_BOOKMARKS variable. Here's a simple implementation:

Python

1doc_structure = iru.DocumentStructure(idp_data)
2
3for page in doc_structure.pages:
4 page_html = page.to_html()
5 page_text = page.to_text()
6 # Do something with the sections

Because this technique has a high likelihood of splitting context in the middle of a sentence or paragraph, you may wish to use a sliding window of pages so that each page break is always present in the middle of a context:

Python

1doc_structure = iru.DocumentStructure(idp_data)
2# Create start / end page pairs
3page_pairs = zip(range(1, len(doc_structure.pages)), range(2, len(doc_structure.pages) + 1))
4for p1, p2 in page_pairs:
5 # Copy the section of the document containing the pages of interest
6 section = doc_structure.copy_between(p1, None, p2, None)
7 section_html = section.to_html()
8 section_text = section.to_text()
9 # Do something with the sections

Indexing and Embeddings

In order to retrieve relevant context from a corpus of document chunks or sections, we need a way to look up which chunks are most relevant. This is called an index. Here, we use a vector index, which encodes each document as a vector (or embedding) of floating point numbers. Semantically similar chunks will be represented by embeddings that are relatively closer together in vector space. Open AI can generate these embeddings for us using their embeddings API We can therefore use this to search for context by finding the chunks whose embeddings are closest to the embedding generated for the query. We use a plain-text representation of the data when generating embeddings for two reasons:

  • The query is also plain-text
  • Structural markup adds noise to the embedding that doesn't improve the semantic representation. Using HTML, for example, harms rather than helps the ability to match a chank to a query, because the embedding will be influenced by the structural and stylistic markup.

Sometimes, your context chunks may be bigger than the maximum size supported by the Open AI Embeddings API. Our approach in this case was to create multiple embeddings for the chunk in question, using an overlap of 500 tokens between neighboring chunks to ensure continuity of the text.

See "STEP 4: Generate Index" in the example code (idp_rag_guide/iso32000_rag_example.py) for a demonstration of how to build such a vector index.

Text Representations of the Document

In order to generate an index and include context with queries, we need a textual representation of the document structure. The most obvious way to do this would be to use the JSON data output by the Data Extraction Module directly, but we found more success using HTML, likely because the JSON schema we use is not prevalent in Open AI's training data, while HTML is very prevalent. HTML is effective at conveying structure, which can help the LLM understand tables and lists, for example. There are contexts, however where this structural information, is not relevant to the task. For example, when generating embeddings, we found that the extra information in HTML just added noise to the similarity metrics used to compare the context sections to the query. For this reason, we chose to use a plain-text representation to create the embeddings of the document sections and build our index. Depending on the structure of the document you are using, you may wish to try out different representations of your data for each step.

Attaching Context To Queries

Once we have a query, we compute it's embedding the same way we did for the contextual information when we built our index (using the Open AI Embeddings API). We then compare that embedding to every embedding in our index using the cosine similarity to find the most relevant chunks from our document to attach to the query. This is imperfect, and the document chunk that contains the answer to the question may not always be ranked highest. To aid with this problem, we chose to attach multiple chunks to the query (up to five, depending on similarity score). The number of chunks to attach was chosen heuristically, and may need to be modified depending on your use case. Here are some things to consider when determining how many context chunks to attach:

  • How big is each chunk? The smaller the chunks you break your document into, the more you will be able to attach to the query.
  • How confident are you that your index was able to find the right chunk? Depending on your use case, you may find that quality of the segment ranking produced by your index will vary. The better performing your index, the fewer chunks you will need to attach.

Sometimes you might need to know if the context returned by your index actually contains the answer to the question. In this case, you may want to modify the query to instruct Open AI to use the contextual information only. For example, you might form your query as follows:

Python

1f""" You are a chatbot that answers questions about a given context. Use the provided context to answer the subsequent question. Use the provided context only. If the provided context does not contain information on the subject, answer 'Information not provided.' Context: \"\"\" {context_segment} \"\"\" Question: {question_text} """

Then, you can check the response for the "Information not provided" string to see if the LLM was able to find an answer in the provided context. This method is not guaranteed to correctly determine the utility of a context chunk, but you may still find it useful to leverage the LLM in this way.

Did you find this helpful?

Trial setup questions?

Ask experts on Discord

Need other help?

Contact Support

Pricing or product questions?

Contact Sales