Some test text!
Core / Guides / Detailed Discussion
Here we will go over some of the decisions we made in the Document RAG Example , including the rationale behind our choices, and some changes you might want to incorporate for your own use case, depending on your needs.
In the Document RAG Example , we refer to 6 basic steps that we performed when building our RAG. They are:
Some of these steps are fairly self explanatory, and don't require much explanation. For others, we will discuss our decisions below and provide alternatives where appropriate, as some of the decisions we made may be less suitable for other use cases.
In order to retrieve relevant context to attach to a query, we first need to split the document into chunks. We did this using the bookmark tree to obtain chunks based on sections of the document (see the code in idp_rag_utils/bookmark_utils.py). Doing this ensures that your chunk does not abruptly end in the middle of a paragraph, and that the entirety of the context is captured in the section. There are some downsides to this decision, however:
In these cases, you will need to use other methods of splitting up the document. A relatively simple technique is to split based on pages. We provide code to do this in the example, which you can toggle on and off by setting the SPLIT_BASED_ON_BOOKMARKS
variable. Here's a simple implementation:
doc_structure = iru.DocumentStructure(idp_data)
for page in doc_structure.pages:
page_html = page.to_html()
page_text = page.to_text()
# Do something with the sections
Because this technique has a high likelihood of splitting context in the middle of a sentence or paragraph, you may wish to use a sliding window of pages so that each page break is always present in the middle of a context:
doc_structure = iru.DocumentStructure(idp_data)
# Create start / end page pairs
page_pairs = zip(range(1, len(doc_structure.pages)), range(2, len(doc_structure.pages) + 1))
for p1, p2 in page_pairs:
# Copy the section of the document containing the pages of interest
section = doc_structure.copy_between(p1, None, p2, None)
section_html = section.to_html()
section_text = section.to_text()
# Do something with the sections
In order to retrieve relevant context from a corpus of document chunks or sections, we need a way to look up which chunks are most relevant. This is called an index. Here, we use a vector index, which encodes each document as a vector (or embedding) of floating point numbers. Semantically similar chunks will be represented by embeddings that are relatively closer together in vector space. Open AI can generate these embeddings for us using their embeddings API We can therefore use this to search for context by finding the chunks whose embeddings are closest to the embedding generated for the query. We use a plain-text representation of the data when generating embeddings for two reasons:
Sometimes, your context chunks may be bigger than the maximum size supported by the Open AI Embeddings API. Our approach in this case was to create multiple embeddings for the chunk in question, using an overlap of 500 tokens between neighboring chunks to ensure continuity of the text.
See "STEP 4: Generate Index" in the example code (idp_rag_guide/iso32000_rag_example.py) for a demonstration of how to build such a vector index.
In order to generate an index and include context with queries, we need a textual representation of the document structure. The most obvious way to do this would be to use the JSON data output by the Data Extraction Module directly, but we found more success using HTML, likely because the JSON schema we use is not prevalent in Open AI's training data, while HTML is very prevalent. HTML is effective at conveying structure, which can help the LLM understand tables and lists, for example. There are contexts, however where this structural information, is not relevant to the task. For example, when generating embeddings, we found that the extra information in HTML just added noise to the similarity metrics used to compare the context sections to the query. For this reason, we chose to use a plain-text representation to create the embeddings of the document sections and build our index. Depending on the structure of the document you are using, you may wish to try out different representations of your data for each step.
Once we have a query, we compute it's embedding the same way we did for the contextual information when we built our index (using the Open AI Embeddings API). We then compare that embedding to every embedding in our index using the cosine similarity to find the most relevant chunks from our document to attach to the query. This is imperfect, and the document chunk that contains the answer to the question may not always be ranked highest. To aid with this problem, we chose to attach multiple chunks to the query (up to five, depending on similarity score). The number of chunks to attach was chosen heuristically, and may need to be modified depending on your use case. Here are some things to consider when determining how many context chunks to attach:
Sometimes you might need to know if the context returned by your index actually contains the answer to the question. In this case, you may want to modify the query to instruct Open AI to use the contextual information only. For example, you might form your query as follows:
f"""
You are a chatbot that answers questions about a given context. Use the
provided context to answer the subsequent question. Use the provided context
only. If the provided context does not contain information on the subject,
answer 'Information not provided.'
Context:
\"\"\"
{context_segment}
\"\"\"
Question: {question_text}
"""
Then, you can check the response for the "Information not provided" string to see if the LLM was able to find an answer in the provided context. This method is not guaranteed to correctly determine the utility of a context chunk, but you may still find it useful to leverage the LLM in this way.
Trial setup questions? Ask experts on Discord
Need other help? Contact Support
Pricing or product questions? Contact Sales