This paper "Observations on Building RAG Systems for Technical Documents" (Soman et al., 31 Mar 2024 ) explores challenges and best practices when applying Retrieval Augmented Generation (RAG) to technical documentation, which often contains specialized terminology, definitions, and acronyms. The authors conducted experiments using IEEE technical specifications and a battery terminology glossary to analyze factors influencing RAG performance.
A key observation is that the reliability of sentence embeddings decreases with increasing chunk size, particularly when chunks exceed 200 words. This suggests that standard embedding models may struggle to capture meaning effectively in longer technical passages, leading to spurious similarities. For practical implementation, this highlights the importance of a carefully chosen chunking strategy, potentially favoring smaller chunks or evaluating different models/strategies for longer texts.
The paper presents several hypotheses and observations derived from experiments, summarized in Table 1:
- Handling Definitions (H1): For documents containing definitions, splitting the defined term from its definition and processing them separately for retrieval yielded better performance compared to embedding the entire paragraph.
- Implementation Tip: When indexing glossaries or documents with explicit definitions, consider creating separate index entries or combining embeddings strategically for the term and its corresponding definition.
- Similarity Scores (H2): Relying solely on the absolute value or thresholding of similarity scores for retrieval is unreliable. Correct answers often have low similarity scores compared to incorrect ones, and scores aren't consistently comparable across different retrieval approaches.
- Implementation Tip: Avoid simple similarity score thresholds. Focus on rank-based retrieval (taking the top-k results) and potentially incorporate other ranking signals or re-ranking techniques.
- Keyword Position (H3): Keywords appearing closer to the beginning of a sentence were retrieved with higher accuracy than those appearing later.
- Implementation Tip: This suggests a limitation of current embedding models. While not directly implementable as a fix, it's an important consideration when evaluating retrieval performance, especially for queries targeting information deep within a sentence. Pre-processing or alternative search methods might be needed for critical keywords located late in sentences.
- Retrieval Strategy (H4 & H5): A strategy involving performing similarity search on sentences, retrieving the parent paragraphs containing those sentences, and then using these paragraphs as context for the generator performed better than directly using paragraph-level similarity search. This approach provided more detailed and relevant context.
- Implementation Tip: Consider a hybrid retrieval approach.
- 1. Index documents at both sentence and paragraph levels. Store mapping from sentences to their parent paragraphs.
- 2. For a query, search for similar sentences.
- 3. Retrieve the unique parent paragraphs corresponding to the top-k most similar sentences.
- 4. Pass these retrieved paragraphs to the LLM for generation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# Indexing Phase for each paragraph in document: store paragraph_text for each sentence in paragraph: sentence_embedding = encode(sentence_text) store (sentence_embedding, paragraph_id) # Retrieval Phase query_embedding = encode(user_query) similar_sentences = find_top_k_similar_sentences(query_embedding, sentence_embeddings) retrieved_paragraph_ids = set() for each similar_sentence in similar_sentences: paragraph_id = get_parent_paragraph_id(similar_sentence) retrieved_paragraph_ids.add(paragraph_id) retrieved_paragraphs = [get_paragraph_text(id) for id in retrieved_paragraph_ids] # Generation Phase context = concatenate(retrieved_paragraphs) answer = generate(LLM, context, user_query) |
- Handling Acronyms (H6): RAG systems struggled with definitions or content heavily involving acronyms and abbreviations. The generator often just expanded the acronym or provided the abbreviation without adding significant value or context.
- Implementation Tip: This is a known challenge. Consider pre-processing steps like expanding common acronyms before indexing or finetuning/selecting LLMs known to handle technical abbreviations better. Specific strategies for acronym lookup might need to be integrated.
- Order of Context (H7): In their experiments, the order of the retrieved paragraphs provided to the generator did not significantly affect the quality of the generated answer.
- Implementation Tip: While some research suggests positional context can matter, this finding implies that for simpler setups like the one used (Llama2-7b-chat with specific prompts), complex re-ranking of retrieved chunks based on their original document position might not be necessary. However, testing this for your specific LLM and document type is recommended.
The paper's experiments also explored the effect of the order of retrieved paragraphs on the generator's output (H7). Unlike some prior work, they did not observe a significant impact, suggesting that for their specific setup (using Llama2-7b-chat with a particular prompt structure), the positional order of context chunks was less critical.
For practical deployment of RAG on technical documents, the paper's findings suggest:
- Smart Chunking: Avoid excessively large chunks. Experiment with chunk sizes and overlaps specific to your document structure and embedding model.
- Specialized Indexing: For glossaries or definition-rich content, consider indexing terms and definitions separately or using specialized embedding strategies.
- Robust Retrieval: Don't rely solely on similarity score thresholds. Implement rank-based retrieval and potentially hybrid methods combining sentence-level search with paragraph-level context.
- Address Acronyms: Develop strategies to handle technical acronyms and abbreviations, as RAG may struggle with them out-of-the-box.
- Evaluate Thoroughly: Use domain-specific metrics (like RAGAS (James et al., 2023 ) or custom evaluations) to assess the quality of both retrieval and generation for technical accuracy.
The authors propose using RAG evaluation metrics and exploring methods for handling follow-up questions as future work, which are critical steps for building robust, practical RAG systems for technical domains. The provided appendix includes the basic prompt structure used for the Llama2 model, emphasizing a direct, context-focused answer style.