Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Observations on Building RAG Systems for Technical Documents (2404.00657v1)

Published 31 Mar 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Retrieval augmented generation (RAG) for technical documents creates challenges as embeddings do not often capture domain information. We review prior art for important factors affecting RAG and perform experiments to highlight best practices and potential challenges to build RAG systems for technical documents.

This paper "Observations on Building RAG Systems for Technical Documents" (Soman et al., 31 Mar 2024 ) explores challenges and best practices when applying Retrieval Augmented Generation (RAG) to technical documentation, which often contains specialized terminology, definitions, and acronyms. The authors conducted experiments using IEEE technical specifications and a battery terminology glossary to analyze factors influencing RAG performance.

A key observation is that the reliability of sentence embeddings decreases with increasing chunk size, particularly when chunks exceed 200 words. This suggests that standard embedding models may struggle to capture meaning effectively in longer technical passages, leading to spurious similarities. For practical implementation, this highlights the importance of a carefully chosen chunking strategy, potentially favoring smaller chunks or evaluating different models/strategies for longer texts.

The paper presents several hypotheses and observations derived from experiments, summarized in Table 1:

  • Handling Definitions (H1): For documents containing definitions, splitting the defined term from its definition and processing them separately for retrieval yielded better performance compared to embedding the entire paragraph.
    • Implementation Tip: When indexing glossaries or documents with explicit definitions, consider creating separate index entries or combining embeddings strategically for the term and its corresponding definition.
  • Similarity Scores (H2): Relying solely on the absolute value or thresholding of similarity scores for retrieval is unreliable. Correct answers often have low similarity scores compared to incorrect ones, and scores aren't consistently comparable across different retrieval approaches.
    • Implementation Tip: Avoid simple similarity score thresholds. Focus on rank-based retrieval (taking the top-k results) and potentially incorporate other ranking signals or re-ranking techniques.
  • Keyword Position (H3): Keywords appearing closer to the beginning of a sentence were retrieved with higher accuracy than those appearing later.
    • Implementation Tip: This suggests a limitation of current embedding models. While not directly implementable as a fix, it's an important consideration when evaluating retrieval performance, especially for queries targeting information deep within a sentence. Pre-processing or alternative search methods might be needed for critical keywords located late in sentences.
  • Retrieval Strategy (H4 & H5): A strategy involving performing similarity search on sentences, retrieving the parent paragraphs containing those sentences, and then using these paragraphs as context for the generator performed better than directly using paragraph-level similarity search. This approach provided more detailed and relevant context.
    • Implementation Tip: Consider a hybrid retrieval approach.
    • 1. Index documents at both sentence and paragraph levels. Store mapping from sentences to their parent paragraphs.
    • 2. For a query, search for similar sentences.
    • 3. Retrieve the unique parent paragraphs corresponding to the top-k most similar sentences.
    • 4. Pass these retrieved paragraphs to the LLM for generation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Indexing Phase
for each paragraph in document:
    store paragraph_text
    for each sentence in paragraph:
        sentence_embedding = encode(sentence_text)
        store (sentence_embedding, paragraph_id)

# Retrieval Phase
query_embedding = encode(user_query)
similar_sentences = find_top_k_similar_sentences(query_embedding, sentence_embeddings)

retrieved_paragraph_ids = set()
for each similar_sentence in similar_sentences:
    paragraph_id = get_parent_paragraph_id(similar_sentence)
    retrieved_paragraph_ids.add(paragraph_id)

retrieved_paragraphs = [get_paragraph_text(id) for id in retrieved_paragraph_ids]

# Generation Phase
context = concatenate(retrieved_paragraphs)
answer = generate(LLM, context, user_query)

  • Handling Acronyms (H6): RAG systems struggled with definitions or content heavily involving acronyms and abbreviations. The generator often just expanded the acronym or provided the abbreviation without adding significant value or context.
    • Implementation Tip: This is a known challenge. Consider pre-processing steps like expanding common acronyms before indexing or finetuning/selecting LLMs known to handle technical abbreviations better. Specific strategies for acronym lookup might need to be integrated.
  • Order of Context (H7): In their experiments, the order of the retrieved paragraphs provided to the generator did not significantly affect the quality of the generated answer.
    • Implementation Tip: While some research suggests positional context can matter, this finding implies that for simpler setups like the one used (Llama2-7b-chat with specific prompts), complex re-ranking of retrieved chunks based on their original document position might not be necessary. However, testing this for your specific LLM and document type is recommended.

The paper's experiments also explored the effect of the order of retrieved paragraphs on the generator's output (H7). Unlike some prior work, they did not observe a significant impact, suggesting that for their specific setup (using Llama2-7b-chat with a particular prompt structure), the positional order of context chunks was less critical.

For practical deployment of RAG on technical documents, the paper's findings suggest:

  1. Smart Chunking: Avoid excessively large chunks. Experiment with chunk sizes and overlaps specific to your document structure and embedding model.
  2. Specialized Indexing: For glossaries or definition-rich content, consider indexing terms and definitions separately or using specialized embedding strategies.
  3. Robust Retrieval: Don't rely solely on similarity score thresholds. Implement rank-based retrieval and potentially hybrid methods combining sentence-level search with paragraph-level context.
  4. Address Acronyms: Develop strategies to handle technical acronyms and abbreviations, as RAG may struggle with them out-of-the-box.
  5. Evaluate Thoroughly: Use domain-specific metrics (like RAGAS (James et al., 2023 ) or custom evaluations) to assess the quality of both retrieval and generation for technical accuracy.

The authors propose using RAG evaluation metrics and exploring methods for handling follow-up questions as future work, which are critical steps for building robust, practical RAG systems for technical domains. The provided appendix includes the basic prompt structure used for the Llama2 model, emphasizing a direct, context-focused answer style.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (11)
  1. IEEE 1881-2016. IEEE standard glossary of stationary battery terminology. IEEE Std 1881-2016, pp.  1–42, 2016. doi: 10.1109/IEEESTD.2016.7552407.
  2. Understanding retrieval augmentation for long-form question answering. arXiv preprint arXiv:2310.12150, 2023a.
  3. Benchmarking large language models in retrieval-augmented generation. arXiv preprint arXiv:2309.01431, 2023b.
  4. Ragas: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217, 2023.
  5. IEEE. IEEE standard for information technology–telecommunications and information exchange between systems - local and Metropolitan Area Networks–specific requirements - part 11: Wireless LAN medium access control (MAC) and physical layer (PHY) specifications. IEEE Std 802.11-2020 (Revision of IEEE Std 802.11-2016), pp.  1–4379, 2021. doi: 10.1109/IEEESTD.2021.9363693.
  6. Sentence-BERT: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 2019.
  7. Observations on LLMs for telecom domain: Capabilities and limitations (To appear in the proceedings of The Third International Conference on Artificial Intelligence and Machine Learning Systems). arXiv preprint arXiv:2305.13102, 2023.
  8. MPNET: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867, 2020.
  9. Dynamic retrieval augmented generation of ontologies using artificial intelligence (DRAGON-AI). arXiv preprint arXiv:2312.10904, 2023.
  10. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  11. Retrieval-augmented domain adaptation of language models. In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023), pp.  54–64, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Sumit Soman (18 papers)
  2. Sujoy Roychowdhury (9 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com