HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems (2411.02959v2)

Published 5 Nov 2024 in cs.IR

Abstract: Retrieval-Augmented Generation (RAG) has been shown to improve knowledge capabilities and alleviate the hallucination problem of LLMs. The Web is a major source of external knowledge used in RAG systems, and many commercial RAG systems have used Web search engines as their major retrieval systems. Typically, such RAG systems retrieve search results, download HTML sources of the results, and then extract plain texts from the HTML sources. Plain text documents or chunks are fed into the LLMs to augment the generation. However, much of the structural and semantic information inherent in HTML, such as headings and table structures, is lost during this plain-text-based RAG process. To alleviate this problem, we propose HtmlRAG, which uses HTML instead of plain text as the format of retrieved knowledge in RAG. We believe HTML is better than plain text in modeling knowledge in external documents, and most LLMs possess robust capacities to understand HTML. However, utilizing HTML presents new challenges. HTML contains additional content such as tags, JavaScript, and CSS specifications, which bring extra input tokens and noise to the RAG system. To address this issue, we propose HTML cleaning, compression, and a two-step block-tree-based pruning strategy, to shorten the HTML while minimizing the loss of information. Experiments on six QA datasets confirm the superiority of using HTML in RAG systems.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel two-stage pipeline that cleans and prunes HTML to preserve structural cues for enhanced retrieval-augmented generation.
It demonstrates that processed HTML outperforms plain text and Markdown across various QA benchmarks such as EM, ROUGE-L, and BLEU.
The approach significantly reduces HTML token length and computational cost while improving answer accuracy, as validated on six diverse QA datasets.

Retrieval-Augmented Generation (RAG) systems commonly use the web as a source of external knowledge. A typical approach involves retrieving web pages, extracting plain text from their HTML sources, and feeding this text to a LLM. However, this process discards valuable structural and semantic information present in HTML, such as headings, tables, and specific tag semantics (e.g., <code>, <a>). The paper "HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems" (2411.02959) proposes using HTML directly as the input format for LLMs in RAG systems to preserve this rich information.

The core idea of HtmlRAG is that LLMs, having been pre-trained on web data, inherently possess the capability to understand HTML structure and semantics. Using HTML directly could lead to better RAG performance compared to relying solely on plain text. However, raw HTML from the web presents significant challenges: it is often excessively long (averaging over 80K tokens) and contains a large amount of noise (CSS, JavaScript, comments, lengthy attributes) that can overwhelm LLMs and increase computational cost.

To address these challenges, HtmlRAG introduces a two-stage processing pipeline:

HTML Cleaning: This is a rule-based pre-processing step that operates without considering the user query. Its goal is to remove semantically irrelevant content and compress redundant structures while retaining core semantic information.
- Content Cleaning: Removes CSS styles, Comments, JavaScript, and lengthy HTML tag attributes. This drastically reduces the token count by eliminating non-essential elements.
- Lossless Structural Compression: Simplifies the HTML tree by merging multiple layers of single-nested tags (e.g., <div><div><p>...</p></div></div> becomes <p>...</p>) and removing empty tags (<p></p>). This cleaning process can reduce the HTML length to about 6% of its original size, making it more manageable.
HTML Pruning: Even after cleaning, HTML documents can be long, especially when multiple retrieved documents are concatenated. The pruning step further shortens the HTML based on its relevance to the user's query, leveraging the HTML's tree structure. This is a two-step process operating on a "block tree", a representation derived from the HTML's DOM tree but with adjustable granularity.
- Block Tree Construction: Instead of pruning on the fine-grained DOM tree, HTML is parsed into a block tree where nodes are merged into hierarchical blocks. The granularity of blocks is controlled by a parameter (e.g., maxWords per block). Blocks can represent merged content of child nodes or the text directly attached to a node.
- Pruning Blocks based on Text Embedding: This is the first pruning stage, typically applied to a coarser-grained block tree. Plain text is extracted from each block, and its relevance to the user's query is scored using an embedding model (e.g., BGE). A greedy algorithm then iteratively removes blocks with the lowest relevance scores until the total HTML length fits within the LLM's context window. The HTML structure is re-adjusted after pruning (merging single-nested tags, removing empty tags). While lightweight, this method's effectiveness is limited on fine-grained blocks where text content is too short for robust embeddings.
- Generative Fine-Grained Block Pruning: This second pruning stage operates on a finer-grained block tree (expanding leaf nodes from the first stage). It uses a fine-tuned generative model (e.g., a lightweight LLM like Phi-3.5-Mini) to score blocks. Inspired by chunk scoring methods, the generative model calculates a score for each block based on the generation probability of its "block path" (the sequence of HTML tags from the root to the block's tag). The model is fine-tuned on a task to generate the path and content of the most relevant block given the HTML and query. An efficient tree-based inference method with dynamic skipping is used during scoring, leveraging the shared prefixes in tokenized block paths to reduce computation. A greedy pruning algorithm similar to the embedding-based method is then applied based on these generative scores.

The experimental evaluation of HtmlRAG was conducted on six QA datasets (ASQA, Hotpot-QA, NQ, Trivia-QA, MuSiQue, ELI5) using real web search results from Bing. The method was compared against plain text and Markdown baselines using chunking-based rerankers (BM25, BGE, E5-Mistral) and abstractive refiners (LongLLMLingua, JinaAI Reader), with Llama-3.1 8B and 70B Instruct models as the readers.

Key findings from the experiments demonstrate the effectiveness of HtmlRAG:

HtmlRAG, particularly with the two-stage pruning, consistently outperforms plain-text-based and Markdown-based RAG systems across various QA tasks and evaluation metrics (EM, Hit@1, ROUGE-L, BLEU).
Using cleaned HTML directly (HtmlRAG without pruning) also shows competitive performance against plain text and Markdown when using LLMs with large context windows (128K tokens), highlighting the inherent advantage of the HTML format itself.
Ablation studies confirm that both the block tree structure and the two pruning stages (embedding-based and generative) contribute significantly to the performance gains. The generative pruning step is crucial for handling finer granularity effectively.
The efficiency analysis indicates that the computational cost of the HTML pruning steps is significantly lower than the final LLM inference for answer generation, making the proposed approach practical for deployment. The tree-based inference with dynamic skipping further optimizes the generative pruning.

For practical implementation, the HtmlRAG pipeline involves:

Retrieving HTML documents (e.g., via a search API).
Applying the rule-based HTML Cleaning module.
Constructing a coarse-grained block tree from the cleaned HTML.
Pruning the block tree using an embedding model's similarity scores.
Constructing a finer-grained block tree from the result of step 4.
Pruning the finer-grained block tree using a fine-tuned generative model's path probability scores.
Feeding the final pruned HTML to the reader LLM along with the query.

The paper provides pseudocode for the block tree construction and greedy pruning algorithms in its appendix, guiding the implementation process. The code and datasets are also made available on GitHub for reproducibility. The choice of granularity (maxWords) for the block tree affects the trade-off between pruning flexibility and the effectiveness of the scoring models (embeddings prefer larger blocks, generative can handle finer ones).

In summary, HtmlRAG presents a novel and effective approach to improve RAG systems by leveraging the rich structural information in HTML documents. The proposed cleaning and two-stage pruning methods provide a practical pipeline to manage the length and noise of raw HTML, enabling LLMs to better utilize retrieved knowledge for superior question answering performance.