PixelRAG: Pixel-Based Retrieval Augmented Generation

Updated 4 July 2026

PixelRAG is a system that retrieves webpage content in its native rendered form by slicing screenshots into tiles for visual embedding.
It bypasses traditional text extraction limitations by preserving layout cues and structural details, leading to improved performance on multiple QA benchmarks.
The approach employs an efficient single-vector retrieval method over millions of image tiles, significantly reducing storage overhead and enabling web-scale deployment.

Searching arXiv for the cited papers and closely related work on PixelRAG. arxiv_search(query="PixelRAG web screenshots beat text for retrieval augmented generation", max_results=5) arxiv_search(query="(Wang et al., 1 Jun 2026) PixelRAG", max_results=10) search_arxiv(query="(Wang et al., 1 Jun 2026)", max_results=5) Attempting arXiv lookup for PixelRAG and related papers. PixelRAG is a retrieval-augmented generation system that represents the web in its native rendered form and performs retrieval and reading entirely in pixel space. Rather than parsing webpages into text, chunking that text, retrieving text chunks, and asking a LLM to read them, PixelRAG renders pages, slices the rendered pages into screenshot tiles, retrieves those tiles with a visual retriever, and feeds the retrieved images directly to a vision-LLM. In the specific system introduced in "PIXELRAG: Web Screenshots Beat Text for Retrieval-Augmented Generation" (Wang et al., 1 Jun 2026), the retrieval corpus is a large datastore of webpage screenshots rather than text passages. This suggests a broader interpretation of "PixelRAG" as visually grounded RAG over native image-like artifacts, although related systems in other domains often operate over patches, tiles, or objects rather than literal individual pixels.

1. Conceptual basis and problem formulation

PixelRAG begins from a critique of standard web RAG: web content is usually converted from HTML into clean text before retrieval and reading, but that conversion is a complex engineering layer involving heuristic or learned extractors such as trafilatura, mwparserfromhell, ReaderLM, or Dripper, and the resulting abstraction can be lossy (Wang et al., 1 Jun 2026). The paper argues that webpages are heterogeneous objects mixing prose with tables, infoboxes, lists, charts, images, dynamically rendered elements, and navigation clutter. Even when extraction succeeds, layout, typography, visual emphasis, section hierarchy, and $2$D structure are flattened into a $1$D token stream. The authors note prior evidence that parser choice alone can materially affect downstream performance, report that two strong parsing methods differ by nearly $10$ absolute points on SimpleQA, and state that a single extractor can discard more than $40\%$ of recoverable webpage text in some settings.

The central thesis is therefore not merely that screenshots preserve non-textual content, but that they preserve retrieval-relevant rendered structure. Tables, infoboxes, bordered sidebars, captions, and paragraph blocks remain visually distinct, whereas linearized text can destroy grouping signals or create false topical overlap. PixelRAG treats screenshots as the primary knowledge representation and relies on modern VLMs for OCR-like reading, document understanding, and structured visual reasoning. The paper presents this as, to the authors' knowledge, the first end-to-end screenshot-based RAG pipeline at web scale, operating over the full English Wikipedia snapshot and also over a large news corpus.

A common misconception is that PixelRAG is relevant only when the answer itself is visually encoded. The reported results are more specific and more surprising: screenshot retrieval outperforms text retrieval not only on multimodal tasks, but also on canonical text-centric question answering benchmarks such as Natural Questions and SimpleQA. At the same time, the paper does not argue that text is always inferior. It explicitly states that text remains attractive when readers are weak at reading pixels, when HTML/text infrastructure is already mature, or when explicit symbolic structures such as hyperlinks are needed.

2. End-to-end architecture and screenshot datastore construction

The PixelRAG pipeline has three stages: data collection, index construction, and runtime retrieval plus generation (Wang et al., 1 Jun 2026). In the standard text-RAG path, HTML is parsed into text, chunked, embedded, indexed, retrieved, and passed as text to a reader. In PixelRAG, each webpage is rendered, the render is tiled into screenshots, the tiles are embedded visually and indexed, and the top retrieved screenshot tiles are passed as images to the reader. No parser is needed in the main pipeline.

For Wikipedia, all HTML, CSS, and image assets come from a Kiwix ZIM archive, giving a static local snapshot. For news, the corpus is collected with an asynchronous crawler over BBC, AP News, and CNN. Rendering is performed offline with Playwright using headless Chromium. The system strips non-content elements such as navigation bars, sidebars, browser chrome, and surrounding whitespace so that the image contains only the content region. Long pages are captured by scrolling. The viewport width is fixed to $875$ pixels, matching Wikipedia’s default content width, and each full-page screenshot is sliced into non-overlapping tiles that are $1024$ pixels tall, with the last tile possibly shorter.

These choices produce a web-scale visual datastore. The paper reports roughly $30$ million tiles for Wikipedia from $7{,}134{,}778$ rendered articles, averaging about $4.2$ tiles per article, and about $3.6$ million tiles for the $1$0-article news corpus. A crucial design decision is single-vector retrieval rather than late-interaction multivector retrieval. A single $1$1 tile produces about $1$2 visual tokens after patching and projector merging, and at $1$3 dimensions per token that would make each tile roughly $1$4K dimensions under a multivector index. For $1$5 million tiles, the paper estimates about $1$6 TB in fp16 for the index alone. PixelRAG instead uses a single $1$7-dimensional vector per tile, bringing the full $1$8M-tile Wikipedia index down to about $1$9 GB in fp16.

The approximate nearest-neighbor index is built with FAISS using IVF. The authors justify IVF on two grounds: exact search is infeasible at this scale, and IVF supports relatively fast build times plus incremental add/delete for corpus updates. The full offline Wikipedia pipeline, from rendering through indexing, completes in about two days on one machine with $10$0 CPU cores, $10$1 TB RAM, and $10$2 H100 GPUs.

3. Visual retrieval model, synthetic supervision, and reading interface

At runtime, the query is embedded with the same visual embedding model used for screenshot tiles, the top-$10$3 screenshot tiles are retrieved by inner-product similarity, and those retrieved tiles are fed directly as images, together with the question, to a VLM reader (Wang et al., 1 Jun 2026). There is no OCR step or intermediate text conversion in the core PixelRAG path. In the main experiments, the default reader is Qwen3.5-4B, and the default retrieval depth is $10$4.

The base retrieval model is Qwen3-VL-Embedding-2B. PixelRAG then adapts it to webpage screenshots via contrastive fine-tuning using synthetic training data generated entirely from the screenshot datastore. Training triples have the form $10$5, where $10$6 is a query, $10$7 is a positive screenshot tile that answers the query, and $10$8 are hard negative tiles that are topically similar but do not answer the query. The data pipeline has two stages. In Stage 1, the system generates positive query-tile pairs by sampling tiles, prompting an LLM to produce one factual, self-contained question whose answer is completely visible in that tile, and then filtering for self-containedness and answerability. In Stage 2, it mines hard negatives dynamically: for a given $10$9, it retrieves the top- $40\%$ 0 candidates using the base embedding model, sets $40\%$ 1, filters false negatives with an LLM-based answerability check, and retains the first $40\%$ 2 surviving hard negatives. After filtering, the training set contains about $40\%$ 3K synthetic query-tile pairs.

The retriever is trained with InfoNCE using in-batch negatives and mined negatives:

$40\%$ 4

Training uses batch size $40\%$ 5 with GradCache, $40\%$ 6 hard negatives per query, peak learning rate $40\%$ 7, $40\%$ 8 warmup steps, and cosine decay, and completes in under $40\%$ 9 hours on a single H100. An important model-design result is that PixelRAG benefits from adapting not just the language backbone but also the visual backbone. The authors apply LoRA to both the LLM backbone and the ViT and report consistent improvements on rendered webpage screenshots.

The reading stage remains fully visual. Retrieved screenshots are passed directly to the VLM reader with the text query. The paper also studies image compression as an efficiency lever specific to pixel-space RAG. Each tile originally measures $875$0 pixels but can be downscaled before reading, for example to $875$1, without changing retrieval or the datastore. Since modern VLMs use dynamic resolution, lowering image resolution reduces visual tokens roughly in proportion to pixel count. In controlled reader-SFT evaluation, average accuracy drops from $875$2 uncompressed to $875$3 at $875$4 compression and to $875$5 at $875$6 compression if the reader is not trained for compressed inputs. After supervised fine-tuning with synthetic $875$7 triples plus distractor tiles, $875$8-compressed reading reaches average $875$9, exceeding the uncompressed ceiling of $1024$0, and $1024$1-compressed reading reaches $1024$2, essentially matching the uncompressed baseline while using one third of the pixel budget.

4. Benchmark behavior and empirical results

PixelRAG is evaluated on Wikipedia QA, multimodal open-domain QA, news QA, and agentic search (Wang et al., 1 Jun 2026). The Wikipedia benchmarks are Natural Questions, NQ-Tables, and SimpleQA, each with $1024$3 examples and GPT-4.1 judge-based accuracy. The multimodal benchmarks are MMSearch and Encyclopedic VQA on the landmarks subset. On news, the system is evaluated on LiveVQA-2025 over a screenshot datastore built from BBC, AP News, and CNN. For agentic search, PixelRAG serves as the search backend for a GPT-5 ReAct agent on MoNaCo.

The headline pattern is consistent across the six main benchmarks: with Qwen3.5-4B as reader and $1024$4, PixelRAG improves over both text baselines on all six tasks.

Benchmark	Strongest reported text baseline	Best reported PixelRAG result
NQ	55.9	58.7
NQ-Tables	42.5	48.8
SimpleQA	71.6	78.8
MMSearch	25.3	28.3
EVQA	29.6	45.1
LiveVQA	59.0	70.3

The detailed retrieval numbers reinforce the same conclusion. On NQ, Recall@3 rises from $1024$5 with trafilatura to $1024$6 with PixelRAG. On NQ-Tables, recall improves from $1024$7 to $1024$8. On SimpleQA, Recall@3 rises from $1024$9 to $30$0. On EVQA, text retrieval has very low recall around $30$1–$30$2, while PixelRAG raises it to $30$3 with the base retriever and $30$4 with the fine-tuned retriever. The abstract summarizes the overall improvement as accuracy gains of up to $30$5 over text-based baselines; in the detailed benchmark discussion, the clearest table-backed comparison to the strongest text baseline is $30$6 versus $30$7 on EVQA, a $30$8-point gain.

The evidence-type analysis on SimpleQA is particularly important because it shows that the gains are not restricted to overtly visual structures. Overall, PixelRAG’s evidence Recall@3 is $30$9 versus $7{,}134{,}778$ 0 for trafilatura. For table questions, PixelRAG reaches $7{,}134{,}778$ 1 evidence Recall@3 versus $7{,}134{,}778$ 2 for trafilatura and $7{,}134{,}778$ 3 for mwparserfromhell; corresponding QA accuracies are $7{,}134{,}778$ 4, $7{,}134{,}778$ 5, and $7{,}134{,}778$ 6. For paragraph evidence, PixelRAG reaches $7{,}134{,}778$ 7 evidence Recall@3 versus $7{,}134{,}778$ 8 for trafilatura, with accuracy $7{,}134{,}778$ 9 versus $4.2$0. The paper attributes part of this paragraph gain to an "infobox displacement" failure mode in text retrieval: once an infobox is linearized, its keyword-dense text can crowd out the actual answer-bearing paragraph.

Modality ablations indicate that retrieval modality is the larger contributor, though visual reading also matters. Screenshot$4.2$1Screenshot scores $4.2$2 on SimpleQA and $4.2$3 on LiveVQA; Screenshot$4.2$4OCR text scores $4.2$5 and $4.2$6; Text$4.2$7Text scores $4.2$8 and $4.2$9; Text$3.6$0Rendered image scores $3.6$1 and $3.6$2. The fact that Screenshot$3.6$3OCR still beats Text$3.6$4Text shows that retrieving screenshots helps even if the reader later consumes text. The fact that Screenshot$3.6$5Screenshot beats Screenshot$3.6$6OCR shows that the visual reading stage adds value beyond retrieval.

Reader quality is a decisive variable. With weak VLMs, pixel retrieval can underperform text retrieval because the model cannot reliably read rendered webpage text. LLaVA-1.5-7B scores $3.6$7 with pixel retrieval versus $3.6$8 with text retrieval on SimpleQA; Qwen2-VL-2B scores $3.6$9 versus $1$00. The crossover reported in the paper occurs at Qwen3-VL-4B, which reaches $1$01 with pixel retrieval versus $1$02 with text retrieval. Above that point, all tested stronger models favor pixel retrieval. This makes PixelRAG’s empirical claim contingent not only on retrieval quality but also on the maturity of the downstream VLM reader.

5. Relation to visually grounded RAG beyond the web

PixelRAG is a specific screenshot-based web RAG system, but it also sits within a wider family of visually grounded retrieval-augmented methods. In "ImageRAG: Enhancing Ultra High Resolution Remote Sensing Imagery Analysis with ImageRAG" (Zhang et al., 2024), ultra-high-resolution remote sensing image analysis is reframed as image long-context selection. The retrieval corpus is derived from the input image itself through a patch division scheme, and the system retrieves visual contexts with a fast path based on text-image similarity and a slow path that bridges domain mismatch through a remote-sensing label-image gallery. The method is training-free as a framework and retrieves actual patches from the source image, not detached embeddings. The paper is explicit, however, that this is not pixel-level retrieval; it operates over patches, crops, and tiles.

"RAG-3DSG: Enhancing 3D Scene Graphs with Re-Shot Guided Retrieval-Augmented Generation" (Chang et al., 15 Jan 2026) pushes the same design pattern into $1$03D scene understanding. There, retrieval operates over reliable low-uncertainty objects in a scene-internal memory rather than over raw pixels or dense image patches. High-uncertainty objects retrieve nearby reliable object context for caption refinement, and generation is conditioned on both retrieved textual context and visual evidence in the form of a composite image containing a re-shot render and a selected crop. Again, the system is visually grounded but object-centric rather than pixel-centric.

"Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image Generation" (Zhu et al., 28 May 2025) addresses another problem that is directly relevant to PixelRAG-style reasoning: global retrieval is often inadequate for compositional prompts. The paper decomposes queries into subqueries, computes subquery-conditioned visual representations, and returns a Pareto-optimal set of images that jointly cover the prompt. Generation is then guided by explicit subquery-aware instructions specifying which retrieved image should contribute which query aspects. This does not operate in screenshot space, but it provides a formal account of complementary evidence retrieval rather than ordinary top-$1$04 similarity.

Taken together, these works suggest a broader technical pattern: treat a large visual environment as a retrievable memory, define an appropriate retrieval unit, and supply only the most relevant visual evidence to a downstream generator. What differs across systems is the retrieval unit. PixelRAG uses webpage screenshot tiles; ImageRAG uses remote-sensing patches; RAG-3DSG uses reconstructed scene objects; Cross-modal RAG uses image-level references with subquery-conditioned representations. In that sense, the distinctive feature of PixelRAG proper is not merely that it uses images, but that it keeps both retrieval and reading in rendered pixel space over a web-scale corpus.

6. Limitations, systems considerations, and research directions

PixelRAG has substantial infrastructure and deployment costs. The screenshot store itself is about $1$05 TB for Wikipedia and $1$06 GB for news (Wang et al., 1 Jun 2026). The appendix notes that persistent tile storage is not strictly necessary after embedding, since one could retain only the vector index and the original HTML/CSS/assets and re-render retrieved pages on demand at inference time, but this is described as an engineering direction rather than an evaluated component. The paper also identifies several functional limitations: hyperlinks are visible as rendered text but not directly actionable; all corpora used are English; content moderation is harder in image form than on extracted text; and the fine-tuned retriever improves Wikipedia benchmarks but does not transfer cleanly to news, where the base retriever slightly outperforms the Wikipedia-tuned one on LiveVQA.

The paper’s comparison to HTML-native alternatives is also instructive. Preserving structure as raw HTML does not solve the problem for current readers, because HTML markup inflates context length by $1$07 and wastes tokens on tags. A fully HTML-native RAG pipeline can preserve retrieval signals, but the reading stage collapses, with a $1$08-point drop on NQ and a $1$09-point drop on NQ-Tables relative to the trafilatura baseline. PixelRAG’s claim is therefore narrower and more technical than a general anti-text argument: screenshots preserve structure in a token-efficient form that current VLMs can exploit.

A separate systems issue concerns the balance between retrieval and generation. "An Adaptive Vector Index Partitioning Scheme for Low-Latency RAG Pipeline" (Kim et al., 11 Apr 2025) does not study PixelRAG directly, but it shows that retrieval and generation contend for the same GPU memory and scheduling budget. In its text-RAG setting, vector search can contribute up to $1$10 of total TTFT for a $1$11M-vector, $1$12-dimensional database, and adaptive hot/cold index partitioning improves vector search responsiveness by $1$13, reduces average TTFT by about $1$14–$1$15, and reaches $1$16 improvement on the largest dataset. A plausible implication is that PixelRAG-like deployments, which must colocate large visual indices with VLM serving, will face at least the same HBM contention and TTFT trade-offs, possibly more acutely because visual prefill cost and image token budgets are larger than in text-only RAG.

The main open problems follow directly from the evidence presented. Extending PixelRAG from static corpora to the live web raises unresolved questions around freshness, rendering correctness, crawling policies, and dynamic content. Improving cross-domain transfer for screenshot retrievers remains open. Hyperlink structure, while visually preserved, is not operational in the current representation. More broadly, PixelRAG challenges the assumption that text must be the canonical interface between the web and retrieval-augmented generation, but it does not eliminate the need for stronger readers, better multimodal systems engineering, or more flexible hybrid representations. The strongest conclusion supported by the current literature is narrower: for a wide class of web retrieval problems, rendered screenshots are not merely a fallback for visually rich pages, but a competitive primary representation for both retrieval and downstream reasoning.