State-Grounded Dynamic Retrieval (SGDR)

Updated 4 July 2026

State-Grounded Dynamic Retrieval (SGDR) is an online framework that dynamically retrieves intermediate, state-conditioned sub-procedures for web-agent tasks.
It uses sliding-window extraction and dual text-code representations to align task goals with current webpage states for actionable skill reuse.
Empirical evaluation on WebArena shows SGDR improves success rates and step efficiency compared to static, one-shot retrieval approaches.

State-Grounded Dynamic Retrieval (SGDR) is an online skill learning method for web agents that replaces static, task-level skill reuse with step-level, state-conditioned skill retrieval. In its original formulation, SGDR retrieves skills using both the task goal and the current webpage state, so that reusable sub-procedures can be invoked at intermediate execution states rather than being fixed once at task start (Li et al., 3 Jun 2026). Related work in retrieval-augmented generation, biomedical evidence acquisition, API routing, and region-level document retrieval is described as closely related or “SGDR-like,” which suggests a broader methodological pattern in which retrieval decisions are grounded in an evolving execution or evidence state rather than a single initial query (He et al., 28 Apr 2025).

1. Motivation and problem setting

SGDR was introduced to address a mismatch between prior online skill learning methods and the dynamics of web execution. Earlier methods such as AWM, ASI, and CER mainly perform one-shot retrieval at the beginning of the task: they select a set of workflows or skills from memory using the initial instruction and then keep those skills fixed throughout execution. The SGDR paper argues that this is too coarse for interactive, stateful web tasks, because a skill that is useful on the landing page may become irrelevant after navigation, while another skill may become useful only after reaching a form, a menu, or a detail page (Li et al., 3 Jun 2026).

The paper also identifies a granularity problem. If a system stores only full-trajectory skills, those skills are too tied to the original task context; if it stores only single-action skills, they are too primitive to provide meaningful abstraction. SGDR is therefore designed to learn reusable skills at an intermediate granularity that can be invoked from intermediate states (Li et al., 3 Jun 2026).

A closely related diagnosis appears in dynamic RAG. “Context-Guided Dynamic Retrieval for Improving Generation Quality in RAG Models” states that conventional RAG retrieves a fixed set of documents once and then generates from that static context, even though generation itself is dynamic. That paper argues that static retrieval becomes misaligned with the evolving generation state and struggles with multi-turn reasoning, ambiguous queries, and multi-document fusion (He et al., 28 Apr 2025). This alignment between web-agent skill reuse and dynamic knowledge access is central to the broader significance of SGDR-like designs.

2. Core architecture in web agents

SGDR consists of three components: a sliding-window extraction process that turns completed trajectories into reusable sub-procedures invokable at intermediate execution states, a dual text-code representation that connects skill retrieval with executable action, and a state-grounded dynamic retrieval mechanism that matches skills to both the task goal and the current webpage state (Li et al., 3 Jun 2026).

The overall online loop operates over a task stream. For each task $g_i$ , the agent solves the task using the current skill library $\mathcal{S}_{i-1}$ . After execution, an evaluator $E$ judges success. If the trajectory is successful, new skills are extracted and added to the library, yielding $\mathcal{S}_i = \mathcal{S}_{i-1} \cup \Delta \mathcal{S}_i$ . The objective is to maximize cumulative success: $\max_{\pi} \sum_{i=1}^{N} y_i.$ This makes SGDR an online accumulation-and-reuse framework rather than a fixed offline skill bank (Li et al., 3 Jun 2026).

Skill induction begins from a successful trajectory

$\mathcal{T} = (o_1, a_1, o_2, a_2, \ldots, a_H, o_{H+1}).$

Instead of storing the entire trajectory as one skill, SGDR enumerates windows of length $l \in \mathcal{L}$ , with

$\mathcal{L} = \{2,3,4,5\}.$

A candidate window is

$w_{t,l} = (o_t, a_t, \ldots, a_{t+l-1}, o_{t+l}).$

The extracted windows are intended to capture local reusable sub-procedures such as opening a settings page, filling and submitting a form, applying a filter, posting a comment, or selecting an option from a dropdown (Li et al., 3 Jun 2026).

Each induced skill is represented as

$s_k = (d_k, c_k),$

where $\mathcal{S}_{i-1}$ 0 is a natural-language description used for retrieval and $\mathcal{S}_{i-1}$ 1 is executable code used for action execution. The paper emphasizes that retrieval and execution require different interfaces: the description supports semantic matching to the current goal and page state, whereas the code makes the skill callable as an actual browser procedure (Li et al., 3 Jun 2026).

Not every window becomes a stored skill. Each candidate is passed to an LLM to determine whether it is a reusable subroutine. Following ASI-style verification, the corresponding primitive action segment in the original trajectory is replaced with the candidate skill call and the rewritten trajectory is re-executed. Only skills whose substituted trajectories remain successful are added to the library. The resulting skill library is therefore behaviorally validated in the environment rather than merely induced from textual description (Li et al., 3 Jun 2026).

3. Retrieval mechanics and state grounding

At decision step $\mathcal{S}_{i-1}$ 2, the agent observes the current webpage state $\mathcal{S}_{i-1}$ 3. Because accessibility trees are verbose, SGDR first compresses the page into a short text summary: $\mathcal{S}_{i-1}$ 4 The summary is intended to use the same operational vocabulary as skill descriptions, including phrases such as “product listing page,” “opened combobox,” or “submission form,” so that embedding-based retrieval is better aligned with executable subroutines (Li et al., 3 Jun 2026).

Retrieval then uses two queries: the original task instruction $\mathcal{S}_{i-1}$ 5 and the current page summary $\mathcal{S}_{i-1}$ 6. For each skill $\mathcal{S}_{i-1}$ 7, SGDR computes

$\mathcal{S}_{i-1}$ 8

Here $\mathcal{S}_{i-1}$ 9 maps text into embedding space, $E$ 0 is cosine similarity, and $E$ 1 balances task goal versus current state. The paper uses $E$ 2, so task and state are weighted equally (Li et al., 3 Jun 2026).

After computing relevance scores for all skills in $E$ 3, SGDR keeps the top $E$ 4 candidates with $E$ 5 and reranks them using Maximal Marginal Relevance (MMR). Skills are greedily selected into $E$ 6 until $E$ 7, using

$E$ 8

with

$E$ 9

and $\mathcal{S}_i = \mathcal{S}_{i-1} \cup \Delta \mathcal{S}_i$ 0. The purpose of MMR is to reduce redundancy produced by overlapping sliding windows while preserving high relevance (Li et al., 3 Jun 2026).

The retrieved skills are exposed only for the current step. This is a defining property of SGDR: retrieval is not a one-time initialization of the prompt, but a recurrent control decision made as the webpage changes. The paper is explicit that skills are meant to be state-contingent subroutines callable from intermediate webpage states. Each skill is a parameterized Python function whose arguments fall into two classes: structural webpage arguments such as element IDs, button IDs, and dropdown IDs, and task-specific content arguments such as search queries, comments, or destination names (Li et al., 3 Jun 2026).

A representative example is the Map-domain skill: $l \in \mathcal{L}$ 0 The paper notes that a skill is reusable only if the required structural elements are visible in the current page state when the agent invokes it. The induction prompt explicitly rejects routines that depend on UI elements that become visible only after a later transition (Li et al., 3 Jun 2026).

4. Experimental evaluation on WebArena

SGDR is evaluated on WebArena, using five single-domain website types: Shopping, Admin, Reddit, GitLab, and Map. Cross-site tasks are excluded, and the system maintains a separate skill library per domain to avoid cross-domain interference. The evaluation uses 764 single-domain tasks in total: 187 Shopping, 182 Admin, 106 Reddit, 180 GitLab, and 109 Map. Baselines are Vanilla, AWM, ASI, and CER, all under online variants where applicable. The reported metrics are success rate overall and by domain, together with the average number of steps per task (Li et al., 3 Jun 2026).

Backbone	SGDR success rate	Reported gain over strongest baseline
GPT-4.1	37.5%	10.6% relative
Qwen3-4B	24.3%	10.0% relative

The strongest baseline is CER. Relative to CER, SGDR improves overall success rate by +3.6 points with GPT-4.1 and +2.2 points with Qwen3-4B, which the paper reports as relative gains of about 10.6% and 10.0%, respectively (Li et al., 3 Jun 2026).

Domain-wise, SGDR is best on 4 of 5 domains with GPT-4.1, with a particularly strong improvement on Admin from 41.4% to 47.7%. GitLab is the main exception. The paper attributes this to GitLab tasks often requiring persistent repository preconditions and whole-task procedures that may be less compatible with the local, sliding-window skill granularity (Li et al., 3 Jun 2026).

SGDR also improves step efficiency. With GPT-4.1, SGDR uses 4.8 steps on average, compared with 6.0 for Vanilla, 5.2 for ASI, and 6.4 for CER. With Qwen3-4B, SGDR reduces steps by 11.1% relative to Vanilla and 13.8% relative to CER. This indicates that the retrieved skills compress multi-step browser interactions into higher-level callable routines rather than merely altering success outcomes (Li et al., 3 Jun 2026).

The ablations support the central design claims. Task-plus-state retrieval outperforms task-only and state-only retrieval; the best $\mathcal{S}_i = \mathcal{S}_{i-1} \cup \Delta \mathcal{S}_i$ 1 is around 0.5; MMR outperforms plain top- $\mathcal{S}_i = \mathcal{S}_{i-1} \cup \Delta \mathcal{S}_i$ 2 relevance retrieval because it reduces redundancy; and sliding-window extraction outperforms full-trajectory and single-action extraction. The reported results therefore tie SGDR’s gains to the joint choice of skill granularity, dual representation, and stepwise state-conditioned retrieval (Li et al., 3 Jun 2026).

5. SGDR-like formulations in retrieval-augmented generation and document retrieval

Although the term SGDR is introduced in the web-agent setting, several later systems are explicitly described as close in spirit to it or as “SGDR-like.” They share the property that retrieval is refreshed or routed according to an evolving generation, evidence, or spatial state rather than remaining static.

System	Grounding state	Retrieved unit
Context-guided dynamic RAG (He et al., 28 Apr 2025)	Generation state $\mathcal{S}_i = \mathcal{S}_{i-1} \cup \Delta \mathcal{S}_i$ 3	Documents
DynaRAG (Liang et al., 24 Feb 2026)	Sufficiency of current evidence	Passages or APIs
PubMed Reasoner (Zhang et al., 28 Mar 2026)	Metadata feedback and evidence pool	Queries, articles, evidence
Spatially grounded retrieval (Georgiou, 2 Dec 2025)	Query-conditioned patch relevance	Document regions

In “Context-Guided Dynamic Retrieval for Improving Generation Quality in RAG Models,” retrieval is updated by the current generation state through

$\mathcal{S}_i = \mathcal{S}_{i-1} \cup \Delta \mathcal{S}_i$ 4

The method combines multi-level perceptive retrieval vector construction with a differentiable document matching path, producing document weights

$\mathcal{S}_i = \mathcal{S}_{i-1} \cup \Delta \mathcal{S}_i$ 5

and a context embedding

$\mathcal{S}_i = \mathcal{S}_{i-1} \cup \Delta \mathcal{S}_i$ 6

On Natural Questions, the paper reports BLEU and ROUGE-L improvements as retrieval-vector construction becomes more context-aware: Static Query yields BLEU 34.5 and ROUGE-L 45.1, whereas the proposed attention-fusion method reaches BLEU 41.0 and ROUGE-L 53.0. GPT-4o achieves the best overall reported quality with BLEU 41.7 and ROUGE-L 55.1 (He et al., 28 Apr 2025).

DynaRAG frames dynamic retrieval as routing between static corpus evidence and live structured external knowledge. The pipeline includes Retriever, Data Cleaning, LLM-based Reranker, Sufficiency Classifier, API Router or Fallback, and Answer Generator. The central routing decision is

$\mathcal{S}_i = \mathcal{S}_{i-1} \cup \Delta \mathcal{S}_i$ 7

where $\mathcal{S}_i = \mathcal{S}_{i-1} \cup \Delta \mathcal{S}_i$ 8 is the highest reranked passage score and $\mathcal{S}_i = \mathcal{S}_{i-1} \cup \Delta \mathcal{S}_i$ 9 is a manually tuned threshold. If evidence is insufficient, the system searches an API schema catalog, retrieves candidate schemas with FAISS, invokes Gorilla v2 to generate the API call, executes the call, prepends the API response to context, and then generates the answer. On CRAG, the best setting is Task 2, which reaches 41.00% accuracy, 22.09% hallucination, and 36.91% missing rate, compared with 34.23% accuracy and 43.09% hallucination for Direct RAG (Liang et al., 24 Feb 2026).

PubMed Reasoner applies state-aware retrieval to biomedical QA through a three-stage search-first pipeline. It first generates candidate MeSH terms and an initial Boolean query, then iteratively refines that query using retrieved PubMed metadata and three assessment dimensions—coverage, alignment, and redundancy—and finally processes ranked articles in batches with a sufficiency check over the evolving evidence pool. The reflective retrieval stage updates

$\max_{\pi} \sum_{i=1}^{N} y_i.$ 0

and stops early when

$\max_{\pi} \sum_{i=1}^{N} y_i.$ 1

returns Yes, when marginal utility is negligible, or when the token budget is reached. The paper uses batch size $\max_{\pi} \sum_{i=1}^{N} y_i.$ 2 and maximum budget $\max_{\pi} \sum_{i=1}^{N} y_i.$ 3 ranked articles. On PubMedQA, PubMed Reasoner with a GPT-4o backbone reaches 78.32% accuracy, slightly above the reported human performance of 78.00%; on MMLU Clinical Knowledge, the same configuration reaches 63.21%. The query-quality analysis shows MeSH recall gains of 17.18% for Gemini and 20.99% for GPT over the raw backbone while maintaining very high precision (Zhang et al., 28 Mar 2026).

“Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation” is described as a training-free, region-level extension of SGDR-style retrieval. It combines ColPali’s patch-level similarity with OCR bounding boxes so that retrieval returns the relevant document region on a page rather than the entire page. For a query $\max_{\pi} \sum_{i=1}^{N} y_i.$ 4 and page patches $\max_{\pi} \sum_{i=1}^{N} y_i.$ 5, it forms

$\max_{\pi} \sum_{i=1}^{N} y_i.$ 6

derives a patch relevance score

$\max_{\pi} \sum_{i=1}^{N} y_i.$ 7

and propagates that relevance to OCR regions using default IoU-weighted aggregation: $\max_{\pi} \sum_{i=1}^{N} y_i.$ 8 The method uses two-stage inference-time retrieval: candidate page retrieval by mean pooling and ANN search in Qdrant, followed by region reranking. It establishes a localization precision bound

$\max_{\pi} \sum_{i=1}^{N} y_i.$ 9

with example upper bounds of 73% for a $\mathcal{T} = (o_1, a_1, o_2, a_2, \ldots, a_H, o_{H+1}).$ 0 paragraph region, 60% for a $\mathcal{T} = (o_1, a_1, o_2, a_2, \ldots, a_H, o_{H+1}).$ 1 table cell, and 46% for a $\mathcal{T} = (o_1, a_1, o_2, a_2, \ldots, a_H, o_{H+1}).$ 2 small label when $\mathcal{T} = (o_1, a_1, o_2, a_2, \ldots, a_H, o_{H+1}).$ 3. The same paper also gives a theoretical context reduction bound

$\mathcal{T} = (o_1, a_1, o_2, a_2, \ldots, a_H, o_{H+1}).$ 4

and notes that with $\mathcal{T} = (o_1, a_1, o_2, a_2, \ldots, a_H, o_{H+1}).$ 5 regions and $\mathcal{T} = (o_1, a_1, o_2, a_2, \ldots, a_H, o_{H+1}).$ 6, the reduction is at least $\mathcal{T} = (o_1, a_1, o_2, a_2, \ldots, a_H, o_{H+1}).$ 7 (Georgiou, 2 Dec 2025).

6. Limitations, boundaries, and interpretation

SGDR is not merely “dynamic retrieval” in a generic sense. In the original web-agent formulation, it is specifically a method for online skill learning in which the retrieved object is a verified, executable sub-procedure represented in dual text-code form and selected at each step using both task and state signals. This distinguishes it from approaches that retrieve only once at task start or that retrieve textual hints without executable action structure (Li et al., 3 Jun 2026).

The empirical results also constrain several common intuitions. The ablations do not support the assumption that whole-trajectory reuse is necessarily preferable; sliding-window extraction outperforms full-trajectory and single-action extraction. Nor do they support purely task-level or purely state-level retrieval; the best performance occurs when both are combined, with $\mathcal{T} = (o_1, a_1, o_2, a_2, \ldots, a_H, o_{H+1}).$ 8 around 0.5. At the same time, the GitLab exception shows that local sub-procedure granularity is not universally optimal, especially where tasks require persistent repository preconditions or whole-task procedures (Li et al., 3 Jun 2026).

Related SGDR-like systems expose different boundary conditions. DynaRAG warns that a fixed sufficiency threshold $\mathcal{T} = (o_1, a_1, o_2, a_2, \ldots, a_H, o_{H+1}).$ 9 may not generalize across domains, and that routing errors can block API access when static evidence is stale. PubMed Reasoner notes that the pipeline is still largely linear and lacks backward adaptivity once later stages begin. Spatially grounded document retrieval shows that localization precision is fundamentally bounded by patch size, so finer state grounding is limited by the spatial resolution of the visual retriever (Liang et al., 24 Feb 2026, Zhang et al., 28 Mar 2026, Georgiou, 2 Dec 2025).

Taken together, these systems indicate that SGDR is fundamentally about retrieval timing, retrieval granularity, and retrieval control. This suggests that the central design problem is not only how to score candidates, but also how to define the current state, when to refresh retrieval, what unit to retrieve, and when the currently accumulated evidence is sufficient. In that sense, SGDR names both a concrete web-agent method and, more broadly, a family of state-conditioned retrieval strategies spanning interactive action, evidence-grounded generation, and spatially localized document access.