Query-Based Diffusion Policy

Updated 26 December 2025

The paper demonstrates how integrating explicit query signals within diffusion processes enhances control, generalization, and efficiency in action generation.
Query-based diffusion policies are defined by conditioning denoising networks with multimodal queries, enabling precise modulation via attention and FiLM mechanisms.
Retrieval-augmented techniques accelerate inference up to 20×, offering robust performance in complex robotic tasks with minimal loss in accuracy.

A query-based diffusion policy is a class of conditional generative policy models in which action or behavior generation is tightly regulated by explicit input queries. These models leverage diffusion processes—parametric stochastic differential equations or discrete Markov chains evolving in noise space—to model and denoise trajectories, parameters, or control signals in imitation learning and reinforcement learning domains. Queries may represent multimodal cues, including high-level task descriptions, perceptual observations, or semantic requirements, and their integration within the diffusion network defines the policy’s controllability, generalization, and efficiency.

1. Foundations of Query-Based Diffusion Policy

Query-based diffusion policy frameworks originate from the conditional diffusion policy paradigm, in which agent action distributions are modeled as conditional denoising processes. The foundational algorithmic structure involves a forward noising process that gradually corrupts demonstration trajectories, followed by a learned reverse process (often parameterized by neural networks) that denoises this sequence back into plausible actions, conditioned on query inputs. The conditioning signal—termed the “query”—guides the generative pathway, with architectures differing in how and where this signal is incorporated (Wang et al., 13 Feb 2025, Xu et al., 23 Sep 2025).

Mathematically, the standard setup follows, for demonstrated action trajectories $a$ :

Forward (training): For a time index $t\sim\text{Uniform}\{1,\ldots,T\}$ ,

$x_t = \sqrt{\bar\alpha_t}a + \sqrt{1-\bar\alpha_t}\epsilon, \quad \epsilon\sim\mathcal{N}(0,I).$

Reverse (sampling):

$x_{t-1} = \sqrt{\alpha_{t-1}}\left(\frac{x_t - (1-\alpha_t)/\sqrt{1-\bar\alpha_t}\cdot\epsilon_\theta(x_t, \text{query}, t)}{\sqrt{\alpha_t}}\right) + \sigma_t z, \quad z\sim\mathcal{N}(0,I).$

where $\epsilon_\theta$ is a neural network conditioned on the query; the precise functional form and injection point of the query distinguish different architectures.

2. Architectural Realizations and Query Integration

The integration of query signals into the diffusion network is a defining feature. Several mechanisms have emerged:

Modulated Attention in MTDP: The Modulated Transformer Diffusion Policy (MTDP) framework employs a Transformer encoder-decoder structure, where decoder blocks are replaced by Modulated Attention modules. Query inputs (here, composed of perceptual image features and timestep embeddings) are processed via MLPs to yield FiLM-style coefficients $(\gamma, \beta)$ . These coefficients modulate the queries, keys, and values at every self- and cross-attention layer in the Transformer, as well as the feed-forward blocks. Modulation is deep (per-layer) and occurs even in UNet-based backbones (MUDP). This design contrasts with vanilla Transformers, where conditioning occurs once via cross-attention (Wang et al., 13 Feb 2025).
Explicit Query Indices in QDP: The Query-Centric Diffusion Policy (QDP) for robotic assembly encodes queries as discrete one-hot tensors selecting specific skills, parts, and contact points. These query tensors are embedded, concatenated with point cloud encodings and proprioceptive states, and fused throughout the action decoder via FiLM-style modulation. The result is a hierarchical perception-to-generation pipeline tightly governed by explicit semantic (skill/part/contact) queries (Xu et al., 23 Sep 2025).
Latent/Language Conditioned Policy Diffusion: In latent diffusion policy models for quality-diversity reinforcement learning, user queries may take the form of continuous behavioral descriptors or language commands. These embeddings are injected through cross-attention and FiLM modulation at all blocks of the generative network, enabling query-driven synthesis of entire policies, not just trajectories (Hegde et al., 2023).

The deep and recurring injection of query signals enables richer context alignment and supports flexible conditional behavior selection.

3. Inference and Acceleration via Retrieval-Based Query Anchoring

Inference in diffusion policy models is computationally intensive due to many iterative denoising steps. Query-based acceleration can be realized via retrieval methods:

Retrieve-Augmented Generation (RAGDP): At inference, the query is used to retrieve the most similar observation-action pair from a vector database of expert demonstrations. The retrieved action anchors the diffusion process at an intermediate noise level corresponding to a “leap ratio” $r$ $r$ , skipping $rT$ $r T$ of the $T$ $T$ total reverse steps. The process is mathematically given by:
- For VP-SDE/ DDPM:
$A_{\tau_0} = \alpha_{\tau_0}A_\text{ret} + \sigma_{\tau_0}\epsilon.$

The remaining $(1 - r)T$ denoising steps then proceed as usual.
No additional training is required; only a database of encoded expert examples is necessary. RAGDP yields up to $20\times$ faster inference with minimally reduced or even slightly improved accuracy for moderate leap ratios (e.g., $r\leq0.75$ ) (Odonchimed et al., 29 Jul 2025).

4. Key Quantitative Results and Benchmarks

Query-based diffusion policies have demonstrated state-of-the-art performance in various robotic control, imitation learning, and policy distillation benchmarks. Key empirical findings include:

Model/Method	Sampling Steps	Toolhang Success (%)	Relative Gains
DP-Transformer (baseline)	100	60	Baseline
MTDP (Modulated Attention)	100	72	+12
MTDP-I (DDIM)	60	68	$\approx$ 40% step reduction
MUDP (UNet Modulated)	100	81	+1 over DP-UNet
MUDP-I (DDIM)	60	81	$\approx$ 40% step reduction

In assembly tasks, QDP (query-centric diffusion policy) achieves skill-wise success rates of 0.59 for insertion and 0.80 for screwing skills, with >50% improvement over non-query-structured baselines. Ablations confirm that query-driven architecture is the dominant contributor to these gains, especially in contact-rich, long-horizon scenarios (Xu et al., 23 Sep 2025).

In acceleration experiments, RAGDP maintains 64–82% of original accuracy at $20\times$ speed-up, outperforming pure sampler and Consistency Policy (CP) distillation baselines, which drop below 60% at similar acceleration (Odonchimed et al., 29 Jul 2025).

Compression of QD-RL archives via latent (query-driven) diffusion matches or surpasses original coverage and reward (≈98% reward, ≈89% coverage, $13\times$ compression), and supports flexible, user-conditioned policy sampling (Hegde et al., 2023).

5. Applications and Generalization of Query-Based Diffusion Policy

Applications of query-based diffusion policies include:

Robotic Manipulation: Vision-based manipulation and assembly tasks, notably those tested in the robomimic suite and FurnitureBench, benefit from deep query integration for tool usage, contact-rich insertion, and multi-step assembly (Wang et al., 13 Feb 2025, Xu et al., 23 Sep 2025).
Behavioral Policy Compression: Scaling to archives of behavior-diverse policies, where a query-conditioned latent diffusion model replaces thousands of neural policies, enabling direct retrieval and composition of new behaviors via descriptors or language (Hegde et al., 2023).
Low-Latency Imitation: Real-time control scenarios leveraging retrieval-based acceleration without sacrificing precision (Odonchimed et al., 29 Jul 2025).

Architecturally, modulated attention mechanisms enable straightforward adaptation to both Transformer and UNet backbones and to different diffusion models (DDPM, DDIM), demonstrating broad transferability and backbone-agnostic deployment (Wang et al., 13 Feb 2025).

6. Practical Considerations and Limitations

While query-based diffusion policies significantly improve controllability, sampling efficiency, and precision, several considerations arise:

Database Dependence in RAGDP: The efficacy of retrieval-based acceleration is contingent on the density and quality of demonstration databases. Sparse or low-coverage databases may impair retrieval quality and thus degrade initial trajectory anchors (Odonchimed et al., 29 Jul 2025).
Query Design: The expressivity and informativeness of the query directly determine control granularity; high-level language or descriptor queries provide powerful semantic control, but require robust embedding architectures (Hegde et al., 2023).
Real-Time Constraints: DDIM and retrieval acceleration alleviate sampling bottlenecks but may still be bounded by neural network evaluation speed and the complexity of the backbone model (Wang et al., 13 Feb 2025).
Generalization: Empirical results reinforce the robustness of query-based policies to novel configurations, history variations, and perceptual perturbations, particularly when queries encode precise subtask, object, and contact semantics (Xu et al., 23 Sep 2025).

7. Impact and Outlook

Query-based diffusion policies mark an inflection in conditional policy generation, enabling context-sensitive, user-steerable, and highly compressed policy repositories. The architectural insight of deep, layer-wise query fusion—exemplified by Modulated Attention and query-centric encoding—enables effective deployment in high-dimensional, real-world robotic manipulation and behavior sequence tasks.

These models unify the strengths of diffusion-based generative modeling, attention-driven context integration, and flexible query interfaces. Empirical benchmarks establish that they not only increase the upper bound of attainable task success (notably in challenging manipulation and assembly skills), but also meet practical constraints on inference time and deployment footprint.

Further research is poised to explore:

Scaling of query structures to more abstract, compositional task spaces.
Efficient organization and retrieval within massive demonstration archives.
Unified frameworks supporting simultaneous low-level action and high-level policy generation from multi-modal queries.

The query-based paradigm is likely to remain central as diffusion policy research advances toward more generalized, interpretable, and robust AI agents in sequential decision-making domains (Wang et al., 13 Feb 2025, Odonchimed et al., 29 Jul 2025, Xu et al., 23 Sep 2025, Hegde et al., 2023).