Multi-View Diffusion Policy

Updated 21 September 2025

Multi-view diffusion policies are stochastic frameworks that fuse multiple sensory inputs to generate expressive, multi-modal action distributions.
They employ iterative denoising and neural score functions, integrating diverse views such as RGB-D, LiDAR, and proprioceptive data for robust control.
These policies enhance applications in robotics, autonomous driving, and 3D synthesis by improving success rates, generalization, and computational efficiency.

A multi-view diffusion policy refers to a class of stochastic policy learning frameworks that leverage denoising diffusion probabilistic models or related score-based generative processes, with an explicit focus on fusing and exploiting information from multiple “views”—which could correspond to diverse sensory modalities, distinct camera perspectives, or different feature spaces. This paradigm supports the synthesis of expressive, robust, and highly generalizable action or data distributions for control, synthesis, or planning tasks. Multi-view diffusion policies have rapidly advanced fields such as robotic manipulation, mobile manipulation, 3D synthesis, multi-agent reinforcement learning, navigation, and medical data imputation by exploiting the complementary advantages of multiple input views and the powerful multi-modal generative capacity of diffusion models.

1. Fundamentals of Multi-View Diffusion Policies

Multi-view diffusion policies extend classical diffusion model-based policies by explicitly incorporating observations or context from multiple distinct perspectives or modalities, often in combination with proprioceptive or historical features. At the core, these policies rely on a generative process that transforms a simple reference distribution—typically Gaussian noise—iteratively into a sample from a high-dimensional, complex, multi-modal target policy. The forward process injects noise into action or data representations, while the reverse process, parameterized by a neural score function, denoises the sample in a manner conditioned on fused multi-view features.

Formally, for a state $s$ and a set of views $\{v_i\}$ (e.g., images, depth maps), the conditional diffusion policy seeks to represent a distribution $p_\theta(a | s, \{v_i\})$ or $p_\theta(x_0 |\text{views}, \text{context})$ . The stochastic process is defined by a sequence of transitions, such as

$a_k \sim \mathcal{N}(\sqrt{\bar\alpha_k} a_0, (1-\bar\alpha_k)I),$

and its iterative reverse by a neural network predicting the mean or noise, each step taking as input the composite feature embedding from all views (Ke et al., 16 Feb 2024, Dong et al., 18 Sep 2025).

The principal objective is to allow the policy to model rich, non-unimodal distributions over output variables, capturing the full complexity of real-world multimodal behaviors and enabling versatile and robust action generation.

2. Multi-View Feature Fusion and Conditioning

The fusion and processing of multi-view inputs is a technical cornerstone. Leading approaches utilize:

3D Feature Clouds: In 3D robot manipulation (e.g., 3D Diffuser Actor), 2D features from multiple RGB-D cameras are “lifted” into a unified 3D coordinate frame before passing into a translation-equivariant transformer (Ke et al., 16 Feb 2024).
Multi-Scale Cross-Attention: In complex environments such as autonomous driving, hierarchical bidirectional cross-attention layers integrate features from multi-modal sensors (e.g., images + LiDAR), with the policy conditioned on the resulting scene embedding (Zhao et al., 26 May 2025).
Semantic Projection: For navigation or zero-shot adaptation, 3D point clouds from multiple views are projected into a 2D bird's-eye-view semantic grid, enabling policies trained in 2D to generalize to 3D navigation (Zhang et al., 1 Apr 2024).
Proprioceptive Integration: Robotic control policies commonly incorporate proprioceptive feedback (e.g., end-effector pose, joint state) along with multi-view vision to enable effective whole-body action synthesis (Dong et al., 18 Sep 2025).
Modality-Composable Scoring: Some frameworks combine the score functions from independently trained policies on different modalities via weighted summation at inference, enabling policy compositionality without additional training (Cao et al., 16 Mar 2025).

These designs support consistent and robust scene understanding, promote viewpoint invariance, and enable manipulation or planning regardless of occlusions, clutter, or environmental change.

3. Diffusion Process Modeling and Theoretical Properties

Multi-view diffusion policies build upon stochastic differential equations (SDEs) or discrete Markov chains to bridge noisy prior distributions and complex action or data posteriors:

The forward process samples from an initial action (or data) and incrementally applies noise, as modeled by an Ornstein–Uhlenbeck dynamic or analogous dynamics:

$dx_t = -x_t dt + \sqrt{2} dw_t$

The reverse process sequentially denoises the sample, each iteration refined by a learned score function, e.g.,

$da_t = (a_t + 2 \nabla_{a_t} \log p_t(a_t)) dt + \sqrt{2} d\bar{w}_t.$

For practical deployment, discretized integrators (such as exponential integrator updates) are employed, typically:

$a_{k+1} = \frac{1}{\sqrt{\alpha_k}}\left[a_k - \frac{1-\alpha_k}{\sqrt{1-\bar{\alpha}_k}} \hat S(a_k, \text{views}, T-t_k)\right] + \sqrt{\frac{1-\alpha_k}{\alpha_k}} z_k,\quad z_k\sim \mathcal{N}(0,I)$

Theoretical results (e.g., convergence rates for the KL divergence between the reverse process marginal and the true target policy) provide strong guarantees that the iterative process can approximate highly multimodal distributions with bounded error, with terms accounting for the discretization error, score mismatch, and convergence rate (Yang et al., 2023, Psenka et al., 2023).

4. Algorithmic Architectures and Training Strategies

Multi-view diffusion policies have been instantiated through various algorithmic templates, typically involving alternating or joint updates to critic (Q-function) and actor (score-based diffusion network):

Actor-Critic Integration: Algorithms such as DIPO and DDiffPG combine score-based actor updates with Q-learning-based critics, often using target actions generated from Q-gradients or behavioral cloning objectives modified by value estimates (Yang et al., 2023, Li et al., 2 Jun 2024).
Score Matching and Reweighting: Efficient online RL is achieved by reweighted score matching losses (RSM), in which the denoising objective is weighted by energy terms (e.g., Mirror Descent or max-entropy targets), supporting both tractable computation and direct policy optimization (Ma et al., 1 Feb 2025).
Policy Fine-Tuning by Policy Gradient: Recent works show that applying PPO or similar on-policy gradients to the entire denoising chain is both feasible and advantageous for policy robustness, structured exploration, and sim-to-real transfer (Ren et al., 1 Sep 2024).
Efficient Architectures: The use of lightweight models (e.g., Mamba Policy’s SSM-Attention hybrid) dramatically reduces computational requirements without sacrificing policy performance, enabling deployment on resource-constrained hardware (Cao et al., 11 Sep 2024).

Adaptation, generalization, and regularization techniques include manipulability-aware control (for robust mobile manipulation), geometric initialization and constraint-based denoising, and test-time adaptation via manifold-constrained guidance (Dong et al., 18 Sep 2025, Li et al., 8 Aug 2025).

5. Performance Metrics, Empirical Evaluation, and Applications

Empirical results consistently show that multi-view diffusion policies yield substantial improvements on a range of benchmarks:

Success Rate & Robustness: In RLBench and CALVIN, multi-view 3D Diffuser Actor sets new state-of-the-art benchmarks with up to 18.1% higher average success rates; M⁴Diffuser achieves 7%–56% higher success rates and reduces collisions by up to 31% in mobile manipulation (Ke et al., 16 Feb 2024, Dong et al., 18 Sep 2025).
Generalization: Multi-view architectures generalize robustly to unseen object instances, occlusions, and varying viewpoints, with policies demonstrating high transferability between environments and tasks, including zero-shot transfer for navigation via semantic projection (Zhang et al., 1 Apr 2024, Dong et al., 18 Sep 2025).
Efficiency: Mamba Policy achieves >80% parameter reduction and up to 90% reduction in FLOPs relative to baseline 3D diffusion policies, with stable performance on long-horizon tasks (Cao et al., 11 Sep 2024).
Sample Efficiency: Trajectory-based data augmentation and multi-modal representation notably improve data efficiency, achieving state-of-the-art performance with orders of magnitude less data (e.g., DOM2’s >20× improvement in offline multi-agent RL) (Li et al., 2023).
Real-World Deployment: Policies are validated not only in simulation but in hardware experiments (e.g., Diffuser Actor, M⁴Diffuser, ADPro) demonstrating markerless, robust manipulation and adaptive behavior in real-world environments (Ke et al., 16 Feb 2024, Dong et al., 18 Sep 2025, Li et al., 8 Aug 2025).
Structured Outputs for 3D Synthesis: In computer vision, Sharp-It produces high-fidelity, 3D-consistent multi-view images with much lower FID, CLIP, and DINO similarity metrics compared to direct baselines, supporting rapid and high-quality 3D reconstruction (Edelstein et al., 3 Dec 2024).

6. Extensions, Symmetries, and Emerging Techniques

Recent research expands the expressiveness and practicality of multi-view diffusion policies by incorporating invariance, adaptability, and modular design:

Equivariance and Invariant Representations: Incorporation of SE(3)-invariant or equivariant feature encoders, as well as relative or delta action representations (especially with eye-in-hand setups), leads to improved sample efficiency and generalization (Wang et al., 19 May 2025).
Composable and Modular Policies: Modality-Composable Diffusion Policy demonstrates score-level composition of policies trained on different modalities, enabling flexible, adaptive policy construction without joint retraining (Cao et al., 16 Mar 2025).
Dynamic Denoising and State-Dependent Adaptation: Policies such as D3P dynamically allocate denoising steps per action or per view, balancing computational cost and task performance, a particularly useful strategy for high-dimensional, multi-view input regimes (Yu et al., 9 Aug 2025).
Manifold and Physics Constraints: Test-time adaptation with observation guidance constrains the diffusion process along task-specific manifolds (e.g., geodesic in manipulation), improving both efficiency and generalization in unseen scenarios (Li et al., 8 Aug 2025).
Hybrid Supervision: Action diffusion can be integrated with supervised policy outputs via hybrid decoders, as in end-to-end autonomous driving, enabling robust, multi-modal, and physically interpretable control (Zhao et al., 26 May 2025).

These directions contribute to the scalability, deployment practicality, and interpretability of multi-view diffusion policy architectures.

7. Representative Applications Across Domains

Multi-view diffusion policies serve as foundational models across several domains:

Application	Input Views/Modalities	Output/Task
Robot Manipulation	Multi-view RGB/RGB-D + proprioception	High-level end-effector poses
Mobile Manipulation	Body/arm state + panoramic cameras	Global scene-aware manipulation goals
Navigation	Multi-view semantic point clouds	2D/3D trajectory action plans
Multi-Agent RL	Local observation + teammate views	Multi-modal action sequence
Healthcare Prediction	Multi-view EHR (diagnosis, notes, labs)	Patient outcome, imputation
3D Content Creation	Multi-view images, text prompts	3D-consistent multi-view renderings
Driving/Planning	Camera + LiDAR + navigation targets	Trajectory distribution + structured control

The ability to flexibly handle multiple views and to model expressive, multi-modal distributions enables these frameworks to outperform their unimodal or single-view counterparts in robustness, generalization, and data efficiency.

The development of multi-view diffusion policies represents a convergence of advances in probabilistic generative modeling, multi-modal sensor fusion, and reinforcement learning. By leveraging iterative denoising conditioned on richly fused, multi-view observations, these policies are capable of robust, generalizable, and sample-efficient control, planning, and synthesis in high-dimensional and dynamic environments. This research injects vital expressiveness, adaptability, and safety into sequential decision systems required for complex real-world applications (Yang et al., 2023, Ke et al., 16 Feb 2024, Dong et al., 18 Sep 2025, Cao et al., 16 Mar 2025, Edelstein et al., 3 Dec 2024).