RayFusion Methods in Visual Computing

Updated 12 October 2025

RayFusion is a family of ray-based fusion methods that integrate multi-modal, multi-agent, and temporal data to overcome single-view ambiguities in 3D perception.
Techniques employ innovative approaches such as ray occupancy encoding, cost volume alignment with attention, and distillation in radiance fields to boost accuracy and efficiency.
These methods enable practical applications in autonomous driving, sparse depth completion, and XR by reducing false positives and supporting dynamic scene manipulation.

RayFusion refers to a family of ray-based fusion methods and architectures in visual computing, spanning collaborative perception, depth estimation, radiance field compositing, and sparse depth completion. These approaches exploit the geometry and occupancy of rays to fuse information from multi-modal, multi-agent, or temporally sequential data sources, thereby improving accuracy, robustness, and efficiency in environments where perception is fundamentally limited by single-view ambiguities.

1. Ray-Based Collaborative Visual Perception

RayFusion as proposed in "RayFusion: Ray Fusion Enhanced Collaborative Visual Perception" (Wang et al., 9 Oct 2025) introduces a ray occupancy encoding framework for multi-agent camera-based 3D object detection in autonomous driving scenarios. The method improves depth estimation—traditionally ambiguous in monocular settings—by aggregating per-ray occupancy information from multiple collaborating agents equipped with cameras.

Each agent generates pixel-wise dense depth distributions from its own detections and encodes this using a high-frequency mapping $f_L(a)$ :

$f_L(a) = (a, \sin(2^0\pi a), \cos(2^0\pi a), \ldots, \sin(2^{L-1}\pi a), \cos(2^{L-1}\pi a))$

These representations are combined for both optical center $\mathbf{o} \in \mathbb{R}^3$ and ray direction $\mathbf{u}$ , then fused using multi-layer perceptrons (MLPs).

Motion and ego-motion are accounted for with sequential warping:

$[x, y, z]_t = [x, y, z]_{t_b} + [v_x, v_y, v_z]_{t_b} \cdot (t-\tau)$

$[x, y, z] = R_{\tau \to t}[x, y, z]_t + T_{\tau \to t}$

Aligned features from distinct agents are fused via multi-scale instance feature aggregation, wherein pyramid window self-attention restricts receptive fields, allowing attention scores $I_b$ computed as:

$I_b = \text{Softmax}\left(\frac{Q K^\top}{\sqrt{d}} + g(D_{i,j} < r_b)\right)V$

The final prediction integrates object anchor encoding $\Phi(\mathcal{A})$ , fused ray occupancy $\hat{\gamma}$ , and delay encoding $p_{t-\tau}$ .

By intersecting the ray occupancy states from multiple agents, RayFusion reduces false positives, suppresses hard negative samples, and yields more accurate 3D localization. The approach demonstrated superior AP70 scores on DAIR-V2X, V2XSet, and OPV2V, outperforming prior methods such as IFTR.

2. Sparse Depth Video Completion via Ray Attention Fusion

In "Deep Cost Ray Fusion for Sparse Depth Video Completion" (Kim et al., 23 Sep 2024), RayFusion denotes a deep learning framework for efficient temporal fusion of sparse depth and RGB streams from videos. The pipeline constructs cost volumes for each frame by stacking D uniformly spaced depth hypothesis planes, forming tensors $V \in \mathbb{R}^{D \times C \times H \times W}$ .

Sequential cost volumes are aligned using camera poses. For each pixel $(h, w)$ , the framework extracts ray-wise features $F \in \mathbb{R}^{D \times C}$ . Attention is applied exclusively to these per-pixel ray sequences, reducing the computational complexity from $O(D^2 H^2 W^2)$ to $O(D^2 H W)$ compared to global volume attention.

Fusion consists of two stages:

Self-attention on current and previous rays: $\text{Attn}(F_t, F_t, F_t)$ , $\text{Attn}(F_{t-1}, F_{t-1}, F_{t-1})$
Cross-attention: $\text{Attn}(F_{t,\text{refined}}, F_{t-1,\text{refined}}, F_{t-1,\text{refined}})$

Depth is regressed by softmax over depth hypotheses:

$D_t(h, w) = \sum_{i=1}^D d_i p_i(h, w)$

where $p_i(h, w)$ is obtained from the softmaxed cost volume.

On KITTI, VOID, and ScanNetV2 datasets, RayFusion achieved state-of-the-art accuracy (RMSE, Chamfer distance, F-score) with only 1.15M parameters, significantly fewer than prior deep learning methods.

3. Radiance Field Fusion for XR and AR

FusedRF (Goel et al., 2023) applies a fusion-by-distillation paradigm to NeRF-like Radiance Fields for scene compositing. Rather than tracing rays through each RF in parallel—where both render time and memory grow linearly with the number of scenes—FusedRF distills the outputs of multiple teacher RFs into a single student RF.

For each sampled point along a ray, the volumetric density is computed:

$\alpha = 1 - \exp(-\sigma \delta)$

where $\sigma$ is density and $\delta$ the inter-sample distance. High-density samples (high $\alpha$ ) are used to form the union training set. The student RF is supervised to match both color $c$ and density $\sigma$ from the teacher RFs:

$L_\text{distill} = \|\mathbf{c}_\text{teacher} - \mathbf{c}_\text{student}\|_2 + \|\sigma_\text{teacher} - \sigma_\text{student}\|_2$

Final refinement is performed using RGB pixel losses. The fused representation maintains rendering speed and memory at the level of a single RF, supports dynamic manipulation (add/del objects via incremental distillation), and demonstrates competitive PSNR to naive compositing.

4. Convolutional Neural Ray Modeling for View Synthesis

CeRF (Yang et al., 2023) models novel view synthesis by learning the derivative of radiance $\partial L/\partial t$ along each ray, rather than the absolute radiance. The mathematical formalization:

$L(o, d, t) = \int_t^{+\infty} \left[\frac{\partial L(o, d, x)}{\partial x}\right] dx$

This leverages the sparsity that radiance only changes at surface boundaries. The framework discretizes the integral, predicting color only at likely surface intersections identified by an indicator function $\mathbb{I}$ .

The architecture employs 1D convolutions (kernel sizes 1 and 3) to extract correlated ray features, followed by a GRU recurrent module to resolve geometric ambiguities via sequence modeling. The Unique Surface Constraint enforces selection of a dominant surface intersection point using a softmax normalization:

$w_{r, e} = \alpha(s_{r, e}; \theta_a) = \text{softmax}(\theta_a \cdot s_{r, e})$

CeRF achieves higher PSNR and perceptual similarity than methods such as Mip-NeRF, NeuRay, and R2L, and offers a robust template for ray-wise fusion in complex, occluded scenes.

5. Multimodal Fusion: RGB-Thermal Depth Estimation

While not itself a "RayFusion" algorithm, RTFusion (Meng et al., 5 Mar 2025) demonstrates the value of fusion strategies in multimodal depth estimation for adverse environments. Its EGFusion module fuses ConvNeXt-extracted features from RGB and THR modalities using Mutual Complementary Attention (MCA) and Edge Saliency Enhancement Module (ESEM):

Mutual Complementary Attention performs cross-modal alignment:

$\mathcal{AL} = \text{softmax}(Q' K^T / V_{dk})$

Edge Saliency Enhancement computes edge weight map $EL$ via convolutions and applies it as:

$F_\text{enhanced} = F_\text{cross} + EL \circ F_\text{THR}$

RTFusion significantly outperforms competing approaches (e.g., MCT, MURF) on MS2 and ViViD++ datasets, especially in challenging lighting, by preserving boundary detail and adaptive cross-modal attention.

6. Applications and Impact in Visual Computing

RayFusion techniques have shown marked impact in several domains:

Autonomous Driving: By aggregating ray-wise occupancy from multiple vehicles, RayFusion improves object detection, localization, and robustness under occlusion or field-of-view constraints (Wang et al., 9 Oct 2025).
Sparse Depth Completion: Efficient temporal fusion of incomplete RGB-D video enables high-fidelity depth maps for mobile robotics and mapping (Kim et al., 23 Sep 2024).
XR and AR: RayFusion of radiance fields via distillation allows compositing and real-time manipulation of complex scenes with bounded memory and computational costs (Goel et al., 2023).
Multimodal Sensing: Cross-modal ray or feature fusion, as in RTFusion, enables robust depth estimation in adverse scenarios, valuable for mobile robotics, surveillance, and AR (Meng et al., 5 Mar 2025).

7. Future Directions and Open Challenges

Prominent future avenues identified by the RayFusion community include:

Collaborative Perception with Unknown Poses: Relaxing reliance on accurate agent localization and pose estimation, expanding privacy-preserving protocols (Wang et al., 9 Oct 2025).
Delay Robustness: Predicting future occupancy representations to mitigate adverse effects of network latency (Wang et al., 9 Oct 2025).
Extending Sensor Modalities: Adapting ray-based fusion methods for modalities beyond standard cameras—for instance, radar, lidar, thermal, or event-based sensors—would further enhance system robustness (Meng et al., 5 Mar 2025).
Dynamic Scene Editing and Incremental Fusion: Supporting continuous compositing, manipulation, and incremental updates in radiance field fusion to accommodate real-time scene changes (Goel et al., 2023).
Efficient Global Ray Attention: Scaling attention mechanisms for global fusion across entire scenes or large sensor networks, while retaining computational efficiency (Kim et al., 23 Sep 2024).

RayFusion methodologies, characterized by their ray-wise signal representations and fusion strategies, continue to advance the limits of collaborative, multimodal, and temporally-aware visual perception systems across diverse real-world applications.