Cross-Attentive Multiview Fusion of Vision-Language Embeddings

Published 14 Apr 2026 in cs.CV | (2604.12551v1)

Abstract: Vision-LLMs have been key to the development of open-vocabulary 2D semantic segmentation. Lifting these models from 2D images to 3D scenes, however, remains a challenging problem. Existing approaches typically back-project and average 2D descriptors across views, or heuristically select a single representative one, often resulting in suboptimal 3D representations. In this work, we introduce a novel multiview transformer architecture that cross-attends across vision-language descriptors from multiple viewpoints and fuses them into a unified per-3D-instance embedding. As a second contribution, we leverage multiview consistency as a self-supervision signal for this fusion, which significantly improves performance when added to a standard supervised target-class loss. Our Cross-Attentive Multiview Fusion, which we denote with its acronym CAMFusion, not only consistently outperforms naive averaging or single-view descriptor selection, but also achieves state-of-the-art results on 3D semantic and instance classification benchmarks, including zero-shot evaluations on out-of-domain datasets.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces CAMFusion, a transformer-based method that fuses multiview vision-language descriptors to enhance 3D semantic understanding.
It employs a novel self-supervised multiview consistency loss to align fused embeddings with ground-truth text and maintain view consistency.
Quantitative results on datasets like Replica and ScanNet show significant improvements in segmentation IoU and instance mAP over traditional pooling methods.

Cross-Attentive Multiview Fusion of Vision-Language Embeddings: An Expert Summary

Introduction and Motivation

Despite substantial progress in vision-LLMs (VLMs) for open-vocabulary tasks in 2D vision, the transfer of these capabilities to 3D perception remains limited by fusion strategies that treat multi-view features as redundant and simply average them. This simplification fails to account for complementary semantic information provided by diverse viewpoints, degrading the fidelity of resultant 3D vision-language embeddings. The paper "Cross-Attentive Multiview Fusion of Vision-Language Embeddings" (2604.12551) introduces CAMFusion, a multiview transformer-based architecture that fuses single-view vision-language descriptors into unified, high-fidelity 3D embeddings, while proposing a novel self-supervised multiview consistency loss that enhances generalization beyond class-prototype supervision.

Figure 1: Overview of CAMFusion: per-view vision-language features extracted from object masks are aggregated via the proposed transformer fusion module to generate a robust multi-view descriptor, augmented with a self-supervised consistency loss leveraging unseen views.

The work is a response to limitations in current 3D open-vocabulary semantic segmentation, which primarily repurpose 2D VLMs by naive fusion of descriptors across views—most often via mean or medoid pooling, or selection of a representative descriptor. These methods, as typified by OpenMask3D, Open3DIS, and OVO, fail to leverage the unique and complementary semantic cues unveiled by multiview observation, a critical factor for robust scene understanding in robotics, AR, and monitoring applications. Prior efforts on localized VLM descriptor extraction (TextRegion, weights predictors) improve 2D per-mask semantics but do not address the fusion bottleneck.

CAMFusion: Transformer-Based Multiview Semantic Super-Resolution

CAMFusion reframes descriptor fusion as a semantic super-resolution task. Unlike aggregation, the transformer explicitly models cross-view complementarity: its architecture alternates self-attention (within a view’s embedding) and cross-attention (across the embeddings from other views) through multiple blocks, culminating in a learned latent pooling over all views.

Figure 2: The multiview transformer alternates self- and cross-attention per view, progressively synthesizing a unified semantic descriptor; the final aggregation uses a learned latent query for pooling across views.

This design allows each viewpoint to attend to complementary and discriminative semantic content across views—an architectural bias that aligns with the regime encountered in real, cluttered scenes, where distinctive object attributes may only be visible in certain viewpoints.

Self-Supervised Multiview Consistency Loss

Building on standard supervised contrastive learning used in VLMs—where aligned image-text pairs form the core supervision—CAMFusion introduces a multiview contrastive objective. This loss enforces that the fused descriptor not only aligns with the ground-truth text embedding but also remains consistent with descriptors of the same instance seen from previously unobserved views. This regularization mitigates overfitting to prototypical class embeddings, strengthening view-invariant and instance-specific semantics. The integration of a semantic class mask further refines the loss by ensuring instances of the same class are not repelled in embedding space.

Figure 3: Illustration of the multiview contrastive loss without class masking, highlighting how semantic consistency among instances of the same class can be optimized.

Quantitative and Qualitative Evaluation

Ablations on Replica, ScanNet, and 3RScan demonstrate that CAMFusion provides consistent and substantial improvements over traditional fusion schemes and state-of-the-art baselines on both 3D semantic segmentation (IoU, accuracy) and instance-level classification (mAP metrics). For example, integrating CAMFusion with OVO raises segmentation IoU from 27% to 38% on Replica, where even with identical single-view descriptors, naive fusion attains only 33%. The superiority is preserved across challenging class-frequency regimes (head, common, tail), and holds in both closed-set and zero-shot scenarios on ScanNet200 and 3RScan.

Further, in ground-truth mask experiments, CAMFusion outperforms OV-3DIS, average pooling, and open-set baselines by over 10 percentage points on Top-k instance mAP, with consistent gains on difficult tail classes—demonstrating improved feature disentanglement and class separation capacity.

Figure 4: Qualitative comparison on Replica and ScanNet200: CAMFusion yields mask predictions with more coherent and sharper object boundaries, suppressing typical segmentation noise present in baseline methods.

Figure 5: Instance segmentation accuracy as a function of the number of input views; CAMFusion consistently outperforms the average pooling baseline and exhibits monotonic improvements with additional views.

Figure 6: Further qualitative evidence on Replica, illustrating robust and precise instance semantics under significant viewpoint variation and clutter.

Figure 7: Qualitative results on ScanNet200 reveal CAMFusion's capacity to generalize to real-world scenes with diverse and rare object classes.

Critical Discussion and Limitations

While CAMFusion demonstrates a robust multiview fusion approach, its effectiveness is bounded by the upstream segmentation quality; errors in mask generation invariably propagate. The selection mechanism for input viewpoints is heuristic, potentially suboptimal, and ripe for enhancement—e.g., with learned informativeness criteria or active selection. The current supervision regime leverages canonical class names, with only limited text augmentation, which may inhibit open-ended compositional generalization to arbitrarily rich queries.

Implications and Future Directions

CAMFusion advances the design of open-vocabulary 3D perception frameworks by providing a scalable, learnable, and flexible semantic fusion primitive. The transformer-based cross-attention approach is compatible with future advances in single-view descriptor extraction and is well-suited for integration into end-to-end 3D semantic pipelines. Its self-supervised multiview consistency loss provides a template for further regularization strategies in 3D semantic learning, potentially informing the design of future feed-forward models that do not explicitly rely on multiview fusion.

Conclusion

CAMFusion systematically addresses the fusion bottleneck in open-vocabulary 3D semantic perception by learning to aggregate view-complementary semantic information through a transformer-based design, complemented by a novel self-supervised consistency objective. The result is superior 3D vision-language descriptors, as reflected in both quantitative benchmarks and qualitative analysis. This work establishes a new baseline for robust multiview semantic fusion and opens avenues for research in learned viewpoint selection, richer linguistic supervision, and integration with end-to-end 3D perception systems.

Markdown Report Issue