Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 18 tok/s Pro

GPT-5 High 15 tok/s Pro

GPT-4o 101 tok/s Pro

Kimi K2 196 tok/s Pro

GPT OSS 120B 467 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Unifying 2D and 3D Vision-Language Understanding (2503.10745v3)

Published 13 Mar 2025 in cs.CV, cs.AI, and cs.RO

Abstract: Progress in 3D vision-language learning has been hindered by the scarcity of large-scale 3D datasets. We introduce UniVLG, a unified architecture for 2D and 3D vision-language understanding that bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems. Our approach initializes most model weights from pre-trained 2D models and trains on both 2D and 3D vision-language data. We propose a novel language-conditioned mask decoder shared across 2D and 3D modalities to ground objects effectively in both RGB and RGB-D images, outperforming box-based approaches. To further reduce the domain gap between 2D and 3D, we incorporate 2D-to-3D lifting strategies, enabling UniVLG to utilize 2D data to enhance 3D performance. With these innovations, our model achieves state-of-the-art performance across multiple 3D vision-language grounding tasks, demonstrating the potential of transferring advances from 2D vision-language learning to the data-constrained 3D domain. Furthermore, co-training on both 2D and 3D data enhances performance across modalities without sacrificing 2D capabilities. By removing the reliance on 3D mesh reconstruction and ground-truth object proposals, UniVLG sets a new standard for realistic, embodied-aligned evaluation. Code and additional visualizations are available at https://univlg.github.io .

Collections

Summary

The paper introduces UniVLG, which leverages pre-trained 2D models and a language-conditioned mask decoder to unify 2D and 3D vision-language tasks.
It employs 2D-to-3D lifting strategies and 3D relative attention layers, achieving state-of-the-art results in 3D referential grounding and segmentation tasks.
The approach eliminates the need for 3D mesh reconstructions, setting a new standard for robust 3D performance in embodied vision systems.

Unifying 2D and 3D Vision-Language Understanding

Introduction

The paper "Unifying 2D and 3D Vision-Language Understanding" addresses the existing gap between 2D-centric models and the rich 3D sensory data available in embodied systems due to the scarcity of large-scale 3D datasets. It introduces UniVLG, a unified architecture for 2D and 3D vision-language understanding. The model leverages pre-trained 2D models, proposes a language-conditioned mask decoder, and incorporates 2D-to-3D lifting strategies to enhance 3D performance. This paper highlights UniVLG's state-of-the-art performance across multiple 3D vision-language grounding tasks, demonstrating the potential for transferring 2D advances to the data-constrained 3D domain. The model also eliminates reliance on 3D mesh reconstruction and ground-truth object proposals, setting a realistic evaluation standard.

Figure 1: (A) UniVLG achieves state-of-the-art performance across various benchmarks. (B) Accepts posed RGB-D sequences or 2D images lifted to 3D pointmaps. (C) Benefits from joint 2D-3D training. (D) Example task inputs/outputs for UniVLG.

Methodology

Architecture

The UniVLG architecture is a vision-language transformer that processes language utterances along with RGB-D or lifted RGB images. It uses a ViT backbone followed by 3D relative attention layers that produce 3D feature tokens. The model includes a proposed decoder that iteratively refines learnable queries using token-language-query attentions, enabling it to decode object segments and match them to corresponding noun phrases in the input referential utterance.

Figure 2: UniVLG Architecture: Fuses information across vision and language to predict 3D segments or generate answers using a ViT backbone and 3D attentions.

Language-Conditioned Mask Decoder

UniVLG introduces a language-conditioned mask decoder shared between 2D and 3D modalities. The decoder predicts segmentation masks by conditioning visual features and language instructions. This innovation is critical for effective grounding in RGB and RGB-D images, providing a unified output space involving per-patch predictions, which translates to precise segment decoding—shown to outperform traditional box-based decoders.

Training Strategy

UniVLG is trained on 2D and 3D datasets, co-training both modalities while leveraging 2D-to-3D lifting strategies which narrow the domain gap between 2D and 3D inputs. The architecture benefits from the representation strength of pre-trained 2D backbones and incorporates supervised mask, text span, box losses, and text generation objectives for end-to-end learning.

Experiments and Results

UniVLG is evaluated on several benchmarks, including 3D referential grounding, visual question answering, and 3D instance segmentation tasks, significantly outperforming existing methods. In the referential grounding task, UniVLG achieves state-of-the-art results in both in-domain (e.g., ScanRefer, NR3D) and out-of-domain settings (e.g., LX3D datasets), demonstrating superior generalization capabilities.

Figure 3: Visualization of UniVLG on 3D Referential Grounding Datasets. Shows predictions and ground-truth in complex scenes.

In 3D referential grounding, UniVLG exhibits resilience to input noise and maintains robust performance across unseen scenes and tasks. The model's lifting strategy enhances its applicability to various domains, closing the performance gap with traditional mesh-based methods while operating directly on sensor-generated point clouds.

Discussion

The paper argues the superiority of mask-based decoding over box-based approaches due to its robust performance in scenarios involving complex object instances. However, decoding segment masks is susceptible to failure modes, such as including distant points or multiple instances within a single mask, necessitating further exploration.

Conclusion

UniVLG effectively bridges the gap between 2D and 3D vision-language tasks, leveraging comprehensive training strategies and architecture designs that integrate pre-trained 2D features and innovative decoding mechanisms. Looking ahead, scaling the model with additional 2D datasets and refining mask decodings remain promising avenues for enhancing 3D vision-language systems' robustness and applicability across different complex real-world tasks.