Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Uni3DL: Unified Model for 3D and Language Understanding (2312.03026v1)

Published 5 Dec 2023 in cs.CV

Abstract: In this work, we present Uni3DL, a unified model for 3D and Language understanding. Distinct from existing unified vision-LLMs in 3D which are limited in task variety and predominantly dependent on projected multi-view images, Uni3DL operates directly on point clouds. This approach significantly expands the range of supported tasks in 3D, encompassing both vision and vision-language tasks in 3D. At the core of Uni3DL, a query transformer is designed to learn task-agnostic semantic and mask outputs by attending to 3D visual features, and a task router is employed to selectively generate task-specific outputs required for diverse tasks. With a unified architecture, our Uni3DL model enjoys seamless task decomposition and substantial parameter sharing across tasks. Uni3DL has been rigorously evaluated across diverse 3D vision-language understanding tasks, including semantic segmentation, object detection, instance segmentation, visual grounding, 3D captioning, and text-3D cross-modal retrieval. It demonstrates performance on par with or surpassing state-of-the-art (SOTA) task-specific models. We hope our benchmark and Uni3DL model will serve as a solid step to ease future research in unified models in the realm of 3D and language understanding. Project page: https://uni3dl.github.io.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xiang Li (1002 papers)
  2. Jian Ding (132 papers)
  3. Zhaoyang Chen (9 papers)
  4. Mohamed Elhoseiny (102 papers)
Citations (3)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub