3D-Language Dataset Overview
- 3D-Language Datasets are curated collections pairing 3D data (point clouds, meshes, motion capture) with diverse textual annotations to support research across various domains.
- They enable breakthroughs in vision–language pretraining, embodied robotics, CAD design, and sign language recognition by grounding spatial relations, dynamics, and affordances.
- Advanced annotation pipelines, integrating human expertise and LLM-driven methods, ensure high-quality, multimodal data for robust 3D scene understanding and retrieval benchmarks.
A 3D-Language Dataset is a curated collection in which three-dimensional spatial modalities—such as point clouds, meshes, motion capture, or CAD structures—are directly and densely paired with human language. This alignment underpins a wide spectrum of research in vision–language pretraining, 3D scene understanding, embodied robotics, grounded QA, sign language recognition, change detection, and CAD design. 3D-Language Datasets distinguish themselves from 2D counterparts by addressing unique 3D phenomena: spatial relations, affordances, dynamics, compositionality, and multi-modality. Recent years have witnessed a proliferation of such corpora, targeting both general scene grounding and specialized domains such as sign language, robotics, and CAD design.
1. Core Definitions and Scope
A 3D-Language Dataset is defined by the presence of paired three-dimensional data (e.g., point clouds, 3D meshes, MoCap sequences, 3D bounding boxes, panoramic scans, RGB-D videos) with textual or linguistic annotations. The text modalities vary—ranging from descriptive captions, referring expressions, question-answer pairs, dialogue, task instructions, to parametric modeling scripts. The 3D data may cover static scenes, articulated human motion, object assemblies, or dynamic scenes (including before/after scans for change detection) (Jia et al., 2024, Michel et al., 2023, Zhang et al., 2024, Tang et al., 2022, Zhen et al., 2024, Wei et al., 24 Nov 2025, Ranum et al., 2024, Ahmed et al., 12 Jan 2025, Zhou et al., 14 Oct 2025, Abdelreheem et al., 2022, Yu et al., 2023, Li et al., 13 Mar 2026).
Prominent dataset families include:
- Vision–Language–Action (VLA) datasets: For navigation, manipulation, and world-model learning (Zhen et al., 2024, Zhang et al., 2024, Wang et al., 2024).
- Grounded scene understanding: Large-scale scene–language corpora for visual grounding, referential expressions, and scene graph alignment (Jia et al., 2024, Wei et al., 24 Nov 2025, Zhang et al., 20 Mar 2025, Wang et al., 2 Jan 2025, Abdelreheem et al., 2022).
- Human action and sign language: Datasets with 3D human body/hand motion paired with stepwise natural instructions or glosses (Tang et al., 2022, Ranum et al., 2024, Yu et al., 2023).
- Multimodal CAD and design: Parametric or mesh-based objects paired with language and structured scripts (Li et al., 13 Mar 2026, Ahmed et al., 12 Jan 2025).
- Change understanding: "Before/after" 3D scans paired with change-focused QA and situated linguistics (Liu et al., 13 Oct 2025).
2. Data Modalities, Annotation Schemes, and Statistics
3D-Language datasets reflect a variety of data modalities and annotation strategies:
3D Modalities
- Point clouds: XYZRGB, often with per-point instance or semantic labels (Jia et al., 2024, Zhang et al., 20 Mar 2025, Wei et al., 24 Nov 2025, Zhang et al., 2024).
- Meshes: OBJ/STL/PLY, sometimes with fine-grained segmentation (Ahmed et al., 12 Jan 2025, Li et al., 13 Mar 2026).
- Motion capture: 3D joint trajectories at hundreds of Hz, SMPL/SMPL-X parameters, handshape meshes (Tang et al., 2022, Ranum et al., 2024, Yu et al., 2023).
- Multi-view images/RGB-D: Rendered or captured images from several viewpoints, sometimes aligned to depth and point clouds (Wang et al., 2024, Wei et al., 24 Nov 2025, Hong et al., 2023, Zhou et al., 14 Oct 2025).
- CAD parametric scripts: Structured DSLs (e.g., Encoder_txt) derived from CAD feature trees (Li et al., 13 Mar 2026).
Language Modalities
- Descriptive and relational captions: Scene, object, and part-level natural language with spatial or functional focus (Jia et al., 2024, Zhang et al., 2024, Yu et al., 2023, Tang et al., 2022, Ahmed et al., 12 Jan 2025, Zhou et al., 14 Oct 2025).
- Referring expressions and dialogue: Unique and sometimes multi-turn object references; discriminative dialogue (Abdelreheem et al., 2022, Wei et al., 24 Nov 2025, Wang et al., 2023).
- QA/instruction: 3D VQA and stepwise manipulation/navigation instructions, sometimes aligned to goal representations or 3D waypoints (Zhen et al., 2024, Wang et al., 2024, Liu et al., 13 Oct 2025, Hong et al., 2023).
- Task-specific natural language: Language-generated change descriptions, repair/rearrangement plans, CAD modeling scripts (Li et al., 13 Mar 2026, Liu et al., 13 Oct 2025).
Scale and Composition
| Dataset | Scale | Modality | Text Types | Notable Features |
|---|---|---|---|---|
| SceneVerse | 68K scenes/2.5M pairs | point cloud, scene graph | captions, referral | Human and LLM annotated, grounding focus |
| Disc3D | 25K scenes/2.08M QA | point cloud, images | QA, dialogue, grounding | Discriminative object referencing, auto-pipeline |
| 3D-LEX | 2K signs | Mocap body/hands/face | gloss labels | ASL & NGT signs, high-fidelity markers |
| FLAG3D | 180K sequences | Mocap, mesh, video | fitness instructions | Professional annotation, stepwise guidance |
| 3DCoMPaT200 | 19K shapes | mesh, point cloud | compositional captions | Part, material, retrieval |
| IL3D | 27.8K layouts | layouts, assets, images | object descriptions | Multimodal export, LLM-aligned annotation |
| ScanEnts3D | 705 scenes/369K links | point cloud, mesh | dense phrase-object | Explicit anchor mapping |
| SldprtNet | 242.6K CAD parts | step/sldprt, images | param. scripts, descs | Parametrization, end-to-end roundtrip |
| Situat3DChange | 903 pairs/121K QA | point cloud before/after | QA, description, plans | Egocentric/allocentric, human situativity |
3. Collection Pipelines and Annotation Methodologies
Collection methods span human annotation, LLM-assisted generation, automated scene graph synthesis, multi-sensor MoCap integration, and parametric extraction.
- Human and expert annotation: Used in fitness (FLAG3D), sign language (3D-LEX, SignAvatars), SceneVerse referrals, and Situat3DChange change descriptions. Rigorous protocols with review and forced disambiguation ensure linguistic alignment (Tang et al., 2022, Ranum et al., 2024, Jia et al., 2024, Liu et al., 13 Oct 2025).
- Automated scene graph-based generation: LLMs and rules synthesize referential expressions and scene/part descriptions, often using Gricean minimality and exclusivity to drive discrimination (Jia et al., 2024, Wei et al., 24 Nov 2025, Zhang et al., 20 Mar 2025).
- Motion capture and 3D sensor fusion: Synchronization of optical MoCap (Vicon, Shogun), sensor gloves (StretchSense), facial ARKit captures; pipelines include calibration marker sets, real-time QA, and semi-automatic phonetic labeling (Ranum et al., 2024, Yu et al., 2023).
- Parametric CAD extraction: Encoder/decoder pipelines traverse feature trees, record command sequences and numeric parameters, later recomposed into scripts and rendered images (Li et al., 13 Mar 2026).
- LLM-driven augmentation: Chain-of-thought prompting, synthetic QA, and paraphrasing enrich linguistic diversity, generate style variants, and offer multi-granularity question sets (e.g., 3D-LR, ViGiL3D) (Deng et al., 2024, Wang et al., 2 Jan 2025).
- Change and difference understanding: Pairwise scan and attribute comparison, both egocentric (relative from observer) and allocentric (world frame), supported by spatial heuristics and human annotation (Liu et al., 13 Oct 2025).
4. Downstream Tasks and Benchmarks
3D-Language Datasets have enabled advances across several benchmarks and task settings:
- Grounded 3D object/part retrieval: Given text, retrieve objects, parts, or compositions in a database (Recall@K, accuracy) (Ahmed et al., 12 Jan 2025, Abdelreheem et al., 2022). Compositional retrieval with increasing numbers of text-grounded part/material phrases.
- 3D visual grounding and reference resolution: Localize the referred target(s) in 3D via bounding boxes (IoU-based metrics, Acc@k) and multi-object precision/recall (Zhang et al., 2024, Zhang et al., 20 Mar 2025, Wei et al., 24 Nov 2025, Wang et al., 2 Jan 2025, Hong et al., 2023).
- Visual question answering and dialog: Answer referential, spatial, or change-focused questions using the complete 3D context (Exact Match, BLEU, GPT-score) (Zhen et al., 2024, Wang et al., 2024, Wei et al., 24 Nov 2025, Liu et al., 13 Oct 2025).
- Sign/action recognition and production: Isolated sign/action classification from 3D motion; full 3D sign synthesis conditioned on language or phonetic representations (accuracy, FID, multimodality, R-Precision) (Ranum et al., 2024, Tang et al., 2022, Yu et al., 2023).
- 3D-aware image and layout editing, CAD generation: Language-to-shape, script, or image transformations; correctness of edit, script, and geometric consistency (BLEU/FID/PSNR/parameter tolerance) (Yu et al., 2023, Li et al., 13 Mar 2026, Zhou et al., 14 Oct 2025).
- Change and dynamics: 3D change detection, difference description, rearrangement instruction grounded in egocentric or allocentric frames, evaluated via GPT-scored correctness and BLEU/CIDEr metrics (Liu et al., 13 Oct 2025).
5. Representative Dataset Properties and Benchmark Results
| Dataset | Notable Metrics (topline) |
|---|---|
| 3D-LEX | ISR Top-1: No HS: 44%, Expert: 48%, Auto: 49% (+5pp over none) |
| SceneVerse | ScanRefer [email protected]: 48.1%; Sr3D overall: 77.5%; zero-shot: 60.6% |
| Disc3D | Attribute recog. EM: 0.874 (+0.284 over baseline), Rel. dist. EM: 0.730 |
| 3DCoMPaT200 | Part shape retrieval R@1: 23–59% (1–6 parts), compositional ablation |
| FLAG3D | In-domain HAR Top-1: ~98%; HMR PA-MPJPE: 62 mm (FT); HAG FID: 0.396→0.407 |
| SldprtNet | Command F1: 0.3247→0.3670, Partial Match: 0.5554→0.6162 (image+text) |
| Situat3DChange | Rearrangement GPT-score: 30.7; Change description: 13.9; QA: 53.8% |
Training with these datasets yields large gains in generalization and linguistic robustness, especially on referential out-of-distribution prompts, discriminative reasoning, and compositional retrieval (Deng et al., 2024, Wang et al., 2 Jan 2025, Wei et al., 24 Nov 2025).
6. Access, Licensing, and Future Directions
Most 3D-Language datasets are released under open research licenses:
- CC BY 4.0 (e.g., 3D-LEX, OBJECT, IL3D)
- CC BY-NC 4.0 (e.g., IL3D, 3D-VLA)
- CC BY-NC-SA 4.0 (e.g., ViGiL3D)
- Specialized dataset homepages and GitHub repositories provide programmatic loaders, parsing utilities, and conversion scripts (Ranum et al., 2024, Li et al., 13 Mar 2026, Wei et al., 24 Nov 2025, Jia et al., 2024, Ahmed et al., 12 Jan 2025, Zhou et al., 14 Oct 2025).
Ongoing and future research directions, as identified in the literature, include:
- Scaling diversity of language beyond template-based or LLM-generated text, emphasizing human-written and idiomatic expressions (Deng et al., 2024, Wang et al., 2 Jan 2025, Wei et al., 24 Nov 2025).
- Expanding signer and actor diversity in motion capture and sign corpora (Ranum et al., 2024).
- Improved alignment between language, physics, and geometric affordance—especially for robotics and embodied agents (Zhen et al., 2024, Wang et al., 2024).
- Extension to assembly-level CAD, multi-turn 3D dialogue, dynamic scenes, robust zero/few-shot generalization, and joint 2D–3D scene understanding pipelines (Zhou et al., 14 Oct 2025, Li et al., 13 Mar 2026, Liu et al., 13 Oct 2025, Cho et al., 2024).
- Addressing remaining gaps in discourse-level grounding, multilinguality, non-manual sign features, and crowd-sourced change detection (Jia et al., 2024, Yu et al., 2023, Liu et al., 13 Oct 2025).
7. Key Considerations, Limitations, and Recommendations
Challenges in 3D-Language dataset creation and usage include:
- Synthetic language: Many datasets rely on LLM or template generation, introducing lexical and syntactic homogeneity, potentially limiting generalization (Zhang et al., 2024, Zhang et al., 20 Mar 2025, Wang et al., 2 Jan 2025).
- Noise-free and bias: Most 3D reconstructions are mesh-derived or simulated, lacking real sensor artifacts. Categories and relations disproportionally reflect object-centric indoor navigation (Zhang et al., 2024, Wang et al., 2024).
- Evaluation complexity: Benchmarks require spatial, semantic, and procedural understanding: single-best answer metrics may undervalue partial or pragmatic correctness (Wei et al., 24 Nov 2025, Wang et al., 2 Jan 2025, Abdelreheem et al., 2022).
- Language robustness: Existing 3D-VL models show marked performance drops under alternate phrasing/variants, justifying the inclusion of dedicated robustness evaluation suites (3D-LR, ViGiL3D) (Deng et al., 2024, Wang et al., 2 Jan 2025).
- Scalability and future annotation: Manual verification remains essential for high-fidelity tasks, but scalable, LLM-assisted pipelines (Disc3D, Situat3DChange) suggest viable paths to large, high-quality data without untenable costs (Wei et al., 24 Nov 2025, Liu et al., 13 Oct 2025).
Best practices call for multi-faceted data (2D+3D+language), multi-granularity annotation, rigorous human-in-the-loop curation/validation, and universal APIs that facilitate cross-dataset method comparison (Jia et al., 2024, Li et al., 13 Mar 2026, Wei et al., 24 Nov 2025). With continued advances, 3D-Language Datasets are foundational to robust, generalizable 3D vision-language-action systems across research and industry.