MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning (2509.22281v1)

Published 26 Sep 2025 in cs.CV and cs.RO

Abstract: The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at https://mesatask.github.io/

Summary

The paper presents a novel task-to-scene generation framework integrating large-scale datasets, LLM-based spatial reasoning, and DPO training.
The methodology employs object list completion, interrelation inference, and scene graph construction to ensure realism and semantic alignment.
The approach outperforms baselines with metrics like FID=40.3 and ~99% success rate, demonstrating strong generalization and practicality.

Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning: MesaTask

Introduction and Motivation

MesaTask addresses the challenge of generating physically plausible, task-oriented 3D tabletop scenes directly from high-level human instructions. The motivation stems from the need for robust training environments for robotic manipulation, where traditional manual or randomized scene construction methods are insufficient in terms of diversity, realism, and alignment with task semantics. The paper introduces a new formulation—task-to-scene generation—requiring the synthesis of interactive, realistic tabletop layouts that satisfy complex spatial and semantic constraints derived from abstract task descriptions.

Figure 1: MesaTask generates task-driven 3D tabletop scenes from high-level instructions, leveraging a large-scale dataset and spatial reasoning.

MesaTask-10K Dataset Construction

MesaTask-10K is a large-scale dataset comprising over 10,700 manually refined tabletop scenes, each constructed to ensure realistic object layouts and intricate inter-object relations. The dataset spans six common indoor table types and utilizes a curated library of 12,000+ 3D assets with rich semantic annotations.

The construction pipeline integrates LLMs for scene description generation, text-to-image models for reference image synthesis, and multimodal models for object detection and semantic labeling. Depth estimation and instance segmentation are used to build coarse 3D layouts, which are then refined by expert annotators in Blender to correct occlusions, scale inaccuracies, and spatial implausibilities. Final scenes are validated in IsaacSim to ensure physical plausibility.

Figure 2: MesaTask-10K dataset pipeline: LLM-driven scene description, image synthesis, 3D layout construction, and human refinement.

The dataset features high object diversity (200+ categories), with an average of 15 objects per scene. The top 100 object categories are visualized to demonstrate coverage and frequency.

Figure 3: Distribution of the top 100 object categories in MesaTask-10K.

MesaTask Framework: LLM-Based 3D Scene Generation

MesaTask is an LLM-based framework that operationalizes task-to-scene generation through a structured spatial reasoning chain. The process decomposes the mapping from task instructions to 3D layouts into three stages:

Object List Completion: Infers a comprehensive set of objects required for the task, extending beyond those explicitly mentioned.
Interrelation Inference: Predicts spatial and functional relationships (e.g., stacking, containment, adjacency) among objects.
Scene Graph Construction: Builds a structured graph encoding object attributes, positions, orientations (quantized into 8 bins), and interrelations, which guides the final 3D layout synthesis.
Figure 4: MesaTask framework overview: task-to-scene generation, reasoning data construction, and DPO data construction.

The framework leverages multimodal LLMs for reasoning data construction, generating detailed task information, object lists, and interrelations from scene graphs and rendered images. For training, MesaTask employs supervised fine-tuning (SFT) on reasoning data, followed by Direct Preference Optimization (DPO) to mitigate common failure modes such as object collisions, misaligned relations, and omission of task-relevant objects. DPO training utilizes paired positive (high-quality) and negative (corrupted) layouts to optimize the LLM's policy for scene generation.

Experimental Evaluation

MesaTask is evaluated against closed-source LLMs (GPT-4o) and modular scene generation baselines (Holodeck-table, I-Design-table). Metrics include Fréchet Inception Distance (FID) for realism, success rate for output format correctness, and a multi-dimensional GPT-score assessing task consistency, object size reasonableness, placement plausibility, layout coherence, and object visibility.

MesaTask achieves the highest performance across all metrics, with FID = 40.3, success rate ≈ 99%, and GPT-score avg. = 8.25, outperforming all baselines. The spatial reasoning chain and DPO training yield measurable improvements in scene plausibility and task alignment.

Figure 5: Qualitative comparison: MesaTask exhibits superior realism, alignment, and modeling of complex inter-object relations compared to baselines.

Ablation studies confirm the necessity of both spatial reasoning and DPO; removing either degrades collision rates, object fidelity, and task alignment.

Figure 6: Ablation paper: DPO training reduces collision rate, increases object fidelity, and improves task alignment.

Generalization and Robustness

MesaTask demonstrates strong generalization to novel tabletop categories not present in the training set (e.g., nightstands, TV stands, cashier counters), maintaining high success rates and GPT-scores. The framework can synthesize scenes with objects and layouts outside the training distribution, leveraging the LLM's open-vocabulary capabilities.

Figure 7: MesaTask generates realistic scenes for novel tabletop categories, demonstrating generalization beyond training data.

Benchmarking and Comparative Analysis

MesaTask-10K serves as a benchmark for scene generation methods. Comparative experiments with ATISS, DiffuScene, and PhyScene (trained on MesaTask-10K) reveal trade-offs between visual quality and semantic alignment. MesaTask differs fundamentally by generating scenes from task instructions rather than static scene descriptions, enabling open-vocabulary synthesis and superior task-scene alignment.

Figure 8: Qualitative results of ATISS, DiffuScene, and PhyScene trained on MesaTask-10K.

Implementation Details

MesaTask is implemented using Qwen3-8b as the base LLM, with full-parameter fine-tuning on 8 A800 GPUs. The asset retrieval process combines SBERT-based text similarity and cosine size similarity, with empirically weighted scoring. Scene graph extraction employs geometric rules for spatial relations, orientation quantization, and grid-based positional encoding. DPO training uses three corruption strategies: geometric perturbation, scene graph corruption, and object removal.

Limitations and Future Directions

MesaTask currently focuses on six indoor table types and relies on 3D object retrieval, limiting diversity to the asset database. Future work will integrate 3D object generation conditioned on bounding boxes to expand object diversity and realism. Extending to broader table types and more complex manipulation tasks is a natural progression.

Conclusion

MesaTask establishes a new paradigm for task-driven 3D tabletop scene generation, combining large-scale dataset construction, structured spatial reasoning, and LLM-based synthesis. The framework achieves state-of-the-art performance in generating physically plausible, semantically aligned scenes from abstract task instructions, with robust generalization and extensibility. MesaTask-10K provides a valuable benchmark for future research in embodied AI, scene synthesis, and robotic policy learning.