Simultaneous multi-object 9-DoF pose control in image generation

Establish methods that enable simultaneous, accurate control over the full 9-DoF poses—3D location, size, and orientation—of multiple objects within a single image generation process, ensuring reliable alignment between specified poses and generated content.

Background

The paper targets controllable image generation where users specify spatial properties of objects. Prior approaches typically handle only 2D constraints or limited 3D aspects: methods using 3D bounding boxes lack orientation control, angle-embedding approaches struggle with precise spatial positioning and multi-object diversity, and some one-step models restrict compatibility with widely used multi-step diffusion frameworks. As a result, comprehensive control over 3D location, size, and orientation for multiple objects in a single scene has not been reliably achieved.

The authors propose 3DSceneDesigner with a new CNOCS representation, a curated ObjectPose9D dataset, a two-stage learning strategy including reinforcement learning to handle pose imbalance, and an inference-time Disentangled Object Sampling technique. While their method advances the state of the art, the paper explicitly frames simultaneous multi-object 9-DoF control as an open challenge motivating their work.

References

However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge.

SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation (2511.16666 - Qin et al., 20 Nov 2025) in Abstract (also reiterated in Section 1: Introduction)