Papers
Topics
Authors
Recent
Search
2000 character limit reached

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Published 2 Apr 2026 in cs.CV and cs.AI | (2604.02289v1)

Abstract: Recent multimodal LLMs have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.

Summary

  • The paper proposes a unified multimodal autoregressive framework that integrates text-to-2D and text-to-3D generation using an interleaved X-to-X training paradigm.
  • It leverages large-scale, heterogeneous datasets and custom tokenizers to compress text, images, and 3D meshes into shared token sequences for efficient cross-modal fusion.
  • Experimental results demonstrate superior shape-text alignment and geometric fidelity compared to cascaded and diffusion-based methods, highlighting parameter efficiency.

Omni123: Unified Text-to-2D and 3D Generation with Limited 3D Data

Introduction

Omni123 proposes a unified, multimodal autoregressive framework for joint text-to-2D and text-to-3D generation, explicitly targeting the challenge of severe 3D data scarcity relative to abundant 2D web-scale data. The paper critically examines pipeline approaches that use 2D generation as a proxy and iterative optimization for 3D synthesis, noting their frequent failures in maintaining geometric consistency and native multimodal reasoning. Instead, Omni123 introduces an interleaved X-to-X training paradigm—treating text, 2D images, and 3D geometry as discrete tokens in a shared sequence space—to maximize cross-modal knowledge transfer and exploit the implicit 3D priors in image data, without requiring fully aligned text-image-3D triplets.

This unified approach aims to coordinate potentially interfering statistical priors from different modalities, enforce semantic-visual-geometric consistency, and enable practical downstream workflows of generation and editing. Figure 1

Figure 1: Comparison of methodological paradigms for text-to-3D generation, highlighting contrasts between classical cascaded, optimization-based, and native unified methods.

Data Design and Processing

Omni123 leverages large-scale, heterogeneous paired datasets with strongly curated quality control. The training corpus covers text, images, and 3D assets, parsed into text-image (63.7M), image-3D (120M), and text-3D (9M) pairings, and interleaved SFT (22M) triplets. To align the training with practical 3D-aware tasks, pipeline stages include rendering/canonicalization, synthetic data augmentation, multi-view rendering for explicit viewpoint control, chain-of-thought multi-granularity captioning, and robust filtering for geometric/textural fidelity and alignment.

The resulting text and 3D assets are optimized to bridge the modality gap both with diverse semantic detail and explicit pose information. Viewpoint control is operationalized in subsequent stages by attaching learnable view tokens.

Architecture and Tokenization Framework

Omni123's architecture innovates on the autoregressive transformer paradigm by compressing all modalities (text, image, 3D mesh) into short 1D token sequences. This is achieved via self-developed image tokenizers (with a two-stage VAE and 1D Q-Former extraction) and the Cube3D tokenizer for mesh geometry. Both achieve low reconstruction error and high semantic fidelity for their respective inputs, supporting efficient context lengths and avoiding modality-specific position encoding.

Dual text encoders (CLIP, Qwen3) provide complementary alignment for vision-language and enriched semantic features. The backbone is a dual-stream transformer with joint-attention for cross-modal fusion and subsequent single-stream layers for token prediction, facilitating maximal parameter sharing. Figure 2

Figure 2: The Omni123 architecture enables concatenated token sequences for text, image, and 3D, processed with shared self-attention to promote strong cross-modal coupling.

Training Curriculum: Cross-Modal Consistency and Viewpoint Awareness

Training proceeds in three interdependent stages:

  1. Cross-modal Pre-training: Tasks cover text-to-image, text-to-3D, image-to-3D, and 3D-to-image, formulated as next-token prediction. Pool temperature rebalancing and priority weighting address natural imbalances in the data for each direction. Cross-modal generative cycles enforce that the representations satisfy requirements for high-level semantics, image appearance, and explicit geometric consistency.
  2. Continued Training (CT): Introduces learnable view tokens to accomplish explicit viewpoint-controllable generation. Rendering canonical views for each 3D shape expands the model's native understanding of view-conditioned image synthesis, a critical step for robust projection and 3D-aware downstream tasks. Figure 3

    Figure 3: Continued training imbues the model with camera pose conditioning via view tokens, establishing explicit geometric correspondences across modalities.

  3. Supervised Fine-Tuning (SFT): Implements interleaved, multi-modal generation chains (e.g., text→image→3D→multi-view-image), closing the semantic-visual-geometric loop at the sequence level. This ensures that 3D outputs are not only informed by preceding image generations but are also directly evaluated by their capacity to render consistent multi-view images.

Instruction-Based 3D Editing

Omni123 natively supports language-driven 3D editing by treating source mesh tokens as prefixes and feeding edit instructions as conditioning, yielding autoregressively generated edited mesh tokens. Leveraging 3DEditVerse (the largest high-quality 3D edit dataset), Omni123 aligns structural and appearance modifications with user directives while enforcing minimal disruption to unedited regions and maintaining cross-view consistency through its sequential context. Figure 4

Figure 4: Omni123's framework naturally enables diverse, iterative 3D generation and editing tasks with high geometric fidelity and semantic control.

Experimental Results

Text-to-3D Generation

Quantitative and qualitative benchmarks against both cascaded (text→image→3D) and native text-3D generation baselines demonstrate that Omni123's cross-modal interleaving achieves superior shape-text alignment. Notably, a 2.2B parameter Omni123 surpasses larger (7B) autoregressive and diffusion-based alternatives (2604.02289). Numerical results indicate improved ULIP-T and Uni3D-T scores compared to both two-stage and native methods, establishing that interleaved X-to-X training is fundamentally more parameter-efficient. Figure 5

Figure 5: Omni123 generates more consistent, compositionally accurate 3D assets compared to both two-stage and native baseline models.

Instruction-Based Editing

On Edit3D-Bench, Omni123 achieves the lowest Chamfer Distance (CD), indicating superior global geometric fidelity; its F1 scores are competitive despite certain benchmarks being tilted in favor of pipeline-specific training distributions. The ability to propagate cross-modal priors from pre-training enables data-efficient adaptation to fine-grained editing.

Image Tokenizer Analysis

The bespoke 1D image tokenizer demonstrates state-of-the-art FID/PSNR/SSIM on ImageNet-1K, minimizing tokenization-induced degradation. Qualitative results show superior preservation of object structure, textural details, and text rendering over competing methods—crucial for guiding accurate 3D synthesis within the joint sequence. Figure 6

Figure 6: Comparative reconstructions reveal the tokenizer’s advantage in high-frequency detail and structural preservation across image classes.

Implications and Future Directions

Omni123’s unified framework represents a shift towards native, sequence-level multimodal 3D generation and editing. By recasting diverse synthesis pipelines as interleaved autoregressive sequences, Omni123 exploits implicit geometric priors from 2D data, enabling parameter-scalable 3D world modeling and practical workflows for 3D content creation. Its approach to viewpoint control, explicit cross-modal consistency, and editability closely aligns with emergent needs in embodied AI, robotics, simulation, and interactive design.

However, constraints remain: fixed mesh resolution and predetermined set of canonical views limit its expressiveness for high-complexity or scene-level content. Further work is needed in adaptive-resolution tokenization, generalization to arbitrary camera/viewpoint distributions, and integration of physical/material properties for simulation-ready assets.

Conclusion

Omni123 provides substantive evidence that unifying 2D and 3D generative pipelines within a shared autoregressive token space, with explicit cross-modal and view-conditioned training, dramatically enhances semantic and geometric fidelity in 3D generation and editing, especially under practical 3D data scarcity constraints. This framework establishes a robust foundation for future multimodal, world-centric AI systems that require seamless traversal and manipulation of visual, geometric, and linguistic information (2604.02289).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.