Extending test-time scaling to unified multimodal models

Develop a test-time scaling approach for unified multimodal models that supports iterative chain-of-thought reasoning, verification, and refinement across multiple rounds of interleaved text and image processing, thereby extending inference-time compute benefits that have been demonstrated for language models to architectures capable of both multimodal understanding and generation within a single model.

Background

Test-time scaling has yielded significant gains for LLMs by allocating more inference compute to perform extended chain-of-thought reasoning, verification, and refinement. However, applying this paradigm to unified multimodal models—which jointly handle visual understanding and generation in a single architecture—poses additional challenges because these models must interleave text and image tokens, maintain visual content memory, and perform iterative editing or generation.

The paper introduces UniT as a proposed framework to address this challenge by integrating agentic data synthesis, unified model training, and test-time budget forcing. The authors emphasize that, prior to their work, extending test-time scaling to unified multimodal models was not established, motivating this explicitly stated open challenge in the abstract.

References

While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves LLM performance, extending this paradigm to unified multimodal models remains an open challenge.

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling  (2602.12279 - Chen et al., 12 Feb 2026) in Abstract, page 1