T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT (2505.00703v1)

Published 1 May 2025 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Recent advancements in LLMs have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: https://github.com/CaraJ7/T2I-R1

PDF Abstract

This paper presents T2I-R1 (Jiang et al., 1 May 2025 ), a novel text-to-image (T2I) generation model enhanced by a bi-level Chain-of-Thought (CoT) reasoning process and optimized using reinforcement learning (RL). The core idea is to apply reasoning strategies, which have been successful in LLMs, to the visual generation domain.

The authors identify two distinct levels of CoT relevant to autoregressive image generation:

Semantic-level CoT: This is textual reasoning performed before image generation begins. It acts as a high-level planning phase, designing the global structure, object attributes, spatial relationships, and reasoning about the user's true intent from potentially ambiguous or uncommon prompts. This step helps the model understand complex requirements and plan the image composition.
Token-level CoT: This refers to the intermediate, patch-by-patch generation process of the image itself. Similar to textual CoT, each subsequent image token (patch) is generated conditioned on previous tokens within a discrete visual space. This focuses on low-level details, pixel generation, and maintaining visual coherence between adjacent patches.

To enhance and coordinate these two levels of CoT within a single Unified LLM (ULM) capable of both understanding and generation, the authors introduce BiCoT-GRPO. This is an RL framework based on Group-Relative Policy Optimization (GRPO) (DeepSeek-AI et al., 22 Jan 2025 ). The process involves a two-step generation pipeline:

Given a text prompt, the ULM is first prompted to generate the semantic-level CoT (textual plan).
The ULM then uses the original text prompt and the generated semantic-level CoT as conditions to generate image tokens, which represent the token-level CoT. These image tokens are then decoded into the final image.

The objective function for BiCoT-GRPO adapts the standard GRPO loss to account for the two-part output (semantic CoT and token CoT). The probability ratio $r_{i,j}(\theta)$ is calculated differently depending on whether the current token belongs to the semantic CoT (conditioned on the prompt and previous semantic tokens) or the token CoT (conditioned on the prompt, the full semantic CoT, and previous image tokens). A token-level policy gradient loss is incorporated and normalized over all generated tokens to balance the optimization across both stages.

A key challenge in applying RL to image generation is defining a suitable reward function, as image quality and prompt alignment are complex to evaluate with simple rules. T2I-R1 addresses this by proposing an ensemble of generation rewards using diverse vision experts. This ensemble includes:

Human Preference Models (HPMs) (e.g., HPS (Wu et al., 2023 ), ImageReward (Angius et al., 2023 )): Evaluate aesthetic quality and overall prompt alignment based on learned human preferences.
Object Detectors (e.g., GroundingDINO (Liu et al., 2023 ), YOLO-world (Jain et al., 12 Mar 2024 )): Verify the existence, number, and spatial relationships of objects mentioned in the prompt.
Visual Question Answering (VQA) Models (e.g., BLIP (Iklassov et al., 2022 ), GIT (Wang et al., 2022 ), LLaVA (Zhu et al., 2023 )): Assess the presence and attributes of objects by querying the VQA model about elements in the generated image.
Output Reward Model (ORM): A fine-tuned LMM trained to directly evaluate the overall image-prompt alignment.

The final reward for a generated image is an average of the scores from the selected experts. Using an ensemble provides a more reliable quality assessment across multiple dimensions and acts as a regularization method to prevent the model from overfitting to a single reward signal. The authors conducted experiments showing that a combination of HPM, Object Detector, and VQA (H+O+V) performs well and yields better visual quality according to human evaluations compared to individual rewards.

The resulting model, T2I-R1, built upon the Janus-Pro-7B ULM, was trained using text prompts from T2I-CompBench (Seida et al., 2023 ) and other sources. Experiments on the T2I-CompBench and WISE (Niu et al., 10 Mar 2025 ) benchmarks demonstrate significant improvements over the baseline Janus-Pro model, with 13% and 19% increases, respectively. T2I-R1 also outperforms leading diffusion models like FLUX.1 (Team, 16 May 2024 ) on these benchmarks, particularly in areas requiring compositional understanding and world knowledge reasoning.

Ablation studies confirm the importance of both semantic-level and token-level CoT. Optimizing only token-level CoT leads to reduced image diversity and less effective handling of prompts requiring reasoning. Optimizing only semantic-level CoT shows some improvement but is less effective than joint optimization, and can result in lower visual quality. The joint optimization provided by BiCoT-GRPO is crucial for both high-level planning and low-level fidelity.

In conclusion, T2I-R1 demonstrates a practical approach to injecting reasoning capabilities into T2I models by explicitly modeling and optimizing both textual planning (semantic-level CoT) and visual generation (token-level CoT) within a unified framework using RL and an ensemble of vision-based rewards. This leads to improved performance on complex compositional prompts and those requiring external knowledge or interpretation, marking a step towards more intelligent and human-aligned image generation systems. The code is available at https://github.com/CaraJ7/T2I-R1.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Dongzhi Jiang (13 papers)
Ziyu Guo (49 papers)
Renrui Zhang (100 papers)
Zhuofan Zong (14 papers)
Hao Li (803 papers)
Le Zhuo (25 papers)
Shilin Yan (20 papers)
Pheng-Ann Heng (196 papers)
Hongsheng Li (340 papers)

Related Papers

Find Related Papers

GitHub

GitHub - CaraJ7/T2I-R1 (7 stars)

Tweets

https://twitter.com/_akhaliq/status/1918956388702699776

https://twitter.com/HuggingPapers/status/1918276323518165374

https://twitter.com/zuckerbarge/status/1919094720451486095

https://twitter.com/_mwitiderrick/status/1918229269832208540

https://twitter.com/ArxivToday/status/1918710552743837717

https://twitter.com/ArxivToday/status/1918348222738698734

YouTube

Show All Videos