This paper presents T2I-R1 (Jiang et al., 1 May 2025 ), a novel text-to-image (T2I) generation model enhanced by a bi-level Chain-of-Thought (CoT) reasoning process and optimized using reinforcement learning (RL). The core idea is to apply reasoning strategies, which have been successful in LLMs, to the visual generation domain.
The authors identify two distinct levels of CoT relevant to autoregressive image generation:
- Semantic-level CoT: This is textual reasoning performed before image generation begins. It acts as a high-level planning phase, designing the global structure, object attributes, spatial relationships, and reasoning about the user's true intent from potentially ambiguous or uncommon prompts. This step helps the model understand complex requirements and plan the image composition.
- Token-level CoT: This refers to the intermediate, patch-by-patch generation process of the image itself. Similar to textual CoT, each subsequent image token (patch) is generated conditioned on previous tokens within a discrete visual space. This focuses on low-level details, pixel generation, and maintaining visual coherence between adjacent patches.
To enhance and coordinate these two levels of CoT within a single Unified LLM (ULM) capable of both understanding and generation, the authors introduce BiCoT-GRPO. This is an RL framework based on Group-Relative Policy Optimization (GRPO) (DeepSeek-AI et al., 22 Jan 2025 ). The process involves a two-step generation pipeline:
- Given a text prompt, the ULM is first prompted to generate the semantic-level CoT (textual plan).
- The ULM then uses the original text prompt and the generated semantic-level CoT as conditions to generate image tokens, which represent the token-level CoT. These image tokens are then decoded into the final image.
The objective function for BiCoT-GRPO adapts the standard GRPO loss to account for the two-part output (semantic CoT and token CoT). The probability ratio is calculated differently depending on whether the current token belongs to the semantic CoT (conditioned on the prompt and previous semantic tokens) or the token CoT (conditioned on the prompt, the full semantic CoT, and previous image tokens). A token-level policy gradient loss is incorporated and normalized over all generated tokens to balance the optimization across both stages.
A key challenge in applying RL to image generation is defining a suitable reward function, as image quality and prompt alignment are complex to evaluate with simple rules. T2I-R1 addresses this by proposing an ensemble of generation rewards using diverse vision experts. This ensemble includes:
- Human Preference Models (HPMs) (e.g., HPS (Wu et al., 2023 ), ImageReward (Angius et al., 2023 )): Evaluate aesthetic quality and overall prompt alignment based on learned human preferences.
- Object Detectors (e.g., GroundingDINO (Liu et al., 2023 ), YOLO-world (Jain et al., 12 Mar 2024 )): Verify the existence, number, and spatial relationships of objects mentioned in the prompt.
- Visual Question Answering (VQA) Models (e.g., BLIP (Iklassov et al., 2022 ), GIT (Wang et al., 2022 ), LLaVA (Zhu et al., 2023 )): Assess the presence and attributes of objects by querying the VQA model about elements in the generated image.
- Output Reward Model (ORM): A fine-tuned LMM trained to directly evaluate the overall image-prompt alignment.
The final reward for a generated image is an average of the scores from the selected experts. Using an ensemble provides a more reliable quality assessment across multiple dimensions and acts as a regularization method to prevent the model from overfitting to a single reward signal. The authors conducted experiments showing that a combination of HPM, Object Detector, and VQA (H+O+V
) performs well and yields better visual quality according to human evaluations compared to individual rewards.
The resulting model, T2I-R1, built upon the Janus-Pro-7B ULM, was trained using text prompts from T2I-CompBench (Seida et al., 2023 ) and other sources. Experiments on the T2I-CompBench and WISE (Niu et al., 10 Mar 2025 ) benchmarks demonstrate significant improvements over the baseline Janus-Pro model, with 13% and 19% increases, respectively. T2I-R1 also outperforms leading diffusion models like FLUX.1 (Team, 16 May 2024 ) on these benchmarks, particularly in areas requiring compositional understanding and world knowledge reasoning.
Ablation studies confirm the importance of both semantic-level and token-level CoT. Optimizing only token-level CoT leads to reduced image diversity and less effective handling of prompts requiring reasoning. Optimizing only semantic-level CoT shows some improvement but is less effective than joint optimization, and can result in lower visual quality. The joint optimization provided by BiCoT-GRPO is crucial for both high-level planning and low-level fidelity.
In conclusion, T2I-R1 demonstrates a practical approach to injecting reasoning capabilities into T2I models by explicitly modeling and optimizing both textual planning (semantic-level CoT) and visual generation (token-level CoT) within a unified framework using RL and an ensemble of vision-based rewards. This leads to improved performance on complex compositional prompts and those requiring external knowledge or interpretation, marking a step towards more intelligent and human-aligned image generation systems. The code is available at https://github.com/CaraJ7/T2I-R1.