VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning (2504.02949v1)

Published 3 Apr 2025 in cs.CV and cs.AI

Abstract: In this work, we present VARGPT-v1.1, an advanced unified visual autoregressive model that builds upon our previous framework VARGPT. The model preserves the dual paradigm of next-token prediction for visual understanding and next-scale generation for image synthesis. Specifically, VARGPT-v1.1 integrates: (1) a novel training strategy combining iterative visual instruction tuning with reinforcement learning through Direct Preference Optimization (DPO), (2) an expanded training corpus containing 8.3M visual-generative instruction pairs, (3) an upgraded LLM backbone using Qwen2, (4) enhanced image generation resolution, and (5) emergent image editing capabilities without architectural modifications. These advancements enable VARGPT-v1.1 to achieve state-of-the-art performance in multimodal understanding and text-to-image instruction-following tasks, demonstrating significant improvements in both comprehension and generation metrics. Notably, through visual instruction tuning, the model acquires image editing functionality while maintaining architectural consistency with its predecessor, revealing the potential for unified visual understanding, generation, and editing. Our findings suggest that well-designed unified visual autoregressive models can effectively adopt flexible training strategies from LLMs, exhibiting promising scalability. The codebase and model weights are publicly available at https://github.com/VARGPT-family/VARGPT-v1.1.

PDF Abstract

VARGPT-v1.1 is presented as an enhanced version of the VARGPT framework, designed as a unified visual autoregressive model capable of both visual understanding and image generation (Zhuang et al., 3 Apr 2025 ). It maintains the core VARGPT paradigm: predicting the next token for visual understanding tasks (like VQA) and predicting the next scale for visual generation tasks (text-to-image synthesis).

Key Innovations and Enhancements

VARGPT-v1.1 introduces several improvements over its predecessor:

Iterative Training Strategy: A novel multi-stage training approach combining iterative visual instruction tuning (SFT) with reinforcement learning via Direct Preference Optimization (DPO). This involves progressively increasing image resolution (from 256x256 to 512x512) and alternating between SFT and DPO phases within the generation training stage.
Expanded Training Data: The visual generation corpus is significantly enlarged to 8.3 million instruction pairs (a 6x increase), comprising 4.2 million real-world samples (filtered LAION-COCO) and 4.1 million synthetic samples (Midjourney v6, Flux).
Upgraded Language Backbone: The model adopts Qwen2-7B as its LLM backbone, benefiting from improved tokenization and attention mechanisms.
Enhanced Generation Resolution: The model is explicitly trained for higher image generation resolution (up to 512x512).
Emergent Image Editing: The model gains image editing capabilities through instruction fine-tuning on a dedicated dataset, without requiring architectural modifications.

Model Architecture

The architecture largely follows VARGPT, aiming to unify understanding and generation within a single autoregressive framework (Figure 3):

Visual Understanding: Uses a ViT visual encoder and a linear projector to process input images. These visual features are combined with text embeddings and fed into the Qwen2-7B LLM, which predicts the next text token autoregressively using standard causal attention.

$\mathbf{Y}^{txt}_t \sim p_{\theta}(\mathbf{Y}^{txt}_t \mid \mathbf{X}^{img}, \mathbf{X}^{query}, \mathbf{Y}^{txt}_{<t})$
Visual Generation: Employs a multi-scale image tokenizer (similar to VQVAE, using bitwise multi-scale residual quantization) and a separate 2B parameter visual decoder (32 Transformer layers). This decoder uses block causal attention (as in Infinity/VAR) to support the "next-scale prediction" paradigm. Dual generation projectors map features between the LLM and the visual decoder. An infinite vocabulary classifier is used for calculating the visual generation loss.
Mixed-Modal Handling: Special tokens differentiate text and image generation segments. Classifier-Free Guidance (CFG) is used during inference (scale 1.5) to improve generation quality.

Training Methodology

The training follows three stages (Figure 4):

Stage 1: Pretraining: Initial training phase (details limited in this paper, likely aligning LLM and vision encoder). Uses the 8.3M generation dataset.
Stage 2: Visual Instruction Tuning (Understanding): Fine-tunes the model for visual understanding tasks using instruction datasets (1.18M samples from LLaVA-1.5, LLaVA-OneVision) mixed with a small portion (50K) of the generation data.
Stage 3: Iterative SFT + DPO (Generation & Editing): This is the core innovation for improving generation (Figure 5).
- SFT (256x256): Fine-tune the visual decoder and generation projectors on the 8.3M generation dataset at 256x256 resolution (40k steps). Other parts are frozen.
- DPO (256x256): Apply DPO using preference pairs. "Loser" images ( $y_l$ ) are generated by the SFT model, while "winner" images ( $y_w$ ) are generated by stronger models (Midjourney v6, Flux) using the same prompts (180k prompts sampled). The objective optimizes the policy model ( $\pi_\theta$ ) to prefer $y_w$ over $y_l$ compared to a reference policy ( $\pi_{ref}$ , the post-SFT model), focusing only on the image token probabilities ( $\pi^{img}_\theta$ ).
  
  $\mathcal{L}_{rl}(\pi_\theta;\pi_{ref}) = -\mathbb{E}_{(x, y_w, y_l)\sim\mathcal{D}} [\log \sigma(\beta\log\frac{\pi^{img}_\theta(y_w|x)}{\pi^{img}_{ref}(y_w|x)} - \beta\log\frac{\pi^{img}_\theta(y_l|x)}{\pi^{img}_{ref}(y_l|x)})]$

* SFT (512x512): Increase resolution to 512x512 and continue SFT (30k steps). * DPO (512x512): Apply DPO again at the higher resolution. * SFT (Visual Editing): Fine-tune the entire model (all parameters unfrozen) on an image editing instruction dataset (11.3k samples from StyleBooth, 512x512 resolution) to imbue editing capabilities. Input images go through the vision encoder, instructions are text prompts, and the model generates the edited image tokens.

Data Summary

Understanding: LLaVA-1.5 (665K), LLaVA-OneVision (total 1.18M samples).
Generation: 8.3M pairs (4.2M LAION-COCO aesthetic filtered, 4.1M Midjourney/Flux synthetic).
Preference (DPO): 180k prompts; pairs generated by SFT model vs. Midjourney/Flux.
Editing: 11.3k StyleBooth samples (image + instruction -> edited image).

Implementation Details

Initialization: LLM, ViT encoder, understanding projector from Qwen2VL-7B-Instruct. Visual decoder from Infinity-2B. Generation projectors randomly initialized.
Tokenizer: Multi-scale VQVAE from Infinity/VAR.
Hyperparameters: Detailed in Table 1, including learning rates (e.g., 5e-5 for SFT, 1e-6 for DPO), batch sizes, and training steps per phase. AdamW optimizer and DeepSpeed ZeRO stage 2/3 used.
Inference: Top-k=900, Top-p=0.95, CFG scale=1.5 for generation.

Performance

Visual Understanding: Achieves state-of-the-art results on several benchmarks compared to models of similar size, including MMBench (81.01), SEED-Bench (76.08), MMMU (48.56), MME (1684.1), and VQA benchmarks like TextVQA (82.0) and SciQA-img (91.6) (Tables 2, 3). It significantly outperforms VARGPT v1.0 and other unified/understanding models.
Visual Generation: Outperforms many specialized generative models (diffusion and autoregressive) and other unified models on GenEval (0.53 overall) and DPG-Bench (78.59 overall) (Tables 4, 5). Qualitative results (Figure 6) show high fidelity and instruction following. Achieves this with a smaller generation dataset (8.3M) compared to some other unified models.
Visual Editing: Demonstrates basic image editing capabilities (style transfer) acquired solely through instruction tuning, without architectural changes (Figure 7).

Practical Applications and Considerations

Unified Interface: Enables applications requiring seamless integration of language understanding, image generation, and basic editing within a single conversational context (e.g., advanced chatbots, content creation aids).
Architecture: The dual-paradigm approach (next-token vs. next-scale) allows leveraging standard LLM techniques while incorporating specialized image generation mechanisms. The separate visual decoder might mitigate task interference.
Training: The iterative SFT+DPO strategy with progressive resolution is key to its performance. DPO avoids explicit reward model training but requires access to a stronger "teacher" model (or high-quality preference data) for constructing pairs. Full-parameter tuning for editing suggests significant parameter updates are needed for this capability.
Limitations:
- Generation quality still lags behind top commercial models (attributed to data scale/quality).
- Editing capability is currently basic (style transfer) due to the limited editing dataset. More complex edits or fine-grained control are future work.
- Scalability to much higher resolutions or more complex generation tasks needs further investigation.

VARGPT-v1.1 demonstrates that combining sophisticated training strategies (iterative SFT+DPO, progressive resolution) with a unified autoregressive architecture can significantly advance multimodal AI, achieving strong performance in both understanding and generation, and even enabling emergent capabilities like editing.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Xianwei Zhuang (7 papers)
Yuxin Xie (9 papers)
Yufan Deng (11 papers)
Dongchao Yang (51 papers)
Liming Liang (5 papers)
Jinghan Ru (6 papers)
Yuguo Yin (11 papers)
Yuexian Zou (119 papers)

Related Papers

Find Related Papers

GitHub

GitHub - VARGPT-family/VARGPT-v1.1: VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning (39 stars)

Tweets

https://twitter.com/gm8xx8/status/1909810589175586849