Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
135 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment (2410.09347v1)

Published 12 Oct 2024 in cs.CV, cs.LG, and eess.IV

Abstract: Classifier-Free Guidance (CFG) is a critical technique for enhancing the sample quality of visual generative models. However, in autoregressive (AR) multi-modal generation, CFG introduces design inconsistencies between language and visual content, contradicting the design philosophy of unifying different modalities for visual AR. Motivated by LLM alignment methods, we propose \textit{Condition Contrastive Alignment} (CCA) to facilitate guidance-free AR visual generation with high performance and analyze its theoretical connection with guided sampling methods. Unlike guidance methods that alter the sampling process to achieve the ideal sampling distribution, CCA directly fine-tunes pretrained models to fit the same distribution target. Experimental results show that CCA can significantly enhance the guidance-free performance of all tested models with just one epoch of fine-tuning ($\sim$ 1\% of pretraining epochs) on the pretraining dataset, on par with guided sampling methods. This largely removes the need for guided sampling in AR visual generation and cuts the sampling cost by half. Moreover, by adjusting training parameters, CCA can achieve trade-offs between sample diversity and fidelity similar to CFG. This experimentally confirms the strong theoretical connection between language-targeted alignment and visual-targeted guidance methods, unifying two previously independent research fields. Code and model weights: https://github.com/thu-ml/CCA.

Summary

  • The paper presents a novel Condition Contrastive Alignment technique that aligns conditional and unconditional probabilities to eliminate reliance on classifier-free guidance.
  • It uses Noise Contrastive Estimation for one-epoch fine-tuning, significantly cutting computational overhead compared to traditional CFG approaches.
  • Experimental results demonstrate substantial improvements, with FID dropping from 19.07 to 3.41 and IS rising from 64.3 to 288.2, confirming its practical efficiency.

Condition Contrastive Alignment for Guidance-Free AR Visual Generation

The paper "Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment" presents a novel approach to autoregressive (AR) visual generation by introducing Condition Contrastive Alignment (CCA). This technique aims to enhance the performance of AR visual models without relying on classifier-free guidance (CFG), which is typically used to improve sample quality but also introduces inconsistencies between visual and language content generation.

Background and Motivation

Autoregressive models have been effectively applied in language domains, and recent efforts have focused on extending these successes to visual generation. By quantizing images into discrete tokens, AR models can apply a similar next-token prediction strategy as employed by LLMs. Although this alignment in design is theoretically appealing, CFG has been found essential for improving sample quality in visual tasks. CFG, however, involves additional sampling costs and alters the training process by requiring separate conditional and unconditional models.

Proposed Method: Condition Contrastive Alignment (CCA)

The authors propose CCA as an alternative to CFG. Inspired by LLM alignment methods, CCA aligns pre-trained models with the desired sampling distribution without altering the sampling process. This method contrasts positive and negative image-condition pairs, where positive pairs are matched, and negative pairs are randomly shuffled mismatches derived from the pretraining dataset.

The key insight is learning a "conditional residual," which captures the gap between conditional and unconditional probabilities. The authors employ Noise Contrastive Estimation (NCE) to train this target model, leading to improved AR performance using a single model. CCA requires only one epoch of fine-tuning on the pretraining dataset, significantly reducing computational overhead compared to CFG.

Experimental Results

The experiments conducted on LlamaGen and VAR models demonstrate that CCA significantly improves guidance-free sample quality, achieving performance levels comparable to CFG while halving the sampling cost. For instance, applying CCA to LlamaGen-L resulted in improving the FID from 19.07 to 3.41 and the IS from 64.3 to 288.2. The results reveal that CCA achieves a similar trade-off between sample diversity and fidelity as CFG, providing a viable alternative for producing high-quality visual samples.

Theoretical and Practical Implications

Theoretically, CCA and CFG aim to approximate the same target sampling distribution. However, CCA achieves this by refining a single model through fine-tuning rather than employing dual models for conditional and unconditional guidance. The practical application of CCA could allow for more efficient and cohesive multimodal generation systems, bridging the gap between visual and language generative models.

Comparative Analysis

The paper also explores existing LLM alignment techniques, such as Direct Preference Optimization (DPO) and Unlearning, but finds these methods generally ineffective in the context of visual AR generation. CCA's performance surpasses these techniques due to its specific design for modeling conditional residuals in visual data.

Future Directions

CCA's development paves the way for further explorations in multimodal generative models where consistent training paradigms across modalities are critical. The research opens new avenues in reducing reliance on costly sampling techniques like CFG while maintaining—or even enhancing—output quality.

Conclusion

Condition Contrastive Alignment presents a substantial advancement in AR visual generation, offering a promising path to efficient, guidance-free models. By unifying alignment techniques with practical efficiency, it stands to influence both the AI theoretical framework and real-world generative model deployment.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.