Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 164 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 72 tok/s Pro

Kimi K2 204 tok/s Pro

GPT OSS 120B 450 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization (2503.08619v1)

Published 11 Mar 2025 in cs.CV

Abstract: Recent advances in text-to-image generation have primarily relied on extensive datasets and parameter-heavy architectures. These requirements severely limit accessibility for researchers and practitioners who lack substantial computational resources. In this paper, we introduce \model, an efficient training paradigm for image generation models that uses knowledge distillation (KD) and Direct Preference Optimization (DPO). Drawing inspiration from the success of data KD techniques widely adopted in Multi-Modal LLMs (MLLMs), LightGen distills knowledge from state-of-the-art (SOTA) text-to-image models into a compact Masked Autoregressive (MAR) architecture with only $0.7B$ parameters. Using a compact synthetic dataset of just $2M$ high-quality images generated from varied captions, we demonstrate that data diversity significantly outweighs data volume in determining model performance. This strategy dramatically reduces computational demands and reduces pre-training time from potentially thousands of GPU-days to merely 88 GPU-days. Furthermore, to address the inherent shortcomings of synthetic data, particularly poor high-frequency details and spatial inaccuracies, we integrate the DPO technique that refines image fidelity and positional accuracy. Comprehensive experiments confirm that LightGen achieves image generation quality comparable to SOTA models while significantly reducing computational resources and expanding accessibility for resource-constrained environments. Code is available at https://github.com/XianfengWu01/LightGen

Summary

LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization

The paper presents LightGen, an image generation model that leverages knowledge distillation (KD) and Direct Preference Optimization (DPO) to achieve efficient text-to-image generation. Unlike traditional methods that rely heavily on large datasets and complex architectures, LightGen aims to democratize access to high-quality image generation by reducing computational and data requirements.

The authors integrate insights from multi-modal LLMs (MLLMs) to distill knowledge from state-of-the-art (SOTA) models into a lightweight Masked Autoregressive (MAR) architecture with just 0.7 billion parameters. Remarkably, this approach only utilizes a synthetic dataset of 2M images, which is considerably smaller than typical datasets for such tasks. The central premise of the paper is that data diversity can compensate for smaller data volumes, significantly reducing pre-training durations from thousands of GPU-days to only 88 GPU-days.

Two critical components underpin LightGen's efficacy:

Knowledge Distillation: LightGen employs a strategy akin to data distillation used in MLLMs. By distilling knowledge from sophisticated text-to-image models into a more compact framework, LightGen preserves performance while curtailing the demand for extensive computing resources.
Direct Preference Optimization: A distinctive feature of LightGen is the integration of DPO to improve image quality. Synthetic datasets often suffer from poor high-frequency details and spatial inaccuracies. DPO addresses these challenges, refining the image's fidelity and positional accuracy.

Quantitative evaluations on the GenEval benchmark demonstrate that LightGen performs competitively with SOTA models. Specifically, it excels in generating images with precise object positioning and vibrant colors at various resolutions, from 256x256 to 1024x1024 pixels. This positions LightGen not merely as an alternative but a viable option for high-quality, efficient image generation, especially in resource-constrained settings.

Implications and Future Directions

The implications of this research are manifold. Practically, it reduces barriers to entry for researchers and organizations with limited computational resources, broadening access to advanced image generation capabilities. Theoretically, this paper challenges the conventional wisdom that large-scale datasets are indispensable for superior performance in generative models. The emphasis on data diversity as opposed to sheer volume signals a potential paradigm shift in the development and deployment of AI models.

Future developments in AI could include further exploration of distillation techniques and other optimization algorithms to enhance the scalability and efficiency of machine learning models. Additionally, the integration of more robust synthetic datasets and advanced post-processing methods like DPO could refine the capabilities of lightweight models. Broadly, this approach might extend to other domains beyond image generation, such as video synthesis or multi-modal AI tasks, where resource efficiency remains a critical concern.