Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding (2410.01699v2)

Published 2 Oct 2024 in cs.CV

Abstract: The current large auto-regressive models can generate high-quality, high-resolution images, but these models require hundreds or even thousands of steps of next-token prediction during inference, resulting in substantial time consumption. In existing studies, Jacobi decoding, an iterative parallel decoding algorithm, has been used to accelerate the auto-regressive generation and can be executed without training. However, the Jacobi decoding relies on a deterministic criterion to determine the convergence of iterations. Thus, it works for greedy decoding but is incompatible with sampling-based decoding which is crucial for visual quality and diversity in the current auto-regressive text-to-image generation. In this paper, we propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD), to accelerate auto-regressive text-to-image generation. By introducing a probabilistic convergence criterion, our SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding and allowing the model to generate diverse images. Specifically, SJD facilitates the model to predict multiple tokens at each step and accepts tokens based on the probabilistic criterion, enabling the model to generate images with fewer steps than the conventional next-token-prediction paradigm. We also investigate the token initialization strategies that leverage the spatial locality of visual data to further improve the acceleration ratio under specific scenarios. We conduct experiments for our proposed SJD on multiple auto-regressive text-to-image generation models, showing the effectiveness of model acceleration without sacrificing the visual quality. The code of our work is available here: https://github.com/tyshiwo1/Accelerating-T2I-AR-with-SJD/.

Citations (3)

View on Semantic Scholar

Summary

The paper presents a novel SJD approach that significantly reduces inference steps by probabilistically accepting multiple tokens per iteration.
It enhances auto-regressive models through draft sequence prediction and progressive verification, ensuring high-quality image outputs.
Experimental results demonstrate over 2× speed-up in models like Lumina-mGPT while preserving FID and CLIP scores for practical image synthesis.

Accelerating Auto-regressive Text-to-Image Generation with Speculative Jacobi Decoding

The paper presents a novel strategy, termed Speculative Jacobi Decoding (SJD), aimed at optimizing the efficiency of auto-regressive text-to-image generation models. By enhancing the conventional Jacobi decoding approach with a probabilistic criterion, SJD remarkably reduces the number of inference steps required for generating images without diminishing their quality.

Methodological Insights

Auto-regressive models are foundational in generating high-quality visual data, adhering to a token-by-token prediction approach. However, the sequential nature demands significant computational resources and time, particularly since each auto-regressive step predicts only one token at a time. This process becomes resource-intensive, often requiring thousands of steps to generate a complete image.

The paper identifies limitations in current Jacobi decoding when applied to recent text-to-image models, which rely on both greedy and sampling-based approaches to ensure diversity and quality in results. To tackle this, the authors propose SJD, which maintains diverse outputs while significantly accelerating the generation process.

Technical Approach

SJD introduces a probabilistic criterion that enables the acceptance of multiple tokens per iteration, greatly enhancing speed while maintaining model output quality. The procedure involves:

Draft Sequence Prediction: Utilizing a sequence of pre-filled tokens, draft tokens are initialized and iteratively updated.
Progressive Verification: A probabilistic criterion decides the acceptance of tokens based on the comparison between previous and current iteration probabilities. This decision process ensures robustness and precision in generation outputs.
Spatial Localization: By leveraging spatial locality in visual data, SJD optimizes initialization strategies, further bolstering speed without training.

Experimental Findings

The experimental results demonstrate that SJD can accelerate models like Lumina-mGPT and Anole by more than double the speed, with negligible loss in visual fidelity. Lumina-mGPT, for instance, exhibited a step compression ratio exceeding $2 \times$ , while generated images preserved high FID scores and CLIP-Scores, indicating maintained quality and consistency in semantic alignment with textual prompts.

Practical and Theoretical Implications

The advancement provided by SJD has practical implications in applications requiring efficient and swift image synthesis, such as digital art development and realistic rendering in media. Theoretically, the integration of a probabilistic Jacobi decoding framework paves the way for further exploration in token-sampling balanced optimization, potentially benefiting other AI areas that rely on substantial auto-regressive tasks.

Future Directions

While SJD provides substantial improvements, future work could explore the fine-tuning of models specifically for multi-token predictions to further increase efficiency. Additionally, there is potential for extending SJD to other domains such as video generation, where sequence optimization could lead to significant advancements in generative capacity and speed.

In summary, the paper delivers an articulate exploration and a substantial contribution to accelerating text-to-image generation, setting a significant benchmark in computational efficiency while retaining quality in generative models.