Halton Scheduler For Masked Generative Image Transformer (2503.17076v1)

Published 21 Mar 2025 in cs.CV

Abstract: Masked Generative Image Transformers (MaskGIT) have emerged as a scalable and efficient image generation framework, able to deliver high-quality visuals with low inference costs. However, MaskGIT's token unmasking scheduler, an essential component of the framework, has not received the attention it deserves. We analyze the sampling objective in MaskGIT, based on the mutual information between tokens, and elucidate its shortcomings. We then propose a new sampling strategy based on our Halton scheduler instead of the original Confidence scheduler. More precisely, our method selects the token's position according to a quasi-random, low-discrepancy Halton sequence. Intuitively, that method spreads the tokens spatially, progressively covering the image uniformly at each step. Our analysis shows that it allows reducing non-recoverable sampling errors, leading to simpler hyper-parameters tuning and better quality images. Our scheduler does not require retraining or noise injection and may serve as a simple drop-in replacement for the original sampling strategy. Evaluation of both class-to-image synthesis on ImageNet and text-to-image generation on the COCO dataset demonstrates that the Halton scheduler outperforms the Confidence scheduler quantitatively by reducing the FID and qualitatively by generating more diverse and more detailed images. Our code is at https://github.com/valeoai/Halton-MaskGIT.

Summary

The paper introduces the Halton scheduler as a novel token unmasking method for MaskGIT, replacing the Confidence scheduler to enhance image quality, diversity, and simplify hyperparameter tuning.
The Halton scheduler utilizes a low-discrepancy Halton sequence to ensure tokens are unmasked in a spatially uniform order across the image, maximizing information gain and reducing correlation between selections.
Experimental results on ImageNet and COCO datasets demonstrate the Halton scheduler's superior performance, showing significant FID reductions (e.g., 2.19 on ImageNet 256x256, 2.7 on COCO) compared to the Confidence scheduler.

The paper "Halton Scheduler For Masked Generative Image Transformer" (2503.17076) introduces a novel token unmasking scheduler for MaskGIT, designed to enhance image generation quality and diversity. The proposed Halton scheduler replaces the original Confidence scheduler, offering improvements in performance and hyperparameter tuning simplicity.

Halton Scheduler Mechanics

The Halton scheduler employs a low-discrepancy, quasi-random Halton sequence to determine the order in which tokens are unmasked during the iterative image generation process of MaskGIT. Unlike the Confidence scheduler, which prioritizes tokens with high confidence scores (potentially leading to clustered selections), the Halton scheduler ensures uniform spatial coverage across the image.

The implementation involves the following steps:

Halton Sequence Generation: A 2D Halton sequence is generated using bases 2 and 3. The mathematical formulation for generating the Halton sequence involves utilizing the radical inverse function. For a given integer $n$ and a prime base $b$ , the radical inverse $\phi_b(n)$ is calculated based on the radix- $b$ representation of $n$ . Specifically, if $n = \sum_{i=0}^{k} a_i b^i$ , where $a_i$ are digits in base $b$ , then $\phi_b(n) = \sum_{i=0}^{k} a_i b^{-i-1}$ . This ensures that the generated sequence covers the space more uniformly than purely random sequences.
Discretization and Mapping: The continuous coordinates produced by the Halton sequence (ranging from 0 to 1 in each dimension) are discretized to align with the grid structure of the token map, which represents the image. This mapping converts the continuous Halton coordinates into discrete token positions. Duplicate token positions are discarded to ensure each token is considered only once per unmasking step.
Token Unmasking: The discretized Halton sequence dictates the order in which tokens are unmasked. By following this sequence, the scheduler ensures that tokens are spatially spread out across the image during the unmasking process.

This approach aims to maximize information gain at each step and reduce correlation between sampled tokens, leading to improved image quality and diversity.

Advantages Over Confidence Scheduler

The Halton scheduler addresses several limitations associated with the Confidence scheduler:

Diversity Enhancement: The Confidence scheduler's tendency to cluster token selections around already unmasked regions limits image diversity. The Halton scheduler mitigates this by spatially distributing tokens, reducing correlation and promoting the generation of more varied image structures.
Image Quality Improvement: By ensuring uniform coverage, the Halton scheduler maximizes information gain, leading to more detailed images.
Simplified Hyperparameter Tuning: The Confidence scheduler often requires the addition of Gumbel noise to avoid clustering, necessitating careful tuning. The Halton scheduler eliminates the need for this noise injection, simplifying the setup process.
Error Reduction: The Halton scheduler reduces non-recoverable sampling errors by minimizing the correlation of sampled tokens, leading to better overall performance.

The paper provides a theoretical justification based on mutual information. The goal is to minimize the Mutual Information aggregated over the inference steps. By spreading out the tokens, the Halton scheduler minimizes the reciprocal information shared by the tokens and maximizes entropy.

Experimental Results

The efficacy of the Halton scheduler was validated through experiments on the ImageNet and COCO datasets.

ImageNet (Class-Conditional Image Synthesis): The Halton scheduler was compared against the Confidence and Random schedulers using metrics such as FID, IS, Precision, and Recall. The results demonstrated that the Halton scheduler consistently outperformed the Confidence scheduler. For instance, the Halton scheduler achieved a reduction of 2.19 in FID on ImageNet 256x256 and 2.27 on ImageNet 512x512, showcasing its superior performance.
COCO (Text-to-Image Generation): Evaluated using FID, CLIP-Score, Precision, and Recall, the Halton scheduler again outperformed the Confidence scheduler, achieving a significant reduction in FID of 2.7. When compared to aMused, the Halton scheduler achieved an even more substantial FID reduction of 10.1.

These quantitative results are supported by qualitative observations, which indicate that the Halton scheduler produces more diverse and detailed images. Ablation studies further confirm the Halton scheduler's benefits, particularly its scalability with an increasing number of inference steps, unlike the Confidence scheduler, which may degrade with more steps due to noise injection.

In conclusion, the Halton scheduler presents a viable alternative to the Confidence scheduler in MaskGIT, offering enhanced image quality, diversity, and simplified hyperparameter tuning, as validated by experiments on ImageNet and COCO datasets.