- The paper introduces the Halton scheduler as a novel token unmasking method for MaskGIT, replacing the Confidence scheduler to enhance image quality, diversity, and simplify hyperparameter tuning.
- The Halton scheduler utilizes a low-discrepancy Halton sequence to ensure tokens are unmasked in a spatially uniform order across the image, maximizing information gain and reducing correlation between selections.
- Experimental results on ImageNet and COCO datasets demonstrate the Halton scheduler's superior performance, showing significant FID reductions (e.g., 2.19 on ImageNet 256x256, 2.7 on COCO) compared to the Confidence scheduler.
The paper "Halton Scheduler For Masked Generative Image Transformer" (2503.17076) introduces a novel token unmasking scheduler for MaskGIT, designed to enhance image generation quality and diversity. The proposed Halton scheduler replaces the original Confidence scheduler, offering improvements in performance and hyperparameter tuning simplicity.
Halton Scheduler Mechanics
The Halton scheduler employs a low-discrepancy, quasi-random Halton sequence to determine the order in which tokens are unmasked during the iterative image generation process of MaskGIT. Unlike the Confidence scheduler, which prioritizes tokens with high confidence scores (potentially leading to clustered selections), the Halton scheduler ensures uniform spatial coverage across the image.
The implementation involves the following steps:
- Halton Sequence Generation: A 2D Halton sequence is generated using bases 2 and 3. The mathematical formulation for generating the Halton sequence involves utilizing the radical inverse function. For a given integer n and a prime base b, the radical inverse ϕb​(n) is calculated based on the radix-b representation of n. Specifically, if n=∑i=0k​ai​bi, where ai​ are digits in base b, then ϕb​(n)=∑i=0k​ai​b−i−1. This ensures that the generated sequence covers the space more uniformly than purely random sequences.
- Discretization and Mapping: The continuous coordinates produced by the Halton sequence (ranging from 0 to 1 in each dimension) are discretized to align with the grid structure of the token map, which represents the image. This mapping converts the continuous Halton coordinates into discrete token positions. Duplicate token positions are discarded to ensure each token is considered only once per unmasking step.
- Token Unmasking: The discretized Halton sequence dictates the order in which tokens are unmasked. By following this sequence, the scheduler ensures that tokens are spatially spread out across the image during the unmasking process.
This approach aims to maximize information gain at each step and reduce correlation between sampled tokens, leading to improved image quality and diversity.
Advantages Over Confidence Scheduler
The Halton scheduler addresses several limitations associated with the Confidence scheduler:
- Diversity Enhancement: The Confidence scheduler's tendency to cluster token selections around already unmasked regions limits image diversity. The Halton scheduler mitigates this by spatially distributing tokens, reducing correlation and promoting the generation of more varied image structures.
- Image Quality Improvement: By ensuring uniform coverage, the Halton scheduler maximizes information gain, leading to more detailed images.
- Simplified Hyperparameter Tuning: The Confidence scheduler often requires the addition of Gumbel noise to avoid clustering, necessitating careful tuning. The Halton scheduler eliminates the need for this noise injection, simplifying the setup process.
- Error Reduction: The Halton scheduler reduces non-recoverable sampling errors by minimizing the correlation of sampled tokens, leading to better overall performance.
The paper provides a theoretical justification based on mutual information. The goal is to minimize the Mutual Information aggregated over the inference steps. By spreading out the tokens, the Halton scheduler minimizes the reciprocal information shared by the tokens and maximizes entropy.
Experimental Results
The efficacy of the Halton scheduler was validated through experiments on the ImageNet and COCO datasets.
- ImageNet (Class-Conditional Image Synthesis): The Halton scheduler was compared against the Confidence and Random schedulers using metrics such as FID, IS, Precision, and Recall. The results demonstrated that the Halton scheduler consistently outperformed the Confidence scheduler. For instance, the Halton scheduler achieved a reduction of 2.19 in FID on ImageNet 256x256 and 2.27 on ImageNet 512x512, showcasing its superior performance.
- COCO (Text-to-Image Generation): Evaluated using FID, CLIP-Score, Precision, and Recall, the Halton scheduler again outperformed the Confidence scheduler, achieving a significant reduction in FID of 2.7. When compared to aMused, the Halton scheduler achieved an even more substantial FID reduction of 10.1.
These quantitative results are supported by qualitative observations, which indicate that the Halton scheduler produces more diverse and detailed images. Ablation studies further confirm the Halton scheduler's benefits, particularly its scalability with an increasing number of inference steps, unlike the Confidence scheduler, which may degrade with more steps due to noise injection.
In conclusion, the Halton scheduler presents a viable alternative to the Confidence scheduler in MaskGIT, offering enhanced image quality, diversity, and simplified hyperparameter tuning, as validated by experiments on ImageNet and COCO datasets.