AutoProg-Zero: Efficient Fine-Tuning for Vision Models
- AutoProg-Zero is a zero-shot fine-tuning algorithm that eliminates inner-loop training by using analytical proxies to guide dynamic parameter unfreezing.
- It leverages NTK and ZiCo metrics with a rank-voting scheme to choose optimal unfreezing schedules, effectively balancing computational cost and convergence quality.
- The integration of a Unique Stage Identifier (SID) stabilizes stage transitions, resulting in faster training and improved model performance.
AutoProg-Zero is a zero-shot automated progressive fine-tuning algorithm tailored for large vision models (LVMs)—most notably diffusion models—within the AutoProg framework. Distinct from AutoProg-One, which requires expensive one-shot supernet training to search for network growth schedules during pre-training, AutoProg-Zero introduces a zero-shot methodology that entirely eliminates inner-loop training or bi-level optimization for schedule selection. It leverages analytical proxies for trainability and convergence, enabling on-the-fly determination of unfreezing schedules, and incorporates a Unique Stage Identifier (SID) scheme to stabilize network behavior across progressive regime changes. This design achieves up to 2.86× acceleration in fine-tuning tasks while maintaining or improving model performance (Li et al., 2024).
1. Design Principles and Motivation
AutoProg-Zero was developed to address computational inefficiencies and instability in the progressive fine-tuning of high-capacity vision models. Traditional progressive learning for fine-tuning LVMs, particularly diffusion models, involves gradually unfreezing network parameters in multiple stages. While AutoProg-One applies a one-shot supernet regime with bi-level optimization for pre-training (increasing both resource and time costs), AutoProg-Zero eliminates:
- Supernet training: The two-epoch elastic supernet phase is bypassed.
- Bi-level schedule training: Instead of fine-tuning each candidate schedule, zero-shot proxies estimate the expected performance of every unfreezing configuration based solely on the current parameterization and incoming gradients.
- Inner-loop computational overhead: All candidate schedules at a given stage share the same model state, allowing immediate, parallelized scoring.
This approach allows the fine-tuning pipeline to dynamically select which parameters to unfreeze at each stage, guided by live statistical proxies for convergence and generalization.
2. Zero-Shot Unfreezing Schedule Search
AutoProg-Zero formalizes the progressive fine-tuning process as the selection of an unfreezing schedule , where each stage determines a subset of parameters to make learnable. The optimization problem is:
where is the ultimate fine-tuned loss and denotes the computational cost. The direct approach would necessitate fully fine-tuning for each candidate schedule—a prohibitive cost.
Instead, AutoProg-Zero introduces:
- Zero-shot proxy : Predicts the future loss for a candidate using analytical and statistical metrics.
- Candidate space : At stage , all permissible parameter subsets 0 extending prior unfreezing actions.
The selection at stage 1 is:
2
Because two objectives are used for 3—the NTK condition number for trainability and ZiCo gradient statistics for convergence/generalization—AutoProg-Zero employs a rank-voting aggregation:
4
5
Here, 6 denotes ranking within the candidate space (lower is better). This methodology ensures a balance between computational cost and predicted optimization quality.
Pseudocode Sketch:
3
3. Mathematical Formulation of Zero-Shot Proxies
3.1. NTK Condition Number
For any stage and timestep, with current learnable parameters 7, the neural tangent kernel (NTK):
8
The condition number proxy is:
9
where 0 are maximal and minimal eigenvalues, respectively. This metric reflects trainability; lower condition number correlates with more stable gradients and predictable descent.
3.2. ZiCo Gradient Statistic
For candidate subset 1 (parameters 2):
3
This proxy favors candidates where gradient means are larger (driving signal) and variances are smaller (stability), thus indicating faster, more robust convergence.
4. Unique Stage Identifier (SID) Scheme
When transitioning stages—i.e., change in trainable parameter set—the network’s output distribution can exhibit sudden drift, potentially destabilizing diffusion training. To counter these effects, AutoProg-Zero introduces a learnable Unique Stage Identifier embedding, 4, incorporated into model conditioning:
- Text-to-image: Each text prompt is prefixed with a unique token 5, associated with 6, so the conditioning vector is 7.
- Class-conditional: For class embedding 8, the new stage embedding becomes 9.
This mechanism preserves continuity in conditioning space and mitigates catastrophic forgetting or abrupt changes in model response after each stage switch.
5. Empirical Results and Benchmark Comparisons
Extensive experiments validate the efficacy of AutoProg-Zero on prominent LVM architectures and datasets:
| Task (Model/Dataset) | Full Fine-Tune Runtime | AutoProg-Zero Runtime (Speedup) | Performance Metric |
|---|---|---|---|
| DiT-XL/2 (Oxford Flowers) | 1× | 0.39× (≈2.56×) | FID: 21.05 (full) → 12.19 (APZ) |
| Stable Diff. (CUB/Flowers, T2I) | 1× | 0.39× | FID: 9.32/35.21 → 8.74/31.91 |
| DreamBooth (Stable Diffusion) | 1× | 0.35× (2.86×) | DINO: 0.849→0.874, CLIP-T: 0.214→0.280 |
Key findings:
- AutoProg-Zero routinely delivers a 2.5–2.9× reduction in wall-clock runtime.
- Final model FID improves or matches full fine-tuning.
- SID yields improved FID (e.g., 8.61→7.70 on Food, DiT at 0.39× time).
Ablations show four stages as a robust default (too many stages revert to supernet-only behavior and degrade performance). AutoProg-Zero outperforms the original one-shot AutoProg-One for fine-tuning tasks.
6. Implementation Details and Hyperparameter Choices
Key experimental configurations:
- Models: DiT-XL/2 (class-conditional), Stable Diffusion v1.5 (text-to-image), DreamBooth (customization).
- Fine-tune steps: 240,000 (DiT), 32 epochs (Stable Diff.), batch sizes: 256 (DiT, 8× A800), 32 (Stable Diff., 1× TITAN RTX).
- Learning rates: 1e-4 (DiT), 1e-5 (Stable Diff.), 5e-6 (DreamBooth).
- Unfreezing stages: 0, initial ratio 1, subsequent candidate splits on patch count and depth at 2.
- Classifier-free guidance: 1.5→4.0 (DiT), 3.0 (Stable Diff.), always 256×256 resolution.
- Adaptive regularization: not applied for diffusion; reserved for ViT VOLO backbone.
Ablation studies confirm the utility of the SID, optimal number of stages, and the superiority of zero-shot schedule search over one-shot in fine-tuning regimes.
7. Integration within the AutoProg Framework and Workflow
AutoProg comprises:
- AutoProg-One: Pre-training stage growth for ViTs, based on one-shot elastic supernet search.
- AutoProg-Zero: Fine-tuning stage for diffusion models, with selection based on zero-shot proxies and SID stabilization.
A typical LVM development pipeline:
- Pre-train: AutoProg-One for ViT backbone (e.g., on ImageNet, if required).
- Transfer: Use the backbone within a diffusion or other vision architecture.
- Fine-tune: Apply AutoProg-Zero for efficient, staged adaptation to new tasks/datasets.
This design achieves efficient, robust, and scalable training of LVMs across pre-training and fine-tuning phases, with direct empirical support for strong speedup and stable convergence (Li et al., 2024).