Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AlphaNet: Improved Training of Supernets with Alpha-Divergence (2102.07954v2)

Published 16 Feb 2021 in cs.CV, cs.AI, and stat.ML

Abstract: Weight-sharing neural architecture search (NAS) is an effective technique for automating efficient neural architecture design. Weight-sharing NAS builds a supernet that assembles all the architectures as its sub-networks and jointly trains the supernet with the sub-networks. The success of weight-sharing NAS heavily relies on distilling the knowledge of the supernet to the sub-networks. However, we find that the widely used distillation divergence, i.e., KL divergence, may lead to student sub-networks that over-estimate or under-estimate the uncertainty of the teacher supernet, leading to inferior performance of the sub-networks. In this work, we propose to improve the supernet training with a more generalized alpha-divergence. By adaptively selecting the alpha-divergence, we simultaneously prevent the over-estimation or under-estimation of the uncertainty of the teacher model. We apply the proposed alpha-divergence based supernets training to both slimmable neural networks and weight-sharing NAS, and demonstrate significant improvements. Specifically, our discovered model family, AlphaNet, outperforms prior-art models on a wide range of FLOPs regimes, including BigNAS, Once-for-All networks, and AttentiveNAS. We achieve ImageNet top-1 accuracy of 80.0% with only 444M FLOPs. Our code and pretrained models are available at https://github.com/facebookresearch/AlphaNet.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Dilin Wang (37 papers)
  2. Chengyue Gong (30 papers)
  3. Meng Li (244 papers)
  4. Qiang Liu (405 papers)
  5. Vikas Chandra (75 papers)
Citations (41)

Summary

  • The paper introduces an adaptive α-divergence framework that refines knowledge distillation for supernet training by balancing uncertainty estimation.
  • It applies dynamic gradient clipping and parameter tuning to stabilize training and overcome limitations of conventional KL divergence.
  • Empirical results demonstrate AlphaNet outperforms state-of-the-art models, achieving 80.0% top-1 accuracy at 444M FLOPs on ImageNet.

Overview of AlphaNet: Improved Training of Supernets with Alpha-Divergence

The paper "AlphaNet: Improved Training of Supernets with Alpha-Divergence" introduces an enhanced methodology for training supernets in the context of weight-sharing neural architecture search (NAS). This approach is rooted in the application of a more generalized divergence metric, known as α\alpha-divergence, for knowledge distillation (KD), which is employed to inform sub-networks within a supernet. The authors present a compelling argument and empirical evidence that α\alpha-divergence serves as a robust alternative to the conventional Kullback-Leibler (KL) divergence typically used in training such architectures.

Motivation and Methodology

The authors highlight a fundamental challenge with traditional KD using KL divergence: its tendency to either over-penalize or under-penalize the sub-networks’ uncertainty estimations in relation to the teacher supernet. This is primarily because KL divergence has a zero-avoiding nature but fails to sufficiently penalize when uncertainty is over-estimated by student sub-networks. To address this, the authors propose utilizing α\alpha-divergence, which allows for modulating the penalty for deviations by tuning the parameter α\alpha. By adaptively setting this parameter during the training process, their approach balances the over- and under-estimation of uncertainty, leading to a more accurate alignment between the sub-networks and the supernet.

They introduce an adaptive α\alpha-divergence framework where both positive and negative values of α\alpha are utilized to guide the KD process. This adaptation seeks to remedy both over-confident and overly conservative predictions by the student models relative to the teacher model.

To mitigate training instability caused by large gradient variances associated with extreme α\alpha values, they apply a dynamic clipping strategy. This approach stabilizes training by capping the influence of high variance gradient factors, ensuring reliable convergence of the supernet.

Experimental Results and Evaluation

The authors demonstrate the efficacy of the proposed method through rigorous empirical validation on benchmark tasks including slimmable networks and weight-sharing NAS challenges. Notably, AlphaNet models derived from their training methodology surpass existing state-of-the-art models across a range of computational budgets, achieving an impressive ImageNet top-1 accuracy of 80.0% at merely 444M FLOPs.

  1. Slimmable Networks: Their adaptive KD mechanism enhances performance across varying configurations of MobileNetV1 and V2 architectures, compared to models trained with conventional KD strategies.
  2. Weight-sharing NAS: The method yields substantial improvements in Pareto optimal frontiers for accuracy versus computational cost. This is particularly evident in scenarios involving broad architectural spaces, underscoring the method's robustness and general applicability to diverse training scenarios.

Implications and Future Directions

The use of adaptive α\alpha-divergence as proposed opens a path towards more efficient and effective training regimes in NAS, addressing a key limitation of conventional divergence metrics in KD scenarios. This innovation not only enhances the performance of sub-networks post-training but also suggests broader implications for general network optimization frameworks where uncertainty estimation is crucial.

Moreover, the research presents potential avenues for further exploration in the balance and interplay between regularization and divergence metrics. Future work could investigate additional adaptive strategies or explore the transferability of this method to other domains of machine learning, such as LLMs or reinforcement learning contexts.

Overall, the work evidences a significant step forward in NAS, providing a refined toolset for the development of architectures that strike a compelling trade-off between prediction accuracy and computational efficiency. This contributes valuable insights and methodologies that could shape forthcoming developments in both AI research and its practical applications.

Github Logo Streamline Icon: https://streamlinehq.com