Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training (2011.14660v4)

Published 30 Nov 2020 in cs.CV

Abstract: The width of a neural network matters since increasing the width will necessarily increase the model capacity. However, the performance of a network does not improve linearly with the width and soon gets saturated. In this case, we argue that increasing the number of networks (ensemble) can achieve better accuracy-efficiency trade-offs than purely increasing the width. To prove it, one large network is divided into several small ones regarding its parameters and regularization components. Each of these small networks has a fraction of the original one's parameters. We then train these small networks together and make them see various views of the same data to increase their diversity. During this co-training process, networks can also learn from each other. As a result, small networks can achieve better ensemble performance than the large one with few or no extra parameters or FLOPs, \ie, achieving better accuracy-efficiency trade-offs. Small networks can also achieve faster inference speed than the large one by concurrent running. All of the above shows that the number of networks is a new dimension of model scaling. We validate our argument with 8 different neural architectures on common benchmarks through extensive experiments. The code is available at \url{https://github.com/FreeformRobotics/Divide-and-Co-training}.

Citations (27)

View on Semantic Scholar

Summary

The paper introduces the 'divide and co-training' method, dividing a large neural network into smaller, independent networks trained collaboratively.
Empirical results on benchmarks like CIFAR and ImageNet show ensembles of smaller networks outperform singular large models with similar resources, often with faster inference.
This method offers 'network count' as a new model scaling dimension, providing a pragmatic alternative for efficient AI in resource-constrained scenarios like edge computing.

Towards Better Accuracy-Efficiency Trade-offs: Divide and Co-training

This paper discusses a methodology aimed at optimizing the trade-off between accuracy and efficiency in neural network models. The approach, termed "divide and co-training," hinges on the hypothesis that augmenting the number of distinct neural networks (ensemble) often yields superior accuracy-efficiency trade-offs compared to merely widening individual networks. This submission elucidates a novel framework that divides an initially large neural network into several smaller independent networks, each possessing a subset of the parameters and computational characteristics of the original.

Key Contributions and Findings

Conceptual Framework: The authors introduce a method to divide a singular large network into multiple smaller networks based on the network's width and parameters. This procedure includes partitioning regularization components, ensuring that each smaller network maintains an appropriate level of performance through adjusted weight decay and dropout techniques.
Co-training Paradigm: By utilizing a co-training mechanism, these independent networks are trained on diverse views of the same dataset, thereby enhancing generalization. The co-training involves a collaborative learning process where networks not only learn from the data but also from each other’s outputs through a Jensen-Shannon divergence-based ensemble loss.
Empirical Validation: Extensive experiments were conducted on eight neural architectures across commonly employed benchmarks such as CIFAR-10, CIFAR-100, and ImageNet. The empirical results demonstrate that the ensemble of smaller networks frequently outperforms a singular large model, even with similar computational resources like parameters and FLOPs.
Improvements in Inference Speed: A notable implication of this approach is the potential improvement in inference speed. By concurrently running smaller networks on separate devices or parallel streams within a single device, the authors report faster inference while maintaining or improving accuracy.
Theoretical Underpinning: The paper revisits the Bias-Variance-Covariance framework to theoretically substantiate how the proposed model scaling impacts these factors differently compared to traditional scaling methods that focus solely on depth or width.
Practical Implications: The proposed method introduces a new scaling dimension to model engineering—network count—thus offering a pragmatic alternative for situations where increasing a model's size becomes computationally untenable or yields diminishing returns.

Discussion on Implications and Future Directions

This paper highlights a critical avenue for the ongoing development of efficient AI systems, especially in resource-constrained scenarios. The division of neural networks paves the way for versatile applications such as edge computing, where computational resources are limited. One potential expansion of this research involves integrating automated machine learning (AutoML) techniques to dynamically determine the optimal number of networks and their configurations for specific tasks or deploy environments.

Further, this approach encourages exploration into model distillation strategies that could allow individual smaller networks or sub-ensembles to act independently, potentially improving robustness against data distribution shifts. Another intriguing direction would be investigating divide and co-training mechanisms in the context of transfer learning, to understand if such ensemble methods could help in more efficiently adapting pre-trained models to novel tasks, thereby saving computational resources and time.

In conclusion, the paper presents a compelling argument and practical demonstration for revisiting the conventional wisdom regarding neural network scaling, proposing an alternative that is not only more efficient but often more effective in enhancing model performance. This contribution is poised to significantly influence both theoretical explorations and pragmatic implementations in the field of AI and deep learning.

PDF Markdown

Related Papers

GitHub

GitHub - FreeformRobotics/Divide-and-Co-training: [TIP 2022] Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training. Plus, an image classification toolbox includes ResNet, Wide-ResNet, ResNeXt, ResNeSt, ResNeXSt, SENet, Shake-Shake, DenseNet, PyramidNet, and EfficientNet. (104 stars)