Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks (1511.06314v1)

Published 19 Nov 2015 in cs.CV, cs.LG, and cs.NE

Abstract: Convolutional Neural Networks have achieved state-of-the-art performance on a wide range of tasks. Most benchmarks are led by ensembles of these powerful learners, but ensembling is typically treated as a post-hoc procedure implemented by averaging independently trained models with model variation induced by bagging or random initialization. In this paper, we rigorously treat ensembling as a first-class problem to explicitly address the question: what are the best strategies to create an ensemble? We first compare a large number of ensembling strategies, and then propose and evaluate novel strategies, such as parameter sharing (through a new family of models we call TreeNets) as well as training under ensemble-aware and diversity-encouraging losses. We demonstrate that TreeNets can improve ensemble performance and that diverse ensembles can be trained end-to-end under a unified loss, achieving significantly higher "oracle" accuracies than classical ensembles.

Citations (301)

View on Semantic Scholar

Summary

The paper presents TreeNets, a novel parameter-sharing approach that balances shared low-level features with specialized deep layers in CNN ensembles.
The paper shows that ensemble-aware losses like Multiple Choice Learning induce model specialization to boost oracle accuracy.
The paper demonstrates that optimizing ensemble diversity during training improves robustness and reduces overfitting on benchmark datasets.

Evaluating Diverse Ensemble Strategies for Convolutional Neural Networks

The paper "Why $M$ Heads are Better than One: Training a Diverse Ensemble of Deep Networks" ventures into an in-depth exploration of ensembling strategies specifically tailored for deep neural networks, with a focus on Convolutional Neural Networks (CNNs). This paper is driven by the observation that top superior performances on several benchmarks, such as the ImageNet Large Scale Visual Recognition Challenge, are often achieved by ensembles of CNNs. However, traditional ensembling methods primarily view the process as a subsequent model aggregation step, often neglecting the potential for optimization during the ensemble training process.

Ensembling Strategies: Refining Methods for CNNs

The paper embarks on a comprehensive investigation of diverse ensembling strategies, categorizing them primarily into traditional approaches such as random initialization and bagging, and novel strategies, including parameter sharing and ensemble-aware losses. Notably, the paper critiques existing practices like bagging due to its redundancy in uniquely utilizing data, proposing alternatives like TreeNets and directly addressing ensemble loss during training.

TreeNets: Parameter Sharing for Better Performance

The research introduces TreeNets—tree-structured architectures that allow varying degrees of parameter sharing among ensemble members. TreeNets share initial layers to capture generalized low-level features while keeping deeper layers independent to foster diversity and specialization. This strategy is well-founded, given empirical results showing that TreeNets not only reduce redundant parameters but also achieve better performance than traditionally independent ensembles. The evaluations conducted across various datasets demonstrate that TreeNets can indeed mitigate overfitting and improve computational efficiency by concentrating parameter diversity in higher layers.

Ensemble-Aware Losses: Encouraging Diversity

Another significant contribution is the paper of ensemble-aware loss functions, which diverge from the independent training of ensemble members. The paper's experimentation with ensemble-aware losses reveals that while directly optimizing the average prediction accuracy yields disappointing results due to insufficient gradient diversity, encouraging diversity through methods like Multiple Choice Learning (MCL) can substantially enhance ensemble oracle accuracy. MCL induces specialization by assigning training examples to the member network most capable of handling them, thereby promoting diversity in learned functions. The clear manifestation of specialization, in this case illustrated through class assignments and image reconstructions, highlights the method's potential to yield more robust ensembles.

Implications and Future Prospects

The findings hold salient implications. They suggest that CNN ensembles, often perceived as computationally expensive, can be made more efficient and effective through intelligent design choices such as parameter sharing and sophisticated loss functions. The results push forward the understanding that diversity is not merely an advantageous byproduct of ensembles, but a critical component that should be carefully optimized throughout the training process.

Looking to the future, these advancements could inform the development of even more sophisticated ensemble strategies. As the sizes and complexities of datasets and models continue to escalate, addressing diversity and optimal parameter utilization will remain crucial. Furthermore, with the introduction of MPI-Caffe, facilitating streamlined distribution and coordination of complex networks across multiple GPUs, there lies vast potential in exploring model coupling and non-trivial ensemble structures even further.

In conclusion, this paper emphasizes diversity and optimization as central tenets in powerful ensemble creation, offering profound insights and methodologies that can directly impact both the practical applications and theoretical underpinnings of deep learning systems.

PDF Markdown