Snapshot Distillation: Teacher-Student Optimization in One Generation (1812.00123v1)

Published 1 Dec 2018 in cs.CV

Abstract: Optimizing a deep neural network is a fundamental task in computer vision, yet direct training methods often suffer from over-fitting. Teacher-student optimization aims at providing complementary cues from a model trained previously, but these approaches are often considerably slow due to the pipeline of training a few generations in sequence, i.e., time complexity is increased by several times. This paper presents snapshot distillation (SD), the first framework which enables teacher-student optimization in one generation. The idea of SD is very simple: instead of borrowing supervision signals from previous generations, we extract such information from earlier epochs in the same generation, meanwhile make sure that the difference between teacher and student is sufficiently large so as to prevent under-fitting. To achieve this goal, we implement SD in a cyclic learning rate policy, in which the last snapshot of each cycle is used as the teacher for all iterations in the next cycle, and the teacher signal is smoothed to provide richer information. In standard image classification benchmarks such as CIFAR100 and ILSVRC2012, SD achieves consistent accuracy gain without heavy computational overheads. We also verify that models pre-trained with SD transfers well to object detection and semantic segmentation in the PascalVOC dataset.

Citations (185)

View on Semantic Scholar

Summary

The paper presents a novel single-generation teacher-student optimization method using cyclic learning rates to effectively reduce training complexity.
The method leverages cyclic learning rate and smoothing techniques to create distinct teacher-student snapshots that improve performance on CIFAR100 and ILSVRC2012.
The paper provides empirical evidence that networks optimized via Snapshot Distillation exhibit enhanced transferability to tasks like object detection and semantic segmentation.

Snapshot Distillation: Teacher-Student Optimization in One Generation

The paper "Snapshot Distillation: Teacher-Student Optimization in One Generation" by Yang et al. discusses an innovative method for optimizing deep neural networks in computer vision, specifically focusing on addressing the traditional inefficiencies of teacher-student (T-S) optimization approaches. Typically, T-S optimization involves a time-consuming multi-generation training process where each subsequent student network learns from a pre-trained teacher model, resulting in amplified computational complexity. The authors introduce a novel framework named Snapshot Distillation (SD), which achieves T-S optimization within a single training generation without sacrificing accuracy while significantly improving training efficiency.

Key Contributions

Single Generation T-S Optimization: The major contribution of this work is enabling T-S optimization within one generation, a problem previously unsolved. SD permits obtaining teacher signals from earlier epochs within the same training generation, leveraging the cyclic learning rate policy. This approach ensures that the teacher and student networks are sufficiently different, preventing under-fitting while improving overall training efficiency.
Cyclic Learning Rate and Smoothing: SD uses a cyclic learning rate policy, dividing training into mini-generations where the last model snapshot in each cycle acts as the teacher for the next. To enhance learning, teacher signals are smoothed, providing rich, informative cues for optimization. This method ensures the student network can utilize secondary information effectively, overcoming traditional T-S challenges such as similarity degeneracy.
Empirical Validation: The method was empirically tested on standard image classification datasets such as CIFAR100 and ILSVRC2012. SD consistently outperformed baseline direct optimization methods with marginal additional computational overhead. Furthermore, networks pre-trained with SD methods showed superior transferability to other tasks like object detection and semantic segmentation, particularly verified using the PascalVOC dataset.

Numerical Results and Implications

The experiments highlighted the efficacy of SD on several deep network architectures, notably ResNets and DenseNets. For instance, the DenseNet-trained model saw a notable error reduction from 22% to 21.17% on CIFAR100, with similar trends observed in other configurations. On ILSVRC2012, SD achieved top-1 error rates of 21.25% with ResNet101, which is promising given the challenging nature of this dataset. These results affirm that SD optimizes knowledge transfer within models effectively, highlighting its potential in complex visual recognition tasks.

Theoretical and Practical Implications

Theoretically, SD offers insights into efficient model training strategies, rethinking traditional multi-generation methodologies. It suggests that model optimization can be achieved by harnessing internal generation cues, effectively challenging the prevalent notion that complete independence between teacher and student instances is necessary for effective T-S training. Practically, this translates to substantial computational savings, which could democratize access to high-performing vision systems by lowering the infrastructure barrier.

Speculation on Future Developments

Future developments could focus on further reducing the basic unit of T-S optimization below the mini-generation level realized in this paper. This could entail exploring gradient-based integration of prior iteration supervision—potentially introducing higher-order gradient information into training loss functions. Such innovations could refine the granularity of supervision signals, optimizing learning dynamics on even smaller scales and pushing the boundaries of what single-generation optimizations can achieve.

In conclusion, Snapshot Distillation presents a significant leap forward in teacher-student optimization methodologies, not only enhancing the practical deployment of efficient training regimes but also prompting a re-evaluation of theoretical foundations in neural network training optimization.

PDF Markdown