Revisiting Locally Supervised Learning: an Alternative to End-to-end Training (2101.10832v1)

Published 26 Jan 2021 in cs.CV, cs.LG, and stat.ML

Abstract: Due to the need to store the intermediate activations for back-propagation, end-to-end (E2E) training of deep networks usually suffers from high GPUs memory footprint. This paper aims to address this problem by revisiting the locally supervised learning, where a network is split into gradient-isolated modules and trained with local supervision. We experimentally show that simply training local modules with E2E loss tends to collapse task-relevant information at early layers, and hence hurts the performance of the full model. To avoid this issue, we propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible, while progressively discard task-irrelevant information. As InfoPro loss is difficult to compute in its original form, we derive a feasible upper bound as a surrogate optimization objective, yielding a simple but effective algorithm. In fact, we show that the proposed method boils down to minimizing the combination of a reconstruction loss and a normal cross-entropy/contrastive term. Extensive empirical results on five datasets (i.e., CIFAR, SVHN, STL-10, ImageNet and Cityscapes) validate that InfoPro is capable of achieving competitive performance with less than 40% memory footprint compared to E2E training, while allowing using training data with higher-resolution or larger batch sizes under the same GPU memory constraint. Our method also enables training local modules asynchronously for potential training acceleration. Code is available at: https://github.com/blackfeather-wang/InfoPro-Pytorch.

Authors (5)

Yulin Wang (45 papers)
Zanlin Ni (11 papers)
Shiji Song (103 papers)
Le Yang (69 papers)
Gao Huang (178 papers)

Citations (78)

View on Semantic Scholar

Summary

The paper presents a modular training approach with the InfoPro loss to retain task-relevant information and reduce memory usage by up to 40%.
It divides networks into gradient-isolated modules trained with local supervision, offering an efficient and parallelizable alternative to end-to-end training.
Empirical results on datasets like CIFAR, SVHN, and ImageNet validate the method's competitive accuracy and potential for resource-limited applications.

Revisiting Locally Supervised Learning: An Alternative to End-to-End Training

The paper presents a distinctive approach to training deep neural networks (DNNs) by reconsidering locally supervised learning as an alternative to the widely-accepted end-to-end (E2E) training paradigm. The central problem addressed is the high memory footprint required by traditional back-propagation in E2E training due to the necessity of storing intermediate activations. To alleviate this challenge, the paper revisits the concept of locally supervised learning, dividing networks into modular, gradient-isolated sections, and trains these modules with local supervision.

One of the core contributions of this paper is the introduction of the information propagation (InfoPro) loss. This loss function is designed to preserve relevant information while sequentially discarding irrelevant data throughout the network's layers. The rationale is that straightforward local training with E2E loss can cause early layers to lose critical task-related information, thereby degrading overall model performance. The InfoPro loss seeks to encourage the retention of information needed for downstream network layers, mitigating the issues observed with naive greedy local learning strategies.

The authors propose a surrogate method to estimate the InfoPro loss, given its original form is computationally prohibitive. They derive an upper bound that integrates reconstruction loss and a cross-entropy or contrastive term, providing a practical algorithm for training. Empirical results across varied datasets — CIFAR, SVHN, STL-10, ImageNet, and Cityscapes — confirm that their method can achieve competitive accuracy with a memory usage reduction of up to 40% compared to full E2E training. This unlocked the ability to work with higher-resolution data or larger batch sizes within the same memory constraints.

A notable aspect of this paper is the potential for asynchronous training, suggesting that individual modules can be trained independently, potentially reducing training time and allowing for parallel execution. This methodological adjustment marks a significant pivot from traditional sequential processing inherent in E2E back-propagation, paving the way for more efficient training regimens.

From a theoretical perspective, the paper situates locally supervised learning within an information-theoretic framework, hypothesizing that information collapse occurs with naive local learning approaches. The paper provides empirical validation of this hypothesis by analyzing the mutual information between layers and input data or labels. The InfoPro loss aims to counteract this by maintaining a flow of task-relevant information as features progress through successive network layers.

Practical implications of this research include more resource-efficient neural network training, particularly valuable for applications constrained by hardware limitations, such as edge computing or mobile platforms. The decoupling of module training also introduces potential for more robust, distributable learning frameworks in AI systems.

Future endeavors may explore extending this approach to regression tasks and other domains beyond conventional vision tasks, offering a universally applicable training algorithm across varied AI applications. Moreover, additional analysis on the balance between local information retention and global task relevance can refine fundamental understanding and efficacy in neural architecture design.

In conclusion, the paper provides a robust alternative to E2E training by creatively leveraging locally supervised modules, potentially ushering in a new era in neural network training that emphasizes efficiency, modularity, and parallel processing capabilities.

PDF Markdown

Related Papers

GitHub

GitHub - blackfeather-wang/InfoPro-Pytorch: Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance. (89 stars)