Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 26 tok/s Pro
GPT-4o 98 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 216 tok/s Pro
2000 character limit reached

InfoBatch: Lossless Training Speed Up by Unbiased Dynamic Data Pruning (2303.04947v2)

Published 8 Mar 2023 in cs.CV

Abstract: Data pruning aims to obtain lossless performances with less overall cost. A common approach is to filter out samples that make less contribution to the training. This could lead to gradient expectation bias compared to the original data. To solve this problem, we propose \textbf{InfoBatch}, a novel framework aiming to achieve lossless training acceleration by unbiased dynamic data pruning. Specifically, InfoBatch randomly prunes a portion of less informative samples based on the loss distribution and rescales the gradients of the remaining samples to approximate the original gradient. As a plug-and-play and architecture-agnostic framework, InfoBatch consistently obtains lossless training results on classification, semantic segmentation, vision pertaining, and instruction fine-tuning tasks. On CIFAR10/100, ImageNet-1K, and ADE20K, InfoBatch losslessly saves 40\% overall cost. For pertaining MAE and diffusion model, InfoBatch can respectively save 24.8\% and 27\% cost. For LLaMA instruction fine-tuning, InfoBatch is also able to save 20\% cost and is compatible with coreset selection methods. The code is publicly available at \href{https://github.com/henryqin1997/InfoBatch}{github.com/NUS-HPC-AI-Lab/InfoBatch}.

Citations (37)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces InfoBatch, which speeds up training by dynamically pruning less informative samples while preserving model performance.
  • It employs an expectation rescaling method to maintain unbiased gradient estimates despite reduced data processing.
  • Empirical results demonstrate up to 40% cost savings on benchmark datasets and versatile applicability across CNN and Transformer models.

Examining InfoBatch: A Framework for Lossless Training Speed Up via Unbiased Dynamic Data Pruning

In contemporary deep learning, especially within computer vision, the computational demands of training state-of-the-art models on large-scale datasets pose significant challenges. This paper introduces InfoBatch, an innovative framework designed to accelerate training through an unbiased dynamic data pruning approach. Fundamentally, InfoBatch aims to reduce training costs without compromising performance, by employing a technique to prune less informative samples from a dataset dynamically and rescaling the gradients of remaining samples.

Main Contributions and Methodology

InfoBatch capitalizes on the notion that not all data samples are equally beneficial for each training iteration. Traditional data pruning methods that statically remove samples can introduce gradient expectation biases, compromising model convergence and performance. In contrast, InfoBatch introduces a dynamic pruning strategy that preserves the original dataset's gradient expectation through an expectation rescaling technique.

The framework functions by maintaining a score for each data sample based on loss values during forward propagation. InfoBatch probabilistically prunes samples with scores below a dynamically calculated mean threshold, ensuring that each pruned sample can still potentially contribute to future training iterations. This dynamic, or soft pruning, differentiates InfoBatch from static pruning approaches by adjusting the influence of samples across training epochs. Importantly, InfoBatch rescales the gradient updates of remaining samples to counteract the reduced number of gradient computations resulting from pruning.

Theoretical and Empirical Evaluations

Theoretically, InfoBatch demonstrates that the proposed rescaling strategy maintains the original dataset’s gradient expectation in pruned datasets. This ensures that optimization objectives remain approximately equivalent to the original training, thereby achieving lossless performance under reduced computational effort. The paper's comprehensive experiments underscore these claims, showcasing consistent performance across different neural architectures and tasks.

Empirically, InfoBatch achieves significant cost savings: up to 40% on CIFAR10/100 and ImageNet-1K, 24.8% on MAE pre-training, and 20% on LLM instruction fine-tuning. The results reveal that InfoBatch can effectively work with CNN-based architectures like ResNet as well as Transformer-based models, indicating its versatility across model types.

Implications and Future Directions

The implications of this research are notable for practitioners constrained by available computational resources. By alleviating excessive computational loads, InfoBatch presents a pragmatic solution for accelerating training without the need for additional hardware resources, democratizing access to high-performance deep learning models.

InfoBatch's compatibility with dynamic tasks and architectures suggests ample opportunities for future exploration. Integrating InfoBatch with existing training paradigms, especially those utilizing large batches, could further enhance its applicability. Moreover, adapting InfoBatch for tasks with limited epoch training schemes, such as those in NLP large models, could broaden its usability in emerging domains.

In summary, InfoBatch emerges as a promising approach in the landscape of efficient deep learning, presenting a refined balance between computational pragmatism and performance efficacy. Its dynamic pruning framework embodies a methodical advancement in managing data sampling strategies, holding potential for ongoing innovations in model training efficiencies.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com