Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
105 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference (2401.10652v3)

Published 19 Jan 2024 in cs.PF, cs.DC, and cs.LG

Abstract: Large deep learning models have achieved impressive performance across a range of applications. However, their large memory requirements, including parameter memory and activation memory, have become a significant challenge for their practical serving. While existing methods mainly address parameter memory, the importance of activation memory has been overlooked. Especially for long input sequences, activation memory is expected to experience a significant exponential growth as the length of sequences increases. In this approach, we propose AutoChunk, an automatic and adaptive compiler system that efficiently reduces activation memory for long sequence inference by chunk strategies. The proposed system generates chunk plans by optimizing through multiple stages. In each stage, the chunk search pass explores all possible chunk candidates and the chunk selection pass identifies the optimal one. At runtime, AutoChunk employs code generation to automatically apply chunk strategies. The experiments demonstrate that AutoChunk can reduce over 80\% of activation memory while maintaining speed loss within 10%, extend max sequence length by 3.2x to 11.7x, and outperform state-of-the-art methods by a large margin.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization, November 2022. URL https://www.biorxiv.org/content/10.1101/2022.11.20.517210v2. Pages: 2022.11.20.517210 Section: New Results.
  2. DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale, June 2022. URL http://arxiv.org/abs/2207.00032. arXiv:2207.00032 [cs].
  3. Language Models are Few-Shot Learners, July 2020. URL http://arxiv.org/abs/2005.14165. arXiv:2005.14165 [cs].
  4. Training Deep Nets with Sublinear Memory Cost, April 2016. URL http://arxiv.org/abs/1604.06174. arXiv:1604.06174 [cs].
  5. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning, October 2018. URL http://arxiv.org/abs/1802.04799. arXiv:1802.04799 [cs].
  6. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, June 2022. URL http://arxiv.org/abs/2205.14135. arXiv:2205.14135 [cs].
  7. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, June 2021. URL http://arxiv.org/abs/2010.11929. arXiv:2010.11929 [cs].
  8. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, February 2016. URL http://arxiv.org/abs/1510.00149. arXiv:1510.00149 [cs].
  9. Deep Residual Learning for Image Recognition, December 2015. URL http://arxiv.org/abs/1512.03385. arXiv:1512.03385 [cs].
  10. Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization, May 2020. URL http://arxiv.org/abs/1910.02653. arXiv:1910.02653 [cs, stat].
  11. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, August 2021. ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-021-03819-2. URL https://www.nature.com/articles/s41586-021-03819-2.
  12. Reformer: The Efficient Transformer, February 2020. URL http://arxiv.org/abs/2001.04451. arXiv:2001.04451 [cs, stat].
  13. Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper, June 2018. URL http://arxiv.org/abs/1806.08342. arXiv:1806.08342 [cs, stat].
  14. Swin Transformer V2: Scaling Up Capacity and Resolution, April 2022. URL http://arxiv.org/abs/2111.09883. arXiv:2111.09883 [cs].
  15. PyTorch: An Imperative Style, High-Performance Deep Learning Library, December 2019. URL http://arxiv.org/abs/1912.01703. arXiv:1912.01703 [cs, stat].
  16. Self-attention Does Not Need $O(n^2)$ Memory, October 2022. URL http://arxiv.org/abs/2112.05682. arXiv:2112.05682 [cs].
  17. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, May 2020. URL http://arxiv.org/abs/1910.02054. arXiv:1910.02054 [cs, stat].
  18. Hierarchical Text-Conditional Image Generation with CLIP Latents, April 2022. URL http://arxiv.org/abs/2204.06125. arXiv:2204.06125 [cs].
  19. ZeRO-Offload: Democratizing Billion-Scale Model Training, January 2021. URL http://arxiv.org/abs/2101.06840. arXiv:2101.06840 [cs].
  20. High-Resolution Image Synthesis with Latent Diffusion Models, April 2022. URL http://arxiv.org/abs/2112.10752. arXiv:2112.10752 [cs].
  21. U-Net: Convolutional Networks for Biomedical Image Segmentation, May 2015. URL http://arxiv.org/abs/1505.04597. arXiv:1505.04597 [cs].
  22. Amit Sabne. Xla : Compiling machine learning for peak performance, 2020.
  23. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, March 2020. URL http://arxiv.org/abs/1909.08053. arXiv:1909.08053 [cs].
  24. Attention Is All You Need, December 2017. URL http://arxiv.org/abs/1706.03762. arXiv:1706.03762 [cs].
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets