The Framework Tax: Disparities Between Inference Efficiency in NLP Research and Deployment
Abstract: Increased focus on the computational efficiency of NLP systems has motivated the design of efficient model architectures and improvements to underlying hardware accelerators. However, the resulting increases in computational throughput and reductions in floating point operations have not directly translated to improvements in wall-clock inference latency. We demonstrate that these discrepancies can be largely attributed to bottlenecks introduced by deep learning frameworks. We denote this phenomenon as the \textit{framework tax}, and observe that the disparity is growing as hardware speed increases over time. In this work, we examine this phenomenon through a series of case studies analyzing the effects of model design decisions, framework paradigms, and hardware platforms on total model latency. Code is available at https://github.com/JaredFern/Framework-Tax.
- Theano: A python framework for fast computation of mathematical expressions. arXiv e-prints, pages arXiv–1605.
- Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312.
- Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1–15.
- Jeff Barr. 2019. Amazon ec2 update-infl instances with aws inferentia chips for high performance cost-effective inferencing.
- JAX: composable transformations of Python+NumPy programs.
- Proxylessnas: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
- Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274.
- {{\{{TVM}}\}}: An automated {{\{{End-to-End}}\}} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Funnel-transformer: Filtering out sequential redundancy for efficient language processing. Advances in neural information processing systems, 33:4271–4282.
- The efficiency misnomer. arXiv preprint arXiv:2110.12894.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Rl-scope: Cross-stack profiling for deep reinforcement learning workloads. Proceedings of Machine Learning and Systems, 3:783–799.
- Power-bert: Accelerating bert inference via progressive word-vector elimination. In International Conference on Machine Learning, pages 3690–3699. PMLR.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
- Sara Hooker. 2021. The hardware lottery. Communications of the ACM, 64(12):58–65.
- Dynabert: Dynamic bert with adaptive width and depth. Advances in Neural Information Processing Systems, 33:9782–9793.
- Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360.
- Squeezebert: What can computer vision teach nlp about efficient neural networks? arXiv preprint arXiv:2006.11316.
- How to train bert with an academic budget. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10644–10652.
- Mlperf mobile inference benchmark: An industry-standard open-source machine learning benchmark for on-device ai. Proceedings of Machine Learning and Systems, 4:352–369.
- Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678.
- Flashlight: Enabling innovation in tools for machine learning. In International Conference on Machine Learning, pages 10557–10574. PMLR.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451.
- Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
- Beyond floating-point ops: Cnn performance prediction with critical datapath length. In 2020 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–9. IEEE.
- Fnet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824.
- George Leopold. 2019. Aws to offer nvidia’s t4 gpus for ai inferencing. URL: https://web. archive. org/web/20220309000921/https://www. hpcwire. com/2019/03/19/aws-upgrades-its-gpu-backed-ai-inference-platform/(visited on 2022-04-19).
- Xsp: Across-stack profiling and analysis of machine learning models on gpus. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 326–327. IEEE.
- Building a performance model for deep learning recommendation model training on gpus. arXiv preprint arXiv:2201.07821.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Deep learning with dynamic computation graphs. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
- Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. In International Conference on Learning Representations.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
- The carbon footprint of machine learning training will plateau, then shrink. Computer, 55(7):18–28.
- Mlperf inference benchmark. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 446–459. IEEE.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Green ai. Communications of the ACM, 63(12):54–63.
- Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, Florence, Italy. Association for Computational Linguistics.
- Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984.
- Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2820–2828.
- Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551.
- Scale efficiently: Insights from pre-training and fine-tuning transformers. arXiv preprint arXiv:2109.10686.
- The penn treebank: an overview. Treebanks: Building and using parsed corpora, pages 5–22.
- Chainer: a next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS), volume 5, pages 1–6.
- Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 32–42.
- Hat: Hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
- A systematic methodology for analysis of deep learning hardware and software platforms. Proceedings of Machine Learning and Systems, 2:30–43.
- Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10734–10742.
- Sustainable ai: Environmental implications, challenges and opportunities. In Proceedings of Machine Learning and Systems, volume 4, pages 795–813.
- Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14138–14148.
- Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146.
- Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6848–6856.
- Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886.
- Hulk: An energy efficiency benchmark platform for responsible natural language processing. arXiv preprint arXiv:2002.05829.
- Tbd: Benchmarking and analyzing deep neural network training. arXiv preprint arXiv:1803.06905.
- Daydream: Accurately estimating the efficacy of optimizations for {{\{{DNN}}\}} training. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pages 337–352.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.