Global Optimizations & Lightweight Dynamic Logic for Concurrency (2409.02227v1)
Abstract: Modern accelerators like GPUs are increasingly executing independent operations concurrently to improve the device's compute utilization. However, effectively harnessing it on GPUs for important primitives such as general matrix multiplications (GEMMs) remains challenging. Although modern GPUs have significant hardware and software support for GEMMs, their kernel implementations and optimizations typically assume each kernel executes in isolation and can utilize all GPU resources. This approach is highly efficient when kernels execute in isolation, but causes significant resource contention and slowdowns when kernels execute concurrently. Moreover, current approaches often only statically expose and control parallelism within an application, without considering runtime information such as varying input size and concurrent applications -- often exacerbating contention. These issues limit performance benefits from concurrently executing independent operations. Accordingly, we propose GOLDYLOC, which considers the global resources across all concurrent operations to identify performant GEMM kernels, which we call globally optimized (GO)-Kernels. Moreover, GOLDYLOC introduces a lightweight dynamic logic which considers the dynamic execution environment for available parallelism and input sizes to execute performant combinations of concurrent GEMMs on the GPU. Overall, GOLDYLOC improves performance of concurrent GEMMs on a real GPU by up to 2$\times$ (18% geomean per workload) and provides up to 2.5$\times$ (43% geomean per workload) speedups over sequential execution.
- Performance, Design, and Autotuning of Batched GEMM for GPUs. In International Conference on High Performance Computing, pages 21–38, Cham, 2016. Springer, Springer International Publishing.
- The Case for GPGPU Spatial Multitasking. In IEEE International Symposium on High-Performance Comp Architecture, HPCA, pages 1–12, Washington, DC, USA, 2012. IEEE, IEEE Computer Society.
- AMD. HIP: Heterogeneous-computing Interface for Portability, 2018.
- AMD. AMD Ryzen™ Threadripper 2950X Processor. ”https://www.amd.com/en/products/cpu/amd-ryzen-threadripper-2950x”, 2019.
- AMD. AMD’s BLAS Library. ”https://github.com/ROCmSoftwarePlatform/rocBLAS”, 2019.
- AMD. AMD CDNA Architecture. ”https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf”, 2020.
- AMD. AMD Instinct™ MI100 Accelerator. ”https://www.amd.com/en/products/server-accelerators/instinct-mi100”, 2020.
- AMD. AMD MxGPU and VMware. https://drivers.amd.com/relnotes/amd_mxgpu_deploymentguide_vmware.pdf, 2020.
- AMD. AMD’s tool for creating a benchmark-driven backend library for GEMMs. ”https://github.com/ROCmSoftwarePlatform/Tensile/”, 2020.
- AMD. AMD Instinct™ MI210 Accelerator. ”https://www.amd.com/en/products/accelerators/instinct/mi200/mi210.html”, 2022.
- AMD. AMD Instinct™ MI300X Accelerator. ”https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html”, 2023.
- AMD. AMD HSA Code Object Format. ”https://rocmdocs.amd.com/en/latest/ROCm_Compiler_SDK/ROCm-Codeobj-format.html”, 2024.
- AMD. AMD ROCm Profiler. ”https://rocmdocs.amd.com/en/latest/ROCm_Tools/ROCm-Tools.html”, 2024.
- AMD. Use ROCm on Radeon GPUs Documentation. ”https://rocm.docs.amd.com/_/downloads/radeon/en/latest/pdf/”, July 2024.
- Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin. In Proceedings of the 33nd International Conference on Machine Learning, pages 173–182, San Diego, CA, 2016. JMLR.org.
- Optimizing Performance of Recurrent Neural Networks on GPUs. CoRR, abs/1604.01946, 2016.
- Ashraf Eassa and Sukru Burc Eryilmaz. The Full Stack Optimization Powering NVIDIA MLPerf Training v2.0 Performance. https://developer.nvidia.com/blog/boosting-mlperf-training-performance-with-full-stack-optimization/, 2022.
- Layer Normalization, 2016.
- Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the Third International Conference on Learning Representation, ICLR, Appleton, WI, USA, 2015. OpenReview.net.
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33 of NeurIPS, pages 1877–1901, NY, USA, 2020. Curran Associates, Inc.
- Enabling Reproducible and Agile Full-System Simulation. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS, pages 183–193, Washington, DC, USA, 2021. IEEE Computer Society.
- On Robustness in the Logistic Regression Model. Journal of the Royal Statistical Society: Series B (Methodological), 55(3):693–706, 1993.
- Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, pages 17–32, New York, NY, USA, 2017. ACM.
- Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 681–696, 2016.
- Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 1724–1734, Doha, Qatar, October 2014. Association for Computational Linguistics.
- Lazy Batching: An SLA-aware Batching System for Cloud Machine Learning Inference. In 27th IEEE International Symposium on High Performance Computer Architecture, HPCA, pages 493–506, Los Alamitos, CA, USA, March 2021. IEEE Computer Society.
- PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units. In 26th IEEE International Symposium on High Performance Computer Architecture, HPCA, pages 220–233, Los Alamitos, CA, USA, Feb 2020. IEEE Computer Society.
- Adaptively Sparse Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019.
- Dell Technologies. MLPerf™ v1.1 Inference on Virtualized and Multi-Instance GPUs. https://infohub.delltechnologies.com/p/mlperf-tm-v1-1-inference-on-virtualized-and-multi-instance-gpus/, 2022.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, Stroudsburg, PA, USA, 2019. Association for Computational Linguistics.
- KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism. In 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO, pages 1–12, Oct 2016.
- GPUSync: A Framework for Real-Time GPU Management. In IEEE 34th Real-Time Systems Symposium, RTSS, pages 33–44, Washington, DC, USA, Dec 2013. IEEE, IEEE Computer Society.
- Optimizing CUDA Code by Kernel Fusion: Application on BLAS. The Journal of Supercomputing, 71(10):3934–3957, October 2015.
- Automatic Fusions of CUDA-GPU Kernels for Parallel Map. SIGARCH Comput. Archit. News, 39(4):98–99, December 2011.
- A Configurable Cloud-scale DNN Processor for Real-time AI. In Proceedings of the 45th Annual International Symposium on Computer Architecture, ISCA, pages 1–14, Piscataway, NJ, USA, 2018. IEEE Press.
- Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO, pages 407–420, Washington, DC, USA, 2007. IEEE, IEEE Computer Society.
- Low Latency RNN Inference with Cellular Batching. In Proceedings of the Thirteenth EuroSys Conference, EuroSys, pages 31:1–31:15, New York, NY, USA, 2018. ACM.
- Amir Gholami. Memory Footprint and FLOPs for SOTA Models in CV/NLP/Speech. ”https://github.com/amirgholami/ai_and_memory_wall”, 2021.
- DeepRecSys: A System for Optimizing End-To-End At-Scale Neural Recommendation Inference. In ACM/IEEE 47th Annual International Symposium on Computer Architecture, ISCA, pages 982–995, Piscataway, NJ, USA, 2020. IEEE Press.
- Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In International Symposium on High Performance Computer Architecture, pages 608–619, Feb 2018.
- ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’17, page 75–84, New York, NY, USA, 2017. Association for Computing Machinery.
- Instruction Scheduling and Global Register Allocation for SIMD Multiprocessors. In 2nd International Workshop on Parallel Algorithms for Irregularly Structured Problems, pages 81–86, Berlin, Heidelberg, 1995. Springer Berlin Heidelberg.
- Newton: A DRAM-maker’s accelerator-in-memory (AiM) architecture for machine learning. In 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO, pages 372–385, Los Alamitos, CA, USA, Oct 2020. IEEE, IEEE Computer Society.
- Streaming End-to-end Speech Recognition For Mobile Devices, 2018.
- Long Short-Term Memory. Neural Computation, 9(8):1735–1780, November 1997.
- GRNN: Low-Latency and Scalable RNN Inference on GPUs. In Proceedings of the Fourteenth EuroSys Conference, EuroSys, pages 41:1–41:16, New York, NY, USA, 2019. ACM.
- A Comparison of Goodness-of-fit Tests for the Logistic Regression Model. Statistics in medicine, 16(9):965–980, 1997.
- Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference. In ACM/IEEE 52nd Annual International Symposium on Computer Architecture, ISCA, Piscataway, NJ, USA, June 2024. IEEE Press.
- Data Movement Is All You Need: A Case Study on Optimizing Transformers. In A. Smola, A. Dimakis, and I. Stoica, editors, Proceedings of Machine Learning and Systems, volume 3, pages 711–732, Indio, CA, 2020. mlsys.org.
- Warp-aware Trace Scheduling for GPUs. In 23rd International Conference on Parallel Architecture and Compilation Techniques, PACT, pages 163–174, New York, NY, USA, 2014. Association for Computing Machinery.
- Dynamic Space-Time Scheduling for GPU Inference. In 30th International Conference on Neural Information Processing Systems, 2018.
- CRUISE: Cache Replacement and Utility-Aware Scheduling. In International Conference on Architectural Support for Programming Languages and Operation Systems, ASPLOS, pages 249–260, 2012.
- GAP: gem5 GPU Accuracy Profiler. In 4th gem5 Users’ Workshop, New York, NY, USA, June 2022. Association for Computing Machinery.
- JEDEC. High Bandwidth Memory DRAM (HBM1, HBM2). ”https://www.jedec.org/standards-documents/docs/jesd235a”, 2019.
- A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices. Journal of Parallel and Distributed Computing, 75:133–140, 2015.
- Beyond Data and Model Parallelism for Deep Neural Networks, 2018.
- OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems, page 395–406, New York, NY, USA, 2013. ACM.
- Ten Lessons from Three Generations Shaped Google’s TPUv4i. In Proceedings of the 48th Annual International Symposium on Computer Architecture, page 1–14, Piscataway, NJ, USA, 2021. IEEE Press.
- In-Datacenter Performance Analysis of a Tensor Processing Unit. In International Symposium on Computer Architecture, pages 1–12, 2017.
- MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores. In 28th IEEE International Symposium on High-Performance Computer Architecture, HPCA, pages 814–830, Los Alamitos, CA, USA, Apr 2022. IEEE Computer Society.
- TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, Portland, OR, Jun 2011. USENIX Association.
- RecNMP: Accelerating Personalized Recommendation with near-Memory Processing. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture, ISCA, page 790–803, Piscataway, NJ, USA, 2020. IEEE Press.
- MIOpen: An Open Source Library For Deep Learning Primitives, 2019.
- Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs. In Proceedings of the 2018 International Conference on Supercomputing, ICS, page 96–106, New York, NY, USA, 2018. Association for Computing Machinery.
- A Code Generator for High-Performance Tensor Contractions on GPUs. In IEEE/ACM International Symposium on Code Generation and Optimization, CGO, pages 85–95, Piscataway, NJ, USA, 2019. IEEE Press.
- Increasing GPU Translation Reach by Leveraging Under-Utilized On-Chip Resources. In 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 1169–1181, New York, NY, USA, 2021. ACM.
- Herald: Optimizing Heterogeneous DNN Accelerators for Edge Devices. arXiv preprint arXiv:1909.07437, 57, 2019.
- Effect of Instruction Fetch and Memory Scheduling on GPU Performance. In Workshop on Language, Compiler, and Architecture Support for GPGPU, volume 88, Piscataway, NJ, USA, 2010. IEEE Press.
- ComP-Net: Command Processor Networking for Efficient Intra-Kernel Communications on GPUs. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, New York, NY, USA, 2018. Association for Computing Machinery.
- Extended Task Queuing: Active Messages for Heterogeneous Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC, pages 933–944, Piscataway, NJ, USA, 2016. IEEE Press.
- CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads. In ACM/IEEE 42nd Annual International Symposium on Computer Architecture, pages 515–527, NY, USA, 2015. ACM.
- Automatic Horizontal Fusion for GPU Kernels, 2020.
- Performance Analysis and Characterization of Training Deep Learning Models on Mobile Device. In IEEE 25th International Conference on Parallel and Distributed Systems, ICPADS, pages 506–515, Washington, DC, USA, 2019. IEEE, IEEE Computer Society.
- SAWS: Synchronization aware GPGPU warp scheduling for multiple independent warp schedulers. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture, pages 383–394, New York, NY, USA, 2015. ACM.
- VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 388–401, New York, NY, USA, 2022. Association for Computing Machinery.
- The gem5 simulator: Version 20.0+, 2020.
- Justin Luitjens. CUDA Streams: Best Practices and Common Pitfalls, 2014.
- Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI, pages 881–897, Renton, WA, Nov 2020. USENIX Association.
- Mixed Precision Training, 2018.
- Microsoft. Turing-NLG: A 17-billion-parameter language model by Microsoft. Microsoft Research Blog, 1(8), 2020.
- Pioneering Chiplet Technology and Design for the AMD EPYC™ and Ryzen™ Processor Families : Industrial Product. In ACM/IEEE 48th Annual International Symposium on Computer Architecture, ISCA, pages 57–70, Piscataway, NJ, USA, 2021. IEEE Press.
- Exploring Sparsity in Recurrent Neural Networks. CoRR, abs/1704.05119, 2017.
- Improving GPU Performance via Large Warps and Two-Level Warp Scheduling. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, pages 308–317, December 2011.
- Nathan Benaich and Air Street Capital. State of AI Report 2022. https://www.stateof.ai/, 2022.
- Generating Efficient Tensor Contractions for GPUs. In 44th International Conference on Parallel Processing, pages 969–978, Washington, DC, USA, 2015. IEEE Computer Society.
- NVIDIA. NVIDIA RISC-V Story. In 4th RISC-V Workshop, San Francisco, CA, 2016. RISC-V.
- NVIDIA. NVIDIA Tesla V100 GPU Architecture The World’s Most Advanced Data Center GPU. http://www.nvidia.com/object/volta-architecture-whitepaper.html , 2017.
- NVIDIA. Pro Tip: cuBLAS Strided Batched Matrix Multiply. https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/, 2017.
- NVIDIA. CUDA Stream Management, 2018.
- NVIDIA. Megatron-LM Github. https://github.com/NVIDIA/Megatron-LM, 2018.
- NVIDIA. NVIDIA cuDNN: GPU Accelerated Deep Learning. https://developer.nvidia.com/cudnn, 2018.
- NVIDIA. Easily Deploy Deep Learning Models in Production. ”https://www.kdnuggets.com/2019/08/nvidia-deploy-deep-learning-models-production.html”, 2019.
- NVIDIA. Nvidia deep learning performance. ”https://docs.nvidia.com/deeplearning/performance/index.html”, 2019.
- NVIDIA. Ride the Fast Lane to AI Productivity with Multi-Instance GPUs. ”https://blogs.nvidia.com/blog/2020/05/14/multi-instance-gpus/”, 2020.
- NVIDIA Corp. NVIDIA cuBLAS. https://developer.nvidia.com/cublas, 2024.
- NVIDIA Corp. NVIDIA Multi-Instance GPU (MIG). https://docs.nvidia.com/cuda/mig/index.html, 2024.
- Scaling Neural Machine Translation, 2018.
- AMD GPUs as an Alternative to NVIDIA for Supporting Real-Time Workloads. In Marcus Völp, editor, 32nd Euromicro Conference on Real-Time Systems, volume 165, pages 10:1–10:23, Dagstuhl, Germany, 2020. Schloss Dagstuhl–Leibniz-Zentrum für Informatik.
- Exploring AMD GPU Scheduling Details by Experimenting With “Worst Practices”. In 29th International Conference on Real-Time Networks and Systems, page 24–34, New York, NY, USA, 2021. Association for Computing Machinery.
- Improving GPGPU Concurrency with Elastic Kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, page 407–418, 2013.
- SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA, pages 27–40, New York, NY, USA, 2017. ACM.
- Demystifying BERT: System Design Implications. In IEEE International Symposium on Workload Characterization, IISWC, Washington, DC, USA, 2022. IEEE, IEEE Computer Society.
- SeqPoint: Identifying Representative Iterations of Sequence-based Neural Networks. In IEEE International Symposium on Performance Analysis of Systems and Software, pages 69–80, DC, USA, August 2020. IEEE Computer Society.
- Oversubscribed Command Queues in GPUs. In Proceedings of the 11th Workshop on General Purpose GPUs, GPGPU-11, pages 50–60, New York, NY, USA, 2018. ACM.
- PyTorch. Pytorch Automatic Mixed Precision Package. https://pytorch.org/docs/stable/amp.html, 2019.
- Sigma: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In 26th IEEE International Symposium on High Performance Computer Architecture, HPCA, pages 58–70, Washington, DC, USA, 2020. IEEE, IEEE Computer Society.
- Language Models are Unsupervised Multitask Learners. OpenAI Blog, 1(8), 2019.
- Closing the Gap: Improving the Accuracy of gem5’s GPU Models. In 5th gem5 Users’ Workshop, New York, NY, USA, June 2023. Association for Computing Machinery.
- Towards Pareto Optimal Throughput in Small Language Model Serving. In the 4th Workshop on Machine Learning and Systems, EuroMLSys ’24, page 144–152, 2024.
- Modeling Modern GPU Applications in gem5. In 3rd gem5 Users’ Workshop, NY, USA, June 2020. ACM.
- Cache-Conscious Wavefront Scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 72–83, Washington, DC, USA, 2012. IEEE Computer Society.
- Divergence-Aware Warp Scheduling. In 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO, pages 99–110, Washington, DC, USA, 2013. IEEE, IEEE Computer Society.
- Learning Representations by Back-Propagating Errors, page 696–699. MIT Press, Cambridge, MA, USA, 1988.
- Tensor Contractions with Extended BLAS Kernels on CPU and GPU. In IEEE 23rd International Conference on High Performance Computing, HiPC, pages 193–202, Washington, DC, USA, 2016. IEEE, IEEE Computer Society.
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, 2019.
- Astra: Exploiting Predictability to Optimize Deep Learning. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, page 909–923, 2019.
- Sklearn. Sklearn Multi-class Logistic Regression. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression, 2019.
- Modular Array-Based GPU Computing in a Dynamically-Typed Language. In Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, page 48–55, USA, 2017. ACM.
- CoGNN: Efficient Scheduling for Concurrent GNN Training on GPUs. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC, pages 538–552, Washington, DC, USA, 2022. IEEE Computer Society, IEEE Computer Society.
- Xiaodan Tan. GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud. PhD thesis, University of Toronto (Canada), 2021.
- TIRIAS Research. Why Your AI infrastructure Needs Both Training and Inference. ”https://www.ibm.com/downloads/cas/QM4BYOPP”, 2019.
- Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, page 6000–6010, USA, 2017. Curran Associates Inc.
- Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU. In Int’l Conference on Green Computing and Communications & Int’l Conference on Cyber, Physical and Social Computing, page 344–350, 2010.
- Training Deep Neural Networks with 8-bit Floating Point Numbers. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NeurIPS, pages 7686–7695, 2018.
- MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI, pages 945–960, Renton, WA, Apr 2022. USENIX Association.
- Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR, abs/1609.08144, 2016.
- AntMan: Dynamic Scaling on GPU Clusters for Deep Learning. In 14th USENIX Symposium on Operating Systems Design and Implementation, pages 533–548, Nov 2020.
- PATS: Pattern Aware Scheduling and Power Gating for GPGPUs. In 23rd International Conference on Parallel Architecture and Compilation Techniques, PACT, pages 225–236, New York, NY, USA, 2014. Association for Computing Machinery.
- Deadline-Aware Offloading for High-Throughput Accelerators. In 27th IEEE International Symposium on High Performance Computer Architecture, pages 479–492, CA, USA, Mar 2021. IEEE Computer Society.
- A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing, pages 15–24, NY, USA, 2015. ACM.
- Deep Learning Language Modeling Workloads: Where Time Goes on Graphics Processors. In IEEE International Symposium on Workload Characterization, IISWC, pages 131–142, Washington, DC, USA, 2019. IEEE, IEEE Computer Society.
- Model-Switching: Dealing with Fluctuating Workloads in Machine-Learning-as-a-Service Systems. In 12th USENIX Workshop on Hot Topics in Cloud Computing, Renton, WA, Jul 2020. USENIX Association.
- DeepCPU: Serving RNN-based Deep Learning Models 10x Faster. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC, pages 951–965, Boston, MA, 2018. USENIX Association.
- Sparse Persistent RNNs: Squeezing Large Recurrent Networks On-Chip. In Proceedings of 6th International Conference on Learning Representations, ICLR, Appleton, WI, USA, 2018. OpenReview.net.