Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services (2404.16283v2)
Abstract: LLMs are now at the core of conversational AI services such as real-time translation and chatbots, which provide live user interaction by incrementally streaming text to the user. However, existing LLM serving systems fail to provide good user experience because their optimization metrics are not always aligned with user experience. In this paper, we first introduce and define the notion of Quality-of-Experience (QoE) for text streaming services by considering each user's end-to-end interaction timeline. Based on this, we propose Andes, a QoE-aware LLM serving system that enhances user experience by ensuring that users receive the first token promptly and subsequent tokens at a smooth, digestible pace, even during surge periods. This is enabled by Andes's preemptive request scheduler that dynamically prioritizes requests at the token granularity based on each request's expected QoE gain and GPU resource usage. Our evaluations demonstrate that, compared to state-of-the-art LLM serving systems, Andes improves the average QoE by up to $4.7\times$ given the same GPU resource, or saves up to 61% GPU resources while maintaining the same high QoE.
- Taming throughput-latency tradeoff in llm inference with sarathi-serve. arXiv preprint arXiv:2403.02310, 2024.
- Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369, 2023.
- Towards predicting reading comprehension from gaze behavior. In ETRA (Symposium on Eye Tracking Research and Applications), 2020.
- The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023.
- Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. 2022.
- Daniel An. Find out how you stack up to new industry benchmarks for mobile page speed. In Google Research Blog, 2018.
- Qoe modeling for http adaptive video streaming–a survey and open challenges. Ieee Access, 7, 2019.
- Dom Barnard. Average speaking rate and words per minute. https://virtualspeech.com/blog/average-speaking-rate-words-per-minute#:~:text=According%20to%20the%20National%20Center,speech%20rates%20for%20different%20activities.
- Language models are few-shot learners. In NeurIPS, 2020.
- Marc Brysbaert. How many words do we read per minute? a review and meta-analysis of reading rate. Journal of memory and language, 2019.
- Clipper: A low-latency online prediction serving system. In NSDI, 2017.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 2022.
- Turbotransformers: an efficient gpu serving system for transformer models. 2021.
- GPTQ: Accurate post-training compression for generative pretrained transformers. In ICLR, 2023.
- Grand View Research. Large language model (llm) market size, share & trends analysis report by component, by application, by enterprise size, by end-use, by region, and segment forecasts, 2023 - 2030. Grand View Research, 2023.
- Serving dnns like clockwork: Performance predictability from the bottom up. In OSDI, 2020.
- Deeprecsys: A system for optimizing end-to-end at-scale neural recommendation inference. In ISCA. IEEE, 2020.
- Reading comprehension and its relationship with working memory capacity when reading horizontally scrolling text. Quarterly Journal of Experimental Psychology, 71(9):1887–1897, 2018.
- Inference without interference: Disaggregate llm inference for mixed downstream workloads. arXiv preprint arXiv:2401.11181, 2024.
- HuggingFace. Text generation inference. https://github.com/huggingface/text-generation-inference.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Lessons learned from the chameleon testbed. In ATC, 2020.
- Knapsack Problems. Springer Berlin Heidelberg, 2004.
- Superserve: Fine-grained inference serving for unpredictable workloads. arXiv preprint arXiv:2312.16733, 2023.
- Efficient memory management for large language model serving with PagedAttention. In SOSP, 2023.
- Alpaserve: Statistical multiplexing with model parallelism for deep learning serving. In OSDI, 2023.
- Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache. arXiv preprint arXiv:2401.02669, 2024.
- Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. NeurIPS, 36, 2024.
- Viraj Mahajan. 100+ incredible ChatGPT statistics & facts in 2024. https://www.notta.ai/en/blog/chatgpt-statistics.
- NVIDIA. TensorRT-LLM. https://github.com/NVIDIA/TensorRT-LLM.
- NVIDIA. Triton Inference Server. https://developer.nvidia.com/nvidia-triton-inference-server.
- NVIDIA. FasterTransformer. https://github.com/NVIDIA/FasterTransformer, 2023.
- Tensorflow-serving: Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139, 2017.
- Combining user logging with eye tracking for interactive and dynamic applications. Behavior research methods, 47:977–993, 2015.
- OpenAI. ChatGPT. https://chat.openai.com.
- Tara Parachuk. Speaking rates comparison table. https://www.voices.com/blog/languages-in-usa-speaking-rates-per-minute/#speaking-rates-comparison-table.
- Splitwise: Efficient generative llm inference using phase splitting. arXiv preprint arXiv:2311.18677, 2023.
- Language model tokenizers introduce unfairness between languages. Advances in Neural Information Processing Systems, 2024.
- Efficiently scaling transformer inference. In MLSys, 2023.
- Infaas: Automated model-less inference serving. In ATC, 2021.
- Linus Schrage. A proof of the optimality of the shortest remaining processing time discipline. Operations Research, 1968.
- Nexus: a gpu cluster engine for accelerating dnn-based video analysis. In SOSP, 2019.
- Fairness in serving large language models. 2024.
- D\\\backslash\’ej\\\backslash\avu: Kv-cache streaming for fast, fault-tolerant generative llm serving. 2024.
- ShareGPT Team. ShareGPT. https://sharegpt.com/.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. In NeurIPS, 2017.
- Efficient large language models: A survey. Transactions on Machine Learning Research, 2024.
- Wikepedia. Weak NP-completeness. https://en.wikipedia.org/wiki/Weak_NP-completeness.
- Orca: A distributed serving system for Transformer-Based generative models. In OSDI, 2022.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. In OSDI, 2024.
- Pets: A unified framework for parameter-efficient transformers serving. In ATC, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.