Eloquent: A More Robust Transmission Scheme for LLM Token Streaming (2401.12961v2)
Abstract: To render each generated token in real-time for users, the LLM server generates tokens one by one and streams each token (or group of a few tokens) through the network to the user right after generation, which we refer to as LLM token streaming. However, under unstable network conditions, the LLM token streaming experience could suffer greatly from stalls since one packet loss could block the rendering of later tokens even if the packets containing them arrive on time. With a measurement study, we show that current applications suffer from increased stalls under unstable networks. For this emerging token streaming problem in LLM Chatbots that differs from previous multimedia and text applications, we propose a novel transmission scheme, called Eloquent, which puts newly generated tokens as well as currently unacknowledged tokens in the next outgoing packet. This ensures that each packet contains some new tokens and, in the meantime, is independently rendered when received, avoiding the aforementioned stalls caused by missing packets. Through simulation under various networks, we show Eloquent reduces stall ratio (proportion of token rendering wait time) by 71.0% compared to the retransmission method commonly used by real chatbot applications and by 31.6% compared to the baseline packet duplication scheme. By tailoring Eloquent to fit the token-by-token generation of LLM, we enable the Chatbots to respond like an eloquent speaker for users to better enjoy pervasive AI.
- Indoor Wireless Path Loss. https://securitytoday.com/Articles/2012/04/01/Indoor-Wireless-Path-Loss.aspx, 2012.
- https://www.belden.com/blogs/smart-building/5-big-wireless-challenges-signal-loss-movement-reach-densification-and-multipath-fading/, 2020.
- API Reference - OpenAI API. https://platform.openai.com/docs/api-reference/chat/create, 2023.
- Bring Gen AI & LLMs to Your Data . https://www.snowflake.com/blog/generative-ai-llms-summit-2023/, 2023.
- Global Large Language Model(LLM) Market Research Report 2023 . https://reports.valuates.com/market-reports/QYRE-Auto-30B13652/global-large-language-model-llm, 2023.
- LLMs are becoming increasingly popular in the enterprise world. https://uptrain.medium.com/llms-are-becoming-increasingly-popular-in-the-enterprise-world-as-businesses-recognize-their-2ac2a61771bd, 2023.
- Anthropic Claude. https://www.anthropic.com/, 2024.
- ChatGPT. https://chat.openai.com/, 2024.
- Google Bard. https://bard.google.com/chat, 2024.
- TCPDUMP documentation. https://www.tcpdump.org/, 2024.
- Unofficial OpenAI Status. https://openai-status.llm-utils.org/, 2024.
- Layered constructions for low-delay streaming codes. IEEE Transactions on Information Theory, 63(1):111–141, 2017.
- Qoe modeling for http adaptive video streaming–a survey and open challenges. IEEE Access, 7:30831–30859, 2019.
- Grace: Loss-resilient real-time video through neural codecs, 2023.
- TCP over Wireless Links: Mechanisms and Implications. https://www.cs.purdue.edu/homes/fahmy/reports/wireless.pdf.
- Salsify:{{\{{Low-Latency}}\}} network video through tighter integration between a video codec and a transport protocol. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 267–282, 2018.
- A close examination of performance and power characteristics of 4g lte networks. In Proceedings of the 10th International Conference on Mobile Systems, Applications, and Services, MobiSys ’12, page 225–238, New York, NY, USA, 2012. Association for Computing Machinery.
- Performance evaluation of webrtc-based video conferencing. SIGMETRICS Perform. Eval. Rev., 45(3):56–68, mar 2018.
- Error compensation framework for flow-guided video inpainting, 2022.
- Markov-based modeling of wireless local area networks. In Proceedings of the 6th ACM International Workshop on Modeling Analysis and Simulation of Wireless and Mobile Systems, MSWIM ’03, page 100–107, New York, NY, USA, 2003. Association for Computing Machinery.
- Error resiliency schemes in h. 264/avc standard. Journal of Visual Communication and Image Representation, 17(2):425–450, 2006.
- Efficient memory management for large language model serving with pagedattention, 2023.
- 𝖱𝖤𝖻𝗈𝗈𝗌𝗍𝖱𝖤𝖻𝗈𝗈𝗌𝗍\mathsf{REboost}sansserif_REboost : Improving throughput in wireless networks using redundancy elimination. IEEE Communications Letters, 21(1):160–163, 2017.
- Reparo: Loss-resilient generative codec for video conferencing, 2023.
- David JC MacKay. Fountain codes. IEE Proceedings-Communications, 152(6):1062–1068, 2005.
- Near Shannon limit performance of low density parity check codes. Electronics letters, 33(6):457–458, 1997.
- Measuring the performance and network utilization of popular video conferencing applications. In Proceedings of the 21st ACM Internet Measurement Conference, IMC ’21, page 229–244, New York, NY, USA, 2021. Association for Computing Machinery.
- Delay-optimal burst erasure code construction. In 2007 IEEE International Symposium on Information Theory, pages 1006–1010, 2007.
- Jon Postel. Transmission control protocol. https://www.rfc-editor.org/rfc/rfc793.txt, September 1981. RFC 793.
- Diagnosing wireless packet losses in 802.11: Separating collision from weak signal. In IEEE INFOCOM 2008 - The 27th Conference on Computer Communications, pages 735–743, 2008.
- So much to read, so little time: How do we read, and can speed reading help? Psychological Science in the Public Interest, 17(1):4–34, 2016. PMID: 26769745.
- Tambur: Efficient loss recovery for videoconferencing via streaming codes. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 953–971, 2023.
- Behavior of 802.11g traffic at high sustained bit rates in the home. In 2009 Digest of Technical Papers International Conference on Consumer Electronics, pages 1–2, 2009.
- RTP: A transport protocol for real-time applications. Internet RFC 3550, July 2003. RFC 3550.
- Reed-Solomon codes and their applications. John Wiley & Sons, 1999.
- Fast distributed inference serving for large language models, 2023.
- Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association.
- Response length perception and sequence scheduling: An llm-empowered llm inference pipeline, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.