Optimizing LLM Queries in Relational Workloads (2403.05821v1)
Abstract: Analytical database providers (e.g., Redshift, Databricks, BigQuery) have rapidly added support for invoking LLMs through native user-defined functions (UDFs) to help users perform natural language tasks, such as classification, entity extraction, and translation, inside analytical workloads. For instance, an analyst might want to extract customer sentiments on millions of product reviews. However, LLM inference is highly expensive in both computational and economic terms: for example, an NVIDIA L4 GPU running Llama2-7B can only process 6 KB of text per second. In this paper, we explore how to optimize LLM inference for analytical workloads that invoke LLMs within relational queries. We show that relational queries present novel opportunities for accelerating LLM inference, including reordering rows to maximize key-value (KV) cache reuse within the LLM inference engine, reordering columns within a row to further increase cache reuse, and deduplicating redundant inference requests. We implement these optimizations in Apache Spark, with vLLM as the model serving backend and achieve up to 4.4x improvement in end-to-end latency on a benchmark of diverse LLM-based queries on real datasets. To the best of our knowledge, this is the first work to explicitly address the problem of optimizing LLM invocations within SQL queries.
- [n.d.]a. https://docs.databricks.com/en/large-language-models/how-to-ai-query.html.
- [n.d.]b. AI Functions on Databricks — docs.databricks.com. https://docs.databricks.com/en/large-language-models/ai-functions.html. [Accessed 01-03-2024].
- [n.d.]. How fast is DuckDB really? — Blog — Fivetran — fivetran.com. https://www.fivetran.com/blog/how-fast-is-duckdb-really. [Accessed 01-03-2024].
- [n.d.]. Large Language Models for sentiment analysis with Amazon Redshift ML (Preview) — Amazon Web Services — aws.amazon.com. https://aws.amazon.com/blogs/big-data/large-language-models-for-sentiment-analysis-with-amazon-redshift-ml-preview/. [Accessed 01-03-2024].
- [n.d.]. LLM with Vertex AI only using SQL queries in BigQuery — Google Cloud Blog — cloud.google.com. https://cloud.google.com/blog/products/ai-machine-learning/llm-with-vertex-ai-only-using-sql-queries-in-bigquery. [Accessed 01-03-2024].
- [n.d.]. PySpark. https://www.databricks.com/glossary/pyspark.
- SemDeDup: Data-efficient learning at web-scale through semantic deduplication. arXiv:2303.09540 [cs.LG]
- Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD ’15). Association for Computing Machinery, New York, NY, USA, 1383–1394. https://doi.org/10.1145/2723372.2742797
- Prompting Is Programming: A Query Language for Large Language Models. Proceedings of the ACM on Programming Languages 7, PLDI (June 2023), 1946–1969. https://doi.org/10.1145/3591300
- Harrison Chase. 2022. LangChain. https://github.com/langchain-ai/langchain
- Ludmila Cherkasova and Gianfranco Ciardo. 2001. Role of Aging, Frequency, and Size in Web Cache Replacement Policies. In Proceedings of the 9th International Conference on High-Performance Computing and Networking (HPCN Europe 2001). Springer-Verlag, Berlin, Heidelberg, 114–123.
- The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox. arXiv:1409.3809 [cs.DB]
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG]
- Ruining He and Julian McAuley. 2016. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. In Proceedings of the 25th International Conference on World Wide Web (Montréal, Québec, Canada) (WWW ’16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 507–517. https://doi.org/10.1145/2872427.2883037
- The MADlib Analytics Library or MAD Skills, the SQL. arXiv:1208.4165 [cs.DB]
- Huggingface. 2023. Text Generation Inference. https://huggingface.co/docs/text-generation-inference/en/index
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
- Hydragen: High-Throughput LLM Inference with Shared Prefixes. arXiv:2402.05099 [cs.LG]
- BlazeIt: optimizing declarative aggregation and limit queries for neural network-based video analytics. Proc. VLDB Endow. 13, 4 (dec 2019), 533–546.
- NoScope: optimizing neural network queries over video at scale. Proc. VLDB Endow. 10, 11 (aug 2017), 1586–1597.
- Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (¡conf-loc¿, ¡city¿Koblenz¡/city¿, ¡country¿Germany¡/country¿, ¡/conf-loc¿) (SOSP ’23). Association for Computing Machinery, New York, NY, USA, 611–626. https://doi.org/10.1145/3600006.3613165
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL]
- Jerry Liu. 2022. LlamaIndex. https://doi.org/10.5281/zenodo.1234
- Accelerating Machine Learning Inference with Probabilistic Predicates. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 1493–1508.
- Microsoft. 2023. Guidance. https://github.com/guidance-ai/guidance
- NVIDIA. 2023a. Faster Transformer. https://github.com/NVIDIA/FasterTransformer
- NVIDIA. 2023b. TensorRT LLM. https://github.com/NVIDIA/TensorRT-LLM
- Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL.
- Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: an Embeddable Analytical Database. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1981–1984. https://doi.org/10.1145/3299869.3320212
- SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250 [cs.CL]
- Micrsoft Research. 2023. Artificial Intelligence Controller Interface (AICI). https://github.com/microsoft/aici
- FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. arXiv:2303.06865 [cs.LG]
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
- Attention Is All You Need. arXiv:1706.03762 [cs.CL]
- Brandon T. Willard and Rémi Louf. 2023. Efficient Guided Generation for Large Language Models. arXiv:2307.09702 [cs.CL]
- RALF: Accuracy-Aware Scheduling for Feature Store Maintenance. Proc. VLDB Endow. 17, 3 (nov 2023), 563–576. https://doi.org/10.14778/3632093.3632116
- C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv:2309.07597 [cs.CL]
- Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding. https://flashinfer.ai/2024/02/02/cascade-inference.html
- Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 521–538. https://www.usenix.org/conference/osdi22/presentation/yu
- Efficiently Programming Large Language Models using SGLang. arXiv:2312.07104 [cs.AI]
- A Formal Perspective on Byte-Pair Encoding. arXiv:2306.16837 [cs.CL]
- Shu Liu (146 papers)
- Asim Biswal (2 papers)
- Audrey Cheng (2 papers)
- Xiangxi Mo (12 papers)
- Shiyi Cao (15 papers)
- Joseph E. Gonzalez (167 papers)
- Ion Stoica (177 papers)
- Matei Zaharia (101 papers)