Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HEXGEN-TEXT2SQL: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL Workflow (2505.05286v1)

Published 8 May 2025 in cs.DB

Abstract: Recent advances in leveraging the agentic paradigm of LLMs utilization have significantly enhanced Text-to-SQL capabilities, enabling users without specialized database expertise to query data intuitively. However, deploying these agentic LLM-based Text-to-SQL systems in production poses substantial challenges due to their inherently multi-stage workflows, stringent latency constraints, and potentially heterogeneous GPU infrastructure in enterprise environments. Current LLM serving frameworks lack effective mechanisms for handling interdependent inference tasks, dynamic latency variability, and resource heterogeneity, leading to suboptimal performance and frequent service-level objective (SLO) violations. In this paper, we introduce HEXGEN-TEXT2SQL, a novel framework designed explicitly to schedule and execute agentic multi-stage LLM-based Text-to-SQL workflows on heterogeneous GPU clusters that handle multi-tenant end-to-end queries. HEXGEN-TEXT2SQL introduce a hierarchical scheduling approach combining global workload-balanced task dispatching and local adaptive urgency-guided prioritization, guided by a systematic analysis of agentic Text-to-SQL workflows. Additionally, we propose a lightweight simulation-based method for tuning critical scheduling hyperparameters, further enhancing robustness and adaptability. Our extensive evaluation on realistic Text-to-SQL benchmarks demonstrates that HEXGEN-TEXT2SQL significantly outperforms state-of-the-art LLM serving frameworks. Specifically, HEXGEN-TEXT2SQL reduces latency deadlines by up to 1.67$\times$ (average: 1.41$\times$) and improves system throughput by up to 1.75$\times$ (average: 1.65$\times$) compared to vLLM under diverse, realistic workload conditions. Our code is available at https://github.com/Relaxed-System-Lab/Hexgen-Flow.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. You Peng (14 papers)
  2. Youhe Jiang (13 papers)
  3. Chen Wang (600 papers)
  4. Binhang Yuan (45 papers)

Summary

Overview of Hexgen-Text2SQL: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL Workflows

The research paper entitled "Hexgen-Text2SQL: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL Workflows" presents a sophisticated framework targeting the efficient scheduling and execution of Text-to-SQL workflows facilitated by LLMs. The framework, named Hexgen-Text2SQL, is developed to meet the growing demand for accurate, fast Text-to-SQL conversions in enterprise environments characterized by multi-tenancy and heterogeneous GPU resources. It addresses several core challenges associated with deploying agentic LLM-based systems, including multi-stage workflows, inter-task dependencies, and stringent service-level objectives (SLOs).

Technical Contributions and Methodology

Hexgen-Text2SQL introduces a hierarchical scheduling approach combining two critical components: global workload-balanced task dispatching and local adaptive urgency-guided prioritization within each heterogeneous GPU cluster. This two-tier system significantly enhances the parallel execution and dependency management across complex Text-to-SQL tasks. Key elements of the framework include:

  1. Hierarchical Scheduling Design:
    • Global Coordination: At the global level, Hexgen-Text2SQL employs a workload-balanced dispatcher that intelligently assigns incoming LLM inference requests to appropriate GPU model instances by evaluating both their computational capabilities and current load. This dispatcher leverages a tunable hyperparameter, α, to balance the trade-off between execution time and queue load, which is dynamically adjusted through a simulation-based tuning mechanism.
  • Local Queue Management: At the local level, each model instance is equipped with an adaptive priority queue, which continuously reorders tasks based on a deadline-aware urgency metric. This mechanism ensures that tasks approaching their SLO deadlines are prioritized, allowing the system to maintain high SLO attainment rates even under heavy multi-tenant workloads.
  1. Simulation-Based Dynamic Tuning:
    • The framework includes a simulator-driven approach to dynamically adjust critical scheduling hyperparameters, ensuring robustness across diverse workload patterns. This adaptability is crucial for optimizing system performance amid varying Text-to-SQL query complexities and resource heterogeneity.

Empirical Evaluation

The performance of Hexgen-Text2SQL is validated through extensive experiments using realistic Text-to-SQL benchmarks. The results are significant, showcasing the framework’s superiority over existing state-of-the-art solutions. Notably, Hexgen-Text2SQL achieves:

  • Reduction in Latency Deadlines: Up to 1.67x (average 1.41x) reduction in latency compared to vLLM, highlighting its efficiency in handling stringent latency requirements.
  • Throughput Improvements: Increase in system throughput by up to 1.75x (average 1.65x), demonstrating the framework’s ability to maximize resource utilization.
  • Improved SLO Compliance: Consistent meeting of strict service-level objectives, affirming the system’s practical utility in enterprise environments.

Theoretical and Practical Implications

The development and deployment of Hexgen-Text2SQL mark a significant advancement in the field of Text-to-SQL interfaces facilitated by LLMs. The framework's design paves the way for more efficient and reliable AI-driven database interactions, raising the bar for LLM application in real-time enterprise settings. The hierarchical scheduling approach, with its emphasis on adaptive prioritization and dynamic tuning, provides a robust model that can be extended beyond Text-to-SQL workflows to other LLM-based applications with similar architecture and serving constraints.

Future Directions

Future research directions suggested by the authors include exploring more refined approaches to workload prediction and queuing, extending the adaptive scheduling techniques to broader LLM applications, and further enhancing the simulator-driven tuning process for real-time optimizations. This work not only contributes to advancing LLM infrastructure optimization but also serves as a foundation for future innovations in adaptive, agentic AI systems.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com