Hexgen-Text2SQL: Scalable Agentic Text-to-SQL Scheduling
- Hexgen-Text2SQL is a framework that enables efficient, SLO-compliant execution of multi-stage Text-to-SQL workflows on diverse, heterogeneous GPU clusters.
- It employs a hierarchical scheduling scheme that combines global workload-balanced dispatching with local urgency-guided prioritization to reduce deadline misses.
- Simulation-based hyperparameter tuning and ablation analyses demonstrate significant throughput gains and tighter SLO attainment compared to existing serving frameworks.
Hexgen-Text2SQL is a serving framework targeting the efficient scheduling and execution of agentic multi-stage Text-to-SQL workflows leveraging LLMs on heterogeneous GPU clusters. The system addresses the unique architectural and operational challenges stemming from multi-tenant, end-to-end Text-to-SQL pipelines in production, characterized by sequential inter-stage dependencies, fine-grained parallelism, stringent latency constraints, and infrastructure heterogeneity. It introduces a hierarchical scheduling methodology, simulation-based hyperparameter tuning, and achieves significant reductions in deadline miss rates and improvements in throughput relative to state-of-the-art serving frameworks (Peng et al., 8 May 2025).
1. System Architecture
Hexgen-Text2SQL orchestrates LLM-based Text-to-SQL workflows in a two-tier structure on GPU clusters. Each user query proceeds through four main stages as informed by the CHESS agentic pipeline:
- Schema Linking: Mapping user question entities to specific database tables and columns.
- SQL Candidate Generation: Initiating multiple parallel LLM invocations with varied prompts to produce SQL candidates.
- Self-Correction: Iteratively re-invoking the LLM (up to 10 iterations) to remediate errors in SQL execution.
- Evaluation: Generating unit-tests and electing the optimal SQL statement.
These stages manifest sequential dependencies, necessitating downstream stages to await predecessor outputs, alongside intra-stage parallelism (e.g., multiple candidate generations and self-correcting passes executed concurrently).
The framework's core components include:
- Global Coordinator: Maintains per-query task dependency graphs and SLO budgets, dispatches sub-tasks across model instances.
- Model Instances: Each pinned to a GPU (Triton/vLLM-style endpoints) and equipped with a local, adaptive priority queue governing inference task execution.
- Monitoring & Feedback: Continuously updates remaining SLO budgets and activates downstream tasks upon upstream completion.
The lifecycle of a query Q entails construction of tasks , assignment of a global SLO , sequential triggering of schema-linking, candidate generation, self-correction, and evaluation, with each step managed according to dependency and timing constraints.
2. Hierarchical Scheduling Scheme
Hexgen-Text2SQL adopts a two-level scheduling paradigm combining:
- Global Workload-Balanced Dispatching
- Local Urgency-Guided Prioritization
2.1 Global Workload-Balanced Dispatching
The global scheduler aims, for each subtask , to select a model instance that maximizes the probability the overall query completes within the SLO:
Due to NP-hardness, Hexgen-Text2SQL implements a heuristic scoring strategy:
- Predicted Computation Cost:
where and denote input and predicted output lengths.
- Queueing Cost:
- Suitability Score for Each :
with trading off “fast execute” and “light load,” and rescaling queue effects.
Subtasks are dispatched to the instance maximizing this suitability score.
2.2 Local Adaptive Urgency-Guided Prioritization
Once a subtask is enqueued to model instance , its priority is dynamically determined by an urgency function:
- Subtask Time Allocation
- Urgency Definition
where is the time has waited in the queue. Higher signifies higher risk of deadline violation, thus higher execution precedence.
The queue always schedules the subtask with the highest urgency for inference execution on the GPU.
3. Simulation-Based Hyperparameter Tuning
The principal dispatch hyperparameter is optimized online via a low-overhead, trace-driven CPU simulator:
- Warm-up: Collect workflow traces (arrival, queue, durations) for 100 seconds.
- Simulation: Replay the trace under candidate values.
- Selection: Set
where indicates simulated end-to-end latency.
- Sweep: Coarse grid (step 0.2), followed by fine sweep (0.1) around optimum.
- Retuning: After each sliding window, compare system latency using a one-sided t-test (); triggering re-tuning if significant degradation is detected.
Tuning windows incur negligible overhead (115–158 s on CPU) versus workload variabilities operating on the order of hours.
4. Empirical Evaluation
4.1 Experimental Setup
Key parameters for benchmarking include:
- Workflow: CHESS agent pipeline with Llama3.1-70B LLM.
- Workload Traces: BIRD-bench (finance, F1, mixed domains).
- Hardware Topologies: Heterogeneous clusters—
- Hetero-1: A100 + A6000
- Hetero-2: A100 + L40 + A6000
- Arrival Rates: Poisson arrivals at 0.5 qps and 1.0 qps.
- Baseline Comparator: vLLM framework using round-robin dispatch and FCFS local queue.
- Metrics: Attainment of SLO for 95% and 99% of requests by scaled-deadline; sustained throughput (qps).
4.2 Performance Results
Hexgen-Text2SQL demonstrates:
- SLO Attainment: Up to (average ) tighter deadline for 95% attainment, and up to (average ) for 99%, relative to vLLM.
- Throughput: – improvement (average ) over vLLM.
| Latency SLO (95%) | Throughput | |
|---|---|---|
| vLLM | ||
| Hexgen-Text2SQL |
4.3 Ablation Analysis
- Workload-Balanced (WB) vs. Round-Robin (RR) Dispatch: WB yields up to (avg ) better 95% SLO than RR.
- Urgency Queue vs. FCFS Queue: Adding local urgency prioritization delivers up to (avg ) speed-up in 95% SLO.
5. Design Insights and Limitations
Hexgen-Text2SQL’s effectiveness derives from explicit modeling of multi-stage workflow dependencies, allowing for:
- Elimination of Idle Gaps: Synchronizing task execution reduces resource underutilization and redundant computations.
- Heterogeneity-Aware Task Assignment: Tailoring subtask scheduling to underlying GPU capabilities maximizes computation throughput and system efficiency.
- Urgency-Driven Local Queues: Prioritize tasks most at risk of SLO violation, mitigating head-of-line blocking.
- Simulation-Driven Tuning: Rapid, workload-responsive optimization of the core dispatch policy parameter .
Limitations and Prospects:
- The empirical computation cost model (prefill plus decode time) could be superseded by ML-based predictors for finer granularity.
- Extending to other LLM multi-stage pipelines (e.g., multimodal chains) would require richer and more general dependency graph modeling.
- Reinforcement learning or multi-armed bandit algorithms are suggested as alternatives to simulation sweeps for faster convergence.
- The potential use of preemption and state-swapping mechanisms to accommodate abrupt high-priority workload arrivals remains an open direction (Peng et al., 8 May 2025).
6. Impact and Future Directions
Hexgen-Text2SQL establishes a methodology for robust, SLO-compliant, and high-throughput serving of complex, multi-stage agentic Text-to-SQL workflows on heterogeneous GPU infrastructure. Its two-level scheduling—the combination of workload-balanced dispatching and urgency-based local priorities—coupled with simulation-based hyperparameter tuning, addresses critical production challenges in LLM-powered database querying. Prospective research avenues include more expressive cost-prediction models, broader generalization to multi-modal or other agentic LLM chains, dynamic adaptation beyond grid-based simulation, and principled integration of preemptive scheduling for extreme deadline sensitivity.