Hexgen-Text2SQL: Scalable Agentic Text-to-SQL Scheduling

Updated 26 January 2026

Hexgen-Text2SQL is a framework that enables efficient, SLO-compliant execution of multi-stage Text-to-SQL workflows on diverse, heterogeneous GPU clusters.
It employs a hierarchical scheduling scheme that combines global workload-balanced dispatching with local urgency-guided prioritization to reduce deadline misses.
Simulation-based hyperparameter tuning and ablation analyses demonstrate significant throughput gains and tighter SLO attainment compared to existing serving frameworks.

Hexgen-Text2SQL is a serving framework targeting the efficient scheduling and execution of agentic multi-stage Text-to-SQL workflows leveraging LLMs on heterogeneous GPU clusters. The system addresses the unique architectural and operational challenges stemming from multi-tenant, end-to-end Text-to-SQL pipelines in production, characterized by sequential inter-stage dependencies, fine-grained parallelism, stringent latency constraints, and infrastructure heterogeneity. It introduces a hierarchical scheduling methodology, simulation-based hyperparameter tuning, and achieves significant reductions in deadline miss rates and improvements in throughput relative to state-of-the-art serving frameworks (Peng et al., 8 May 2025).

1. System Architecture

Hexgen-Text2SQL orchestrates LLM-based Text-to-SQL workflows in a two-tier structure on GPU clusters. Each user query proceeds through four main stages as informed by the CHESS agentic pipeline:

Schema Linking: Mapping user question entities to specific database tables and columns.
SQL Candidate Generation: Initiating multiple parallel LLM invocations with varied prompts to produce SQL candidates.
Self-Correction: Iteratively re-invoking the LLM (up to 10 iterations) to remediate errors in SQL execution.
Evaluation: Generating unit-tests and electing the optimal SQL statement.

These stages manifest sequential dependencies, necessitating downstream stages to await predecessor outputs, alongside intra-stage parallelism (e.g., multiple candidate generations and self-correcting passes executed concurrently).

The framework's core components include:

Global Coordinator: Maintains per-query task dependency graphs and SLO budgets, dispatches sub-tasks across model instances.
Model Instances: Each pinned to a GPU (Triton/vLLM-style endpoints) and equipped with a local, adaptive priority queue governing inference task execution.
Monitoring & Feedback: Continuously updates remaining SLO budgets and activates downstream tasks upon upstream completion.

The lifecycle of a query Q entails construction of tasks $\{q_1,\ldots,q_n\}$ , assignment of a global SLO $T^{\text{SLO}}$ , sequential triggering of schema-linking, candidate generation, self-correction, and evaluation, with each step managed according to dependency and timing constraints.

2. Hierarchical Scheduling Scheme

Hexgen-Text2SQL adopts a two-level scheduling paradigm combining:

Global Workload-Balanced Dispatching
Local Urgency-Guided Prioritization

2.1 Global Workload-Balanced Dispatching

The global scheduler aims, for each subtask $q_{i,j}$ , to select a model instance $m \in M$ that maximizes the probability the overall query completes within the SLO:

$\max_{\phi:\{q\}\to M}\;\Pr\Bigl(\sum_{(i,j)}t_{i,j}^{\phi(q_{i,j})}\le T^{\rm SLO}_i\Bigr)$

Due to NP-hardness, Hexgen-Text2SQL implements a heuristic scoring strategy:

Predicted Computation Cost:

$t_{\text{comp}_{i,j}^m} = t_{\rm prefill}^m (L_{\rm in}(q_{i,j})) + t_{\rm decode}^m (\widehat{L}_{\rm out}(q_{i,j}))$

where $L_{\rm in}$ and $\widehat{L}_{\rm out}$ denote input and predicted output lengths.

Queueing Cost:

$t_{\text{queue}_{i,j}^m} = \sum_{q'\in \Theta^m} t_{\text{comp}^{m}(q')}$

Suitability Score for Each $(q, m)$ :

$\mathrm{Score}(q_{i,j},m) = (1-\alpha)\,{\beta}/{t_{\text{queue}_{i,j}^m}} - \alpha\,t_{\text{comp}_{i,j}^m}$

with $\alpha \in [0,1]$ trading off “fast execute” and “light load,” and $\beta$ rescaling queue effects.

Subtasks are dispatched to the instance maximizing this suitability score.

2.2 Local Adaptive Urgency-Guided Prioritization

Once a subtask is enqueued to model instance $m$ , its priority is dynamically determined by an urgency function:

Subtask Time Allocation

$t^{\text{SLO}}_{i,j} = (T^{\text{SLO}}_i - \tau^i_{\text{elapsed}}) \times \frac{\overline{t}_{\text{comp}_{i,j}}}{\sum_{k=j}^{n_i} \overline{t}_{\text{comp}_{i,k}}}$

Urgency Definition

$U_{i,j} = t_{\text{comp}_{i,j}^m} - (t^{\rm SLO}_{i,j} - \tau_{i,j})$

where $\tau_{i,j}$ is the time $q_{i,j}$ has waited in the queue. Higher $U_{i,j}$ signifies higher risk of deadline violation, thus higher execution precedence.

The queue always schedules the subtask with the highest urgency for inference execution on the GPU.

3. Simulation-Based Hyperparameter Tuning

The principal dispatch hyperparameter $\alpha$ is optimized online via a low-overhead, trace-driven CPU simulator:

Warm-up: Collect workflow traces (arrival, queue, durations) for 100 seconds.
Simulation: Replay the trace under candidate $\alpha$ values.
Selection: Set

$\alpha^* = \arg\min_{\alpha \in [0,1]} \frac{1}{N} \sum_{i=1}^N T_i(\alpha)$

where $T_i(\alpha)$ indicates simulated end-to-end latency.

Sweep: Coarse grid (step 0.2), followed by fine sweep (0.1) around optimum.
Retuning: After each sliding window, compare system latency using a one-sided t-test ( $p < 0.01$ ); triggering re-tuning if significant degradation is detected.

Tuning windows incur negligible overhead (115–158 s on CPU) versus workload variabilities operating on the order of hours.

4. Empirical Evaluation

4.1 Experimental Setup

Key parameters for benchmarking include:

Workflow: CHESS agent pipeline with Llama3.1-70B LLM.
Workload Traces: BIRD-bench (finance, F1, mixed domains).
Hardware Topologies: Heterogeneous clusters—
- Hetero-1: $2 \times$ A100 + $2 \times$ A6000
- Hetero-2: $2 \times$ A100 + L40 + A6000
Arrival Rates: Poisson arrivals at 0.5 qps and 1.0 qps.
Baseline Comparator: vLLM framework using round-robin dispatch and FCFS local queue.
Metrics: Attainment of SLO for 95% and 99% of requests by scaled-deadline; sustained throughput (qps).

4.2 Performance Results

Hexgen-Text2SQL demonstrates:

SLO Attainment: Up to $1.67\times$ (average $1.41\times$ ) tighter deadline for 95% attainment, and up to $1.60\times$ (average $1.35\times$ ) for 99%, relative to vLLM.
Throughput: $1.57\times$ – $1.75\times$ improvement (average $1.65\times$ ) over vLLM.

	Latency SLO (95%)	Throughput
vLLM	$T_{\rm base}$	$X_{\rm base}$
Hexgen-Text2SQL	$T_{\rm base}/1.41$	$1.65\,X_{\rm base}$

4.3 Ablation Analysis

Workload-Balanced (WB) vs. Round-Robin (RR) Dispatch: WB yields up to $1.38\times$ (avg $1.18\times$ ) better 95% SLO than RR.
Urgency Queue vs. FCFS Queue: Adding local urgency prioritization delivers up to $1.5\times$ (avg $1.2\times$ ) speed-up in 95% SLO.

5. Design Insights and Limitations

Hexgen-Text2SQL’s effectiveness derives from explicit modeling of multi-stage workflow dependencies, allowing for:

Elimination of Idle Gaps: Synchronizing task execution reduces resource underutilization and redundant computations.
Heterogeneity-Aware Task Assignment: Tailoring subtask scheduling to underlying GPU capabilities maximizes computation throughput and system efficiency.
Urgency-Driven Local Queues: Prioritize tasks most at risk of SLO violation, mitigating head-of-line blocking.
Simulation-Driven Tuning: Rapid, workload-responsive optimization of the core dispatch policy parameter $\alpha$ .

Limitations and Prospects:

The empirical computation cost model (prefill plus decode time) could be superseded by ML-based predictors for finer granularity.
Extending to other LLM multi-stage pipelines (e.g., multimodal chains) would require richer and more general dependency graph modeling.
Reinforcement learning or multi-armed bandit algorithms are suggested as alternatives to simulation sweeps for faster $\alpha$ convergence.
The potential use of preemption and state-swapping mechanisms to accommodate abrupt high-priority workload arrivals remains an open direction (Peng et al., 8 May 2025).

6. Impact and Future Directions

Hexgen-Text2SQL establishes a methodology for robust, SLO-compliant, and high-throughput serving of complex, multi-stage agentic Text-to-SQL workflows on heterogeneous GPU infrastructure. Its two-level scheduling—the combination of workload-balanced dispatching and urgency-based local priorities—coupled with simulation-based hyperparameter tuning, addresses critical production challenges in LLM-powered database querying. Prospective research avenues include more expressive cost-prediction models, broader generalization to multi-modal or other agentic LLM chains, dynamic adaptation beyond grid-based simulation, and principled integration of preemptive scheduling for extreme deadline sensitivity.

Markdown Report Issue Upgrade to Chat

References (1)

HEXGEN-TEXT2SQL: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL Workflow (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hexgen-Text2SQL.

Hexgen-Text2SQL: Scalable Agentic Text-to-SQL Scheduling

1. System Architecture

2. Hierarchical Scheduling Scheme

2.1 Global Workload-Balanced Dispatching

2.2 Local Adaptive Urgency-Guided Prioritization

3. Simulation-Based Hyperparameter Tuning

4. Empirical Evaluation

4.1 Experimental Setup

4.2 Performance Results

4.3 Ablation Analysis

5. Design Insights and Limitations

6. Impact and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Hexgen-Text2SQL: Scalable Agentic Text-to-SQL Scheduling

1. System Architecture

2. Hierarchical Scheduling Scheme

2.1 Global Workload-Balanced Dispatching

2.2 Local Adaptive Urgency-Guided Prioritization

3. Simulation-Based Hyperparameter Tuning

4. Empirical Evaluation

4.1 Experimental Setup

4.2 Performance Results

4.3 Ablation Analysis

5. Design Insights and Limitations

6. Impact and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research