Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hexgen-Text2SQL: Scalable Agentic Text-to-SQL Scheduling

Updated 26 January 2026
  • Hexgen-Text2SQL is a framework that enables efficient, SLO-compliant execution of multi-stage Text-to-SQL workflows on diverse, heterogeneous GPU clusters.
  • It employs a hierarchical scheduling scheme that combines global workload-balanced dispatching with local urgency-guided prioritization to reduce deadline misses.
  • Simulation-based hyperparameter tuning and ablation analyses demonstrate significant throughput gains and tighter SLO attainment compared to existing serving frameworks.

Hexgen-Text2SQL is a serving framework targeting the efficient scheduling and execution of agentic multi-stage Text-to-SQL workflows leveraging LLMs on heterogeneous GPU clusters. The system addresses the unique architectural and operational challenges stemming from multi-tenant, end-to-end Text-to-SQL pipelines in production, characterized by sequential inter-stage dependencies, fine-grained parallelism, stringent latency constraints, and infrastructure heterogeneity. It introduces a hierarchical scheduling methodology, simulation-based hyperparameter tuning, and achieves significant reductions in deadline miss rates and improvements in throughput relative to state-of-the-art serving frameworks (Peng et al., 8 May 2025).

1. System Architecture

Hexgen-Text2SQL orchestrates LLM-based Text-to-SQL workflows in a two-tier structure on GPU clusters. Each user query proceeds through four main stages as informed by the CHESS agentic pipeline:

  1. Schema Linking: Mapping user question entities to specific database tables and columns.
  2. SQL Candidate Generation: Initiating multiple parallel LLM invocations with varied prompts to produce SQL candidates.
  3. Self-Correction: Iteratively re-invoking the LLM (up to 10 iterations) to remediate errors in SQL execution.
  4. Evaluation: Generating unit-tests and electing the optimal SQL statement.

These stages manifest sequential dependencies, necessitating downstream stages to await predecessor outputs, alongside intra-stage parallelism (e.g., multiple candidate generations and self-correcting passes executed concurrently).

The framework's core components include:

  • Global Coordinator: Maintains per-query task dependency graphs and SLO budgets, dispatches sub-tasks across model instances.
  • Model Instances: Each pinned to a GPU (Triton/vLLM-style endpoints) and equipped with a local, adaptive priority queue governing inference task execution.
  • Monitoring & Feedback: Continuously updates remaining SLO budgets and activates downstream tasks upon upstream completion.

The lifecycle of a query Q entails construction of tasks {q1,,qn}\{q_1,\ldots,q_n\}, assignment of a global SLO TSLOT^{\text{SLO}}, sequential triggering of schema-linking, candidate generation, self-correction, and evaluation, with each step managed according to dependency and timing constraints.

2. Hierarchical Scheduling Scheme

Hexgen-Text2SQL adopts a two-level scheduling paradigm combining:

  • Global Workload-Balanced Dispatching
  • Local Urgency-Guided Prioritization

2.1 Global Workload-Balanced Dispatching

The global scheduler aims, for each subtask qi,jq_{i,j}, to select a model instance mMm \in M that maximizes the probability the overall query completes within the SLO:

maxϕ:{q}M  Pr((i,j)ti,jϕ(qi,j)TiSLO)\max_{\phi:\{q\}\to M}\;\Pr\Bigl(\sum_{(i,j)}t_{i,j}^{\phi(q_{i,j})}\le T^{\rm SLO}_i\Bigr)

Due to NP-hardness, Hexgen-Text2SQL implements a heuristic scoring strategy:

  • Predicted Computation Cost:

tcompi,jm=tprefillm(Lin(qi,j))+tdecodem(L^out(qi,j))t_{\text{comp}_{i,j}^m} = t_{\rm prefill}^m (L_{\rm in}(q_{i,j})) + t_{\rm decode}^m (\widehat{L}_{\rm out}(q_{i,j}))

where LinL_{\rm in} and L^out\widehat{L}_{\rm out} denote input and predicted output lengths.

  • Queueing Cost:

tqueuei,jm=qΘmtcompm(q)t_{\text{queue}_{i,j}^m} = \sum_{q'\in \Theta^m} t_{\text{comp}^{m}(q')}

  • Suitability Score for Each (q,m)(q, m):

Score(qi,j,m)=(1α)β/tqueuei,jmαtcompi,jm\mathrm{Score}(q_{i,j},m) = (1-\alpha)\,{\beta}/{t_{\text{queue}_{i,j}^m}} - \alpha\,t_{\text{comp}_{i,j}^m}

with α[0,1]\alpha \in [0,1] trading off “fast execute” and “light load,” and β\beta rescaling queue effects.

Subtasks are dispatched to the instance maximizing this suitability score.

2.2 Local Adaptive Urgency-Guided Prioritization

Once a subtask is enqueued to model instance mm, its priority is dynamically determined by an urgency function:

  • Subtask Time Allocation

ti,jSLO=(TiSLOτelapsedi)×tcompi,jk=jnitcompi,kt^{\text{SLO}}_{i,j} = (T^{\text{SLO}}_i - \tau^i_{\text{elapsed}}) \times \frac{\overline{t}_{\text{comp}_{i,j}}}{\sum_{k=j}^{n_i} \overline{t}_{\text{comp}_{i,k}}}

  • Urgency Definition

Ui,j=tcompi,jm(ti,jSLOτi,j)U_{i,j} = t_{\text{comp}_{i,j}^m} - (t^{\rm SLO}_{i,j} - \tau_{i,j})

where τi,j\tau_{i,j} is the time qi,jq_{i,j} has waited in the queue. Higher Ui,jU_{i,j} signifies higher risk of deadline violation, thus higher execution precedence.

The queue always schedules the subtask with the highest urgency for inference execution on the GPU.

3. Simulation-Based Hyperparameter Tuning

The principal dispatch hyperparameter α\alpha is optimized online via a low-overhead, trace-driven CPU simulator:

  1. Warm-up: Collect workflow traces (arrival, queue, durations) for 100 seconds.
  2. Simulation: Replay the trace under candidate α\alpha values.
  3. Selection: Set

α=argminα[0,1]1Ni=1NTi(α)\alpha^* = \arg\min_{\alpha \in [0,1]} \frac{1}{N} \sum_{i=1}^N T_i(\alpha)

where Ti(α)T_i(\alpha) indicates simulated end-to-end latency.

  1. Sweep: Coarse grid (step 0.2), followed by fine sweep (0.1) around optimum.
  2. Retuning: After each sliding window, compare system latency using a one-sided t-test (p<0.01p < 0.01); triggering re-tuning if significant degradation is detected.

Tuning windows incur negligible overhead (115–158 s on CPU) versus workload variabilities operating on the order of hours.

4. Empirical Evaluation

4.1 Experimental Setup

Key parameters for benchmarking include:

  • Workflow: CHESS agent pipeline with Llama3.1-70B LLM.
  • Workload Traces: BIRD-bench (finance, F1, mixed domains).
  • Hardware Topologies: Heterogeneous clusters—
    • Hetero-1: 2×2 \timesA100 + 2×2 \timesA6000
    • Hetero-2: 2×2 \timesA100 + L40 + A6000
  • Arrival Rates: Poisson arrivals at 0.5 qps and 1.0 qps.
  • Baseline Comparator: vLLM framework using round-robin dispatch and FCFS local queue.
  • Metrics: Attainment of SLO for 95% and 99% of requests by scaled-deadline; sustained throughput (qps).

4.2 Performance Results

Hexgen-Text2SQL demonstrates:

  • SLO Attainment: Up to 1.67×1.67\times (average 1.41×1.41\times) tighter deadline for 95% attainment, and up to 1.60×1.60\times (average 1.35×1.35\times) for 99%, relative to vLLM.
  • Throughput: 1.57×1.57\times1.75×1.75\times improvement (average 1.65×1.65\times) over vLLM.
Latency SLO (95%) Throughput
vLLM TbaseT_{\rm base} XbaseX_{\rm base}
Hexgen-Text2SQL Tbase/1.41T_{\rm base}/1.41 1.65Xbase1.65\,X_{\rm base}

4.3 Ablation Analysis

  • Workload-Balanced (WB) vs. Round-Robin (RR) Dispatch: WB yields up to 1.38×1.38\times (avg 1.18×1.18\times) better 95% SLO than RR.
  • Urgency Queue vs. FCFS Queue: Adding local urgency prioritization delivers up to 1.5×1.5\times (avg 1.2×1.2\times) speed-up in 95% SLO.

5. Design Insights and Limitations

Hexgen-Text2SQL’s effectiveness derives from explicit modeling of multi-stage workflow dependencies, allowing for:

  • Elimination of Idle Gaps: Synchronizing task execution reduces resource underutilization and redundant computations.
  • Heterogeneity-Aware Task Assignment: Tailoring subtask scheduling to underlying GPU capabilities maximizes computation throughput and system efficiency.
  • Urgency-Driven Local Queues: Prioritize tasks most at risk of SLO violation, mitigating head-of-line blocking.
  • Simulation-Driven Tuning: Rapid, workload-responsive optimization of the core dispatch policy parameter α\alpha.

Limitations and Prospects:

  • The empirical computation cost model (prefill plus decode time) could be superseded by ML-based predictors for finer granularity.
  • Extending to other LLM multi-stage pipelines (e.g., multimodal chains) would require richer and more general dependency graph modeling.
  • Reinforcement learning or multi-armed bandit algorithms are suggested as alternatives to simulation sweeps for faster α\alpha convergence.
  • The potential use of preemption and state-swapping mechanisms to accommodate abrupt high-priority workload arrivals remains an open direction (Peng et al., 8 May 2025).

6. Impact and Future Directions

Hexgen-Text2SQL establishes a methodology for robust, SLO-compliant, and high-throughput serving of complex, multi-stage agentic Text-to-SQL workflows on heterogeneous GPU infrastructure. Its two-level scheduling—the combination of workload-balanced dispatching and urgency-based local priorities—coupled with simulation-based hyperparameter tuning, addresses critical production challenges in LLM-powered database querying. Prospective research avenues include more expressive cost-prediction models, broader generalization to multi-modal or other agentic LLM chains, dynamic adaptation beyond grid-based simulation, and principled integration of preemptive scheduling for extreme deadline sensitivity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hexgen-Text2SQL.