OraPlan-SQL: Bilingual NL2SQL Planning System
- OraPlan-SQL is a planning-centric NL2SQL system that decomposes natural language queries into explicit stepwise plans and SQL queries using a dual-agent architecture.
- The framework integrates feedback-guided meta-prompting, bilingual entity-linking, and diversified planning with majority voting to enhance reliability and reduce execution errors.
- OraPlan-SQL achieves notable performance with 54.96% execution accuracy in English, 56.67% in Chinese, and over 99% syntactic SQL validity.
OraPlan-SQL is a planning-centric NL2SQL system designed for the Archer NL2SQL Evaluation Challenge 2025, a benchmark focusing on bilingual (English/Chinese) natural language-to-SQL reasoning tasks that require arithmetic, commonsense, and hypothetical inference. The framework adopts a dual-agent architecture, integrating explicit stepwise planning, feedback-driven prompt refinement, entity-variant management for robust bilingual reasoning, diversified planning for reliability, and majority-vote-based answer selection. OraPlan-SQL surpasses prior state-of-the-art results, achieving 54.96% execution accuracy (EX) in English and 56.67% in Chinese, while ensuring over 99% syntactic SQL validity (VA) (Liu et al., 27 Oct 2025).
1. System Architecture and Pipeline
OraPlan-SQL decomposes the NL2SQL process into two distinct agents:
- Planner agent (): Receives a natural-language query and database schema , outputs an explicit, stepwise plan in natural language detailing reasoning steps including arithmetic operations, filtering, and hypothetical conditions.
- SQL agent (): Converts the plan and schema into an executable SQL query , preserving the logical integrity of the plan.
The pipeline operates as follows:
- Accepts user queries in English or Chinese ().
- Retrieves top- schema elements () via an embedding-based retriever.
- The planner agent produces diverse plans () by sampling at temperature 0.7.
- Each is independently processed by the SQL agent, using retrieved schema plus semantically similar in-context examples (ICL), yielding SQL outputs.
- Executes each SQL and collects results .
- Applies majority voting across to select the final answer, resolving ties using model log-probability.
This agentic framework isolates the complex reasoning from SQL syntax concerns and allows targeted improvements to the planning process (Liu et al., 27 Oct 2025).
2. Feedback-Guided Meta-Prompting
Rather than orchestrating multiple sub-agents, OraPlan-SQL applies a meta-prompting strategy to iteratively refine a single planner agent:
- Failure-case collection: The base planner is run on a held-out development set; cases with incorrect execution or flawed plans are collated.
- Clustering: Each failure (query and plan) is embedded via an LLM into a dense vector. -means clustering identifies distinct error types ( typically 5–7).
- Human annotation and LLM distillation: Clusters are manually labeled, and a "meta-writer" LLM is prompted with representative examples to produce concise corrective guidelines.
- Prompt integration: The distilled guidelines are appended to the planner’s system prompt under "Planner Correction Guidelines".
At inference, the planner executes according to these guidelines. This feedback loop demonstrably enhances generalization and corrects systematic planning errors. Ablation reveals a 27.9-point drop in EX when meta-guidelines are removed (from 72.12% to 44.23%), underscoring their impact (Liu et al., 27 Oct 2025).
3. Bilingual Entity-Linking and Variants
Complex bilingual scenarios in NL2SQL surface transliteration and entity mismatch challenges, particularly when Chinese queries reference entities whose forms do not match English schema entries. OraPlan-SQL addresses this by:
- Entity variant generation: The planner system prompt instructs explicit identification and enumeration of plausible English variants (translations, abbreviations, aliases) for each entity detected in the query. For example, 'æ ¼éå ¬å' yields ['Glebe Park', 'Glebe Gardens'].
- Encoding in plan: A section titled "Entity Variants" is prepended to each plan, mapping .
- Downstream usage: The SQL agent consults these variants for WHERE and JOIN clause construction.
This methodology improves cross-lingual entity resolution and reduces execution errors related to schema mismatches (Liu et al., 27 Oct 2025).
4. Plan Diversification and Majority Voting
To enhance reliability and reduce brittleness, OraPlan-SQL introduces plan diversification and consensus selection:
- Diverse plan generation: Planner agent samples plans ( or $7$) at temperature 0.7 to cover distinct reasoning pathways.
- SQL query synthesis: Each plan is processed with the SQL agent at temperature 0.0 for deterministic conversion.
- Majority voting: After executing resulting SQLs (), the final answer is chosen as the mode:
If no majority, the output with highest model log-likelihood is selected, defaulting to 's result in unresolved cases.
Ablation indicates that disabling majority voting leads to a modest decrease in EX (–1.0 point), demonstrating robust consensus mechanisms (Liu et al., 27 Oct 2025).
5. SQL Agent Engineering and In-Context Learning
SQL agent effectiveness depends on precise prompt construction:
- Prompt components:
- Agent role: "You are a SQL-generation agent. You will convert the user’s plan into valid SQL."
- Top-5 retrieved schema elements.
- Three in-context examples (NL plan to gold SQL).
- Instruction to adhere strictly to plan step order.
- Model selection: Both planner and SQL agent utilize GPT-5. The SQL agent decodes at temperature 0 to minimize hallucinations.
Schema retrieval and ICL demonstrably reduce syntactic and semantic SQL errors. Ablation shows that omitting ICL marginally decreases EX (–1.9 points), highlighting its refinement role (Liu et al., 27 Oct 2025).
6. Evaluation Metrics and Comparative Results
Performance on the Archer 2025 bilingual benchmark is evaluated as follows:
- Execution Accuracy ():
- SQL Validity ():
Leaderboard results:
| System | EN EX | ZH EX | VA |
|---|---|---|---|
| OraPlan-SQL (GPT-5) | 54.96% | 56.67% | >99% |
| HIT–SCIR (GPT-4o) | 48.66% | 44.08% | — |
OraPlan-SQL achieves gains of +6.3 points in English and +12.6 points in Chinese over the runner-up (Liu et al., 27 Oct 2025).
Ablation studies (English dev set, GPT-5):
| Component Removed | EX (%) | ΔEX |
|---|---|---|
| Full system | 72.12 | — |
| w/o meta-prompt guidelines | 44.23 | –27.9 |
| w/o planner agent | 64.42 | –7.7 |
| w/o ICL in SQL agent | 70.19 | –1.9 |
| w/o majority voting | 71.15 | –1.0 |
7. Supplementary Algorithms and Formalization
Algorithm 1: Feedback-Guided Meta-Prompting
Given held-out queries , schema , and base planner :
- For each : a. b. c. If gold answer, collect .
- Embed failures into via LLM embedding.
- Cluster into clusters via k-means (cosine).
- For each : a. Sample 5 examples, label error type. b. Prompt meta-writer LLM for guidelines. c. Receive guideline bullet-list .
- concatenate all under "Planner Correction Guidelines."
- Return .
Majority Voting Formalization:
Given results , define:
Final answer:
This architecture, refinement methodology, bilingual entity management, plan diversification, and consensus selection together achieve state-of-the-art results on reasoning-intensive, bilingual NL2SQL benchmarks (Liu et al., 27 Oct 2025).