AgentSM: Semantic Memory for Text-to-SQL
- AgentSM is an agentic framework that utilizes structured, interpretable semantic memory to guide multi-step reasoning for Text-to-SQL generation.
- It integrates phase-specific retrieval and composite tool abstraction to reduce redundant exploration and lower latency in real-world enterprise settings.
- Experimental results on Spider benchmarks demonstrate improved execution accuracy, fewer reasoning steps, and enhanced overall efficiency compared to baseline methods.
Agent Semantic Memory (AgentSM) is an agentic framework specifically designed for the Text-to-SQL problem, which centers on mapping natural language questions to executable SQL queries over realistic, large-scale enterprise databases. AgentSM uniquely incorporates a structured, interpretable semantic memory of prior agent reasoning traces—enabling agents to reuse systematic patterns of multi-step exploration and reasoning. This architecture addresses the deficiencies of traditional scratchpad or vector retrieval approaches by providing context-aware, phase-specific programmatic guidance, enhancing both efficiency and stability in demanding text-to-SQL scenarios (Biswal et al., 22 Jan 2026).
1. Motivation and Problem Context
Traditional LLM-based Text-to-SQL systems achieve competitive performance on academic benchmarks but falter in real-world enterprise settings due to schema scale, dialect heterogeneity, and multi-step reasoning demands. Primary obstacles include:
- Repetitive exploration: Agents frequently duplicate expensive schema-exploration steps across questions, leading to wasted computation and increased latency.
- Rigid workflows: Fixed prompt engineering or scratchpad routines introduce unnecessary steps or fail to adequately adapt to schema variability.
- Instability and inefficiency: Minor divergences can derail agent reasoning, producing invalid SQL or truncated trajectories even when using low-temperature decoding.
Attempts to reuse history via vector search or raw scratchpads merely retrieve unstructured logs, which remain latent and difficult for agents to interpret or incorporate effectively. AgentSM resolves these deficits by encoding, retrieving, and injecting structured execution traces as direct, interpretable guides for future agent trajectories (Biswal et al., 22 Jan 2026).
2. Formal Definitions and Semantic Memory Construction
Let denote the current natural-language question, the target database with schema , and the set of available agent tools (e.g., schema inspection, SQL execution, vector search).
A reasoning trajectory encodes ordered agent steps, where each can be an LLM-generated reasoning "Thought", a tool invocation , or a tool-obtained observation. Formally, SQL synthesis within a trajectory is modeled as a sequence of productions from a context-free grammar , capturing both SQL and extended tool actions.
The semantic memory at time is a set
which evolves as
0
when a new trajectory on 1 is generated.
Similarity-driven retrieval is defined by the function
2
where 3 is a pretrained encoder (MiniLM-L6) and 4 is cosine similarity. Only entries on the same database 5 are considered for reuse, with the top-6 highest-scoring trajectories selected for injection into the next agent run.
3. System Architecture and Algorithmic Workflow
AgentSM is constructed as a multi-agent system using the smolagents framework:
- Planner agent: Responsible for high-level reasoning, code and SQL generation, memory interleaving, and overall trajectory orchestration.
- Schema-linking agent: Specialized sub-agent with a 75 step budget, using vector retrieval to map question tokens to schema elements, returning these mappings to the planner.
The full workflow comprises these steps:
- Filter memory 8 for trajectories associated with 9.
- Compute similarity scores 0 for all 1 in the filtered set.
- Select and inject top-2 exploration trajectories as agent context for the Planner.
- The Planner alternates between local reasoning and tool invocation, updating state until completion or budget exhaustion.
- The executed trajectory 3 is recorded and appended to 4.
Structured memory entries are stored in Markdown (or optionally JSON), demarcating Exploration, Execution, and Validation phases. Each step incorporates metadata: phase, tool-name, code/query, observation. Step classification uses regex heuristics over tool names and SQL patterns, enabling selective phase replay during retrieval.
Composite tools are automatically mined by detecting frequently co-occurring low-level tool sequences and abstracting them into single higher-level actions, optimizing both planning efficiency and memory compactness.
4. Memory Retrieval, Reuse, and Synthetic Trajectory Synthesis
Retrieval at inference time is strictly phase-specific. Only the "Exploration" component of the closest past trajectory is prepended to the prompt context, explicitly suppressing redundant schema-inquiry and setup steps for new questions over the same database.
To prevent memory sparsity, AgentSM synthesizes offline a dense library of exploration trajectories using algorithmically generated synthetic questions over each schema. Synthetic question sampling is performed via LLM generation under a specified budget and schema distribution; each sampled 5 is executed to completion, and the resulting 6 is stored. No supervised fine-tuning is applied; the entire approach leverages agentic self-play and prompt-based planning.
5. Experimental Evaluation and Ablation
AgentSM is evaluated on the Spider 2.0 and Spider 2.0 Lite benchmarks, covering BigQuery, Snowflake, and SQLite backends. The primary metrics include execution accuracy (EX%), average step count, input/output tokens, and end-to-end latency. Main comparative results are shown below:
| Method | Overall EX% | Avg Steps | In Toks | Out Toks | Latency (s) |
|---|---|---|---|---|---|
| SpiderAgent (Claude 3.7) | 28.7 | 18.9 | 200 K | 4 K | 363.2 |
| CodingAgent (Claude 4) | 24.7 | 18.1 | 200 K | 4 K | 325.5 |
| AgentSM (Claude 3.7) | 38.4 | 16.8 | 299 K | 5 K | 226.4 |
| AgentSM (Claude 4) | 44.8 | 16.4 | 300 K | 5 K | 247.1 |
AgentSM achieves 44.8% execution accuracy with 13% fewer steps and 30% lower latency relative to baseline agentic methods on Spider 2.0 Lite. Ablation demonstrates that disabling trajectory reading reduces accuracy by 34.7 points with a four-step average increase; disabling composite tools incurs a 35.8 point accuracy drop and further increases latency.
Selective trajectory reading eliminates approximately 25% of redundant steps. Composite tool abstraction further reduces trajectory length and token usage.
6. Analysis, Limitations, and Variance Across Domains
Phase-wise analysis indicates that AgentSM enters the Execution phase 30% more quickly than CodingAgent, confirming the effectiveness of exploration trajectory injection. Failure attribution shows:
- Schema-linking errors account for 30% of agent errors on domains with deeply nested Snowflake schemas.
- Step budget exhaustion triggers 5% of failures.
- Residual failures are attributable to logic or SQL dialect-specific issues.
Domain-specific performance varies substantially: common schema types see 60–78% execution accuracy, while specialized databases such as github_repos and idc see 14–40%. This indicates that semantic memory is most effective in settings where schema structures and reasoning patterns are recurrent.
7. Prospects and Future Directions
Potential extensions include:
- Fine-grained retrieval by matching trajectories on table or column usage, rather than full-question similarity.
- Trajectory consolidation to reduce memory size and eliminate redundant traces via merging.
- Shared memory across agents for scaling multi-agent pipelines, requiring mechanisms for consistency and update control.
- Learned retrieval strategies and adaptive composite tool formation.
A plausible implication is that AgentSM’s formalization of trajectory-based semantic memory and phase-specific retrieval could generalize to other agentic program synthesis tasks beyond SQL, provided similarly structured and reusable execution traces are available (Biswal et al., 22 Jan 2026). The framework highlights the importance of interpretability, structure, and phase-aware retrieval in advancing LLM-based agent reliability in enterprise decision environments.