Natural Language to KQL Translation

Updated 14 December 2025

Natural-language-to-KQL translation is the automatic conversion of plain language queries into syntactically and semantically valid Kusto Query Language statements for interacting with telemetry databases.
It integrates semantic parsing, program synthesis, and schema grounding by leveraging grammar-aware neural models and LLM-driven prompt engineering to ensure accurate query generation.
State-of-the-art pipelines combine modular schema refinement, few-shot exemplar retrieval, and query refinement to reduce latency and improve execution accuracy in high-stakes enterprise environments.

Natural-language-to-KQL (NL→KQL) translation refers to the automatic generation of syntactically and semantically valid Kusto Query Language (KQL) statements from natural language queries (NLQs), enabling users to interact with structured telemetry or log databases—such as those hosted in Azure Data Explorer and Microsoft security platforms—without explicit knowledge of KQL syntax or database schema. This task combines elements of semantic parsing, program synthesis, information retrieval, and LLM alignment, and is characterized by stringent requirements on schema grounding, syntactic correctness, and executable output in high-stakes domains such as enterprise security operations centers (SOCs).

1. Methodological Principles of NL→KQL Translation

The NL→KQL translation problem requires a model to map an unstructured, ambiguous natural-language question into the highly restricted, pipeline-oriented syntax of KQL. Unlike generic sequence-to-sequence translation, effective NL→KQL conversion is sensitive to both database schema (table/column inventory, value types) and the grammar of KQL, which favors operator pipelines over SQL-style SELECT–FROM–WHERE constructs (Cai et al., 2017). Two principal methodological paradigms are prominent:

Grammar-Aware Neural Models: Early approaches adapt encoder-decoder neural architectures by injecting grammar state representations and semantic features into both encoder and decoder, leveraging a pre-enumerated BNF of KQL to enforce syntax validity and recursive stack-based subquery management. The decoder is augmented to receive not only recurrent states and attention context but also a binary grammar vector $g_t$ tracking all open non-terminals in the KQL BNF, and its outputs are masked by both short-term and schema-derived long-term token constraints (see Section 1.5 in (Cai et al., 2017)).
LLM-Driven Prompt Engineering with Schema Retrieval: Contemporary methods, notably the NL2KQL framework and its successors, use a modular inference pipeline where a schema refiner selects a minimal, relevant schema subset and few-shot selector retrieves prompt exemplars (NLQ–KQL pairs), both using semantic embeddings and cosine similarity (Tang et al., 3 Apr 2024, Muzammil et al., 7 Dec 2025). The LLM is then supplied with a meticulously composed prompt including the selected schema, best practices, syntax guides, and exemplars, followed by an optional query refiner for deterministic post-parsing repair of candidate KQL.

2. System Architectures and Pipeline Components

NL→KQL systems are generally organized into modular pipelines with discrete responsibilities:

Component	Responsibility	Canonical Implementation
Semantic Data Catalog	Maintains table, column, and (optionally) value embeddings with descriptions for retrieval and prompt conditioning	Embedding + vector store (Tang et al., 3 Apr 2024, Muzammil et al., 7 Dec 2025)
Schema Refiner	Selects top- $t$ tables and top- $v_n$ enum values semantically relevant to the NLQ, filters by access control	Cosine similarity scoring
Few-Shot Selector	Retrieves top- $f$ few-shot NLQ–KQL pairs from a large synthetic database for in-context learning	Cosine matching on embeddings
Prompt Builder	Assembles full prompt, including schema, syntax guide, best practices, few-shots, user query	Textual concatenation
Query Generator	Produces candidate KQL via seq2seq model or LLM	LSTM-based model or LLM
Query Refiner/Judge	Parses and repairs candidate KQL for syntax and semantic errors, or selects among candidates using an "Oracle" LLM with schema context	Rule-based parser (Tang et al., 3 Apr 2024), LLM judge (Muzammil et al., 7 Dec 2025)

Pipeline organization in NL2KQL (Tang et al., 3 Apr 2024) and SLM frameworks (Muzammil et al., 7 Dec 2025) prioritizes aggressive pruning of schema and example space to fit prompt context, maximize LLM attention efficiency, and reduce hallucinations or spurious table/column invention.

3. Training, Data Generation, and Schema Grounding

Training NL→KQL models requires large, diverse, and schema-specific datasets of NLQ–KQL pairs. As full-scale annotated corpora are scarce, the dominant approach is synthetic data generation tuned to the deployment schema:

Synthetic FSDB Construction: A synthetic few-shot database (FSDB) is built by (i) sampling tables/themes, (ii) prompting a base LLM to generate valid KQL queries, (iii) generating NL paraphrases by imperative explanation, and (iv) using round-trip generation and KQL parser validation (token-Jaccard threshold $J \geq 0.7$ ) to ensure semantic faithfulness (Tang et al., 3 Apr 2024).
Transfer from SQL Datasets: For neural grammar-aware architectures, initial seed corpora are adapted from NL→SQL datasets via rule-based translation, then manually corrected and paraphrased, followed by fine-grained token and semantic label annotation for model input (Cai et al., 2017).
Rationale Distillation for SLMs: Recent work distills stepwise model reasoning into "chain-of-thought" rationales concatenated before gold KQL output, facilitating compact LoRA-based SLM fine-tuning with limited capacity (Muzammil et al., 7 Dec 2025).

Schema grounding plays a critical role. Each NLQ is embedded and compared to schema element vectors to filter accessible tables and columns; only these are made available to the model during translation, enforcing high schema precision.

4. Grammar Constraints, Subquery Modeling, and Error Correction

Ensuring output KQL is both syntactically valid and semantically meaningful is a central challenge:

Grammar State Vector: Neural models track a binary grammar vector $g_t \in \{0,1\}^M$ indicating active BNF non-terminals, which is injected into all LSTM gates and used to mask output tokens by short-term (local BNF) and long-term (schema; e.g., join disallowed columns) constraints at each decode step. Subquery handling is managed by stack-based updates of this grammar state when "let"/";" tokens are encountered in KQL (Cai et al., 2017).
Query Refiner: LLM-based approaches postprocess the raw KQL using the official KQL parser to detect parse/semantic errors (undefined identifiers, missing operators). Deterministic repairs—such as prefix-matching unknown columns, substituting tables to resolve column needs, and rule-based insertion of operators—yield measurable gains in execution correctness (recovering 1–2% accuracy per ablation in (Tang et al., 3 Apr 2024)).
Error-Aware Prompting: For SLMs, error profiling of frequent parser failure patterns informs compact prompt modifications. Enforcing patterns for timestamp usage, infix restriction for operators (has, in, between), and explicit bans on hallucinated tables/columns helps mitigate low-parameter model errors with negligible token cost (Muzammil et al., 7 Dec 2025).

5. Empirical Performance and Evaluation Protocols

Evaluation comprehensively addresses both syntactic and semantic adequacy:

Offline (Parse-based) Metrics: Syntax score ( $\mathrm{Syntax}(\hat q)$ ), semantic score ( $\mathrm{Semantic}(\hat q, s)$ ), table overlap score, Jaccard-based filter column/literal scores (Tang et al., 3 Apr 2024, Muzammil et al., 7 Dec 2025). NL2KQL main achieves Syntax 0.9933, Semantic 0.9724 in Defender XDR domain, with performance ablation showing the necessity of schema/few-shot components.
Online (Execution-based) Metrics: Row-exec and column-exec compare returned tuples and columns between predicted and gold KQL queries, with average execution accuracy (Avg-Exec) as principal indicator of system utility for end-users (Tang et al., 3 Apr 2024).
Cost and Latency: SLM architectures achieve 10–15 $\times$ reduction in token cost and 4 $\times$ lower latency versus GPT-5 baselines while matching or exceeding semantic accuracy (e.g., 0.906 Semantic at 0.213 USD for 230 queries, vs. 0.861 at 2.018 USD for GPT-5) (Muzammil et al., 7 Dec 2025).
Generalization: When transferred to unseen schemas (e.g., Sentinel after Defender), SLM two-stage pipelines exhibit only minor losses in syntax and semantic metrics ( $-2.3$ and $-7.5$ points), though filter precision can degrade significantly, which is attributed to schema retrieval and value grounding challenges (Muzammil et al., 7 Dec 2025).

6. Adaptation to KQL from SQL and Domain-Specific Considerations

Adapting neural NL→SQL frameworks for KQL translation principally requires:

Re-enumeration of the non-terminal set in the grammar state vector $g_t$ according to KQL BNF, which is characterized by a pipeline-dominant syntax (e.g., Pipeline $\to$ Pipeline $|$ OperatorCall) and a distinct operator and aggregation inventory (Cai et al., 2017).
Augmentation of encoder semantic features to distinguish KQL-specific pipeline operators and the expanded set of event-centric table/column names (Cai et al., 2017).
Preservation of stack-based recursive management for subqueries, but with KQL-appropriate patterns (e.g., "let" as subquery opener) (Cai et al., 2017).
Data pipeline modifications: KQL training corpora are partially generated by translating existing SQL examples, manual correction, and synthetic paraphrasing, followed by joint NL and schema feature annotation (Cai et al., 2017).

The NL2KQL pipeline and SLM two-stage frameworks are domain-agnostic; only the BNF, schema, and prompt instantiations need alteration to support other high-level, schema-driven query languages.

7. Limitations and Open Challenges

NL→KQL systems currently face several technical challenges:

Filter Precision: Filter-literal and filter-column exactness remains a bottleneck on novel schemas, especially for rare columns and enumerated values. Current filter-literal Jaccard accuracy can drop as low as 0.10 for out-of-domain data (Muzammil et al., 7 Dec 2025).
Schema Hallucination and Value Retrieval: LLM and SLM models may still attempt to access non-existent tables, columns, or create spurious values unless aggressively pruned and prompted.
Generalization to Unseen Workloads: Degradation in non-SQL workloads or adversarial NLQ inputs—such as vague temporal scopes or ambiguous entity references—necessitates further research in robust prompt engineering, value grounding, and BNF-driven error correction.
Adaptation to New Query Languages: While BNF adjustment and encoder tag extension are straightforward for languages like EQL or Splunk SPL, systematic validation remains to be demonstrated (Muzammil et al., 7 Dec 2025).
Data Requirements for Fine-Tuning: Scaling high-quality, schema-grounded NLQ–KQL corpora remains resource-intensive, though synthetic round-trip curation and rationale distillation mitigate manual burden.

A plausible implication is that subsequent progress in sub-10B SLMs, schema representation learning, and data synthesis will further narrow the gap to full LLMs while reducing operational cost, thus broadening applicability to additional enterprise and scientific telemetry domains.

References:

"An Encoder-Decoder Framework Translating Natural Language to Database Queries" (Cai et al., 2017)
"NL2KQL: From Natural Language to Kusto Query" (Tang et al., 3 Apr 2024)
"Towards Small LLMs for Security Query Generation in SOC Workflows" (Muzammil et al., 7 Dec 2025)