Skill-KNN: A Skill-Based Few-Shot Retrieval Method
- The paper demonstrates that Skill-KNN outperforms raw-input KNN by modeling underlying task skills, leading to higher robustness and accuracy.
- The method leverages LLM-based skill rewriting and embedding models to capture core reasoning operations rather than superficial input similarities.
- Empirical results show that Skill-KNN boosts performance on cross-domain semantic parsing tasks with significant gains over traditional retrieval baselines.
Skill-KNN is a skill-based few-shot selection method for in-context learning with LLMs. Unlike traditional K-nearest neighbor (KNN) strategies that compare surface linguistic features, Skill-KNN retrieves context examples by explicitly modeling the underlying reasoning operations, or "skills," required to solve each task instance. This approach systematically eliminates spurious correlations due to superficial input similarity, offering substantial gains in robustness and accuracy, particularly in cross-domain semantic parsing scenarios (An et al., 2023).
1. Formal Problem Statement
Skill-KNN targets in-context learning, where a frozen LLM adapts to a downstream task using an example bank composed of input-label pairs (for instance, natural-language questions plus schema as inputs, SQL queries as outputs). For each test input , the model retrieves context examples
to form the prompt for , which then generates the prediction
The central challenge is to design an optimal retrieval function that maximizes predictive accuracy for the frozen LLM.
2. Motivation: Skill-Based Versus Surface-Based Selection
Conventional few-shot selection leverages embedding models such as SBERT to encode input —typically just the natural language question—selecting neighbors based on cosine similarity in embedding space. This "raw-input KNN" approach is prone to selecting instances sharing superficial lexical or entity overlap, rather than examples requiring similar underlying programmatic operations (e.g., "JOIN," "COUNT DISTINCT"). As a result, context selection can be misaligned with the latent reasoning requirements of the downstream task.
Skill-KNN mitigates this by rewriting each input into a concise, natural-language skill description capturing the key operations to be performed. KNN search is then conducted in the embedding space of these skill descriptions, not the original surface forms. Empirical t-SNE visualizations demonstrate that skill-based embeddings cluster according to operation type rather than lexical similarity, and retrieval becomes substantially less sensitive to irrelevant input perturbations (An et al., 2023).
3. Generation and Utilization of Skill-Based Descriptions
Skill-KNN constructs skill-based representations using a two-stage procedure:
- Demonstration Annotation: A small set of demonstration pairs is manually constructed, where enumerates the required operations (e.g., "join two tables and count distinct values in column X").
- Few-Shot Prompting: For each input , a few-shot prompt is assembled by presenting the demos to a frozen LLM (gpt-3.5-turbo), followed by the new input and awaiting the completion "Skill:" with the induced description . This skill description is then embedded using an off-the-shelf encoder.
This process is repeated times for each example with shuffling of demo order to address prompt sensitivity, producing sets of skill rewrites per input.
4. Mathematical Formulation and Algorithmic Workflow
Let denote the LLM-based skill rewriting function and the embedding model. For each test input and candidate bank example :
The basic similarity is
Selection retrieves the top- indices with maximal similarity. To further improve robustness, skill rewrite variants per input are generated:
- Consistency Variant: Uses mean embedding across rewrites.
- Distinctiveness Variant: Uses maximal pairwise similarity between any pair of rewrites.
Skill-KNN Algorithm (paraphrased)
- Rewrite Stage: For each bank example and test input, generate skill-based rewrites using shuffled demo prompts.
- Embedding Stage: Compute embeddings for each rewrite.
- Similarity Computation: Depending on variant, compute mean, max, or average similarities across rewrite pairs.
- Retrieval: Select the best-matching examples for in-context learning.
5. Elimination of Spurious Surface Features
Rewriting the input into skill-focused prompts decouples context selection from irrelevant lexical or entity overlap. t-SNE plots confirm that, compared to raw-input KNN, skill-based embeddings group by operation rather than by superficial linguistic similarity. On datasets with perturbations in schema or natural language (e.g., Dr. Spider-DB), Skill-KNN exhibits enhanced robustness, demonstrating a lower error rate and decreased sensitivity to surface-level database or input changes (An et al., 2023).
6. Empirical Performance and Implementation
Skill-KNN is empirically validated across five cross-domain semantic parsing datasets (Spider, Dr. Spider, KaggleDBQA, BIRD, COGS) and six LLMs (text-chat-davinci-002, code-davinci-002, text-davinci-003, code-cushman-002, gpt-35-turbo, gpt-4). Baselines include random selection, conventional KNN (SBERT, OpenAI Ada/Babbage), maximal marginal relevance (MMR), and fine-tuning-based selection (EPR, CEIL, TST).
Skill-KNN achieves the highest non-oracle accuracy on all benchmarks. For text-chat-davinci-002 on Spider: Random = 72.9%, KNN-SBERT = 73.0%, MMR = 74.6%, Skill-KNN(base) = 76.8%, Skill-KNN(distinct) = 78.3% (oracle Target-KNN = 78.6%). On Dr. Spider-DB, Skill-KNN(distinct) yields 57.0% accuracy versus 54.1% for MMR. On GSM8K math reasoning, Skill-KNN scores 71.0% (compared to 69.1% random, 69.9% KNN-SBERT), and using gpt-4 on Spider, Skill-KNN(consistency) achieves 82.7% (An et al., 2023).
Key implementation hyperparameters include:
- Rewriting LLM: gpt-3.5-turbo, temperature 0, max tokens 200,
- Demo set size: (ablations show mild performance drop for )
- Embedding models: SBERT (all-mpnet-base-v2), OpenAI text-embedding-babbage-001, text-embedding-ada-002
- Retrieval size: in-context examples
A summary of key empirical findings is presented in the table below:
| Dataset/Task | KNN-SBERT | Skill-KNN (base) | Skill-KNN (distinct/consistency) |
|---|---|---|---|
| Spider (text-chat-davinci-002) | 73.0% | 76.8% | 78.3% (distinct) |
| Dr. Spider-DB (MMR) | 54.1% | - | 57.0% (distinct) |
| GSM8K (math) | 69.9% | 71.0% | - |
| Spider (gpt-4) | 76.7% | - | 82.7% (consistency) |
7. Insights, Limitations, and Prospective Extensions
Ablations confirm that increasing the demo set size () beyond 8 yields diminishing returns. Both the consistency and distinctiveness variants are beneficial, each addressing different noise regimes in LLM rewrites: consistency mitigates zero-mean noise, distinctiveness attenuates rare outlier noise. In terms of computational cost, rewriting all examples with variants is expensive (400–500 × 8 A100 GPU-hours). While Skill-KNN exhibits robust generalization, its advantage wanes on tasks dependent solely on surface similarity. Potential extensions include automated demo selection, application to other domains (e.g., table reasoning, code), and developing a unified criterion to combine the strengths of the existing variants (An et al., 2023).