Papers
Topics
Authors
Recent
Search
2000 character limit reached

Kamiwaza Agentic Merit Index (KAMI) Benchmark

Updated 15 December 2025
  • Kamiwaza Agentic Merit Index (KAMI) is an interactive benchmark evaluating LLMs’ capacity to ground reasoning, sequence tool calls, and recover from errors.
  • The metric aggregates trial outcomes from diverse tasks like filesystem, text extraction, CSV analysis, and SQL querying into a pooled accuracy score.
  • KAMI offers actionable insights into models’ adaptation, error recovery, and robustness under uncertain, dynamic environmental conditions.

The Kamiwaza Agentic Merit Index (KAMI) is an interactive, multi-step benchmark for evaluating the real-world agentic capabilities of LLMs equipped with tool-use APIs. Unlike traditional static question answering or code generation benchmarks, KAMI assesses how reliably an LLM can ground its reasoning within its environment, sequence tool calls, recover from errors, and adapt under uncertainty. The benchmark is structured around filesystem, text extraction, CSV analysis, and SQL querying tasks, each instantiated as multiple randomized scenarios. Scoring is based on pooled accuracy across a large number of independent trials, yielding both aggregate measures and detailed behavioral traces for qualitative analysis. KAMI v0.1 aims to provide a rigorous foundation for analyzing interactive, environment-aware agentic behavior in LLM agents (Roig, 8 Dec 2025).

1. Formal Definition and Purpose

The KAMI benchmark is defined as a collection of interactive, multi-step scenarios designed to probe and quantify agentic robustness in LLMs that interface with tool-use APIs. Its evaluation targets include:

  • Grounding: Does the model anchor its reasoning in live observations and environmental data?
  • Sequencing: Does the model orchestrate the correct ordering of tool invocations?
  • Recovery: Can the agent identify and correct errors within the allowed inference rounds?
  • Adaptation: Is the model robust under varying uncertainty, environmental distractors, or workload?

KAMI distinguishes itself from static benchmarks by requiring actual state changes in an environment and by scoring based on the exactness of tool-mediated artifacts produced by the agent in each trial (Roig, 8 Dec 2025).

2. Mathematical Formulation

KAMI’s score is mathematically formulated as the trial-weighted pooled accuracy across all benchmark scenarios. Let S={1,,M}S = \{1, \dots, M\} be the set of agentic scenarios (M=19M=19 in KAMI v0.1), TsT_s the number of randomized trials per scenario (Ts=240T_s=240), and Xs,t{0,1}X_{s,t} \in \{0,1\} a per-trial success indicator (1 for correct, scoring-type-verified output; 0 otherwise). Then:

  • Scenario-level accuracy: As=1Tst=1TsXs,tA_s = \frac{1}{T_s} \sum_{t=1}^{T_s} X_{s,t}
  • Scenario weight: ws=Tsk=1MTkw_s = \frac{T_s}{\sum_{k=1}^M T_k} (uniform in v0.1)
  • Pooled KAMI score:

KAMI=s=1MwsAs=1Ms=1MAs=s=1Mt=1TsXs,ts=1MTs\mathrm{KAMI} = \sum_{s=1}^M w_s A_s = \frac{1}{M}\sum_{s=1}^{M} A_s = \frac{\sum_{s=1}^M\sum_{t=1}^{T_s}X_{s,t}}{\sum_{s=1}^M T_s}

With fixed TsT_s and wsw_s, each scenario contributes equally, and every trial is normalized to prevent overfitting to particular tasks or scenarios (Roig, 8 Dec 2025).

3. Benchmark Structure and Scenario Taxonomy

KAMI v0.1 comprises four principal task categories, each mapped to specific scenario ID series and defined operationally as follows:

Category Scenario IDs Description
Filesystem Q201, Q202 Create directories/files under implicit constraints
Text Extraction Q301, Q302 Retrieve specific lines from large files using line-indexed reads
CSV Analysis Q401, Q402, Q403 Compute counts, sums, averages over CSVs, requiring code execution
SQL Querying Q501, Q502, Q503 Schema discovery, joins, filters, aggregation with distractors

Each scenario is instantiated as Ts=240T_s=240 independent, randomized trials in the full benchmark. The qualitative study manually reviews a 12.5% sample—$30$ trials per scenario across selected models—surfacing both aggregate accuracy and fine-grained behavioral patterns. Non-core axes such as sanity checks (100-series), prompt engineering (600-series), and output format (700-series) are excluded from the scored set in the referenced work (Roig, 8 Dec 2025).

4. Scoring, Weighting, and Normalization

The aggregation of trial-level outcomes adheres to strict normalization:

  • All scenarios employ identical trial counts and equal per-scenario weights (ws=1/19w_s = 1/19).
  • KAMI’s trial-weighted pooling precludes selective overfitting; even the hardest scenarios contribute exactly $1/19$ to the final metric.
  • Each trial’s evaluation is binary: Xs,t=1X_{s,t}=1 only if the final artifact (file, JSON, etc.) exactly matches the scenario’s canonical solution; otherwise $0$.
  • The scoring mechanism accommodates multi-step error recovery. A model is not penalized for initial mistakes if successful correction occurs within the maximum of $20$ inference rounds allotted per trial.

These criteria collectively ensure that KAMI reflects aggregate, scenario-balanced agentic reliability rather than idiosyncratic successes or degenerate strategies.

5. Example Evaluation and Interpretation

Analysis in (Roig, 8 Dec 2025) reports performance of DeepSeek V3.1 over ten benchmarked scenarios, each with 30 manually traced trials. The corresponding accuracies are tabulated as follows:

Scenario Correct Trials Accuracy
Q201 30/30 1.000
Q202 30/30 1.000
Q301 27/30 0.900
Q302 30/30 1.000
Q401 30/30 1.000
Q402 20/30 0.667
Q403 30/30 1.000
Q501 30/30 1.000
Q502 14/30 0.467
Q503 20/30 0.667

The subset KAMI score (pooled over 10 scenarios) is calculated as

KAMIsubset=1.000+1.000+0.900+1.000+1.000+0.667+1.000+1.000+0.467+0.667100.870\mathrm{KAMI}_{\text{subset}} = \frac{1.000+1.000+0.900+1.000+1.000+0.667+1.000+1.000+0.467+0.667}{10} \approx 0.870

In the full benchmark (19 scenarios × 240 trials), the identical aggregation yields the reported “Pooled Accuracy” figure (e.g., DeepSeek V3.1 at 92.2%) (Roig, 8 Dec 2025).

6. Experimental Protocols and Design Assumptions

KAMI v0.1 enforces several design decisions to maintain interpretative clarity and reproducibility:

  • A single-tool-per-round constraint, unadvertised in prompts, enforces stepwise agentic decomposition of tasks.
  • Temperature is fixed at $0.4$ to balance deterministic execution and needed exploration.
  • Each trial is capped at 20 inference rounds to avoid infinite loops or degenerate long-horizon behaviors.
  • Core scenarios uniformly exercise multi-layered agent capabilities; auxiliary or meta scenarios are excluded from scoring and qualitative error analysis.
  • The experimental sample in (Roig, 8 Dec 2025) comprises three models (Granite 4 Small, Llama 4 Maverick, DeepSeek V3.1) across a subset of scenarios and trials. This leaves comprehensive model coverage and tool-augmented qualitative assessment as future work.

A plausible implication is that the current version deliberately abstracts away from variance in prompt engineering or output format effects to focus on core agentic behaviors.

7. Significance and Comparison to Traditional Metrics

KAMI provides several conceptual and practical advantages over aggregate leaderboard metrics common in QA or code-completion evaluation:

  • Fine-grained, per-trial interactive analysis: Each execution trace can be dissected for strategy, error recovery, and context handling.
  • True multi-step agentic evaluation: Scenarios require ongoing schema inspection, adaptation to environmental distractors, and stateful tool-use sequences.
  • Explicit measurement of recovery behavior: Success is not exclusively tied to first-attempt correctness, but to the agent’s overall capacity for diagnosis and correction within resource budgets.
  • Robustness under context pollution: Scenario distractors probe the agent’s ability to discriminate between salient and irrelevant contextual signals, surfacing failure modes invisible to one-shot evaluation tasks.

The index thus supports both quantitative cross-model comparison and granular investigation of recurrent failure archetypes—premature action without grounding, over-helpfulness in response to missing inputs, vulnerability to context pollution, and execution fragility under load—facilitating principled benchmark evolution and informed system design (Roig, 8 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Kamiwaza Agentic Merit Index (KAMI).