Granite 4 Small: Agentic LLM Evaluation

Updated 15 December 2025

Granite 4 Small is a 32-billion parameter LLM designed for agentic multi-task tool use, enabling structured commands in filesystem, text extraction, CSV analysis, and SQL reasoning.
The model demonstrates near-deterministic performance on filesystem tasks but shows critical weaknesses in CSV analysis and complex SQL queries, as revealed by the KAMI v0.1 benchmark.
Characteristic failure modes include premature ungrounded actions, over-helpfulness under ambiguity, and context pollution, underscoring key design challenges for reliable agent deployment.

Granite 4 Small is a 32-billion parameter LLM designed for agentic tool use and evaluated extensively in the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark. Its principal research interest lies in its behavioral patterns and characteristic failure modes compared to other state-of-the-art LLMs for autonomous multi-step reasoning and environment interaction. Granite 4 Small is empirical evidence for the limits of pretraining and size in agentic robustness and highlights the critical design choices for deploying LLMs as reliable agents in enterprise scenarios (Roig, 8 Dec 2025).

1. Definition and Core Capabilities

Granite 4 Small is classified as a tool-augmented LLM, capable of issuing structured tool calls (such as filesystem operations, programmatic data extraction, and SQL querying) in agentic, multi-turn environments. It was evaluated on four canonical agentic task families in the KAMI v0.1 protocol:

Filesystem manipulation
Text extraction across files
Multi-file CSV data analysis
SQL workflow reasoning

The agentic evaluation is conducted at fine trace granularity, focusing on both per-trial success rates and qualitative behavioral phenomena during error-prone or ambiguous task execution. Such evaluation protocols move beyond isolated benchmark scores, surfacing recurrent strategies and failure archetypes foundational to understanding model capability and deployment risk (Roig, 8 Dec 2025).

2. Scenario-Wise Quantitative Performance

Granite 4 Small’s empirical accuracy varies sharply between task classes. The table below summarizes the primary findings for the evaluated task families (each scenario consists of 30 independent trials):

Task Type	Average Success Rate (%)	Notable Failure Modes
Filesystem	96.7	Rare, mostly formatting
Text-extraction	80.0	Occasional JSON errors
CSV-analysis	3.3	Logic errors, hallucination
SQL reasoning	41.1	Premature, ungrounded action

Filesystem tasks (e.g., directory creation and file writing) approach near-deterministic execution. Text extraction yields moderate, but not universal, reliability, mainly limited by minor formatting capture issues. In contrast, CSV analysis exposes fundamental weak points: Granite achieves only 0–6.7% success across representative tasks (e.g., question answering over multiple, distractor-laden CSV files), with failures arising from unsound tool use and fabricated data. SQL evaluation, which requires schema inference and multi-step query synthesis, finds the model only partially successful: while straightforward join/aggregation is handled acceptably (up to 63.3%), more complex, multi-query or distractor scenarios result in rapid performance degradation (Roig, 8 Dec 2025).

3. Characteristic Failure Archetypes

Agentic evaluation reveals four consistent failure archetypes intrinsic to Granite 4 Small's operation across domains:

Premature Action without Grounding: The model issues tool commands directly—such as SQL "SELECT" statements—without schema introspection or environment verification, leading to brittle and non-recoverable faults when entity mismatch occurs. It does not invoke schema-discovery tools on first attempt even when available, instead persisting with "best guess" queries.
Over-Helpfulness under Uncertainty: Upon environmental ambiguity (e.g., failed file loads or unsupported queries), Granite offers plausible, but invented, solutions—such as synthesizing data or inserting imaginary sample rows—to "help" answer a question, thereby breaking grounding with source-of-truth data.
Context Pollution from Distractors: When session context includes extraneous or distractor information (e.g., additional tables or files), Granite frequently ceases discrimination, either selecting irrelevant entities or declaring required tables missing, often outputting default or null values without recovery.
Fragile Execution under Load: Faced with input anomalies (e.g., malformed JSON or code syntax errors), the model often cycles recurrently through formatting or parsing failures without adjusting strategy, rapidly exhausting allowed inference rounds.

These patterns persist across structured data and text, indicating that neither environment nor modality alone explains the failures.

4. Successful Strategies and Partial Recoveries

Despite these limitations, two patterns enable Granite's partial success:

A. Clean, Sequential Tool Sequencing: In scenarios conforming strictly to training priors (e.g., create directory then write file), Granite faithfully sequences the necessary toolcalls, often without intermediate intervention, achieving near-perfect accuracy.

B. Elementary Self-Correction: When presented with minor, explicit error feedback (for instance, missing JSON delimiters in a tool call), Granite sometimes successfully repairs its own output and resumes completion, though this is only effective for surface-level formatting corrections.

Notably absent are robust multi-step verification cycles. The model does not, for example, learn to parse error messages from tool environments to adapt its queries beyond superficial fixes.

5. Comparative Analysis with Contemporary LLMs

When contrasted with Llama 4 Maverick (400B, 17B MoE) and DeepSeek V3.1 (671B, 37B MoE with RL), Granite 4 Small exhibits several distinguishing characteristics (Roig, 8 Dec 2025):

Model Size is Not Sufficient: Llama 4 Maverick marginally outperforms Granite on most tasks (74.6% vs. 58.5% pooled accuracy) but does not escape the core four failure archetypes. The substantial jump to DeepSeek V3.1’s 92.2% is attributed primarily to reinforcement learning for tool recovery and error handling, not scale or mixture-of-experts architecture alone.
Recovery and Adaptation are Critical: DeepSeek V3.1 demonstrates iterative error interpretation and correction, where initial failures are addressed by engaging with schema suppliers or adjusting query structure mid-trial. Granite exhibits only superficial recovery paths.
Constraint Adaptation: Maverick is distinctive in learning “one tool per round” constraints during evaluation, while Granite and DeepSeek do not, illustrating that inductive preference for interface adaptivity is highly training-dependent.
Persistent Context Errors: DeepSeek V3.1, despite superior performance, occasionally replicates Granite's context pollution errors in distractor-heavy environments, indicating a persistent upper bound set by input curation and not only model or training improvements.

6. Implications for Agentic Model Development

The KAMI v0.1 analysis of Granite 4 Small supports the following domain-general conclusions about LLM deployment in agentic environments (Roig, 8 Dec 2025):

Reliable multi-step, tool-driven LLM agency is not a direct function of parameter count; training on grounding, verification, and error recovery is necessary.
Evaluation protocols must move beyond aggregate scoring to detailed, per-trial trace analysis to uncover the prevalent behavioral failure patterns and enable targeted mitigation.
Constraints imposed by environment and session context (tool interface restrictions, distractor presence, error message structure) drive failure rates, underscoring the importance of both robust failure detection/recovery mechanisms and careful context engineering for agentic deployments.

A plausible implication is that robust enterprise-grade agentic deployment necessitates not only scaling and architectural advances but systematic curriculum and reinforcement learning interventions focused explicitly on interactive, environment-driven adaptation and adherence to external ground truth.

Markdown Report Issue Upgrade to Chat

References (1)

How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Granite 4 Small.