It's Not the Size: Harness Design Determines Operational Stability in Small Language Models

Published 12 May 2026 in cs.SE, cs.AI, and cs.OS | (2605.12129v1)

Abstract: This paper experimentally analyzes how the level of harness engineering affects the operational performance of small LLMs (SLMs, 2-3B parameters). Three harness conditions - model-only (raw prompt), minimal-shell (wrapper tags), and a 4-stage pipeline (plan->execute->verify->recover) - are applied to three models (Gemma4 E2B, Qwen3.5:2B, LLaMA 3.2 3B) across 24 tasks, comparing Task Success Rate (TSR) and Valid TSR (VTSR). The pipeline harness achieves TSR=0.952 and VTSR=1.000 on Gemma4 E2B (T1-T5, 21 tasks). A non-monotonic phenomenon - minimal-shell TSR < model-only TSR - is observed in two models. In LLaMA 3.2 3B model-only, seven format violations yield TSR=0.429, revealing scaffold collapse: the model abandons JSON structure under complex format requirements without harness support. Ablation shows planning and recovery each contribute approximately 24.7% of total gain. VCR (Verification Catch Rate)=0.625 across all pipeline runs.

Abstract PDF Upgrade to Chat

Authors (1)

Yong-eun Cho

Summary

The paper shows that sophisticated harness design, such as a four-stage pipeline, enhances operational stability more than sheer model size.
The methodology employs Task Success Rate and Valid TSR metrics, revealing non-monotonic effects and critical roles of planning and recovery stages.
Ablation analysis quantifies planning and recovery contributions of roughly 24.7% each, underscoring the need for proactive scaffolding in SLM deployments.

Harness Design as a Determinant of Operational Stability for Small LLMs

Overview and Motivation

This paper ("It's Not the Size: Harness Design Determines Operational Stability in Small LLMs" (2605.12129)) presents a systematic experimental study on the impact of harness engineering on the operational robustness of small LLMs (SLMs) in edge or on-premises scenarios. Contrasting prevailing emphasis on static benchmark accuracy, the research centers on the capacity of SLMs (2B–3B parameters) to meet operational thresholds for task success and output format compliance under different harness conditions. The investigation covers three representative models—Gemma4 E2B, Qwen3.5:2B, LLaMA 3.2—across 24 tasks distributed in six operationally relevant categories, deploying three harness levels: model-only (raw), minimal-shell (tagged), and a four-stage pipeline (plan-execute-verify-recover).

The study addresses five research questions relating to threshold achievement, monotonicity of harness benefit, harness component contributions, cross-model reproducibility, and category-specific harness effects.

Experimental Setup and Harness Conditions

Models are deployed locally via Ollama with controlled inference parameters and GPU resources. Each task undergoes a single execution per harness condition, with retry allowed exclusively within the pipeline’s recovery stage. Evaluation employs manual scoring (0/1/2 rubric) and metrics: Task Success Rate (TSR) and Valid TSR (VTSR), with operational stability defined as TSR ≥ 0.65 and VTSR ≥ 0.80. The harnesses are:

Model-only: Unstructured prompt, acting as baseline.
Minimal-shell: Prompts wrapped in structural tags without execution flow logic.
Pipeline: Sequential plan, execute, verify, and recover stages, incorporating retries and error feedback for enhanced stability.
Ablation: Pipeline stages are individually removed to assess contribution, yielding pipeline-no-plan, pipeline-no-verify, and pipeline-no-recover variants.

Core Findings

Harness Effectiveness and Non-Monotonicity

Empirical results demonstrate that harness design often contributes more to operational stability than model size. Gemma4 E2B, under the pipeline harness, attains TSR = 0.952 and VTSR = 1.000 for T1–T5 (21 tasks), decisively surpassing threshold criteria. Notably, a non-monotonic phenomenon emerges: minimal-shell harness produces lower TSR than model-only in Gemma4 and Qwen3.5, attributed to increased cognitive load and timeout frequency induced by simplistic wrapper tags—suggesting that insufficient harnessing may degrade task reliability, contrary to intuition.

For LLaMA 3.2, model-only condition manifests “scaffold collapse”: a drastic breakdown in output format stability (TSR = 0.429, seven format violations) despite adequate content generation (VTSR = 0.952), underscoring the independence of format enforcement from generative ability.

Harness Component Contributions

Ablation analysis quantifies harness stage impacts, validating that planning and recovery each contribute approximately 24.7% of the total pipeline gain. Planning operates as a proactive format anchor, ensuring compliance with quantitative constraints (e.g., character limits), beyond simple decomposition. Recovery ensures completion in timeout-prone tasks, converting failures into successes.

Verification Catch Rate (VCR) is 0.625 for pipeline runs, indicating that the verify-recover loop reliably detects but can only fix a subset of failures, depending on the nature (timeout, rule-detectable constraint, scaffold collapse, hallucination).

Cross-Model Generalizability

Harness advantages replicate across models but are modulated by inherent model characteristics and inference speed. Qwen3.5:2B achieves high baseline TSR in model-only (0.952) due to rapid inference, diminishing pipeline gains (TSR = 0.857). LLaMA 3.2 requires harness intervention for operational stability due to format instability under workload. Category decomposition reveals workflow completion and constraint-sensitive tasks benefit most from harness complexity.

Failure Mode Taxonomy

Failures are classified by harness fixability: timeouts and rule-detectable constraints are generally fixable via reactive harness mechanisms; scaffold collapse mandates proactive scaffolding; quantitative constraints may require external rule-based verification; hallucinations indicate intrinsic model limitations.

Tool Integration Limitations

Task T6 (Web Search) exposes a methodological limitation—pipeline vs. model-only comparisons conflate tool availability with orchestration quality. Redesign proposals call for equal-tool allocation across harness conditions for valid orchestration measurement.

Practical and Theoretical Implications

The findings advocate that SLMs’ operational reliability in real-world applications is strongly contingent upon the sophistication of harness design, transcending what static benchmarks capture. Reliance on parameter count or generative prowess alone is insufficient; harnesses must execute both reactive recovery and proactive scaffolding to achieve robust completion and compliance in multi-step, format-sensitive workloads.

From an engineering perspective, harness complexity must be judiciously balanced—half-measures may impede stability due to increased cognitive load or superficial structure. For theory, proactive planning emerges as a critical mechanism for quantitative constraint anchoring, aligning with recent agent-oriented LLM control architectures.

Future harness development should integrate rule-based verification and diversify recovery strategies, optimize pipelines for latency-sensitive use cases, and systematically benchmark orchestration under equal tool access. Unified cross-condition measurement is required to mitigate session or resource-driven artifacts.

Conclusion

This paper establishes that harness engineering, not model size, is the principal determinant for operational stability in small LLMs deployed in edge scenarios. Pipeline harnesses—combining planning, verification, and recovery—effectively remedy reactive and proactive failure modes, enabling SLMs to surpass operational thresholds for task completion and format compliance. Harnesses perform dual roles: enforcing structure independently of content generation, and actively mediating completion. The results recommend comprehensive harness design as a prerequisite for reliable real-world SLM deployment and suggest future research on lightweight yet effective orchestration frameworks to facilitate efficient, stable SLM operation.

Markdown Report Issue