Cause of progressive degradation in quantised Phi-3.5 models during extended processing

Identify the underlying cause(s) of the progressive model degradation and subsequent incoherent outputs observed in quantised Phi-3.5 Mini 3.8B models during extended batch processing of paediatric renal biopsy reports on CPU-only 16GB RAM hardware, and determine whether the failure mode is driven by memory exhaustion, key–value cache accumulation across sequential inference calls, quantisation-related instability with long context lengths, or other factors.

Background

During large-batch annotation with two-shot prompts, the quantised Phi-3.5 Mini models (Q4 and Q8) processed hundreds of reports before catastrophically degrading and producing incoherent text for the remaining reports, unlike other small LLMs which exhibited minimal JSON parsing errors.

The authors hypothesise several potential explanations, including hardware memory limits (16GB RAM), accumulation of key–value cache across sequential inferences, and quantisation-related instability under long contexts, but the precise mechanism is not yet established.

References

Whilst the cause of this progressive model degradation during extended processing sessions remains unclear, possible explanations include memory exhaustion on our limited hardware (16GB RAM), KV cache accumulation across sequential inference calls, or quantisation-related instability under extended processing with longer context lengths.

A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models  (2604.04168 - Vijayaraghavan et al., 5 Apr 2026) in Results: Model Performance and Comparison