Claude Haiku 4.5: Orthographic Evaluation
- Claude Haiku 4.5 is a proprietary language model that specializes in generating valid English words by strictly adhering to set orthographic constraints.
- It employs uniform prompting and varying internal reasoning budgets, with F1 scores increasing from 0.544 in direct mode to 0.680 at a 16K reasoning budget.
- Benchmark comparisons reveal that while it trails GPT-5-mini in absolute performance, it significantly outperforms open-source models by leveraging specialized instruction-tuning and training objectives.
Claude Haiku 4.5 is a proprietary LLM evaluated for orthographic constraint satisfaction, with particular emphasis on its ability to generate valid English words under strict character-level rules. Benchmarking against both other proprietary and open-source architectures, Claude Haiku-4.5 demonstrates high-level absolute and relative performance, especially in tasks requiring systematic enumeration of words respecting explicit orthographic constraints. Assessments using human difficulty ratings reveal both strengths and systematic failure modes, with implications for future architectural innovations.
1. Evaluation Context: Orthographic Constraint Satisfaction
Claude Haiku-4.5 was assessed in the context of controlled text generation tasks requiring adherence to orthographic constraints. The primary benchmark used was a spell-bee style evaluation based on 58 consecutive New York Times Spelling Bee puzzles (June 2–July 29, 2025). Each puzzle defines a set of seven characters and a designated center character ; valid outputs must:
- Use only characters from ,
- Include ,
- Be at least four characters long ( for a generated word).
Across these puzzles, 2,007 unique solutions were identified in 2,710 word instances. Human difficulty was established via empirical success rates from over 10,000 daily solvers per puzzle, providing an external calibration curve for model output versus human ability (Tuck et al., 26 Nov 2025).
2. Model Configurations and Experimental Procedure
Claude Haiku-4.5 was tested as a zero-shot generator using uniform prompts specifying the character set and constraint on the center letter. Four configurations were used:
- Direct: a single forward pass without explicit chain-of-thought or internal reasoning tokens.
- Thinking modes: the model was allowed internal chain-of-thought processing with maximum reasoning budgets of 4K, 8K, and 16K tokens.
Prompting remained identical across settings: “Find as many valid English words as possible—one per line,” with the seven permitted letters and the center letter highlighted; no information about the expected solution count was provided.
3. Performance Metrics and Comparative Results
Performance was quantified using precision (), recall (), and F1 score (), with
where is the model's output set and is the human-verified solution set.
A comparison of top-performing configurations across families is captured in the following table:
| Model | Budget | Precision | Recall | F₁ |
|---|---|---|---|---|
| GPT-5-mini | 16 K | 0.888 | 0.680 | 0.761 |
| Claude-Haiku-4.5 | 16 K | 0.851 | 0.574 | 0.680 |
| Claude-Haiku-4.5 | 8 K | 0.842 | 0.515 | 0.633 |
| Qwen-32B | 8 K | 0.797 | 0.233 | 0.343 |
Key findings include:
- Proprietary models (Claude Haiku-4.5, GPT-5-mini) achieve F1 scores approximately 2.0–2.2× higher than the best-performing open-source model (Qwen-32B), a gap due primarily to recall (Claude Haiku-4.5 recall: 57.4% vs. Qwen-32B: 23.3%), while precision differences are moderate (~9 points).
- The best Claude Haiku-4.5 configuration (16K thinking budget) attains , trailing GPT-5-mini () but substantially outperforming Qwen-32B () (Tuck et al., 26 Nov 2025).
4. Thinking Budget Sensitivity and Architectural Observations
Claude Haiku-4.5 consistently converts increased internal “thinking” budget into monotonic performance gains:
- Direct mode (no chain-of-thought):
- 4K reasoning tokens: (+0.072)
- 8K reasoning tokens: (+0.017 over 4K)
- 16K reasoning tokens: (+0.047 over 8K; total +0.136 over direct)
This pattern exemplifies the “high‐capacity” effect: larger models leverage additional reasoning steps for higher-precision constraint verification. In contrast, mid-sized Qwen models degrade with extra budget (Qwen-14B drops from 0.289 at 4K to 0.253 at 16K), establishing that budget sensitivity is heterogeneous and architecture-dependent.
A plausible implication is that internal architecture—including memory span and capacity for explicit multi-constraint reasoning—is decisive for effective orthographic verification.
5. Calibration to Human Difficulty and Systematic Error Patterns
Human difficulty for each word is measured as . The Pearson correlation between and model miss rate (fraction of puzzles where is not generated) quantifies alignment to human word difficulty:
- Qwen family: –$0.266$
- GPT-5-mini:
- Claude Haiku-4.5:
Claude Haiku-4.5 yields the best human-alignment () among tested models, but still demonstrates pronounced and consistent blind spots on frequent, human-trivial words with atypical orthography. Illustrative examples include high miss rates for “data,” “poop,” and “loll”—all with human success above 93% and model miss rates above 89%. These errors are not explained by vocabulary deficits; instead, they reflect an over-reliance on distributional plausibility, systematically penalizing:
- Double consonants (e.g., “loll,” “illicit”)
- Repeated letters (e.g., “poop,” “papa”)
- Truncated or informal forms (e.g., “acai,” “data”)
This suggests that Claude Haiku-4.5's underlying generation priorities may overweight common surface-form frequency, to the detriment of constraint-valid but orthographically uncommon words.
6. Architectural and Training Factors; Research Directions
Cross-family performance disparities (2.0–2.2 ) outweigh within-family scaling gains (83% increase from Qwen-4B to Qwen-32B), indicating that architectural specialization and supervisory signals drive constraint-satisfaction capability more than sheer parameter count or standard scaling. Likely contributors to Claude Haiku-4.5's advantage include:
- Specialized instruction-tuned objectives promoting explicit constraint satisfaction over predictive plausibility.
- Broad and diverse pre-training and fine-tuning corpora that may include relevant constraint-based puzzles.
- Possible inclusion of internal logic modules or working-memory enhancements for multi-constraint evaluation and robust long-form word enumeration.
To address remaining limitations, three avenues are proposed:
- Integration of sub-modular orthographic verification (“symbolic checkers”) orthogonal to next-token prediction.
- Development of training objectives or auxiliary tasks that reward production of rare, but valid, orthographically complex words, thereby decoupling constraint-satisfaction from frequency-based plausibility.
- Adoption of adaptive budget policies for reasoning tokens, to optimize the allocation of generation effort in accordance with puzzle complexity and diminishing marginal utility of further computation.
7. Significance and Open Questions
Claude Haiku-4.5 defines a state-of-the-art baseline for orthographic constraint satisfaction in LLMs, attaining both high absolute performance ( up to 0.680 at 16K token budgets) and best-in-class calibration with human difficulty (Pearson ). Nonetheless, persistent systematic errors on orthographically irregular but human-trivial words delimit the model’s ability to match human flexibility and generalize across surface-form anomalies. Architectural enhancements focused on explicit symbolic verification, more targeted training, and dynamic resource allocation emerge as critical research imperatives for bridging the gap to human-level orthographic reasoning (Tuck et al., 26 Nov 2025).