LaCy: Pretraining Framework for Fact Delegation
- LaCy is a pretraining framework for small language models that improves factual generation by selectively delegating hard factual tokens using a dedicated call interface.
- It integrates a spaCy-based factuality detector with per-token loss ranking to identify and offload factually critical tokens during training.
- Empirical evaluations show that LaCy boosts FactScore, reduces fact leakage, and maintains language understanding performance with minimal added complexity.
LaCy is a pretraining framework for Small LLMs (SLMs) designed to selectively delegate the prediction of factually critical tokens—those for which SLMs have limited capacity or accuracy—to a larger external model. Rather than relying solely on next-token prediction loss, LaCy fuses a syntactic/semantic judgment of factuality with per-token loss statistics to robustly partition tokens into those the SLM should learn to generate and those it should offload via a dedicated <CALL> interface. This strategy improves factual compositionality, FactScore, and robustness of cascaded generation, while maintaining efficiency and simplicity in SLM pretraining (Ujváry et al., 12 Feb 2026).
1. Motivation for Selective Delegation in SLM Pretraining
The accuracy of LLMs is upper-bounded by their parameter capacity. SLMs (≤1B parameters) cannot memorize all domain knowledge, resulting in statistical approximations or hallucinations for factual content. In deployment settings where SLMs can query larger models or retrieval systems, a more principled approach is to (a) let the SLM freely predict grammatical or inferable content, and (b) explicitly delegate the "difficult" factual tokens whose prediction quality is degraded by limited capacity.
Naïve delegation strategies based purely on high cross-entropy loss are insufficient, because high-loss tokens may include both truly hard facts and grammatically flexible tokens, not all of which are factually significant. The central insight of LaCy is to guide delegation using the intersection of high loss and explicit factuality, as detected by a spaCy-based factuality parser, to avoid both memorization of hard facts and excessive delegation of harmless linguistic tokens (Ujváry et al., 12 Feb 2026).
2. Formal Definition and Core Algorithm
Given a streaming corpus and LM parameters , LaCy introduces two central components:
- Factuality Detector : A binary classifier implemented using spaCy's NER, POS, and dependency parsing, combined with direct heuristics to flag "first-mention" tokens for entities, dates, numerics, and other fact-bearing items.
- Per-batch Loss Ranking: For each token, compute the next-token negative log likelihood (loss) and denote as "hard" those tokens in the top for loss within a mini-batch.
The call-mask is defined as:
where is adjusted to match a predetermined call rate, typically 15%.
The pretraining objective is:
where is the distribution over all non-<CALL> tokens.
3. Implementation and Pretraining Workflow
LaCy is implemented by integrating spaCy’s en_core_web_sm pipeline to annotate raw text with factual markers. Heuristics include registering only the first mention of each PERSON/ORG/DATE, initial occurrence of numerics, and noun-chunk-based checks. On a large corpus (e.g., ∼3B token dwiki), parsing and annotation are amortized over pretraining (~152 hours per 1B tokens with a single CPU, with direct dataloader integration possible).
The model backbone is a GPT-2 style transformer (334M–1.3B parameters; 24 layers; up to 2048-dim, 16 heads; context windows of 1024). Pretraining is conducted with batch sizes of 112–128, AdamW with learning rates of 2e-4 or 1e-4, and completion of 340k–440k total steps. LaCy runs ~15% more steps than a vanilla baseline to equalize exposure to ground-truth tokens.
Pseudo-code for a pretraining step:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Input: streaming text corpus, factuality detector C_spaCy, batch size B, call-rate target r (15%)
Initialize θ randomly, add special token <CALL>
for each pretraining step:
Sample a batch 𝔅 of B token-sequences
For each sequence x in 𝔅 and position i:
1. Compute SLM loss ℓ_i = -log p(x_{i+1} | x_{1:i}; θ)
2. Identify top r% hardest (ℓ_i) tokens in 𝔅
3. Set C_LaCy(i) ← 1 if C_spaCy(x_{i+1}) == 1 and i is hard
0 otherwise
4. Target: <CALL> if C_LaCy(i) == 1; else x_{i+1}
End for
Compute LaCy cross-entropy loss and backpropagate
end for |
4. Empirical Evaluation and Comparative Results
LaCy’s improvements are evaluated across several tasks and metrics:
- Factual Generation (FactScore): On a 334M-parameter model, average FactScore reaches 32.2% for LaCy, compared with 25.4% for a vanilla SLM, 28.7% for loss-only gating, 29.1% for reference-loss delegation, and 30.3% using LLM-judge as the delegation oracle. With a more advanced retrieval-augmented generator (Qwen 3 32B + RAG) as the cascade, LaCy attains ~66% FactScore, outperforming all baselines.
- Fact Leakage: When <CALL> is disabled and the SLM is forced to generate all tokens, LaCy achieves lowest leakage (30% containment of gold answers on BigBench-Wikidata and PopQA), compared with 45–50% for alternatives, confirming that LaCy discourages harmful memorization.
- Natural Language Understanding: On multiple-choice NLU tasks (ARC-Easy, HellaSwag, PIQA, SIQA), LaCy matches vanilla SLM performance (39.9% vs. 39.6%), indicating no degradation of general language competence.
- Loss Analysis: LaCy's delegation aligns with the highest-loss factual tokens (call-loss ≈5.7), and its non-call-loss (2.8) is as low or lower than other approaches. However, aggregate validation loss does not correlate with FactScore, highlighting the inadequacy of loss-only evaluation.
The table below summarizes performance on key generation benchmarks (334M model, 22% actual call-rate):
| Approach | FactScore (%) | Fact Leakage (%) | NLU Score (%) |
|---|---|---|---|
| Vanilla SLM | 25.4 | 45–50 | 39.6 |
| Loss-only | 28.7 | ~40 | — |
| Rho-1 | 29.1 | ~40 | — |
| LLM-judge | 30.3 | ~35 | — |
| LaCy | 32.2 | 30 | 39.9 |
5. Algorithmic Choices, Ablations, and Analysis
Ablation studies demonstrate that spaCy-only random sampling produces weaker FactScores and more leakage. spaCy with reference-model loss ("Rho") provides only marginal improvement at much higher compute overhead. Skipping backpropagation on remaining facts adds minimal further gain and can harm NLU if overapplied.
LaCy’s effectiveness is attributed to:
- Targeting delegation strictly to high-loss factual tokens.
- Retaining high-loss grammatical tokens in the SLM's learning stream, enabling better generalization.
- Maintaining a fixed call-rate budget, thus capping delegation and promoting efficiency.
The factuality annotation covers approximately 25% of corpus tokens; further advances in parsing, classification, or domain adaptation could improve coverage without negatively influencing language fluency.
6. Discussion, Limitations, and Future Directions
LaCy’s approach to delegating factual tokens is scalable and lightweight. The use of a hand-crafted factuality detector is efficient but admits the limitation that true factuality may sometimes elude spaCy heuristic coverage. A plausible implication is that more sophisticated NLP classifiers or structured knowledge extraction could further refine call-masks.
The experimental scope is limited to up to 1011 tokens and 1.3B parameter models. It remains an open research question whether the optimal delegation threshold or factuality detector properties evolve across order-of-magnitude scale. Integration with non-LM tools (e.g., retrieval-augmented generation, database queries) in the <CALL> handler would generalize LaCy to hybrid neuro-symbolic settings.
A theoretical analysis of optimal delegation under hard parameter capacity constraints for SLMs is indicated as an important direction for future work.
7. Summary of Impact and Practical Implications
LaCy establishes a principled, computationally lightweight strategy for SLMs to partition generation between learned context-sensitive patterns and factual delegation, implementing a modular approach to factual language modeling (Ujváry et al., 12 Feb 2026). It achieves higher factual precision in open-ended language generation, reduces fact leakage, and operates without negative side effects on NLU or training complexity. This suggests LaCy can serve as a practical base for future SLM development, particularly in resource-constrained or hybrid cascade deployments.