Papers
Topics
Authors
Recent
Search
2000 character limit reached

LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

Published 12 Feb 2026 in cs.CL | (2602.12005v1)

Abstract: LLMs have consistently grown to compress more world knowledge into their parameters, but the knowledge that can be pretrained into them is upper-bounded by their parameter size. Especially the capacity of Small LLMs (SLMs) is limited, leading to factually incorrect generations. This problem is often mitigated by giving the SLM access to an outside source: the ability to query a larger model, documents, or a database. Under this setting, we study the fundamental question of \emph{which tokens an SLM can and should learn} during pretraining, versus \emph{which ones it should delegate} via a \texttt{<CALL>} token. We find that this is not simply a question of loss: although the loss is predictive of whether a predicted token mismatches the ground-truth, some tokens are \emph{acceptable} in that they are truthful alternative continuations of a pretraining document, and should not trigger a \texttt{<CALL>} even if their loss is high. We find that a spaCy grammar parser can help augment the loss signal to decide which tokens the SLM should learn to delegate to prevent factual errors and which are safe to learn and predict even under high losses. We propose LaCy, a novel pretraining method based on this token selection philosophy. Our experiments demonstrate that LaCy models successfully learn which tokens to predict and where to delegate for help. This results in higher FactScores when generating in a cascade with a bigger model and outperforms Rho or LLM-judge trained SLMs, while being simpler and cheaper.

Summary

  • The paper demonstrates that loss signals alone are insufficient for SLM token selection, integrating spaCy-based factual annotation to guide delegation.
  • It introduces a dual pipeline that triggers <CALL> on high-loss factual tokens, achieving improved FactScore and reduced fact leakage in biographical generation.
  • Empirical findings on GPT-2 variants show that delegated factual predictions preserve SLM NLU capabilities while mitigating risky fact memorization.

Authoritative Summary of "LaCy: What Small LLMs Can and Should Learn is Not Just a Question of Loss" (2602.12005)

Motivation and Problem Statement

The paper investigates the learnability boundaries of Small LLMs (SLMs), emphasizing their constrained parametric memory and susceptibility to factual errors. While large LLMs saturate benchmarks by memorizing world knowledge, SLMs—defined by parameter counts typically below 1B—must balance efficient token prediction against their limited capacity. This operational constraint is often mitigated by cascade mechanisms: SLMs predict what they can, then delegate to a larger partner (M, DB, or LLM) via a special <CALL> token. The authors pose the fundamental question: which tokens should be "learned" by an SLM, and which should trigger a <CALL> deferral?

Conventional pretraining minimizes token-wise cross-entropy loss. However, the paper demonstrates that loss signals are inadequate for token selection since they are agnostic with respect to factuality versus plausibly acceptable continuations. Many tokens may have high loss due to non-exact matches but are still acceptable (non-factual, e.g., synonyms or grammatical variations), whereas factual tokens (entities, dates) require precise prediction to avoid hallucinations.

LaCy Framework and Methodological Innovations

LaCy is introduced as a token-selection framework that augments the loss signal with spaCy-based grammatical and factual annotation to distinguish between tokens that SLMs can safely learn and those that should be delegated. The method identifies factual tokens (e.g., new entities, dates) with spaCy, then delegates those with high loss. Non-factual tokens may have high loss without semantic risk, so they are not automatically delegated. Formally, within each batch, LaCy replaces ground-truth targets with <CALL> tokens where the spaCy fact label and loss cross a threshold (top nn\% highest-loss factual tokens).

(Figure 1)

Figure 1: LaCy pipeline - during pretraining, SLM's prediction and loss are parsed for factuality; factual tokens with high loss trigger training towards <CALL> output, which at inference delegates to a larger cascade model.

At inference, the SLM generates tokens autoregressively, emitting <CALL> when its capacity is insufficient for factual correctness. This triggers the cascade partner to produce the next token, supporting high factuality in downstream outputs while minimizing factual memorization in the SLM itself.

Empirical Evaluation and Analysis

Factual Accuracy and Targeted Delegation

The main experimental setup involves pretraining GPT-2 architectures (334M and 1.3B) on the dwiki dataset. Biographical generation is evaluated with FactScore, which quantifies the proportion of generated atomic facts supported by the true Wikipedia entry. LaCy achieves the highest FactScore among tested methods, notably outperforming Rho-based methods and LLM judge annotation baselines—even with a relatively limited cascade partner (Llama 3.2 1B). Figure 2

Figure 2: LaCy achieves superior FactScore while minimizing fact leakage by successfully learning factual delegation locations and offloading factual predictions.

Ablations demonstrate that spaCy-only annotation (without loss selection) underperforms LaCy, confirming the necessity of loss-augmented factual token delegation. The use of a reference model’s loss (Rho-1) yields marginal improvements, but incurs significant computational overhead.

Fact Leakage and Parametric Memory

To assess factual memorization, the SLM is evaluated in isolation (with <CALL> suppressed) on QA tasks. LaCy leads to the lowest fact leakage, confirming that its delegation results in minimal factual entanglement in the SLM’s parametric memory. Figure 2

Figure 2: Left - FactScore for biography generation with cascade partner; Right - fact leakage in direct SLM generation (no cascade). LaCy optimally balances factual precision and parameter memory.

Token-Type Sensitivity: Loss versus Acceptability

The authors empirically validate that loss is not correlated with downstream factual performance (FactScore) in cascaded setups. High loss may occur for non-factual tokens (e.g., grammatical alternatives), which do not risk hallucination. Only the combination of spaCy factual annotation and loss reveals tokens where SLM predictions are likely unacceptable, motivating targeted delegation. Figure 3

Figure 3: Difference between accuracy (exact matching) and acceptability; loss is blind to token type, but spaCy parsing differentiates factual tokens needing deferral from non-factual tokens.

Validation loss metrics (call, non-call, and total) are shown to be poor predictors for factual accuracy in cascade scenarios, diverging from historical findings in scaling law literature [kaplan2020scaling].

NLU Performance and Model Capacity

NLU tasks (ARC-Easy, HellaSwag, PIQA, SIQA) are employed to measure side effects. Offloading facts does not impair SLM NLU skills, nor does aggressive offloading (in ablations) improve NLU, establishing that factual memorization and natural language comprehension are largely orthogonal in SLM training.

Throughput and Practical Considerations

LaCy’s integration with spaCy on purely CPU hardware enables large-scale pretraining without the labeling overhead associated with LLM judge or reference loss methods, which require substantial GPU time. The pipeline is amenable to online labeling during data loading.

Implications, Limitations, and Future Directions

The findings contradict the idea that loss-based methods alone suffice for optimal SLM training. SLMs should be trained to only memorize what is safe for their capacity—non-factual, high-loss tokens with semantic acceptability—and delegate on factual, high-loss tokens. The LaCy framework, conceptually simple, offers measurable gains in factual accuracy in cascaded settings and purges fact memorization from SLMs, with minimal throughput cost.

This approach has ramifications for cost-efficient LLM deployment, LLM cascading architectures [gupta2024language], fact hallucination mitigation, and downstream parameter-efficient fine-tuning. Future work should focus on robust scaling, joint architectures with multi-modal or large KB cascades, and refined token-selection heuristics that combine other signals (contextual uncertainty, knowledge graphs).

Conclusion

LaCy establishes that optimal token selection for SLM pretraining is not a function of loss alone; factual token identification and targeted delegation are necessary for trustworthy, efficient model training. The spaCy+loss paradigm robustly separates learnable from non-learnable tokens, maximizing factual accuracy and minimizing factual entanglement, while retaining SLM NLU capabilities. The approach has practical scalability and theoretical significance in guiding SLM research towards fine-grained token-wise learning and intelligent deferral mechanisms.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper looks at how to train small LLMs (SLMs) so they make fewer factual mistakes. The authors argue that small models shouldn’t try to remember every fact (like exact dates or names). Instead, they should learn when to “ask for help” from a bigger, smarter model. They introduce a training method called LaCy that teaches a small model to do exactly that.

What questions did the researchers ask?

They focused on two simple questions:

  • Which pieces of text should a small model try to predict by itself?
  • Which pieces should it “delegate” by asking a bigger model for help?

They show this isn’t just about how “wrong” a guess is. Some wrong guesses are still okay (like choosing a different but reasonable next word), while others cause factual errors (like getting a person’s birthplace wrong). LaCy tries to tell these cases apart during training.

How did they do it?

Think of text as a sequence of tiny pieces called tokens (often parts of words). At each step, the model guesses the next token.

Here’s the idea behind LaCy:

  • Small models have limited memory. They can be good at general language (grammar, style), but not at storing countless facts (names, dates, locations).
  • During training, the model gets a score called “loss.” High loss means it was surprised (its guess didn’t match the training text). But high loss alone doesn’t tell you whether the guess matters for truth.
  • The authors use a grammar/linguistic tool called spaCy (a program that tags words as names, places, dates, etc.) to spot tokens that are likely factual and require an exact answer (for example, “Mozart was born in Salzburg” — “Salzburg” must be correct).
  • LaCy combines two signals: 1) Is this token a fact-like token (using spaCy)? 2) Is the model struggling with it (high loss)?
  • If both are true, the small model is trained to output a special placeholder called a <CALL> token instead of guessing. The <CALL> means “Ask a bigger model to fill this in.”
  • At inference time (when the model is actually used), if the small model outputs <CALL>, the system passes the context to a larger model, which fills in the factual token (like supplying “Salzburg”).

Analogy: Imagine writing a report. You write the sentences yourself (style, grammar), but when you reach a specific fact (a birth year), you quickly ask a more knowledgeable friend to confirm it before you write it down.

What did they find?

  • Better factual accuracy: On a biography-writing task (using Wikipedia-style text), a small model trained with LaCy produced more correct facts when paired with a bigger model. They measured this with FactScore, a metric that checks factual statements against the real Wikipedia page.
  • Smarter calling for help: LaCy learns to “call” mostly on tokens that really need precision (names, dates, places) instead of random words. This reduces hallucinations (confident but wrong statements).
  • Less “fact leakage”: When they disabled calling and forced the small model to answer factual questions on its own, the LaCy model gave fewer correct answers. That’s actually good here: it shows the small model didn’t memorize lots of facts; it learned to rely on the helper when needed.
  • Loss alone isn’t enough: They showed that “loss” (how wrong a guess is) doesn’t tell you whether a token is a fact that must be exact. Some high-loss tokens are harmless (like choosing between “The cat” and “The time”), while some low-loss tokens could still be wrong in a factual way. So combining loss with spaCy’s tags works better.
  • Cheap and simple labeling: LaCy’s spaCy-based labeling runs on CPUs and is cheaper than methods that rely on large helper models or GPU-heavy scoring during training.
  • General language skills stay intact: Offloading facts didn’t hurt the small model’s reading and reasoning skills (NLU). It still performed similarly on general understanding tasks.

Why does this matter?

LaCy shows a practical way to train small, fast models that know when to ask for help on facts. This can:

  • Make small models more trustworthy by reducing factual errors.
  • Keep small models efficient (they don’t waste capacity trying to memorize the world).
  • Lower training and deployment costs (small model does most of the work; big model steps in only when needed).
  • Fit nicely with tool use and retrieval systems (in the future, the <CALL> could trigger web search, databases, or other tools).

In short, the paper argues that small models should learn language and judgment, not memorize everything. Teaching them when to call for help leads to better, more reliable systems.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, framed to guide actionable follow-up work:

  • Reliability of the factual-token detector: No quantitative precision/recall analysis of the spaCy-based tagging (plus heuristics) on the training domain; error types, span coverage, and boundary accuracy (especially for multi-token entities) are unreported.
  • Generality of factual-token detection: It is unclear how well spaCy en_core_web_sm (and the custom heuristics) transfer to other domains (news, scientific, code), genres (dialogue, instructions), and languages; multilingual and domain-specific detectors were not evaluated.
  • Formalization and validation of “acceptability”: The definition is heuristic and validated only on a tiny sample with a proprietary LLM; no large-scale, human-annotated, or reproducible protocol exists to measure acceptability reliably.
  • Span-level vs token-level delegation: The method delegates at the next-token granularity, but many facts span multiple subword tokens; how to detect, initiate, and terminate span-level calls (entity/date ranges, multiword names) is unaddressed.
  • Tokenization mismatches: The SLM uses SentencePiece while the cascade model does not; multi-token returns are noted but not systematically handled or evaluated for their impact on coherence and downstream metrics.
  • Numeric/date handling: Numbers and dates are high-risk factual tokens with bounded loss; no targeted rules or detectors were developed to prioritize or tailor delegation for numeric content.
  • Call-budget selection and scheduling: The 15% training and 22% inference budgets are fixed and partly chosen for comparability; the optimal fraction, adaptive scheduling over training, and coverage–risk trade-offs remain unexplored.
  • Calibration of call thresholds: The running-quantile thresholding at inference is heuristic; robustness under distribution shift, per-domain calibration, and principled selective prediction calibration (e.g., risk-coverage curves) are unstudied.
  • Training–inference mismatch: During pretraining, the model never conditions on cascade outputs; the effect of this exposure bias on downstream calling behavior and factuality is unknown, and no methods (e.g., mixed-in teacher forcing with returned spans) were tested.
  • Dependency on cascade partner quality: Only two partners (Llama 3.2 1B; one Qwen 32B+RAG ablation) are tried; systematic variation of partner size/quality, RAG configurations, grounding sources, and cost–quality trade-offs is missing.
  • End-to-end latency and cost: The paper does not measure inference-time latency, throughput, or cost per token considering call frequency and cascade overhead; practical deployment implications remain unclear.
  • Utility vs. redundancy of calls: No detailed error analysis of unnecessary calls (overcalling) versus missed calls (undercalling) and their contribution to factual errors, fluency degradations, or cost.
  • Alternative token-type signals: No ablation with other linguistic signals (e.g., constituency/dep parsing, POS tags), stronger NERs, learned factuality classifiers, or ensembles combining multiple detectors.
  • Robustness to label noise: Sensitivity of LaCy to mislabeling in spaCy/heuristics is not quantified; no simulations of noise injection or ablations to estimate performance degradation under imperfect labels.
  • Scaling behavior: Results center on 334M (main) and 1.3B (appendix) SLMs over ~50B tokens from Wikipedia; scaling trends for larger SLMs, longer training, and heterogeneous corpora remain unknown.
  • Decoding effects: Only greedy decoding is used; interactions between sampling-based decoding (temperature/top-p) and the timing/frequency/quality of calls were not analyzed.
  • Breadth of evaluation: NLU is limited to a few multiple-choice sets; no assessments on instruction-following, long-context reasoning, code, math, or dialogue to test whether fact offloading helps or hurts broader capabilities.
  • Human evaluation: No human judgments of coherence, fluency, style consistency, or perceived factual soundness in cascaded generations, especially with frequent token-level calls.
  • Factuality metrics coverage: Results rely on FactScore (biographies); benchmarks like FEVER, TruthfulQA, HoVer, attribution-based QA, and contradiction detection were not tested.
  • Objective alignment: While loss is shown to misalign with factuality, no alternative training objective is proposed that directly optimizes factuality/abstention (e.g., selective likelihood, risk-aware losses).
  • Privacy and safety: Calling may expose user/contextual data to external models/tools; the paper does not address redaction, privacy constraints, call governance, or post-call safety filtering.
  • Verification of cascade outputs: There is no mechanism to validate or reject incorrect cascade returns; strategies for post-hoc verification, cross-checking, or self-consistency are not explored.
  • Consecutive/nested calls and long-span retrieval: Handling of back-to-back calls, multi-token retrieval windows, or structured tool outputs (tables, API returns) is out of scope but crucial for practical systems.
  • Interaction with instruction tuning/RLHF: Whether the learned calling behavior persists or is overwritten during downstream alignment/finetuning is unknown.
  • Stronger uncertainty baselines: Comparisons lack modern selective prediction baselines (e.g., entropy/margin, temperature scaling, conformal prediction, mutual information from ensembles/MC dropout).
  • Catastrophic interference and calibration: The effect of introducing <CALL> on probability calibration, long-horizon language modeling quality, and interference with non-factual tokens is not measured.
  • Adaptive or curriculum strategies: No investigation into curricula that phase in/out delegation, progressively refine factual detectors, or anneal budgets during training.
  • Online/teacher-assisted variants: Combining LaCy with online teacher signals (reference model/RAG) without full two-stage training (e.g., distillation, pseudo-labeling) is an open design question.
  • Reproducibility of labeling: The custom spaCy heuristics and exact rule set for fact detection are not fully specified; lack of released code or detailed schemas hinders replication and fair comparison.
  • Data contamination checks: The paper does not report contamination analysis between pretraining and evaluation pages in Wikipedia, which could inflate absolute scores.
  • Multilingual/code settings: No experiments on multilingual corpora or code-switching; coverage and accuracy of factual detectors and calling behavior in such settings are unknown.
  • Downstream cost–quality optimization: There is no principled framework to jointly optimize factuality gains against monetary/latency costs of cascading under production constraints.

Practical Applications

Immediate Applications

The following use cases can be deployed with current tooling, leveraging LaCy’s training recipe (spaCy-based factual token identification + loss-based selection) and standard cascade/RAG stacks.

  • Cost‑efficient LLM cascades for enterprise knowledge assistants
    • Sector: software, enterprise IT
    • What: Train small on‑prem SLMs to handle fluency and “safe-to-guess” tokens, emitting <CALL> only for high‑loss factual tokens; route those to a larger cloud model or RAG stack.
    • Tools/products/workflows: spaCy (NER/POS), SentencePiece, vector DB (FAISS/Elasticsearch), cascade partner (e.g., Llama 3.2 1B, Qwen 3 32B), FactScore-based evaluation; deploy as a microservice with a call‑ratio gate.
    • Assumptions/dependencies: Larger model or RAG is more factual than the SLM; acceptable cloud latency; security model allows contextual snippets to be sent on <CALL>; tuning of call thresholds to meet cost and quality targets.
  • On‑device assistants with selective cloud escalation
    • Sector: consumer software, mobile, IoT
    • What: Run an SLM locally for conversational flow and grammar; only send to cloud on <CALL> (names, dates, entities) to improve accuracy and reduce data usage.
    • Tools/products/workflows: Mobile/edge runtime for SLM, lightweight telemetry to monitor call budget, privacy filters on context before escalation.
    • Assumptions/dependencies: Network availability for escalations; privacy constraints on sending context; smaller vocabulary/tokenization alignment between on‑device and cloud models.
  • Factual content generation with editorial workflows (bios, product pages, briefs)
    • Sector: media, e‑commerce, marketing
    • What: Authoring pipelines where the SLM writes prose and defers only for factual spans; editors review retrieved facts explicitly marked as “called”.
    • Tools/products/workflows: CMS integration; audit trail that tags called tokens; FactScore monitoring to QA campaigns.
    • Assumptions/dependencies: High‑quality cascade partner or RAG corpus (e.g., product catalog, Wikipedia); latency budget compatible with batch generation.
  • Retrieval‑Augmented Generation (RAG) trigger gating
    • Sector: software, search, enterprise knowledge
    • What: Use <CALL> as a learned gate to trigger retrieval only when facts are likely needed; reduce unnecessary retrievals and hallucinations.
    • Tools/products/workflows: Retrieval middleware keyed by <CALL> events; hybrid pipelines (SLM → RAG → rewrite).
    • Assumptions/dependencies: Accurate entity detection in the domain; RAG index freshness; minimal domain shift relative to training data or updated spaCy rules.
  • Customer support chatbots with targeted deferral
    • Sector: customer service, telecom, SaaS
    • What: SLM handles general conversation; <CALL> triggers a KB lookup or hands off to a larger model for policy, pricing, or account‑specific facts.
    • Tools/products/workflows: Integration with ticketing/CRM; dynamic call budgets (risk‑aware gating for account‑critical questions); explainable escalation messages.
    • Assumptions/dependencies: Access to up‑to‑date KBs; ability to redact PII before deferral; clear fallback when larger model fails.
  • Report and narrative generation with verified numbers/dates
    • Sector: finance, operations, analytics
    • What: SLM drafts narratives while <CALL> retrieves exact figures (KPIs, dates) from warehouse/ERP via function calls.
    • Tools/products/workflows: Tool/function calling instead of LLM cascade (e.g., SQL connectors, API wrappers); typed <CALL> schemas.
    • Assumptions/dependencies: Stable APIs/metadata; governance ensuring retrieved values are authoritative; audit trails.
  • Safer non‑diagnostic medical knowledge assistants
    • Sector: healthcare (informational use only)
    • What: SLM manages conversational flow; <CALL> fetches medical facts (drug interactions, guidelines) from vetted sources.
    • Tools/products/workflows: UMLS/SNOMED/DailyMed integrations; explicit disclaimers; human‑in‑the‑loop for high‑risk queries.
    • Assumptions/dependencies: Strict compliance (HIPAA/GDPR); domain‑tuned fact detectors (beyond general spaCy); regulatory approval for deployment scope.
  • Developer assistants that defer on API facts
    • Sector: software engineering
    • What: SLM suggests code or explanations; <CALL> triggers docs search/static analysis when encountering API names/versions.
    • Tools/products/workflows: IDE plugins; docsearch over MDN/StackOverflow/internal docs; call logs for debugging.
    • Assumptions/dependencies: Fresh documentation index; appropriate parsing for code tokens (domain‑specific NER).
  • Legal/Policy summarization with fact deferral
    • Sector: legal, public sector
    • What: SLM summarizes prose; <CALL> inserts citations/statutes/precedents with links to verified sources.
    • Tools/products/workflows: Legal RAG (case law databases); citation validators; redlining workflows highlighting called spans.
    • Assumptions/dependencies: Licensed legal corpora; tolerance for added latency; jurisdiction awareness.
  • Data labeling at scale without GPUs for call training
    • Sector: ML infrastructure, academia
    • What: CPU‑only spaCy annotation integrated into the dataloader to flag factual tokens during pretraining; avoids GPU labeling (LLM judge) or two‑stage Rho pipelines.
    • Tools/products/workflows: spaCy pipelines, CPU workers in data loaders, reproducible masks per batch.
    • Assumptions/dependencies: spaCy coverage for target languages; maintainable heuristics for domain entities.
  • Privacy‑aware selective disclosure
    • Sector: security, compliance
    • What: Reduce exposure by deferring less often and only on factual spans; combine with redaction so only minimal context is sent when <CALL> triggers.
    • Tools/products/workflows: Context windows scrubbers; call‑time PII filters; DLP policies keyed off <CALL>.
    • Assumptions/dependencies: Effective redaction without breaking factual retrieval; organizational approval for selective disclosure.
  • Better evaluation and model selection for cascades
    • Sector: academia, ML ops
    • What: Replace validation‑loss as a single KPI with downstream factuality metrics (e.g., FactScore) when tuning cascade policies (as loss did not correlate with factuality in the paper).
    • Tools/products/workflows: Continuous evaluation harness with FactScore; offline call‑ratio sweeps; A/B tests of cascade partners.
    • Assumptions/dependencies: Availability of task‑aligned factuality metrics; representative evaluation data.

Long‑Term Applications

These use cases may require additional research, tooling, or standardization (e.g., domain‑specific parsers, span‑level deferral, security/legal frameworks).

  • Open standard for <CALL> protocols and tool schemas
    • Sector: software, interoperability
    • What: Define typed deferral APIs (entity, date, citation, numeric) so SLMs can interoperate with multiple tools/LLMs/KBs.
    • Dependencies: Community/spec governance; alignment across tokenizers and vendors; security model for call payloads.
  • Span‑level and structured deferral (beyond single tokens)
    • Sector: NLP research, enterprise apps
    • What: Train models to emit boundaries and types for factual spans and invoke specific tools (e.g., “<CALL:WIKIDATA>[entity span]”).
    • Dependencies: Training data with span annotations; decoding/beam support for structured calls; evaluation on span‑level factuality.
  • Domain‑specific fact detectors beyond general spaCy
    • Sector: healthcare, law, finance, scientific publishing
    • What: Replace/augment spaCy with domain NER (e.g., UMLS entities, legal citations, financial instruments) to improve deferral accuracy.
    • Dependencies: High‑quality domain parsers; licensing; labeled corpora for adaptation.
  • Verified knowledge integration and function‑first deferral
    • Sector: data platforms, enterprise IT
    • What: Prefer direct function calls (SQL, API) over LLM cascades for facts; unify governance and auditability.
    • Dependencies: Stable data contracts; provenance tracking; monitoring for stale sources.
  • Risk‑aware orchestration and dynamic call budgeting
    • Sector: finance, healthcare, government
    • What: Adjust call thresholds based on risk class of a task/user/context (e.g., stricter in regulated workflows).
    • Dependencies: Reliable risk classifiers; policy engines; explainability requirements.
  • Human‑in‑the‑loop deferral for high‑stakes decisions
    • Sector: healthcare, legal, safety‑critical operations
    • What: Elevate <CALL> to route specific factual spans to human experts (editor review queues).
    • Dependencies: Triage UIs; SLAs; auditability and attribution of changes.
  • Multilingual and cross‑domain LaCy variants
    • Sector: global software, education
    • What: Extend parsing/heuristics to other languages and scripts; ensure entity/date detection works across locales.
    • Dependencies: Multilingual NER/POS resources; tokenization harmonization; culturally specific formats.
  • Edge/robotics planning with selective knowledge offload
    • Sector: robotics, IoT
    • What: On‑device SLM for instruction following; <CALL> to remote knowledge/services when exact world facts are needed (maps, part specs).
    • Dependencies: Connectivity management; caching of frequent facts; safety gating in low‑connectivity scenarios.
  • Hardware/software co‑design for deferral gating
    • Sector: semiconductors, mobile
    • What: Accelerators that cheaply compute confidence/loss proxies and spaCy‑like features to trigger calls with minimal overhead.
    • Dependencies: Efficient token‑type detectors; frameworks exposing gating signals.
  • Continual learning with externalized facts (reduced obsolescence)
    • Sector: MLOps, enterprise AI
    • What: Keep SLM stable while updating external knowledge sources; avoid frequent retraining for fact drift.
    • Dependencies: Up‑to‑date KBs; monitoring for drift; retriever maintenance.
  • Policy and compliance standards for deferral and auditing
    • Sector: regulators, compliance
    • What: Mandate deferral on high‑risk factual tokens in sensitive domains; require call logs and provenance in AI‑assisted content.
    • Dependencies: Consensus on risk taxonomies; certification schemes; privacy regulations for call payloads.
  • New evaluation metrics and datasets centered on acceptability
    • Sector: academia, benchmarking
    • What: Develop acceptability‑based benchmarks and metrics that reflect factual correctness over exact-token matching.
    • Dependencies: Human/system annotations for acceptable continuations; community adoption.
  • Education/tutoring with source‑grounded answers
    • Sector: education
    • What: Tutors that defer to textbooks/journals for factual claims, citing sources automatically on <CALL>.
    • Dependencies: Licensed curricular materials; citation engines; classroom acceptance policies.

Cross‑cutting assumptions and caveats

  • Cascade partner quality matters: Gains depend on the larger model/RAG being more factual than the SLM; poor partners limit benefits.
  • Domain shift requires re‑tuning: spaCy rules and loss thresholds tuned on Wikipedia may not transfer without adaptation.
  • Privacy and governance: Any deferral sends context; organizations must implement redaction and auditing.
  • Tokenization/tool alignment: Mismatched tokenization between SLM and cascade must be handled (e.g., multi‑token numeric returns).
  • Latency budgets: Workflows must tolerate added latency on <CALL>; prefetching/caching may be needed.
  • Evaluation: Rely on downstream factuality metrics (e.g., FactScore) rather than validation loss alone when optimizing cascades.

Glossary

  • Acceptability: A criterion for whether a predicted next token maintains factual and logical consistency with the ground truth continuation. "we propose the concept of acceptability as a relaxation of accuracy"
  • Autoregressive LLMs: Models that predict the next token given previous tokens to approximate the data distribution. "Autoregressive LLMs approximate the data distribution by next-token prediction"
  • Bayesian active learning: A framework that selects informative data points for training using Bayesian principles. "has recently resurfaced in Bayesian active learning \citet{rho_loss}"
  • CALL token: A special placeholder token indicating the model should delegate prediction to an external source. "This is implemented by predicting some form of a <CALL> placeholder for the next token."
  • Cross-entropy loss: A standard training objective that measures the divergence between predicted and true token distributions. "the cross-entropy loss is blind to the type of error"
  • Domain adaptation: Techniques for improving model performance when training and test data come from different distributions. "It has roots in domain adaptation \citep{moore-lewis-2010-intelligent,xie_doremi}"
  • Distributionally robust optimization: Methods that optimize performance under worst-case distributional shifts. "distributionally robust optimization \citep{oren2019distributionallyrobustlanguagemodeling}"
  • Fact leakage: The unintended storage of factual knowledge in a small model’s parameters, measured by its ability to answer facts without calling. "Without calling, LaCy has lowest fact leakage, meaning the least facts were trained into the limited parametric SLM memory."
  • FactScore: An evaluation metric that measures the proportion of atomic facts in generated text supported by ground-truth sources. "Factual accuracy is measured by FactScore \citep{min-etal-2023-factscore}."
  • Greedy decoding: A generation strategy that selects the highest-probability token at each step without search. "We use greedy decoding."
  • Grokking: A phenomenon where models transition from memorization to generalized patterns after extended training. "models seem to start ``grokking'', that is, they transition from nearly lossless to lossy predictions"
  • Learnability theory: The study of which data points or tokens a model can reliably learn given its capacity and training signals. "From a learnability theory standpoint, the loss on the true token during pretraining indicates both whether the SLM predicts the true token correctly and whether it will reliably predict it after training."
  • LLM judge: An approach that uses a larger LLM to annotate or classify tokens (e.g., as factual) for training smaller models. "LLM judge-based factual annotations \citep{zhao2025lmlm}"
  • Model cascade: A system where a smaller model defers to a larger, more capable model when needed. "assume a bigger model to step in when the SLM calls, forming a model cascade \citep{varshney-baral-2022-model-cascading,gupta2024language}"
  • Named Entity Recognition: A technique to identify entities such as names, dates, and locations in text. "spaCy's small English web model ... for Named Entity Recognition and linguistic annotation"
  • Parametric knowledge: Information stored in the model’s parameters rather than accessed externally. "they transition from nearly lossless to lossy predictions by overwriting and compressing parametric knowledge \citep{ghosal2025memorization}"
  • Retrieval-Augmented Generation (RAG): A method that augments generation by retrieving external knowledge based on the context. "enhanced with a RAG prompt."
  • Rho-1: A loss-based token selection method that uses a reference model’s loss to decide which tokens to train or delegate. "Rho-1 \citep{rho_1}"
  • Rho-loss: A learnability-based metric guiding token selection by comparing losses across models or training stages. "Rho-loss and Rho-1 \citep{rho_loss,rho_1} use this to decide which tokens to pretrain into the SLM's parameters and which to skip"
  • SentencePiece tokenizer: A subword tokenization algorithm that builds a vocabulary from data without language-specific rules. "We pretrain GPT-2 architectures from scratch with the SentencePiece tokenizer \citep{Kudo2018SentencePieceAS}."
  • Small LLMs (SLMs): Compact LLMs (around or below 1B parameters) designed to predict easy tokens and delegate hard factual ones. "Small LLMs (SLMs, \citeauthor{belcak2025small}, \citeyear{belcak2025small})."
  • spaCy: An NLP library used here for grammar parsing and entity detection to augment loss signals for token delegation. "We find that a spaCy grammar parser can help augment the loss signal to decide which tokens the SLM should learn to delegate"
  • Token logits: The pre-softmax scores over the vocabulary that reflect the model’s preference for each next token. "while being simpler and cheaper.} ... or token logits."
  • Validation loss: The loss measured on held-out data used to monitor training progress and generalization. "Models' abilities, especially during pretraining, are often assessed by their validation loss."

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 364 likes about this paper.