Papers
Topics
Authors
Recent
2000 character limit reached

Just-in-Time Model Replacement (JITR)

Updated 12 December 2025
  • Just-in-Time Model Replacement (JITR) is an adaptive framework that replaces large language models with efficient surrogate models for recurring, template-based tasks.
  • It continuously monitors LLM usage, detects recurring task patterns via clustering and vectorization, and triggers model replacement based on cost and accuracy thresholds.
  • Leveraging methods like full fine-tuning, adapters/LoRA, and distillation, JITR achieves near LLM-level performance with significantly reduced computational overhead.

Just-in-Time Model Replacement (JITR) is an adaptive framework for dynamically substituting LLMs with computationally cheaper, task-specialized surrogate models in production pipelines. Upon detecting recurrent user requests that can be characterized as stable task templates, JITR identifies and fine-tunes small models—thereby reducing operational cost and latency without sacrificing accuracy for the repetitive task. The approach centers on continual monitoring of LLM usage, automatic detection and clustering of recurring task patterns, efficient surrogate model search and adaptation, and seamless runtime model switching with ongoing performance monitoring (Strassenburg et al., 5 Dec 2025).

1. Formal Problem Formulation

Let D={(xi,yi)}\mathcal{D} = \{(x_i, y_i)\} be the stream of user prompts xix_i and LLM-generated outputs yiy_i. Many requests in practice correspond to a small set of recurring tasks TT, recognizable as templates x=templateT(;ϕ)x = \mathrm{template}_T(\ell ; \phi), where \ell defines the high-level instruction (e.g., "sentiment classification of movie review") and ϕ\phi are slot-fillers (e.g., the actual review text). For each recurring task TT in a sliding window WW, frequency fTf_T quantifies occurrence.

The objective is to find, for each detected TT, a surrogate model MsM_s that (i) achieves task accuracy P(Ms,T)PˉTP(M_s, T) \ge \bar{P}_T and (ii) minimizes per-instance invocation cost C(Ms,T)C(M_s, T) (monetary, energy, or time). Model replacement is enacted when, after accumulating NwarmN_{\rm warm} labeled instances, a candidate model MsM_s is shown to satisfy

C(Ms,T)<C(M0,T)P(Ms,T)PˉTC(M_s, T) < C(M_0, T) \quad \land \quad P(M_s, T) \ge \bar{P}_T

where M0M_0 is the original LLM.

2. Recurring-Task Detection and Trigger Mechanisms

Incoming LLM calls are monitored and analyzed through multilayered pipelines designed to extract recurring patterns and cluster requests into tasks TT:

  • Prompt-Prefix Vectorization: Each request xx is embedded into a fixed-length vector e(x)e(x) (extracted from model KV-cache or wrapper prompts). Pairwise cosine similarities are used to link prompt instances:

cos(e(xi),e(xj))=e(xi)e(xj)e(xi)e(xj)\cos(e(x_i), e(x_j)) = \frac{e(x_i) \cdot e(x_j)}{||e(x_i)||\,||e(x_j)||}

Thresholding (cosτ\cos \ge \tau) groups requests into candidate templates.

  • Wrapper Prompt Classification: Requests are optionally wrapped in metadata prompts instructing the LLM to emit {input_type,task_type}\{\text{input\_type}, \text{task\_type}\} fields, yielding initial cluster assignments.
  • Periodic Clustering: Offline clustering (agglomerative or kk-means) is applied every 1,000 requests to recent embeddings, producing clusters {T1,,Tk}\{T_1, \ldots, T_k\}.

Surrogate generation is triggered when frequency fTf_T and buffer size for cluster TT exceed user-specified or estimated thresholds fminf_{\min} and NminN_{\min}.

3. Surrogate Model Search and Selection Strategy

The candidate search space S\mathcal{S} comprises models from private repositories or public hubs (e.g., Hugging Face), each annotated with parameter count (#θ\#\theta), model size, inference latency (M)\ell(M), and benchmarking metadata. The optimization seeks

minMSC(M,T)s.t.P(M,T)PˉT,(M)ˉT,mem(M)mˉT\min_{M \in \mathcal{S}} C(M, T) \quad \text{s.t.} \quad P(M, T) \ge \bar{P}_T,\, \ell(M) \le \bar{\ell}_T,\, \mathrm{mem}(M) \le \bar{m}_T

where ˉT\bar{\ell}_T and mˉT\bar{m}_T are latency and memory constraints. The search process systematically prunes infeasible models, ranks candidates using quick surrogate predictors P^(M,T)\hat{P}(M, T) on samples, clusters via Task2Vec-like embeddings, and full fine-tunes the 5\approx5 most promising meta-candidates before final selection.

Step Input Output
Prune S\mathcal{S}, constraints Models with valid memory/latency
Surrogate Prediction Pruned models, task samples Predicted accuracy for fast ranking
Clustering Ranked models, embeddings Top-kk clusters; select cluster representatives
Fine-tuning Meta-candidates, full data Measured P(M,T)P(M, T), select best cost-achieving MM

4. Transfer Learning and Fine-Tuning Pipeline

Surrogate adaptation uses several transfer learning paradigms:

  • Full Fine-Tuning updates all parameters.
  • Adapters / LoRA freeze the base model W0W_0 and learn low-rank updates W=W0+ABW = W_0 + AB.
  • Distillation minimizes the Kullback-Leibler divergence between MsM_s output and the full LLM's output logits.

The composite training loss is: L(θ)=αiCE(yi,p(xi;θ))+βiKL(p(0)(xi)p(xi;θ))\mathcal{L}(\theta) = \alpha \sum_i \mathrm{CE}(y_i, p(x_i;\theta)) + \beta \sum_i \mathrm{KL}(p^{(0)}(x_i) \| p(x_i;\theta)) with CE\mathrm{CE} as the cross-entropy, KL\mathrm{KL} as distillation loss, and α\alpha, β\beta weighting ground-truth versus teacher signal.

Empirical results indicate only a few hundred to a few thousand examples are required to approach LLM-level test accuracy on straightforward tasks, using standard data split, early stopping on Lval\mathcal{L}_{\rm val}, and checkpoint retention based on validation set P(Ms,T)P(M_s, T).

5. System Architecture and Workflow (Poodle Framework)

The Poodle system is the canonical instantiation of JITR, with clearly delineated components:

  • Data Collector / Monitor: Hooks into each LLM API call, applies optional wrapper prompts, logs xi,yix_i, y_i, and cost/timing metrics.
  • Task Analyzer: Performs clustering on recent logs to update the set of recurring tasks.
  • Model Manager / Generator: On new recurring task TT, runs search and customization workflow and registers MsM_s^*.
  • Inference Engine: Routes incoming requests for TT to MsM_s; otherwise, defaults to LLM M0M_0.
  • Model Monitor: Periodically shadow-tests a random fraction (1–5%) of requests on both M0M_0 and surrogate MsM_s to monitor performance drift ΔP\Delta P; triggers retraining or reversion if performance falls below threshold ϵ\epsilon.

Integration is achieved via transparent proxying or a client SDK that intercepts and augments calls, enabling non-intrusive deployment with existing LLM APIs.

6. Empirical Evaluation and Quantitative Results

The Poodle prototype was evaluated on binary sentiment classification (IMDB), comparing canonical LLMs (GPT-4.1, GPT-4.1-nano, Llama-405B Turbo, Llama-2-7B) with surrogates (BERT-base, \approx80M params).

  • Cost Savings (per 1 million requests, Table 1 prices):
    • GPT-4.1-nano \rightarrow BERT: Break-even \sim100k requests, saving \$33.
    • GPT-4.1 \rightarrow BERT: Break-even \sim10k requests, saving \$850.
    • Llama-405B Turbo \rightarrow BERT: Break-even \sim10k requests, saving \$1,420.
  • Latency and Throughput (NVIDIA A5000, max batch size):
    • Llama-2-7B: 13 items/sec (batch=16)
    • BERT: 254 items/sec (batch=128), \approx19.6×\times faster
    • Break-even at \sim100k requests; 7.5×7.5\times speedup at 1M
  • Surrogate Accuracy (IMDB test set):
#Examples GT Train\toTest Acc LLM\toTest Acc
500 0.86\to0.88 0.88\to0.88
1,000 0.88\to0.89 0.88\to0.88
2,000 0.89\to0.90 0.88\to0.88
5,000 0.90\to0.91 0.90\to0.90
  • Development Efficiency:
    • Naïve full-fine-tune (10 candidates on 5,000 examples): 53 min, test-acc 0.92
    • JITR search+fine-tune, best on 500 examples: 2.8 min, acc 0.91
    • JITR search+fine-tune on 5,000: 12 min, acc 0.92

7. Practical Challenges, Limitations, and Future Directions

Key challenges include early-detection overhead from wrapper tokens, scaling model-store indexing for millions of candidates, storage/throughput bottlenecks (require fast load and cluster-aware compression), and calibration of monitoring (how much shadowing is required to robustly detect surrogate drift). Notable limitations are that surrogate accuracy is contingent on quality and representativeness of TT data; rare or shifting tasks produce weaker surrogates; and privacy concerns arise with logging sensitive prompts/responses.

Proposed research directions include meta-learning for low-shot surrogate performance prediction, hardware- and storage–co-optimization, advanced distillation (including intermediate representation matching), multi-task surrogates with shared layers, and dynamic refinement of user-defined performance/cost thresholds via automated feedback loops (Strassenburg et al., 5 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Just-in-Time Model Replacement (JITR).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube