Just-in-Time Model Replacement (JITR)

Updated 12 December 2025

Just-in-Time Model Replacement (JITR) is an adaptive framework that replaces large language models with efficient surrogate models for recurring, template-based tasks.
It continuously monitors LLM usage, detects recurring task patterns via clustering and vectorization, and triggers model replacement based on cost and accuracy thresholds.
Leveraging methods like full fine-tuning, adapters/LoRA, and distillation, JITR achieves near LLM-level performance with significantly reduced computational overhead.

Just-in-Time Model Replacement (JITR) is an adaptive framework for dynamically substituting LLMs with computationally cheaper, task-specialized surrogate models in production pipelines. Upon detecting recurrent user requests that can be characterized as stable task templates, JITR identifies and fine-tunes small models—thereby reducing operational cost and latency without sacrificing accuracy for the repetitive task. The approach centers on continual monitoring of LLM usage, automatic detection and clustering of recurring task patterns, efficient surrogate model search and adaptation, and seamless runtime model switching with ongoing performance monitoring (Strassenburg et al., 5 Dec 2025).

1. Formal Problem Formulation

Let $\mathcal{D} = \{(x_i, y_i)\}$ be the stream of user prompts $x_i$ and LLM-generated outputs $y_i$ . Many requests in practice correspond to a small set of recurring tasks $T$ , recognizable as templates $x = \mathrm{template}_T(\ell ; \phi)$ , where $\ell$ defines the high-level instruction (e.g., "sentiment classification of movie review") and $\phi$ are slot-fillers (e.g., the actual review text). For each recurring task $T$ in a sliding window $W$ , frequency $f_T$ quantifies occurrence.

The objective is to find, for each detected $T$ , a surrogate model $M_s$ that (i) achieves task accuracy $P(M_s, T) \ge \bar{P}_T$ and (ii) minimizes per-instance invocation cost $C(M_s, T)$ (monetary, energy, or time). Model replacement is enacted when, after accumulating $N_{\rm warm}$ labeled instances, a candidate model $M_s$ is shown to satisfy

$C(M_s, T) < C(M_0, T) \quad \land \quad P(M_s, T) \ge \bar{P}_T$

where $M_0$ is the original LLM.

2. Recurring-Task Detection and Trigger Mechanisms

Incoming LLM calls are monitored and analyzed through multilayered pipelines designed to extract recurring patterns and cluster requests into tasks $T$ :

Prompt-Prefix Vectorization: Each request $x$ is embedded into a fixed-length vector $e(x)$ (extracted from model KV-cache or wrapper prompts). Pairwise cosine similarities are used to link prompt instances:

$\cos(e(x_i), e(x_j)) = \frac{e(x_i) \cdot e(x_j)}{||e(x_i)||\,||e(x_j)||}$

Thresholding ( $\cos \ge \tau$ ) groups requests into candidate templates.

Wrapper Prompt Classification: Requests are optionally wrapped in metadata prompts instructing the LLM to emit $\{\text{input\_type}, \text{task\_type}\}$ fields, yielding initial cluster assignments.
Periodic Clustering: Offline clustering (agglomerative or $k$ -means) is applied every 1,000 requests to recent embeddings, producing clusters $\{T_1, \ldots, T_k\}$ .

Surrogate generation is triggered when frequency $f_T$ and buffer size for cluster $T$ exceed user-specified or estimated thresholds $f_{\min}$ and $N_{\min}$ .

3. Surrogate Model Search and Selection Strategy

The candidate search space $\mathcal{S}$ comprises models from private repositories or public hubs (e.g., Hugging Face), each annotated with parameter count ( $\#\theta$ ), model size, inference latency $\ell(M)$ , and benchmarking metadata. The optimization seeks

$\min_{M \in \mathcal{S}} C(M, T) \quad \text{s.t.} \quad P(M, T) \ge \bar{P}_T,\, \ell(M) \le \bar{\ell}_T,\, \mathrm{mem}(M) \le \bar{m}_T$

where $\bar{\ell}_T$ and $\bar{m}_T$ are latency and memory constraints. The search process systematically prunes infeasible models, ranks candidates using quick surrogate predictors $\hat{P}(M, T)$ on samples, clusters via Task2Vec-like embeddings, and full fine-tunes the $\approx5$ most promising meta-candidates before final selection.

Step	Input	Output
Prune	$\mathcal{S}$ , constraints	Models with valid memory/latency
Surrogate Prediction	Pruned models, task samples	Predicted accuracy for fast ranking
Clustering	Ranked models, embeddings	Top- $k$ clusters; select cluster representatives
Fine-tuning	Meta-candidates, full data	Measured $P(M, T)$ , select best cost-achieving $M$

4. Transfer Learning and Fine-Tuning Pipeline

Surrogate adaptation uses several transfer learning paradigms:

Full Fine-Tuning updates all parameters.
Adapters / LoRA freeze the base model $W_0$ and learn low-rank updates $W = W_0 + AB$ .
Distillation minimizes the Kullback-Leibler divergence between $M_s$ output and the full LLM's output logits.

The composite training loss is: $\mathcal{L}(\theta) = \alpha \sum_i \mathrm{CE}(y_i, p(x_i;\theta)) + \beta \sum_i \mathrm{KL}(p^{(0)}(x_i) \| p(x_i;\theta))$ with $\mathrm{CE}$ as the cross-entropy, $\mathrm{KL}$ as distillation loss, and $\alpha$ , $\beta$ weighting ground-truth versus teacher signal.

Empirical results indicate only a few hundred to a few thousand examples are required to approach LLM-level test accuracy on straightforward tasks, using standard data split, early stopping on $\mathcal{L}_{\rm val}$ , and checkpoint retention based on validation set $P(M_s, T)$ .

5. System Architecture and Workflow (Poodle Framework)

The Poodle system is the canonical instantiation of JITR, with clearly delineated components:

Data Collector / Monitor: Hooks into each LLM API call, applies optional wrapper prompts, logs $x_i, y_i$ , and cost/timing metrics.
Task Analyzer: Performs clustering on recent logs to update the set of recurring tasks.
Model Manager / Generator: On new recurring task $T$ , runs search and customization workflow and registers $M_s^*$ .
Inference Engine: Routes incoming requests for $T$ to $M_s$ ; otherwise, defaults to LLM $M_0$ .
Model Monitor: Periodically shadow-tests a random fraction (1–5%) of requests on both $M_0$ and surrogate $M_s$ to monitor performance drift $\Delta P$ ; triggers retraining or reversion if performance falls below threshold $\epsilon$ .

Integration is achieved via transparent proxying or a client SDK that intercepts and augments calls, enabling non-intrusive deployment with existing LLM APIs.

6. Empirical Evaluation and Quantitative Results

The Poodle prototype was evaluated on binary sentiment classification (IMDB), comparing canonical LLMs (GPT-4.1, GPT-4.1-nano, Llama-405B Turbo, Llama-2-7B) with surrogates (BERT-base, $\approx$ 80M params).

Cost Savings (per 1 million requests, Table 1 prices):
- GPT-4.1-nano $\rightarrow$ BERT: Break-even $\sim$ 100k requests, saving \$33.
- GPT-4.1 $\rightarrow$ BERT: Break-even $\sim$ 10k requests, saving \$850.
- Llama-405B Turbo $\rightarrow$ BERT: Break-even $\sim$ 10k requests, saving \$1,420.
Latency and Throughput (NVIDIA A5000, max batch size):
- Llama-2-7B: 13 items/sec (batch=16)
- BERT: 254 items/sec (batch=128), $\approx$ 19.6 $\times$ faster
- Break-even at $\sim$ 100k requests; $7.5\times$ speedup at 1M
Surrogate Accuracy (IMDB test set):

#Examples	GT Train $\to$ Test Acc	LLM $\to$ Test Acc
500	0.86 $\to$ 0.88	0.88 $\to$ 0.88
1,000	0.88 $\to$ 0.89	0.88 $\to$ 0.88
2,000	0.89 $\to$ 0.90	0.88 $\to$ 0.88
5,000	0.90 $\to$ 0.91	0.90 $\to$ 0.90

Development Efficiency:
- Naïve full-fine-tune (10 candidates on 5,000 examples): 53 min, test-acc 0.92
- JITR search+fine-tune, best on 500 examples: 2.8 min, acc 0.91
- JITR search+fine-tune on 5,000: 12 min, acc 0.92

7. Practical Challenges, Limitations, and Future Directions

Key challenges include early-detection overhead from wrapper tokens, scaling model-store indexing for millions of candidates, storage/throughput bottlenecks (require fast load and cluster-aware compression), and calibration of monitoring (how much shadowing is required to robustly detect surrogate drift). Notable limitations are that surrogate accuracy is contingent on quality and representativeness of $T$ data; rare or shifting tasks produce weaker surrogates; and privacy concerns arise with logging sensitive prompts/responses.

Proposed research directions include meta-learning for low-shot surrogate performance prediction, hardware- and storage–co-optimization, advanced distillation (including intermediate representation matching), multi-task surrogates with shared layers, and dynamic refinement of user-defined performance/cost thresholds via automated feedback loops (Strassenburg et al., 5 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Just-in-Time Model Replacement (JITR).

Just-in-Time Model Replacement (JITR)

1. Formal Problem Formulation

2. Recurring-Task Detection and Trigger Mechanisms

3. Surrogate Model Search and Selection Strategy

4. Transfer Learning and Fine-Tuning Pipeline

5. System Architecture and Workflow (Poodle Framework)

6. Empirical Evaluation and Quantitative Results

7. Practical Challenges, Limitations, and Future Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Just-in-Time Model Replacement (JITR)

1. Formal Problem Formulation

2. Recurring-Task Detection and Trigger Mechanisms

3. Surrogate Model Search and Selection Strategy

4. Transfer Learning and Fine-Tuning Pipeline

5. System Architecture and Workflow (Poodle Framework)

6. Empirical Evaluation and Quantitative Results

7. Practical Challenges, Limitations, and Future Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research