Just-in-Time Model Replacement (JITR)
- Just-in-Time Model Replacement (JITR) is an adaptive framework that replaces large language models with efficient surrogate models for recurring, template-based tasks.
- It continuously monitors LLM usage, detects recurring task patterns via clustering and vectorization, and triggers model replacement based on cost and accuracy thresholds.
- Leveraging methods like full fine-tuning, adapters/LoRA, and distillation, JITR achieves near LLM-level performance with significantly reduced computational overhead.
Just-in-Time Model Replacement (JITR) is an adaptive framework for dynamically substituting LLMs with computationally cheaper, task-specialized surrogate models in production pipelines. Upon detecting recurrent user requests that can be characterized as stable task templates, JITR identifies and fine-tunes small models—thereby reducing operational cost and latency without sacrificing accuracy for the repetitive task. The approach centers on continual monitoring of LLM usage, automatic detection and clustering of recurring task patterns, efficient surrogate model search and adaptation, and seamless runtime model switching with ongoing performance monitoring (Strassenburg et al., 5 Dec 2025).
1. Formal Problem Formulation
Let be the stream of user prompts and LLM-generated outputs . Many requests in practice correspond to a small set of recurring tasks , recognizable as templates , where defines the high-level instruction (e.g., "sentiment classification of movie review") and are slot-fillers (e.g., the actual review text). For each recurring task in a sliding window , frequency quantifies occurrence.
The objective is to find, for each detected , a surrogate model that (i) achieves task accuracy and (ii) minimizes per-instance invocation cost (monetary, energy, or time). Model replacement is enacted when, after accumulating labeled instances, a candidate model is shown to satisfy
where is the original LLM.
2. Recurring-Task Detection and Trigger Mechanisms
Incoming LLM calls are monitored and analyzed through multilayered pipelines designed to extract recurring patterns and cluster requests into tasks :
- Prompt-Prefix Vectorization: Each request is embedded into a fixed-length vector (extracted from model KV-cache or wrapper prompts). Pairwise cosine similarities are used to link prompt instances:
Thresholding () groups requests into candidate templates.
- Wrapper Prompt Classification: Requests are optionally wrapped in metadata prompts instructing the LLM to emit fields, yielding initial cluster assignments.
- Periodic Clustering: Offline clustering (agglomerative or -means) is applied every 1,000 requests to recent embeddings, producing clusters .
Surrogate generation is triggered when frequency and buffer size for cluster exceed user-specified or estimated thresholds and .
3. Surrogate Model Search and Selection Strategy
The candidate search space comprises models from private repositories or public hubs (e.g., Hugging Face), each annotated with parameter count (), model size, inference latency , and benchmarking metadata. The optimization seeks
where and are latency and memory constraints. The search process systematically prunes infeasible models, ranks candidates using quick surrogate predictors on samples, clusters via Task2Vec-like embeddings, and full fine-tunes the most promising meta-candidates before final selection.
| Step | Input | Output |
|---|---|---|
| Prune | , constraints | Models with valid memory/latency |
| Surrogate Prediction | Pruned models, task samples | Predicted accuracy for fast ranking |
| Clustering | Ranked models, embeddings | Top- clusters; select cluster representatives |
| Fine-tuning | Meta-candidates, full data | Measured , select best cost-achieving |
4. Transfer Learning and Fine-Tuning Pipeline
Surrogate adaptation uses several transfer learning paradigms:
- Full Fine-Tuning updates all parameters.
- Adapters / LoRA freeze the base model and learn low-rank updates .
- Distillation minimizes the Kullback-Leibler divergence between output and the full LLM's output logits.
The composite training loss is: with as the cross-entropy, as distillation loss, and , weighting ground-truth versus teacher signal.
Empirical results indicate only a few hundred to a few thousand examples are required to approach LLM-level test accuracy on straightforward tasks, using standard data split, early stopping on , and checkpoint retention based on validation set .
5. System Architecture and Workflow (Poodle Framework)
The Poodle system is the canonical instantiation of JITR, with clearly delineated components:
- Data Collector / Monitor: Hooks into each LLM API call, applies optional wrapper prompts, logs , and cost/timing metrics.
- Task Analyzer: Performs clustering on recent logs to update the set of recurring tasks.
- Model Manager / Generator: On new recurring task , runs search and customization workflow and registers .
- Inference Engine: Routes incoming requests for to ; otherwise, defaults to LLM .
- Model Monitor: Periodically shadow-tests a random fraction (1–5%) of requests on both and surrogate to monitor performance drift ; triggers retraining or reversion if performance falls below threshold .
Integration is achieved via transparent proxying or a client SDK that intercepts and augments calls, enabling non-intrusive deployment with existing LLM APIs.
6. Empirical Evaluation and Quantitative Results
The Poodle prototype was evaluated on binary sentiment classification (IMDB), comparing canonical LLMs (GPT-4.1, GPT-4.1-nano, Llama-405B Turbo, Llama-2-7B) with surrogates (BERT-base, 80M params).
- Cost Savings (per 1 million requests, Table 1 prices):
- GPT-4.1-nano BERT: Break-even 100k requests, saving \$33.
- GPT-4.1 BERT: Break-even 10k requests, saving \$850.
- Llama-405B Turbo BERT: Break-even 10k requests, saving \$1,420.
- Latency and Throughput (NVIDIA A5000, max batch size):
- Llama-2-7B: 13 items/sec (batch=16)
- BERT: 254 items/sec (batch=128), 19.6 faster
- Break-even at 100k requests; speedup at 1M
- Surrogate Accuracy (IMDB test set):
| #Examples | GT TrainTest Acc | LLMTest Acc |
|---|---|---|
| 500 | 0.860.88 | 0.880.88 |
| 1,000 | 0.880.89 | 0.880.88 |
| 2,000 | 0.890.90 | 0.880.88 |
| 5,000 | 0.900.91 | 0.900.90 |
- Development Efficiency:
- Naïve full-fine-tune (10 candidates on 5,000 examples): 53 min, test-acc 0.92
- JITR search+fine-tune, best on 500 examples: 2.8 min, acc 0.91
- JITR search+fine-tune on 5,000: 12 min, acc 0.92
7. Practical Challenges, Limitations, and Future Directions
Key challenges include early-detection overhead from wrapper tokens, scaling model-store indexing for millions of candidates, storage/throughput bottlenecks (require fast load and cluster-aware compression), and calibration of monitoring (how much shadowing is required to robustly detect surrogate drift). Notable limitations are that surrogate accuracy is contingent on quality and representativeness of data; rare or shifting tasks produce weaker surrogates; and privacy concerns arise with logging sensitive prompts/responses.
Proposed research directions include meta-learning for low-shot surrogate performance prediction, hardware- and storage–co-optimization, advanced distillation (including intermediate representation matching), multi-task surrogates with shared layers, and dynamic refinement of user-defined performance/cost thresholds via automated feedback loops (Strassenburg et al., 5 Dec 2025).