Papers
Topics
Authors
Recent
2000 character limit reached

Predictive-LoRA: Efficient Serverless LLM Inference

Updated 30 December 2025
  • Predictive-LoRA (P-LoRA) is a serverless inference system that proactively prefetches LoRA adapters using an LSTM-based traffic predictor and page-based memory management.
  • It leverages a lightweight LSTM to forecast adapter demand, reducing cold start latency by 68% and boosting throughput by up to 1.52×.
  • Uniform 2 MB page management in P-LoRA maintains GPU memory utilization above 87% while significantly lowering memory fragmentation from mixed adapter ranks.

Predictive-LoRA (P-LoRA) is a proactive, fragmentation-aware serverless inference system for LLMs fine-tuned via @@@@1@@@@ (LoRA). Designed to address the critical challenges of reactive adapter loading and GPU memory fragmentation in serverless environments, P-LoRA integrates two principal innovations: (1) a lightweight Long Short-Term Memory (LSTM)-based traffic predictor to forecast and proactively prefetch high-demand LoRA adapters, and (2) a page-based memory management regime that eliminates external GPU fragmentation regardless of adapter heterogeneity. Benchmarking demonstrates that P-LoRA reduces median cold start latency by up to 68%, increases throughput by up to 1.52×, and sustains GPU memory utilization above 87% even under mixed adapter ranks (Ni et al., 23 Dec 2025).

1. System Architecture and Operational Workflow

P-LoRA's architecture comprises several interconnected components, distributed across host and device-level resources. On the host CPU and memory, access logs and per-adapter time-series counters are maintained. The LSTM traffic predictor executes asynchronously on CPU cores, ingesting sliding-window histories and emitting predicted adapter access probabilities p^i\hat{p}_i to a shared prediction table. All LoRA adapter weight files (i.e., the AA and BB matrices) reside in host storage (SSD/DRAM) when not present in GPU memory.

GPU memory is partitioned into fixed-size, 2 MB pages managed by a custom page table. Active pages exclusively hold the weights of "in-GPU" adapters. The Prefetch Manager, an extension of the GPU scheduler, consumes predicted adapter probabilities from the predictor and orchestrates PCIe/DMA transfers to bring predicted-hot adapters into a GPU staging buffer, later promoting them into the active pool at batch boundaries. Inference execution, built as a vLLM extension, utilizes a scatter-gather CUDA kernel for page lookups and weight accesses.

Upon receipt of a serverless request, the scheduler verifies the residency of the requested adapter. If present, inference commences; otherwise, loading proceeds proactively (if already underway) or reactively, potentially incurring a cold start. The inference pipeline fuses the transformer backbone with LoRA adapter weights from in-GPU pages. All adapter accesses are logged for subsequent prediction. P-LoRA operates entirely within the container managed by the serverless framework (e.g., Azure Functions), preserving zero-management semantics, with all prediction, prefetching, and memory packing as transparent system services.

2. LSTM-Based Adapter Traffic Prediction

The traffic predictor central to P-LoRA is a two-layer LSTM, each layer comprising 64 hidden units. For adapter ii and time interval tt, the input features are the normalized access count vector:

[citw,citw+1,,cit1][c_i^{t-w}, c_i^{t-w+1}, \ldots, c_i^{t-1}]

with ww representing a default 30-second sliding window in 1-second increments. Adapter identity is encoded via a learned 32-dimensional embedding E(i)E(i). The LSTM's final hidden state is passed to a fully connected layer of size NN (number of adapters), outputting p^i\hat{p}_i, the predicted probability that adapter ii will be accessed in the next time interval.

The objective function is binary cross-entropy, applied over all adapters at each tick:

L=i=1N[yilogp^i+(1yi)log(1p^i)]L = -\sum_{i=1}^N \left[ y_i \cdot \log \hat{p}_i + (1 - y_i) \cdot \log (1 - \hat{p}_i) \right]

where yi=1y_i = 1 if adapter ii is accessed in (t,t+1](t, t+1], $0$ otherwise.

Training utilizes production-scale traces (Azure Functions, mapped to unique adapters), with synthetic arrival rates spanning 10–500 req/s. Online learning occurs every 100 real requests, using a replay buffer of 10,000 observations; optimization uses Adam (lr=103lr = 10^{-3}, batch size = 64). Prediction accuracy is 86% on average—peaking at 89% during weekday business hours and decreasing to 70–78% during weekend evenings; CPU inference cost is 2.3 ms per prediction.

Proactive prefetching occurs at 100 ms intervals: all adapters with p^i>θ\hat{p}_i > \theta (θ0.5\theta \approx 0.5) not resident on GPU are preemptively loaded. Double buffering overlaps data transfers with batched inference. This mechanism reduces median cold start latency from 68 ms (S-LoRA) to 22 ms under benchmark conditions—demonstrating a 68% reduction.

3. Page-Based Adapter Memory Management

GPU memory is managed using a uniform 2 MB page abstraction. LoRA adapters—with variable sizes dictated by rank (e.g., rank-8 ≈13 MB, rank-64 ≈100 MB for LLaMA2-7B)—are divided into M=S/PM = \lceil S / P \rceil logical pages, with SS the adapter's size and PP the page size. Each adapter receives a dedicated page table, mapping logical indices to physical page frames; contiguity is not required.

Allocation pseudo-code:

1
2
3
4
5
6
7
8
pages_needed = ceil(adapter_size[adapter_id] / P)
while free_page_count < pages_needed:
    victim = SelectVictimAdapter()
    FreePages(victim)
for page_index in [0..pages_needed−1]:
    phys = PopFreePage()
    page_table[adapter_id][page_index] = phys
Mark adapter_id as resident
The eviction policy (Equation (3)) assigns each candidate adapter a score:

scorei=αLRU_ranki+βfreqi+γp^i\text{score}_i = \alpha \cdot \text{LRU\_rank}_i + \beta \cdot \text{freq}_i + \gamma \cdot \hat{p}_i

Adapters with the lowest scores are evicted until sufficient pages are available. Background compaction periodically scans for adjacent free pages but is seldom necessary due to low fragmentation.

Uniform page sizing eliminates external fragmentation: all "holes" are integer multiples of 2 MB; adapters of disparate ranks are handled with consistent efficiency. Measured fragmentation ratio F=1(used_pages/total_pages)F = 1 - (\text{used\_pages} / \text{total\_pages}) is 12% on average for P-LoRA, versus 25% (S-LoRA) and 35% (block allocator baseline). Memory utilization UU remains above 87% across all workload mixes.

4. Experimental Results and Metrics

Evaluations use 8× NVIDIA A100 (40 GB), 512 GB DDR4, and dual AMD EPYC 7763 servers, with models including LLaMA2-7B/13B and Mistral-7B. LoRA ranks tested are {8, 16, 32, 64}. Workloads are generated from Azure Functions traces at 10–500 req/s, with prompt/output lengths sampled from ShareGPT logs. Baselines include vLLM (no LoRA packing), S-LoRA, and dLoRA.

Key performance results:

Metric P-LoRA S-LoRA Block Allocator vLLM
Median cold start latency 22 ms 68 ms
Throughput (1000 adapters) 145 req/s 95 req/s
TTFT (500 req/s) 340 ms 520 ms 820 ms
Memory fragmentation ~12% ~25% ~35%
Memory utilization >87%

Further, Time-Per-Output-Token (TPOT) improves by ~24% at peak load. The total runtime overhead per batch is 3.5 ms, subdivided as LSTM predictor (2.3 ms, 18 MB), page table ops (0.4 ms, 32 MB), and prefetch scheduler (0.8 ms, 8 MB), for a combined memory overhead of 58 MB.

5. Scalability, Overhead, and Limitations

P-LoRA scales to thousands of adapters with minimal throughput degradation, attributed to page-based memory packing and efficient prefetching. Online learning handles workload distributional shifts, updating every 100 requests.

The total runtime overhead for prediction, page table operations, and scheduling is 3.5 ms per batch, distributed across batches of 8–32 prompts. The corresponding memory overhead (58 MB) is negligible compared to typical LoRA adapter sizes.

Limitations include dependency on workload regularity: prediction and prefetch hit rates decrease to approximately 70% under unpredictable or adversarial access patterns (“unpredictable weekend bursts”), which increases cold-start probability. Page-table address translation introduces negligible per-batch overhead, fully overlappable with other GPU kernels.

Future research directions suggested include adaptive page sizing, more sophisticated multi-horizon adapter access forecasting, and integration with dynamic quantization and related memory-saving approaches.

6. Relationship to Prior Approaches and Broader Implications

S-LoRA and dLoRA serve as baseline methods for serverless LoRA inference but suffer from substantial cold start latency (68 ms vs. 22 ms for P-LoRA) and elevated memory fragmentation (~25–35%). P-LoRA’s integration of proactive prediction and uniform memory paging addresses these deficits by leveraging OS-inspired virtual memory abstractions and real-time traffic forecasting. The system’s architecture anticipates broader application to multi-tenant or large-scale inference environments where rapid workload shifts and adapter heterogeneity are common (Ni et al., 23 Dec 2025).

A plausible implication is that adoption of LSTM-driven prefetching and fine-grained page management may become standard for managing fine-tuned multi-LLM deployments in serverless settings, particularly as adapter counts and access variability increase.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Predictive-LoRA (P-LoRA).