OTAS: Elastic Serving Systems for Dynamic Workloads
- Elastic Serving Systems (OTAS) are adaptive platforms that dynamically allocate computational resources for serving ML models and transactional workloads under fluctuating conditions.
- They integrate fine-grained resource pooling, declarative scheduling, and self-evolving, policy-driven control to optimize throughput, latency, and service-level objectives.
- Recent implementations employ automated buffering and scalable microservice architectures to ensure rapid adaptation and fault tolerance in high-demand environments.
Elastic Serving Systems (OTAS) are designed to deliver real-time adaptability in the allocation and organization of computational resources for serving machine learning models, transactional workloads, and online inference under high dynamism and uncertainty. The OTAS paradigm—On-Tap, On-Demand, Tiered, Adaptive Serving—focuses on continuous, policy-driven elasticity, spanning from fine-grained object scaling in middleware systems, through declarative resource scheduling for LLMs, to multi-modal and cache-centric approaches. OTAS systems must tightly couple adaptive scheduling, resource abstraction, and data movement optimizations for efficiency and service-level objective (SLO) compliance, frequently underpinned by dynamic, sometimes self-evolving, control logic.
1. Architectural Principles and Core Abstractions
Elastic Serving Systems instantiate elasticity at various levels of application and infrastructure, employing orthogonal abstractions tailored to specific domains.
- Object- and Class-level Elasticity: In middleware such as ElasticRMI, application objects are instantiated within pools that can grow or shrink; state is centralized in a strongly-consistent in-memory key–value store, with a “sentinel” dynamically coordinating pool size and object lifecycle via metrics streams and policy evaluation (Jayaram, 2019).
- Decoupled Planes for Policy Evolution: Autopoiesis introduces a two-plane architecture—data plane for policy execution, control plane for continuous policy evolution via LLM-driven online synthesis. This separation is crucial for achieving sustained adaptability in the face of volatile workloads and elastic infrastructure (Jiang et al., 8 Apr 2026).
- Declarative Resource and Cache Pools: In TokenLake, a unified, segment-level prefix cache pool is managed independently of the scheduler, exposing memory, hit-rate, and bandwidth state to a stateless scheduling layer. This interface enables optimization of communication and memory fragmentation without burdening the task scheduler with cache placement logic (Wu et al., 24 Aug 2025).
- Microservices and Sharding: ElasticRec demonstrates microservice decomposition by partitioning dense and embedding shards, enabling independent scaling and utility-based resource allocation, orchestrated by Kubernetes with horizontal auto-scaling (Choi et al., 2024).
OTAS systems tend to layer these abstractions, facilitating stateless, policy-driven scheduling at the top, and resource pooling or sharding at fine granularity in the underlying substrate.
2. Elasticity Control, Scaling Policies, and Automation
Elastic scaling policies in OTAS are inherently multidimensional and exploit both system-level and application-specific signals.
- Metric-driven and Programmable Scaling: ElasticRMI supports policies that combine CPU, memory, queue length, and user-specified application metrics. The sentinel aggregates per-object metrics at burst intervals (typically 60 s by default) and executes threshold or user-derived scaling logic to compute pool size delta (Jayaram, 2019).
- Self-evolving Policies via Program Synthesis: Autopoiesis dispenses with fixed policies, instead synthesizing new scheduling logic online. The LLM evolutionary engine mutates and evaluates candidate policies on recent runtime traces, directly optimizing end-to-end trace completion time under the triad of scheduling, reconfiguration, and serving costs (Jiang et al., 8 Apr 2026).
- Utility-based Resource Partitioning: In ElasticRec, per-shard utility is calculated as QPS per unit of memory, maximizing overall utility subject to memory and SLO constraints. The optimal partitioning is periodically recomputed, and scaling is implemented by autoscaling containers within Kubernetes (Choi et al., 2024).
- Lightweight SLO-aware Buffer Tuning: eLLM incorporates SLO-aware admission and buffer scaling. Buffer size is increased or decreased based on TPOT and TTFT violations, with physical GPU and CPU memory ballooning mediated by the scheduler (Xu et al., 18 Jun 2025).
Policy frameworks in OTAS must remain programmable or self-optimizing, with tight feedback loops coupled to system observability.
3. Resource Pooling, Memory Elasticity, and Efficient Data Orchestration
Unified resource and cache pools are foundational for OTAS systems in high-throughput/low-latency inference or serving environments.
- Virtual Memory Pools and Ballooning: eLLM unifies KV and activation tensors within a single GPU memory pool, with ownership of physical memory chunks remapped by page-table updates. This supports on-demand inflation/deflation of working sets and admits overflow into CPU DRAM, minimizing queueing and maximizing batch sizes under dynamic load (Xu et al., 18 Jun 2025).
- Segment-level Cache Pooling and Heavy-hitter Replication: TokenLake manages cache as segments distributed across the cluster, decouples compute from storage, and applies heavy-hitter replication to balance load and deduplicate data. Batches are assigned by solving a bipartite matching problem to minimize communication, and eviction decisions are made globally via LRU (Wu et al., 24 Aug 2025).
- Elastic Sequence and Multimodal Parallelism: LoongServe’s ESP supplies per-iteration, token-granular allocation of tokens across instances, achieving zero fragmentation and dynamic bin-packing, while ElasticMM orchestrates modality-oriented pools and stage-level (encoding, prefill, decode) elastic partition scheduling via gain–cost analyses (Wu et al., 2024, Liu et al., 14 Jul 2025).
Pooling and declarative resource interfaces enable schedulers to focus on compute- and SLO-driven batching logic, with memory and network orchestration handled orthogonally.
4. Scheduling, Batching, and Workload Adaptivity
Schedulers in OTAS systems interleave dynamic batching, degree-of-parallelism optimization, and SLO constraint enforcement.
- Dynamic Programming and Policy-Decomposition: Both OTAS (token-adaptive transformer serving) and LoongServe use dynamic programming for joint batching and resource allocation (batch boundaries, token adaptation or DoP) under memory and deadline constraints (Chen et al., 2024, Wu et al., 2024). These formulations map naturally to weighted scheduling and resource partitioning.
- Tandem and Micro-Request Abstraction: DynaServe utilizes a split-point selection per request, arbitrarily dividing prefill and decode phases while two-level schedulers (global and local on GPUs) maintain QPS, goodput, and latency guarantees by continually adjusting split points and batch composition (Ruan et al., 12 Apr 2025).
- SLO-driven Scaling and Modality/Stage Decomposition: ElasticMM sustains SLOs under burst loads by allocating disjoint resource pools by modality and by pipeline stage, preemptively migrating resources as gain–cost calculations dictate and leveraging cache hits to reduce TTFT (Liu et al., 14 Jul 2025).
- Stateless Elastic Scheduling: In TokenLake, the stateless scheduler solves for optimal batch assignments while only accounting for prefix cache load as a constraint, allowing Plug-and-Play integration with different underlying pooling mechanisms (Wu et al., 24 Aug 2025).
Adaptive scheduling is therefore characterized by tight interleaving of batching, resource-bound estimation, and fine-grained preemption or scaling—often in real time or with sub-minute latency.
5. Fault Tolerance, Online Scaling, and Production Considerations
Production OTAS deployments face unique demands for continuous availability, rapid scaling, and strong isolation.
- Process Group Elasticity and Fine-grained Fault Domains: MultiWorld introduces independently scalable and fault-isolated process groups (“worlds”) atop standard CCLs (NCCL). Worlds are created/teared-down without full-system reinitialization; failures are isolated, and join times are on the order of tens of milliseconds with <5% throughput loss under high concurrency (Lee et al., 2024).
- Microservice Granularity and Autoscaling Orchestration: ElasticRec’s architecture naturally supports canary releases, live embedding updates, and online retraining, exploiting Kubernetes HPA for per-shard scaling. Per-shard QPS and utility feedback further enable capacity-aware fault recovery and resource minimization (Choi et al., 2024).
- Consistency and Write Coordination: Fine-grained synchronization of shared state (e.g., in-memory K/V stores for object state or versioned embeddings in RecSys) remains a limiting factor for throughput and latency in some designs. Mechanisms such as global LRU eviction, host–guest ballooning, and P2P asynchronous collective engines (as in TokenLake and eLLM) mitigate much of the bottleneck in practical settings (Wu et al., 24 Aug 2025, Xu et al., 18 Jun 2025).
OTAS solutions thus integrate elasticity with robust, rapid isolation and recovery, ensuring system availability and resource efficiency even under extreme dynamic workloads.
6. Performance Benchmarks and Comparative Results
Quantitative evaluations consistently demonstrate that OTAS-compliant systems surpass prior static or coarse-grained approaches:
| System | Throughput Uplift | SLO/Latency Improvement | Key Benchmark/Comparison |
|---|---|---|---|
| ElasticRMI | Near-ideal agility (<2 nodes shortage/excess) | Sub-minute reaction, lower latency spikes | SPEC OSG, order router, pub/sub, consensus (Jayaram, 2019) |
| Autopoiesis | Up to 53% reduction in completion time | 0 ms hot-swap, auto-adapting | DistServe, HexGen, SpotServe (Jiang et al., 8 Apr 2026) |
| LoongServe | 3.0–3.8× throughput, ≥50% latency reduction | P90 goodput 1.5–2.3× higher | vLLM, SplitFuse, DistServe (Wu et al., 2024) |
| TokenLake | 2–5× goodput vs. baselines, 2× hit rate | Load CV 6–12× lower | SGLang-Router, MoonCake (Wu et al., 24 Aug 2025) |
| eLLM | Up to 2.3× decoding throughput, 3× batch | TTFT up to 295× lower | vLLM (Xu et al., 18 Jun 2025) |
| ElasticMM | 3.2–4.5× throughput, up to 4.2× TTFT reduction | 98–100% SLO compliance | vLLM (Liu et al., 14 Jul 2025) |
| ElasticRec | Avg 3.3× less memory, 8.1× higher utility | 1.6× lower deployment cost | Model-wise RecSys serving (Choi et al., 2024) |
Empirical results highlight the crucial role of fine-grained, policy-adaptive elasticity in sustaining both efficiency and service quality.
7. Design Trade-offs, Limitations, and Open Challenges
- Consistency vs. Overhead: Designs relying on distributed K/V stores for shared object state (e.g., HyperDex in ElasticRMI) must balance strong consistency against network overhead (Jayaram, 2019).
- Scalability of Matching and Pooling: Algorithms such as Hungarian matching in TokenLake are cubic in instance count, which can be a bottleneck at extreme scale; approximate algorithms or hierarchical dispatch may be warranted (Wu et al., 24 Aug 2025).
- Policy Complexity and Explainability: Self-evolving policy frameworks as in Autopoiesis introduce opaque control logic and require measures (timeouts, evaluators) to mitigate degenerate/hung states and ensure safe evolution (Jiang et al., 8 Apr 2026).
- SLO-aware Buffering and Phase Interference: Schemes using buffer ballooning (eLLM) or cross-pool capacity tuning (ElasticMM) require careful balance to avoid starvation or priority inversion under burst or multi-tenant regimes (Xu et al., 18 Jun 2025, Liu et al., 14 Jul 2025).
- Domain Specialization: While token adaptation and memory ballooning generalize, direct adaptation of techniques (e.g., ViT-specific merging strategies in OTAS) to LLM or RecSys requires further empirical study (Chen et al., 2024).
This suggests ongoing refinement in resource abstraction, policy management, and statistical learning for dynamic environments, with open questions regarding fairness, transparency, and multi-tenancy isolation.
The accumulated evidence indicates that OTAS systems, grounded in dynamic, policy-driven scheduling, unified resource pooling, and fine-grained adaptivity, enable state-of-the-art throughput, utilization, and SLO compliance for transactional, inference, and large-scale data-serving applications. These designs are now foundational for new systems that require robust, elastic scaling amid evolving workloads and infrastructure uncertainty (Jayaram, 2019, Jiang et al., 8 Apr 2026, Wu et al., 2024, Wu et al., 24 Aug 2025, Xu et al., 18 Jun 2025, Liu et al., 14 Jul 2025, Lee et al., 2024, Chen et al., 2024, Choi et al., 2024).