- The paper presents a research agenda outlining cloud-native architectures to efficiently scale large language models while optimizing resource utilization.
- It proposes adaptive methods including elastic scheduling, energy-aware placement, and AI-driven orchestration to manage heterogeneous hardware and bursty workloads.
- It identifies challenges such as inter-service overhead, privacy, and standardization that must be addressed to support reliable LLM deployments.
Cloud-Native and Distributed Systems for Scalable LLMs
Motivation and Architectural Foundations
LLMs now underpin mission-critical AI workloads with unparalleled computational intensity and high-throughput operational requirements. The scale of transformer-based models (e.g., GPT-4, PaLM, LLaMA) yields training and inference phases consuming substantial hardware, memory, storage, and network resources at both the cloud datacenter and distributed edge. This architectural shift diverges fundamentally from traditional ML deployments by requiring dynamic, multi-tenant, heterogeneous cluster management; static, monolithic cloud deployments no longer suffice.
The architectural paradigm for LLM system support integrates microservices, containerization, elastic orchestration, and runtime resource accounting. This cloud-native stack, spanning diverse hardware (GPUs, TPUs, NPUs), distributed scheduling substrates, and scalable runtime intermediaries, enables both fine-grained modularity and workload phase-specific optimizations. Atop this infrastructure, LLMs are deployed via standardized APIs ensuring integration with downstream applications, allowing infrastructure to mask internal complexity and present robust, scalable services.
Figure 1: Hierarchical stack for LLM deployment detailing hardware, orchestration, resource management, and interface layers.
Systemic and Operational Challenges
LLMs fundamentally alter system assumptions via bursty, user-driven inference, hybrid context lengths, multi-modal retrieval pipelines, and agility requirements (latency, throughput, SLOs). The paper identifies several core challenges:
Resource Management and Optimization Mechanisms
The authors systematize resource management into four pivotal strategies:
- Elastic Scheduling for Heterogeneous Hardware: Contemporary serving systems implement prefill/decode phase separation, disaggregation, and specialized hardware allocation to extract peak throughput, while adaptive autoscaling (e.g., coordinated pool balancing, token-velocity-driven scaling) maintains SLOs amidst variable demand.
- Energy- and Carbon-Aware Placement: Dynamic runtime scheduling now integrates region-level carbon telemetry, energy metrics, and cluster-level power efficiency into allocation logic, aiming to minimize emissions (e.g., via time-shifting, cooling-aware placement).
- Adaptive QoS in Multi-Tenant Contexts: Workload classification enables differentiated handling—interactive, batch, streaming, and tool-augmented inference requests—while fairness-aware algorithms (e.g., token-level VTC, quota-based isolation) prevent saturation from heavy tenants and improve utilization under dynamic concurrency.
- AI/DRL-Driven Orchestration: Reinforcement learning (e.g., AWARE, DRPC) and AI-augmented control automate scaling, placement, and real-time adaptation by ingesting fine-grained telemetry, with hybrid control guardrails ensuring reliable deployment.
Figure 3: Taxonomy of resource management and optimization strategies central to modern LLM serving and infrastructure orchestration.
Emerging Trends: Federated, Serverless, and Decentralized LLMs
Several architectural and algorithmic frontiers are identified:
Future Directions and Research Frontiers
Key forward-looking research areas include:
Conclusion
LLMs demand a departure from legacy ML/DL infrastructure, necessitating LLM-aware cloud-native and distributed system architectures that co-optimize resource utilization, system reliability, energy sustainability, privacy, and developer/tenant abstraction. The research agenda synthesizes a layered vision—spanning hardware acceleration, adaptive orchestration, federated training, self-managing clouds, and system-wide standardization—needed to ensure LLMs remain both scalable and operationally viable as foundation models proliferate in research, enterprise, and edge domains (2604.17227).