TD3-Sched: RL for Cloud-Edge Scheduling
- TD3-Sched is a distributed reinforcement learning framework that dynamically allocates CPU and memory for containerized workloads in heterogeneous cloud-edge infrastructures.
- It leverages the TD3 algorithm with twin critics and target policy smoothing to minimize response latency, maximize resource utilization, and enforce SLOs.
- Empirical evaluations on a Kubernetes testbed using production-scale traces demonstrate significant latency reductions and improved SLO compliance compared to DQN, DDPG, and standard schedulers.
TD3-Sched is a distributed reinforcement learning–based resource scheduling framework tailored for orchestrating containerized workloads in heterogeneous cloud-edge environments. It employs the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm to enable continuous and multi-dimensional control of CPU and memory resources, with the explicit objective of minimizing response latency, maximizing resource utilization, and enforcing Service Level Objectives (SLOs) in the presence of highly dynamic workloads and limited edge resources (Song et al., 23 Sep 2025). The system is evaluated on a realistic testbed integrating Kubernetes-based orchestration and production-scale traces, showing marked improvements in latency, SLO compliance, and convergence characteristics compared to both traditional and recent RL-based baselines.
1. Motivation and Problem Setting
Resource scheduling in cloud-edge infrastructures poses non-trivial challenges due to the asymmetric resource availability between cloud (resource-rich, higher latency) and edge (resource-constrained, latency-sensitive) nodes. Existing centralized schedulers often encounter bottlenecks that lead to degraded user experience, especially for latency-critical microservices. TD3-Sched frames the scheduling problem as a distributed, continuous-control task in which each node must dynamically provision CPU and memory allocations for multiple containerized services such that overall system latency is minimized and SLOs are met. The solution explicitly targets scenarios where workload demand is volatile, and resource and network states fluctuate rapidly.
2. Algorithmic Architecture
TD3-Sched leverages the following architectural components:
- State Space: At each scheduling decision point, the observed system state is a 4N-dimensional vector:
where and are per-service CPU and memory utilizations, is response latency, and is queries per second (QPS), with all entries normalized to .
- Action Space: The action is a continuous 2N-dimensional vector encapsulating resource allocations:
- Reward Function: Reward balancing multiple objectives is computed as:
where penalizes latency overshoot, penalizes resource waste, rewards SLO satisfaction, and penalizes excessive migration. The coefficients are chosen to heavily favor minimizing latency.
- TD3 Core Mechanisms:
- Twin Q-networks (Q-functions with parameters ) provide clipped double Q-learning to reduce estimation bias.
- Actor network computes the continuous resource allocations.
- Target policy smoothing:
with and . - TD target:
Double detached critics and delayed, less frequent actor updates are implemented for gradient stability.
3. System Implementation
TD3-Sched is integrated atop a Kubernetes-based cloud-edge testbed consisting of eight nodes with explicit edge/cloud designations. Key implementation points include:
- Resource Orchestration: Direct interfacing with Kubernetes APIs for dynamic, per-container resource allocation.
- Monitoring Infrastructure: Continuous collection of utilization, latency, and QPS metrics via Prometheus and Node Exporter.
- Workload Realism: Replaying Alibaba trace data and deploying the Sock Shop microservices application ensure realistic workload, resource, and latency patterns.
- Training Regime: An offline training phase using historical data establishes initial policy, which is continuously refined online during actual system operation.
4. Empirical Evaluation
TD3-Sched’s efficacy is validated under varying loads and baselined against DQN, DDPG, and the standard Kubernetes scheduler (BaseK):
Metric | TD3-Sched | DQN | DDPG | BaseK |
---|---|---|---|---|
Avg latency (100 req/s) | 75.2 ms | 91.6 ms | 106.4 ms | 122.5 ms |
SLO violation (300 req/s) | 0.47% | (higher) | (higher) | (higher) |
- Latency: TD3-Sched reduces latency by 17.9%–38.6% under low-to-moderate load and by 16%–31.6% under stress.
- Resource Efficiency: The system smooths CPU/memory usage across services and nodes, demonstrating less waste than the baselines.
- SLO Compliance: TD3-Sched achieves very low failure to meet latency SLOs even when system is saturated.
- Learning Dynamics: Learning curves indicate faster convergence and less variance due to the specific twin-critic structure and the reward’s strong emphasis on latency reduction.
5. Design Analysis and Trade-offs
- The continuous action space addresses the need for fine-grained, real-time adaptation in resource allocation, which is not supported by rule-based or discretized approaches.
- Dual critics and policy smoothing in TD3 mitigate the risk of suboptimal policy updates and overestimation pitfalls, supporting more rapid and stable learning in dynamic operational contexts.
- The reward function’s composition balances several objectives; however, its weights require tuning to reflect deployment-specific trade-offs (e.g., penalizing migration in edge settings).
- The distributed nature of TD3-Sched’s policy nodes enables resource allocation to be tailored locally, reducing response times and improving scalability.
6. Integration and Extensibility
TD3-Sched is designed for seamless deployment with standard Kubernetes infrastructures. This facilitates:
- Near real-time resource orchestration for container-based applications,
- Scale-out to large clusters by virtue of distributed policy learning,
- Flexibility to integrate new resource types (e.g., GPUs, network bandwidth) or to support multi-tenant, service-specific constraints via modular policy extension.
A plausible implication is that further gains could be realized by extending the state and reward design to explicitly incorporate network-level constraints or finer workload heterogeneity.
7. Directions for Future Research
The authors propose several areas for extension:
- Adaptive tuning of reward weights for dynamic environments,
- Support for multi-tenant systems with differentiated QoS/SLO requirements,
- Extension to multidimensional resources and hardware accelerators,
- Advanced online learning protocols to enable rapid policy adaptation in real deployments,
- Large-scale evaluations in practical distributed cloud-edge production environments.
These directions reflect the need to refine the scheduler for generalizability, robustness, and integration with diverse resource types and evolving application classes.
TD3-Sched demonstrates that distributed continuous-control reinforcement learning, when coupled with carefully engineered system-state and action representations and directly integrated into production orchestration frameworks, can substantially improve latency, SLO compliance, and resource utilization for containerized microservices across heterogeneous cloud-edge architectures (Song et al., 23 Sep 2025).