Timely-RL: Real-Time Adaptive Reinforcement Learning
- Timely-RL is a reinforcement learning paradigm that optimizes agent behavior under strict, real-time wall-clock constraints.
- It employs a two-stage training process—SFT followed by RL with wall-clock reward shaping—to learn deadline-sensitive action planning.
- The approach dynamically adapts to variable latencies, enhancing tool-use strategies and on-time performance across diverse tasks.
A Timely-RL system is a reinforcement learning (RL) methodology designed to enable agentic systems—such as LLMs with tool-use capabilities—to operate optimally under hard real-time constraints. Unlike standard RL protocols for reasoning agents, Timely-RL is constructed to directly train agents for wall-clock time-awareness, strategic adaptation to latency, and deadline-sensitive action planning. The paradigm shifts the definition of “test-time” from generation length or FLOPs to true elapsed time, demarcating a class of RL agents that maximize task performance subject to dynamic, unpredictable external delays.
1. Wall-Clock Test-Time Formalization
Timely-RL begins with a formal redefinition of test-time in agentic environments where frequent tool calls create variable, exogenous latency. The cumulative elapsed time for reasoning steps is modeled as: where denotes in-model generation time and is tool or environment latency at step (Ma et al., 23 Jan 2026).
Agents are evaluated under a fixed budget such that generating the answer after constitutes a failure, regardless of token count. Adaptive policy optimization must thus ingest real-time latency feedback to decide not only “what” to generate, but “when” and “how long” to engage external modules.
2. Timely-Eval Benchmark and Regime Sensitivity
Timely-RL is empirically grounded in Timely-Eval, a benchmark suite that probes time-budget adaptation across heterogeneous task classes:
- Interactive Games: Jericho text adventures, synthetic latencies injected into tool calls (0, 2, 10, 50s).
- ML Tasks: MLEBench-Lite problems; code execution with variable wall-clock delays.
- General Reasoning: MATH, AIME, GPQA-Diamond under strict real-time cutoffs (Ma et al., 23 Jan 2026).
Three latency regimes are characterized by the ratio :
- Tool-dominated (): Planning prioritizes fewer, higher-quality tool interactions.
- Generation-dominated (): Strategy shifts toward brevity and more frequent calls.
- Balanced (): Requires dynamic depth-breadth trade-off.
3. Timely-RL Training Procedure
Timely-RL is a two-stage process:
3.1 Supervised Cold-Start Fine-Tuning (SFT)
An initial policy is distilled from time-budget-aware teacher traces (e.g., Qwen3-235B-Instruct), using explicit wall-clock calls (get_duration) and step-wise reasoning sequences (~1M examples across diverse tasks). This injects primitive awareness of time constraints and tool typologies into the base model.
3.2 RL with Wall-Clock Reward Shaping
Subsequent RL optimization utilizes a wall-clock reward signal: Where represents task performance (e.g., normalized score, accuracy), is an on-time completion bonus, and encourages full, but not overshot, utilization of the budget. A coefficient mediates efficiency–performance trade-offs. This reward is maximized via policy gradient updates derived from the VeRL framework, a PPO-style algorithm with clipped importance weights (Ma et al., 23 Jan 2026).
4. Policy Optimization and Inference Flow
The agent’s state at each step incorporates:
- Current prompt and cumulative history (including prior actions and tool outputs)
- Per-turn latency feedback from external calls
- Internal LLM hidden state
Actions comprise reasoning token generation and/or tool invocation. At every turn, the agent receives feedback on elapsed time and must dynamically plan the depth and tempo of its reasoning steps. Inference proceeds until or a terminal answer token is output.
5. Empirical Properties and Advantages
Timely-RL systems exhibit several notable empirical behaviors:
- Latency-Sensitivity: Smaller, faster LLMs outperform larger ones when tool latency is low; at high latency, the largest model dominates due to superior answer quality per tool interaction.
- Adaptive Reasoning Lengths: Reasoning trace length increases monotonically with time budget for Timely-RL agents; prior models show little adaptation.
- On-Time Completion and Accuracy: Timely-RL-trained models consistently exceed SFT-only baselines on tight deadlines (e.g., on-time success rates rise 5–8% absolute on constrained MATH/GPQA/AIME benchmarks).
- Task-Specific Scaling: Interactive environments benefit linearly from additional time, whereas ML tasks saturate quickly.
6. Broader Implications, Open Limitations, and Future Directions
Timely-RL recasts test-time as the primary optimization resource for agentic LLM deployments where unpredictability of API or tool latency is the norm. Moving away from token-length metrics to elapsed seconds enables robust, deadline-focused reasoning and practical integration into real-world tool-chaining workflows. Key limitations include exclusive focus on text tasks, lack of multi-agent negotiation, and open questions about richer risk-reward balancing for high-stakes domains.
Potential future directions include:
- Extension to multimodal Timely-RL (vision, audio).
- Hierarchical subtask planning with nested time constraints.
- Adversarial latency robustness and uncertainty-aware dynamic budgeting (Ma et al., 23 Jan 2026).
Timely-RL thus defines a class of wall-clock-adaptive RL algorithms for deadline-constrained, agentic reasoning environments, underlining the centrality of time as the operative resource in next-generation interactive AI systems.