Timely-RL: Real-Time Adaptive Reinforcement Learning

Updated 30 January 2026

Timely-RL is a reinforcement learning paradigm that optimizes agent behavior under strict, real-time wall-clock constraints.
It employs a two-stage training process—SFT followed by RL with wall-clock reward shaping—to learn deadline-sensitive action planning.
The approach dynamically adapts to variable latencies, enhancing tool-use strategies and on-time performance across diverse tasks.

A Timely-RL system is a reinforcement learning (RL) methodology designed to enable agentic systems—such as LLMs with tool-use capabilities—to operate optimally under hard real-time constraints. Unlike standard RL protocols for reasoning agents, Timely-RL is constructed to directly train agents for wall-clock time-awareness, strategic adaptation to latency, and deadline-sensitive action planning. The paradigm shifts the definition of “test-time” from generation length or FLOPs to true elapsed time, demarcating a class of RL agents that maximize task performance subject to dynamic, unpredictable external delays.

1. Wall-Clock Test-Time Formalization

Timely-RL begins with a formal redefinition of test-time in agentic environments where frequent tool calls create variable, exogenous latency. The cumulative elapsed time for $N$ reasoning steps is modeled as: $t_\text{all} = \sum_{i=1}^N t_\text{gen}^{(i)} + \sum_{i=1}^N t_\text{tool}^{(i)}$ where $t_\text{gen}^{(i)}$ denotes in-model generation time and $t_\text{tool}^{(i)}$ is tool or environment latency at step $i$ (Ma et al., 23 Jan 2026).

Agents are evaluated under a fixed budget $T_b$ such that generating the answer after $t_\text{all} > T_b$ constitutes a failure, regardless of token count. Adaptive policy optimization must thus ingest real-time latency feedback to decide not only “what” to generate, but “when” and “how long” to engage external modules.

2. Timely-Eval Benchmark and Regime Sensitivity

Timely-RL is empirically grounded in Timely-Eval, a benchmark suite that probes time-budget adaptation across heterogeneous task classes:

Interactive Games: Jericho text adventures, synthetic latencies injected into tool calls (0, 2, 10, 50s).
ML Tasks: MLEBench-Lite problems; code execution with variable wall-clock delays.
General Reasoning: MATH, AIME, GPQA-Diamond under strict real-time cutoffs (Ma et al., 23 Jan 2026).

Three latency regimes are characterized by the ratio $m_i = t_\text{tool}^{(i)} / t_\text{gen}^{(i)}$ :

Tool-dominated ( $m_i \gg 1$ ): Planning prioritizes fewer, higher-quality tool interactions.
Generation-dominated ( $m_i \ll 1$ ): Strategy shifts toward brevity and more frequent calls.
Balanced ( $m_i \approx 1$ ): Requires dynamic depth-breadth trade-off.

3. Timely-RL Training Procedure

Timely-RL is a two-stage process:

3.1 Supervised Cold-Start Fine-Tuning (SFT)

An initial policy is distilled from time-budget-aware teacher traces (e.g., Qwen3-235B-Instruct), using explicit wall-clock calls (get_duration) and step-wise reasoning sequences (~1M examples across diverse tasks). This injects primitive awareness of time constraints and tool typologies into the base model.

3.2 RL with Wall-Clock Reward Shaping

Subsequent RL optimization utilizes a wall-clock reward signal: $R(t, r) = \begin{cases} 0 & t > T_{\max} \ r_f & t \le T_{\max},\ r = 0 \ r_f + r + \lambda U(t) & t \le T_{\max},\ r > 0 \end{cases}$ Where $r$ represents task performance (e.g., normalized score, accuracy), $r_f$ is an on-time completion bonus, and $U(t) = \sin((\pi/2)\cdot \min(t/T_{\max}, 1))$ encourages full, but not overshot, utilization of the budget. A $\lambda$ coefficient mediates efficiency–performance trade-offs. This reward is maximized via policy gradient updates derived from the VeRL framework, a PPO-style algorithm with clipped importance weights (Ma et al., 23 Jan 2026).

4. Policy Optimization and Inference Flow

The agent’s state at each step incorporates:

Current prompt and cumulative history (including prior actions and tool outputs)
Per-turn latency feedback from external calls
Internal LLM hidden state

Actions comprise reasoning token generation and/or tool invocation. At every turn, the agent receives feedback on elapsed time and must dynamically plan the depth and tempo of its reasoning steps. Inference proceeds until $t_\text{all} \ge T_b$ or a terminal answer token is output.

5. Empirical Properties and Advantages

Timely-RL systems exhibit several notable empirical behaviors:

Latency-Sensitivity: Smaller, faster LLMs outperform larger ones when tool latency is low; at high latency, the largest model dominates due to superior answer quality per tool interaction.
Adaptive Reasoning Lengths: Reasoning trace length increases monotonically with time budget $T_b$ for Timely-RL agents; prior models show little adaptation.
On-Time Completion and Accuracy: Timely-RL-trained models consistently exceed SFT-only baselines on tight deadlines (e.g., on-time success rates rise 5–8% absolute on constrained MATH/GPQA/AIME benchmarks).
Task-Specific Scaling: Interactive environments benefit linearly from additional time, whereas ML tasks saturate quickly.

6. Broader Implications, Open Limitations, and Future Directions

Timely-RL recasts test-time as the primary optimization resource for agentic LLM deployments where unpredictability of API or tool latency is the norm. Moving away from token-length metrics to elapsed seconds enables robust, deadline-focused reasoning and practical integration into real-world tool-chaining workflows. Key limitations include exclusive focus on text tasks, lack of multi-agent negotiation, and open questions about richer risk-reward balancing for high-stakes domains.

Potential future directions include:

Extension to multimodal Timely-RL (vision, audio).
Hierarchical subtask planning with nested time constraints.
Adversarial latency robustness and uncertainty-aware dynamic budgeting (Ma et al., 23 Jan 2026).

Timely-RL thus defines a class of wall-clock-adaptive RL algorithms for deadline-constrained, agentic reasoning environments, underlining the centrality of time as the operative resource in next-generation interactive AI systems.

Markdown Upgrade to Chat

References (1)

Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Timely-RL.