Papers
Topics
Authors
Recent
Search
2000 character limit reached

Timely-RL: Real-Time Adaptive Reinforcement Learning

Updated 30 January 2026
  • Timely-RL is a reinforcement learning paradigm that optimizes agent behavior under strict, real-time wall-clock constraints.
  • It employs a two-stage training process—SFT followed by RL with wall-clock reward shaping—to learn deadline-sensitive action planning.
  • The approach dynamically adapts to variable latencies, enhancing tool-use strategies and on-time performance across diverse tasks.

A Timely-RL system is a reinforcement learning (RL) methodology designed to enable agentic systems—such as LLMs with tool-use capabilities—to operate optimally under hard real-time constraints. Unlike standard RL protocols for reasoning agents, Timely-RL is constructed to directly train agents for wall-clock time-awareness, strategic adaptation to latency, and deadline-sensitive action planning. The paradigm shifts the definition of “test-time” from generation length or FLOPs to true elapsed time, demarcating a class of RL agents that maximize task performance subject to dynamic, unpredictable external delays.

1. Wall-Clock Test-Time Formalization

Timely-RL begins with a formal redefinition of test-time in agentic environments where frequent tool calls create variable, exogenous latency. The cumulative elapsed time for NN reasoning steps is modeled as: tall=i=1Ntgen(i)+i=1Nttool(i)t_\text{all} = \sum_{i=1}^N t_\text{gen}^{(i)} + \sum_{i=1}^N t_\text{tool}^{(i)} where tgen(i)t_\text{gen}^{(i)} denotes in-model generation time and ttool(i)t_\text{tool}^{(i)} is tool or environment latency at step ii (Ma et al., 23 Jan 2026).

Agents are evaluated under a fixed budget TbT_b such that generating the answer after tall>Tbt_\text{all} > T_b constitutes a failure, regardless of token count. Adaptive policy optimization must thus ingest real-time latency feedback to decide not only “what” to generate, but “when” and “how long” to engage external modules.

2. Timely-Eval Benchmark and Regime Sensitivity

Timely-RL is empirically grounded in Timely-Eval, a benchmark suite that probes time-budget adaptation across heterogeneous task classes:

  • Interactive Games: Jericho text adventures, synthetic latencies injected into tool calls (0, 2, 10, 50s).
  • ML Tasks: MLEBench-Lite problems; code execution with variable wall-clock delays.
  • General Reasoning: MATH, AIME, GPQA-Diamond under strict real-time cutoffs (Ma et al., 23 Jan 2026).

Three latency regimes are characterized by the ratio mi=ttool(i)/tgen(i)m_i = t_\text{tool}^{(i)} / t_\text{gen}^{(i)}:

  • Tool-dominated (mi1m_i \gg 1): Planning prioritizes fewer, higher-quality tool interactions.
  • Generation-dominated (mi1m_i \ll 1): Strategy shifts toward brevity and more frequent calls.
  • Balanced (mi1m_i \approx 1): Requires dynamic depth-breadth trade-off.

3. Timely-RL Training Procedure

Timely-RL is a two-stage process:

3.1 Supervised Cold-Start Fine-Tuning (SFT)

An initial policy is distilled from time-budget-aware teacher traces (e.g., Qwen3-235B-Instruct), using explicit wall-clock calls (get_duration) and step-wise reasoning sequences (~1M examples across diverse tasks). This injects primitive awareness of time constraints and tool typologies into the base model.

3.2 RL with Wall-Clock Reward Shaping

Subsequent RL optimization utilizes a wall-clock reward signal: R(t,r)={0t>Tmax rftTmax, r=0 rf+r+λU(t)tTmax, r>0R(t, r) = \begin{cases} 0 & t > T_{\max} \ r_f & t \le T_{\max},\ r = 0 \ r_f + r + \lambda U(t) & t \le T_{\max},\ r > 0 \end{cases} Where rr represents task performance (e.g., normalized score, accuracy), rfr_f is an on-time completion bonus, and U(t)=sin((π/2)min(t/Tmax,1))U(t) = \sin((\pi/2)\cdot \min(t/T_{\max}, 1)) encourages full, but not overshot, utilization of the budget. A λ\lambda coefficient mediates efficiency–performance trade-offs. This reward is maximized via policy gradient updates derived from the VeRL framework, a PPO-style algorithm with clipped importance weights (Ma et al., 23 Jan 2026).

4. Policy Optimization and Inference Flow

The agent’s state at each step incorporates:

  • Current prompt and cumulative history (including prior actions and tool outputs)
  • Per-turn latency feedback from external calls
  • Internal LLM hidden state

Actions comprise reasoning token generation and/or tool invocation. At every turn, the agent receives feedback on elapsed time and must dynamically plan the depth and tempo of its reasoning steps. Inference proceeds until tallTbt_\text{all} \ge T_b or a terminal answer token is output.

5. Empirical Properties and Advantages

Timely-RL systems exhibit several notable empirical behaviors:

  • Latency-Sensitivity: Smaller, faster LLMs outperform larger ones when tool latency is low; at high latency, the largest model dominates due to superior answer quality per tool interaction.
  • Adaptive Reasoning Lengths: Reasoning trace length increases monotonically with time budget TbT_b for Timely-RL agents; prior models show little adaptation.
  • On-Time Completion and Accuracy: Timely-RL-trained models consistently exceed SFT-only baselines on tight deadlines (e.g., on-time success rates rise 5–8% absolute on constrained MATH/GPQA/AIME benchmarks).
  • Task-Specific Scaling: Interactive environments benefit linearly from additional time, whereas ML tasks saturate quickly.

6. Broader Implications, Open Limitations, and Future Directions

Timely-RL recasts test-time as the primary optimization resource for agentic LLM deployments where unpredictability of API or tool latency is the norm. Moving away from token-length metrics to elapsed seconds enables robust, deadline-focused reasoning and practical integration into real-world tool-chaining workflows. Key limitations include exclusive focus on text tasks, lack of multi-agent negotiation, and open questions about richer risk-reward balancing for high-stakes domains.

Potential future directions include:

  • Extension to multimodal Timely-RL (vision, audio).
  • Hierarchical subtask planning with nested time constraints.
  • Adversarial latency robustness and uncertainty-aware dynamic budgeting (Ma et al., 23 Jan 2026).

Timely-RL thus defines a class of wall-clock-adaptive RL algorithms for deadline-constrained, agentic reasoning environments, underlining the centrality of time as the operative resource in next-generation interactive AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Timely-RL.