Real-Time Chunking (RTC)

Updated 8 December 2025

Real-Time Chunking (RTC) is a technique that partitions model outputs into overlapping segments to ensure temporally smooth and responsive control in streaming applications.
It employs asynchronous inference and masked inpainting to bridge action chunk boundaries, reducing delays and avoiding jerky transitions in tasks such as robotic manipulation and speech recognition.
Empirical evaluations show RTC improves task success rates (up to 94.1% in robotics), reduces word error rate in ASR by 10–13.9%, and lowers initial latency in text-to-speech.

Real-Time Chunking (RTC) is a class of inference-time and training-time scheduling algorithms designed to enable low-latency, temporally consistent streaming outputs from high-capacity models, particularly diffusion- and flow-based policies, in real-time physical control, speech, and generative applications. RTC originates in the context of vision–language–action (VLA) control for robotics, but extends across domains where the latency of model inference is non-negligible relative to the system’s control or output update interval. The core principle is to partition model outputs into overlapping or sequential “chunks” of actions, features, or predictions, and to schedule planning and execution asynchronously, mitigating pauses, boundary discontinuities, and feedback delays without retraining the base model (Black et al., 9 Jun 2025).

1. Conceptual Foundations and Problem Statement

In high-frequency control systems, especially those using large neural architectures, the inference latency $\delta$ often exceeds the system’s actuation interval $\Delta t$ , introducing an inference delay $d = \lfloor \delta / \Delta t \rfloor > 0$ . A naive “synchronous chunking” strategy, in which the policy emits $H$ -step action chunks at each decision point, executes the first $s$ actions, then waits for the next chunk, incurs periodic pauses and produces “jerky” action or perception transitions at chunk boundaries. This disrupts closed-loop control and can degrade task success or perceptual quality across diverse domains such as robotic manipulation, streaming speech recognition, and incremental text-to-speech (Black et al., 9 Jun 2025, Le et al., 21 Feb 2025, Du et al., 3 Jan 2024).

RTC addresses these challenges by decoupling chunk inference and execution: the next chunk is generated in parallel with the execution of the current chunk, overlapping their boundaries. Actions or outputs that must be executed before the next inference completes are “frozen,” while the remaining steps—and the challenging interface at the chunk boundary—are inpainted or recomputed to ensure temporal smoothness and reactivity.

2. Mathematical Formulation and Algorithmic Design

Consider a control policy $\pi(A_t|o_t)$ that, given observation $o_t$ at time $t$ , emits an action chunk $A_t = [a_t, a_{t+1},\dots,a_{t+H-1}]$ . The control loop proceeds at interval $\Delta t$ , but each model call incurs $\delta$ seconds of compute delay.

Given a chunk execution horizon $s \leq H$ and effective delay $d = \lfloor \delta/\Delta t\rfloor$ , the first $d$ actions of a chunk are irrevocably executed before the next chunk is available and must be preserved. Let $A_{\rm prev}$ denote the tail of the previous chunk to be stitched with the new one. RTC frames this overlap as a masked inpainting problem, introducing a soft binary mask $W \in [0,1]^H$ :

$W_i = 1$ for $i < d$ (actions to stay “frozen”)
$W_i = 0$ for $i \geq H-s$ (unaffected future actions)
$W_i$ decays smoothly for $d \leq i < H-s$

During each denoising/flow-matching step, a guided update pulls the current prediction toward the preserved actions on the masked entries, using a gradient-based correction:

$v_{\text{guided}}(A^\tau) = v_\pi(A^\tau, o, \tau) + \lambda(\tau) \left( A_{\mathrm{prev}} - \widehat{A}^1 \right)^\top \mathrm{diag}(W) \frac{\partial \widehat{A}^1}{\partial A^\tau}$

where $\lambda(\tau)$ clips the correction, and $\widehat{A}^1$ is the current prediction (see (Black et al., 9 Jun 2025) for precise update equations).

Background threading supports asynchronous operation: while the controller executes actions from $A_{\rm cur}$ , a separate inference process begins early, receives the most recent observation, and produces a new chunk via guided inpainting. The scheduling ensures compatibility at chunk boundaries and continuous, jitter-free execution.

Pseudocode for this process demonstrates the sharing of observation buffers, mutex-protected chunk updates, and the inpainting-based chunk generation in the inference loop (Black et al., 9 Jun 2025):

def GetAction(o_next):
    lock(mutex)
    t += 1
    o_cur = o_next
    notify(InferenceLoop)
    a = A_cur[t-1]
    unlock(mutex)
    return a

def InferenceLoop():
    lock(mutex)
    # Q: delay buffer
    while True:
        wait_until(t >= s_min)
        s = t
        A_prev = A_cur[s:]
        d = max(Q)
        o = o_cur
        unlock(mutex)
        A_new = GuidedInference(pi, o, A_prev, d, s)
        lock(mutex)
        A_cur = A_new
        t -= s
        Q.push(d)
        unlock(mutex)

3. Empirical Evaluations Across Domains

RTC has been validated across multiple settings:

Robotics: In 12 Kinetix-simulated dynamic manipulation and locomotion tasks, RTC achieved $94.1\%$ success without delay ( $d=0$ ) and retained $85.3\%$ at $d=4$ , outperforming baselines including Best-In-Delay (BID, $67.9\%$ at $d=4$ ) and temporal-ensemble methods (TE, $<40\%$ at $d=4$ ). On six real-world bimanual tasks using the $\Pi$ Zero5 VLA, RTC maintained throughput ( $\sim$ 0.046–0.048 substeps/s) and success ( $82\%\pm5\%$ in "lighting candle") even as delay increased to $d\approx16$ , while synchronous and TE alternatives degraded or failed (Black et al., 9 Jun 2025).
Speech Recognition: In streaming ASR, chunk-based RTC methods leveraging “time-shifted contextual attention” (TSCA) and dynamic right context masking reduce relative WER by $10$– $13.9\%$ on LibriSpeech with negligible effect on user-perceived latency. The model can exploit future context without explicit wait, by shifting and overlapping chunks during inference (Le et al., 21 Feb 2025).
Text-to-Speech: Incremental FastPitch applies chunked self-attention with cache-based receptive fields, achieving nearly identical MOS (4.178 vs. 4.185) to its parallel counterpart but with $4\times$ lower time-to-first-chunk and only slight increase in real-time factor (RTF), via streaming chunk emission and careful masking (Du et al., 3 Jan 2024).

4. Extensions: Training-Time Action Conditioning

A further efficiency arises by shifting inpainting from inference to training. Training-time RTC simulates the expected inference delay during supervised learning, directly conditioning on the action prefix rather than requiring pseudo-inverse guidance at deployment. This eliminates the need for backward vector–Jacobian products during rollouts. At inference, the model simply copies the committed prefix and predicts the postfix as in standard forward diffusion:

Empirically, training-time RTC matches or surpasses inference-time RTC at higher delays. For example, in robot box building at $d>2$ , training-time RTC achieves $75\pm2\%$ simulated task success at $d=4$ , compared to $50\pm3\%$ for inference-time RTC. Real-world latency drops from $135 \pm 5$  ms to $108 \pm 4$  ms (box building) with comparable or higher completion rates (Black et al., 5 Dec 2025).
Integration requires only minimal modification of existing code bases (per-token flow-matching, mask for loss computation, and prefix override).

This suggests that training-time action conditioning is, in many settings, a more computationally attractive and empirically robust approach for RTC than inpainting-only deployment (Black et al., 5 Dec 2025).

5. Domain-Generalization and Additional Applications

Streaming Speech: Time-shifted chunking with left-shifting and masking enables real-time ASR to exploit “future” context by cyclically recycling chunk suffixes as in-chunk “right context” for attention, while dynamic context masking during training exposes the model to stochastic look-ahead, closing the accuracy gap to non-streaming models (Le et al., 21 Feb 2025).
Text-to-Speech: Chunked transformers with cached receptive fields and strict masking (static or dynamic) deliver streaming audio with sub-50 ms initial latency and near-parity MOS to parallel models, enabling deployment in latency-sensitive applications such as voice assistants (Du et al., 3 Jan 2024).
Real-Time Embedded Systems: In real-time calculus (RTC) analysis of timed automata, chunking (granularity-based abstractions) enables multi-scale performance bounds: coarser-g chunking accelerates analysis at the cost of looser bounds, while multi-granularity causality closure recovers precision efficiently (Altisen et al., 2010).

Domain	RTC Mechanism	Measured Benefit
Robotic control (VLA)	Asynchronous inpainting,	$>80\%$ task success at high delays;
	soft boundary stitching	robust to $>200$  ms delays (Black et al., 9 Jun 2025)
Streaming ASR	TSCA, DRC masks on chunks	$10$– $13.9\%$ relative WER reduction
		at constant user-perceived latency (Le et al., 21 Feb 2025)
Text-to-Speech	Chunked self-attention	$4\times$ lower time-to-first-chunk,
	& fixed receptive field	MOS $=$ parallel baseline (Du et al., 3 Jan 2024)

6. Robustness, Limitations, and Practical Considerations

RTC is robust to arbitrarily large delays; its performance remains constant as $d$ grows (up to tested $d\approx16$ ), whereas synchronous chunking degrades linearly, and naive ensemble approaches can trigger oscillatory failures. Soft-masked inpainting is critical for avoiding discontinuities at chunk boundaries when $d$ is small, but introduces additional inference overhead (≈20 ms per pass). Training-time RTC eliminates this overhead but sacrifices the flexibility of soft overlap weighting.

Integration is model-agnostic: any pretrained diffusion or flow-matching policy with chunked output can be retrofitted with RTC scheduling without retraining (for inference-time RTC). Additional requirements include access to velocity fields, vector–Jacobian products for inpainting, and minimal threading infrastructure.

Limitations include:

Additional inference overhead for pseudo-inverse guidance (fully eliminated by training-time RTC).
Real-world validation remains centered on position-controlled tasks; high-dynamic regimes (legged robots, drones) require further paper.
Complexity of scheduling and memory management increases with chunk size and execution overlap.

7. Future Directions

Several avenues for extension and research include:

Combining RTC with model-distilled, single-step diffusion policies (consistency models) to further reduce inference overhead.
Extending RTC to hierarchical (“System 2/1”) architectures for multi-timescale asynchronous control.
Applying RTC to reinforcement-learning–trained chunking policies for direct reward-sensitive inpainting.
Systematic evaluation in safety-critical domains (e.g., autonomous vehicles) and higher-agility scenarios (e.g., quadrupedal locomotion, aerial swarms) (Black et al., 9 Jun 2025).

A plausible implication is that RTC, especially in its training-time form, offers a generic framework for low-latency, feedback-reactive control and streaming prediction across high-latency model classes, without architectural retraining or specialized tuning, provided mask-based conditioning and scheduling are feasible at deployment (Black et al., 5 Dec 2025).