Papers
Topics
Authors
Recent
2000 character limit reached

Mixture of Horizons (MoH) in Robotic Action

Updated 1 December 2025
  • Mixture of Horizons (MoH) is a strategy that injects multi-scale temporal prediction into full-attention modules, balancing fine-grained control with long-term planning.
  • It partitions action chunks into overlapping horizons that are processed in parallel via a shared transformer, achieving efficiency with minimal computational overhead.
  • MoH employs cross-horizon consensus for adaptive inference, dynamically selecting action prefixes to boost throughput and success rates in robotic tasks.

A Mixture of Horizons (MoH) is a plug-and-play strategy for injecting multi-scale temporal prediction into full-attention neural action modules, designed to address the inherent trade-off in vision-language-action (VLA) models between fine-grained temporal precision (short horizons) and global foresight (long horizons). Developed in the context of robotic manipulation, MoH dynamically partitions action chunks into overlapping horizons, processes these in parallel with a shared transformer, and fuses their predictions via a lightweight linear gate. This enables simultaneous exploitation of both short-term accuracy and long-term planning within a single architecture, with minimal added computational or memory overhead. MoH further supports dynamic inference through cross-horizon consensus, which adapts the executed action prefix based on inter-horizon agreement, thereby increasing throughput and robustness in both simulated and real-world robotic tasks (Jing et al., 24 Nov 2025).

1. Motivation: The Horizon Trade-Off in Action Chunking

In VLA models for robotic manipulation, action sequences are divided into fixed-length “chunks” for tractable prediction and execution. The horizon—the chunk length chosen for training and inference—affects model performance critically. Longer horizons offer better global foresight but often reduce fine-grained accuracy, as errors compound further into the future and local corrections become more difficult. Conversely, short action horizons facilitate precise local control, but limit the model’s ability to coordinate extended behaviors, resulting in poor performance on long-term tasks. Empirical results demonstrate that selecting any fixed horizon represents a trade-off that is suboptimal for complex mixed-horizon tasks (Jing et al., 24 Nov 2025).

2. Core Method: Multi-Horizon Chunking and Parallel Transformer Design

Let the model at timestep tt predict an action chunk At=(at,1,...,at,H)RH×daA_t = (a_{t,1}, ..., a_{t,H}) \in \mathbb{R}^{H \times d_a}, where dad_a is the action dimension and HH is the maximum chunk length. MoH introduces a set of “candidate horizons” H={h1,h2,...,hN}\mathcal{H} = \{ h_1, h_2, ..., h_N \} such that 0<h1<...<hN=H0 < h_1 < ... < h_N = H. For each horizon hHh \in \mathcal{H}, the ground-truth chunk is truncated: At(h)=(at,1,...,at,h)Rh×daA_t^{(h)} = (a_{t,1}, ..., a_{t,h}) \in \mathbb{R}^{h \times d_a}. Each chunk At(h)A_t^{(h)} is zero-padded to length HH and masked so that positions k>hk > h are ignored in self-attention. This enables all horizon-augmented input variants to be processed in parallel by a shared action transformer head of depth LL, leveraging tensorized computation and weight sharing for efficiency.

The VLM backbone encodes multi-modal context—including multi-view images VtV_t, past history h<th_{<t}, language TT, and proprioceptive input sts_t—once per step. The representation then feeds into the parallel transformer, with time and positional embeddings optionally applied (flow-matching: time embedding τ\tau; one-step regression: learnable query token). For each horizon and step, the transformer outputs Zt(h)Rh×dZ_t^{(h)} \in \mathbb{R}^{h \times d}, masked appropriately.

3. Horizon Gating, Prediction, and Fusion

Above the shared transformer, a “gate head”—a single linear layer—computes unnormalized log-weights gt,k,hg_{t,k,h} for each time step kk and horizon hh. Softmax normalization is performed over the set H:hk\mathcal{H} : h \geq k, ensuring that gating is only performed among active horizons for each index:

αt,k,h=exp(gt,k,h)hH:khexp(gt,k,h)\alpha_{t,k,h} = \frac{\exp(g_{t,k,h})}{\sum_{h' \in \mathcal{H}: k \leq h'} \exp(g_{t,k,h'})}

Horizon-specific action heads convert Zt(h)Z_t^{(h)} into predictions A^t(h)=(a^t,1(h),...,a^t,h(h))\hat{A}_t^{(h)} = (\hat{a}_{t,1}^{(h)}, ..., \hat{a}_{t,h}^{(h)}). The fused prediction at step kk is:

a^t,k=hH:khαt,k,ha^t,k(h)\hat{a}_{t,k} = \sum_{h \in \mathcal{H}: k \leq h} \alpha_{t,k,h} \cdot \hat{a}_{t,k}^{(h)}

This fusion explicitly allows the model to arbitrate between predictions at varying temporal scales, increasing both accuracy and task generalization.

4. Adaptive Inference via Cross-Horizon Consensus

MoH supports a dynamic inference mode that adaptively selects how much of the predicted action chunk to execute, based on inter-horizon agreement. For each step kk, the model computes the 1\ell_1 disagreement between the fused action and each horizon’s prediction, weighted by their gating value:

  • Hk={hH:kh}H_k = \{ h \in \mathcal{H} : k \leq h \}
  • dˉk=hHkαt,k,ha^t,ka^t,k(h)1\bar{d}_k = \sum_{h \in H_k} \alpha_{t,k,h} \|\hat{a}_{t,k} - \hat{a}_{t,k}^{(h)}\|_1

A threshold thresh=(mean dˉ1...dˉn)×r\mathrm{thresh} = (\text{mean}~\bar{d}_1... \bar{d}_n) \times r is defined, with nn a minimum prefix length and r>1r>1 a margin. The executed prefix KexecK_{exec} is the longest KK such that for all kKk \leq K, dˉkthresh\bar{d}_k \leq \mathrm{thresh} and Hkm|H_k| \geq m. This scheme yields increased throughput (up to 2.5×2.5 \times longer execution intervals with fewer replans) without sacrificing success rates (Jing et al., 24 Nov 2025).

Dynamic Inference Pseudocode

1
2
3
4
5
6
7
8
9
For k = 1…H:
    H_k ← {h∈𝓗 : k≤h}
    \bar d_k ← ∑_{h∈H_k} α_{k,h} ⋅ ‖\hat a_{k} − \hat a_{k}^{(h)}‖₁
thresh ← Mean(\bar d_1… \bar d_n) ⋅ r
K_exec ← n
For k = n+1…H:
    if |H_k|<m or \bar d_k>thresh: break
    else K_exec ← k
Return actions {\hat a_1,…,\hat a_{K_exec}}

5. Training Objective and Computational Overhead

MoH is designed for efficiency. The architecture adds only 2×N2 \times N parameters for the gating head (with N=HN = |\mathcal{H}|). The VLM prefix is computed once per step, while the full-attention transformer head (approximately 300M parameters) benefits from parallelism. GPU overhead is under 5% for training; inference is virtually latency-free because all horizons are masked and batched together.

The composite training objective is:

L=Lmix+λindLind+λbalLbalL = L_{\text{mix}} + \lambda_{\text{ind}} L_{\text{ind}} + \lambda_{\text{bal}} L_{\text{bal}}

where

  • LmixL_{\text{mix}}: prediction loss on fused actions (a^\hat{a}), using either flow-matching or regression/classification,
  • Lind=hHL(h)L_{\text{ind}} = \sum_{h \in \mathcal{H}} L^{(h)}: sum of per-horizon losses,
  • LbalL_{\text{bal}}: a coefficient-of-variation (CV) penalty promoting usage of all horizons, calculated interval-wise over the action chunk.

In experiments, λind=1\lambda_{ind}=1, λbal=103\lambda_{bal}=10^{-3}. Ablation confirms that the CV penalty prevents gating collapse to the longest horizon, maximizing the benefits of multi-horizon fusion.

6. Empirical Evaluation and Impact

MoH was evaluated on the LIBERO benchmark—comprising the Spatial, Object, Goal, and Long suites—as well as the RoboTwin2.0 bimanual manipulation benchmark and three real-robot tasks. Performance metrics demonstrate consistent and significant improvements in action chunking policies. For example, on LIBERO, MoH with flow policy π0.5\pi_{0.5} attains a new state of the art (99% average success rate) after 30,000 iterations. Table 1 summarizes main results:

Method Spatial Object Goal Long Avg
πreg\pi_{\text{reg}} 97.8 98.2 94.6 90.2 95.2
πreg\pi_{\text{reg}} + MoH 99.0 98.8 96.4 91.4 96.4
π0\pi_0 97.4 98.2 95.4 84.2 93.8
π0\pi_0 + MoH 97.6 98.8 96.4 87.4 95.1
π0.5\pi_{0.5} 98.8 99.0 97.6 95.4 97.7
π0.5\pi_{0.5} + MoH 98.8 100.0 98.8 98.4 99.0

Ablation studies show diminishing gains beyond a moderate candidate set (e.g., {3,6,...,30}\{3,6,...,30\}) and confirm that the gating-balance regularizer is critical for preventing collapse to the longest horizon. On RoboTwin2.0, π0\pi_0+MoH improves average success rates across both easy and hard bimanual tasks; on real-robot experiments, MoH consistently increases success rates over non-MoH baselines.

7. Broader Significance and Outlook

MoH resolves the widely observed trade-off in action horizon selection for sequential prediction in robotics and more broadly wherever multi-scale temporal modeling is critical. Its core plug-and-play nature, negligible training and inference overhead, and empirical robustness across simulated and real tasks support its deployment as a generic upgrade for full-attention action modules. A plausible implication is that similar multi-horizon strategies could benefit other domains—such as planning, policy learning, or even natural language generation—where fixed temporal scales limit performance (Jing et al., 24 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Mixture of Horizons (MoH).