Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 80 tok/s
Gemini 2.5 Pro 60 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 87 tok/s Pro
Kimi K2 173 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

RoboSSM: Scalable Robotic Imitation Learning

Updated 28 September 2025
  • RoboSSM is a scalable in-context imitation learning approach for robot manipulation, leveraging state-space models to overcome Transformer limitations.
  • It employs the Longhorn SSM update with linear-time inference, enabling efficient processing of both few-shot demonstrations and long-horizon tasks without retraining.
  • Empirical results on the LIBERO benchmark show RoboSSM’s superior generalization and robustness, maintaining performance with increasing prompt lengths and temporal dilations.

RoboSSM is a scalable in-context imitation learning (ICIL) architecture for robot manipulation, based on state-space models (SSMs) rather than Transformer architectures. The central innovation is the replacement of Transformer-based prompt processing with Longhorn, a modern SSM that supports linear-time inference and strong extrapolation capabilities over long-horizon and increased-context settings. RoboSSM is specifically engineered for few-shot adaptation, in which a robot must generalize to novel tasks by conditioning on a short sequence (prompt) of demonstration trajectories, without any parameter updates at deployment time. Experiments on the LIBERO benchmark demonstrate that RoboSSM achieves highly competitive generalization to unseen tasks and remains robust as prompt lengths and task horizons scale.

1. Foundations and Motivation

RoboSSM directly addresses two limitations of ICIL approaches that employ Transformers: computational inefficiency and prompt-length sensitivity. Self-attention-based models exhibit O(N2)\mathcal{O}(N^2) time and memory complexity in sequence length NN, creating scalability bottlenecks for long prompts with high-frequency sensory observations. Empirically, Transformers trained on fixed prompt lengths tend to underperform—sometimes catastrophically—when faced with longer in-context sequences at test time, due to overfitting to a narrow context length regime and an inability to extrapolate sequence structure.

RoboSSM replaces self-attention with a state-space model backbone, which decomposes sequence modeling into a recurrent update: st=Atst1+Bts_t = A_t \odot s_{t-1} + B_t where sts_t is the hidden state at time tt, AtA_t and BtB_t are parameterized update factors, and \odot denotes the Hadamard (element-wise) product. In particular, the Longhorn model underpins RoboSSM, yielding both theoretical and empirical improvements in scaling, extrapolation, and robustness.

2. State-Space Model (SSM) Approach

The core of RoboSSM is the Longhorn SSM, which operates by evolving an internal state with a learned recurrent update inspired by online convex programming. Each step seeks a hidden state that minimizes a regularized objective: st=argmins{sst1F2+sktxtdiag(βt)2}s_t = \arg\min_s \left\{ \|s - s_{t-1}\|_F^2 + \|s k_t - x_t\|^2_{\mathrm{diag}(\beta_t)} \right\} where ktk_t is a key vector, xtx_t is the input embedding at time tt, and βt\beta_t is a learned per-step weighting vector. The closed-form update enables Longhorn’s forward pass to scale linearly in input length, with low memory requirements and amenability to batched execution.

Longhorn’s recurrence form allows it to “remember” and integrate demonstration information distributed over long contexts—a critical property for imitation learning, where the number and length of demonstrations during conditioning may vary unpredictably at test time.

3. In-Context Imitation Learning Pipeline

RoboSSM is evaluated in a prompt-based setup: the model receives a prompt P\mathcal{P} containing NN demonstration trajectories (each a sequence of observation embeddings), then predicts the next action(s) for a given query trajectory, without any parameter updates. Unlike classical meta-learning or multi-task setups, in-context learning aims to conditionally induce a task via demonstration without any explicit retraining or fine-tuning, making the approach suitable for rapid, repeated adaptation in robotic settings.

During both training and inference, the demonstration prompt and the query trajectory are concatenated and processed as a single sequence. The backbone SSM—parameterized as a Longhorn—encodes the full trajectory, enabling RoboSSM to perform direct next-action or trajectory-level prediction at each time step.

4. Empirical Results and Comparative Analysis

RoboSSM was evaluated on the LIBERO benchmark, which comprises several tasks (LIBERO-Object, LIBERO-90, etc.) involving high-dimensional, temporally extended manipulation in visual environments. The key experimental axes are:

  • Prompt-Length Extrapolation: RoboSSM maintains or improves performance as the number of in-context demonstrations increases beyond the training regime, in stark contrast to Transformer-based ICIL (ICRT), which typically collapses out-of-distribution. For example, when training on 2-shot prompts, RoboSSM achieves highest accuracy with up to 32 demonstration sequences at evaluation time.
  • Few-Shot Generalization: RoboSSM preserves high accuracy even when trained and tested in extreme low-shot regimes (1-2 demonstrations), demonstrating strong inductive bias for sequence-based adaptation.
  • Robustness to Temporal Dilation: Under test-time time-dilation (simulating real-world delays or variable operator speed), RoboSSM remains robust, whereas ICRT degrades sharply.
  • Linear Inference Runtime: RoboSSM’s runtime scales linearly as prompt length increases, compared to the quadratic trajectory for ICRT due to attention cache prefill.
  • Task Generalization: RoboSSM achieves strong generalization to completely unseen tasks, as LIBERO partitions training and test splits at the task (not trajectory) level.
Model Scenario Extrapolation Robustness Inference Complexity
RoboSSM (Longhorn) Long prompt, time dilation High (performance stable or improves) O(N)O(N) (linear)
ICRT (Transformer) Long prompt, time dilation Low (performance degrades sharply) O(N2)O(N^2) (quadratic)

5. Technical Architecture

The Longhorn SSM update, as implemented in RoboSSM, involves a set of parameterized update functions for each time step: st=Atst1+Bts_t = A_t \odot s_{t-1} + B_t where AtA_t and BtB_t are generated from the current and past input embeddings, key vectors, and scaling factors (including the βt\beta_t weighting term learned or modulated at test time). A secondary scaling parameter γ\gamma is used during evaluation to further adjust the contribution of new inputs relative to state retention, with experiments showing that choosing γ<1\gamma < 1 can further boost performance.

For training, next-action prediction loss is computed for the query trajectory, with optimization using AdamW and consistent architectural/parameter budget between baselines and RoboSSM.

6. Implications, Limitations, and Future Directions

RoboSSM has several important implications for the field of imitation learning in robotics:

  • Scalability: By eliminating the quadratic bottleneck of Transformers, RoboSSM supports efficient processing of arbitrarily long prompt sequences on standard or embedded hardware.
  • Generalization: The SSM’s strong extrapolation enables robust performance as demonstration counts and temporal pattern statistics drift between training and operational deployment.
  • Adaptability: RoboSSM’s backbone can be adapted for both few-shot and large-context regimes, paving the way toward continual or lifelong robot learning without repeated retraining.

Limitations include the absence of demonstrated performance on highly compositional, open-ended task distributions and potential challenges as task complexity (or scene diversity) increases. The authors note that generalization to truly compositional or lifelong learning regimes remains an open area for research.

7. Mathematical Formulations and Figures

Key model update equations in RoboSSM:

  • Recurrent update:

st=Atst1+Bts_t = A_t \odot s_{t-1} + B_t

  • Online convex programming formulation:

st=argmins{sst1F2+sktxtdiag(βt)2}s_t = \arg\min_s \left\{ \|s - s_{t-1}\|_F^2 + \|s k_t - x_t\|^2_{\mathrm{diag}(\beta_t)} \right\}

Figures referenced in the original paper provide architectural diagrams (training/inference pipelines), demonstration of prompt-length extrapolation, runtime scaling, and detailed comparison between Transformer and SSM-based approaches.


RoboSSM establishes state-space models (specifically Longhorn) as an efficient and robust foundation for scalable in-context imitation learning in robotics, overcoming the computational and generalization limitations of Transformer-based paradigms. Its performance on continuous-control, vision-based robot manipulation benchmarks underscores the potential of SSMs in enabling practical, few-shot, and lifelong robotic adaptation (Yoo et al., 24 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to RoboSSM.