RoboSSM: Scalable Robotic Imitation Learning
- RoboSSM is a scalable in-context imitation learning approach for robot manipulation, leveraging state-space models to overcome Transformer limitations.
- It employs the Longhorn SSM update with linear-time inference, enabling efficient processing of both few-shot demonstrations and long-horizon tasks without retraining.
- Empirical results on the LIBERO benchmark show RoboSSM’s superior generalization and robustness, maintaining performance with increasing prompt lengths and temporal dilations.
RoboSSM is a scalable in-context imitation learning (ICIL) architecture for robot manipulation, based on state-space models (SSMs) rather than Transformer architectures. The central innovation is the replacement of Transformer-based prompt processing with Longhorn, a modern SSM that supports linear-time inference and strong extrapolation capabilities over long-horizon and increased-context settings. RoboSSM is specifically engineered for few-shot adaptation, in which a robot must generalize to novel tasks by conditioning on a short sequence (prompt) of demonstration trajectories, without any parameter updates at deployment time. Experiments on the LIBERO benchmark demonstrate that RoboSSM achieves highly competitive generalization to unseen tasks and remains robust as prompt lengths and task horizons scale.
1. Foundations and Motivation
RoboSSM directly addresses two limitations of ICIL approaches that employ Transformers: computational inefficiency and prompt-length sensitivity. Self-attention-based models exhibit time and memory complexity in sequence length , creating scalability bottlenecks for long prompts with high-frequency sensory observations. Empirically, Transformers trained on fixed prompt lengths tend to underperform—sometimes catastrophically—when faced with longer in-context sequences at test time, due to overfitting to a narrow context length regime and an inability to extrapolate sequence structure.
RoboSSM replaces self-attention with a state-space model backbone, which decomposes sequence modeling into a recurrent update: where is the hidden state at time , and are parameterized update factors, and denotes the Hadamard (element-wise) product. In particular, the Longhorn model underpins RoboSSM, yielding both theoretical and empirical improvements in scaling, extrapolation, and robustness.
2. State-Space Model (SSM) Approach
The core of RoboSSM is the Longhorn SSM, which operates by evolving an internal state with a learned recurrent update inspired by online convex programming. Each step seeks a hidden state that minimizes a regularized objective: where is a key vector, is the input embedding at time , and is a learned per-step weighting vector. The closed-form update enables Longhorn’s forward pass to scale linearly in input length, with low memory requirements and amenability to batched execution.
Longhorn’s recurrence form allows it to “remember” and integrate demonstration information distributed over long contexts—a critical property for imitation learning, where the number and length of demonstrations during conditioning may vary unpredictably at test time.
3. In-Context Imitation Learning Pipeline
RoboSSM is evaluated in a prompt-based setup: the model receives a prompt containing demonstration trajectories (each a sequence of observation embeddings), then predicts the next action(s) for a given query trajectory, without any parameter updates. Unlike classical meta-learning or multi-task setups, in-context learning aims to conditionally induce a task via demonstration without any explicit retraining or fine-tuning, making the approach suitable for rapid, repeated adaptation in robotic settings.
During both training and inference, the demonstration prompt and the query trajectory are concatenated and processed as a single sequence. The backbone SSM—parameterized as a Longhorn—encodes the full trajectory, enabling RoboSSM to perform direct next-action or trajectory-level prediction at each time step.
4. Empirical Results and Comparative Analysis
RoboSSM was evaluated on the LIBERO benchmark, which comprises several tasks (LIBERO-Object, LIBERO-90, etc.) involving high-dimensional, temporally extended manipulation in visual environments. The key experimental axes are:
- Prompt-Length Extrapolation: RoboSSM maintains or improves performance as the number of in-context demonstrations increases beyond the training regime, in stark contrast to Transformer-based ICIL (ICRT), which typically collapses out-of-distribution. For example, when training on 2-shot prompts, RoboSSM achieves highest accuracy with up to 32 demonstration sequences at evaluation time.
- Few-Shot Generalization: RoboSSM preserves high accuracy even when trained and tested in extreme low-shot regimes (1-2 demonstrations), demonstrating strong inductive bias for sequence-based adaptation.
- Robustness to Temporal Dilation: Under test-time time-dilation (simulating real-world delays or variable operator speed), RoboSSM remains robust, whereas ICRT degrades sharply.
- Linear Inference Runtime: RoboSSM’s runtime scales linearly as prompt length increases, compared to the quadratic trajectory for ICRT due to attention cache prefill.
- Task Generalization: RoboSSM achieves strong generalization to completely unseen tasks, as LIBERO partitions training and test splits at the task (not trajectory) level.
Model | Scenario | Extrapolation Robustness | Inference Complexity |
---|---|---|---|
RoboSSM (Longhorn) | Long prompt, time dilation | High (performance stable or improves) | (linear) |
ICRT (Transformer) | Long prompt, time dilation | Low (performance degrades sharply) | (quadratic) |
5. Technical Architecture
The Longhorn SSM update, as implemented in RoboSSM, involves a set of parameterized update functions for each time step: where and are generated from the current and past input embeddings, key vectors, and scaling factors (including the weighting term learned or modulated at test time). A secondary scaling parameter is used during evaluation to further adjust the contribution of new inputs relative to state retention, with experiments showing that choosing can further boost performance.
For training, next-action prediction loss is computed for the query trajectory, with optimization using AdamW and consistent architectural/parameter budget between baselines and RoboSSM.
6. Implications, Limitations, and Future Directions
RoboSSM has several important implications for the field of imitation learning in robotics:
- Scalability: By eliminating the quadratic bottleneck of Transformers, RoboSSM supports efficient processing of arbitrarily long prompt sequences on standard or embedded hardware.
- Generalization: The SSM’s strong extrapolation enables robust performance as demonstration counts and temporal pattern statistics drift between training and operational deployment.
- Adaptability: RoboSSM’s backbone can be adapted for both few-shot and large-context regimes, paving the way toward continual or lifelong robot learning without repeated retraining.
Limitations include the absence of demonstrated performance on highly compositional, open-ended task distributions and potential challenges as task complexity (or scene diversity) increases. The authors note that generalization to truly compositional or lifelong learning regimes remains an open area for research.
7. Mathematical Formulations and Figures
Key model update equations in RoboSSM:
- Recurrent update:
- Online convex programming formulation:
Figures referenced in the original paper provide architectural diagrams (training/inference pipelines), demonstration of prompt-length extrapolation, runtime scaling, and detailed comparison between Transformer and SSM-based approaches.
RoboSSM establishes state-space models (specifically Longhorn) as an efficient and robust foundation for scalable in-context imitation learning in robotics, overcoming the computational and generalization limitations of Transformer-based paradigms. Its performance on continuous-control, vision-based robot manipulation benchmarks underscores the potential of SSMs in enabling practical, few-shot, and lifelong robotic adaptation (Yoo et al., 24 Sep 2025).