Cross-Robot Action Generalization

Updated 13 November 2025

Cross-robot action generalization is the ability for learned policies or representations to transfer across different robot embodiments while maintaining task execution.
It employs a shared, task-centric latent affordance space that integrates action, object, and effect cues to overcome differences in kinematics and sensors.
Empirical studies show low RMSE values in tasks like insertion and grasping, validating its potential for plug-and-play skill transfer in robotics.

Cross-robot action generalization is the capacity for policies, models, or system representations learned from data on one robot embodiment to transfer—without retraining or with only lightweight adaptation—to different robot hardware, morphologies, or configurations, while maintaining successful task execution. Achieving this requires overcoming differences in kinematics, sensors, actuators, and environment couplings that traditionally prevent direct policy or trajectory transfer between heterogeneous robots. A unifying approach is to construct shared, task-centric latent representations that embed actions, objects, and effects in a way that is invariant to the specifics of the acting robot, thereby enabling the reinterpretation or decoding of skills across embodiments.

1. Foundations: Affordance Spaces and Equivalence

A principal theoretical construct underlying cross-robot action generalization is the affordance space, defined as a latent space in which the effectual possibilities (affordances) of objects relative to agents are represented as shared latent codes. In the context of "Cross-Embodied Affordance Transfer through Learning Affordance Equivalences" (Aktas et al., 24 Apr 2024), each affordance sample is formalized as $A_j = \{ (a_t^{R_j}, e_t, o, t)\}_{t=0}^1$ , where $a_t^{r}$ is the continuous action of robot $r$ , $e_t$ is the effect trajectory (e.g., force, displacement), $o$ is an object descriptor (depth image), and $t$ is normalized phase. Robot actions, object, and effect cues are encoded by Conditional Neural Movement Primitive (CNMP) encoders into separate per-modality latents:

$L^{a_r} = \frac{1}{n}\sum_{i=1}^n E^{a_r}(t_i,a_{t_i}^{r}),$

$L^{e} = \frac{1}{n}\sum_{i=1}^n E^{e}(t_i,e_{t_i}),$

$L^{o} = \frac{1}{n}\sum_{i=1}^n E^{o}(t_i,o_{t_i}).$

A convex combination over robots,

$L^a = \sum_{r\in R_j} p^r L^{a_r}\,,\quad 0 \leq p^r \leq 1,\, \sum_r p^r = 1,$

constructs a unified action code, which, together with the effect and object latents, yields the affordance latent:

$L^F = p^a L^a + p^e L^e + p^o L^o, \quad p^a+p^e+p^o=1.$

This "affordance blending" makes no reference to the identity or specific embodiment of the robot, instead encoding the task in terms of the effect an action produces on an object, irrespective of which robot produced it. Affordance equivalence emerges when different agent-object-action tuples yield indistinguishable effect trajectories, allowing the policy to reconstruct the correct effect or action for any agent in the shared space.

2. Model Architectures and Training Procedures

Affordance blending networks are realized via simple, deep MLP architectures for each encoder (action, effect, object). Each encoder processes its respective modality via a concatenated (t, vector) input and emits a D-dimensional latent (typical D=64–128, 3-layer MLPs, with ReLU activations). Decoders are also MLPs, taking the merged affordance latent $L^F$ and a time/phase input and outputting Gaussian mean and variance over each target trajectory channel. Crucially, no RNNs, convolutions, or attention mechanisms are used; temporal information is encoded solely by the phase variable $t$ .

The system is trained with a per-point CNMP negative log-likelihood loss plus a Selective Loss:

$\mathcal{L}_{\mathrm{selective}} = \min_{x \in \{\text{candidate outputs}\}} \mathcal{L}_x(\theta,\phi).$

Selective Loss is critical for stabilizing learning when multiple affordances share similar effect–action pairs; it prevents spurious high-loss regions where context samples overlap multiple valid solutions. Optimization uses Adam with standard parameters ( $\beta_1=0.9, \beta_2=0.999$ , $\mathrm{lr}=10^{-3}$ ), mini-batches of 32–64 samples, for 50–100 epochs. During training, dropout and layer normalization are optionally employed to regularize deeper networks.

Affordance equivalence, not enforced via contrastive loss but rather via reconstructing identical effect/action trajectories for equivalent tuples, leads to clustering of equivalent skills in the latent affordance space.

3. Mechanisms of Cross-Embodiment Action Transfer

After learning, transfer proceeds as follows:

Encoding: Encode a demonstration (trajectory) from the source robot $r_s$ to obtain $L^a_s$ .
Affordance Merging: Combine with the object and (optionally) effect trajectory to get $L^F$ .
Decoding: Decode $L^F$ into the target robot’s action space $r_t$ via the respective decoder:

$\hat a_{t}^{r_t} \sim \mathcal{N}(\mu_t^{a_{r_t}}, \sigma_t^{a_{r_t}})$

At no stage is an explicit mapping between source and target action or joint spaces needed. The model has acquired a common, task-centric latent such that every robot’s decoder can translate it into their specific motor commands. When a demonstration is only available for one robot, continual training with already learned affordances allows recovery of missing action channels by inference.

4. Empirical Results and Ablation Insights

Experiments were performed in simulation (CoppeliaSim) and via real-world direct imitation:

Insertion: UR-10 + red rod × 10 openings (insertion) — trajectory RMSE < 0.02 rad.
Graspability: UR-10 (BarrettHand) & Baxter (two-finger) × 10 objects — grasp RMSE < 0.03 rad.
Rollability (push): UR-10 & KUKA × 6 objects — push RMSE < 0.04 rad.

Effect prediction RMSE (e.g. force, displacement): insertion 0.1 N, grasp 2 mm, roll 5 mm. Held-out affordance classification/detection: insertion 100 %, graspability 94 %, rollability 96 %. Qualitatively, latent PCA plots reveal tight clusters aligning insertable/non-insertable or graspable/ungraspable classes across robots.

Ablation studies found removing the convex combination (multi-robot blending) breaks cross-robot transfer. Zeroing a modality weight (e.g., object) impairs reconstruction only of that modality but preserves effect–action equivalence. Disabling Selective Loss introduces high-loss instability where affordances overlap, degrading convergence rate by ~20 %.

5. Limitations, Scalability, and Future Prospects

The current method demonstrates cross-robot action generalization principally on a constrained set of relatively simple affordances and two robot embodiments. Extending to dozens of robots, more complex or hierarchical skills (multi-step tool use, long-horizon plans), or more diverse sensor modalities (tactile, language, multimodal vision) remains an open challenge. The current system infers equivalence implicitly through reconstruction loss; explicit use of contrastive or triplet objectives may further tighten affordance clusters and improve zero-shot transfer. For applications requiring multi-step or hierarchical policy transfer, embedding the affordance space within a reinforcement learning pipeline or enabling composition of affordance codes (affordance chaining) is a plausible future direction. Sensor richness also remains limited; for instance, tactile or vision-language information may expand the scope of generalizable affordances. Scaling issues may also arise when learning over high-dimensional effect or action spaces or over many more robot-object classes.

6. Comparative Position in Broader Research

"Affordance Blending Networks" (Aktas et al., 24 Apr 2024) contrasts with prior work using explicit action retargeting, joint-space mapping, or morphology-encoded policies. It does not require knowledge of robot embodiment at inference time, nor does it maintain explicit inverse models mapping object–effect pairs to specific robot kinematics. Instead, all transfer proceeds via an effect-centric, agent-invariant latent, offering a distinctive functional foundation for task-centric, zero-shot skill sharing. This affordance-centric abstraction aligns with broader interest in grounding robot policies in task or object representations to facilitate transfer, rather than binding them to hardware-specific action spaces.

7. Practical Implications

In practical deployment, the affordance blending approach enables plug-and-play transfer of manipulation skills without per-embodiment calibration or joint-space retargeting. Direct imitation, as in the UR-10 replaying a human push, demonstrates real-world feasibility. The abstraction layer reduces the engineering barrier of mapping between diverse robots and opens avenues for scalable, data-sharing multi-robot learning. Current limitations in complexity and generality must be addressed, but the affordance equivalence approach establishes a mathematically elegant and empirically robust recipe for cross-robot manipulation transfer that is fundamentally different from models requiring per-robot adaptation or explicit morphological grounding.

PDF Markdown Chat (Pro)

References (1)

Cross-Embodied Affordance Transfer through Learning Affordance Equivalences (2024)

Follow Topic

Get notified by email when new papers are published related to Cross-Robot Action Generalization.