Unified Action Model Overview

Updated 4 July 2026

Unified Action Model (UAM) is a family of architectures that introduce a stable intermediate action abstraction to decouple context understanding from execution.
It integrates discrete action vocabularies with continuous realization models, enabling robust multimodal prediction across heterogeneous embodiments.
UAM research addresses issues of semantic preservation, architectural complexity, and cross-domain adaptability while distinguishing itself from similar terms like Urban Air Mobility.

Searching arXiv for the explicitly relevant papers and nearby formulations of “Unified Action Model” / unified action representations. arxiv_search: (Richter et al., 2022) Unified Action Model (UAM) denotes a family of research ideas rather than a single standardized formalism. In the narrowest explicit usage, it names a dual-stream vision–language–action architecture designed to preserve multimodal semantics during action learning (Zhang et al., 15 May 2026). In a broader and increasingly common sense, it refers to models that place a stable action abstraction between context understanding and executable behavior, often by combining a discrete action vocabulary with a continuous realization model, a shared latent interface across heterogeneous embodiments, or a joint world–action generator (Richter et al., 2022). The acronym is also polysemous: in adjacent literatures, “UAM” commonly means Urban Air Mobility, and in aerial robotics it can mean unmanned aerial manipulator rather than Unified Action Model (Erturk et al., 2020).

1. Term, scope, and competing usages

The literature does not present a single canonical definition of UAM. One explicit line uses the term for an embodied architecture that separates a pretrained semantic pathway from a control-oriented “Dorsal Expert,” thereby addressing the “embodiment tax” in VLA training (Zhang et al., 15 May 2026). A broader reading of nearby work suggests a recurring design pattern: actions are unified by introducing a compact intermediate representation that is stable across contexts, while context-dependent variation is handled by priors, decoders, or downstream embodiment-specific translators (Zheng et al., 17 Jan 2025).

This broader usage spans several representational choices. In vehicle behavior prediction, unification appears as a latent probabilistic model with a discrete variable for action identity and a continuous variable for within-action variation (Richter et al., 2022). In embodied robotics, it appears as a Universal Action Space implemented by a discrete codebook and lightweight embodiment-specific decoders (Zheng et al., 17 Jan 2025), as a semantically partitioned Unified Action Space with per-agent masks in heterogeneous MARL (Yu et al., 2024), or as an object-centric Unified Motion-Action Model in which 3D object motion trajectories serve as the shared interface between control and dynamics (Cao et al., 15 Jun 2026). This suggests that UAM is better treated as an architectural and representational program than as a single named model.

A persistent source of confusion is acronym overlap. In aviation and transportation, UAM overwhelmingly denotes Urban Air Mobility, including work on communication, navigation, surveillance, routing, and scheduling (Erturk et al., 2020). In aerial robotics, UAM can denote an unmanned aerial manipulator, as in vision-based approaching and object tracking for aerial manipulation (Zheng et al., 2022). Any technical use of “Unified Action Model” therefore requires local definition.

2. Probabilistic latent-action models for vehicle behavior

A particularly clear statistical formulation of a UAM-like idea appears in “Learning and Predicting Multimodal Vehicle Action Distributions in a Unified Probabilistic Model Without Labels” (Richter et al., 2022). The model is trained on scenario–trajectory pairs $(x_i,s_i)$ , where $x_i$ is a future trajectory and $s_i$ is a rasterized scenario. Its latent structure uses a discrete variable $y \in \{1,\dots,K\}$ for action category and a continuous latent $z \in \mathbb{R}^D$ for within-action variability. The base factorization is

$p(x,y,z\mid s)=p(y\mid s)\,p(z\mid y)\,p(x\mid z), \qquad q(y,z\mid x,s)=q(y\mid x,s)\,q(z\mid x).$

Under this formulation, $p(y\mid s)$ is a context-conditioned action prior, $p(z\mid y)$ is a Gaussian latent prototype for each action, and $p(x\mid z)$ decodes latent realizations into trajectories. The resulting “action” is therefore not a fixed maneuver template, but a discrete mode whose geometric realization is modulated continuously by $z$ .

The training objective follows variational inference. The evidence lower bound separates into a reconstruction term, a KL term aligning posterior action assignments with the scene-conditioned prior, and a mixture-weighted KL term pulling the continuous encoding toward one of the Gaussian action components. A notable feature is that the posterior over the discrete latent is derived analytically rather than learned with a separate network or relaxed with Gumbel-Softmax: $x_i$ 0 This ties action inference jointly to scene plausibility and latent compatibility with the encoded trajectory.

The paper then addresses a central UAM tension: a globally stable action vocabulary benefits from preventing scene information from bypassing the discrete bottleneck, but accurate prediction also requires scenario-conditioned refinement of continuous execution. Its solution is a second encoder $x_i$ 1, first in an alternate “dual encoder” setting and then in a unified model that duplicates the continuous latent branch while sharing the same action prior, Gaussian components, and decoder. The unified factorization is

$x_i$ 2

with

$x_i$ 3

Conceptually, one branch preserves trajectory-to-latent clustering and globally stable action semantics, while the other learns how each action is instantiated in the current scene.

Architecturally, the scenario is encoded as a rasterized bird’s-eye-view image of size $x_i$ 4, the action prior $x_i$ 5 uses a CNN with a ResNet50 feature extractor, and the continuous encoder and decoder are MLPs. On the Waymo Open Motion Dataset, the latent dimension is $x_i$ 6 and the number of discrete actions is $x_i$ 7, with only about 20 actions receiving significant probability after training. The reported strengths are interpretability, multimodality, probabilistic consistency, label-free action discovery, and scene-conditioned realization; the principal limitation is that evaluation is mostly qualitative, with no strong benchmark tables for prediction, calibration, or action interpretability (Richter et al., 2022).

3. Unified action spaces across heterogeneous embodiments

A second major UAM line addresses action heterogeneity across robots. “Universal Actions for Enhanced Embodied Foundation Models” introduces UniAct, whose central mechanism is a learned Universal Action Space implemented as a discrete codebook

$x_i$ 8

with the main instantiation using $x_i$ 9 and $s_i$ 0 (Zheng et al., 17 Jan 2025). Given observation–goal context $s_i$ 1, the model predicts a distribution $s_i$ 2 over universal actions and selects

$s_i$ 3

with Gumbel–Softmax used during training: $s_i$ 4 The selected universal action is translated back into domain-specific control by a lightweight decoder

$s_i$ 5

The global training objective is a multi-domain behavior-cloning loss over heterogeneous datasets, with alignment learned implicitly through the shared bottleneck rather than paired cross-embodiment demonstrations.

This design is explicitly motivated by action-space incompatibility across embodiments, including end-effector position, end-effector velocity, and joint position control. UniAct is trained on 1 million demonstrations from 28 embodiments using a 0.5B model, a codebook of 256 universal actions, and lightweight MLP decoder heads (Zheng et al., 17 Jan 2025). The paper reports that its 0.5B instantiation outperforms 14X larger embodied foundation models on extensive real-world and simulation evaluations, and that adaptation to a new AIRBOT embodiment requires training only 4M parameters out of 500M. The broader implication is that a UAM need not normalize all robots into one low-level control format; it can instead unify them at the level of generic atomic behaviors and retain embodiment-specific realization as a lightweight terminal layer.

An alternative unification mechanism appears in heterogeneous MARL. “Improving Global Parameter-sharing in Physically Heterogeneous Multi-agent Reinforcement Learning with Unified Action Space” defines a semantically partitioned Unified Action Space

$s_i$ 6

where $s_i$ 7 are self actions, $s_i$ 8 ally actions, and $s_i$ 9 enemy actions (Yu et al., 2024). All agents produce outputs in the same UAS, but each agent’s valid action subset is recovered via an available-action mask. The paper augments this with a Cross-Group Inverse loss that predicts other groups’ masked policies or Q-values from trajectory information. Here unification is not a shared motor codebook but a semantically aligned output vocabulary with per-agent feasibility constraints.

A third formulation shifts the shared interface from actions to object motion. “Unified Motion-Action Modeling for Heterogeneous Robot Learning” defines a masked generative model

$y \in \{1,\dots,K\}$ 0

in which 3D object motion trajectories and robot actions are treated as co-evolving variables (Cao et al., 15 Jun 2026). The key abstraction is that 3D object motion is more embodiment-agnostic than robot state or action, yet more physically grounded than pixel-space prediction. This suggests a broader UAM principle: the “unified” layer can sit above control as an object-centric motion interface rather than as a universal low-level action alphabet.

4. Embodied architectures that couple semantics, prediction, and control

The most explicit recent use of the term is “UAM: A Dual-Stream Perspective on Forgetting in VLA Training” (Zhang et al., 15 May 2026). That paper argues that end-to-end action tuning of pretrained VLMs imposes a structural bottleneck: the same encoder must simultaneously preserve language-grounded semantic representations and adapt to produce control-relevant visual features. UAM addresses this by adding a parallel Dorsal Expert $y \in \{1,\dots,K\}$ 1 to complement the semantic expert $y \in \{1,\dots,K\}$ 2, yielding

$y \in \{1,\dots,K\}$ 3

The dorsal stream is initialized from the pretrained generative expert of Bagel and trained with a visual-dynamics objective,

$y \in \{1,\dots,K\}$ 4

while using no frozen parameters, no gradient stopping, no auxiliary VL co-training, and no replay of multimodal data. Reported results include retention of over $y \in \{1,\dots,K\}$ 5 of the underlying VLM’s multimodal capability and the highest average success rate among compared baselines on out-of-distribution manipulation tasks involving unseen objects, novel object–target compositions, and instruction variation (Zhang et al., 15 May 2026).

A more strongly unified formulation appears in Pelican-Unified 1.0, which uses a single VLM for understanding and reasoning and a Unified Future Generator for joint video–action denoising (Zhang et al., 14 May 2026). The model maps context $y \in \{1,\dots,K\}$ 6 to a reasoning trace and a dense latent state,

$y \in \{1,\dots,K\}$ 7

then conditions a shared diffusion transformer on $y \in \{1,\dots,K\}$ 8 to generate future video and future actions: $y \in \{1,\dots,K\}$ 9 Its joint training objective combines text, video, and action losses in one loop. With a single checkpoint, Pelican-Unified reports 64.7 on eight VLM benchmarks, 66.03 on WorldArena, and 93.5 on RoboTwin, thereby advancing a UAM interpretation in which understanding, reasoning, imagination, and action are trained as one embodied loop rather than as separate expert systems (Zhang et al., 14 May 2026).

ABot-M0.5 extends the same general direction to mobile manipulation by unifying future video prediction, intermediate latent actions, and disentangled mobility/manipulation control (Chen et al., 1 Jul 2026). Its core cascade is

$z \in \mathbb{R}^D$ 0

where $z \in \mathbb{R}^D$ 1 is a future video latent, $z \in \mathbb{R}^D$ 2 is a frame-level latent action, and $z \in \mathbb{R}^D$ 3 is the executable action split into mobility and manipulation subspaces. The model combines a world model, latent-action model, and action decoder in one autoregressive stack, while Dream Forcing trains inverse dynamics on model-predicted futures to reduce train–test mismatch. Reported results include 54.2 average on RoboCasa365 Target 100%, 94.1 average on RoboTwin 2.0, 99.4 average on LIBERO, and strong real-world long-horizon manipulation performance (Chen et al., 1 Jul 2026). A plausible implication is that unification in embodied systems often works best when it is structured rather than flat: one model spans the loop, but internal pathways remain factorized where dynamics differ sharply.

5. Conceptual action primitives, events, and behavior

At the most abstract end of the spectrum, “Conceptual Modeling of Actions” proposes a minimal action ontology for conceptual modeling rather than a robotics architecture (Al-Fedaghi, 2022). Its “thinging machine” model claims that actions can be expressed through five primitive actions: create, process, release, transfer, and receive. The definitions are intentionally elementary: create brings a new thing into being in the machine; process changes, handles, or examines a thing but yields no new thing; release makes a thing ready for transfer outside the machine; transfer moves a thing into or out of a machine; and receive collapses arrival and acceptance into one incoming action.

This model distinguishes sharply among static action structure, event, and behavior. A static TM diagram represents potential actions and flows. An event is “a subgraph of the static TM model (called a region of the event) plus time,” and behavior is “a chronology of events that occur at particular time and over a region” (Al-Fedaghi, 2022). In the paper’s reinterpretations of UML activity diagrams and BPMN task diagrams, many coarse-grained actions such as “Receive Order,” “Send Invoice,” or “Perform Reparation” become composites over the five primitives. For UAM discourse, this conceptual line contributes a useful reminder: any unified action model must decide whether “action” is treated as a primitive, a latent code, a probabilistic mode, a control token, or a temporally indexed region within a broader behavior model.

6. Limitations, misconceptions, and open problems

A common misconception is that a UAM must define one universal low-level motor command space. The literature does not support that view. UniAct uses a shared discrete codebook but still decodes through embodiment-specific heads (Zheng et al., 17 Jan 2025); heterogeneous MARL work uses a unified output vocabulary plus masks rather than a single executable command set (Yu et al., 2024); and UMA explicitly retains an embodiment-specific action head while unifying action and dynamics around 3D object motion (Cao et al., 15 Jun 2026). This suggests that “unified” more often refers to the intermediate action abstraction than to the final actuator interface.

Another misconception is that unification is inherently architecture-agnostic. The current literature repeatedly makes it architectural. Dual-stream separation in VLA training is presented as the mechanism that reduces forgetting (Zhang et al., 15 May 2026); Pelican-Unified makes a shared latent loop state $z \in \mathbb{R}^D$ 4 the organizing principle for reasoning, imagination, and action (Zhang et al., 14 May 2026); and ABot-M0.5 argues that temporal granularity, action-space structure, and train–test consistency must all be aligned for unified world–action modeling to work in mobile manipulation (Chen et al., 1 Jul 2026). A plausible implication is that unification is less about naming a single latent variable than about placing bottlenecks and shared computation in the right parts of the control stack.

The main limitations are equally recurrent. The vehicle action-distribution model fixes the number of discrete actions in advance, uses $z \in \mathbb{R}^D$ 5, and evaluates semantic quality mostly qualitatively rather than through strong benchmark tables (Richter et al., 2022). UniAct fixes a $z \in \mathbb{R}^D$ 6 codebook, does not address zero-shot new-robot execution without adaptation data, and is evaluated mostly on single-arm manipulation settings (Zheng et al., 17 Jan 2025). UAM for VLA training introduces substantial additional complexity, combining a 7B semantic expert, a 7B dorsal expert, and a 2B action expert, with reported single-step inference latency of about 1500 ms (Zhang et al., 15 May 2026). UMA still requires calibrated camera-to-base extrinsics and an embodiment-specific action head at deployment (Cao et al., 15 Jun 2026). Conceptual action-primitive models, by contrast, sharpen terminology but do not provide execution semantics, concurrency formalisms, or planner-ready operator models (Al-Fedaghi, 2022).

Taken together, these works suggest that UAM is best understood as a research program centered on stable action abstractions, shared latent interfaces, and tighter coupling among context, prediction, and execution. What remains unresolved is whether a future mature UAM will converge on a common formal core, or whether “Unified Action Model” will continue to denote a family of architectures that solve different unification problems—semantic preservation, multimodal prediction, cross-embodiment transfer, or world–action consistency—under one umbrella.