Joint Prediction of Human Motions and Actions in Human-Robot Collaboration

Published 3 Apr 2026 in cs.RO | (2604.03065v1)

Abstract: Fluent human--robot collaboration requires robots to continuously estimate human behaviour and anticipate future intentions. This entails reasoning jointly about \emph{continuous movements} and \emph{discrete actions}, which are still largely modelled in isolation. In this paper, we introduce \textsf{MA-HERP}, a hierarchical and recursive probabilistic framework for the \emph{joint estimation and prediction} of human movements and actions. The model combines: (i) a hierarchical representation in which movements compose into actions through admissible Allen interval relations, (ii) a unified probabilistic factorisation coupling continuous dynamics, discrete labels, and durations, and (iii) a recursive inference scheme inspired by Bayesian filtering, alternating top-down action prediction with bottom-up sensory evidence. We present a preliminary experimental evaluation based on neural models trained on musculoskeletal simulations of reaching movements, showing accurate motion prediction, robust action inference under noise, and computational performance compatible with on-line human--robot collaboration.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces MA-HERP, a framework that jointly models continuous movements and discrete actions for effective human-robot collaboration.
It employs hierarchical decomposition and recursive Bayesian filtering with Allen’s interval algebra to enforce temporal constraints on predictions.
Experimental results demonstrate high prediction accuracy, robust uncertainty management, and real-time performance critical for safe robotic interactions.

Joint Probabilistic Inference for Human Motions and Actions in Human-Robot Collaboration

Motivation and Problem Statement

Anticipatory behavior is central to the efficacy of Human-Robot Collaboration (HRC). Robotic systems must integrate dense continuous sensing (e.g., body trajectories) with discrete, semantically meaningful inferences (e.g., action categories) under tight real-time constraints. Existing approaches predominantly treat either continuous human movements or discrete human actions in isolation, neglecting the hierarchical and temporally interconnected structure of human activity as it manifests in collaborative, real-world scenarios. Furthermore, practical HRC requires representations that accommodate partial concurrency, hierarchical compositionality, and explicit uncertainty.

To address these gaps, the paper introduces MA-HERP (Movement-Action Hierarchical Estimation and Recursive Prediction), a hierarchical and recursive probabilistic framework that jointly models and predicts human movement and action in HRC. The goal is to achieve online, bidirectional inference and prediction across multiple abstraction levels, thereby bridging the gap between geometric motion forecasting and symbolic action understanding.

Hierarchical Temporal Action-Motion Representation

The core representational structure in MA-HERP is a hierarchical decomposition of human behavior. At the lowest level, continuous movement segments are modeled as sequences in $\mathbb{R}^d$ (for example, joint angle trajectories). At progressively higher levels, these movements are aggregated into discrete, labeled action intervals with explicit temporal boundaries and durations. Composite actions are recursively assembled, yielding a hierarchy of labeled intervals.

Temporal constraints among action intervals are formalized using Allen’s interval algebra, ensuring only feasible compositions are permitted. For instance, the system disallows overlapping actions for the same limb but permits them for distinct effectors, and it encodes admissible concurrency, sequencing, and containment relations as level-specific composition tables. Importantly, these constraints are enforced as hard or soft plausibility factors within the probabilistic model, thereby unifying symbolic and geometric reasoning.

This hierarchical structure provides explicit mappings between continuous observations and active symbolic labels at any instant. The framework supports both hard and soft enforcement of temporal constraints, enabling robust operation under noisy or ambiguous segmentation.

Joint Probabilistic Model and Recursive Inference

MA-HERP formulates behavior modeling as a joint probability distribution over sets of action intervals at all hierarchy levels and observed motion trajectories. The factor graph construction explicates the coupling terms:

Plausibility factors for Allen temporal relations,
Hierarchical composition consistency penalties,
Label-conditioned continuous movement likelihoods,
Duration-regularized, context-sensitive action transition priors.

Due to intractability in the exact enumeration of interval configurations and relations, the inference proceeds recursively in an online fashion, inspired by Bayesian filtering. At each timestep, label (action) beliefs are propagated forward, updated via continuous observation likelihoods, and regularized with temporal constraints and duration priors (using HSMM hazard functions). This enables joint forward prediction of both future body configurations and likely forthcoming action events.

The recursive cycle is flexibly instantiated to handle multiple parallel chains (e.g., bimanual or multi-effector scenarios) and multiple abstraction levels. Each recursive step produces an inference summary that conditions both motion and symbolic action probability flows, supporting top-down and bottom-up evidence propagation.

Experimental Validation

Preliminary empirical analysis leverages controlled musculoskeletal simulations of upper-limb reaching, generated with Bioptim. Label-conditioned neural sequence models (Transformers) are trained on low-dimensional joint trajectories, covering all valid transitions among defined target regions. Models are evaluated under both noise-free and noise-perturbed scenarios to interrogate motion prediction, action inference, and computational tractability.

Movement Prediction

When trained and evaluated in the noise-free regime, label-conditioned models achieve PCCs near 0.99 and RMSEs on the order of 0.01 radians, indicating negligible discrepancy from ground truth. With noise levels up to 30%, prediction error and variance predictably increase, with corresponding degradation in both PCC and RMSE. This establishes that the framework can express calibrated uncertainty and detect sensor degradation, which is crucial for safe HRC deployment.

Action Prediction and Classification

Discrete action label inference is performed with both incremental prefix (expanding window) and fixed sliding window schemes. In noise-free settings, classifiers achieve up to 1.00 accuracy for certain transition classes, and in general achieve high F1 scores when sufficient context is observed. Under noisy observation conditions, label inference becomes delayed or less confident, particularly for specific transitions (e.g., B-to-AC), which reflects in decreased recall. These results confirm the framework’s adaptive robustness and provide evidence that real-time, early intent inference is achievable in favorable regimes.

Computational Performance

Online inference for next-step movement prediction completes in less than 0.2 seconds per trajectory, and label prediction/classification is achieved in the microsecond range, far below the operational timescales required for collaborative safety and fluency. This supports the practical feasibility of deploying MA-HERP in closed-loop HRC systems.

Qualitative Example: End-to-End Recursive Prediction

Execution over sequences of composite reaching actions demonstrates the ability of MA-HERP to update action beliefs and movement predictions continuously as new sensory observations arrive. Early convergence of action beliefs enables the robotic system to plan and initiate collaborative behaviors in advance of motion completion, while uncertainty or delayed commitments naturally promote conservative (safer) robot policies.

Implications and Future Directions

MA-HERP advances the theoretical and practical frontiers of HRC by providing a unified, probabilistic, and hierarchical model that directly accommodates both continuous movement prediction and action-level intent inference. The explicit integration of Allen temporal constraints and hierarchical compositions into the inference machinery supports scenarios with complex, concurrent, and temporally interdependent tasks. CALibrated uncertainty expressions facilitate adaptive robot behavior under sensor noise and partial observability, a prerequisite for real-world robustness.

From a theoretical perspective, the approach clarifies the interplay between symbolic and geometric representations in collaborative robotics and suggests avenues for generalizing to more complex action grammars and effectors. Practically, the recursive and tractable nature of the inference algorithms reported makes real-time deployment feasible in collaborative workspaces, with immediate applications in industrial co-manipulation, assistive robotics, and intent-aware safety frameworks.

Potential future research directions include:

Extension to higher-dimensional kinematics, contact-rich manipulation, and multi-agent interactions,
Formal analysis of the trade-offs between observation memory (window lengths), prediction horizon, and computational cost,
Real-time closed-loop validation in physical HRC environments with exogenous disturbances and latent action intents,
Investigation of iterative re-labelling impacts and mechanisms to mitigate oscillatory or unstable inferences,
Structured uncertainty calibration and integration with risk-sensitive robot planning and control policies.

Conclusion

MA-HERP establishes a principled approach for joint, online prediction and estimation of human movements and actions using hierarchical interval-based representations and recursive, probabilistic inference. Its strong empirical results in motion and action forecasting, robust performance under observation noise, and computational efficiency position it as a relevant model for embedding intent-aware, proactive reasoning into collaborative robotic systems. The framework offers a fertile basis for further theoretical development and practical validation in increasingly realistic and challenging HRC scenarios.

Markdown Report Issue