- The paper demonstrates a hierarchical control framework where a teacher-student architecture enables intent inference solely from proprioceptive history.
- It employs a pre-trained velocity-tracking backbone for low-level locomotion, ensuring robust tracking and compliance across varied terrains.
- Robust simulation and real-world experiments show the framework’s scalability to heterogeneous multi-robot teams and effective transport of irregular payloads.
PAINT: Partner-Agnostic Intent-Aware Cooperative Transport with Legged Robots
Physical human-robot interaction (pHRI) in collaborative transport tasks, particularly with heterogeneous partners and in complex environments, presents a set of long-standing challenges in robust intent estimation and compliant motion generation. High-capacity quadrupedal robots offer promising morphology for these settings; however, state-of-the-art approaches commonly depend on expensive, fragile end-effector force-torque (FT) sensors and explicit payload tracking, limiting practical deployment—especially in cluttered, rugged environments or flexible team settings. The "PAINT: Partner-Agnostic Intent-Aware Cooperative Transport with Legged Robots" (2604.12852) framework addresses these gaps by proposing a lightweight, hierarchical control method enabling robots to infer partner motion intent purely from proprioceptive feedback, decoupling intent inference from locomotion execution.
Figure 1: Overview of the PAINT partner-agnostic intent-aware cooperative transport architecture, detailing hierarchical controller composition and deployment paradigm.
Hierarchical Control and Intent Inference
PAINT formalizes cooperative transport as a POMDP, with the true coupled robot-payload-partner state only partially observable. The core technical contribution is the decoupling of high-level (HL) intent understanding from terrain-robust low-level (LL) locomotion:
- High-Level Policy and Intent Estimator: The HL controller is constructed around a teacher-student architecture. The teacher policy is trained with privileged interaction wrench supervision, directly mapping FT observations to compliant arm/base actions. The deployable student, which is used at runtime, receives only proprioceptive joint histories. Embedded in the student is an intent estimator—a network trained via regression to invert joint and velocity histories into an estimated wrench—that augments the proprioceptive input and encapsulates latent intent cues. The actor-student is regularized by a KL penalty against the teacher, facilitating transfer of finely-shaped intent-to-action mappings to the sensor-restricted policy.
- Low-Level Locomotion Backbone: The LL component employs a pre-trained velocity-tracking backbone, using terrain, base commands, and proprioception, enabling robust planar base tracking under varying dynamic loads and terrain irregularity. This modular design explicitly leverages advances in learned legged locomotion while isolating intent interpretation to the HL policy.
The reward function robustly encodes force/torque alignment and payload stability, penalizing height deviations, high actuation, and yielding smooth, compliant motions without sensor dependence.
Simulation and Real-World Benchmarks
PAINT is trained in highly parallelized Isaac Gym simulation, with rigorous domain randomization across payload geometry, mass (0–10 kg nominal, up to 28 kg in evaluation), and partner interaction profiles synthesized via stochastic wrench scheduling. The policy architecture is a compact multilayer perceptron (MLP). Real-world deployment on quadrupedal manipulators confirms stable, compliant collaborative transport with human and robotic leaders, diverse team sizes, and over varied payloads and unstructured terrains.


Figure 2: Capabilities of PAINT in real-world and simulation: intent inference purely from proprioceptive histories enables robust transport with different partners, payloads, and terrains.
Key quantitative outcomes include:
Generalization Across Robot Embodiments and Team Heterogeneity
PAINT demonstrates zero-shot transferability across robot embodiments beyond the training morphology. Experiments show that swapping the proprioceptive signal source (e.g., using leg-only proprioception on a manipulator-less quadruped) preserves compliant transport and intent inference capabilities. Furthermore, the decentralized nature of the policy ensures heterogeneous teams (e.g., quadrupeds with/without arms) synchronize via payload-coupled interaction, without need for global state broadcast or explicit coordination.
Figure 4: Quantitative end-effector intent-alignment with increasing payload mass, showcasing superior or competitive performance for both single and multi-robot teams with the PAINT controller.
Figure 5: Measured versus commanded interaction wrenches for two robot embodiments, revealing smooth, compliant tracking even with synthetic partner wrench profiles.
Limitations and Failure Analysis
While the PAINT architecture robustly infers intent from interaction-induced proprioception, its purely reactive nature exposes limitations in tasks demanding anticipative behaviors (e.g., obstacle-rich navigation or semantic awareness). Without explicit perception, the policy cannot proactively avoid obstacles, delegating all corrective action to leader-applied intent, which may result in failure under ambiguous or delayed guidance.
Figure 6: HL policy failure case—without explicit obstacle awareness, unsafe base commands may still be issued, potentially leading to collision or locomotion failure.
Implications and Future Directions
PAINT provides a scalable, sensor-minimal paradigm for intent-aware physical collaboration in legged robots, eschewing reliance on fragile force/torque sensors or sophisticated external state estimators. From a practical perspective, this reduces system cost and integration overhead, enabling real-world deployment in unstructured, infrastructure-free environments and with flexible human/robot team compositions.
The approach validates that time-correlated proprioceptive signals, shaped through payload-coupled interactions, suffice for robust and physically meaningful intent estimation. This insight extends theoretical understanding of partial observability in pHRI and multi-agent collaborative manipulation. Future expansions may encompass: moving beyond planar wrench estimation to full 6D interaction modeling; augmenting the policy with onboard perception for safe, high-level planning and obstacle avoidance; and extending the framework to collaborative grasping, lifting, and dexterous reorientation.
Conclusion
The PAINT framework sets a new standard for intent-aware cooperative transport in legged robots, providing a deployable, robust, and scalable control methodology that functions exclusively on proprioceptive input. Its hierarchical design, teacher-student training, and demonstrated real-world transferability support its applicability for complex, real-world multi-agent manipulation and human-robot collaboration tasks, highlighting directions for future research in sensor-minimal, scalable collaborative robotics.