Object-Relative Control in Robotics and AI

Updated 13 September 2025

Object-relative control is an approach that defines actions relative to specific objects, facilitating clear object segmentation and robust decision-making.
It enhances compositionality, sample efficiency, and adaptability in complex, high-dimensional environments such as robotics and autonomous navigation.
Key techniques include spatial mixture models, hierarchical controllers, and uncertainty quantification, which together improve policy performance and sim2real transfer.

Object-relative control refers to methods and frameworks in perception, planning, and actuation that represent and reason about agent actions with respect to specific objects or entities, rather than in purely global (world-centric), image-centric, or agent-centric coordinates. This paradigm substantially enhances compositionality, sample efficiency, adaptability, and robustness in scenarios where complex interactions, partial observability, and dynamic environments dominate, such as robotics, autonomous vehicles, and human–object interaction synthesis.

1. Principled Formulation in Model-Based RL: Object-based Perception Control (OPC)

Object-relative control in model-based reinforcement learning is anchored in the Joint Perception and Control as Inference (PCI) framework (Li et al., 2019), which unifies perception and control as a single inference problem under POMDPs. Here, perception infers object-level latent states while control jointly optimizes policies that interact with these object representations. The evidence lower bound (ELBO) decomposes into a perception term and a control term that are optimized together:

$\log p(o^{{\geq}t}, x^{{\leq}t} | a^{{<}t}) \geq \mathbb{E}_{x^{{\leq}t} \sim q^w}\left[\log \prod_{j=1}^{t} p(x^j | ...) - D_{KL}(...) \right] + \mathbb{E}_{x^{{\leq}t} \sim q^w}\left[\log \Diamond_1\right]$

OPC instantiates this by utilizing a spatial mixture model and iterative EM to segment pixels into object groups, maintaining a separate latent parameter vector $\theta_k$ for each object, optimized via gradient ascent:

$\theta_k^{t+1} = \theta_k^t + \alpha \sum_{i=1}^{D} \eta_{i,k}^t \cdot \frac{\psi_{i,k}^t - x_i^{t+1}}{\sigma^2} \cdot \frac{\partial \psi_{i,k}^t}{\partial \theta_k^t}$

Empirical results in high-dimensional pixel environments show not only improved perceptual grouping but also significant gains in accumulated rewards over baselines, attributed to incorporating object-level inductive biases which simplify the control task and enable efficient reward assignment.

2. Hierarchical Composition of Object-Centric Controllers

In robotics, complex manipulation tasks often require simultaneous fulfillment of multiple sub-goals. Object-relative controllers manage these by defining control objectives in object-aligned coordinate frames and composing them hierarchically via nullspace projections (Sharma et al., 2020). Each primitive controller (for position, force, or rotation) operates relative to object features (e.g., axis, surface normal), and their errors are prioritized and projected into the nullspace of higher-priority controllers:

$\Delta_x^n = K_x \mathcal{N}([u^0, ..., u^{n-1}]) \delta_x(x_d^n, u^n, x_c)$

The RL policy chooses controller priorities, benefiting from sample efficiency, zero-shot generalization to unseen environments, and robust sim2real transfer, since object-centric actions naturally abstract away specifics of geometry or sensor embodiment.

3. Object-Aware Representation Learning in Visuomotor Control

Object-relative control leverages representation learning frameworks that explicitly divide scenes into object-centric latent variables, as opposed to object-agnostic models (Heravi et al., 2022). Slot Attention mechanisms partition input features into K slots, each corresponding to a scene object, optimizing reconstruction and mask-based localization in a self-supervised regime:

$L = \frac{1}{N} \sum_{i=1}^{N} \left[ I_i - \sum_{k=1}^{K} M_{ik} \times R_{ik} \right]^2$

These object-level representations are reused for downstream control policy learning and object localization, yielding state-of-the-art accuracies (e.g., 95% PCK for keypoint localization in multi-object scenes) and marked improvement in policy performance and sample efficiency, particularly in low-data regimes.

Recent advances enable closed-loop, object-relative navigation in UAVs and mobile robots using minimal sensors (IMU, RGB camera) and computationally optimized deep networks for 6-DoF object pose estimation (Jantos et al., 8 Oct 2024). AI-based predictors (e.g., PoET) infer semantic object poses, which, after camera-agnostic correction via homographies, are fused with inertial data in EKF-based frameworks. Initialization and correction equations ensure accurate object-to-world frame mapping:

$R_{O_kW} = R_{O_kC} R_{IC}^T R_{WI}^T, \qquad p_{O_kW} = p_{O_kC} - R_{O_kW} ( R_{WI} p_{IC} + p_{WI} )$

Experiments on challenging inspection tasks demonstrate centimeter-level accuracy and robust operation under payload and computational constraints.

5. Uncertainty-Aware Object-Relative State Estimation

Accurate uncertainty quantification for deep 6D object pose predictions is vital for object-relative state estimation (Jantos et al., 1 Sep 2025). Aleatoric uncertainty is inferred by augmenting pre-trained predictors with detached MLP heads for translation/rotation, outputting diagonal covariance matrices:

$\Sigma_{\hat{t}} = \text{diag}(\sigma_x^2, \sigma_y^2, \sigma_z^2), \qquad \Sigma_{\theta} = \text{diag}(\sigma_{\theta_x}^2, \sigma_{\theta_y}^2, \sigma_{\theta_z}^2)$

These uncertainties are incorporated as time-varying measurement covariances in Kalman filters, enabling dynamic anchor object selection and improved state estimates, while maintaining negligible computational overhead (∼0.6% increase).

6. Parameterization and Optimization under Relative Measurement Constraints

In distributed/control system synthesis, object-relative measurements impose structural constraints, often encoded as differential outputs (e.g., $y = C_2 x$ where each row of $C_2$ has one 1 and one –1) (Marshall et al., 2023). Controller synthesis uses convex parameterization via a Youla parameter $Q$ constrained to a relative subspace:

$\min_{Q} \| T_1 - T_2 Q T_3 \| \quad \text{subject to} \quad Q \in S_{\text{rel}}(C_2),\; Q \text{ stable}$

Additional network structural constraints (sparsity, delays) are incorporated as subspace constraints on $Q$ , retaining convexity and enabling scalable controller design for large systems.

Visual navigation systems leverage object-centric scene graphs (“relative 3D Scene Graph”) and costmaps encoding object-level path planning costs for control in real and simulated environments (Garg et al., 11 Sep 2025). Controllers such as ObjectReact condition rollout predictions on multidimensional WayObject Costmap embeddings:

$E(l)_i = \begin{cases} \sin(l / Z^{i/D}) & \text{if}~i~\text{even} \ \cos(l / Z^{(i-1)/D}) & \text{if}~i~\text{odd} \end{cases}$

This approach yields markedly improved spatial reasoning, robust performance under sensor variation and cross-embodiment deployment, and superior path planning flexibility.

In human-object interaction synthesis, object-relative control is articulated through bipartite Relative Movement Dynamics (RMD) graphs (Deng et al., 24 Mar 2025), where each edge encodes a human–object part relationship (approaching, stationary, retracting, unstable). Vision-LLMs automatically generate fine-grained, spatiotemporal plans that guide reinforcement learning policies, resulting in more naturalistic, physically plausible long-horizon interactions.

8. Applications and Directions

Object-relative control frameworks find applications in robotic manipulation (contact-rich, multi-object environments), UAV navigation (object-centric localization for inspection, mapping), state estimation (dynamic re-anchoring), animation (lifelike human–object motions), and distributed control (networked sensor limitations). The paradigm enables leveraging rich inductive biases—object decomposition, semantic segmentation, relational reasoning—for sample-efficient learning, robust transfer, and support for unsupervised and self-supervised learning regimes. Further research includes integrating physics-based and relational inductive biases, uncertainty quantification, multi-modal sensor fusion, and scalable design for large, distributed systems.