Bearing-Box Estimator for Visual Tracking

Updated 18 January 2026

Bearing-box estimator is a vision-based approach that leverages 2D and 3D bounding boxes to recover a target's trajectory, velocity, and physical size from monocular observations.
It employs a pseudo-linear state-space model and Kalman filter update, integrating geometric features from detector outputs to overcome observability limitations inherent in bearing-only methods.
By using standard detector outputs and accommodating MAV thrust dynamics, the estimator improves performance in dynamic environments without requiring lateral maneuvers.

A bearing-box estimator is a class of vision-based target motion estimators that exploits the geometric and size information contained in object detection bounding boxes—either in 2D (image plane) or full 3D (cuboid) form—to recover both the trajectory and physical size of a moving target from monocular observations. Unlike classic bearing-only approaches that rely exclusively on the bearing direction and require specific observer maneuvers to break ambiguities, bearing-box estimators leverage standard detector outputs to achieve enhanced observability and estimation performance, including in cases without lateral or high-order observer motion (Ning et al., 2024, Zhang et al., 11 Jan 2026).

1. State and Motion Models

The bearing-box estimator seeks to reconstruct the position, velocity, and physical extent (“scale”) of a moving target of unknown size from visual data produced by a monocular camera with known observer pose (obtained via systems like RTK-GPS or VIO). The estimator utilizes a discrete-time state-space model. The canonical first-order model uses a 7-dimensional state vector: $x = \begin{bmatrix} p_T \ v_T \ \ell \end{bmatrix} \in \mathbb{R}^7$ where $p_T \in \mathbb{R}^3$ is the target position, $v_T \in \mathbb{R}^3$ is the velocity, and $\ell > 0$ is the characteristic physical size (e.g., diameter or height) of the target in the direction orthogonal to the bearing line (Ning et al., 2024).

The transition model assumes constant-velocity, constant-size motion: $x_{k+1} = F x_k + q_k,$ with

$F = \begin{bmatrix} I_3 & \Delta t I_3 & 0_{3\times1} \ 0_{3} & I_3 & 0_{3\times 1} \ 0_{1\times3} & 0_{1\times3} & 1 \end{bmatrix},$

and process noise $q_k \sim \mathcal{N}(0, \Sigma_q)$ .

In applications involving multi-rotor micro aerial vehicles (MAVs), a second-order kinematic extension incorporating acceleration is used: $x = \begin{bmatrix} p_o^w \ v_o^w \ a_o^w \ \alpha \end{bmatrix},$ with appropriate transition dynamics (Zhang et al., 11 Jan 2026). This accommodates the dynamic coupling between MAV thrust and motion.

2. Construction of Bearing-Box Measurements

2D Bounding-Box (Bearing-Angle) Measurements

For each detected 2D image bounding box:

The box’s pixel center $(q_x, q_y)$ provides a unit-bearing vector $g \in \mathbb{R}^3$ in world coordinates using the camera calibration matrix and rotation.
The box’s size $s_{\text{pix}}$ (e.g., width or height) determines the angle $\theta$ subtended at the camera center:

$\theta = \arccos\left( \frac{l_\text{left}^2 + l_\text{right}^2 - s_{\text{pix}}^2}{2\,l_\text{left} l_\text{right}} \right),$

where $l_\text{left}$ and $l_\text{right}$ are computed from image and camera parameters (Ning et al., 2024).

The measurement model becomes:

Noisy bearing: $\hat{g} = g + \mu$ , $\mu \sim \mathcal{N}(0, \sigma_\mu^2 I_3)$ .
Noisy angle: $\hat{\theta} = \theta + w$ , $w \sim \mathcal{N}(0, \sigma_w^2)$ .

These measurements yield a set of pseudo-linear equations relating scene geometry to the state.

3D Bounding-Box (Cuboid) Measurements

Contemporary detectors can provide full 3D bounding boxes, representing the target as an oriented cuboid with side-lengths $\ell_1, \ell_2, \ell_3$ . The box corners are projected to image coordinates, and a normalized “depth” is computed by solving an overdetermined pseudo-linear system: $Q_i (\mathbf{R}_o^c \mathbf{p}_i^o + \mathbf{p}_o^c) = 0,$ leading to

$\bar{\mathbf{p}}_o^c = -\left( \sum_i Q_i^T Q_i \right)^{-1} \sum_i Q_i^T Q_i \mathbf{R}_o^c \bar{\mathbf{p}}_i^o,$

which provides a direct linear relation between the estimated translation, velocity, and scale (up to normalization) (Zhang et al., 11 Jan 2026).

3. Pseudo-Linear Estimation Framework

Both the 2D and 3D measurement constructions enable a pseudo-linear measurement equation of the form: $z_k = H_k x_k + \nu_k,$ where $z_k$ contains the processed geometric measurements and $H_k$ is constructed from the projection and bounding box geometry. In the Kalman filter context:

Prediction: $\hat{x}_- = F \hat{x}_{k-1}$ , $P_- = F P F^\top + \Sigma_q$
Update: $K = P_- H^\top [H P_- H^\top + \Sigma_\nu]^{-1}$ , $\hat{x} = \hat{x}_- + K(z - H \hat{x}_-)$

Care is taken to use the pseudoinverse for ill-posed cases and to adjust covariance terms according to measurement uncertainty (Ning et al., 2024, Zhang et al., 11 Jan 2026).

For MAVs, a second-order measurement is integrated, leveraging the coupling between acceleration and thrust: $z^{(2)} = P_{\tilde{h}} (g e_3), \quad H^{(2)} = [0, 0, P_{\tilde{h}}, 0],$ with $P_{\tilde{h}}$ the orthogonal projector onto the plane normal to the MAV's thrust vector.

4. Observability Conditions and Theoretical Analysis

General Case (2D/3D Common Object)

Observability in discrete time requires that the observer’s motion have strictly higher polynomial order than the target’s; for a constant-velocity target, observer acceleration (or higher) suffices.
In continuous time, an $n$ th-order target trajectory is observable if and only if the observer’s trajectory has at least order $n+1$ at some instant.
At least $n+2$ measurements with a non-zero $(n+1)$ th difference are required to reconstruct the system (Ning et al., 2024, Zhang et al., 11 Jan 2026).

MAV-Specific Relaxation

The inclusion of measurements derived from the MAV’s thrust-attitude dynamics yields a fundamental relaxation:

For a second-order MAV model, the state is observable if (a) the observer’s trajectory exhibits non-zero jerk at a step (with at least four measurements) or (b) with three measurements if the relative acceleration (modulo thrust orientation) is non-degenerate.
These conditions eliminate the requirement for observer lateral maneuvers or strictly higher-order motion, strictly relaxing classical constraints (Zhang et al., 11 Jan 2026).

5. Comparison with Classical Approaches

Classical bearing-only estimators suffer from inherent observability limitations: if the observer moves only along the line of sight, target range and scale remain ambiguous. Full observability in bearing-only estimation demands observer maneuvers orthogonal to the bearing direction.

The bearing-box (and bearing-angle) estimators inherently resolve this ambiguity—through the inclusion of angle or normalized 3D box geometries—without requiring the observer to perform lateral “zig-zag” or spiral motion. As soon as nonzero observer acceleration or thrust-induced dynamics (for MAVs) exist, full state observability is restored (Ning et al., 2024, Zhang et al., 11 Jan 2026).

6. Implementation and Practical Results

Kalman filter-based estimators using pseudo-linear measurement updates are employed for both 2D and 3D cases. Key implementation steps include:

Acquisition of detector outputs (bounding boxes, 3D box parameters)
Construction of measurement and pseudo-linear system matrices
Execution of filter predict and update steps, incorporating noise covariance adjustment and possible MAV-specific measurements

Key findings from extensive experimental validation:

Scenario	Bearing-Only	Bearing-Angle	Bearing-Box
KITTI (NIDE %)	54.5	41.6	34.6
Indoor car (NIDE %)	17.1	19.9	13.5
AirSim sim/MAV (static observer)	Fail	Fail	<15%
Real-world MAV tracking (outdoor)	Drifts	Drifts	Stable/Low
Monte-Carlo, circle/line/PNG pursuit	Poor/stable	Stable	N/A

In classic scenarios (e.g., straight-line maneuvers), the bearing-only estimator diverges, whereas the bearing-box estimator remains consistent and accurate.
In real-world, high-fidelity simulation (AirSim) and outdoor tests, bearing-box achieves stable estimation even for static observers, something unattainable by previous paradigms (Ning et al., 2024, Zhang et al., 11 Jan 2026).

7. Extensions and Applicability

The bearing-box estimator applies broadly to target motion estimation problems where vision-based 2D or 3D bounding box detection is available. It is notably advantageous in robotic scenarios—such as ground robot following, MAV-to-MAV tracking, and other applications involving object detection in dynamic environments.

This approach leverages standard outputs of modern detection systems without needing additional sensors or modifications, making it directly compatible with widely deployed pipelines. For MAVs, the framework incorporates a unique coupling between thrust, acceleration, and observability that is not present in ground-based or unconstrained observer systems. This provides enhanced performance and relaxes previous maneuvering requirements, supporting new design and deployment strategies in aerial robotics (Ning et al., 2024, Zhang et al., 11 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

A Bearing-Angle Approach for Unknown Target Motion Analysis Based on Visual Measurements (2024)

Observability-Enhanced Target Motion Estimation via Bearing-Box: Theory and MAV Applications (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bearing-Box Estimator.