Bearing-Box Estimator for Visual Tracking
- Bearing-box estimator is a vision-based approach that leverages 2D and 3D bounding boxes to recover a target's trajectory, velocity, and physical size from monocular observations.
- It employs a pseudo-linear state-space model and Kalman filter update, integrating geometric features from detector outputs to overcome observability limitations inherent in bearing-only methods.
- By using standard detector outputs and accommodating MAV thrust dynamics, the estimator improves performance in dynamic environments without requiring lateral maneuvers.
A bearing-box estimator is a class of vision-based target motion estimators that exploits the geometric and size information contained in object detection bounding boxes—either in 2D (image plane) or full 3D (cuboid) form—to recover both the trajectory and physical size of a moving target from monocular observations. Unlike classic bearing-only approaches that rely exclusively on the bearing direction and require specific observer maneuvers to break ambiguities, bearing-box estimators leverage standard detector outputs to achieve enhanced observability and estimation performance, including in cases without lateral or high-order observer motion (Ning et al., 2024, Zhang et al., 11 Jan 2026).
1. State and Motion Models
The bearing-box estimator seeks to reconstruct the position, velocity, and physical extent (“scale”) of a moving target of unknown size from visual data produced by a monocular camera with known observer pose (obtained via systems like RTK-GPS or VIO). The estimator utilizes a discrete-time state-space model. The canonical first-order model uses a 7-dimensional state vector: where is the target position, is the velocity, and is the characteristic physical size (e.g., diameter or height) of the target in the direction orthogonal to the bearing line (Ning et al., 2024).
The transition model assumes constant-velocity, constant-size motion: with
and process noise .
In applications involving multi-rotor micro aerial vehicles (MAVs), a second-order kinematic extension incorporating acceleration is used: with appropriate transition dynamics (Zhang et al., 11 Jan 2026). This accommodates the dynamic coupling between MAV thrust and motion.
2. Construction of Bearing-Box Measurements
2D Bounding-Box (Bearing-Angle) Measurements
For each detected 2D image bounding box:
- The box’s pixel center provides a unit-bearing vector in world coordinates using the camera calibration matrix and rotation.
- The box’s size (e.g., width or height) determines the angle subtended at the camera center:
where and are computed from image and camera parameters (Ning et al., 2024).
The measurement model becomes:
- Noisy bearing: , .
- Noisy angle: , .
These measurements yield a set of pseudo-linear equations relating scene geometry to the state.
3D Bounding-Box (Cuboid) Measurements
Contemporary detectors can provide full 3D bounding boxes, representing the target as an oriented cuboid with side-lengths . The box corners are projected to image coordinates, and a normalized “depth” is computed by solving an overdetermined pseudo-linear system: leading to
which provides a direct linear relation between the estimated translation, velocity, and scale (up to normalization) (Zhang et al., 11 Jan 2026).
3. Pseudo-Linear Estimation Framework
Both the 2D and 3D measurement constructions enable a pseudo-linear measurement equation of the form: where contains the processed geometric measurements and is constructed from the projection and bounding box geometry. In the Kalman filter context:
- Prediction: ,
- Update: ,
Care is taken to use the pseudoinverse for ill-posed cases and to adjust covariance terms according to measurement uncertainty (Ning et al., 2024, Zhang et al., 11 Jan 2026).
For MAVs, a second-order measurement is integrated, leveraging the coupling between acceleration and thrust: with the orthogonal projector onto the plane normal to the MAV's thrust vector.
4. Observability Conditions and Theoretical Analysis
General Case (2D/3D Common Object)
- Observability in discrete time requires that the observer’s motion have strictly higher polynomial order than the target’s; for a constant-velocity target, observer acceleration (or higher) suffices.
- In continuous time, an th-order target trajectory is observable if and only if the observer’s trajectory has at least order at some instant.
- At least measurements with a non-zero th difference are required to reconstruct the system (Ning et al., 2024, Zhang et al., 11 Jan 2026).
MAV-Specific Relaxation
The inclusion of measurements derived from the MAV’s thrust-attitude dynamics yields a fundamental relaxation:
- For a second-order MAV model, the state is observable if (a) the observer’s trajectory exhibits non-zero jerk at a step (with at least four measurements) or (b) with three measurements if the relative acceleration (modulo thrust orientation) is non-degenerate.
- These conditions eliminate the requirement for observer lateral maneuvers or strictly higher-order motion, strictly relaxing classical constraints (Zhang et al., 11 Jan 2026).
5. Comparison with Classical Approaches
Classical bearing-only estimators suffer from inherent observability limitations: if the observer moves only along the line of sight, target range and scale remain ambiguous. Full observability in bearing-only estimation demands observer maneuvers orthogonal to the bearing direction.
The bearing-box (and bearing-angle) estimators inherently resolve this ambiguity—through the inclusion of angle or normalized 3D box geometries—without requiring the observer to perform lateral “zig-zag” or spiral motion. As soon as nonzero observer acceleration or thrust-induced dynamics (for MAVs) exist, full state observability is restored (Ning et al., 2024, Zhang et al., 11 Jan 2026).
6. Implementation and Practical Results
Kalman filter-based estimators using pseudo-linear measurement updates are employed for both 2D and 3D cases. Key implementation steps include:
- Acquisition of detector outputs (bounding boxes, 3D box parameters)
- Construction of measurement and pseudo-linear system matrices
- Execution of filter predict and update steps, incorporating noise covariance adjustment and possible MAV-specific measurements
Key findings from extensive experimental validation:
| Scenario | Bearing-Only | Bearing-Angle | Bearing-Box |
|---|---|---|---|
| KITTI (NIDE %) | 54.5 | 41.6 | 34.6 |
| Indoor car (NIDE %) | 17.1 | 19.9 | 13.5 |
| AirSim sim/MAV (static observer) | Fail | Fail | <15% |
| Real-world MAV tracking (outdoor) | Drifts | Drifts | Stable/Low |
| Monte-Carlo, circle/line/PNG pursuit | Poor/stable | Stable | N/A |
- In classic scenarios (e.g., straight-line maneuvers), the bearing-only estimator diverges, whereas the bearing-box estimator remains consistent and accurate.
- In real-world, high-fidelity simulation (AirSim) and outdoor tests, bearing-box achieves stable estimation even for static observers, something unattainable by previous paradigms (Ning et al., 2024, Zhang et al., 11 Jan 2026).
7. Extensions and Applicability
The bearing-box estimator applies broadly to target motion estimation problems where vision-based 2D or 3D bounding box detection is available. It is notably advantageous in robotic scenarios—such as ground robot following, MAV-to-MAV tracking, and other applications involving object detection in dynamic environments.
This approach leverages standard outputs of modern detection systems without needing additional sensors or modifications, making it directly compatible with widely deployed pipelines. For MAVs, the framework incorporates a unique coupling between thrust, acceleration, and observability that is not present in ground-based or unconstrained observer systems. This provides enhanced performance and relaxes previous maneuvering requirements, supporting new design and deployment strategies in aerial robotics (Ning et al., 2024, Zhang et al., 11 Jan 2026).