WildLIFT: 3D Wildlife Monitoring Framework

Updated 3 July 2026

WildLIFT is a computational framework that converts monocular drone video into persistent 3D object tracks and semantic wildlife annotations.
It employs a three-stage pipeline integrating 3D reconstruction with Kalman-filtered tracking and PCA-based 3D bounding box fitting for accurate wildlife monitoring.
The approach reduces manual annotation by 97.7% while achieving high tracking accuracy and enabling detailed viewpoint and occlusion analyses for ecological research.

WildLIFT is a computational framework designed to extract three-dimensional (3D) scene geometry and structured wildlife annotations from monocular drone video, employing open-vocabulary instance segmentation and identity-consistent tracking to enable species-agnostic 3D detection, tracking, and behavioral analysis. Unlike traditional drone-based monitoring pipelines, which primarily operate in the two-dimensional image domain, WildLIFT leverages geometric information inherent in video data to produce persistent 3D object tracks, oriented bounding boxes (OBBs) with semantic face labeling, and quantitative viewpoint and occlusion statistics. This approach significantly enhances the analytical utility of aerial wildlife datasets for ecological research and monitoring by providing structured metadata and dramatically reducing manual annotation effort (Shukla et al., 27 Apr 2026).

1. Pipeline Architecture and Core Modules

WildLIFT is organized as a three-stage computational pipeline that processes uncalibrated monocular RGB drone video combined with an open-vocabulary textual prompt specifying the target species. The three stages are:

WildLIFT-RT (3D Scene Geometry and Tracking): A feed-forward transformer model, CUT3R, reconstructs dense per-frame 3D point-maps $\mathcal{P}_t \subset \mathbb{R}^3$ and camera poses $\mathbf{T}_t \in SE(3)$ without reliance on pre-calibrated camera intrinsics. Grounded-SAM, an open-vocabulary segmentation model, provides 2D instance masks $M_t^i$ for all detected animal instances per-frame. Each 2D mask is "lifted" to 3D by selecting points $\mathcal{Q}_t^i = \{\mathbf{p} \in \mathcal{P}_t \;|\; \pi(\mathbf{p}) \in M_t^i \}$ , where $\pi$ is a learned projection. Kalman-filtered tracking with a constant-velocity model associates these 3D clusters over time, generating identity-consistent trajectory labels.
WildLIFT-A (3D Bounding Box Fitting and Annotation Tools): The tracked 3D clusters are fit with oriented 3D bounding boxes $b=(\mathbf{c},\mathbf{d},\mathbf{R})$ via principal component analysis (PCA). Gimbal telemetry, if available, constrains vertical alignment. A web-based annotation tool allows users to refine OBBs and assign semantic face labels (front, top, left, etc.) on sparse keyframes. Linear interpolation for position and dimension and SLERP (on $SO(3)$ ) for rotation propagate edits between keyframes.
WildLIFT-V (Viewpoint and Occlusion Analysis): Utilizing the semantic OBB faces and known camera location $\mathbf{c}_{\mathrm{cam},t}$ , WildLIFT computes per-face visibility ( $\mathbf{n}_f \cdot \hat{\mathbf{v}}_t > 0$ ) and quality scores $Q_t^{(f)}$ . Occlusion analysis is performed by casting rays from the camera through sampled face points and checking for intersections with other OBBs, producing an occlusion fraction $\mathbf{T}_t \in SE(3)$ 0 and overall effective visibility $\mathbf{T}_t \in SE(3)$ 1. Coverage diversity across faces is quantified using Shannon entropy.

The pipeline flexibly accommodates different taxa by simply modifying the text prompt, and requires no additional model training for new species.

2. 3D Reconstruction and Tracking Methods

At the core of WildLIFT's geometry recovery is CUT3R, a feed-forward transformer designed to recover temporally consistent 3D point clouds $\mathbf{T}_t \in SE(3)$ 2 and corresponding camera transformations $\mathbf{T}_t \in SE(3)$ 3 from monocular video. This formulation, inspired by structure-from-motion, seeks to minimize reprojection error: $\mathbf{T}_t \in SE(3)$ 4 with $\mathbf{T}_t \in SE(3)$ 5 implicitly learning camera intrinsics and $\mathbf{T}_t \in SE(3)$ 6 providing 2D ground-truth observations. CUT3R bypasses iterative bundle adjustment by regressing 3D coordinates and poses directly from frame-level video, maintaining temporal consistency through state propagation.

Tracking utilizes centroids $\mathbf{T}_t \in SE(3)$ 7 for each detected 3D instance and maintains a Kalman filter $\mathbf{T}_t \in SE(3)$ 8 per trajectory. Data association across frames is performed by solving an assignment problem using a cost function blending spatial and segmentation overlap: $\mathbf{T}_t \in SE(3)$ 9 and resolved using the Hungarian method. Tracks that become occluded enter a dormant state for up to 100 frames, allowing for later re-identification.

3. Integration of Open-Vocabulary Segmentation and 3D Lifting

WildLIFT employs Grounded-SAM, an open-vocabulary segmentation model, to generate high-fidelity 2D instance masks $M_t^i$ 0 for each animal detected in a frame, based on a user-supplied textual prompt (e.g., "giraffe," "zebra," "elephant"). Each instance mask is mapped to 3D by selecting those 3D points from $M_t^i$ 1 whose projected locations fall within $M_t^i$ 2.

A statistical outlier-removal step is applied within clusters to improve segmentation robustness, filtering points remote from the centroid. This methodology is species-agnostic and model-agnostic, requiring only updates to textual prompts to monitor different taxa, and feeds directly into subsequent tracking and OBB fitting stages.

4. 3D Bounding Box Representation and Viewpoint-Aware Metrics

WildLIFT represents objects in KITTI format OBBs: center $M_t^i$ 3, dimensions $M_t^i$ 4, and orientation $M_t^i$ 5. Keyframe-based human annotation allows semantic face labeling (front, back, left, etc.) of each OBB. Between keyframes, position and dimensions are interpolated linearly; rotations use SLERP; semantic labels are propagated based on motion heuristics.

For each OBB, visibility of a semantic face at time $M_t^i$ 6 is determined by whether the face’s normal vector satisfies $M_t^i$ 7 where $M_t^i$ 8 is the unit vector from the OBB center to the camera. Visibility quality scores

$M_t^i$ 9

combine viewing angle $\mathcal{Q}_t^i = \{\mathbf{p} \in \mathcal{P}_t \;|\; \pi(\mathbf{p}) \in M_t^i \}$ 0, projected area $\mathcal{Q}_t^i = \{\mathbf{p} \in \mathcal{P}_t \;|\; \pi(\mathbf{p}) \in M_t^i \}$ 1, centroid centrality $\mathcal{Q}_t^i = \{\mathbf{p} \in \mathcal{P}_t \;|\; \pi(\mathbf{p}) \in M_t^i \}$ 2, and aspect-ratio penalty $\mathcal{Q}_t^i = \{\mathbf{p} \in \mathcal{P}_t \;|\; \pi(\mathbf{p}) \in M_t^i \}$ 3.

Occlusion is quantified by casting rays from the camera, through sampled OBB face points, and tallying those intersecting other OBBs. The occlusion fraction $\mathcal{Q}_t^i = \{\mathbf{p} \in \mathcal{P}_t \;|\; \pi(\mathbf{p}) \in M_t^i \}$ 4 yields effective visibility $\mathcal{Q}_t^i = \{\mathbf{p} \in \mathcal{P}_t \;|\; \pi(\mathbf{p}) \in M_t^i \}$ 5. Diversity of viewpoint coverage is summarized by a Shannon entropy metric

$\mathcal{Q}_t^i = \{\mathbf{p} \in \mathcal{P}_t \;|\; \pi(\mathbf{p}) \in M_t^i \}$ 6

where $\mathcal{Q}_t^i = \{\mathbf{p} \in \mathcal{P}_t \;|\; \pi(\mathbf{p}) \in M_t^i \}$ 7 records per-face coverage. These metrics support downstream analyses including behavioral inference and training data selection.

5. Interactive Keyframe-Based Annotation Workflow

WildLIFT-A introduces a browser-based, human-in-the-loop annotation tool that facilitates rapid geometric correction and semantic face assignment through keyframe selection. PCA-based auto-fitted OBBs require manual adjustment in only a minority of frames (7% across evaluated sequences). Editors manipulate a sparse set of keyframes, with interpolation propagating changes.

Let $\mathcal{Q}_t^i = \{\mathbf{p} \in \mathcal{P}_t \;|\; \pi(\mathbf{p}) \in M_t^i \}$ 8 and $\mathcal{Q}_t^i = \{\mathbf{p} \in \mathcal{P}_t \;|\; \pi(\mathbf{p}) \in M_t^i \}$ 9 denote edited bounding boxes for adjacent keyframes. For intermediate frame $\pi$ 0, the interpolation weight $\pi$ 1 allows: $\pi$ 2 with rotation $\pi$ 3 interpolated using SLERP. Semantic label propagation strategies accommodate heading changes and eigenvector ambiguity.

The pipeline achieves a 97.7% reduction in manual editing: on average, only 2.3% of frames need direct annotation, corresponding to 3.6 s/frame and a greater than 20x speed-up over frame-by-frame labeling.

6. Evaluation Metrics and Empirical Performance

WildLIFT has been quantitatively evaluated across 27 drone video sequences (2,581 frames, 6,799 ground-truth 3D detections, 77 tracked individuals) of rhinos, elephants, zebras, and giraffes. Key findings include:

Tracking Accuracy: WildLIFT-RT achieves perfect recall (1.000) and highest identity consistency (IDF $\pi$ 4), outperforming established 2D trackers (OC-SORT, ByteTrack, BotSORT) by up to 2.2 percentage points. Giraffe sequences with tree occlusion show greatest improvements (IDF $\pi$ 5 vs BotSORT 0.922). Ablating the Kalman filter reduces IDF $\pi$ 6 (0.957), confirming the value of velocity-aware re-identification.
Annotation Efficiency: PCA-fitted OBBs require no geometric correction in 93% of frames, a substantial improvement over LiDAR-based single-image annotators (58.5%). Semantic heading flips occur in only 10% of frames.
Viewpoint and Face Classification: Automated per-face visibility classification achieved accuracy from 0.86 (zebra herd) to 0.95 (rhino), with rare false positives at shallow angles, and high mean F1 scores.
Occlusion Statistics: 15–40% of geometrically visible frames in multi-animal scenes are subject to partial flank occlusion, illustrating the importance of explicit occlusion modeling.

7. Limitations and Prospective Extensions

WildLIFT's performance is bounded by the quality of the CUT3R 3D reconstruction backbone: rapid camera motion or significant motion blur can degrade depth and pose estimation. The 3D geometry is internally consistent but possesses scale ambiguity; metric scaling necessitates external reference. WildLIFT is conceptually species-agnostic, but empirical validation has been limited to four large terrestrial mammals; generalization to birds, marine, or small-bodied taxa is untested. The annotation tool, while efficient, still requires limited human correction. Future research may explore training fully-automated monocular 3D detectors on WildLIFT's outputs, as well as assessing real-time deployment feasibility given CUT3R's online architecture.

WildLIFT constitutes a comprehensive solution for transforming monocular drone video into rich, structured, viewpoint-aware 3D representations that support scalable, species-agnostic wildlife monitoring and downstream ecological and computer vision tasks (Shukla et al., 27 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (1)

WildLIFT: Lifting monocular drone video to 3D for species-agnostic wildlife monitoring (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WildLIFT.