Canterbury Forestry Dataset
- The Canterbury Forestry Dataset is a UAV-based stereo vision benchmark featuring 5,313 high-resolution stereo pairs captured from diverse forestry environments in New Zealand.
- It provides pseudo-ground-truth disparity maps and standard evaluation metrics like EPE, RMSE, and Bad-Pixel Ratio to assess deep stereo matching performance in complex vegetation.
- Optimized sensor setup, challenging lighting, and detailed foliage structures make it a vital resource for evaluating cross-domain generalization in unconstrained, natural scenes.
The Canterbury Forestry Dataset is a UAV-acquired stereo vision benchmark designed to address the gap in high-resolution, cross-domain evaluation for dense vegetation and specialized forestry environments. Tailored for assessing the generalization performance of deep stereo matching methods in scenarios dominated by complex natural foliage, it advances depth estimation research beyond the traditional urban and indoor contexts. The dataset combines high-fidelity stereo imaging with pseudo-ground-truth disparities, enabling rigorous quantitative and qualitative assessment of modern disparity estimation algorithms in unconstrained, vegetation-rich aerial scenes (Lin et al., 3 Dec 2025).
1. Acquisition Protocol and Sensor Configuration
Stereo image acquisition was conducted using a ZED Mini stereo camera mounted on a custom multirotor UAV platform. Key specifications are:
- Sensor: ZED Mini stereo pair, global-shutter, hardware-synchronized
- Resolution: pixels (per camera, RGB)
- Baseline ():
- Focal length (): pixels ( at full HD)
- Principal point: px
- Sensor area:
The camera was mounted on a vibration-damped gimbal, oriented nadir (pitch ). Flights occurred at altitudes of $20$– above canopy, achieving ground-sampling distances (GSD) of $1$–. Imaging runs were performed at airspeeds $3$– with approximately forward and side overlap between frames. The dataset was acquired from March to October 2024 at mixed-aged radiata pine plantations and native temperate bush sites in Canterbury, New Zealand. Recordings include varied illumination (sunny, overcast, dappled shade) and cover late summer through early spring, sampling diverse vegetation densities (sparse to very dense ).
2. Dataset Composition, Structure, and Format
A total of $5,313$ stereo pairs were selected from an initial population of captured frames. Each stereo pair covers an approximate ground footprint of , yielding an overall nominal spatial coverage of .
Data organization:
- All imagery and disparity maps are provided as PNG files: $8$-bit for images, $16$-bit for disparity (pseudo-ground-truth derived from DEFOM).
- Directory structure (for each scene ):
- , encoding disparity in pixels
- : camera intrinsics, GPS, timestamp
Naming follows , where ranges from $0001$ to $5313$.
Ground truth: Active LiDAR ground truth is absent. Stereo disparity pseudo-ground-truth is generated by the DEFOM foundation model, with additional qualitative checks for boundary and smoothness quality. The DEFOM output includes a validity mask alongside $16$-bit disparity maps.
3. Disparity and Depth Statistics
Statistical analysis over all valid pixels in $5,313$ stereo pairs yielded the following summary:
| Statistic | Disparity (px) | Depth (m) |
|---|---|---|
| Minimum | 2 | 1.0 |
| Maximum | 250 | 50.0 |
| Mean | 22.5 | 9.3 |
| Std. Dev. | 31.8 | 8.7 |
The per-pixel aggregated disparity histogram is skewed toward low-disparity values (greater depth), exhibiting a long tail for proximal structures. The depth histogram peaks between $5$ and .
Disparity-to-depth conversion follows the canonical relation:
where is depth in meters, the focal length in pixels, the baseline in meters, and the local disparity in pixels.
4. Evaluation Metrics and Performance Protocols
Standard quantitative evaluation leverages metrics established in stereo correspondence:
- Average End-Point Error (EPE):
- Bad-Pixel Ratio ():
- D1: Percentage of pixels with absolute error px.
- RMSE:
Best practices recommend rectification and undistortion using the intrinsics, masking sky and invalid DEFOM regions during evaluation, disparity clipping for visualization, and adaptive filtering (median/bilateral) to mitigate speckle noise. Normalization of depth to is advocated for deep neural network supervision.
5. Vegetation-Specific Challenges and Representative Scenarios
Foreground occlusions behind dense twigs and branches, fine-scale structures ($3$– diameters), repetitive texture, and wind-induced motion blur characterize the dataset’s difficulty. These factors manifest as correspondence ambiguities, heightened demand for subpixel disparity precision, and frequent occlusion boundaries. Representative scenes exemplify:
- Dense overlapping foliage with thin-branch detail and homogeneous sky
- Extreme depth discontinuities () between canopy and understory
- Dappled illumination producing mixed-exposure regions
A plausible implication is the dataset's utility in benchmarking both depth smoothness and detailed structure preservation, two priorities that stereo matching methods often trade off in vegetated environments.
6. Model Evaluation and Benchmarking Insights
Zero-shot evaluation of state-of-the-art stereo methods was performed without fine-tuning, all models having been trained on Scene Flow. Benchmarked methods include RAFT-Stereo, IGEV, IGEV++, BridgeDepth, StereoAnywhere, DEFOM, ACVNet, PSMNet, and TCstereo. Foundation models (e.g., BridgeDepth, DEFOM) excelled in structured, urban benchmarks (ETH3D: BridgeDepth $0.23$ px, DEFOM $0.35$–$4.65$ px EPE), yet iterative-refinement approaches (e.g., IGEV++: $0.36$–$6.77$ px; IGEV: $0.33$–$21.91$ px) displayed increased cross-domain robustness. RAFT-Stereo exhibited failure on ETH3D (EPE $26.23$ px, error) due to negative disparity predictions, while performing normally on KITTI. Qualitative analysis designates DEFOM as the optimal “gold-standard baseline” for vegetation stereo, demonstrating superior depth smoothness, consistent occlusion handling, and cross-domain performance relative to IGEV++, which excels in fine detail preservation (Lin et al., 3 Dec 2025).
7. Access, Licensing, and Citation
The Canterbury Forestry Dataset is publicly released under the Creative Commons Attribution 4.0 license. The dataset and supporting resources are available at:
The canonical citation is:
Lin et al., “Generalization Evaluation of Deep Stereo Matching Methods for UAV-Based Forestry Applications,” ICVNZ 2024.
The open-access framework facilitates secondary use for both evaluation of new stereo correspondence methods in vegetation-rich outdoor domains and development of foundation models with improved generalization to natural, unstructured scenes.