SpatialTree-Bench Benchmark Suite

Updated 27 December 2025

SpatialTree-Bench is a benchmarking suite that evaluates spatial hierarchies in biological tissue simulations and remote tree mapping.
It compares diverse spatial tree data structures—such as k-d trees, octrees, and R-trees—using metrics like insertion and query times.
The framework extends to assessing spatial abilities in MLLMs via hierarchical tasks, enabling cross-domain model evaluations.

SpatialTree-Bench refers to a set of principled benchmarking frameworks used in two distinct research frontiers: (1) standardized evaluation of spatial hierarchies for simulation of spatially-resolved biological tissue, and (2) benchmarking for individual tree mapping using sub-meter resolution imagery. Additionally, related work extends the concept to multimodal LLMs (MLLMs) for spatial ability evaluation under the label "SpatialTree-Bench." Across these domains, SpatialTree-Bench frameworks provide rigorous, domain-specific metrics, dataset protocols, and comparative analyses of spatial structures or models, enabling reproducible, quantitative assessment of spatial reasoning and detection algorithms (Dmitrenok et al., 2016, Xiao et al., 23 Dec 2025, Gominski et al., 2023).

1. Benchmarking Spatial-Hierarchies in Biological Tissue Simulation

SpatialTree-Bench is a benchmarking suite implemented within the BioDynaMo tissue simulator, explicitly designed for evaluating spatial hierarchy structures in large agent-based models of biological tissue (Dmitrenok et al., 2016). Its principal functions are to measure how alternative spatial tree data structures impact the computational costs of (a) broad-phase neighborhood search and (b) dynamic reinsertion of moving/deforming agents, as well as to systematically search for optimal parameterizations (e.g., tree depth, leaf capacity, splitting heuristic).

Protocol:

Agents (N = 10,000 to 1,000,000) are generated with centers uniformly in [0,1]^3.
Data structures are incrementally built through point insertions.
Fixed-radius neighbor queries are issued for each agent at radii $r=10^{-3}$ and $r=10^{-5}$ .
Metrics reported: average insertion time and query time (over five trials), aggregated as total time $T_{\text{total}}=T_{\text{insert}}+T_{\text{query}}$ .

Spatial Trees Compared:

Octree: 8-ary, axis-aligned cube splitting, with configurable max depth $D$ and leaf capacity $C$ .
k-d Tree: Binary space partition with four evaluated splitting heuristics:
- MMAS (Median Multiple Axis Split): cyclical axis, split at median.
- MSAS (Median Single Axis Split): always split along longest dimension at median.
- CS (Center Split): cycle axes, split at box midpoint.
- SAHS (Surface Area Heuristic Split): split to minimize surface area sum.
R-Tree: B-tree-like structure with $M$ -fanout bounding rectangles, $M \in \{5, 25, 125\}$ .

Complexities:

Insertion: Average $O(\log N)$ for k-d and octree, but worst-case $O(N)$ in degenerate splits.
Query: $O(\log N + k)$ average, $O(N + k)$ in degenerate pruning, $k=$ output size.

2. SpatialTree-Bench Protocols and Metrics in Tree Mapping

SpatialTree-Bench for tree mapping offers a unified evaluation framework for the detection, localization, and size estimation of individual trees from high-resolution imagery (Gominski et al., 2023). Key features include explicit annotation cost modeling, a flexible matching protocol from polygon annotations, and robust error metrics accounting for spatial uncertainty and label ambiguity.

Benchmark Datasets:

Denmark: 23,600 trees, 712 $\mathrm{km}^2$ , 20 cm GSD, RGB+NIR.
Rwanda: 98,800 trees, 340 $\mathrm{km}^2$ , 25 cm GSD, RGB.

Annotation Protocol:

Manual polygon crowns yield tree centers and crown areas.
Annotation with segmentation mask requires %%%%14 $O(N + k)$ 15%%%% effort vs. center+area (disk) labeling.

Evaluation Metrics:

Detection: Precision, recall, F1, and instance IoU.
Matching: Cost $c_{ij} = \|p_i-\hat p_j\|_2 + \lambda|\mathrm{CA}_i-\widehat{\mathrm{CA}_j}|$ , under constraint $\|p_i-\hat p_j\|_2 < \gamma\mathrm{CD}_i$ .
Mode-Aware Scores: Balanced F1 ( $bF1$ ) for one-to-one, many-to-one, one-to-many matching, to counteract annotation noise.
Localization/Crown Area: Balanced RMSE over location and area.

Table: Detection and Localization Results (selected Denmark dataset results, $[email protected]$ )

Method	nMAE (%)	$[email protected]$
Segmentation	30.7	42.1
CenterNet	55.2	40.7
Heatmap	33.4	41.8
P2P	26.4	49.6
FasterRCNN	39.9	47.6

P2P (point-proposal) yields top strict detection, while the heatmap-only method provides annotation efficiency and robust performance.

3. Hierarchical Spatial Ability Benchmarking: SpatialTree-Bench in MLLMs

SpatialTree-Bench has been extended to a hierarchical capability-centric evaluation of spatial abilities in MLLMs, providing a cognitive-science-grounded taxonomy and organizing 27 sub-abilities into four ascending levels (Xiao et al., 23 Dec 2025):

L1: Low-level Perception: (e.g., depth, orientation, egocentric motion; 11 sub-abilities)
L2: Mental Mapping: (e.g., spatial captioning, semantic relations, cognitive-map memory; 7 sub-abilities)
L3: Mental Simulation: (e.g., geometric reasoning, sequential planning; 5 sub-abilities)
L4: Agentic Competence: (e.g., navigation, robotic/human manipulation; 4 sub-abilities)

Dataset and Protocols:

Unification of >12 prior datasets plus a new SpatialPlus dataset (250k QAs).
QA annotation leverages perception engines (e.g., DepthAnything3), LLM-generated templates, and human curation.
41 benchmark tasks, >200k examples, with 70.7% MCQ, 15% numeric, 10% open response, 4.3% agentic trajectories.
Metrics: Per-level—accuracy, relative error, angular diff., continuous success for agentic tasks; results are aggregated with hierarchical weights.

Key Findings:

L1 sub-abilities exhibit near-orthogonality ( $r<0.2$ ), while L3–L4 show strong correlation ( $r>0.6$ ).
Supervised fine-tuning on single L1 tasks produces negative transfer within L1 but aids L2–L4.
Multi-ability SFT harnesses cross-skill synergy, mitigating negative transfer.
RL with uniform "thinking" benefits complex tasks but degrades perception; hierarchy-aware RL with level-sensitive reward shaping ("auto-think") yields positive transfer across all levels.

4. Comparative Analysis of Baselines and Methods

Agent-based Simulation (BioDynaMo):

Best-in-class: k-d trees with MSAS at small search radii (e.g., $r=10^{-5}$ ), $C \approx 10$ , $D \approx 1000$ (Table: $T_{\text{total}}$ for 1M points: k-d MSAS $3,149$ ms, MMAS $5,402$ ms, octree $9,426$ ms, R-tree $>180,000$ ms).
Trade-offs: Lower $C$ yields deeper/faster-search trees, higher $D$ above threshold gives diminishing search speed merit; SAHS benefits in complex volumes, not point-only settings.

Tree Mapping (Imagery):

Point-based detection (P2P): Best strict detection (low nMAE, high $bF1$ ).
Heatmap method: Most annotation-efficient with competitive error rates and robust to noisy labels; gains of +20–25% $bF1$ via ensembling/test-time augmentation.
Segmentation: Slight highest upper bound on localization/crown-area RMSE.
Backbones: UNet-R50 outperforms alternatives (DeepLabV3, SegFormer, TransUNet) on this task.

5. Robustness, Annotation Efficiency, and Practical Recommendations

SpatialTree-Bench's balanced matching and scoring protocols address the non-uniqueness of mappings when annotation noise (merges/splits) is present. Balanced $F_1$ (bF1) and the composite error metrics are substantially more robust to real-world label ambiguity than standard one-to-one matching. Annotation regimes built around disk-based targets (center+area) reduce labeling effort by one order of magnitude over segmentation masks, without significant loss in detection or localization accuracy in dense canopy conditions (Gominski et al., 2023).

For biological simulation, practitioners are advised to profile both insertion and search; in dynamic simulations, search cost dominates, so minimizing $T_{\text{query}}$ is favored even at slight insertion cost. Octree remains valuable for distributed-memory implementations needing uniform partitions, despite being $\sim1.5-2\times$ slower than best k-d trees.

For MLLMs, capability-centric, hierarchy-aware benchmarks and reward structures are essential. Fine-tuning should combine basic perceptual modes (distance, size, correspondence) for optimal transfer, and RL should penalize unnecessary reasoning in direct perception tasks while incentivizing it in sequential/high-level tasks.

6. Impact, Future Directions, and Best Practices

SpatialTree-Bench establishes a reproducible, quantitative standard in two otherwise disparate research domains: large-scale agent-based tissue simulation and individual tree detection in high-resolution remote sensing. Its methodology—robust, annotation-efficient, and protocol-driven—has enabled more accurate comparison of architectures and facilitated transfer of best practices such as balanced matching, capability-centric evaluation, and hierarchical aggregation of metrics.

Best Practices (as found across domains):

Use hierarchical benchmarks over isolated tasks (Xiao et al., 23 Dec 2025).
For simulation: prefer k-d MSAS trees for fine-grained, spatially-local interaction, octree for grid partitioning.
For tree mapping: employ disk-based heatmap targets for annotation efficiency, point-based proposals for maximum detection precision.
Aggregate scores bottom-up, prioritize foundational (perceptual) skills/data, and tailor RL to task complexity level.
Adopt ensembling and test-time augmentation to mitigate annotation/model uncertainty.

SpatialTree-Bench protocols and findings are foundational for scalable, robust spatial modeling in both environmental sensing and artificial intelligence research (Dmitrenok et al., 2016, Xiao et al., 23 Dec 2025, Gominski et al., 2023).

PDF Markdown Chat (Pro)

References (3)

Evaluation of spatial trees for simulation of biological tissue (2016)

SpatialTree: How Spatial Abilities Branch Out in MLLMs (2025)

Benchmarking Individual Tree Mapping with Sub-meter Imagery (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SpatialTree-Bench.

SpatialTree-Bench Benchmark Suite

1. Benchmarking Spatial-Hierarchies in Biological Tissue Simulation

2. SpatialTree-Bench Protocols and Metrics in Tree Mapping

Table: Detection and Localization Results (selected Denmark dataset results, $[email protected]$ )

3. Hierarchical Spatial Ability Benchmarking: SpatialTree-Bench in MLLMs

4. Comparative Analysis of Baselines and Methods

5. Robustness, Annotation Efficiency, and Practical Recommendations

6. Impact, Future Directions, and Best Practices

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SpatialTree-Bench Benchmark Suite

1. Benchmarking Spatial-Hierarchies in Biological Tissue Simulation

2. SpatialTree-Bench Protocols and Metrics in Tree Mapping

Table: Detection and Localization Results (selected Denmark dataset results, bF1@1.0[email protected]bF[email protected])

3. Hierarchical Spatial Ability Benchmarking: SpatialTree-Bench in MLLMs

4. Comparative Analysis of Baselines and Methods

5. Robustness, Annotation Efficiency, and Practical Recommendations

6. Impact, Future Directions, and Best Practices

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Table: Detection and Localization Results (selected Denmark dataset results, $[email protected]$ )