Voxel-Height-Guided Sampling (VHS)

Updated 4 July 2026

Voxel-Height-Guided Sampling (VHS) is a technique that uses voxel or pillar-specific height priors to guide feature selection and pooling along the vertical axis.
Different frameworks implement VHS via methods like per-column truncation, occupancy-informed interval pooling, and mask-guided projection to reduce cross-height feature confusion.
Empirical studies show that integrating height priors in VHS enhances 3D occupancy prediction and object detection metrics compared to height-agnostic approaches.

Voxel-Height-Guided Sampling (VHS) denotes the use of voxel-space height information to guide how image-derived features are selected, pooled, or projected into 3D representations. In the current literature, the term is used explicitly in Collaborative Perceiver, while closely related mechanisms appear under different names in HiPR and Deep Height Decoupling: HiPR’s Height-Guided Reparameterization (HGR) and DHD’s Mask Guided Height Sampling (MGHS). Across these formulations, the shared objective is to reduce cross-height feature confusion by replacing height-agnostic sampling or pooling with operations conditioned on per-pillar heights, height intervals, or predicted voxel-height masks (Yuan et al., 28 Jul 2025, Wu et al., 6 May 2026, Wu et al., 2024).

1. Terminology and conceptual scope

The literature does not present VHS as a single canonical module. Collaborative Perceiver introduces “voxel-height-guided sampling” as an explicit component for vision-based 3D object detection with auxiliary occupancy supervision, using occupancy-driven height intervals and masked pooling along the vertical axis (Yuan et al., 28 Jul 2025). HiPR does not use the name VHS explicitly, but its Height-Guided Reparameterization is described as implementing “precisely what VHS refers to”: sampling vertically along each voxel or pillar column guided by a per-pillar height prior, with invalid pillars skipped (Wu et al., 6 May 2026). Deep Height Decoupling likewise does not use the term VHS, but its MGHS is described as “a direct instantiation of voxel-height-guided sampling,” because it predicts per-pixel height, converts that prediction into height-range masks, and gates feature projection into height-consistent voxel subspaces (Wu et al., 2024).

A useful way to differentiate the three formulations is by the source of the height prior and the locus of intervention in the pipeline.

Framework	Height prior	VHS operation
HiPR	LiDAR-derived or ground-truth-conditioned per-pillar height map	Truncate pillar sampling range and mask invalid pillars
CoP	LDO-informed height intervals over occupied voxels	Masked weighted pooling over $z$ and interval fusion
DHD (MGHS)	Predicted per-pixel voxel-height bins	Height-mask gating before projection into subspaces

This terminological variation is significant. VHS is therefore best understood as a design principle rather than a single algorithm: it injects explicit height structure into 2D-to-3D lifting, BEV construction, or query-to-image aggregation.

2. Problem setting and mathematical basis

The common problem is that conventional 2D-to-3D transformations are typically height-agnostic or rely on globally shared vertical ranges. HiPR frames 3D occupancy prediction as voxel-wise semantic inference in a fixed region of interest, with examples including Occ3D using $X,Y \in [-40,40]\ \mathrm{m}$ and $Z \in [-1,5.4]\ \mathrm{m}$ at $0.4\ \mathrm{m}$ resolution, and SurroundOcc using $X,Y \in [-50,50]\ \mathrm{m}$ and $Z \in [-5,3]\ \mathrm{m}$ at $0.5\ \mathrm{m}$ resolution (Wu et al., 6 May 2026). In HiPR, image lifting follows an LSS-based formulation: for pixel coordinate $p=[u,v,1]^\top$ , camera intrinsics $K$ , and extrinsics $(R,t)$ ,

$X,Y \in [-40,40]\ \mathrm{m}$ 0

The world-to-image form is

$X,Y \in [-40,40]\ \mathrm{m}$ 1

Multi-view features are accumulated into a coarse BEV query, after which BEVFormer-style deformable cross-attention samples image features from 3D reference points along each pillar (Wu et al., 6 May 2026).

The difficulty arises because conventional sampling uses a globally shared interval:

$X,Y \in [-40,40]\ \mathrm{m}$ 2

HiPR identifies that this uniform fixed-range sampling “struggles to capture the sparsity and height variations of real-world scenes,” causing ambiguous correspondences and unreliable aggregation (Wu et al., 6 May 2026). CoP describes an analogous failure mode from the perspective of BEV detection: BEV collapse destroys object-specific vertical structure, so objects with similar BEV footprints but different heights, such as cones and trucks, become difficult to distinguish (Yuan et al., 28 Jul 2025). DHD formulates the same issue in forward projection: VoxelPooling and BEVPooling introduce “many confusing features that belong to other height ranges,” because features are projected or collapsed without an explicit height prior (Wu et al., 2024).

CoP formalizes the voxel grid in LiDAR coordinates with bounds $X,Y \in [-40,40]\ \mathrm{m}$ 3, voxel sizes $X,Y \in [-40,40]\ \mathrm{m}$ 4, and voxel index

$X,Y \in [-40,40]\ \mathrm{m}$ 5

Its lifted voxel feature volume is written as

$X,Y \in [-40,40]\ \mathrm{m}$ 6

and a height at BEV location $X,Y \in [-40,40]\ \mathrm{m}$ 7 may be defined by an occupancy-weighted expectation

$X,Y \in [-40,40]\ \mathrm{m}$ 8

DHD instead predicts per-pixel height bins directly and uses them to construct binary masks over discrete height intervals (Yuan et al., 28 Jul 2025, Wu et al., 2024).

These formulations differ operationally, but all target the same structural deficiency: the vertical axis is not uniformly informative, and treating it as such dilutes geometry and semantics.

3. HiPR: pillar-wise height-bounded sampling and progressive conditioning

HiPR implements VHS through Height-Guided Projection Reparameterization within a camera-LiDAR occupancy framework. The key signal is a BEV height map obtained by collapsing the LiDAR occupancy grid along $X,Y \in [-40,40]\ \mathrm{m}$ 9. Given LiDAR points voxelized into $Z \in [-1,5.4]\ \mathrm{m}$ 0, the maximum occupied height index is

$Z \in [-1,5.4]\ \mathrm{m}$ 1

and the metric height is

$Z \in [-1,5.4]\ \mathrm{m}$ 2

Empty pillars are assigned an invalid height value, with validity mask

$Z \in [-1,5.4]\ \mathrm{m}$ 3

HiPR then replaces the global vertical bound with a pillar-specific cap,

$Z \in [-1,5.4]\ \mathrm{m}$ 4

and samples

$Z \in [-1,5.4]\ \mathrm{m}$ 5

If $Z \in [-1,5.4]\ \mathrm{m}$ 6, the query is not updated; otherwise, deformable cross-attention aggregates multi-view features from the reparameterized reference points. The update is

$Z \in [-1,5.4]\ \mathrm{m}$ 7

The paper emphasizes that there is “no multi-range partitioning or explicit reweighting beyond deformable attention”; the effect comes from truncation and masking (Wu et al., 6 May 2026).

HiPR adds Progressive Height Conditioning because LiDAR-derived heights are noisy and sparse. During training,

$Z \in [-1,5.4]\ \mathrm{m}$ 8

where valid LiDAR grids are independently replaced by ground-truth heights with probability

$Z \in [-1,5.4]\ \mathrm{m}$ 9

Early epochs therefore rely heavily on $0.4\ \mathrm{m}$ 0, while inference disables PHC and uses $0.4\ \mathrm{m}$ 1 (Wu et al., 6 May 2026).

The occupancy decoder is query-based and uses

$0.4\ \mathrm{m}$ 2

with $0.4\ \mathrm{m}$ 3, $0.4\ \mathrm{m}$ 4, and $0.4\ \mathrm{m}$ 5 (Wu et al., 6 May 2026). The empirical ablations isolate the contribution of each VHS component on ALOcc-2D-mini: baseline $0.4\ \mathrm{m}$ 6 mIoU, $0.4\ \mathrm{m}$ 7height-guided sampling only $0.4\ \mathrm{m}$ 8, $0.4\ \mathrm{m}$ 9height-validity mask $X,Y \in [-50,50]\ \mathrm{m}$ 0, and $X,Y \in [-50,50]\ \mathrm{m}$ 1PHC $X,Y \in [-50,50]\ \mathrm{m}$ 2 (Wu et al., 6 May 2026). Among alternative sampling strategies, uniform $X,Y \in [-50,50]\ \mathrm{m}$ 3 gives $X,Y \in [-50,50]\ \mathrm{m}$ 4, a learned height predictor gives $X,Y \in [-50,50]\ \mathrm{m}$ 5, LiDAR mean pillar height gives $X,Y \in [-50,50]\ \mathrm{m}$ 6, and LiDAR highest occupied height gives $X,Y \in [-50,50]\ \mathrm{m}$ 7 (Wu et al., 6 May 2026).

HiPR reports $X,Y \in [-50,50]\ \mathrm{m}$ 8 mIoU on Occ3D with camera visible mask, surpassing DAOcc at $X,Y \in [-50,50]\ \mathrm{m}$ 9, and $Z \in [-5,3]\ \mathrm{m}$ 0 RayIoU on Occ3D without camera mask, surpassing DAOcc at $Z \in [-5,3]\ \mathrm{m}$ 1 (Wu et al., 6 May 2026). On SurroundOcc, it reaches $Z \in [-5,3]\ \mathrm{m}$ 2 mIoU, outperforming OccCylindrical at $Z \in [-5,3]\ \mathrm{m}$ 3 (Wu et al., 6 May 2026). The runtime table shows that HiPR-mini operates at $Z \in [-5,3]\ \mathrm{m}$ 4 FPS on the ALOcc-2D-mini backbone, while full HiPR on ALOcc-2D runs at $Z \in [-5,3]\ \mathrm{m}$ 5 FPS; the paper characterizes HiPR-mini as real-time because it exceeds $Z \in [-5,3]\ \mathrm{m}$ 6 FPS (Wu et al., 6 May 2026).

4. Collaborative Perceiver: interval-based VHS with local-density-aware occupancy

Collaborative Perceiver introduces VHS explicitly in a multi-task framework for vision-based 3D object detection, where spatial occupancy acts as auxiliary information to refine BEV representations (Yuan et al., 28 Jul 2025). Its motivation is distinct from HiPR’s query reparameterization but converges on the same principle: vertical occupancy is highly non-uniform, and dense evidence tends to concentrate in specific height bands. CoP formalizes this with Local-Density-Aware Occupancy (LDO), which produces a dense occupancy ground truth containing both semantic occupancy and a local density weight $Z \in [-5,3]\ \mathrm{m}$ 7 (Yuan et al., 28 Jul 2025).

VHS in CoP operates not by altering pillar sample coordinates but by defining a set of Height-of-Interest intervals $Z \in [-5,3]\ \mathrm{m}$ 8 from LDO statistics. The reported intervals are grouped into three categories:

Base Layer (BL): $Z \in [-5,3]\ \mathrm{m}$ 9, $0.5\ \mathrm{m}$ 0, $0.5\ \mathrm{m}$ 1, $0.5\ \mathrm{m}$ 2
Universal Layer (UL): $0.5\ \mathrm{m}$ 3, $0.5\ \mathrm{m}$ 4
Extended Focus Layer (EFL): $0.5\ \mathrm{m}$ 5, $0.5\ \mathrm{m}$ 6

With $0.5\ \mathrm{m}$ 7 intervals, CoP performs masked weighted pooling along $0.5\ \mathrm{m}$ 8:

$0.5\ \mathrm{m}$ 9

where $p=[u,v,1]^\top$ 0 indicates whether height index $p=[u,v,1]^\top$ 1 lies in interval $p=[u,v,1]^\top$ 2, $p=[u,v,1]^\top$ 3 is an occupancy mask, and $p=[u,v,1]^\top$ 4 is either $p=[u,v,1]^\top$ 5 or $p=[u,v,1]^\top$ 6 (Yuan et al., 28 Jul 2025). The interval-specific pooled features are concatenated and fused with an SE-style mechanism:

$p=[u,v,1]^\top$ 7

where $p=[u,v,1]^\top$ 8 is produced by a $p=[u,v,1]^\top$ 9 convolution on the concatenated tensor and $K$ 0 is produced by attention derived from GAP, MLP, broadcasting, and a $K$ 1 convolution (Yuan et al., 28 Jul 2025).

This height-aware local feature is then combined with the global BEV feature in the Collaborative Feature Fusion module. The adaptive gate is

$K$ 2

and the updated BEV feature is

$K$ 3

The occupancy branch receives a channel-to-height transformation, while the detection head predicts 3D boxes (Yuan et al., 28 Jul 2025). Training uses

$K$ 4

with $K$ 5, and the occupancy loss is weighted by local density (Yuan et al., 28 Jul 2025).

The ablations attribute clear gains to VHS. On nuScenes validation with ResNet-50, Dense Occ only gives NDS $K$ 6 and mAP $K$ 7; adding LDO gives NDS $K$ 8 and mAP $K$ 9; adding VHS gives NDS $(R,t)$ 0 and mAP $(R,t)$ 1; and adding CFF yields the final NDS $(R,t)$ 2 and mAP $(R,t)$ 3 (Yuan et al., 28 Jul 2025). In the sampling strategy ablation, global pooling yields mAP $(R,t)$ 4 and NDS $(R,t)$ 5, uniform height pooling with $(R,t)$ 6 bins gives mAP $(R,t)$ 7 and NDS $(R,t)$ 8, BL alone gives mAP $(R,t)$ 9 and NDS $X,Y \in [-40,40]\ \mathrm{m}$ 00, BL+UL gives mAP $X,Y \in [-40,40]\ \mathrm{m}$ 01 and NDS $X,Y \in [-40,40]\ \mathrm{m}$ 02, and BL+UL+EFL gives mAP $X,Y \in [-40,40]\ \mathrm{m}$ 03 and NDS $X,Y \in [-40,40]\ \mathrm{m}$ 04 (Yuan et al., 28 Jul 2025). On the nuScenes test set with ResNet-101, CoP reports $X,Y \in [-40,40]\ \mathrm{m}$ 05 mAP and $X,Y \in [-40,40]\ \mathrm{m}$ 06 NDS (Yuan et al., 28 Jul 2025).

A notable property of CoP’s VHS is that it is explicitly interval-based rather than pillar-specific. The mechanism is therefore closer to stratified vertical pooling than to HiPR’s per-column truncation. This suggests that VHS can refer either to adaptive coordinate reparameterization or to adaptive interval selection, provided the governing principle is occupancy- or height-guided vertical feature selection.

5. Deep Height Decoupling: MGHS as VHS by mask-guided projection

Deep Height Decoupling addresses vision-based 3D occupancy prediction by inserting an explicit height prior into forward projection (Wu et al., 2024). Its central observation is that both VoxelPooling and BEVPooling mix features from incompatible height ranges, either by splatting image features across many voxel heights or by collapsing height entirely. DHD’s solution is to predict a per-pixel height map with HeightNet and then use Mask Guided Height Sampling, which the supplied description identifies as conceptually equivalent to VHS (Wu et al., 2024).

Height is discretized into $X,Y \in [-40,40]\ \mathrm{m}$ 07 bins over the voxelized $X,Y \in [-40,40]\ \mathrm{m}$ 08-range. For Occ3D-nuScenes, the range is $X,Y \in [-40,40]\ \mathrm{m}$ 09 with $X,Y \in [-40,40]\ \mathrm{m}$ 10 voxel height, giving $X,Y \in [-40,40]\ \mathrm{m}$ 11 bins (Wu et al., 2024). HeightNet predicts a categorical distribution $X,Y \in [-40,40]\ \mathrm{m}$ 12, and the discrete height index is

$X,Y \in [-40,40]\ \mathrm{m}$ 13

LiDAR supervision is constructed by projecting LiDAR points into the image and assigning the ego height $X,Y \in [-40,40]\ \mathrm{m}$ 14 of the closest point to each pixel. Height and depth heads are trained with BCE on discretized bins:

$X,Y \in [-40,40]\ \mathrm{m}$ 15

From dataset height statistics, DHD selects $X,Y \in [-40,40]\ \mathrm{m}$ 16 intervals, $X,Y \in [-40,40]\ \mathrm{m}$ 17, described as the “4+4+8” split (Wu et al., 2024). The paper reports that this split yields the lowest weighted semantic entropy, $X,Y \in [-40,40]\ \mathrm{m}$ 18, compared with $X,Y \in [-40,40]\ \mathrm{m}$ 19 for no split (Wu et al., 2024). For each interval $X,Y \in [-40,40]\ \mathrm{m}$ 20, the hard mask is

$X,Y \in [-40,40]\ \mathrm{m}$ 21

The height-aware feature is then

$X,Y \in [-40,40]\ \mathrm{m}$ 22

Each masked feature map is forward-projected into the 3D subspace whose voxel $X,Y \in [-40,40]\ \mathrm{m}$ 23 indices lie in $X,Y \in [-40,40]\ \mathrm{m}$ 24, producing height-refined features $X,Y \in [-40,40]\ \mathrm{m}$ 25 (Wu et al., 2024).

DHD complements these height-refined features with standard depth-based BEV features $X,Y \in [-40,40]\ \mathrm{m}$ 26 and fuses them using the Synergistic Feature Aggregation module. The channel stage is

$X,Y \in [-40,40]\ \mathrm{m}$ 27

with

$X,Y \in [-40,40]\ \mathrm{m}$ 28

The spatial stage is

$X,Y \in [-40,40]\ \mathrm{m}$ 29

and the fused feature is

$X,Y \in [-40,40]\ \mathrm{m}$ 30

The full training objective is

$X,Y \in [-40,40]\ \mathrm{m}$ 31

with $X,Y \in [-40,40]\ \mathrm{m}$ 32, $X,Y \in [-40,40]\ \mathrm{m}$ 33, $X,Y \in [-40,40]\ \mathrm{m}$ 34, and $X,Y \in [-40,40]\ \mathrm{m}$ 35 (Wu et al., 2024).

The ablations separate the effect of the VHS-equivalent mechanism. On DHD-S, the baseline without MGHS or SFA gives $X,Y \in [-40,40]\ \mathrm{m}$ 36 mIoU; adding height decoupling gives $X,Y \in [-40,40]\ \mathrm{m}$ 37; adding height decoupling plus mask projection gives $X,Y \in [-40,40]\ \mathrm{m}$ 38; and adding SFA yields $X,Y \in [-40,40]\ \mathrm{m}$ 39 (Wu et al., 2024). Main validation results on Occ3D-nuScenes report $X,Y \in [-40,40]\ \mathrm{m}$ 40 mIoU for DHD-S, $X,Y \in [-40,40]\ \mathrm{m}$ 41 for DHD-M, and $X,Y \in [-40,40]\ \mathrm{m}$ 42 for DHD-L (Wu et al., 2024).

A defining characteristic of DHD’s version of VHS is that the gate is applied before projection and is driven by predicted rather than measured height. In contrast to HiPR’s pillar-local LiDAR cap and CoP’s occupancy-informed interval pooling, DHD creates multiple height-specific subspaces by hard assignment of image pixels to discrete height groups.

6. Empirical profile and relation to earlier height modeling

Across the reported frameworks, VHS-like mechanisms consistently outperform height-agnostic baselines, but the magnitude and interpretation of the gain depend on the task. In occupancy prediction, HiPR’s pillar-wise truncation and masking provide gains on Occ3D and SurroundOcc, while DHD’s mask-guided projection improves mIoU under single-frame and short-history settings (Wu et al., 6 May 2026, Wu et al., 2024). In 3D detection, CoP shows that height-aware local pooling improves both mAP and NDS over global or uniform height pooling (Yuan et al., 28 Jul 2025).

These results align with a recurring pattern in the literature: height priors can enter the pipeline at different stages. HiPR notes prior approaches such as OC-BEV, which augments uniform sampling with a scene-level local height prior; HV-BEV, which predicts per-grid discrete height distributions; DHD, which decouples projection spaces and uses predicted heights; and DA-Occ, which aggregates frustum features with height distributions (Wu et al., 6 May 2026). HiPR distinguishes itself by using a LiDAR-derived per-pillar height map and explicitly reparameterizing the projection space, while CoP uses occupancy-driven interval pooling, and DHD uses predicted height masks.

A central empirical distinction is the source of supervision. HiPR depends on LiDAR-derived heights and uses PHC to bridge the gap between noisy LiDAR and cleaner ground-truth heights during training (Wu et al., 6 May 2026). CoP derives its height intervals from LDO histograms and uses density-weighted pooling, but the intervals are fixed for training and inference (Yuan et al., 28 Jul 2025). DHD predicts height directly from images with explicit LiDAR supervision and then enforces a hard partition of features into three subspaces (Wu et al., 2024).

This suggests that “height guidance” is not monolithic. It may act as a per-column bound, a set of shared intervals, or a learned per-pixel categorical prior. What unifies these choices is the attempt to reorganize the vertical sampling space so that feature aggregation occurs in geometrically and semantically plausible regions.

7. Limitations, misconceptions, and open directions

A common misconception is that VHS refers to a single standardized architecture. The reported literature does not support that view. The term is explicit in CoP, but HiPR and DHD implement equivalent ideas under different names and with materially different mechanics (Yuan et al., 28 Jul 2025, Wu et al., 6 May 2026, Wu et al., 2024).

Another misconception is that VHS necessarily means multi-interval partitioning. HiPR explicitly states that its effect does not come from “multi-range partitioning or explicit reweighting beyond deformable attention”; instead, it arises from per-pillar truncation and invalidity masking (Wu et al., 6 May 2026). Conversely, CoP and DHD do use interval partitioning, but one performs masked pooling over voxel features and the other performs mask-guided projection from image space (Yuan et al., 28 Jul 2025, Wu et al., 2024).

The main limitations are also framework-specific. HiPR inherits LiDAR sparsity and noise: distant regions and occlusions make $X,Y \in [-40,40]\ \mathrm{m}$ 43 noisy or invalid, and hard masking can skip regions that are valid in images but unobserved in LiDAR. Its single-height representation also struggles with multi-layer structures such as overpasses, balconies, or tree canopies above vehicles, and PHC does not fully solve temporal misalignment for dynamic objects (Wu et al., 6 May 2026). CoP is sensitive to biased or noisy LDO-derived priors; fixed Height-of-Interest intervals may under-sample unusual scenes, slopes, bridges, or overhead structures, and shared intervals may be suboptimal for rare classes with atypical vertical extent (Yuan et al., 28 Jul 2025). DHD depends on accurate height estimation; erroneous $X,Y \in [-40,40]\ \mathrm{m}$ 44 can misgate features, and fixed interval thresholds may not adapt to scene-specific geometry. The paper explicitly points to soft masks, confidence-weighted gating, learned thresholds, and uncertainty-aware sampling as possible extensions (Wu et al., 2024).

HiPR’s discussion of future improvements provides a concise research agenda for VHS more broadly: model per-pillar height distributions rather than a single cap; estimate uncertainty and sample more densely where uncertainty is high; fuse temporal LiDAR frames or radar to densify height priors at long range; jointly learn height priors with camera and LiDAR through consistency regularization; incorporate semantic cues so that $X,Y \in [-40,40]\ \mathrm{m}$ 45 ranges depend on class; and replace hard masks with soft masks to avoid entirely skipping uncertain pillars (Wu et al., 6 May 2026). CoP and DHD point in compatible directions through adaptive interval selection, per-class height priors, and soft gating (Yuan et al., 28 Jul 2025, Wu et al., 2024).

Taken together, these methods establish VHS as a technically specific response to vertical ambiguity in 2D-to-3D scene understanding. Its core contribution is not the use of height information in the abstract, but the reorganization of projection or pooling so that vertical sampling better matches occupancy structure, object extent, and sensor-derived geometric evidence.