Channel-to-Height Transformation

Updated 9 April 2026

Channel-to-height transformation is a method that infers vertical spatial structure from multi-channel inputs, enhancing 3D scene understanding in fields like computer vision and remote sensing.
It employs explicit height binning, probabilistic inference, and efficient channel reshaping to achieve improved accuracy, computational speedup, and lower memory usage.
Integrating height-aware feature fusion and attention mechanisms, the approach robustly supports applications in BEV object detection, occupancy prediction, and wireless propagation modeling.

A channel-to-height transformation is a paradigm and set of methods used to infer, encode, or model spatial height (vertical) structure using multi-channel or channelized input features across a range of fields—principally in computer vision for 3D scene understanding (e.g., BEV perception and occupancy prediction), atmospheric and ocean remote sensing (e.g., wave height estimation), and wireless propagation modeling (e.g., A2G channel modeling for UAVs). The transformation typically involves (1) mapping each channel to a particular height bin or distribution, (2) aggregating feature or measurement data along that dimension, and (3) representing, reconstructing, or predicting height-dependent phenomena. The key technical approaches include explicit height binning, height distribution modeling, probabilistic inference, channel reshaping, and height-aware attention or fusion. This framework has enabled significant gains in accuracy, efficiency, and physical fidelity in tasks ranging from 3D object detection in BEV to large-scale radio propagation models.

1. Theoretical Foundations and Equivalence Relations

Channel-to-height transformation originated from the need to resolve ambiguities in monocular- or multi-view perception (especially in BEV tasks), where mapping 2D (image/channel) features to 3D (x, y, z) structures is ill-posed. Historically, two paradigms have been used: depth-based lifting and height-based lifting.

HeightFormer establishes a formal equivalence between height-based and depth-based transformations. In height-based methods, for each BEV cell (x, z), the model predicts a center height $y_{xz}$ and span $h_{xz}$ , and then samples across $H$ adaptively located height bins. These samples are projected into each image view and the features aggregated. The equivalence is quantified analytically: under bounded placement error $\epsilon$ , both depth error and height error yield proportional bounds on BEV misplacement, e.g.,

$\delta_{y,\max} = \epsilon \cdot (|v - v_0|/f_y) \cdot (f_x/(|u - u_0| + f_x))$

implying that either parametrization can recover the correct BEV location under controlled error (Wu et al., 2023).

This theoretical insight underpins the adoption of explicit height modeling as a substitute for or complement to depth-based approaches, allowing models to predict vertical occupancy or feature structure without explicit depth supervision or auxiliary sensors.

2. Explicit Height Binning and Height Distribution Inference

Modern BEV and 3D occupancy frameworks operationalize channel-to-height transformations through explicit height binning procedures. Typically, dense multi-view image features (with $C$ channels) are projected into $H$ discretized or adaptively selected height bins per BEV cell.

In HeightFormer (Wu et al., 2023), each BEV cell learns $(y_{xz}, h_{xz})$ and forms $H$ sampling heights: $y_{xz} - \frac{h_{xz}}{2},\,\, y_{xz} - \frac{h_{xz}}{2} + \Delta,\,\, \ldots,\,\, y_{xz} + \frac{h_{xz}}{2};\quad \Delta = \frac{h_{xz}}{H-1}$ This process replaces fixed, hand-designed anchors with data-driven, uncertainty-aware sampling. At each layer, a height head refines $h_{xz}$ 0 and their associated predicted uncertainties via recursive updates, using a Laplace prior and uncertainty-regularized loss.

Other frameworks such as HeightMapNet (Qiu et al., 2024) and DHD (Wu et al., 2024) use 1×1 convolutions and softmax operations to obtain categorical or probabilistic distributions $h_{xz}$ 1 over discretized height bins, informed by height priors or empirical statistics.

Self-attention can be applied along the height axis to extract local height distributions (as in HeightFormer for roadside vision (Zhang et al., 13 Mar 2025)), and transformer-based processing of height sequences enables adaptive feature weighting and accurate vertical representation, even in high-resolution (X, Y, Z) voxel grids.

3. Channel Reshaping and Computational Efficiency

An alternative instantiation of channel-to-height transformation is the parameter-free channel reshaping paradigm, as in FlashOcc (Yu et al., 2023). Here, a high-dimensional BEV feature tensor $h_{xz}$ 2 is produced, where $h_{xz}$ 3 (semantic classes × height bins). The transformation is realized by simply reshaping: $h_{xz}$ 4 This operation—analogous to sub-pixel upsampling in super-resolution—recovers a per-height-bin 3D occupancy prediction using only efficient 2D convolutions, with no additional learnable parameters or memory overhead for 3D structures until the final output. Compared to 3D convolution-based voxel lifting, this approach achieves $h_{xz}$ 5 speedup and up to 70% reduction in memory, with competitive or superior mIoU on standard benchmarks (Yu et al., 2023).

4. Height-Aware Feature Fusion and Attention Mechanisms

Channel-to-height transformations are often embedded within broader height-aware fusion pipelines. In DHD (Wu et al., 2024), Mask-Guided Height Sampling (MGHS) splits height predictions into interval-specific binary masks, which are applied to feature maps prior to BEV pooling or voxelization. This process partitions the feature space into height-consistent subspaces, reducing contamination from features with misaligned vertical context.

Synergistic Feature Aggregation (SFA) then combines depth-collapsed and height-refined features via channel and spatial affinity weighting, using global average pooling, learned gating (sigmoid activations), and convolutional spatial attention.

In remote sensing, SCAWaveNet (Zhang et al., 1 Jul 2025) employs spatial–channel attention within a transformer backbone. CYGNSS DDMs are processed so that each channel is aligned with an attention head, and the network fuses spatial and cross-channel information before regressing to height (significant wave height, SWH) per channel.

5. Channel-to-Height in Wireless Propagation and Channel Modeling

Height-dependence is essential in modern 3D geometrical stochastic channel models (GSCMs), notably for cellular and A2G mmWave systems.

The 3GPP 3D-UMa/UMi model (Mondal et al., 2015) introduces height-dependent formulas for LOS/NLOS probability, path-loss, and angular spreads. The model computes path-loss as a function of user equipment (UE) height $h_{xz}$ 6, with explicit LOS probability augmentation for above-rooftop cases: $h_{xz}$ 7 where $h_{xz}$ 8 is linearly increasing with height above $h_{xz}$ 9, capturing the effect of floor elevation.

In A2G mmWave communications, Saboor et al. (Saboor et al., 13 Nov 2025) and (Pang et al., 2021) provide height-dependent models for LoS-probability, path-loss exponents, and shadow fading, driven by detailed geometric and stochastic urban environment parameters. The transformation from measured channel states (path-loss, shadowing) to inferred transceiver height is nonlinear and must be solved algorithmically (e.g., via numeric root-finding) since all large-scale parameters are explicit functions of height.

6. Applications in 3D Occupancy, HD Map Construction, and Remote Sensing

Channel-to-height transformation is foundational for camera-only 3D object detection in BEV, 3D occupancy prediction, HD map vectorization, and significant wave height (SWH) remote sensing:

In BEV detection (HeightFormer (Wu et al., 2023)), explicit height modeling matches or exceeds the performance of depth-based methods without requiring LiDAR or direct depth supervision, and can be adapted to arbitrary camera geometries.
In occupancy prediction (FlashOcc (Yu et al., 2023), DHD (Wu et al., 2024)), the approach enables parameter-efficient, high-throughput models that preserve fine vertical voxelization, supporting real-time deployment and improved geometry recall (overhangs, small objects).
In HD map learning (HeightMapNet (Qiu et al., 2024)), explicit channel-to-height mapping improves the representation of critical road features and their elevations, with a multi-scale and foreground-background aware design.
In remote sensing (SCAWaveNet (Zhang et al., 1 Jul 2025)), spatial–channel attention maps multi-channel GNSS-R data to per-channel SWHs, leveraging cross-channel reinforcement for improved regression accuracy relative to single-channel baselines.

7. Architectural Considerations and Performance Benchmarks

Implementing channel-to-height transformations requires careful tuning of architectural hyperparameters (e.g., number of height bins $H$ 0, attention head counts, embedding dimensions) and integration with downstream tasks (e.g., detection, segmentation, HD map decoding).

Key empirical findings include:

HeightFormer achieves SOTA performance on DAIR-V2X-I and Rope3D for camera-only roadside 3D detection, with 3D box AP increases of $H$ 1 to $H$ 2 points over previous systems (Zhang et al., 13 Mar 2025).
FlashOcc realizes $H$ 3 runtime speedup and $H$ 4 memory reduction versus 3D conv designs with essentially no loss in mIoU (Yu et al., 2023).
SCAWaveNet yields $H$ 5 and $H$ 6 RMSE reduction on ERA5 and NDBC datasets, respectively, relative to prior state of the art (Zhang et al., 1 Jul 2025).
HeightMapNet demonstrates clear gains on nuScenes and Argoverse 2 HD map datasets by fusing explicit height priors with multiscale, foreground-masked features (Qiu et al., 2024).

A plausible implication is that as BEV-based frameworks are deployed in broader contexts (e.g., robotics, aerial mapping), channel-to-height transformation—via explicit binning, probabilistic fusion, or efficient reshaping—will constitute a core architectural primitive for maintaining geometric fidelity and computational viability across diverse sensing modalities.