Papers
Topics
Authors
Recent
2000 character limit reached

Conditional Cost Volume Normalization (CCVNorm)

Updated 3 December 2025
  • Conditional Cost Volume Normalization (CCVNorm) is a neural network module that integrates sparse LiDAR disparities into the 4D cost-volume normalization process for stereo matching.
  • It modulates pixel-level cost activations to intensify agreements with LiDAR measurements, thereby reducing stereo ambiguities and focusing computational effort on reliable depth cues.
  • Hierarchical extensions like HierCCVNorm offer scalable, efficient implementations with minimal overhead, leading to improved performance on benchmarks such as KITTI.

Conditional Cost Volume Normalization (CCVNorm) is a neural network module designed to tightly integrate sparse LiDAR disparity observations into the cost-volume regularization process in stereo matching systems. Instead of fusing depth estimates post hoc, CCVNorm enables pixel-wise conditioning within the 4D cost-volume tensor frequently used in state-of-the-art end-to-end stereo architectures. By modulating local cost-volume activations based on the agreement with LiDAR-provided disparity, the module facilitates robust stereo-LiDAR fusion, leading to improved depth perception and effective ambiguity resolution (Wang et al., 2019).

1. Motivation and Conceptual Foundation

Standard end-to-end stereo networks (e.g., GC-Net) utilize a 4D cost-volume F∈RN×C×H×W×DF \in \mathbb{R}^{N \times C \times H \times W \times D} to encapsulate per-pixel, per-disparity information fundamental for depth inference. Conventional normalization strategies such as Batch Normalization treat the data uniformly per channel, disregarding spatially sparse but geometrically informative LiDAR signals. CCVNorm advances this paradigm by introducing the sparse LiDAR disparity Lh,wsL^s_{h,w} as a pixel-wise conditioning signal into normalization, thereby making the cost-volume regularizer disparity-aware and sensitive to strong, localized depth cues. When the disparity dd aligns with the LiDAR measurement, CCVNorm intensifies the associated cost-volume features; conversely, non-agreeing disparities are suppressed, effectively pruning the stereo search space. This form of conditioning allows the network to resolve stereo ambiguities more reliably and utilize sparse, high-confidence LiDAR data at the spatial locations where it is available.

2. Mathematical Formulation

Let Fi,c,h,w,dF_{i, c, h, w, d} denote the feature activation at batch index ii, channel cc, pixel location (h,w)(h, w), and disparity dd. Over a training minibatch B\mathcal{B}, per-channel statistics are computed as follows: μc≜Ex∈B,F⋅,c,⋅,⋅,⋅[Fx]\mu_c \triangleq \mathbb{E}_{x \in \mathcal{B}, F_{\cdot, c, \cdot, \cdot, \cdot}}[F_x]

σc2≜Varx∈B,F⋅,c,⋅,⋅,⋅[Fx]\sigma^2_c \triangleq \mathrm{Var}_{x \in \mathcal{B}, F_{\cdot, c, \cdot, \cdot, \cdot}}[F_x]

The normalized activation is

F~i,c,h,w,d=Fi,c,h,w,d−μcσc2+ϵ\tilde F_{i, c, h, w, d} = \frac{F_{i, c, h, w, d} - \mu_c}{\sqrt{\sigma^2_c + \epsilon}}

CCVNorm applies a spatially and disparity-specific affine transformation: Fi,c,h,w,dCCV=γi,c,h,w,d⋅F~i,c,h,w,d+βi,c,h,w,dF^{\mathrm{CCV}}_{i, c, h, w, d} = \gamma_{i, c, h, w, d} \cdot \tilde F_{i, c, h, w, d} + \beta_{i, c, h, w, d} where the modulation parameters (γ,β)(\gamma, \beta) depend on the validity and value of Li,h,wsL^s_{i, h, w}:

  • If Li,h,wsL^s_{i, h, w} is valid (discrete LiDAR disparity â„“\ell):

γi,c,h,w,d=gc,d(ℓ),βi,c,h,w,d=hc,d(ℓ)\gamma_{i, c, h, w, d} = g_{c, d}(\ell), \quad \beta_{i, c, h, w, d} = h_{c, d}(\ell)

  • Otherwise:

γi,c,h,w,d≡γˉc,d,βi,c,h,w,d≡βˉc,d\gamma_{i, c, h, w, d} \equiv \bar\gamma_{c, d}, \quad \beta_{i, c, h, w, d} \equiv \bar\beta_{c, d}

Two main variants exist for the parameterization:

  • Categorical CCVNorm: LiDAR disparity LsL^s is quantized into D~\tilde D bins; gg and hh are lookup tables of dimension D~×C×D\tilde D \times C \times D.
  • Continuous CCVNorm: A compact CNN encodes the raw LiDAR map to regress a continuous D×CD \times C vector for each pixel.

3. Hierarchical Extension: HierCCVNorm

A direct lookup for γ,β\gamma, \beta in categorical form incurs a parameter cost of O(D~⋅C⋅D)O(\tilde D \cdot C \cdot D) per layer. To address scalability, HierCCVNorm factorizes the modulation process:

  1. Compute intermediate channel-wise modulation gc(Li,h,ws)g_c(L^s_{i, h, w}), hc(Li,h,ws)∈RCh_c(L^s_{i, h, w}) \in \mathbb{R}^C.
  2. Use compact disparity-wise lookup mappings ϕ,ψ∈RD×C\phi, \psi \in \mathbb{R}^{D \times C}.

For locations with valid LsL^s: γi,c,h,w,d=ϕg(d)⋅gc(Li,h,ws)+ψg(d)\gamma_{i, c, h, w, d} = \phi^g(d) \cdot g_c(L^s_{i, h, w}) + \psi^g(d)

βi,c,h,w,d=ϕh(d)⋅hc(Li,h,ws)+ψh(d)\beta_{i, c, h, w, d} = \phi^h(d) \cdot h_c(L^s_{i, h, w}) + \psi^h(d)

This formulation reduces parameter cost dramatically to O(Câ‹…D)O(C \cdot D) per layer with minimal performance regression.

4. Integration within Stereo Matching Networks

In typical architectures such as GC-Net, CCVNorm systematically replaces every 3D BatchNorm operation within the 3D-CNN cost-regularization block. The original implementation applied CCVNorm at layers 21, 24, 27, 30, 33, 34, and 35. Architectural overhead is variant-dependent:

  • Categorical: Requires a lookup table per layer (e.g., D~=192\tilde D = 192, C≈32C \approx 32 yields ∼\sim200K parameters/layer).
  • Continuous: Introduces a lightweight LiDAR encoder CNN (approximately 1.2M weights).
  • HierCCVNorm: Achieves significant efficiency, using only Câ‹…D≈6,000C \cdot D \approx 6,000 parameters per layer.

Elementwise scale and shift introduce negligible additional FLOPs. The substantial computational cost, limited to compact mapping networks or table lookups, led to minimal runtime increases; empirically, end-to-end latency on NVIDIA 1080Ti was measured at 0.962 s/frame for GC-Net baseline and 1.011 s/frame (+0.049 s) for Input Fusion + HierCCVNorm.

5. Experimental Evaluation and Comparative Performance

Quantitative and robustness evaluation was performed on KITTI Stereo 2015 and KITTI Depth Completion datasets:

KITTI Stereo 2015 (disparity error >3>3 px):

  • GC-Net (no LiDAR): 4.24%
  • Prob. Fusion [ICRA’18]: 5.91%
  • Park et al. [arXiv’18]: 4.84%
  • IF + HierCCVNorm: 3.35% (lowest error)

KITTI Depth Completion (RMSE / iRMSE):

  • FusionNet [ECCV’18]: 0.773 m / 2.19 1/km
  • Park et al.: 2.021 m / 3.39 1/km
  • IF + HierCCVNorm: 0.749 m / 1.40 1/km (lowest error)

Ablation on Depth Completion (subset, 1k frames):

Method Disparity >3>3 px RMSE (m)
GC-Net 0.2540 1.0314
+ Input Fusion only 0.1694 0.7659
+ Feature-Concat 0.1810 —
+ Naive CBN 0.2446 —
+ CCVNorm (Cat) 0.1254 0.8942
+ HierCCVNorm (Cat) 0.1268 0.8898
Full (IF + HierCCVNorm) 0.1196 0.7493

Robustness to LiDAR density:

CCVNorm and HierCCVNorm exhibited gracefully degrading performance as LiDAR point cloud density decreased from 100% to 10%. Other fusion schemes such as Input Fusion and Feature-Concat showed substantial drops below 60% density.

Sensitivity to local modifications:

Only CCVNorm-based models adapted disparity outputs locally when LiDAR patches were manually altered—demonstrating true conditional normalization behavior.

6. Broader Implications and Significance

CCVNorm embodies a principled approach to the fusion of active (LiDAR) and passive (stereo) depth sensing modalities at the feature normalization level. By leveraging pixelwise, disparity-aware conditioning during cost-volume regularization, the approach restricts stereo search space where reliable LiDAR ground truth exists and allows the network to focus regularization efforts at unresolved regions. The hierarchical extension enables scaling to larger networks and more complex tasks without substantial increases in runtime or memory. A plausible implication is that similar normalization techniques, when adapted to other modalities or uncertainty-aware fusion tasks, may further enhance performance in multimodal computer vision systems.

CCVNorm extends the line of LiDAR-stereo fusion strategies beyond direct output fusion and conventional feature concatenation. Compared to previous methods such as Probabilistic Fusion [ICRA’18] and Feature-Concat/Naive CBN, CCVNorm yields superior results on established benchmarks and demonstrates robust performance under adverse LiDAR sparsity. Its integration and performance on GC-Net, along with favorable comparisons to FusionNet and works by Park et al., position CCVNorm as a generic and high-performance solution for sparse-dense fusion at the level of cost-volume regularization (Wang et al., 2019).

In summary, Conditional Cost Volume Normalization achieves tightly coupled, efficient, and effective integration of LiDAR and stereo modalities, setting a new benchmark for depth perception tasks in autonomous driving and related fields.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Conditional Cost Volume Normalization (CCVNorm).