Interlaced Cost Volume Construction
- Interlaced cost volume construction is a method that dynamically adapts depth hypothesis sampling using global uniform, variance-based local, and epipolar propagation strategies.
- It improves efficiency and accuracy in multi-view stereo by reducing computational load and refining the search space based on scene geometry and confidence estimates.
- Empirical results show enhanced depth precision and speed, with significant error reduction and up to 6× speedup over static cost volume approaches.
Interlacing Cost Volume Construction refers to a family of techniques in multi-view stereo (MVS) and stereo matching that assemble multi-dimensional cost volumes using multiple, dynamically chosen search strategies across a pyramid of spatial resolutions. The aim is to exploit the non-uniform spatial and statistical properties of scene geometry and network predictions at different scales, thereby improving the efficiency, accuracy, and robustness of depth inference. This methodology is differentiated from static, uniform cost volume approaches by its explicit use of stage-dependent, data-adaptive hypothesis sampling, neighborhood propagation, and hybridization with auxiliary context-driven mechanisms.
1. Principles of Interlaced Cost Volume Construction
Cost volume construction forms the computational backbone of most deep MVS and stereo pipelines: for a set of reference and source images, a cost volume encodes the matching cost or correlation for each pixel across a set of depth or disparity hypotheses. Interlacing, in this context, refers to changing the depth/disparity hypotheses, or the strategy for sampling those hypotheses, between different scales or stages of processing.
Unlike fixed uniform sampling across all pyramid stages, interlaced construction leverages the observation that:
- The statistical and geometrical uncertainty at coarse and fine scales are fundamentally different.
- Initial coarse-level hypotheses benefit from a global uniform sweep to ensure coverage.
- Finer levels can localize hypotheses based on previous estimates, confidence variances, or geometric neighborhoods.
This design paradigm is explicitly realized in frameworks such as MSCVP-MVSNet ("Cost Volume Pyramid Network with Multi-strategies Range Searching for Multi-view Stereo"), SuperMVS ("SuperMVS: Non-Uniform Cost Volume For High-Resolution Multi-View Stereo"), and several others (Gao et al., 2022, Zhang, 2022).
2. Stage-wise Sampling Strategies
Interlaced cost volume construction typically employs distinct sampling techniques at each level of the pyramid, matched to the epistemic and geometric characteristics of the current depth estimate. Three principal strategies, as introduced in MSCVP-MVSNet, are:
- DHS1 (Global Uniform Sampling): At the coarsest stage, depth planes are uniformly sampled over the entire valid scene interval . This guarantees that the initial estimate is not biased by local context and can account for the entire solution space.
- DHS2 (Variance-based Local Sampling): At intermediate stages, sampling focuses on a window around the previous stage's depth estimate for each pixel. The width of this per-pixel window is adaptively set proportional to the local posterior variance output by the previous stage, i.e., . Regions of high uncertainty are assigned wider search ranges.
- DHS3 (Epipolar Neighbor Propagation): At fine resolutions, hypotheses are sampled along an interval that is determined by propagating the upsampled depth prediction into spatial neighbors (epipolar propagation). For each pixel, the minimum and maximum depth among its upsampled (and epipolarly matched) neighbors are used to define the new search space, allowing sharp structural priors to guide local refinement.
The table below summarizes the interlaced hypothesis strategies as implemented in MSCVP-MVSNet (Gao et al., 2022):
| Pyramid Stage | Sampling Strategy | Characteristics |
|---|---|---|
| 1 | DHS1: Uniform over | Coarse, global, dense |
| 2 | DHS2: Variance-based per-pixel local range | Adaptive, confidence-driven |
| 3+ | DHS3: Epipolar neighbor propagation | Contextual, spatially aware |
SuperMVS adopts a closely related approach, with its own SampleNet, predicting non-uniform per-pixel/hypothesis distributions at each stage from prior depths and image context, narrowing the search interval as resolution increases (Zhang, 2022).
3. Cost Volume Regularization and Adaptive Filtering
After constructing the raw cost volume at each stage, it is passed through a sequence of group-wise correlation operations and regularized with a 3D CNN (commonly a U-Net) to enhance matching reliability. A key enhancement is the use of adaptive unimodal filtering (AUF) in the intermediate stages, which imposes a unimodal prior on the cost distribution. This is implemented by computing a softmax over (where is a learned confidence-based scale) and applying a stereo-focal loss to focus learning on reliable hypotheses.
Empirically, AUF reduces spurious multi-peak responses and sharpens the correct minimum in the cost distribution, especially important in regions of ambiguous or repetitive texture. These sharpened distributions lead to more effective narrowing of the depth search space in subsequent stages and increased final depth precision (Gao et al., 2022).
4. Multi-Scale Upsampling and Iterative Refinement
Between pyramid stages, the per-pixel depth map is upsampled to the next higher resolution, typically with bilinear or bicubic interpolation. This upsampled map then defines, via variance-based or neighbor-propagation windowing, the hypothesis range for the next stage. At each new resolution,
- A fresh set of hypothesis depths is generated according to the local context.
- The cost volume is rebuilt and regularized.
- The depth prediction is refined via soft-argmax regression.
This architecture supports recursive refinement with memory and compute complexity orders of magnitude lower than flat, high-resolution cost volumes. CVP-MVSNet ("Cost Volume Pyramid Based Depth Inference for Multi-View Stereo") sets the residual search range width per-pixel such that a depth deviation corresponds to a fixed pixel displacement, tying the sampling density directly to image resolution and effective baseline (Yang et al., 2019).
5. Efficiency and Accuracy Gains
Interlaced cost volume construction is motivated by, and delivers, significant improvements in both computational efficiency and achievable depth accuracy. On the standard DTU dataset, MSCVP-MVSNet's use of multi-strategy sampling achieves an overall error of 0.334 mm, compared to 0.402 mm for using uniform sampling at all stages (Gao et al., 2022). When combined with adaptive unimodal filtering and separate 3D regularizers per stage, this is further reduced to 0.328 mm, outperforming previous state-of-the-art pipelines such as MVSNet and Cascade-MVSNet.
SuperMVS employs a non-uniform plane allocation that reduces the number of processed planes at high resolution (only 80 total planes across four stages, with as few as 8 at full res), while achieving an overall error of 0.325 mm, and requiring less than 5.4 GB memory at a runtime of 0.51 s per view (Zhang, 2022).
CVP-MVSNet's cost volume pyramid with residual correction delivers a 6× speedup and memory reduction over monolithic approaches without performance drop, reflecting the impact of interlaced, resolution-adaptive cost construction (Yang et al., 2019).
6. Extensions and Alternative Interlacing Concepts
Recent methods have generalized interlaced volume concepts beyond depth/disparity space. For example, Image-Coupled Volume Propagation (ICVP) interleaves spatial cost-volume evolution with auxiliary image-context propagation at each encoder/decoder block, fusing both geometric and appearance cues in an alternating, scale-aware manner (Kwon et al., 2022). This allows significant reduction of 3D convolutional channel count without accuracy loss and leads to improved detail preservation at lower computational cost.
Attention-based methods such as ACVNet compute multi-level adaptive groupwise correlations to generate soft attention in the cost volume, filtering redundant information and enhancing match-related cues (Xu et al., 2022). While not a pyramid in the strict sense, these techniques reflect the general principle of interlacing: dynamically reweighting or resampling the search space based on network-internal or geometric cues.
7. Empirical Trends and Practical Considerations
The principal advantages and trade-offs of interlaced cost volume construction are summarized as follows:
| Method | Main Interlacing Strategy | Planes (high-res) | Overall Error (DTU) | Memory (GB) | Runtime (s) |
|---|---|---|---|---|---|
| MSCVP-MVSNet (Gao et al., 2022) | DHS1→DHS2→DHS3, AUF | 8 | 0.328 mm | – | – |
| SuperMVS (Zhang, 2022) | Uniform→non-uniform SampleNet | 8 | 0.325 mm | 5.4 | 0.51 |
| CVP-MVSNet (Yang et al., 2019) | Uniform→residual local | ~8 | 0.351 mm | 1.4 | 0.37 |
| CasMVSNet* | Uniform at all stages | ≥8 | 0.355 mm | >6 | 0.76 |
*For comparison to prior fixed-interval cascades.
The interlaced approach supports scalable deployment, reduces the need for large memory allocations, and leads to improved robustness in ambiguous regions, as shown in both metric benchmarks and qualitative gap reduction (fewer holes, sharper details).
References
- "Cost Volume Pyramid Network with Multi-strategies Range Searching for Multi-view Stereo" (Gao et al., 2022)
- "SuperMVS: Non-Uniform Cost Volume For High-Resolution Multi-View Stereo" (Zhang, 2022)
- "Cost Volume Pyramid Based Depth Inference for Multi-View Stereo" (Yang et al., 2019)
- "Image-Coupled Volume Propagation for Stereo Matching" (Kwon et al., 2022)
- "Attention Concatenation Volume for Accurate and Efficient Stereo Matching" (Xu et al., 2022)