PatchMatch-RL for Multi-View Stereo
- The paper introduces PatchMatch-RL, which frames view and candidate selection as discrete RL policy actions to learn trainable costs and regularizations in MVS.
- It utilizes dual agents—a view-selection network leveraging geometric priors and a candidate-selection network with recurrent cost regularization—to achieve pixelwise depth and normal estimates.
- Experimental results on ETH3D and Tanks & Temples, along with ablation studies, underscore the importance of normal estimation, view selection, and RCR for global consistency.
Reinforcement Learning PatchMatch (PatchMatch-RL) is an end-to-end trainable approach for multi-view stereo (MVS) that integrates reinforcement learning (RL) with a PatchMatch-based optimization framework. Unlike traditional MVS techniques that rely on hand-crafted or non-trainable optimization procedures, PatchMatch-RL formulates core algorithmic steps—view selection and depth hypothesis candidate selection—as discrete policy actions optimized via RL. This enables learning trainable costs and regularizations while maintaining explicit, pixelwise estimates of depth, surface normals, and visibility, particularly suited to challenging scenarios with wide-baseline, sparse views and large depth ranges (Lee et al., 2021).
1. Reinforcement Learning Problem Formulation
PatchMatch-RL decomposes the discrete PatchMatch optimization into two intertwined agents, both parameterized as policies trained by REINFORCE.
- View-Selection Agent ($\pi_\mathcal{V}_{\theta_\mathcal{V}}$) selects the most informative subset of source views per pixel, using as state input geometric priors (e.g., scale, incidence angle, triangulation angle) and group-wise correlations .
- Candidate-Selection Agent ($\pi_\mathcal{S}_{\theta_\mathcal{S}}$) selects among propagated oriented plane hypotheses per pixel, with state comprising the set of candidates, their recurrent hidden states , and their computed per-view, per-candidate visibility-weighted correlations .
Actions are the selection of out of source views (view agent) and choosing a candidate hypothesis from the options (candidate agent). Rewards are computed as factorized Gaussian likelihoods measuring the agreement of the selected hypothesis 0 with ground truth 1:
2
The RL objective is to maximize the expected sum of rewards 3 via policy-gradient updates:
4
where 5 denotes the discounted return. Additional likelihood-based surrogate losses (cross-entropy) further shape training:
- Candidate agent cross-entropy loss,
- View agent approximate gradient as the difference in log-probabilities over selected and selected-worst views.
2. Architecture and Policy Network Components
PatchMatch-RL instantiates distinct networks for view and candidate selection.
- View-Selection Network: Receives per-pixel concatenated geometric priors and group-wise correlations, processed by an MLP to estimate per-source-view visibility logits 6.
- Candidate-Selection ("Photometric Scorer" with Recurrent Cost Regularization): For each candidate 7:
- Computes visibility-weighted correlation 8.
- Aggregates pairwise smoothness features 9 over local pixel neighborhoods.
- Passes the concatenation of these features and the candidate's hidden state through a GRU—this forms the Recurrent Cost Regularization (RCR) scheme.
- Outputs a regularized cost $\pi_\mathcal{S}_{\theta_\mathcal{S}}$0 for policy sampling. The RCR mechanism propagates local consistency signals analogously to loopy belief propagation.
3. Training Regime and Handling of Discrete Operations
PatchMatch-RL utilizes REINFORCE to propagate gradients through the non-differentiable, discrete sampling operations inherent to PatchMatch (view subset selection, hard hypothesis choice). Training proceeds as follows:
At each PatchMatch iteration $\pi_\mathcal{S}_{\theta_\mathcal{S}}$1 and pixel $\pi_\mathcal{S}_{\theta_\mathcal{S}}$2, sample the view and candidate selection actions from their respective policies.
- Observe reward $\pi_\mathcal{S}_{\theta_\mathcal{S}}$3.
- Accumulate returns $\pi_\mathcal{S}_{\theta_\mathcal{S}}$4.
- Update parameters $\pi_\mathcal{S}_{\theta_\mathcal{S}}$5 using policy-gradient estimators.
- Surrogate cross-entropy losses ($\pi_\mathcal{S}_{\theta_\mathcal{S}}$6 for candidates, $\pi_\mathcal{S}_{\theta_\mathcal{S}}$7 for views) are used to further stabilize learning.
A decaying $\pi_\mathcal{S}_{\theta_\mathcal{S}}$8-greedy exploration strategy controls stochasticity during training, with $\pi_\mathcal{S}_{\theta_\mathcal{S}}$9 annealed from 0 to 1.
4. Novel Algorithmic Components
PatchMatch-RL introduces two primary innovations aiding robust end-to-end optimization:
- Dilated Patch Kernels for Normal Estimation: For each hypothesis 2, the algorithm defines a dilated support window 3 (size 4, dilation 5). Correlations are computed across source views by aggregating softmaxed per-patch feature responses, enabling explicit estimation of surface normals.
- Recurrent Cost Regularization (RCR): By integrating a GRU across candidate hypotheses and their propagated costs, the model encourages global geometrical consistency, overcoming the limitations of local, plane-sweep-based methods.
5. Implementation Details and Hyperparameters
PatchMatch-RL employs a coarse-to-fine hierarchy:
- Three image scales (6, 7, 8 resolution), with FPN features at progressively reduced channel widths.
- PatchMatch iterations per scale (training/inference): (2,1,1)/(8,2,2).
- Patch size 9, dilation 0.
- 10 source views provided, with 1 sampled per pixel.
- 2-greedy exploration schedule: 3, with 4.
- Discount factors: 5 (photometric), 6 (view selection).
- Training data: BlendedMVS, low-res 7.
- Optimizer: Adam, learning rate 8 with 9 decay per epoch.
- Hardware: NVIDIA RTX 3090.
6. Experimental Results and Ablation Studies
PatchMatch-RL is evaluated on ETH3D High-Res and Tanks & Temples datasets. Key findings are summarized:
| Benchmark | Metric (@Threshold) | PatchMatch-RL | PatchMatchNet | PVSNet | COLMAP | ACMH | AttMVS | CasMVSNet | BP-MVSNet |
|---|---|---|---|---|---|---|---|---|---|
| ETH3D High-Res (test) | F1 (2 cm) | 72.4 | 73.1 | 72.1 | 73.0 | 75.9 | – | – | – |
| ETH3D High-Res (test) | F1 (5 cm) | 86.8 | 85.9 | 85.6 | 84.0 | – | – | – | – |
| Tanks & Temples Inter. | F1 | 51.8 | 53.2 | 56.9 | – | – | 60.1 | 56.8 | 57.6 |
| Tanks & Temples Advanced | F1 | 31.8 | 32.3 | 33.5 | – | – | 31.9 | 31.1 | 31.4 |
On ETH3D, PatchMatch-RL achieves the highest Combined F1 among recent learning-based methods at the strict 2 cm threshold and maintains best or competitive performance at the 5 cm threshold. On Tanks & Temples, performance is comparable to other advanced learning-based models.
Ablation studies reveal the criticality of key architectural choices:
- Removing normal estimation results in a 0 reduction in Combined F1 (@2 cm).
- Omitting view-selection or recurrent cost regularization yields 1 and 2 F1 drop respectively (full model: 3).
7. Significance, Limitations, and Context
PatchMatch-RL demonstrates that explicit modeling of traditionally non-differentiable MVS steps using RL enables end-to-end trainability without sacrificing the advantages of discrete, pixelwise candidate representations. It leverages both photometric and geometric cues, and recurrent regularization, to achieve global consistency even in sparse, wide-baseline scenarios where prior learning-based approaches often degrade (Lee et al., 2021).
The method’s handling of discrete sampling via RL, rather than relaxation or heuristic approximations, represents a principled advancement. However, on some benchmarks such as Tanks & Temples, the performance remains comparable rather than decisively leading among SOTA, suggesting further room for refinement in challenging generalization settings. The ablation analyses reinforce that each core component—normal estimation, view selection, and RCR—contributes significantly to robustness.
A plausible implication is that PatchMatch-RL’s framework could catalyze further integration of RL-based optimization in tasks with discrete, non-differentiable steps, especially within geometric vision pipelines.