MC-NeRF: Multi-Camera Neural Radiance Field

Updated 11 March 2026

MC-NeRF is a method that jointly optimizes camera intrinsics, extrinsics, and neural radiance fields for diverse multi-camera systems.
It overcomes calibration challenges by incorporating intrinsic reprojection losses and AprilTag-based fiducial constraints to improve scene reconstruction.
Experimental results show significant gains in intrinsic accuracy and photometric rendering quality on both synthetic and real-world datasets.

MC-NeRF is a method enabling joint optimization of per-image intrinsic and extrinsic camera parameters alongside a neural radiance field in multi-camera image acquisition systems, specifically addressing the challenges posed by diverse, unknown, or poorly-initialized camera parameters across large-scale, heterogeneous camera networks (Gao et al., 2023). Unlike conventional NeRF pipelines, which assume a unique, fixed camera model, MC-NeRF introduces algorithmic and practical solutions to simultaneously solve for scene representation and camera calibration in scenarios where each training image may originate from a different camera with distinct intrinsics and pose.

1. Problem Setting and Motivation

Standard NeRF pipelines and public datasets (e.g., Synthesis, LLFF, Mip-NeRF360) presuppose that all images originate from a single camera model, where the ray-generation function is parameterized by shared intrinsic matrix $K$ and extrinsic transformation $[R|T]$ . This assumption is violated in practical multi-camera arrangements (e.g., motion-capture studios, wide-baseline surveillance, or facial acquisition rigs), where tens to hundreds of cameras may have different focal lengths, principal points, and variable poses. Manual calibration is labor-intensive and fails to scale; re-calibration becomes prohibitive if cameras are moved.

Prior NeRF-based methods capable of optimizing camera parameters (such as BARF, SC-NeRF, NeRF––, L2G-NeRF) require shared intrinsics or reasonable pose initialization, and their performance degrades with poor parameter estimates. Allowing each image arbitrary $K$ and $[R|T]$ introduces coupling and degeneracy: (1) intrinsic and extrinsic ambiguities become inextricably linked, and (2) photometric loss alone cannot resolve intrinsic parameters. MC-NeRF directly addresses these challenges by incorporating additional constraints and novel calibration protocols, formulating the inverse problem as the simultaneous joint recovery of all $K_i$ , $[R_i|T_i]$ , and the radiance field, from a mixed multi-camera dataset (with or without good initial values) (Gao et al., 2023).

2. Mathematical Formulation and Camera Model

MC-NeRF preserves the standard NeRF formulation, where the scene is modeled via a continuous function:

$f_\theta : (x, d) \rightarrow (c, \sigma)$

with $x \in \mathbb{R}^3$ a 3D position, $d \in \mathbb{R}^3$ a viewing direction, $c \in \mathbb{R}^3$ color, and $\sigma \in \mathbb{R}^+$ density. The color along camera ray $r(t) = o + t d$ is rendered as:

$C(r) = \int_0^\infty T(t)\, \sigma(r(t))\, c(r(t), d)\, dt,$

where $T(t) = \exp\left(-\int_0^t \sigma(r(s)) ds \right)$ .

For camera geometry, each image $j$ employs its own intrinsic $K_j$ and extrinsic $[R_j \mid T_j]$ :

$s\, p = K_j [R_j | T_j] P$

where $P$ is a world 3D point and $p$ is the corresponding homogeneous pixel coordinate. The intrinsic matrix $K_j$ follows a pinhole form with per-camera (and per-image) multiplicative deltas:

$K_j = \begin{pmatrix} f_{xj} \Delta f_{xj} & 0 & u_{0j} \Delta u_{0j} \ 0 & f_{yj} \Delta f_{yj} & v_{0j} \Delta v_{0j} \ 0 & 0 & 1 \end{pmatrix}$

where the deltas $\Delta$ reflect learnable correction factors (no skew is assumed: $c=0$ ). Volume rendering rays are generated by back-projecting image pixels through $K_j^{-1}$ and $[R_j|T_j]^{-1}$ .

3. Joint Optimization Scheme

MC-NeRF jointly optimizes the following parameter groups:

Intrinsics: Each $K_j$ is initialized coarsely then refined via multiplicative updates $\Delta K_j$ .
Extrinsics: Camera pose is parameterized as a 6D vector $\alpha_j \in \mathfrak{se}(3)$ and exponentiated to $[R_j|T_j] \in \mathrm{SE}(3)$ .
NeRF network parameters $\theta$ including multi-layer perceptron weights and positional encoding (progressively activated following BARF).

Optimization proceeds via a composite loss function:

Intrinsic reprojection loss: For AprilTag-detected fiducials in calibration images,

$L_{intr} = \sum_{j,i} \left\Vert \frac{\mathrm{gt}\,p_{ij}}{L_{norm}} - \frac{\mathrm{pd}\,p_{ij}}{L_{norm}} \right\Vert^2$

where $\mathrm{gt}\,p_{ij}$ are detected tag corners and $\mathrm{pd}\,p_{ij}$ the projected positions under current parameter estimates.

Extrinsic reprojection loss: Classic bundle adjustment/PnP, optimizing $[R_j|T_j]$ while holding $K_j$ fixed.
Photometric rendering loss:

$L_{phot} = \sum_{pixels} \left\Vert C_{pred}(r) - C_{gt}(r) \right\Vert^2$

Degeneracy arises if all calibration points are coplanar, preventing robust intrinsics recovery—addressed by ensuring each calibration frame contains at least two AprilTags on different faces of a 3D cube for non-coplanar feature points. Intrinsic-extrinsic coupling is decoupled through the intrinsic reprojection branch: augmenting photometric alignment with reprojection constraints from rich AprilTag coverage enables the system to independently solve for $K_j$ and $[R_j|T_j]$ (Gao et al., 2023).

4. Calibration Protocol and Fiducial Design

Calibration images are acquired by introducing a purpose-built calibration cube, with each of its six faces printed with a unique AprilTag (36H11 family). Each visible tag supplies five feature points, so frames with at least two visible faces yield a minimum of ten non-coplanar calibration points, circumventing planar degeneracies. The protocol comprises two acquisition packs:

Acquisition Pack	Objective	Procedure
Pack 1 (global)	Define world frame, rough initialize	Cube at geometric center, all cameras observe ≥1 tag
Pack 2 (local)	Intrinsic, local extrinsic observation	Randomly reorient/move cube, each camera sees ≥2 tags

Initial $K_j$ , $[R_j|T_j]$ are derived from these packs via homography estimation and bundle adjustment algorithms, providing coarse estimates for subsequent optimization stages (Gao et al., 2023).

5. End-to-End Network Architecture and Training Strategy

MC-NeRF employs a multi-branch MLP closely following NeRF architectures, typically with 8–10 layers, 256 or 512 channels, ReLU activations, skip connections, and BARF-style positional encoding for progressive locality and smoothing in pose refinement.

Training is conducted in three stages:

Camera parameter initialization: Only $L_{intr}$ is used, focusing exclusively on calibration images (Pack 1 and Pack 2) to obtain initial $K_j$ and $[R_j|T_j]$ .
Global joint optimization: $L_{phot} + L_{intr}$ are co-optimized; all network and camera parameters are updated simultaneously. Progressive positional encoding (as in BARF) stabilizes joint adaptation.
Fine-tuning intrinsics: With extrinsics frozen, further optimize $K_j$ and $\theta$ to refine focal lengths and principal points.

In each iteration, rendering rays use the current $K_j, [R_j|T_j]$ to sample the 3D domain; both photometric and reprojection losses are back-propagated. This end-to-end approach allows for the decoupling of intrinsic/extrinsic ambiguities and robust convergence without strong a priori parameter estimates.

6. Datasets: Synthetic and Real-World

MC-NeRF's evaluation employs both synthetic and real datasets constructed to reflect practical, large-scale multi-camera scenarios:

Synthetic: Four camera-rig styles (half-ball, ball, room, 2D array) across eight distinct scenes each (32 total). Each scene uses 100 cameras with randomized FOVs ( $\in [40^\circ, 80^\circ]$ ) and principal points. Five splits per scene: Pack 1 (global calibration), Pack 2 (local calibration), training, validation, test.
Real-world: Capture hall (9m × 6m × 2.4m) with 88 fixed consumer cameras, each with unique intrinsics. Calibration follows the two-pack protocol; object images are 1920×1080 (rescaled to 960×540 for experiments).

This dataset design enables rigorous assessment of MC-NeRF's intrinsic/extrinsic recovery and radiance field representation capabilities (Gao et al., 2023).

7. Experimental Findings and Limitations

Key experimental results include:

Compatibility with per-image intrinsics: NeRF trained on images spanning diverse FOVs achieves comparable PSNR/SSIM/LPIPS to subsets trained on fixed FOVs, provided accurate $K,[R|T]$ are available.
Intrinsic estimation and degeneracy: $L_{intr}$ does not converge with single (coplanar) AprilTag; with ≥2 tags, full intrinsic parameter recovery is robust. Having ≥3 tags improves convergence accuracy.
Extrinsic estimation: Existing methods (BARF, L2G-NeRF) cannot recover accurate poses from random initialization; MC-NeRF's calibration-derived initialization enables convergent joint optimization.
2D neural image alignment: Only MC-NeRF with six point constraints recovers subpixel alignment in a neural patch alignment benchmark; BARF/L2G-NeRF diverge.
Joint optimization improvements: Compared to fix-step pipelines, MC-NeRF's global optimization reduces errors in $K$ , $R$ , $T$ by 2–5× and improves LPIPS by ∼2×; PSNR may decrease slightly due to edge misalignment.
Real-world comparison: On 88-camera real-world evaluation, MC-NeRF achieves the best intrinsic accuracy (mean $|\Delta f| \approx 4$ px vs $>7$ px) and best or near-best rendering ( $\mathrm{PSNR}\sim24.3$ dB, $\mathrm{LPIPS}\sim0.36$ ) relative to COLMAP+NeRF, Meshroom+NeRF, NeRF––, BARF, and Instant-NGP.

MC-NeRF demonstrates that a multi-branch loss (photometric and intrinsic reprojection), coupled with barf-style progressive positional encoding and a cost-effective AprilTag cube, suffices for high-fidelity, degenerate-free calibration and 3D scene reconstruction in large-scale multi-camera systems.

Limitations are intrinsic to the MC-NeRF approach:

Necessity of dedicated calibration frames (AprilTags, static cameras).
Inapplicability to dynamic scenes or changing intrinsics during capture.
Computational overhead from optimizing both large network weights and large numbers of camera parameters.
Occasional PSNR/SSIM penalties at region boundaries, even if perceptual quality (LPIPS) is improved (Gao et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

MC-NeRF: Multi-Camera Neural Radiance Fields for Multi-Camera Image Acquisition Systems (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to McNeRF.