Papers
Topics
Authors
Recent
Search
2000 character limit reached

MC-NeRF: Multi-Camera Neural Radiance Field

Updated 11 March 2026
  • MC-NeRF is a method that jointly optimizes camera intrinsics, extrinsics, and neural radiance fields for diverse multi-camera systems.
  • It overcomes calibration challenges by incorporating intrinsic reprojection losses and AprilTag-based fiducial constraints to improve scene reconstruction.
  • Experimental results show significant gains in intrinsic accuracy and photometric rendering quality on both synthetic and real-world datasets.

MC-NeRF is a method enabling joint optimization of per-image intrinsic and extrinsic camera parameters alongside a neural radiance field in multi-camera image acquisition systems, specifically addressing the challenges posed by diverse, unknown, or poorly-initialized camera parameters across large-scale, heterogeneous camera networks (Gao et al., 2023). Unlike conventional NeRF pipelines, which assume a unique, fixed camera model, MC-NeRF introduces algorithmic and practical solutions to simultaneously solve for scene representation and camera calibration in scenarios where each training image may originate from a different camera with distinct intrinsics and pose.

1. Problem Setting and Motivation

Standard NeRF pipelines and public datasets (e.g., Synthesis, LLFF, Mip-NeRF360) presuppose that all images originate from a single camera model, where the ray-generation function is parameterized by shared intrinsic matrix KK and extrinsic transformation [RT][R|T]. This assumption is violated in practical multi-camera arrangements (e.g., motion-capture studios, wide-baseline surveillance, or facial acquisition rigs), where tens to hundreds of cameras may have different focal lengths, principal points, and variable poses. Manual calibration is labor-intensive and fails to scale; re-calibration becomes prohibitive if cameras are moved.

Prior NeRF-based methods capable of optimizing camera parameters (such as BARF, SC-NeRF, NeRF––, L2G-NeRF) require shared intrinsics or reasonable pose initialization, and their performance degrades with poor parameter estimates. Allowing each image arbitrary KK and [RT][R|T] introduces coupling and degeneracy: (1) intrinsic and extrinsic ambiguities become inextricably linked, and (2) photometric loss alone cannot resolve intrinsic parameters. MC-NeRF directly addresses these challenges by incorporating additional constraints and novel calibration protocols, formulating the inverse problem as the simultaneous joint recovery of all KiK_i, [RiTi][R_i|T_i], and the radiance field, from a mixed multi-camera dataset (with or without good initial values) (Gao et al., 2023).

2. Mathematical Formulation and Camera Model

MC-NeRF preserves the standard NeRF formulation, where the scene is modeled via a continuous function:

fθ:(x,d)(c,σ)f_\theta : (x, d) \rightarrow (c, \sigma)

with xR3x \in \mathbb{R}^3 a 3D position, dR3d \in \mathbb{R}^3 a viewing direction, cR3c \in \mathbb{R}^3 color, and σR+\sigma \in \mathbb{R}^+ density. The color along camera ray r(t)=o+tdr(t) = o + t d is rendered as:

C(r)=0T(t)σ(r(t))c(r(t),d)dt,C(r) = \int_0^\infty T(t)\, \sigma(r(t))\, c(r(t), d)\, dt,

where T(t)=exp(0tσ(r(s))ds)T(t) = \exp\left(-\int_0^t \sigma(r(s)) ds \right).

For camera geometry, each image jj employs its own intrinsic KjK_j and extrinsic [RjTj][R_j \mid T_j]:

sp=Kj[RjTj]Ps\, p = K_j [R_j | T_j] P

where PP is a world 3D point and pp is the corresponding homogeneous pixel coordinate. The intrinsic matrix KjK_j follows a pinhole form with per-camera (and per-image) multiplicative deltas:

Kj=(fxjΔfxj0u0jΔu0j 0fyjΔfyjv0jΔv0j 001)K_j = \begin{pmatrix} f_{xj} \Delta f_{xj} & 0 & u_{0j} \Delta u_{0j} \ 0 & f_{yj} \Delta f_{yj} & v_{0j} \Delta v_{0j} \ 0 & 0 & 1 \end{pmatrix}

where the deltas Δ\Delta reflect learnable correction factors (no skew is assumed: c=0c=0). Volume rendering rays are generated by back-projecting image pixels through Kj1K_j^{-1} and [RjTj]1[R_j|T_j]^{-1}.

3. Joint Optimization Scheme

MC-NeRF jointly optimizes the following parameter groups:

  • Intrinsics: Each KjK_j is initialized coarsely then refined via multiplicative updates ΔKj\Delta K_j.
  • Extrinsics: Camera pose is parameterized as a 6D vector αjse(3)\alpha_j \in \mathfrak{se}(3) and exponentiated to [RjTj]SE(3)[R_j|T_j] \in \mathrm{SE}(3).
  • NeRF network parameters θ\theta including multi-layer perceptron weights and positional encoding (progressively activated following BARF).

Optimization proceeds via a composite loss function:

  1. Intrinsic reprojection loss: For AprilTag-detected fiducials in calibration images,

Lintr=j,igtpijLnormpdpijLnorm2L_{intr} = \sum_{j,i} \left\Vert \frac{\mathrm{gt}\,p_{ij}}{L_{norm}} - \frac{\mathrm{pd}\,p_{ij}}{L_{norm}} \right\Vert^2

where gtpij\mathrm{gt}\,p_{ij} are detected tag corners and pdpij\mathrm{pd}\,p_{ij} the projected positions under current parameter estimates.

  1. Extrinsic reprojection loss: Classic bundle adjustment/PnP, optimizing [RjTj][R_j|T_j] while holding KjK_j fixed.
  2. Photometric rendering loss:

Lphot=pixelsCpred(r)Cgt(r)2L_{phot} = \sum_{pixels} \left\Vert C_{pred}(r) - C_{gt}(r) \right\Vert^2

Degeneracy arises if all calibration points are coplanar, preventing robust intrinsics recovery—addressed by ensuring each calibration frame contains at least two AprilTags on different faces of a 3D cube for non-coplanar feature points. Intrinsic-extrinsic coupling is decoupled through the intrinsic reprojection branch: augmenting photometric alignment with reprojection constraints from rich AprilTag coverage enables the system to independently solve for KjK_j and [RjTj][R_j|T_j] (Gao et al., 2023).

4. Calibration Protocol and Fiducial Design

Calibration images are acquired by introducing a purpose-built calibration cube, with each of its six faces printed with a unique AprilTag (36H11 family). Each visible tag supplies five feature points, so frames with at least two visible faces yield a minimum of ten non-coplanar calibration points, circumventing planar degeneracies. The protocol comprises two acquisition packs:

Acquisition Pack Objective Procedure
Pack 1 (global) Define world frame, rough initialize Cube at geometric center, all cameras observe ≥1 tag
Pack 2 (local) Intrinsic, local extrinsic observation Randomly reorient/move cube, each camera sees ≥2 tags

Initial KjK_j, [RjTj][R_j|T_j] are derived from these packs via homography estimation and bundle adjustment algorithms, providing coarse estimates for subsequent optimization stages (Gao et al., 2023).

5. End-to-End Network Architecture and Training Strategy

MC-NeRF employs a multi-branch MLP closely following NeRF architectures, typically with 8–10 layers, 256 or 512 channels, ReLU activations, skip connections, and BARF-style positional encoding for progressive locality and smoothing in pose refinement.

Training is conducted in three stages:

  1. Camera parameter initialization: Only LintrL_{intr} is used, focusing exclusively on calibration images (Pack 1 and Pack 2) to obtain initial KjK_j and [RjTj][R_j|T_j].
  2. Global joint optimization: Lphot+LintrL_{phot} + L_{intr} are co-optimized; all network and camera parameters are updated simultaneously. Progressive positional encoding (as in BARF) stabilizes joint adaptation.
  3. Fine-tuning intrinsics: With extrinsics frozen, further optimize KjK_j and θ\theta to refine focal lengths and principal points.

In each iteration, rendering rays use the current Kj,[RjTj]K_j, [R_j|T_j] to sample the 3D domain; both photometric and reprojection losses are back-propagated. This end-to-end approach allows for the decoupling of intrinsic/extrinsic ambiguities and robust convergence without strong a priori parameter estimates.

6. Datasets: Synthetic and Real-World

MC-NeRF's evaluation employs both synthetic and real datasets constructed to reflect practical, large-scale multi-camera scenarios:

  • Synthetic: Four camera-rig styles (half-ball, ball, room, 2D array) across eight distinct scenes each (32 total). Each scene uses 100 cameras with randomized FOVs ([40,80]\in [40^\circ, 80^\circ]) and principal points. Five splits per scene: Pack 1 (global calibration), Pack 2 (local calibration), training, validation, test.
  • Real-world: Capture hall (9m × 6m × 2.4m) with 88 fixed consumer cameras, each with unique intrinsics. Calibration follows the two-pack protocol; object images are 1920×1080 (rescaled to 960×540 for experiments).

This dataset design enables rigorous assessment of MC-NeRF's intrinsic/extrinsic recovery and radiance field representation capabilities (Gao et al., 2023).

7. Experimental Findings and Limitations

Key experimental results include:

  • Compatibility with per-image intrinsics: NeRF trained on images spanning diverse FOVs achieves comparable PSNR/SSIM/LPIPS to subsets trained on fixed FOVs, provided accurate K,[RT]K,[R|T] are available.
  • Intrinsic estimation and degeneracy: LintrL_{intr} does not converge with single (coplanar) AprilTag; with ≥2 tags, full intrinsic parameter recovery is robust. Having ≥3 tags improves convergence accuracy.
  • Extrinsic estimation: Existing methods (BARF, L2G-NeRF) cannot recover accurate poses from random initialization; MC-NeRF's calibration-derived initialization enables convergent joint optimization.
  • 2D neural image alignment: Only MC-NeRF with six point constraints recovers subpixel alignment in a neural patch alignment benchmark; BARF/L2G-NeRF diverge.
  • Joint optimization improvements: Compared to fix-step pipelines, MC-NeRF's global optimization reduces errors in KK, RR, TT by 2–5× and improves LPIPS by ∼2×; PSNR may decrease slightly due to edge misalignment.
  • Real-world comparison: On 88-camera real-world evaluation, MC-NeRF achieves the best intrinsic accuracy (mean Δf4|\Delta f| \approx 4px vs >7>7px) and best or near-best rendering (PSNR24.3\mathrm{PSNR}\sim24.3 dB, LPIPS0.36\mathrm{LPIPS}\sim0.36) relative to COLMAP+NeRF, Meshroom+NeRF, NeRF––, BARF, and Instant-NGP.

MC-NeRF demonstrates that a multi-branch loss (photometric and intrinsic reprojection), coupled with barf-style progressive positional encoding and a cost-effective AprilTag cube, suffices for high-fidelity, degenerate-free calibration and 3D scene reconstruction in large-scale multi-camera systems.

Limitations are intrinsic to the MC-NeRF approach:

  • Necessity of dedicated calibration frames (AprilTags, static cameras).
  • Inapplicability to dynamic scenes or changing intrinsics during capture.
  • Computational overhead from optimizing both large network weights and large numbers of camera parameters.
  • Occasional PSNR/SSIM penalties at region boundaries, even if perceptual quality (LPIPS) is improved (Gao et al., 2023).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to McNeRF.