Papers
Topics
Authors
Recent
2000 character limit reached

Camera-Aware Referring Field (CaRF)

Updated 11 November 2025
  • The paper introduces CaRF, a framework that integrates camera-aware encoding and paired-view supervision to enforce robust multi-view 3D Gaussian segmentation.
  • It leverages a novel Gaussian Field Camera Encoding module to fuse camera parameters with semantic features for precise cross-view mask predictions.
  • Evaluations demonstrate significant mIoU improvements over previous methods, validating its effectiveness in achieving view-consistent segmentation in 3D scenes.

Camera Aware Referring Field (CaRF) is a fully differentiable framework for referring 3D Gaussian Splatting Segmentation (R3DGS) that addresses multi-view consistency in associating free-form language expressions with spatially localized 3D regions. CaRF introduces explicit camera geometry encoding as well as paired-view supervision; these mechanisms promote geometric reasoning and enforce view-consistent mask predictions directly in 3D Gaussian space, outperforming previous methods reliant on 2D projections and single-view learning.

1. Problem Formulation and Scene Representation

CaRF is defined in the context of referring 3D Gaussian Splatting Segmentation (R3DGS), where the goal is to spatially localize a natural-language query on a 3D scene represented by anisotropic Gaussians. The 3D scene is parameterized as

G={Gi}i=1N,Gi=(μi,Σi,ci,αi)\mathcal{G} = \{ G_i \}_{i=1}^N,\quad G_i = (\mu_i, \Sigma_i, c_i, \alpha_i)

where μiR3\mu_i \in \mathbb{R}^3 is the Gaussian center, ΣiR3×3\Sigma_i \in \mathbb{R}^{3 \times 3} the covariance, ciR3c_i \in \mathbb{R}^3 the color, and αi[0,1]\alpha_i \in [0,1] the opacity. Each GiG_i defines a density

Gi(x)=αiexp(12(xμi)Σi1(xμi)).G_i(x) = \alpha_i \exp\Bigl(-\tfrac12 (x-\mu_i)^\top \Sigma_i^{-1} (x-\mu_i)\Bigr).

A natural-language query qq is represented as a sequence of LL token embeddings E=[e1,...,eL]RL×d\mathbf{E} = [e_1, ..., e_L] \in \mathbb{R}^{L \times d} derived from a pretrained encoder (e.g., BERT). Given the scene (G\mathcal{G}), camera calibrations {(Kk,[Rktk])}k=1K\{ (K_k, [R_k \mid t_k]) \}_{k=1}^K, and a query qq, the task is to produce a per-Gaussian referring score mim_i. Rendering these scores in each calibrated camera view yields 2D masks aligned with the region described in natural language.

2. Gaussian Field Camera Encoding (GFCE)

CaRF’s core innovation is the Gaussian Field Camera Encoding (GFCE) module, which integrates view geometry into the cross-modal semantic matching between Gaussians and linguistic queries.

Camera Parameter Encoding:

Extrinsic parameters are flattened into cext=[vec(R);t]R12c_{\text{ext}} = [\mathrm{vec}(R); t] \in \mathbb{R}^{12}. Intrinsic parameters (focal lengths and principal point from KK) are normalized and concatenated, producing a full camera code cR16c \in \mathbb{R}^{16}. This code is mapped via a multilayer perceptron (MLP) to a feature fcamRdf_{\text{cam}} \in \mathbb{R}^d: fcam=MLPcam(c)f_{\text{cam}} = \mathrm{MLP}_{\text{cam}}(c)

Cross-modal Interaction:

Each Gaussian GiG_i carries a learnable semantic feature fiRdf_i \in \mathbb{R}^d. The interaction module ϕ\phi fuses fif_i with the language embedding E\mathbf{E}: gi=ϕ(fi,E)Rdg_i = \phi(f_i, \mathbf{E}) \in \mathbb{R}^d

Camera-Conditioned Feature Modulation:

GFCE injects view-dependent information via elementwise addition: g~i(k)=gi+fcam(k)\tilde{g}_i^{(k)} = g_i + f_{\text{cam}}^{(k)} g~i(k)\tilde{g}_i^{(k)} thus carries explicit camera geometry from the kkth view. The per-Gaussian referring score under view kk is

mi(k)=j=1L(g~i(k))ejm_i^{(k)} = \sum_{j=1}^L \left( \tilde{g}_i^{(k)} \right)^\top e_j

This view-sensitive modulation allows the model to encode occlusions, scale, and spatial relationships in a differentiable manner.

3. In-Training Paired-View Supervision (ITPVS)

Standard single-view training enforces agreement between predictions and 2D pseudo-masks under a single camera at a time. CaRF introduces In-Training Paired-View Supervision (ITPVS), in which two overlapping camera views per iteration are employed.

Dual-View Rasterization:

For a view pair (va,vb)(v_a, v_b), predicted masks Mpred(v)(p)M_{\text{pred}}^{(v)}(p) are rendered via alpha compositing across the Gaussians: Mpred(v)(p)=i=1Nvmi(v)αi(v)(p)k<i(1αk(v)(p)),v{va,vb}M_{\text{pred}}^{(v)}(p) = \sum_{i=1}^{N_v} m_i^{(v)} \alpha_i^{(v)}(p) \prod_{k < i} \left( 1 - \alpha_k^{(v)}(p) \right), \quad v \in \{v_a, v_b\}

Weighted Two-View BCE Loss:

Binary cross-entropy loss Lbce(v)\mathcal{L}_{\text{bce}}^{(v)} is computed against per-view pseudo-GT masks, and the joint objective is

L2view=αLbce(va)+(1α)Lbce(vb)\mathcal{L}_{2\text{view}} = \alpha \mathcal{L}_{\text{bce}}^{(v_a)} + (1-\alpha)\mathcal{L}_{\text{bce}}^{(v_b)}

usually with α=0.5\alpha = 0.5.

Optional Logit Consistency:

A further regularization term penalizes per-Gaussian disagreements between the two views: Lpair=(va,vb)i=1Nσ(zi(va))σ(zi(vb))22\mathcal{L}_{\text{pair}} = \sum_{(v_a, v_b)} \sum_{i=1}^N \left\| \sigma(z_i^{(v_a)}) - \sigma(z_i^{(v_b)}) \right\|_2^2 where σ\sigma denotes the sigmoid function.

ITPVS forces Gaussians to produce view-invariant semantic predictions, thereby mitigating overfitting to single-view artifacts and enforcing robust 3D consistency.

4. Network Architecture and Training Regimen

The CaRF pipeline consists of distinct modules: geometry pretraining, semantic field learning, language encoding, cross-modal fusion, GFCE, volumetric mask rendering, and multiple loss heads.

Key steps for one training iteration:

  1. Sample two camera views (va,vb)(v_a, v_b) with 30%\geq 30\% spatial overlap.
  2. Encode the language query qq as E\mathbf{E}.
  3. For each Gaussian:
    • Fuse with language via gi=ϕ(fi,E)g_i = \phi(f_i, \mathbf{E}).
    • Compute fcam(va)f_{\text{cam}}^{(v_a)}, fcam(vb)f_{\text{cam}}^{(v_b)} and modulate to g~i(v)\tilde{g}_i^{(v)}.
    • Compute referring scores mi(v)=j(g~i(v))ejm_i^{(v)} = \sum_j (\tilde{g}_i^{(v)})^\top e_j.
  4. Render predicted masks for both views.
  5. Compute the two-view loss L2view\mathcal{L}_{2\text{view}}.
  6. Form a prototype feature fgf_g from the top-τ\tau Gaussians and contrast against distractors for contrastive loss Lcon\mathcal{L}_{\text{con}}.
  7. Total loss combines both: L=λ1L2view+λ2Lcon\mathcal{L} = \lambda_1 \mathcal{L}_{2\text{view}} + \lambda_2 \mathcal{L}_{\text{con}}.
  8. Backpropagate to update fif_i, MLP parameters, and ϕ\phi.

Training details:

  • 30,000 iterations
  • d=128d = 128 feature dimension
  • Adam optimizer, batch size of one query (two views/step)
  • Learning rates: 2.5×1032.5\times 10^{-3} for referring field and contrastive head, 1×1041\times 10^{-4} for GFCE/gating
  • Mixed-precision, gradient clip = 1.0
  • Pseudo ground-truth masks synthesized using Grounded-SAM with confidence-weighted IoU

5. Quantitative and Qualitative Evaluation

Extensive experiments across three standard referring 3D segmentation datasets demonstrate that CaRF achieves consistently higher mean Intersection-over-Union (mIoU) than previous methods.

Method Ref-LERF (mIoU) LERF-OVS (mIoU) 3D-OVS (mIoU)
ReferSplat 25.0 52.6 92.9
CaRF 29.2 (+16.8%) 54.9 (+4.3%) 94.7 (+2.0%)

Ablation studies reveal that both ITPVS and GFCE contribute to performance gains; jointly, they provide the strongest results (e.g., on Ref-LERF: Baseline 28.3/20.1, ITPVS only 31.6/22.4, GFCE only 24.3/13.5, full CaRF 33.5/24.7 for Ramen/Kitchen). Qualitative outputs show that CaRF produces masks preserving fine object details (e.g., glass rims, handle curvature) while maintaining cross-view coherence, whereas single-view methods tend to miss parts or to over-segment into the background.

6. Context, Limitations, and Applications

CaRF's explicit camera-aware modulation allows features to account for occlusion, scale, and precise spatial arrangements that are view-dependent. This overcomes limitations of prior 2D pseudo-supervision and non-differentiable reprojection strategies, which are sensitive to thresholding and accumulate geometric errors over time. By coupling paired-view gradients, CaRF regularizes the model toward genuinely 3D-consistent segmentations.

Potential applications include:

  • Embodied AI: A robot can resolve queries such as "pick up the blue mug on the left shelf" with robust, view-invariant localization.
  • AR/VR interaction: Users can select and manipulate virtual objects anchored in real geometry using natural language.
  • Autonomous perception: The system enables open-vocabulary, viewpoint-invariant segmentation (e.g., "pedestrian crossing road") in dynamic 3D environments.

This suggests that explicit camera-awareness and multi-view training are jointly essential for consistent, reliable 3D segmentation linked to natural language. A plausible implication is that similar camera-aware mechanisms may benefit other 3D perception tasks requiring geometric and semantic consistency across views.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Camera Aware Referring Field (CaRF).