Camera-Aware Referring Field (CaRF)

Updated 11 November 2025

The paper introduces CaRF, a framework that integrates camera-aware encoding and paired-view supervision to enforce robust multi-view 3D Gaussian segmentation.
It leverages a novel Gaussian Field Camera Encoding module to fuse camera parameters with semantic features for precise cross-view mask predictions.
Evaluations demonstrate significant mIoU improvements over previous methods, validating its effectiveness in achieving view-consistent segmentation in 3D scenes.

Camera Aware Referring Field (CaRF) is a fully differentiable framework for referring 3D Gaussian Splatting Segmentation (R3DGS) that addresses multi-view consistency in associating free-form language expressions with spatially localized 3D regions. CaRF introduces explicit camera geometry encoding as well as paired-view supervision; these mechanisms promote geometric reasoning and enforce view-consistent mask predictions directly in 3D Gaussian space, outperforming previous methods reliant on 2D projections and single-view learning.

1. Problem Formulation and Scene Representation

CaRF is defined in the context of referring 3D Gaussian Splatting Segmentation (R3DGS), where the goal is to spatially localize a natural-language query on a 3D scene represented by anisotropic Gaussians. The 3D scene is parameterized as

$\mathcal{G} = \{ G_i \}_{i=1}^N,\quad G_i = (\mu_i, \Sigma_i, c_i, \alpha_i)$

where $\mu_i \in \mathbb{R}^3$ is the Gaussian center, $\Sigma_i \in \mathbb{R}^{3 \times 3}$ the covariance, $c_i \in \mathbb{R}^3$ the color, and $\alpha_i \in [0,1]$ the opacity. Each $G_i$ defines a density

$G_i(x) = \alpha_i \exp\Bigl(-\tfrac12 (x-\mu_i)^\top \Sigma_i^{-1} (x-\mu_i)\Bigr).$

A natural-language query $q$ is represented as a sequence of $L$ token embeddings $\mathbf{E} = [e_1, ..., e_L] \in \mathbb{R}^{L \times d}$ derived from a pretrained encoder (e.g., BERT). Given the scene ( $\mathcal{G}$ ), camera calibrations $\{ (K_k, [R_k \mid t_k]) \}_{k=1}^K$ , and a query $q$ , the task is to produce a per-Gaussian referring score $m_i$ . Rendering these scores in each calibrated camera view yields 2D masks aligned with the region described in natural language.

2. Gaussian Field Camera Encoding (GFCE)

CaRF’s core innovation is the Gaussian Field Camera Encoding (GFCE) module, which integrates view geometry into the cross-modal semantic matching between Gaussians and linguistic queries.

Camera Parameter Encoding:

Extrinsic parameters are flattened into $c_{\text{ext}} = [\mathrm{vec}(R); t] \in \mathbb{R}^{12}$ . Intrinsic parameters (focal lengths and principal point from $K$ ) are normalized and concatenated, producing a full camera code $c \in \mathbb{R}^{16}$ . This code is mapped via a multilayer perceptron (MLP) to a feature $f_{\text{cam}} \in \mathbb{R}^d$ : $f_{\text{cam}} = \mathrm{MLP}_{\text{cam}}(c)$

Cross-modal Interaction:

Each Gaussian $G_i$ carries a learnable semantic feature $f_i \in \mathbb{R}^d$ . The interaction module $\phi$ fuses $f_i$ with the language embedding $\mathbf{E}$ : $g_i = \phi(f_i, \mathbf{E}) \in \mathbb{R}^d$

Camera-Conditioned Feature Modulation:

GFCE injects view-dependent information via elementwise addition: $\tilde{g}_i^{(k)} = g_i + f_{\text{cam}}^{(k)}$ $\tilde{g}_i^{(k)}$ thus carries explicit camera geometry from the $k$ th view. The per-Gaussian referring score under view $k$ is

$m_i^{(k)} = \sum_{j=1}^L \left( \tilde{g}_i^{(k)} \right)^\top e_j$

This view-sensitive modulation allows the model to encode occlusions, scale, and spatial relationships in a differentiable manner.

3. In-Training Paired-View Supervision (ITPVS)

Standard single-view training enforces agreement between predictions and 2D pseudo-masks under a single camera at a time. CaRF introduces In-Training Paired-View Supervision (ITPVS), in which two overlapping camera views per iteration are employed.

Dual-View Rasterization:

For a view pair $(v_a, v_b)$ , predicted masks $M_{\text{pred}}^{(v)}(p)$ are rendered via alpha compositing across the Gaussians: $M_{\text{pred}}^{(v)}(p) = \sum_{i=1}^{N_v} m_i^{(v)} \alpha_i^{(v)}(p) \prod_{k < i} \left( 1 - \alpha_k^{(v)}(p) \right), \quad v \in \{v_a, v_b\}$

Weighted Two-View BCE Loss:

Binary cross-entropy loss $\mathcal{L}_{\text{bce}}^{(v)}$ is computed against per-view pseudo-GT masks, and the joint objective is

$\mathcal{L}_{2\text{view}} = \alpha \mathcal{L}_{\text{bce}}^{(v_a)} + (1-\alpha)\mathcal{L}_{\text{bce}}^{(v_b)}$

usually with $\alpha = 0.5$ .

Optional Logit Consistency:

A further regularization term penalizes per-Gaussian disagreements between the two views: $\mathcal{L}_{\text{pair}} = \sum_{(v_a, v_b)} \sum_{i=1}^N \left\| \sigma(z_i^{(v_a)}) - \sigma(z_i^{(v_b)}) \right\|_2^2$ where $\sigma$ denotes the sigmoid function.

ITPVS forces Gaussians to produce view-invariant semantic predictions, thereby mitigating overfitting to single-view artifacts and enforcing robust 3D consistency.

4. Network Architecture and Training Regimen

The CaRF pipeline consists of distinct modules: geometry pretraining, semantic field learning, language encoding, cross-modal fusion, GFCE, volumetric mask rendering, and multiple loss heads.

Key steps for one training iteration:

Sample two camera views $(v_a, v_b)$ with $\geq 30\%$ spatial overlap.
Encode the language query $q$ as $\mathbf{E}$ .
For each Gaussian:
- Fuse with language via $g_i = \phi(f_i, \mathbf{E})$ .
- Compute $f_{\text{cam}}^{(v_a)}$ , $f_{\text{cam}}^{(v_b)}$ and modulate to $\tilde{g}_i^{(v)}$ .
- Compute referring scores $m_i^{(v)} = \sum_j (\tilde{g}_i^{(v)})^\top e_j$ .
Render predicted masks for both views.
Compute the two-view loss $\mathcal{L}_{2\text{view}}$ .
Form a prototype feature $f_g$ from the top- $\tau$ Gaussians and contrast against distractors for contrastive loss $\mathcal{L}_{\text{con}}$ .
Total loss combines both: $\mathcal{L} = \lambda_1 \mathcal{L}_{2\text{view}} + \lambda_2 \mathcal{L}_{\text{con}}$ .
Backpropagate to update $f_i$ , MLP parameters, and $\phi$ .

Training details:

30,000 iterations
$d = 128$ feature dimension
Adam optimizer, batch size of one query (two views/step)
Learning rates: $2.5\times 10^{-3}$ for referring field and contrastive head, $1\times 10^{-4}$ for GFCE/gating
Mixed-precision, gradient clip = 1.0
Pseudo ground-truth masks synthesized using Grounded-SAM with confidence-weighted IoU

5. Quantitative and Qualitative Evaluation

Extensive experiments across three standard referring 3D segmentation datasets demonstrate that CaRF achieves consistently higher mean Intersection-over-Union (mIoU) than previous methods.

Method	Ref-LERF (mIoU)	LERF-OVS (mIoU)	3D-OVS (mIoU)
ReferSplat	25.0	52.6	92.9
CaRF	29.2 (+16.8%)	54.9 (+4.3%)	94.7 (+2.0%)

Ablation studies reveal that both ITPVS and GFCE contribute to performance gains; jointly, they provide the strongest results (e.g., on Ref-LERF: Baseline 28.3/20.1, ITPVS only 31.6/22.4, GFCE only 24.3/13.5, full CaRF 33.5/24.7 for Ramen/Kitchen). Qualitative outputs show that CaRF produces masks preserving fine object details (e.g., glass rims, handle curvature) while maintaining cross-view coherence, whereas single-view methods tend to miss parts or to over-segment into the background.

6. Context, Limitations, and Applications

CaRF's explicit camera-aware modulation allows features to account for occlusion, scale, and precise spatial arrangements that are view-dependent. This overcomes limitations of prior 2D pseudo-supervision and non-differentiable reprojection strategies, which are sensitive to thresholding and accumulate geometric errors over time. By coupling paired-view gradients, CaRF regularizes the model toward genuinely 3D-consistent segmentations.

Potential applications include:

Embodied AI: A robot can resolve queries such as "pick up the blue mug on the left shelf" with robust, view-invariant localization.
AR/VR interaction: Users can select and manipulate virtual objects anchored in real geometry using natural language.
Autonomous perception: The system enables open-vocabulary, viewpoint-invariant segmentation (e.g., "pedestrian crossing road") in dynamic 3D environments.

This suggests that explicit camera-awareness and multi-view training are jointly essential for consistent, reliable 3D segmentation linked to natural language. A plausible implication is that similar camera-aware mechanisms may benefit other 3D perception tasks requiring geometric and semantic consistency across views.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Camera Aware Referring Field (CaRF).