Camera-Aware Referring Field (CaRF)
- The paper introduces CaRF, a framework that integrates camera-aware encoding and paired-view supervision to enforce robust multi-view 3D Gaussian segmentation.
- It leverages a novel Gaussian Field Camera Encoding module to fuse camera parameters with semantic features for precise cross-view mask predictions.
- Evaluations demonstrate significant mIoU improvements over previous methods, validating its effectiveness in achieving view-consistent segmentation in 3D scenes.
Camera Aware Referring Field (CaRF) is a fully differentiable framework for referring 3D Gaussian Splatting Segmentation (R3DGS) that addresses multi-view consistency in associating free-form language expressions with spatially localized 3D regions. CaRF introduces explicit camera geometry encoding as well as paired-view supervision; these mechanisms promote geometric reasoning and enforce view-consistent mask predictions directly in 3D Gaussian space, outperforming previous methods reliant on 2D projections and single-view learning.
1. Problem Formulation and Scene Representation
CaRF is defined in the context of referring 3D Gaussian Splatting Segmentation (R3DGS), where the goal is to spatially localize a natural-language query on a 3D scene represented by anisotropic Gaussians. The 3D scene is parameterized as
where is the Gaussian center, the covariance, the color, and the opacity. Each defines a density
A natural-language query is represented as a sequence of token embeddings derived from a pretrained encoder (e.g., BERT). Given the scene (), camera calibrations , and a query , the task is to produce a per-Gaussian referring score . Rendering these scores in each calibrated camera view yields 2D masks aligned with the region described in natural language.
2. Gaussian Field Camera Encoding (GFCE)
CaRF’s core innovation is the Gaussian Field Camera Encoding (GFCE) module, which integrates view geometry into the cross-modal semantic matching between Gaussians and linguistic queries.
Camera Parameter Encoding:
Extrinsic parameters are flattened into . Intrinsic parameters (focal lengths and principal point from ) are normalized and concatenated, producing a full camera code . This code is mapped via a multilayer perceptron (MLP) to a feature :
Cross-modal Interaction:
Each Gaussian carries a learnable semantic feature . The interaction module fuses with the language embedding :
Camera-Conditioned Feature Modulation:
GFCE injects view-dependent information via elementwise addition: thus carries explicit camera geometry from the th view. The per-Gaussian referring score under view is
This view-sensitive modulation allows the model to encode occlusions, scale, and spatial relationships in a differentiable manner.
3. In-Training Paired-View Supervision (ITPVS)
Standard single-view training enforces agreement between predictions and 2D pseudo-masks under a single camera at a time. CaRF introduces In-Training Paired-View Supervision (ITPVS), in which two overlapping camera views per iteration are employed.
Dual-View Rasterization:
For a view pair , predicted masks are rendered via alpha compositing across the Gaussians:
Weighted Two-View BCE Loss:
Binary cross-entropy loss is computed against per-view pseudo-GT masks, and the joint objective is
usually with .
Optional Logit Consistency:
A further regularization term penalizes per-Gaussian disagreements between the two views: where denotes the sigmoid function.
ITPVS forces Gaussians to produce view-invariant semantic predictions, thereby mitigating overfitting to single-view artifacts and enforcing robust 3D consistency.
4. Network Architecture and Training Regimen
The CaRF pipeline consists of distinct modules: geometry pretraining, semantic field learning, language encoding, cross-modal fusion, GFCE, volumetric mask rendering, and multiple loss heads.
Key steps for one training iteration:
- Sample two camera views with spatial overlap.
- Encode the language query as .
- For each Gaussian:
- Fuse with language via .
- Compute , and modulate to .
- Compute referring scores .
- Render predicted masks for both views.
- Compute the two-view loss .
- Form a prototype feature from the top- Gaussians and contrast against distractors for contrastive loss .
- Total loss combines both: .
- Backpropagate to update , MLP parameters, and .
Training details:
- 30,000 iterations
- feature dimension
- Adam optimizer, batch size of one query (two views/step)
- Learning rates: for referring field and contrastive head, for GFCE/gating
- Mixed-precision, gradient clip = 1.0
- Pseudo ground-truth masks synthesized using Grounded-SAM with confidence-weighted IoU
5. Quantitative and Qualitative Evaluation
Extensive experiments across three standard referring 3D segmentation datasets demonstrate that CaRF achieves consistently higher mean Intersection-over-Union (mIoU) than previous methods.
| Method | Ref-LERF (mIoU) | LERF-OVS (mIoU) | 3D-OVS (mIoU) |
|---|---|---|---|
| ReferSplat | 25.0 | 52.6 | 92.9 |
| CaRF | 29.2 (+16.8%) | 54.9 (+4.3%) | 94.7 (+2.0%) |
Ablation studies reveal that both ITPVS and GFCE contribute to performance gains; jointly, they provide the strongest results (e.g., on Ref-LERF: Baseline 28.3/20.1, ITPVS only 31.6/22.4, GFCE only 24.3/13.5, full CaRF 33.5/24.7 for Ramen/Kitchen). Qualitative outputs show that CaRF produces masks preserving fine object details (e.g., glass rims, handle curvature) while maintaining cross-view coherence, whereas single-view methods tend to miss parts or to over-segment into the background.
6. Context, Limitations, and Applications
CaRF's explicit camera-aware modulation allows features to account for occlusion, scale, and precise spatial arrangements that are view-dependent. This overcomes limitations of prior 2D pseudo-supervision and non-differentiable reprojection strategies, which are sensitive to thresholding and accumulate geometric errors over time. By coupling paired-view gradients, CaRF regularizes the model toward genuinely 3D-consistent segmentations.
Potential applications include:
- Embodied AI: A robot can resolve queries such as "pick up the blue mug on the left shelf" with robust, view-invariant localization.
- AR/VR interaction: Users can select and manipulate virtual objects anchored in real geometry using natural language.
- Autonomous perception: The system enables open-vocabulary, viewpoint-invariant segmentation (e.g., "pedestrian crossing road") in dynamic 3D environments.
This suggests that explicit camera-awareness and multi-view training are jointly essential for consistent, reliable 3D segmentation linked to natural language. A plausible implication is that similar camera-aware mechanisms may benefit other 3D perception tasks requiring geometric and semantic consistency across views.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free