Causal Ego-Referring Segmentation (CERES)
- CERES is a plug-in causal intervention framework that mitigates dataset-induced language bias and egocentric visual confounding in referring video object segmentation.
- It employs dual-modal adjustments by fusing semantic and depth features through attention mechanisms to generate de-biased representations.
- Empirical evaluations on standard benchmarks show that CERES achieves state-of-the-art performance and improved robustness against motion blur, occlusions, and novel object–action combinations.
Causal Ego-Referring Segmentation (CERES) is a plug-in causal intervention framework for egocentric referring video object segmentation (Ego-RVOS). Ego-RVOS requires segmenting the specific object-in-action denoted by a language query within first-person videos. CERES addresses two longstanding challenges in this domain: dataset-induced language bias and egocentric visual confounding. The approach implements dual-modal causal intervention by applying back-door adjustment to language representations and front-door adjustment to the visual stream. CERES achieves state-of-the-art results on canonical Ego-RVOS benchmarks and demonstrates improved robustness under challenging conditions such as motion blur, occlusions, and novel object–action combinations (Liu et al., 30 Dec 2025).
1. Problem Setting and Causal Modeling
Ego-RVOS’s primary task is segmentation of an active object described by a language query in a video sequence recorded from a first-person perspective. Output is a dense binary mask sequence indicating the pixels belonging to the queried object at each frame.
The causal structure underpinning this formulation is encoded in a Structural Causal Model (SCM):
- Nodes: (language query), (visual input), (observable dataset bias: object–action co-occurrence), (unobserved egocentric confounders: e.g., motion, occlusion), (mediator), (segmentation output).
- Edges: , , (mediated visual path).
- Confounding: , (spurious back-door for language); , (spurious back-door for vision).
The objective is to estimate (effect of language) and (effect of vision) by counteracting the respective confounding using principled causal adjustments.
2. Back-Door Adjustment: Language Deconfounding
Textual confounding arises from dataset bias: specific object–action pairs () dominate distributions, inducing spurious correlations between and . To estimate , Pearl's back-door criterion is invoked:
CERES implements a neural back-door adjustment through the following procedure:
- Text encoder generates a language embedding.
- All unique object–action pairs are embedded as . Their empirical frequencies are computed from the training set.
- Calculate the confounder mean:
- The de-biased language feature:
The associated score is .
This neural realization (NWGM) approximates the do-intervention on language by counteracting dataset-induced back-door dependencies.
3. Front-Door Adjustment: Visual Confounder Mitigation
First-person video introduces unobserved confounders (e.g., head motion) impacting both and . CERES targets using front-door adjustment:
Mediator is constructed by fusing:
- Semantic tokens () from an RGB encoder, representing "what is present".
- Depth tokens () from a monocular depth encoder, representing spatial structure.
Depth-aware cross-attention computes a depth-aggregated token , which then guides attention over semantic tokens to yield the fused mediator . Temporal context is approximated via a sliding memory bank over embeddings of the preceding frames, producing .
Fusion of and is performed with a gated residual block:
This constructs a de-confounded visual representation, satisfying the front-door criterion for .
4. Architecture and Implementation Specifics
CERES operates as a plug-in atop pre-trained RVOS backbones (e.g., ReferFormer), structured as follows:
- Linguistic Back-door Deconfounder (LBD): applies to the text embedding output.
- Visual stream augmented by a frozen Depth Anything V2 (ViT-B) encoder; semantic and geometric tokens are extracted for the last layers.
- Dual-modal attention (DAttn) fuses depth and semantic streams.
- Memory Attention (MAttn) incorporates context from a window of frames.
- Gated residual block forms the final de-biased visual feature.
Input resolution is . Text encoder is RoBERTa. Backbones utilized include ResNet101, Video Swin-B, and Swin-L.
Forward pass pseudocode for timestep :
1 2 3 4 5 6 7 8 9 10 11 12 |
1. f_L ← E_txt(ℓ)
2. f'_L ← f_L + f̄_Z
3. X_rgb ← E_rgb(x_t), X_dep ← E_depth(x_t)
4. For l in last n layers:
M_{v,l}, M_{d,l} ← extract_tokens(X_rgb, X_dep)
Ĥm_l ← DAttn(M_{d,l}, M_{v,l})
ĤM ← Ĥm_{layers}
5. B_t ← update_bank(B_{t−1}, X_rgb)
ĤX_t ← MAttn(x_t, B_t)
6. f'_X ← σ·MLP([ĤM;ĤX_t]) + (1−σ)·flatten(X_rgb)
7. ŷ_t ← mask_decoder(f'_X, f'_L)
8. return ŷ_t |
5. Training Protocols and Objectives
Supervision follows ActionVOS conventions, optimizing for:
- Cross-entropy loss on per-pixel segmentation masks.
- Dice loss for mask overlap.
- Focal loss to counteract class imbalance.
Deep supervision applies identical segmentation losses to outputs from intermediate mediator layers. Hyperparameters: batch size 4, 6 training epochs, AdamW optimizer with differentiated learning rates ( for CERES modules, for backbone), and learning rate decay at epochs 3 and 5 by 0.1.
6. Empirical Evaluation
Experiments use VISOR (13,205 training clips, 76,873 objects; 467 validation clips, 1,841 objects), with VOST and VSCOS for zero-shot generalization. Key metrics: mIoU (positive objects), mIoU (negative objects), cIoU, cIoU, generalized IoU (gIoU; segmentation and activity), and standard binary classification measures.
| Method | mIoU | mIoU | gIoU | Accuracy |
|---|---|---|---|---|
| ActionVOS | 59.9 | 16.3 | 69.9 | 73.4 |
| CERES | 64.0 | 15.3 | 72.4 | 76.3 |
On zero-shot VSCOS, CERES achieves 55.3 mIoU and 62.5 cIoU (vs. ActionVOS: 52.5 and 57.7). On VOST, CERES attains 32.0 mIoU and 21.7 cIoU (vs. ActionVOS: 30.2 and 17.6). Ablations confirm additive benefits: LBD, DAttn, and MAttn each improve generalization, with their combination yielding maximum gains.
7. Significance and Implications
CERES demonstrates that causal framework-based interventions, specifically through back-door and front-door adjustment, can systematically mitigate both linguistic and visual confounding in egocentric segmentation. The dual-modal mediator leverages both semantic and geometric (depth) features, guided by causal principles, resulting in representations robust to egocentric video artifacts such as heavy motion blur and occlusion. Extensive benchmarking on VISOR, VOST, and VSCOS supports substantial state-of-the-art improvement, with CERES adaptable to a broad class of pre-trained RVOS backbones. A plausible implication is the extension of these causal approaches to broader egocentric understanding tasks, underscoring the value of principled deconfounding in complex vision-language integration (Liu et al., 30 Dec 2025).