Causal Ego-Referring Segmentation (CERES)

Updated 2 January 2026

CERES is a plug-in causal intervention framework that mitigates dataset-induced language bias and egocentric visual confounding in referring video object segmentation.
It employs dual-modal adjustments by fusing semantic and depth features through attention mechanisms to generate de-biased representations.
Empirical evaluations on standard benchmarks show that CERES achieves state-of-the-art performance and improved robustness against motion blur, occlusions, and novel object–action combinations.

Causal Ego-Referring Segmentation (CERES) is a plug-in causal intervention framework for egocentric referring video object segmentation (Ego-RVOS). Ego-RVOS requires segmenting the specific object-in-action denoted by a language query within first-person videos. CERES addresses two longstanding challenges in this domain: dataset-induced language bias and egocentric visual confounding. The approach implements dual-modal causal intervention by applying back-door adjustment to language representations and front-door adjustment to the visual stream. CERES achieves state-of-the-art results on canonical Ego-RVOS benchmarks and demonstrates improved robustness under challenging conditions such as motion blur, occlusions, and novel object–action combinations (Liu et al., 30 Dec 2025).

1. Problem Setting and Causal Modeling

Ego-RVOS’s primary task is segmentation of an active object described by a language query $L$ in a video sequence $X=\{x_1,...,x_T\}$ recorded from a first-person perspective. Output is a dense binary mask sequence $Y=\{y_1,...,y_T\}$ indicating the pixels belonging to the queried object at each frame.

The causal structure underpinning this formulation is encoded in a Structural Causal Model (SCM):

Nodes: $L$ (language query), $X$ (visual input), $Z$ (observable dataset bias: object–action co-occurrence), $U$ (unobserved egocentric confounders: e.g., motion, occlusion), $M$ (mediator), $Y$ (segmentation output).
Edges: $X \rightarrow Y$ , $L \rightarrow Y$ , $X \rightarrow M \rightarrow Y$ (mediated visual path).
Confounding: $Z \rightarrow L$ , $Z \rightarrow Y$ (spurious back-door for language); $U \rightarrow X$ , $U \rightarrow Y$ (spurious back-door for vision).

The objective is to estimate $P(Y|do(L))$ (effect of language) and $P(Y|do(X))$ (effect of vision) by counteracting the respective confounding using principled causal adjustments.

2. Back-Door Adjustment: Language Deconfounding

Textual confounding arises from dataset bias: specific object–action pairs ( $Z$ ) dominate distributions, inducing spurious correlations between $L$ and $Y$ . To estimate $P(Y|do(L))$ , Pearl's back-door criterion is invoked:

$P(Y|do(L = \ell)) = \sum_z P(Y|L=\ell, Z=z) P(Z=z)$

CERES implements a neural back-door adjustment through the following procedure:

Text encoder $f_L(\ell)$ generates a language embedding.
All unique object–action pairs $z_i$ are embedded as $f_Z(z_i)$ . Their empirical frequencies $P(z_i)$ are computed from the training set.
Calculate the confounder mean:

$\bar{f}_Z = \sum_{i=1}^K P(z_i) f_Z(z_i)$

The de-biased language feature:

$f'_L(\ell) = f_L(\ell) + \bar{f}_Z$

The associated score is $s'_Y(\ell) = w^T f'_L(\ell)$ .

This neural realization (NWGM) approximates the do-intervention on language by counteracting dataset-induced back-door dependencies.

3. Front-Door Adjustment: Visual Confounder Mitigation

First-person video introduces unobserved confounders $U$ (e.g., head motion) impacting both $X$ and $Y$ . CERES targets $P(Y|do(X))$ using front-door adjustment:

$P(Y|do(X=x)) = \sum_m \sum_{x'} P(Y|M=m, X=x') P(M=m|X=x) P(X=x')$

Mediator $M$ is constructed by fusing:

Semantic tokens ( $M_v$ ) from an RGB encoder, representing "what is present".
Depth tokens ( $M_d$ ) from a monocular depth encoder, representing spatial structure.

Depth-aware cross-attention computes a depth-aggregated token $\hat{M}_d(x)$ , which then guides attention over semantic tokens to yield the fused mediator $\hat{M}(x)$ . Temporal context $E_{X'}[X']$ is approximated via a sliding memory bank over embeddings of the preceding $W$ frames, producing $\hat{X}_t$ .

Fusion of $\hat{M}(x_t)$ and $\hat{X}_t$ is performed with a gated residual block:

$f'_X(x_t) = \sigma\cdot\mathrm{MLP}([\hat{M}(x_t);\hat{X}_t]) + (1-\sigma)\cdot X_t$

This constructs a de-confounded visual representation, satisfying the front-door criterion for $U$ .

4. Architecture and Implementation Specifics

CERES operates as a plug-in atop pre-trained RVOS backbones (e.g., ReferFormer), structured as follows:

Linguistic Back-door Deconfounder (LBD): applies to the text embedding output.
Visual stream augmented by a frozen Depth Anything V2 (ViT-B) encoder; semantic and geometric tokens are extracted for the last $n$ layers.
Dual-modal attention (DAttn) fuses depth and semantic streams.
Memory Attention (MAttn) incorporates context from a window of $W=5$ frames.
Gated residual block forms the final de-biased visual feature.

Input resolution is $448 \times 448$ . Text encoder is RoBERTa. Backbones utilized include ResNet101, Video Swin-B, and Swin-L.

Forward pass pseudocode for timestep $t$ :

1. f_L ← E_txt(ℓ)
2. f'_L ← f_L + f̄_Z
3. X_rgb ← E_rgb(x_t), X_dep ← E_depth(x_t)
4. For l in last n layers:
       M_{v,l}, M_{d,l} ← extract_tokens(X_rgb, X_dep)
       Ĥm_l ← DAttn(M_{d,l}, M_{v,l})
   ĤM ← Ĥm_{layers}
5. B_t ← update_bank(B_{t−1}, X_rgb)
   ĤX_t ← MAttn(x_t, B_t)
6. f'_X ← σ·MLP([ĤM;ĤX_t]) + (1−σ)·flatten(X_rgb)
7. ŷ_t ← mask_decoder(f'_X, f'_L)
8. return ŷ_t

5. Training Protocols and Objectives

Supervision follows ActionVOS conventions, optimizing for:

Cross-entropy loss on per-pixel segmentation masks.
Dice loss for mask overlap.
Focal loss to counteract class imbalance.

Deep supervision applies identical segmentation losses to outputs from intermediate mediator layers. Hyperparameters: batch size 4, 6 training epochs, AdamW optimizer with differentiated learning rates ( $1 \times 10^{-3}$ for CERES modules, $1 \times 10^{-4}$ for backbone), and learning rate decay at epochs 3 and 5 by 0.1.

6. Empirical Evaluation

Experiments use VISOR (13,205 training clips, 76,873 objects; 467 validation clips, 1,841 objects), with VOST and VSCOS for zero-shot generalization. Key metrics: mIoU $_+$ (positive objects), mIoU $_-$ (negative objects), cIoU $_+$ , cIoU $_-$ , generalized IoU (gIoU; segmentation and activity), and standard binary classification measures.

Method	mIoU $_+$	mIoU $_-$	gIoU	Accuracy
ActionVOS	59.9	16.3	69.9	73.4
CERES	64.0	15.3	72.4	76.3

On zero-shot VSCOS, CERES achieves 55.3 mIoU $_+$ and 62.5 cIoU $_+$ (vs. ActionVOS: 52.5 and 57.7). On VOST, CERES attains 32.0 mIoU $_+$ and 21.7 cIoU $_+$ (vs. ActionVOS: 30.2 and 17.6). Ablations confirm additive benefits: LBD, DAttn, and MAttn each improve generalization, with their combination yielding maximum gains.

7. Significance and Implications

CERES demonstrates that causal framework-based interventions, specifically through back-door and front-door adjustment, can systematically mitigate both linguistic and visual confounding in egocentric segmentation. The dual-modal mediator leverages both semantic and geometric (depth) features, guided by causal principles, resulting in representations robust to egocentric video artifacts such as heavy motion blur and occlusion. Extensive benchmarking on VISOR, VOST, and VSCOS supports substantial state-of-the-art improvement, with CERES adaptable to a broad class of pre-trained RVOS backbones. A plausible implication is the extension of these causal approaches to broader egocentric understanding tasks, underscoring the value of principled deconfounding in complex vision-language integration (Liu et al., 30 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Causal Ego-Referring Segmentation (CERES).