Boosting Text-Driven Video Segmentation via Geometry-Aware Distillation

Published 23 Jun 2026 in cs.CV | (2606.24464v1)

Abstract: Text-driven Referring Video Object Segmentation (RVOS) aims to locate and segment target objects in videos given natural language. However, existing models are typically trained on 2D image or video datasets with naive segmentation losses, which overlooks the geometric consistency across frames and leads to weak spatial understanding. In this paper, we propose Geometry-enhanced Language-guided Video segmentation (GeoLaV), a two-stage framework that distills 3D geometric knowledge from images to enhance text-driven video segmentation. In the first stage, we perform monocular geometry pretraining with monocular novel-view synthesis, enabling the model to acquire geometry-consistent visual representations via spatial alignment on large-scale single-image datasets. In the second stage, we introduce geometry-aware distillation and fine-tune the model on video segmentation datasets, transferring 3D structural knowledge from a general 3D prior model. This process reinforces 3D awareness and improves both spatiotemporal coherence and language grounding in segmentation. Extensive experiments show that our method using only image segmentation data already provides notable zero-shot generalization in RVOS. When combined with geometry-aware distillation for fine-tuning on videos, our method achieves state-of-the-art performance across multiple RVOS benchmarks. The code is available at https://github.com/Tony1882880/GeoLaV.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a two-stage framework that couples monocular geometry pretraining with geometry-aware distillation to enhance video segmentation.
It employs novel-view synthesis, depth estimation, and dual teacher guidance to enforce cross-frame geometric and semantic consistency.
Empirical results show improved spatial precision and tracking stability, achieving state-of-the-art scores on standard RVOS benchmarks.

Geometry-Aware Distillation for Text-Driven Referring Video Object Segmentation

Motivation and Problem Formulation

Text-driven Referring Video Object Segmentation (RVOS) requires segmenting target objects in videos as specified by natural language prompts. Existing approaches rely heavily on pretraining with large-scale 2D image datasets, failing to expose models to cross-frame geometric consistency or viewpoint variations crucial for spatial reasoning. This omission results in segmentation models with limited geometric awareness, weak spatial and temporal understanding, and poor generalization to complex video scenes. Furthermore, most models rely on naive segmentation losses, without explicit modeling of 3D geometry, hindering tracking stability under significant camera or object motion.

GeoLaV addresses these deficiencies by integrating explicit geometric priors and 3D structural knowledge. The framework consists of two main stages:

Stage I: Monocular Geometry Pretraining (MGP) via novel-view synthesis enforces cross-view geometric consistency on large-scale image datasets.
Stage II: Geometry-Aware Distillation (GAD) transfers 3D structural priors from general 3D models to the segmentation backbone for enhanced spatiotemporal coherence and language grounding.
Figure 1: GeoLaV's two-stage paradigm leverages monocular geometry pretraining followed by geometry-aware distillation.

Architecture and Methodology

Stage I: Monocular Geometry Pretraining (MGP)

This stage synthesizes geometry-consistent view sequences from single images to facilitate spatially robust representation learning. The pipeline involves:

Depth Estimation: Employing feed-forward monocular geometry models (e.g., $\pi^{3}$ , VGGT) to generate scale-invariant depth maps from RGB input.
3D Lifting and Camera Simulation: Projecting pixels into 3D space and simulating camera trajectories (linear, piecewise-linear, or curved) to create novel viewpoints. Each synthetic frame is generated via rigid 3D transformations, maintaining geometric coherence and depth-correspondence.
Figure 2: The pipeline for novel-view synthesis through 3D lifting, camera simulation, occlusion handling, and inpainting.
Occlusion Handling and Inpainting: Occlusion masks and inpainting networks restore visual completeness and mitigate artifacts from novel view generation.
Geometry-Consistent Supervisory Signals: Visual foundation model (VFM) encoders (e.g., DINOv3) extract features, which are aligned via a projection head with the memory-attention features from the SAM2 backbone. Feature alignment is enforced through auxiliary cosine similarity loss.
Figure 3: Examples of synthesized view sequences along physically valid 3D camera trajectories for geometry-consistent pretraining.

Stage II: Geometry-Aware Distillation (GAD)

Upon initialization with the geometry-consistent weights, the backbone is fine-tuned on real video datasets, leveraging dual supervision:

3D Teacher Features: General 3D models (e.g., $\pi^{3}$ ) encode spatial structure and depth relationships.
Visual Foundation Model Features: VFM teacher encodes semantic object boundaries.
Dual-Branch Distillation: Two projection heads align backbone memory features to both teacher spaces, regularizing representations to encode both geometric and semantic cues.
Segmentation Objective: Primary mask prediction remains supervised by text-driven segmentation loss.
Figure 4: GeoLaV's architecture overview, illustrating monocular geometry pretraining and teacher-guided geometry-aware distillation.

Empirical Evaluation

GeoLaV demonstrates substantial improvements across standard RVOS benchmarks, including Ref-Youtube-VOS, Ref-DAVIS17, and MeViS, relative to both non-large and large VLM-based baselines. The method achieves:

70.5 $\mathcal{J}{content}\mathcal{F}$ on Ref-Youtube-VOS,
72.5 $\mathcal{J}{content}\mathcal{F}$ on Ref-DAVIS17,
50.0 $\mathcal{J}{content}\mathcal{F}$ on MeViS,

with notable gains over prior approaches such as SAMWISE and ReferDINO. Zero-shot evaluation on unseen video domains after image-only pretraining shows over +15.1 improvement on Ref-Youtube-VOS compared to the SAMWISE baseline.

Figure 5: Qualitative comparison against SAMWISE, showing enhanced spatial precision and tracking stability.

Ablation studies affirm that performance improvement is driven by geometry-consistent pretraining and explicit 3D-aware distillation, not merely the exposure to more synthetic frames. Depth-based 3D warping brings larger gains than planar augmentation, and full two-stage training yields the highest scores.

Figure 6: Geometric consistency comparison (left: planar vs. GeoLaV) and feature representations (right: dual teacher distillation yields feature clusters with clear object separation).

Theoretical and Practical Implications

GeoLaV validates the hypothesis that explicit 3D geometry priors and monocular novel-view synthesis can compensate for the lack of diverse and annotation-rich video data. The deterministic multi-frame synthetic pipeline scales efficiently, offering a paradigm for geometry-aware pretraining in the absence of real videos. The geometry-aware distillation process leverages both geometric and semantic cues, enabling robust cross-frame correspondence and improved temporal consistency in segmentation.

Practically, GeoLaV's methodology is applicable to any scenario where spatial or temporal consistency is critical but high-quality video annotation resources are lacking. The approach is compatible with generic vision-language backbones and teacher models, facilitating transfer learning for other multimodal, spatiotemporal tasks.

Future Directions

Potential avenues include open-vocabulary segmentation, incorporation of larger teacher models with broader 3D semantic understanding, and scaling the synthetic training regime for even longer or more diverse camera trajectories. Extensions toward streaming video, continual learning, and multimodal 3D understanding are also tractable given the modularity of the framework.

Conclusion

GeoLaV establishes a principled two-stage pipeline for geometry-enhanced video segmentation by coupling monocular geometry pretraining with geometry-aware distillation. By injecting both 3D structure and semantic priors, the method achieves state-of-the-art performance while maintaining computational efficiency and robust generalization, providing a foundation for further advances in multimodal video understanding (2606.24464).

Markdown Report Issue