Triple Query Former (TQF)
- The paper introduces TQF, which factorizes queries into appearance, intra-frame interaction, and inter-frame motion components to reduce selection bias.
- It dynamically fuses linguistic and visual cues, employing cross-modal attention and motion-aware aggregation to enhance spatiotemporal object segmentation.
- Experimental validations on RVOS benchmarks demonstrate up to 21.8% improvement in core metrics, confirming TQF's robust performance.
Triple Query Former (TQF) is a structured query architecture that factorizes a referring query for object selection in Referring Video Object Segmentation (RVOS) into three dynamically constructed components: appearance query, intra-frame interaction query, and inter-frame motion query. Developed to mitigate query selection bias—where static query embeddings can be misled by distractors with similar visual or motion profiles—TQF integrates linguistic cues and visual guidance during dynamic query construction and employs dedicated motion-aware aggregation modules for enhanced spatiotemporal coherence (Zhang et al., 17 Sep 2025).
1. Motivation and Structured Query Factorization
Conventional RVOS approaches typically encode the entire referring expression as a static query reused across all video frames. This paradigm induces query selection bias, particularly when distractor objects exhibit appearance or motion characteristics similar to the target. TQF redefines the query as a triplet, factorizing it into the following:
- Appearance Query: Captures static object attributes by hierarchical alignment of language-derived object cues (e.g., color, shape) with high-level visual features. The process involves initial coarse cross-modal attention, progressive Top-K selection over visual regions, and final refinement via cross-attention between textual and visual features until the query is visually grounded.
- Intra-frame Interaction Query: Models spatial relationships by combining learnable initial query vectors and relational language features, guiding the query to focus on object interactions (e.g., “person next to lamp”). A cross-attention module fuses relation-specific language with the full-sentence context.
- Inter-frame Motion Query: Encodes motion semantics using trajectory embeddings augmented with motion-specific linguistic cues and positional encodings (learned and sine-cosine from pixel coordinates). Attention pooling produces a query that reflects coherent object movement across time.
This decomposition allows each query component to specialize in capturing particular dimensions—static identity, spatial interactions, and temporal continuity—while remaining adaptable to video content.
2. Dynamic Query Construction
Unlike static text-based queries, TQF constructs its three query branches dynamically during inference, fusing linguistic cues and visual representations at multiple processing stages:
- Appearance queries: Begin with coarse alignment at high-level features, then iteratively refine focus using Top-K visual regions, culminating in a visually grounded query.
- Intra-frame queries: Fuse relational language with learnable slots and overall sentence context, ensuring spatial interactions are directly encoded.
- Inter-frame queries: Integrate trajectory embeddings and motion language into a position-aware representation using attention pooling, yielding a query robust to appearance and movement fluctuations.
This dynamic fusion enables TQF to robustly select the most discriminative cues for object segmentation, reducing susceptibility to distractors.
3. Motion-aware Aggregation Modules
To refine and propagate object token representations extracted (e.g., via Mask2Former), TQF introduces two motion-aware aggregation designs:
- Intra-frame Interaction Aggregation (IIA):
- For each pair of object tokens within a frame, position vectors combining centroid, size, and Intersection-over-Union (IoU) are passed through sine-cosine positional encoding and an MLP. These relational priors guide a self-attention mechanism where tokens aggregate features from spatially and semantically related neighbors.
- Aggregated relation features are then fused with original tokens via another cross-modal attention process conjoined with intra-frame queries.
- Inter-frame Motion Aggregation (IMA):
- Applies localized self-attention on neighboring frames (windowed, e.g., three frames), followed by cross-modal attention with inter-frame queries for long-range motion alignment.
- A global temporal self-attention is performed across object tracks, ensuring coherent semantic propagation and robust object identity over time.
- The final video-level token representation is computed using a feed-forward network on the cross-attended outputs.
These modules enforce both spatially informed and temporally consistent token representations necessary for high-fidelity spatiotemporal object segmentation.
4. Experimental Validation on RVOS Benchmarks
TQF's effectiveness is established via experiments on multiple RVOS benchmarks such as A2D-Sentences, JHMDB-Sentences, Ref-DAVIS17, Ref-YouTube-VOS, and MeViS:
- Performance: TQF consistently achieves state-of-the-art accuracy across all datasets and backbones (e.g., Video-Swin-T, Video-Swin-B).
- Metric Improvements: For instance, on Ref-YouTube-VOS, TQF improves region similarity and contour accuracy (𝒥𝒥𝒥𝒮𝒥) by ~2% over prior art; MeViS observes gains up to +21.8% in core metrics compared to models relying only on textual query initialization.
- Ablation studies confirm that structured query design and both aggregation modules are critical to these improvements.
This demonstrates the utility of query factorization and cross-modal dynamic query construction for mitigating alignment errors and increasing segmentation robustness.
5. Architectural Implications and Future Directions
TQF's design—comprising triplet query decomposition, dynamic cross-modal fusion, and hierarchical spatiotemporal aggregation—advances the field by addressing strong limitations in static query-based RVOS.
The authors note residual challenges, such as segmentation reliability under long-term occlusion, which degrade when appearance and motion cues vanish. A plausible implication is that memory-based persistence and re-detection strategies may be necessary to mitigate mask drift after extended missing periods. The modular structure of TQF renders it amenable to such future extensions.
6. Significance in Cross-modal Object Selection Paradigms
The TQF framework generalizes the notion of query-based selection in RVOS by explicitly encoding and aligning appearance, spatial, and motion cues, ensuring robust identification and segmentation even among complex distractor scenarios. Its adaptable, multi-branch query design sets a precedent for similar factorized architectures in broader cross-modal reasoning tasks.
By directly integrating structured queries with spatiotemporal aggregation, TQF enables robust, context-aware video object segmentation and lays the groundwork for further innovation in dynamic, multimodal query processing (Zhang et al., 17 Sep 2025).