Cross Gloss Attention Fusion (CGAF)
- Cross Gloss Attention Fusion (CGAF) is a video feature fusion module that merges pose and RGB streams using a dual-stage cross-modal attention mechanism.
- It dynamically adjusts local temporal neighborhoods via learnable offsets and linear interpolation to capture semantically aligned gloss features.
- Integrated within the Semantically Enhanced Dual-Stream Encoder (SEDS), CGAF enhances sign language retrieval by preserving temporal coherence and computational efficiency.
Cross Gloss Attention Fusion (CGAF) is a video feature fusion module introduced within the Semantically Enhanced Dual-Stream Encoder (SEDS) framework for sign language retrieval. CGAF implements a two-stage, temporally localized cross-modal attention mechanism designed to aggregate semantically correlated Pose and RGB visual information (“glosses”) across adjacent video clips, followed by a feature fusion step that produces a unified sequence encoding suitable for late-stage cross-modal retrieval alignment (Jiang et al., 2024).
1. Functional Context within SEDS
In the SEDS architecture, CGAF is positioned directly after the intra-modal Transformer encoders that separately process Pose and RGB clip features. The input to CGAF consists of two sequences: Pose clip features and RGB clip features , where is the number of temporal clips per video and is the feature dimensionality. CGAF’s output, a fused feature , is subsequently employed for downstream cross-modal alignment between visual and textual (gloss or sentence-level) representations.
2. Two-Stage Module Structure
CGAF is organized into two main stages:
Stage I: Dual-Stream Cross-Gloss Attention
Each stream (Pose and RGB) is projected via learnable matrices into queries, keys, and values:
Two cross-attention groups are constructed:
- Pose queries attend to local windows in the RGB stream: produce intermediate output
- RGB queries attend to local windows in the Pose stream: produce
For each temporal clip within a sequence, a “gloss window” of temporally adjacent neighbors is established, defined by positions and refined via learnable offsets :
typically yields non-integral indices, necessitating linear interpolation for key/value selection at interpolated positions. The attention-weighted value aggregation is applied:
Where denote interpolated keys/values at offset-adjusted neighbor locations. Layer normalization and residual connections are applied. This cross-gloss block is stacked for two successive layers on each stream.
Stage II: Feature Fusion and Residual MLP
The updated streams , are concatenated along the feature dimension and compressed to dimensionality via a small MLP: A residual sum completes the fusion: This process yields the fused feature sequence, enforcing both information blending and identity preservation from each modality.
3. Mathematical Formulations and Algorithmic Details
CGAF’s attention mechanism incorporates dynamic, learnable offsets within local temporal neighborhoods, enabling flexible inter-clip dependency modeling. The indices are adjusted and wrapped modulo , and are permitted to be continuous, with linear interpolation over keys/values addressing non-integer positions. Each attention group utilizes a single head; no multi-head splitting is performed.
Pseudocode for a forward pass through CGAF is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
Inputs: f_p, f_r ∈ ℝ[T×D]; constants P ∈ {0…T−1}^{T×N} Project: q_p,k_p,v_p = Linear_p(f_p) q_r,k_r,v_r = Linear_r(f_r) for repetition = 1 to 2 do # Cross‐gloss attention Pose→RGB O_p = W_o_p · q_p # (T×N) P̂_p = (P + O_p) mod T # (T×N) floats [k̂_r, v̂_r] = Interpolate(k_r, v_r, P̂_p) α_p = softmax( q_p ⨂ k̂_r ) # (T×N) h_p = Σ_i α_p[:,i:i+1] * v̂_r[:,i:i+1] f_p ← LayerNorm(f_p + h_p) # Cross‐gloss attention RGB→Pose (symmetrically) O_r = W_o_r · q_r P̂_r = (P + O_r) mod T [k̂_p, v̂_p] = Interpolate(k_p, v_p, P̂_r) α_r = softmax( q_r ⨂ k̂_p ) h_r = Σ_i α_r[:,i:i+1] * v̂_p[:,i:i+1] f_r ← LayerNorm(f_r + h_r) end for m = MLP(concat(f_p, f_r)) # → ℝ[T×D] f_v = m + f_p + f_r return f_v |
4. Design Choices and Hyperparameter Configuration
CGAF’s configuration is determined by several key hyperparameters:
- : number of temporal clips per video
- : feature dimensionality per clip (e.g., 512)
- : gloss window size, i.e., neighborhood size per attention query
- Stacking: two cross-gloss layers per modality branch
- Attention: single-head attention mechanism
- MLP: hidden layer doubles input dimension (), no additional dropout layers specified
- Projections: all query/key/value/offset projections are learnable matrices in or
No explicit multiplicative or gating mechanism is employed other than the residual summation in Stage II; attention weights function as soft gates over the gloss neighborhood.
5. Role of Gloss Locality and Temporal Coherence
CGAF’s central innovation is the imposition of “gloss locality,” wherein the fusion of modalities is mediated by local temporal neighborhoods that are further adapted via learned, dynamic offsets. This design preserves the temporal coherence essential to sign language, mitigating the risk of diffusing local action details when integrating broader visual context. By enabling selective cross-modal and intra-modal attention among semantically similar glosses, CGAF provides context-sensitive fusion without incurring excessive computational or memory overhead.
A plausible implication is that such locality-aware, offset-adjusted fusion may be generally advantageous for other multimodal temporal tasks where semantic “alignment” occurs locally but may shift by small durations due to gesture variability.
6. Empirical Impact and Application Scope
When incorporated into SEDS, CGAF enables end-to-end trainable, lightweight sign language representation that consistently outperforms state-of-the-art approaches across various sign-language video retrieval benchmarks (Jiang et al., 2024). By eschewing global dense attention in favor of localized, dynamic cross-gloss attention, it maintains computational efficiency and reduces memory consumption.
Although designed for sign language retrieval—where precise alignment of multimodal, temporally localized glosses is critical—the CGAF architecture is potentially extendable to other multipath video-language tasks that exhibit strong local semantic structure across time.
7. Relation to Existing Approaches and Distinctiveness
CGAF differs from contemporaneous video fusion approaches by (a) explicitly enforcing dynamic local attention windows (“gloss neighborhood”) per query, (b) adapting neighbor selection via learned offsets, and (c) fusing only after sequential cross-modal refinements, rather than relying on offline or global feature mixing. No explicit gating beyond softmax attention weights is employed, and all feature blending is performed via residual addition following modular cross-attentional blocks. This design paradigm explicitly targets the challenges of preserving detailed action semantics in sign language while remaining computationally tractable (Jiang et al., 2024).