Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 164 tok/s Pro
GPT OSS 120B 449 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Dual-SignLanguageNet: Dual-Stream Sign Recognition

Updated 17 September 2025
  • The paper presents DSLNet's dual-stream architecture that decouples hand morphology and trajectory using wrist-centric and facial-centric normalization techniques.
  • It employs dedicated neural networks—STDGCNN for morphology and FTDE for trajectory—with dynamic graph convolution and energy-weighted encoding for optimal feature extraction.
  • DSLNet achieves state-of-the-art performance on benchmarks like WLASL-100, demonstrating improved accuracy and efficiency with significantly fewer parameters than competing models.

Dual-SignLanguageNet (DSLNet) is a specialized neural architecture for sign language recognition and translation that explicitly decouples and models hand morphology and motion trajectory in complementary coordinate systems. This dual-reference, dual-stream framework leverages wrist-centric and facial-centric normalization for robust gesture analysis. The following sections detail its architecture, component networks, fusion mechanisms, benchmarking results, and technical innovations.

1. Dual-Reference Normalization and Stream Decoupling

DSLNet applies dual-reference normalization to the input skeletal sequence, producing two specialized representations:

  • Wrist-Centric (Morphological) Frame: Each hand joint position is normalized relative to the wrist, providing view-invariant, fine-grained shape features:

Xshape(t,i)=hi(t)hw(t),iJX'_{\text{shape}}(t, i) = h_i(t) - h_w(t), \quad i \in \mathcal{J}

This decouples the local hand morphology from the global signer pose.

  • Facial-Centric (Trajectory) Frame: The hand's global position is normalized with respect to the facial centroid and scaled by a facial scale factor, providing context-aware trajectory features:

%%%%1%%%%

Here, cf(t)c_f(t) is the centroid of facial landmarks, sf(t)s_f(t) is the facial scale (e.g., bounding-box diagonal), and ε\varepsilon prevents division by zero. This representation captures the motion of the hands with respect to the head, which is crucial for distinguishing semantically-ambiguous, morphologically-similar gestures.

2. Component Networks for Morphology and Trajectory

Each normalized stream is processed by a dedicated neural network optimized for its target aspect:

  • Morphology Stream (Topology-Aware Spatiotemporal Network, TSSN): Utilizes a Spatio-Temporal Dynamic Graph Convolutional Neural Network (STDGCNN) applied to the wrist-centric input. The architecture dynamically builds k-NN graphs over joints per frame, extracting spatial features with dynamic convolution, followed by temporal convolution over sequential frames. Final descriptors are aggregated using bidirectional LSTM and multi-head attention to yield the morphology feature FsF_s.
  • Trajectory Stream (Finsler Trajectory Dynamics Encoder, FTDE): Focuses on temporal, direction-sensitive motion, modeling physics-informed trajectory dynamics from the facial-centric input. For each timestep, an energy-weighted encoding is computed:

Ft=φθ(pt,v˙t)x˙t2αF_t = \varphi_\theta(p_t, \dot{v}_t) \cdot \|\dot{x}_t\|_2^\alpha

Et=FttFt+εE_t = \frac{F_t}{\sum_t F_t + \varepsilon}

where φθ\varphi_\theta is a learnable fusion network, and α\alpha is a learned exponent on velocity magnitude. Temporal energy weights EtE_t act as an attention mask, emphasizing salient motion. FTDE further employs causal convolutions and bidirectional LSTM to integrate context, producing FtRT×dtF_t \in \mathbb{R}^{T \times d_t} trajectory features.

3. Geometry-Driven Optimal Transport Fusion

Integration of the two streams utilizes a geometry-driven optimal transport (Geo-OT) mechanism:

  • Both FsF_s and FtF_t are enhanced by cross-attention layers, encouraging feature exchange.
  • A transport plan γ\gamma is computed to align the global morphological feature with the temporal trajectory profile:

Ft(aligned)=jγ1jFt(attn)(j)F_t^{(\text{aligned})} = \sum_j \gamma_{1j} \cdot F_t^{(\text{attn})}(j)

  • The fusion seeks to minimize a joint cost comprising feature similarity and temporal priors.
  • Geometric consistency is enforced via the loss term:

Lgeo=1cos(fm(Fs(attn)),fa(Ft(aligned)))\mathcal{L}_{\text{geo}} = 1 - \cos(f_m(F_s^{(\text{attn})}), f_a(F_t^{(\text{aligned})}))

where fmf_m, faf_a are modality-specific learnable projections and cos(,)\cos(\cdot,\cdot) is the cosine similarity.

  • The overall training loss is:

L=LCE+αLgeo\mathcal{L} = \mathcal{L}_{\text{CE}} + \alpha \cdot \mathcal{L}_{\text{geo}}

with LCE\mathcal{L}_{\text{CE}} being the cross-entropy classification loss.

4. Benchmark Performance

DSLNet achieves state-of-the-art isolated sign language recognition accuracy across multiple datasets:

Dataset Accuracy (%) Benchmark. Comments Parameter Count (Millions)
WLASL-100 93.70 +1.45% above Uni-Sign 46.3
WLASL-300 89.97 +1.05% above Uni-Sign 46.3
LSA64 99.79 Outstanding fine-grained accuracy 46.3

DSLNet demonstrates significant reductions in parameter count relative to competing architectures (e.g., Uni-Sign: 592.1M), yielding superior computational efficiency alongside high accuracy.

5. Technical Innovations

Prominent technical contributions of DSLNet are:

  • Dual-Reference Separation: Morphology and trajectory are modeled in specialized coordinate systems, drastically improving resolution of geometric ambiguity in morphologically-similar signs.
  • Dedicated Stream Specialization: STDGCNN and FTDE are tailored for structure and dynamics, respectively, complemented by physics-informed weighting and dynamic graph operations.
  • Semantic Alignment via Geo-OT: Optimal transport fusion aligns the semantic content of the two streams using temporal and feature-based priors, effectively marrying local gesture shape to global context.
  • Efficient Parameterization: DSLNet’s design achieves high discriminative power per trainable parameter, facilitating real-world deployment.

6. Significance and Outlook

By resolving morphological ambiguities and capturing context through dual-reference normalization and geometry-driven fusion, DSLNet advances the state of the art in skeleton-based isolated sign language recognition. Its architecture addresses crucial failure points in earlier methods, notably where gesture shape alone cannot determine semantic meaning. The efficiency and extensibility of DSLNet suggest its utility in broader application domains, including continuous sign language recognition and multimodal gesture understanding, and serve as a template for future dual-stream neural frameworks in vision-LLMing (Liu et al., 10 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Dual-SignLanguageNet (DSLNet).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube