Dual-SignLanguageNet: Dual-Stream Sign Recognition

Updated 17 September 2025

The paper presents DSLNet's dual-stream architecture that decouples hand morphology and trajectory using wrist-centric and facial-centric normalization techniques.
It employs dedicated neural networks—STDGCNN for morphology and FTDE for trajectory—with dynamic graph convolution and energy-weighted encoding for optimal feature extraction.
DSLNet achieves state-of-the-art performance on benchmarks like WLASL-100, demonstrating improved accuracy and efficiency with significantly fewer parameters than competing models.

Dual-SignLanguageNet (DSLNet) is a specialized neural architecture for sign language recognition and translation that explicitly decouples and models hand morphology and motion trajectory in complementary coordinate systems. This dual-reference, dual-stream framework leverages wrist-centric and facial-centric normalization for robust gesture analysis. The following sections detail its architecture, component networks, fusion mechanisms, benchmarking results, and technical innovations.

1. Dual-Reference Normalization and Stream Decoupling

DSLNet applies dual-reference normalization to the input skeletal sequence, producing two specialized representations:

Wrist-Centric (Morphological) Frame: Each hand joint position is normalized relative to the wrist, providing view-invariant, fine-grained shape features:

$X'_{\text{shape}}(t, i) = h_i(t) - h_w(t), \quad i \in \mathcal{J}$

This decouples the local hand morphology from the global signer pose.

Facial-Centric (Trajectory) Frame: The hand's global position is normalized with respect to the facial centroid and scaled by a facial scale factor, providing context-aware trajectory features:

$X'_{\text{traj}}(t) = \frac{h_w(t) - c_f(t)}{s_f(t) + \varepsilon}$

Here, $c_f(t)$ is the centroid of facial landmarks, $s_f(t)$ is the facial scale (e.g., bounding-box diagonal), and $\varepsilon$ prevents division by zero. This representation captures the motion of the hands with respect to the head, which is crucial for distinguishing semantically-ambiguous, morphologically-similar gestures.

2. Component Networks for Morphology and Trajectory

Each normalized stream is processed by a dedicated neural network optimized for its target aspect:

Morphology Stream (Topology-Aware Spatiotemporal Network, TSSN): Utilizes a Spatio-Temporal Dynamic Graph Convolutional Neural Network (STDGCNN) applied to the wrist-centric input. The architecture dynamically builds k-NN graphs over joints per frame, extracting spatial features with dynamic convolution, followed by temporal convolution over sequential frames. Final descriptors are aggregated using bidirectional LSTM and multi-head attention to yield the morphology feature $F_s$ .
Trajectory Stream (Finsler Trajectory Dynamics Encoder, FTDE): Focuses on temporal, direction-sensitive motion, modeling physics-informed trajectory dynamics from the facial-centric input. For each timestep, an energy-weighted encoding is computed:

$F_t = \varphi_\theta(p_t, \dot{v}_t) \cdot \|\dot{x}_t\|_2^\alpha$

$E_t = \frac{F_t}{\sum_t F_t + \varepsilon}$

where $\varphi_\theta$ is a learnable fusion network, and $\alpha$ is a learned exponent on velocity magnitude. Temporal energy weights $E_t$ act as an attention mask, emphasizing salient motion. FTDE further employs causal convolutions and bidirectional LSTM to integrate context, producing $F_t \in \mathbb{R}^{T \times d_t}$ trajectory features.

3. Geometry-Driven Optimal Transport Fusion

Integration of the two streams utilizes a geometry-driven optimal transport (Geo-OT) mechanism:

Both $F_s$ and $F_t$ are enhanced by cross-attention layers, encouraging feature exchange.
A transport plan $\gamma$ is computed to align the global morphological feature with the temporal trajectory profile:

$F_t^{(\text{aligned})} = \sum_j \gamma_{1j} \cdot F_t^{(\text{attn})}(j)$

The fusion seeks to minimize a joint cost comprising feature similarity and temporal priors.
Geometric consistency is enforced via the loss term:

$\mathcal{L}_{\text{geo}} = 1 - \cos(f_m(F_s^{(\text{attn})}), f_a(F_t^{(\text{aligned})}))$

where $f_m$ , $f_a$ are modality-specific learnable projections and $\cos(\cdot,\cdot)$ is the cosine similarity.

The overall training loss is:

$\mathcal{L} = \mathcal{L}_{\text{CE}} + \alpha \cdot \mathcal{L}_{\text{geo}}$

with $\mathcal{L}_{\text{CE}}$ being the cross-entropy classification loss.

4. Benchmark Performance

DSLNet achieves state-of-the-art isolated sign language recognition accuracy across multiple datasets:

Dataset	Accuracy (%)	Benchmark. Comments	Parameter Count (Millions)
WLASL-100	93.70	+1.45% above Uni-Sign	46.3
WLASL-300	89.97	+1.05% above Uni-Sign	46.3
LSA64	99.79	Outstanding fine-grained accuracy	46.3

DSLNet demonstrates significant reductions in parameter count relative to competing architectures (e.g., Uni-Sign: 592.1M), yielding superior computational efficiency alongside high accuracy.

5. Technical Innovations

Prominent technical contributions of DSLNet are:

Dual-Reference Separation: Morphology and trajectory are modeled in specialized coordinate systems, drastically improving resolution of geometric ambiguity in morphologically-similar signs.
Dedicated Stream Specialization: STDGCNN and FTDE are tailored for structure and dynamics, respectively, complemented by physics-informed weighting and dynamic graph operations.
Semantic Alignment via Geo-OT: Optimal transport fusion aligns the semantic content of the two streams using temporal and feature-based priors, effectively marrying local gesture shape to global context.
Efficient Parameterization: DSLNet’s design achieves high discriminative power per trainable parameter, facilitating real-world deployment.

6. Significance and Outlook

By resolving morphological ambiguities and capturing context through dual-reference normalization and geometry-driven fusion, DSLNet advances the state of the art in skeleton-based isolated sign language recognition. Its architecture addresses crucial failure points in earlier methods, notably where gesture shape alone cannot determine semantic meaning. The efficiency and extensibility of DSLNet suggest its utility in broader application domains, including continuous sign language recognition and multimodal gesture understanding, and serve as a template for future dual-stream neural frameworks in vision-language modeling (Liu et al., 10 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Skeleton-based sign language recognition using a dual-stream spatio-temporal dynamic graph convolutional network (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Dual-SignLanguageNet (DSLNet).