Dual-SignLanguageNet: Dual-Stream Sign Recognition
- The paper presents DSLNet's dual-stream architecture that decouples hand morphology and trajectory using wrist-centric and facial-centric normalization techniques.
- It employs dedicated neural networks—STDGCNN for morphology and FTDE for trajectory—with dynamic graph convolution and energy-weighted encoding for optimal feature extraction.
- DSLNet achieves state-of-the-art performance on benchmarks like WLASL-100, demonstrating improved accuracy and efficiency with significantly fewer parameters than competing models.
Dual-SignLanguageNet (DSLNet) is a specialized neural architecture for sign language recognition and translation that explicitly decouples and models hand morphology and motion trajectory in complementary coordinate systems. This dual-reference, dual-stream framework leverages wrist-centric and facial-centric normalization for robust gesture analysis. The following sections detail its architecture, component networks, fusion mechanisms, benchmarking results, and technical innovations.
1. Dual-Reference Normalization and Stream Decoupling
DSLNet applies dual-reference normalization to the input skeletal sequence, producing two specialized representations:
- Wrist-Centric (Morphological) Frame: Each hand joint position is normalized relative to the wrist, providing view-invariant, fine-grained shape features:
This decouples the local hand morphology from the global signer pose.
- Facial-Centric (Trajectory) Frame: The hand's global position is normalized with respect to the facial centroid and scaled by a facial scale factor, providing context-aware trajectory features:
%%%%1%%%%
Here, is the centroid of facial landmarks, is the facial scale (e.g., bounding-box diagonal), and prevents division by zero. This representation captures the motion of the hands with respect to the head, which is crucial for distinguishing semantically-ambiguous, morphologically-similar gestures.
2. Component Networks for Morphology and Trajectory
Each normalized stream is processed by a dedicated neural network optimized for its target aspect:
- Morphology Stream (Topology-Aware Spatiotemporal Network, TSSN): Utilizes a Spatio-Temporal Dynamic Graph Convolutional Neural Network (STDGCNN) applied to the wrist-centric input. The architecture dynamically builds k-NN graphs over joints per frame, extracting spatial features with dynamic convolution, followed by temporal convolution over sequential frames. Final descriptors are aggregated using bidirectional LSTM and multi-head attention to yield the morphology feature .
- Trajectory Stream (Finsler Trajectory Dynamics Encoder, FTDE): Focuses on temporal, direction-sensitive motion, modeling physics-informed trajectory dynamics from the facial-centric input. For each timestep, an energy-weighted encoding is computed:
where is a learnable fusion network, and is a learned exponent on velocity magnitude. Temporal energy weights act as an attention mask, emphasizing salient motion. FTDE further employs causal convolutions and bidirectional LSTM to integrate context, producing trajectory features.
3. Geometry-Driven Optimal Transport Fusion
Integration of the two streams utilizes a geometry-driven optimal transport (Geo-OT) mechanism:
- Both and are enhanced by cross-attention layers, encouraging feature exchange.
- A transport plan is computed to align the global morphological feature with the temporal trajectory profile:
- The fusion seeks to minimize a joint cost comprising feature similarity and temporal priors.
- Geometric consistency is enforced via the loss term:
where , are modality-specific learnable projections and is the cosine similarity.
- The overall training loss is:
with being the cross-entropy classification loss.
4. Benchmark Performance
DSLNet achieves state-of-the-art isolated sign language recognition accuracy across multiple datasets:
Dataset | Accuracy (%) | Benchmark. Comments | Parameter Count (Millions) |
---|---|---|---|
WLASL-100 | 93.70 | +1.45% above Uni-Sign | 46.3 |
WLASL-300 | 89.97 | +1.05% above Uni-Sign | 46.3 |
LSA64 | 99.79 | Outstanding fine-grained accuracy | 46.3 |
DSLNet demonstrates significant reductions in parameter count relative to competing architectures (e.g., Uni-Sign: 592.1M), yielding superior computational efficiency alongside high accuracy.
5. Technical Innovations
Prominent technical contributions of DSLNet are:
- Dual-Reference Separation: Morphology and trajectory are modeled in specialized coordinate systems, drastically improving resolution of geometric ambiguity in morphologically-similar signs.
- Dedicated Stream Specialization: STDGCNN and FTDE are tailored for structure and dynamics, respectively, complemented by physics-informed weighting and dynamic graph operations.
- Semantic Alignment via Geo-OT: Optimal transport fusion aligns the semantic content of the two streams using temporal and feature-based priors, effectively marrying local gesture shape to global context.
- Efficient Parameterization: DSLNet’s design achieves high discriminative power per trainable parameter, facilitating real-world deployment.
6. Significance and Outlook
By resolving morphological ambiguities and capturing context through dual-reference normalization and geometry-driven fusion, DSLNet advances the state of the art in skeleton-based isolated sign language recognition. Its architecture addresses crucial failure points in earlier methods, notably where gesture shape alone cannot determine semantic meaning. The efficiency and extensibility of DSLNet suggest its utility in broader application domains, including continuous sign language recognition and multimodal gesture understanding, and serve as a template for future dual-stream neural frameworks in vision-LLMing (Liu et al., 10 Sep 2025).