The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023 (2401.06788v2)
Abstract: This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023, engaging in the fixed and open tracks of Single-Speaker VSR Task, and the open track of Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multi-scale video data. Besides, various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation. The VSR model adopts an end-to-end architecture with joint CTC/attention loss, comprising a ResNet3D visual frontend, an E-Branchformer encoder, and a Transformer decoder. Experiments show that our system achieves 34.76% CER for the Single-Speaker Task and 41.06% CER for the Multi-Speaker Task after multi-system fusion, ranking first place in all three tracks we participate.
- “Cn-cvs: A mandarin audio-visual dataset for large vocabulary continuous visual to speech synthesis,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
- “E-branchformer: Branchformer with enhanced merging for speech recognition,” in Proc. SLT. IEEE, 2023, pp. 84–91.
- “Branchformer: Parallel MLP-attention architectures to capture local and global context for speech recognition and understanding,” in Proc. ICML. PMLR, 2022, pp. 17627–17643.
- “Conformer: Convolution-augmented Transformer for speech recognition,” in Proc. Interspeech. ISCA, 2020, pp. 5036–5040.
- Jonathan G Fiscus, “A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER),” in Proc. ASRU. IEEE, 1997, pp. 347–354.