AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition (2401.10411v1)
Abstract: Wearable devices like smart glasses are approaching the compute capability to seamlessly generate real-time closed captions for live conversations. We build on our recently introduced directional Automatic Speech Recognition (ASR) for smart glasses that have microphone arrays, which fuses multi-channel ASR with serialized output training, for wearer/conversation-partner disambiguation as well as suppression of cross-talk speech from non-target directions and noise. When ASR work is part of a broader system-development process, one may be faced with changes to microphone geometries as system development progresses. This paper aims to make multi-channel ASR insensitive to limited variations of microphone-array geometry. We show that a model trained on multiple similar geometries is largely agnostic and generalizes well to new geometries, as long as they are not too different. Furthermore, training the model this way improves accuracy for seen geometries by 15 to 28\% relative. Lastly, we refine the beamforming by a novel Non-Linearly Constrained Minimum Variance criterion.
- “Directional speech recognition for speaker disambiguation and cross-talk suppression,” Proc. INTERSPEECH 2023, pp. 3522–3526, 2023.
- “One model to enhance them all: array geometry agnostic multi-channel personalized speech enhancement,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 271–275.
- “Vararray: Array-geometry-agnostic continuous speech separation,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6027–6031.
- “Vararray meets t-sot: Advancing the state of the art of streaming distant conversational speech recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “MIMO-speech: End-to-end multi-channel multi-speaker speech recognition,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 237–244.
- “Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 1–5.
- “Multi-channel multi-speaker ASR using 3D spatial feature,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6067–6071.
- “Multi-channel overlapped speech recognition with location guided speech extraction network,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 558–565.
- “Factored spatial and spectral multichannel raw waveform cldnns,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5075–5079.
- “Spatial attention for far-field speech recognition with deep beamforming neural networks,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7499–7503.
- Jack Capon, “High-resolution frequency-wavenumber spectrum analysis,” Proceedings of the IEEE, vol. 57, no. 8, pp. 1408–1418, 1969.
- “Superdirectional microphone arrays,” Kluwer International Series in Engineering and Computer Science, pp. 181–238, 2000.
- “Superdirective beamforming based on the krylov matrix,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2531–2543, 2016.
- “Superdirective beamforming robust against microphone mismatch,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 2, pp. 617–631, 2007.
- “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning. PMLR, 2015, pp. 448–456.
- “Language modeling with gated convolutional networks,” in International conference on machine learning. PMLR, 2017, pp. 933–941.
- “Alignment restricted streaming recurrent neural network transducer,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 52–59.
- “An investigation of monotonic transducers for large-scale automatic speech recognition,” arXiv preprint arXiv:2204.08858, 2022.
- “A streaming on-device end-to-end model surpassing server-side conventional model quality and latency,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6059–6063.
- “Developing rnn-t models surpassing high-performance hybrid models with customization capability,” arXiv preprint arXiv:2007.15188, 2020.
- “Streaming multi-talker ASR with token-level serialized output training,” arXiv preprint arXiv:2202.00842, 2022.
- “Extended graph temporal classification for multi-speaker end-to-end ASR,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7322–7326.
- “Prediction of energy decay in room impulse responses simulated with an image-source model,” The Journal of the Acoustical Society of America, vol. 124, no. 1, pp. 269–277, 2008.
- “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 351–355.
- “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” in INTERSPEECH, 2020.
- “Project aria: A new tool for egocentric multi-modal ai research,” arXiv preprint arXiv:2308.13561, 2023.
- “Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6783–6787.
- Ju Lin (9 papers)
- Niko Moritz (23 papers)
- Yiteng Huang (12 papers)
- Ruiming Xie (5 papers)
- Ming Sun (146 papers)
- Christian Fuegen (36 papers)
- Frank Seide (16 papers)