This paper, "Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems" (Bijwadia et al., 2022 ), presents a method to integrate acoustic endpointing (EP) capabilities directly into an end-to-end (E2E) Automatic Speech Recognition (ASR) model. Traditionally, ASR systems rely on a separate EP model for two primary tasks: Voice Activity Detection (VAD) for frame filtering (discarding non-speech segments before ASR processing) and End-of-Query (EOQ) detection (identifying when a user has finished speaking in short-query interactions). Integrating these functions into a single E2E model offers potential benefits in terms of improved quality through joint training and reduced infrastructure complexity by deploying only one model.
The core challenge addressed by the paper is the differing computational requirements for VAD and EOQ. VAD for frame filtering needs to be extremely computationally efficient as it operates on every audio frame in real-time. EOQ detection, while latency-sensitive for user experience, can leverage computation from the larger ASR model once speech is detected. Prior E2E approaches often rely on decoder-based EOQ signals (e.g., predicting an end-of-sentence token), which cannot provide frame-level VAD for filtering and may have latency issues due to ASR processing batching.
The proposed model integrates an acoustic EP component into the audio encoder of a streaming E2E ASR model (specifically, a cascaded Conformer-Transducer). The EP model shares the initial layers of the ASR encoder. To handle the computational constraint discrepancy, the paper introduces a novel "switch" connection.
Model Architecture and Training (Implementation Details):
- Multitask Training: The ASR and EP tasks are trained jointly using a weighted sum of their respective losses: . The ASR loss is typically RNN-T loss, and the EP loss is standard cross-entropy on frame-level speech classification labels. The EP task classifies each frame into one of four classes: speech, initial silence, intermediate silence, or final silence.
- Shared Encoder Layers: The EP model's input layer is placed downstream of the first few layers (specifically, the first two conformer layers in their experiments) of the ASR audio encoder. This allows the EP to utilize the richer, low-level latent representations learned by the ASR encoder. The EP model itself consists of a projection layer, a few LSTM layers, and a final fully-connected layer with a softmax output over the four speech classes.
- The "Switch" Connection (Training): To train the EP model to accept either raw audio frames (for efficient VAD) or ASR latent features (for high-quality EOQ), a probabilistic "switch" is used during training. For each training example, the input to the EP branch is randomly chosen between the raw audio features and the output of the shared ASR encoder layers with equal probability. This forces the EP sub-model to learn to make predictions regardless of which input source it receives.
Inference Logic (Implementation Details):
During inference, the probabilistic switch is replaced with a state machine based on the EP's own predictions (see Figure 2 in the paper).
- EP Only State: Initially, the system is in an "EP Only" state. Raw audio frames are fed only to the EP model. ASR computation is skipped. This is the low-cost VAD phase for frame filtering. The EP predicts for each frame.
- Transition to ASR + EP: When the EP predicts speech with high confidence (), the system transitions to the "ASR + EP" state. Audio frames are now fed to the ASR model, and the output of the shared ASR encoder layers is fed to the EP model.
- Continuous Recognition (VAD Off-Detection): In continuous recognition scenarios (like dictation), if the EP prediction of speech drops below the threshold (), the system transitions back to the "EP Only" state to resume computationally cheaper frame filtering.
- Short Query Recognition (EOQ Detection): In short-query scenarios (like voice commands), the system stays in the "ASR + EP" state until EOQ is detected. EOQ can be declared by either the acoustic EP signal () or a decoder-based signal from the ASR (e.g., predicting an token with high confidence). A short mandatory waiting period after a potential EOQ detection is applied to reduce premature cutoffs.
Practical Application and Results:
The unified model is evaluated on real-world voice search (short query) and voice dictation (continuous query) test sets.
- Short Query (EOQ): Compared to a baseline system with separate ASR and EP models, the proposed multitask model with shared layers and the switch connection (E3) significantly reduces endpoint latency. It achieves a 30.8% reduction in median latency (from 390ms to 270ms) and a 23.0% reduction in 90th percentile latency (from 740ms to 570ms) while maintaining the same Word Error Rate (WER). This reduction in latency directly improves user-perceived latency (UPL), a critical factor for interactive speech systems. The analysis shows that the multitask models consistently offer better latency-WER trade-offs across different operating points compared to the baseline.
- Continuous Query (Frame Filtering): The proposed system (E3) is as effective as the baseline at frame filtering, discarding the same percentage of non-speech frames (resulting in 73% of audio remaining after filtering). Surprisingly, the unified model also improves WER in continuous recognition (from 10.4% to 9.3%). The authors speculate this is because the shared encoder layers, benefiting from the EP task's frame-level speech/silence targets, become more robust to noisy non-speech segments, reducing hallucination errors even when frame filtering is disabled.
Implementation Considerations:
- Computational Cost: The key benefit for VAD efficiency comes from the "EP Only" state during inference, where only the small EP model (or a small initial part of the ASR encoder + EP) is active, allowing fast frame filtering. The "ASR + EP" state for EOQ detection leverages the ongoing ASR computation anyway, so the additional cost of feeding features to the EP branch is minimal during speech segments.
- Hyperparameter Tuning: The weights for the multitask loss (), the thresholds for VAD () and acoustic EOQ (), the decoder-based EOQ threshold (), and the waiting period () need careful tuning based on evaluation data to balance WER and latency requirements.
- Model Size: The core ASR model is a substantial ~150M parameter model. Deploying this system requires hardware capable of running this model in a streaming fashion with low latency. The EP sub-model itself is much smaller, closer in size to traditional standalone EPs.
- Data Requirements: Training requires a large dataset with both speech transcripts (for ASR) and frame-level speech/silence labels (for EP), typically generated via forced alignment.
In summary, the paper demonstrates a practical and effective method for unifying ASR and acoustic EP into a single E2E streaming model. The novel "switch" connection allows the EP to operate efficiently on raw audio for VAD and leverage ASR features for improved EOQ quality, leading to significant latency reduction in short-query scenarios and improved WER in continuous recognition, while simplifying system architecture.