Montreal Forced Aligner (MFA)

Updated 2 April 2026

Montreal Forced Aligner (MFA) is a forced alignment toolkit that maps phonetic and orthographic transcriptions to speech audio with sub-25 ms temporal precision.
It employs a Kaldi-based GMM-HMM framework with Viterbi decoding and contextual triphone modeling to achieve high accuracy in phonetics research.
MFA's extensibility and integration with Praat TextGrid outputs support large-scale corpus phonetics, despite higher computational demands compared to newer neural approaches.

The Montreal Forced Aligner (MFA) is a forced alignment toolkit designed to align orthographic or phonemic transcriptions to corresponding speech audio with high temporal precision. Developed as a Kaldi-based system, MFA leverages statistical acoustic modeling and finite-state transduction to deliver word and phone boundary annotations necessary for large-scale corpus phonetics and downstream linguistic analysis. MFA remains the reference implementation for segment-level annotation in both research and applied settings due to its combination of accuracy, extensibility, and reproducibility on diverse language corpora (Chodroff, 2018, Rousso et al., 2024).

1. Core Architecture and Algorithms

MFA is based on the classical Gaussian Mixture Model–Hidden Markov Model (GMM–HMM) approach, as implemented in Kaldi. The fundamental components are:

Acoustic Model: MFA models each phonetic unit as a sequence of left-to-right HMM states with GMM output distributions. Context-dependent triphone modeling is performed following an initial monophone stage. Each frame $x_t$ in the audio feature sequence is evaluated as $p(x_t|s_t) = \sum_{m=1}^M w_{s_t,m} \mathcal{N}(x_t; \mu_{s_t,m}, \Sigma_{s_t,m})$ .
Pronunciation Lexicon: A mapping from orthographic representations to phone sequences in a standard symbol set (typically ARPABET for English), enabling transcript-to-phone sequence conversion.
Decoding Algorithm: The forced-alignment task is solved via Viterbi decoding in a linear sequence FST, optimizing the state-sequence $s_{1:T}$ over $T$ frames to maximize the combined acoustic and transition log-likelihoods (Rousso et al., 2024, Chodroff, 2018).

MFA's feature pipeline uses 13 MFCCs + Δ + ΔΔ (total 39-dimensional) computed every 10 ms with a 25 ms Hamming window. Cepstral mean and variance normalization are performed per speaker or speaker-adaptation cluster, while LDA+MLLT transforms and Speaker Adaptive Training (SAT) with fMLLR further enhance acoustic invariance.

2. Training Regime and Hyperparameters

MFA's vanilla English acoustic model is trained on approximately 982 hours of 16 kHz LibriSpeech corpus audio and augmented by large-lexicon ARPABET dictionaries (~200k word types). The default training recipe involves:

Monophone model initialization (25 EM iterations, 1–200 Gaussians per state via mixup)
Triphone modeling with decision tree–based state tying (35 iterations, yielding ~2,000 clustered states)
LDA+MLLT for feature decorrelation (15–20 iterations)
SAT (fMLLR) for speaker normalization (10–12 iterations) (Rousso et al., 2024, Tosolini et al., 9 Apr 2025)

Hyperparameter optimization for alignment accuracy in low-resource contexts is critical. Increasing monophone training iterations by 4× and using fine-grained triphone groupings (e.g., 22 natural classes) yielded mean absolute boundary error reductions from 23.92 ms to as low as 21.08 ms. LDA/SAT scaling showed marginal incremental benefit. Audio augmentation techniques (bandwidth filtering, re-encoding, speed perturbation) provided negligible to negative benefit (Tosolini et al., 9 Apr 2025).

3. Data Preparation and Workflow

Input requirements for MFA are:

Audio: Single-channel 16 kHz WAV format.
Transcripts: Praat TextGrids (one interval tier, word- or utterance-level) or plain text with explicit utterance boundaries. Boundaries must avoid exact file start/end; 20–50 ms empty margin is recommended.
Lexicon: Two-column text file matching the acoustic model's phone set, accommodating multiple pronunciations and a special <UNK> token for out-of-vocabulary items.

A typical MFA workflow comprises:

Preparing corpus and lexicon resources
Running the main alignment command (mfa_align) with specified jobs and CPU settings
Optionally, training a custom acoustic model (mfa_train_corpus) for language or domain adaptation
Generating output TextGrids with word and phone boundary tiers (Chodroff, 2018)

4. Alignment Performance and Comparative Evaluation

MFA's frame-level temporal modeling sustains high accuracy across diverse evaluation corpora:

Word Boundary Metrics

System	≤10 ms	≤25 ms	≤50 ms	≤100 ms
MFA	41.6%	72.8%	89.4%	97.4%
MMS	18.6%	43.5%	75.7%	94.7%
WhisperX	22.4%	52.7%	82.4%	94.2%

Phone Boundary Metrics (TIMIT/Buckeye)

Corpus	≤10 ms	≤25 ms	≤50 ms	≤100 ms
TIMIT	38.6%	72.3%	81.1%	84.6%
Buckeye	35.3%	60.6%	68.9%	72.7%

MFA consistently outperforms end-to-end ASR-based aligners such as WhisperX and MMS across all thresholds, especially for tight (≤10 ms) error bounds. Mean boundary errors are 21.9 ms (TIMIT) and 27.8 ms (Buckeye), with F₁@20 ms ≈ 65.7% and 65.4%, respectively. Observed failure cases include boundary drift in long utterances without nonspeech segmentation and reduced performance with spontaneous, non-read speech (Rousso et al., 2024).

5. Strengths, Limitations, and Practical Considerations

Strengths:

Temporal Precision: MFA achieves robust boundary accuracy for both word and phone tiers at sub-25 ms tolerance, outperforming CTC- and encoder–decoder-based ASR models in standard evaluations (Rousso et al., 2024).
Configurability and Extensibility: Supports custom model training, lexicon adaptation, and hyperparameter grid search for low-resource scenarios (Tosolini et al., 9 Apr 2025).
Integration in Phonetic Workflows: Output in Praat TextGrid format, compatibility with corpus phonetics toolchains, and reproducibility have established MFA as a community standard (Chodroff, 2018).

Limitations:

Boundary Resolution: Classic MFA aligns only at the 10 ms frame edges. Recent deep learning systems employing linear interpolation (e.g., MAPS) achieved 27.9% more boundaries within 10 ms of target, reducing median errors from 10.44 ms (MFA) to 7.31 ms (Kelley et al., 2023).
No Inter-Phoneme Gap Modeling: Models silences by a single phone class at word/utterance ends but does not predict both onset and offset per phone, unlike CTC/blank-based aligners (Rehman et al., 27 Sep 2025).
Resource Demands: Alignment is computationally intensive (real-time factor 52–194× on Buckeye), limiting scalability for interactive and large-scale streaming applications.

6. Innovations, Alternatives, and Developments

Recent research demonstrates complementary and competitive approaches:

MAPS: Introduces multilabel acoustic tagging and linear interpolation for subframe boundary placement. Interpolation yields up to 6% absolute gains in tight-error (<10 ms) alignment and can be retrofitted to MFA’s DP decoder with trivial CPU overhead (Kelley et al., 2023).
BFA: Replaces MFCCs with neural universal phoneme embeddings and utilizes a CTC decoding regime with explicit blank tokens for silence/gap modeling. BFA attains up to 240× faster alignment at competitive recall (especially for relaxed tolerances ≥40 ms), predicts both onsets and offsets, and enables explicit gap allocation between phones, which MFA cannot natively provide (Rehman et al., 27 Sep 2025).
Low-Resource Adaptation: For endangered/minority languages with <1 hour of data, fine-grained triphone grouping and increased monophone training yield 5–10% improvements, whereas audio augmentation shows negligible effect (Tosolini et al., 9 Apr 2025).

7. Outlook and Theoretical Considerations

MFA’s paradigm of GMM–HMM modeling, lexicon-based decoding, and frame-level alignment remains robust for research-level forced alignment. However, MAPS’ critique of one-hot acoustic targets—imposing minimum entropy and suppressing phone confusion—suggests adoption of smoothed or multi-task output representations, such as label-smoothing and auxiliary gestural tiers. Addressing sociolinguistic mismatch between training and operational data remains an open area, as does integrating flexible between-frame interpolation, richer posterior modeling, and more efficient data utilization (Kelley et al., 2023, Rousso et al., 2024).

Hybrid systems that combine ASR transcription for robust recognition with MFA for time-stamping are predicted to become standard, with ongoing research into neural architectures that bridge fine-grained boundary localization and the global robustness of modern end-to-end models. MFA continues to set the bar for precision word- and phone-level alignment, with new toolkits contributing additional temporal flexibility, annotation richness, and computational efficiency.