LEMAS-Edit: Multilingual Speech Editing
- LEMAS-Edit is an autoregressive multilingual speech editing system that infills masked codec tokens based on precise word-level alignments.
- It leverages a VoiceCraft decoder-only Transformer and conditioned autoregressive generation to ensure smooth, seamless speech transitions.
- Adaptive strategies like repetition control and re-generation heuristics improve boundary management and performance in real-world recordings.
LEMAS-Edit is an autoregressive multilingual speech-editing system introduced within the LEMAS framework as a decoder-only architecture that formulates speech editing as a masked token infilling task. It operates over discrete codec tokens rather than waveform samples, uses precise word-level alignments to construct edit masks, and applies adaptive decoding strategies intended to produce seamless, smooth-boundary speech editing with natural transitions. Within the broader LEMAS project, it complements LEMAS-TTS and is trained on timestamp-annotated multilingual speech data, with the paper positioning it as a practical system for prompt-based speech generation and editing in real recordings (Zhao et al., 4 Jan 2026).
1. Position within the LEMAS framework
LEMAS-Edit is presented together with the LEMAS-Dataset and LEMAS-TTS. The dataset is described as covering over 150,000 hours across 10 major languages with word-level timestamps, while LEMAS-Edit is the editing-oriented component that exploits those alignments for token-level infilling. The paper states that LEMAS-Edit adopts an autoregressive decoder-only architecture and that its central formulation is masked token infilling rather than full-span re-synthesis (Zhao et al., 4 Jan 2026).
This design choice is consequential because the system is not framed as a conventional text-to-speech model used post hoc for replacement. Instead, the edit region is explicitly localized in the codec-token sequence, and the model conditions on the preserved acoustic context surrounding that region. A common misconception is that multilingual speech editing in this setting is simply zero-shot TTS constrained by transcript replacement. The paper distinguishes LEMAS-Edit from that view by using a dedicated infilling formulation and by evaluating it against a multilingual zero-shot TTS baseline run in “editing mode,” that is, by re-synthesizing the span from scratch.
The editing model is also described as separate from the additional training mechanisms used for LEMAS-TTS. In particular, the paper specifies that no additional accent-adversarial or CTC objectives are used in the editing model; those are specific to LEMAS-TTS. This separation clarifies that the editing system is architecturally and procedurally specialized for local replacement rather than general multilingual synthesis.
2. Speech representation and autoregressive architecture
LEMAS-Edit is built on the same autoregressive, decoder-only Transformer backbone as VoiceCraft, with 330 M parameters, and inherits its discrete-token speech representation. Raw waveform is first encoded by a neural audio codec, EnCodec, into a sequence of quantized frame tokens; the paper gives the example of 8 codebooks and 6 frames per 10 ms. These codec tokens form the vocabulary over which the editing model operates (Zhao et al., 4 Jan 2026).
At both training and inference time, the decoder input consists of two parts: a prefix of unmasked codec tokens, corresponding to all tokens outside the edit region, and a special mask placeholder token, or repeated mask tokens, occupying exactly the number of time frames being edited. The model then autoregressively fills in the masked segment one codec token at a time, conditioning on all previously generated tokens, including the intact tokens preceding the mask.
This architecture makes the edit boundary a first-class object. The unmasked acoustic context is preserved explicitly in the sequence presented to the decoder, and the masked span is reconstructed in situ. The paper attributes the resulting smooth edit transitions to the tight coupling between word-level timing and codec-token masking. A plausible implication is that the model’s continuity properties derive less from global regeneration quality and more from local temporal anchoring at the token boundary.
3. Mask construction from word-level alignments
The model’s masking procedure depends on the LEMAS-Dataset forced-alignment pipeline, which uses the MMS aligner together with romanized transcripts to obtain precise word-start and word-end timestamps. For each training example, one contiguous target span of words is randomly sampled, often 1–3 words. The span’s begin and end times in seconds are then converted to EnCodec-frame indices through a frame-rate computation based on audio length in tokens and audio duration; those indices define a contiguous masked region (Zhao et al., 4 Jan 2026).
The paper is explicit that no custom “slack” window is used. Token boundaries align exactly to the word timestamps, and the masked region is formed by replacing the corresponding codec tokens with a placeholder mask token while leaving all other codec tokens intact. This exact alignment is presented as the mechanism that ensures smooth edit transitions. In other words, the boundary condition is not softened heuristically; it is determined directly by forced alignment.
The training objective is the negative log-likelihood over only the masked positions. Let be the full codec-token sequence and let denote the masked positions. The loss is
The decoder remains autoregressive over the full sequence,
but the loss backpropagates only through masked positions. In implementation, the paper states that this is realized with a standard cross-entropy over the masked token labels. This objective makes LEMAS-Edit a targeted reconstruction model rather than a general language-model objective over all codec tokens.
4. Decoding heuristics and boundary control
The paper emphasizes that autoregressive codec models can loop on silence or syllables in long spans, and it introduces adaptive decoding strategies specifically to mitigate that behavior. The first is History-Aware Repetition Control, implemented as a dynamic repetition penalty within top-/top- sampling. For tokens that have already appeared in the current generation history, positive logits are divided by the penalty and negative logits are multiplied by the penalty before sampling. The penalty ramps up over time, and the stated purpose is to prevent looping (Zhao et al., 4 Jan 2026).
The second mechanism is an Adaptive Re-Generation Mechanism. The system estimates a target speaking rate from the reference audio as the number of EnCodec tokens in the reference divided by the number of words in the reference. If an internal anomaly flag is raised, for example when output is extremely short, below , or when a special RE_GEN signal is triggered, the system automatically retries infilling up to a small maximum number of times. Each retry relaxes the repetition penalty further and can expand the mask region by a few frames to provide more acoustic context.
The paper also describes three system-level enhancements. First, robust recognition and alignment: at inference, the original audio is re-aligned with Whisper-Large and MMS whenever edit boundaries are adjusted. Second, signal enhancement: optional denoising is performed via UVR5 in a light mode or DeepFilterNet in a more aggressive mode. Third, long-form processing: very long recordings are split into chunks, each chunk is edited independently, and the outputs are stitched with zero-crossing cross-fades.
These mechanisms indicate that LEMAS-Edit is not only a model architecture but also an editing pipeline. A plausible implication is that the reported smooth-boundary behavior depends on the interaction between the infilling model and these surrounding heuristics, especially under noisy or long-form conditions.
5. Training configuration and multilingual scope
The base model is a 330 M-parameter VoiceCraft decoder-only Transformer, warm-started from the publicly released English VoiceCraft checkpoint. Training data for the editing model comprise aligned utterances from LEMAS-Dataset in 7 languages—zh, en, de, fr, pt, es, and it—together with WenetSpeech4TTS, GigaSpeech, and MLS (Zhao et al., 4 Jan 2026).
Optimization uses Adam with single-step updates on full batches and no gradient accumulation. The paper notes that this simplification made effective batch size tokens per GPU straightforward to tune and removed “hidden” dynamics from accumulation. The learning-rate schedule is a standard inverse-sqrt warmup over 10 k steps followed by a constant plateau, with final learning rate on the order of . Training is reported to converge in 0 k updates on 16 A100 GPUs, corresponding to approximately one week of wall-clock time.
The multilingual scope should be described precisely. The overall LEMAS-Dataset covers 10 major languages, but the editing model’s multilingual training set is explicitly stated as 7 languages plus additional corpora. This suggests a distinction between the dataset’s full coverage and the subset used for LEMAS-Edit training.
6. Evaluation, comparative results, and practical interpretation
The test set consists of 20 held-out utterances, with 2–3 utterances per language, drawn from the LEMAS evaluation partition and paired with gold word alignments. For each utterance, one word or short phrase is randomly selected via ChatGPT prompts to be replaced. The baseline is multilingual zero-shot TTS from LEMAS-TTS run in “editing mode,” meaning that the target span is re-synthesized from scratch rather than infilled (Zhao et al., 4 Jan 2026).
Evaluation is based on human A/B preference tests with six native or proficient listeners per language. The comparison focuses solely on naturalness, boundary smoothness, and coherence in the edited region. Figure 1 is described as a ridgeline plot summarizing preferences on a 0–100 scale, where 0 indicates strong preference for one system and 100 for the other. Across all seven languages, the paper reports that LEMAS-Edit is statistically indistinguishable from or slightly preferred to the TTS baseline.
That result is important to interpret carefully. The paper does not claim categorical superiority across all criteria; rather, it states statistical indistinguishability or slight preference relative to the baseline. The significance lies in achieving that level of performance while operating through localized masked-token infilling and tight alignment constraints. The concluding summary attributes the system’s behavior to the combination of tight word-level masking from LEMAS-Dataset alignments, dynamic repetition control, and re-generation heuristics, and states that this combination yields artifact-free, smooth-boundary edits in noisy “in the wild” recordings without any language-specific engineering.
Within the broader landscape of speech generation, LEMAS-Edit therefore occupies a specific methodological niche: multilingual local speech editing through codec-token infilling conditioned on exact alignment. Its contribution is not merely multilinguality or promptability in isolation, but the integration of timestamp supervision, autoregressive infilling, and runtime control heuristics into a single editing system (Zhao et al., 4 Jan 2026).