Papers
Topics
Authors
Recent
2000 character limit reached

Custom Multilingual Seq2Seq LLM

Updated 4 December 2025
  • Custom Multilingual Seq2Seq LLMs are transformer-based encoder-decoder systems designed to map input sequences into outputs across diverse languages for tasks like translation and generation.
  • They leverage rigorous corpus curation, advanced tokenization, and embedding initialization techniques to balance high-resource and underrepresented languages.
  • Training protocols integrate span-masked denoising and causal objectives, followed by fine-tuning for tasks such as morphological segmentation, named entity recognition, and sentiment analysis.

A custom multilingual sequence-to-sequence LLM (Seq2Seq LLM) is defined as a Transformer-based architecture—typically encoder–decoder—explicitly designed to map input sequences to output sequences across multiple languages using shared or compositional tokenization and multilingual pretraining. These models address the limitations of monolingual or English-centric LLMs by leveraging cross-lingual transfer, large-scale datasets, and customized tokenization strategies, thereby achieving high performance on generative, translation, and morphological tasks in both high-resource and underrepresented languages. The architectural foundation, training pipelines, and practical deployment strategies are extensively guided by recent research advances, as outlined below.

1. Corpus Curation and Preprocessing

Effective custom multilingual Seq2Seq LLMs depend on high-quality corpora that combine monolingual language depth and cross-lingual breadth. For robust coverage, practitioners assemble monolingual corpora in the target languages, targeting 10–60 GB per language (e.g., Wikipedia, news, web crawls), complemented with massive multilingual corpora such as mC4 (27 TB, 101 languages) (Eyal et al., 2022). European LLMs like Teuken-7B-Base emphasize balanced sampling with domain-specific and web-sourced data, comprising up to 4 trillion tokens with explicit up-sampling of non-English content to reach 60% language diversity (Ali et al., 30 Sep 2024).

Preprocessing comprises the following deterministic pipeline:

  • Non-text noise removal (HTML tags, control characters)
  • Automated language detection and filtering
  • Deduplication to remove near-duplicates
  • Unicode normalization (NFKC)
  • Sentence splitting and task-specific tokenization

For highly morphologically rich or low-resource languages (e.g., Tigrinya, Romanized Hindi/Bengali), preprocessing protocols further enforce script normalization and domain adaptation (Teklehaymanot et al., 24 Sep 2025, Gharami et al., 27 Nov 2025).

2. Tokenization and Embedding Initialization

The construction of robust tokenizers is critical. Standard approaches involve training large-vocabulary (32k–250k) SentencePiece or BPE tokenizers on aggregated corpora, with strategies to minimize token “fertility” (tokens-per-word ratio) across all languages (Eyal et al., 2022, Ali et al., 30 Sep 2024, Kiulian et al., 24 Oct 2024). For morphologically rich languages or transliteration tasks, merging rules are engineered to respect morpheme boundaries and phonetic structure (e.g., no merges across Ge’ez base-character boundaries in Tigrinya, or explicit mapping tables in Bengali) (Teklehaymanot et al., 24 Sep 2025, Gharami et al., 27 Nov 2025).

Embedding initialization follows two main paradigms:

  • For overlapping tokens, embeddings are copied from pretrained models (MarianMT, mT5).
  • New tokens are initialized using uniform distributions (e.g., U(3/d,3/d)U(-\sqrt{3/d}, \sqrt{3/d})) or the NACHOS algorithm, where embeddings are computed as the average of constituent subword embeddings, preserving semantic continuity (Kiulian et al., 24 Oct 2024, Teklehaymanot et al., 24 Sep 2025).

This minimizes capacity dilution and accelerates convergence for low-resource languages.

3. Model Architecture and Training Objectives

Transformer encoder–decoder architectures remain standard, with variations in depth and parameter count depending on resource constraints and multilingual demands. Canonical configurations include:

Model Encoder Layers Decoder Layers Model Dim FF Dim Vocab Size
mT5 24–48 24–48 1024–4096 4096–16384 250k (Eyal et al., 2022)
MarianMT 6 6 512 2048 32k (Teklehaymanot et al., 24 Sep 2025, Gharami et al., 27 Nov 2025)
AlexaTM 20B 46 32 4096 16384 150k (Soltan et al., 2022)
Teuken-7B N/A 32 (decoder) 4096 13440 50–100k (Ali et al., 30 Sep 2024)

Loss functions consistently apply cross-entropy over the target sequence. Pretraining objectives combine span-masked denoising (dropping 15% tokens in Poisson(3) spans) and causal (prefix) LM, with 80:20 mixing in large-scale models (AlexaTM 20B) (Soltan et al., 2022). Hybrid objectives facilitate both scoring/classification and generative use cases.

4. Fine-Tuning Strategies and Task Framing

Fine-tuning adapts the generic model to specific downstream tasks, framed in a text-to-text paradigm:

  • Named entity recognition: input raw sentence, output delimited entity–type pairs (Eyal et al., 2022).
  • Morphological segmentation, lemmatization, POS tagging: inputs mapped via custom delimiter schemes (“@@”, “»”) to encoded morpheme/tag pairs (Eyal et al., 2022).
  • Machine translation: parallel sentence pairs, adapted for low-resource alphabets or noisy social-media variants (Gharami et al., 27 Nov 2025).
  • Image-to-text translation pipelines combine OCR outputs with subsequent seq2seq NMT (Sahay et al., 27 Oct 2025).

Fine-tuning protocols utilize AdamW with linearly decayed learning rates (e.g., 1e-4–1e-3), moderate batch sizes (8–128), and regularization via dropout (0.1–0.3) and label smoothing (ϵ=0.1\epsilon=0.1 for classification) (Eyal et al., 2022, Teklehaymanot et al., 24 Sep 2025, Gharami et al., 27 Nov 2025). Curriculum strategies—such as encoder freezing for initial steps—preserve cross-lingual knowledge and stabilize learning (Teklehaymanot et al., 24 Sep 2025, Soltan et al., 2023).

5. Evaluation Methodologies

Model evaluation integrates domain-aligned automatic and human-centric metrics:

Task Metric Noteworthy Benchmarks
MT BLEU, chrF, ROUGE, METEOR Flores-101 (Soltan et al., 2022), OPUS-100 (Sahay et al., 27 Oct 2025)
Morphology F1 (token-level, morpheme-level) UD Hebrew (Eyal et al., 2022)
QA F1, Exact Match (EM) ParaShoot (Eyal et al., 2022), SQuADv2 (Soltan et al., 2022)
Sentiment Macro F1 NEMO (Hebrew) (Eyal et al., 2022)
Transliteration BLEU, CER, WER, chrF Dakshina (Hindi/Bengali) (Gharami et al., 27 Nov 2025)

Human evaluation focuses on fluency and adequacy, error taxonomies (lexical, morphological, OOV), and domain breakdowns (religious, news, education) (Teklehaymanot et al., 24 Sep 2025, Gharami et al., 27 Nov 2025). Bonferroni correction ensures statistical significance across experimental comparisons (Teklehaymanot et al., 24 Sep 2025). Fertility, Non-Existing Words Ratio, and Code-Switching Word Ratio are additional diagnostic metrics for generative quality in underrepresented languages (Kiulian et al., 24 Oct 2024).

6. Adaptation Across Languages and Resource Regimes

Transfer learning approaches, tokenizer unification, and encoder warm-start protocols support rapid adaptation to new language pairs and scripts (Kiulian et al., 24 Oct 2024, Soltan et al., 2023). Shared vocabularies and embedding matrices maximize cross-lingual transfer, with curriculum and mixed-batch strategies favoring low-resource languages. Back-translation and synthetic data augmentation compensate for corpus scarcity, while modular adapters can localize specialization without disrupting generalization (Sahay et al., 27 Oct 2025).

Multilingual models such as Teuken-7B-Base cover 24 languages by balancing data proportions and minimizing fertility variance, achieving ~54% average accuracy on European QA and commonsense benchmarks with efficient scaling to 7B parameters (Ali et al., 30 Sep 2024).

7. Scaling, Compute Efficiency, and Best Practices

Scaling laws indicate diminishing performance returns above certain model and data sizes, suggesting empirical tuning for optimal compute use (Eyal et al., 2022). Sequential pretraining pipelines—MLM encoder followed by seq2seq decoder with staged unfreezing—yield equivalent quality at 27% reduced compute cost relative to from-scratch dual training (Soltan et al., 2023). PreLayerNorm, mixed precision, ZeRO-style sharding, and instruction-tuning frameworks are industry standard for efficient large-scale deployment (Soltan et al., 2022, Ali et al., 30 Sep 2024).

Reproducibility mandates publishing code, exact hyperparameters, data splits, and checkpoints, with mean ± standard deviation reporting across random seeds (Eyal et al., 2022). Error analysis, annotation audits, and iterative parameter tuning remain central to final model refinement.

References

  • Eyal et al., "Multilingual Sequence-to-Sequence Models for Hebrew NLP" (Eyal et al., 2022)
  • Hailay et al., "Low-Resource English-Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks" (Teklehaymanot et al., 24 Sep 2025)
  • IndoTranslit, "Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration" (Gharami et al., 27 Nov 2025)
  • Soltan et al., "AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model" (Soltan et al., 2022)
  • Teuken et al., "Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs" (Ali et al., 30 Sep 2024)
  • Dobler et al., "From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages" (Kiulian et al., 24 Oct 2024)
  • Soltan et al., "Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models" (Soltan et al., 2023)
  • Zelezny et al., "A U-Net and Transformer Pipeline for Multilingual Image Translation" (Sahay et al., 27 Oct 2025)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Custom Multilingual Seq2Seq LLM.