WMT 2025 Terminology Shared Task Corpus
- WMT 2025 Terminology Shared Task Corpus is a specialized resource for terminology-constrained machine translation, featuring bilingual dictionaries and synthetic data.
- It includes three language pairs—EN→DE, EN→ES, and EN→RU—with explicit XML-tagged term constraints that ensure high translation accuracy.
- The corpus enables rigorous evaluation of MT systems via metrics like BLEU, chrF2++, and near-perfect terminology success rates in dual-stage translation architectures.
The WMT 2025 Terminology Shared Task Corpus is a specialized resource designed for research and evaluation in terminology-constrained machine translation (MT). Unlike conventional parallel corpora, its primary materials are bilingual terminology dictionaries covering three language directions: English–German, English–Spanish, and English–Russian. The corpus has been widely used as a benchmark for terminology-aware translation systems, most notably in Jaswal et al.’s "It Takes Two: A Dual Stage Approach for Terminology-Aware Translation" (Jaswal, 7 Nov 2025), where it serves as both a source of constraints and a held-out test set. The key focus is on sentence-level translation with explicit term constraints, supporting both the development of synthetic training data and rigorous evaluation protocols for MT models enhanced for controlled vocabulary.
1. Structure and Composition
The WMT 2025 Terminology Shared Task Corpus comprises:
- Language Coverage: Three translation directions: English→German (DE), English→Spanish (ES), English→Russian (RU).
- Terminology Dictionaries: Extracted from WMT 2025 development data, with each bilingual lexicon typically exceeding 1,000 unique source–target entries per direction. These term lists are tracked via repetition_ids; LLM-generated term suggestions are appended to supplement coverage in similar domains.
- Synthetic Parallel Data: To offset the scarcity of in-domain sentence-aligned data, synthetic parallel sentences are automatically generated, embedding source–target term pairs.
- Single-term mode: Approximately 10,000–15,000 filtered sentence pairs per language, each enforcing a single required term constraint.
- Multi-term mode: Similar scale per language, with each sentence constraining 2–3 terms.
- Term Distribution: The design of synthetic data guarantees that every segment in single-term mode contains exactly one constraint, while multi-term mode averages 2–3 constraints per segment. No finer-grained per-segment histograms are reported.
- Corpus Size: Absolute counts of sentences or tokens in train, dev, and test splits are not disclosed; only filtered synthetic data counts (post-deduplication and quality estimation) are specified.
2. Constraint Annotation and File Schema
- Constraint Markup: All term constraints are explicitly bracketed with XML-style tags
[TERM]...[/TERM]on both source (English) and target (DE/ES/RU) sides, ensuring alignment and facilitating automated post-processing. - Metadata Packaging: Model outputs after post-editing are serialized as JSONL, with each object encapsulating:
"source": raw source segment"initial_translation": NMT system output"required_terms": term mapping list
- The integrity of this schema is programmatically validated via a dedicated JSON parser.
3. Preprocessing, Tag Management, and Integration
- Tag Standardization: Constraint markers are standardized through a pre-processing pass employing:
- Longest-first matching: Preference for longest term matches
- Case-insensitive detection: Matching disregards case but retains original case in output
- Inverse mapping: Ensuring source and target tags are monotonically aligned
- Synthetic Data Quality Control:
- COMET_QE Filtering: Sentence pairs are scored, with only those achieving –$0.90$ (by language) retained.
- Deduplication: Redundant or near-duplicate sentence sources are removed.
- Vocabulary Augmentation: The NMT model (NLLB-200 with 3.3B parameters) is modified to include atomic “[TERM]” and “[/TERM]” tokens in its vocabulary, preventing sub-word tokenization errors with markup.
- Fine-Tuning Protocol: A mix of filtered synthetic parallel data and raw terminology lists (not parallel text) from the development dictionaries is used for parameter-efficient adapter fine-tuning of the underlying NLLB model. Further details on subword tokenization or BPE are not disclosed; the baseline NLLB preprocessing pipeline is applied.
4. Evaluation Procedures and Metrics
- Evaluation Split: The WMT 2025 Shared Task corpus functions as a held-out, standardized test set for cross-system comparison under systems requiring explicit terminology application.
- Reported Metrics:
- BLEU (Papineni et al. 2002)
- chrF2++ (Popović 2015)
- Terminology Success Rates:
- Rand. SR computes forced placement of random terms
- Results on Official Test Set (Proper Constraints):
| Direction | BLEU | chrF2++ | Proper SR |
|---|---|---|---|
| EN → DE | 48.06 | 70.74 | 0.98 |
| EN → ES | 58.51 | 76.08 | 0.99 |
| EN → RU | 35.80 | 63.57 | 0.98 |
- Observations: Strict (“proper”) constraint settings maximize BLEU/chrF2++ and yield nearly perfect terminology accuracy ( Proper SR). Unconstrained (“noterm”) and random-term configurations demonstrate the functional trade-off between lexical flexibility and terminology adherence.
5. Methodological Best Practices and Usage Guidelines
- Two-Stage Architecture ("DuTerm"): The preferred paradigm combines:
- Adapter-fine-tuned NMT on tag-annotated synthetic data
- Prompted LLM-based post-editing for fluency optimization and robust enforcement of term positions
- Tag Management: The introduction of dedicated markup tokens is essential to preserve tag boundaries and maintain consistent constraint placement, especially when using subword segmentation-based models.
- Quality Assurance: The application of a high COMET_QE filtering threshold (–$0.90$) is crucial for suppressing low-quality or noisy synthetic pairs.
- Prompt Engineering: Post-editing prompts should enumerate all required source–target term mappings explicitly. Generation is performed with low temperature (0.3) for determinism, coupled with output validation to ensure tag integrity.
- Cross-Lingual Transfer: Pooling synthetic sentences from multiple language pairs for joint fine-tuning enables representation sharing, potentially enhancing performance on low-resource pairs.
- Future Enhancements: Document-level consistency, dynamic prompt adaptation, and user-controllable constraint mechanisms are noted as promising directions to extend beyond current sentence-level constraints.
6. Context, Limitations, and Implications
The design of the WMT 2025 Terminology Shared Task Corpus—with its emphasis on term dictionaries and context-rich synthetic parallel sentences—reflects its primary use case: benchmarking and refining term-sensitive MT systems. The corpus does not provide raw parallel text for training conventional sentence-level MT models, instead supplying constraint resources and evaluation scaffolds that ensure reproducibility and comparability in terminology-aware research. The absence of standard train/dev/test split statistics or token counts suggests the corpus is best understood as a resource for developing and evaluating constraint-aware MT pipelines rather than as a generic parallel corpus. A plausible implication is that any expansion of the WMT terminology corpus in the future will likely focus on richer contexts (e.g., document-level constraints), greater annotation granularity and protocol standardization, and the incorporation of dynamic user-driven term selection to foster further research into adaptive, reliable terminology control in neural MT.