Translate-Align-Retrieve (TAR) Pipeline
- Translate-Align-Retrieve (TAR) is a three-stage method combining neural machine translation, unsupervised word alignment, and answer span projection to construct parallel extractive QA datasets.
- It leverages large-scale English–Spanish parallel corpora and advanced preprocessing (e.g., BPE, Moses tokenization) to achieve nearly 100% data coverage with precise answer annotations.
- Empirical evaluations show that QA models trained on TAR-generated datasets set new benchmarks on cross-lingual tasks such as SQuAD-es and XQuAD.
The Translate–Align–Retrieve (TAR) pipeline is a principled methodology for generating fully-annotated, extractive question answering (QA) corpora in a target language by leveraging parallel corpora and word-level alignment. The approach was originally formulated to construct SQuAD-es, a Spanish translation of the Stanford Question Answering Dataset (SQuAD) v1.1, addressing the acute resource scarcity in large-scale non-English QA data. TAR decomposes the QA dataset transduction process into three sequential components: neural machine translation (NMT) of context and questions, unsupervised token-level alignment, and answer span projection from source to target text (Carrino et al., 2019).
1. Pipeline Architecture and Functional Stages
TAR processes an English QA triple (context, question, answer span) to produce a target-language triple . The pipeline comprises:
1. Translation: Both context and question are translated using an NMT model.
2. Alignment: An unsupervised word-alignment model computes token-level mapping between each source sentence and its translation.
3. Retrieval (Answer Projection): The original answer span indices are projected into the target context using alignment data to extract .
The end-to-end result is a synthetic, parallel QA resource maintaining nearly 100% data coverage with fine-grained answer annotation.
2. Translation Stage: Data, Model, and Preprocessing
The translation component uses a Transformer-based NMT model trained on approximately 6.5 million English–Spanish parallel sentence pairs drawn from WikiMatrix, TED-2013, News-Commentary, Tatoeba, and OpenSubtitles.
Data preprocessing includes:
- Punctuation normalization
- Moses tokenization and truecasing
- Joint byte-pair encoding (BPE) with 50k merge operations
- Sentence length capped at 80 tokens; duplicates removed
The NMT model has the following settings:
- Base Transformer architecture per Vaswani et al. (2017)
- OpenNMT-py toolkit, default hyperparameters
- Shared source–target vocabulary/embeddings
- Trained for 200,000 update steps on a single GeForce GTX TITAN X GPU
- Checkpoint averaging over the last three snapshots
Model performance on a held-out test set is BLEU 45.60.
The NMT training objective is the standard cross-entropy loss over the target-side sequence of length :
3. Alignment Stage: Methodology and Probability Model
Alignment relies on efmaral, a rapid implementation of eflomal, which is an unsupervised Bayesian word-alignment model trained via Markov Chain Monte Carlo methods on the tokenized training data.
Alignment model details:
- A source sentence and target are aligned so that each target token is mapped to a source token .
- The alignment 0 is scored in an IBM-Model-1 fashion:
1
or equivalently
2
- efmaral extends this with Bayesian priors and distortion parameters, inferred via Gibbs sampling.
Since truecasing and BPE-based tokenization may split words, token-to-word mappings are maintained explicitly so alignment can be collapsed to the word level.
4. Retrieval Stage: Answer Span Projection
The answer projection step maps English answer span indices 3 in 4 to target indices in 5.
Procedure:
- For each position 6, retrieve the set of aligned target tokens 7.
- Set 8, 9, to handle possible reordering.
- Extract 0 as the projected answer.
- If the text does not exactly match any substring (due to translation variation), the method falls back to returning this substring regardless.
Pseudo-algorithm:
6
5. Dataset Construction and Analytical Statistics
TAR produces two primary resources:
| Corpus | Examples | Construction Criterion |
|---|---|---|
| SQuAD-es v1.1 (full) | 87,595/87,599 (~100%) | All projected spans via alignment, including fallback |
| SQuAD-es v1.1 (small) | 46,260 (~53%) | Only examples with verbatim answer occurrences in 1 |
The average context length is 140 tokens, average question length is 13, and average answer length is 4 tokens. Post-processing includes removing overlapping-span errors (trimming punctuation, removing adjacent-sentence tokens) and discarding empty answer spans.
6. Empirical Evaluation: QA Models, Metrics, and Results
Multilingual-BERT ("bert-multilingual-cased") is fine-tuned on SQuAD-es or SQuAD-es-small using the HuggingFace Transformers library (max 384 tokens per instance, batch size 12–16, learning rate 2, for three epochs).
Performance is evaluated using Exact Match (EM) and token-level F1:
3
4
On Spanish MLQA:
- TAR-train + mBERT (full): F1 = 68.1, EM = 48.3
- TAR-train + mBERT (small): F1 = 65.5, EM = 47.2
- mBERT zero/few-shot: F1 = 64.3, EM = 46.6
- translate-train: F1 = 53.9, EM = 37.4
- Prior state-of-the-art (XLM): F1 = 68.0, EM = 49.8
On Spanish XQuAD:
- TAR-train + mBERT (full): F1 = 77.6, EM = 61.8
- TAR-train + mBERT (small): F1 = 73.8, EM = 59.5
- Other mBERT-based baselines: F1 ≈ 59–74, EM ≈ 41–55
These empirical findings indicate that the SQuAD-es corpus constructed via TAR sets new state-of-the-art results on cross-lingual extractive QA benchmarks (Carrino et al., 2019).
7. Implementation Details and Replicability
Critical components and hyperparameters:
- NMT training: Moses scripts for normalization/tokenization, Subword-nmt for BPE (50k merge operations), OpenNMT-py for Transformer. Adam optimizer (5), Noam schedule, 4,000 warm-up steps, maximum sentence length of 80.
- Word alignment: efmaral with default MCMC settings, trained on tokenized NMT data to produce alignment priors.
- QA fine-tuning: bert-multilingual-cased, max length 384, learning rate 3e–5, 3 epochs, run on a single TITAN X GPU.
With these specifications and resources, the TAR pipeline can be adapted to generate high-coverage QA training resources for other languages using the same systematic methodology (Carrino et al., 2019).