Translate-Align-Retrieve (TAR) Pipeline

Updated 31 May 2026

Translate-Align-Retrieve (TAR) is a three-stage method combining neural machine translation, unsupervised word alignment, and answer span projection to construct parallel extractive QA datasets.
It leverages large-scale English–Spanish parallel corpora and advanced preprocessing (e.g., BPE, Moses tokenization) to achieve nearly 100% data coverage with precise answer annotations.
Empirical evaluations show that QA models trained on TAR-generated datasets set new benchmarks on cross-lingual tasks such as SQuAD-es and XQuAD.

The Translate–Align–Retrieve (TAR) pipeline is a principled methodology for generating fully-annotated, extractive question answering (QA) corpora in a target language by leveraging parallel corpora and word-level alignment. The approach was originally formulated to construct SQuAD-es, a Spanish translation of the Stanford Question Answering Dataset (SQuAD) v1.1, addressing the acute resource scarcity in large-scale non-English QA data. TAR decomposes the QA dataset transduction process into three sequential components: neural machine translation (NMT) of context and questions, unsupervised token-level alignment, and answer span projection from source to target text (Carrino et al., 2019).

1. Pipeline Architecture and Functional Stages

TAR processes an English QA triple $(c_{src}, q_{src}, a_{src})$ (context, question, answer span) to produce a target-language triple $(c_{tgt}, q_{tgt}, a_{tgt})$ . The pipeline comprises:

1. Translation: Both context and question are translated using an NMT model.

2. Alignment: An unsupervised word-alignment model computes token-level mapping between each source sentence and its translation.

3. Retrieval (Answer Projection): The original answer span indices $(a_{src}^{start}, a_{src}^{end})$ are projected into the target context using alignment data to extract $(a_{tgt})$ .

The end-to-end result is a synthetic, parallel QA resource maintaining nearly 100% data coverage with fine-grained answer annotation.

2. Translation Stage: Data, Model, and Preprocessing

The translation component uses a Transformer-based NMT model trained on approximately 6.5 million English–Spanish parallel sentence pairs drawn from WikiMatrix, TED-2013, News-Commentary, Tatoeba, and OpenSubtitles.

Data preprocessing includes:

Punctuation normalization
Moses tokenization and truecasing
Joint byte-pair encoding (BPE) with 50k merge operations
Sentence length capped at 80 tokens; duplicates removed

The NMT model has the following settings:

Base Transformer architecture per Vaswani et al. (2017)
OpenNMT-py toolkit, default hyperparameters
Shared source–target vocabulary/embeddings
Trained for 200,000 update steps on a single GeForce GTX TITAN X GPU
Checkpoint averaging over the last three snapshots

Model performance on a held-out test set is BLEU 45.60.

The NMT training objective is the standard cross-entropy loss over the target-side sequence of length $N$ :

$L_{MT} = -\sum_{t=1}^{N}\log\,p_{\theta}(y_t \mid y_{<t}, x)$

3. Alignment Stage: Methodology and Probability Model

Alignment relies on efmaral, a rapid implementation of eflomal, which is an unsupervised Bayesian word-alignment model trained via Markov Chain Monte Carlo methods on the tokenized training data.

Alignment model details:

A source sentence $s = s_1 \ldots s_I$ and target $t = t_1 \ldots t_J$ are aligned so that each target token $t_j$ is mapped to a source token $s_{a(j)}$ .
The alignment $(c_{tgt}, q_{tgt}, a_{tgt})$ 0 is scored in an IBM-Model-1 fashion:

$(c_{tgt}, q_{tgt}, a_{tgt})$ 1

or equivalently

$(c_{tgt}, q_{tgt}, a_{tgt})$ 2

efmaral extends this with Bayesian priors and distortion parameters, inferred via Gibbs sampling.

Since truecasing and BPE-based tokenization may split words, token-to-word mappings are maintained explicitly so alignment can be collapsed to the word level.

4. Retrieval Stage: Answer Span Projection

The answer projection step maps English answer span indices $(c_{tgt}, q_{tgt}, a_{tgt})$ 3 in $(c_{tgt}, q_{tgt}, a_{tgt})$ 4 to target indices in $(c_{tgt}, q_{tgt}, a_{tgt})$ 5.

Procedure:

For each position $(c_{tgt}, q_{tgt}, a_{tgt})$ 6, retrieve the set of aligned target tokens $(c_{tgt}, q_{tgt}, a_{tgt})$ 7.
Set $(c_{tgt}, q_{tgt}, a_{tgt})$ 8, $(c_{tgt}, q_{tgt}, a_{tgt})$ 9, to handle possible reordering.
Extract $(a_{src}^{start}, a_{src}^{end})$ 0 as the projected answer.
If the text does not exactly match any substring (due to translation variation), the method falls back to returning this substring regardless.

Pseudo-algorithm:

$(a_{src}^{start}, a_{src}^{end})$ 6

5. Dataset Construction and Analytical Statistics

TAR produces two primary resources:

Corpus	Examples	Construction Criterion
SQuAD-es v1.1 (full)	87,595/87,599 (~100%)	All projected spans via alignment, including fallback
SQuAD-es v1.1 (small)	46,260 (~53%)	Only examples with verbatim answer occurrences in $(a_{src}^{start}, a_{src}^{end})$ 1

The average context length is 140 tokens, average question length is 13, and average answer length is 4 tokens. Post-processing includes removing overlapping-span errors (trimming punctuation, removing adjacent-sentence tokens) and discarding empty answer spans.

6. Empirical Evaluation: QA Models, Metrics, and Results

Multilingual-BERT ("bert-multilingual-cased") is fine-tuned on SQuAD-es or SQuAD-es-small using the HuggingFace Transformers library (max 384 tokens per instance, batch size 12–16, learning rate $(a_{src}^{start}, a_{src}^{end})$ 2, for three epochs).

Performance is evaluated using Exact Match (EM) and token-level F1:

$(a_{src}^{start}, a_{src}^{end})$ 3

$(a_{src}^{start}, a_{src}^{end})$ 4

On Spanish MLQA:

TAR-train + mBERT (full): F1 = 68.1, EM = 48.3
TAR-train + mBERT (small): F1 = 65.5, EM = 47.2
mBERT zero/few-shot: F1 = 64.3, EM = 46.6
translate-train: F1 = 53.9, EM = 37.4
Prior state-of-the-art (XLM): F1 = 68.0, EM = 49.8

On Spanish XQuAD:

TAR-train + mBERT (full): F1 = 77.6, EM = 61.8
TAR-train + mBERT (small): F1 = 73.8, EM = 59.5
Other mBERT-based baselines: F1 ≈ 59–74, EM ≈ 41–55

These empirical findings indicate that the SQuAD-es corpus constructed via TAR sets new state-of-the-art results on cross-lingual extractive QA benchmarks (Carrino et al., 2019).

7. Implementation Details and Replicability

Critical components and hyperparameters:

NMT training: Moses scripts for normalization/tokenization, Subword-nmt for BPE (50k merge operations), OpenNMT-py for Transformer. Adam optimizer ( $(a_{src}^{start}, a_{src}^{end})$ 5), Noam schedule, 4,000 warm-up steps, maximum sentence length of 80.
Word alignment: efmaral with default MCMC settings, trained on tokenized NMT data to produce alignment priors.
QA fine-tuning: bert-multilingual-cased, max length 384, learning rate 3e–5, 3 epochs, run on a single TITAN X GPU.

With these specifications and resources, the TAR pipeline can be adapted to generate high-coverage QA training resources for other languages using the same systematic methodology (Carrino et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

Automatic Spanish Translation of the SQuAD Dataset for Multilingual Question Answering (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Translate-Align-Retrieve (TAR) Pipeline.

Translate-Align-Retrieve (TAR) Pipeline

1. Pipeline Architecture and Functional Stages

2. Translation Stage: Data, Model, and Preprocessing

3. Alignment Stage: Methodology and Probability Model

4. Retrieval Stage: Answer Span Projection

5. Dataset Construction and Analytical Statistics

6. Empirical Evaluation: QA Models, Metrics, and Results

7. Implementation Details and Replicability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Translate-Align-Retrieve (TAR) Pipeline

1. Pipeline Architecture and Functional Stages

2. Translation Stage: Data, Model, and Preprocessing

3. Alignment Stage: Methodology and Probability Model

4. Retrieval Stage: Answer Span Projection

5. Dataset Construction and Analytical Statistics

6. Empirical Evaluation: QA Models, Metrics, and Results

7. Implementation Details and Replicability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research