Papers
Topics
Authors
Recent
Search
2000 character limit reached

Translate-Align-Retrieve (TAR) Pipeline

Updated 31 May 2026
  • Translate-Align-Retrieve (TAR) is a three-stage method combining neural machine translation, unsupervised word alignment, and answer span projection to construct parallel extractive QA datasets.
  • It leverages large-scale English–Spanish parallel corpora and advanced preprocessing (e.g., BPE, Moses tokenization) to achieve nearly 100% data coverage with precise answer annotations.
  • Empirical evaluations show that QA models trained on TAR-generated datasets set new benchmarks on cross-lingual tasks such as SQuAD-es and XQuAD.

The Translate–Align–Retrieve (TAR) pipeline is a principled methodology for generating fully-annotated, extractive question answering (QA) corpora in a target language by leveraging parallel corpora and word-level alignment. The approach was originally formulated to construct SQuAD-es, a Spanish translation of the Stanford Question Answering Dataset (SQuAD) v1.1, addressing the acute resource scarcity in large-scale non-English QA data. TAR decomposes the QA dataset transduction process into three sequential components: neural machine translation (NMT) of context and questions, unsupervised token-level alignment, and answer span projection from source to target text (Carrino et al., 2019).

1. Pipeline Architecture and Functional Stages

TAR processes an English QA triple (csrc,qsrc,asrc)(c_{src}, q_{src}, a_{src}) (context, question, answer span) to produce a target-language triple (ctgt,qtgt,atgt)(c_{tgt}, q_{tgt}, a_{tgt}). The pipeline comprises:

1. Translation: Both context and question are translated using an NMT model.

2. Alignment: An unsupervised word-alignment model computes token-level mapping between each source sentence and its translation.

3. Retrieval (Answer Projection): The original answer span indices (asrcstart,asrcend)(a_{src}^{start}, a_{src}^{end}) are projected into the target context using alignment data to extract (atgt)(a_{tgt}).

The end-to-end result is a synthetic, parallel QA resource maintaining nearly 100% data coverage with fine-grained answer annotation.

2. Translation Stage: Data, Model, and Preprocessing

The translation component uses a Transformer-based NMT model trained on approximately 6.5 million English–Spanish parallel sentence pairs drawn from WikiMatrix, TED-2013, News-Commentary, Tatoeba, and OpenSubtitles.

Data preprocessing includes:

  • Punctuation normalization
  • Moses tokenization and truecasing
  • Joint byte-pair encoding (BPE) with 50k merge operations
  • Sentence length capped at 80 tokens; duplicates removed

The NMT model has the following settings:

  • Base Transformer architecture per Vaswani et al. (2017)
  • OpenNMT-py toolkit, default hyperparameters
  • Shared source–target vocabulary/embeddings
  • Trained for 200,000 update steps on a single GeForce GTX TITAN X GPU
  • Checkpoint averaging over the last three snapshots

Model performance on a held-out test set is BLEU 45.60.

The NMT training objective is the standard cross-entropy loss over the target-side sequence of length NN:

LMT=t=1Nlogpθ(yty<t,x)L_{MT} = -\sum_{t=1}^{N}\log\,p_{\theta}(y_t \mid y_{<t}, x)

3. Alignment Stage: Methodology and Probability Model

Alignment relies on efmaral, a rapid implementation of eflomal, which is an unsupervised Bayesian word-alignment model trained via Markov Chain Monte Carlo methods on the tokenized training data.

Alignment model details:

  • A source sentence s=s1sIs = s_1 \ldots s_I and target t=t1tJt = t_1 \ldots t_J are aligned so that each target token tjt_j is mapped to a source token sa(j)s_{a(j)}.
  • The alignment (ctgt,qtgt,atgt)(c_{tgt}, q_{tgt}, a_{tgt})0 is scored in an IBM-Model-1 fashion:

(ctgt,qtgt,atgt)(c_{tgt}, q_{tgt}, a_{tgt})1

or equivalently

(ctgt,qtgt,atgt)(c_{tgt}, q_{tgt}, a_{tgt})2

  • efmaral extends this with Bayesian priors and distortion parameters, inferred via Gibbs sampling.

Since truecasing and BPE-based tokenization may split words, token-to-word mappings are maintained explicitly so alignment can be collapsed to the word level.

4. Retrieval Stage: Answer Span Projection

The answer projection step maps English answer span indices (ctgt,qtgt,atgt)(c_{tgt}, q_{tgt}, a_{tgt})3 in (ctgt,qtgt,atgt)(c_{tgt}, q_{tgt}, a_{tgt})4 to target indices in (ctgt,qtgt,atgt)(c_{tgt}, q_{tgt}, a_{tgt})5.

Procedure:

  1. For each position (ctgt,qtgt,atgt)(c_{tgt}, q_{tgt}, a_{tgt})6, retrieve the set of aligned target tokens (ctgt,qtgt,atgt)(c_{tgt}, q_{tgt}, a_{tgt})7.
  2. Set (ctgt,qtgt,atgt)(c_{tgt}, q_{tgt}, a_{tgt})8, (ctgt,qtgt,atgt)(c_{tgt}, q_{tgt}, a_{tgt})9, to handle possible reordering.
  3. Extract (asrcstart,asrcend)(a_{src}^{start}, a_{src}^{end})0 as the projected answer.
  4. If the text does not exactly match any substring (due to translation variation), the method falls back to returning this substring regardless.

Pseudo-algorithm:

(asrcstart,asrcend)(a_{src}^{start}, a_{src}^{end})6

5. Dataset Construction and Analytical Statistics

TAR produces two primary resources:

Corpus Examples Construction Criterion
SQuAD-es v1.1 (full) 87,595/87,599 (~100%) All projected spans via alignment, including fallback
SQuAD-es v1.1 (small) 46,260 (~53%) Only examples with verbatim answer occurrences in (asrcstart,asrcend)(a_{src}^{start}, a_{src}^{end})1

The average context length is 140 tokens, average question length is 13, and average answer length is 4 tokens. Post-processing includes removing overlapping-span errors (trimming punctuation, removing adjacent-sentence tokens) and discarding empty answer spans.

6. Empirical Evaluation: QA Models, Metrics, and Results

Multilingual-BERT ("bert-multilingual-cased") is fine-tuned on SQuAD-es or SQuAD-es-small using the HuggingFace Transformers library (max 384 tokens per instance, batch size 12–16, learning rate (asrcstart,asrcend)(a_{src}^{start}, a_{src}^{end})2, for three epochs).

Performance is evaluated using Exact Match (EM) and token-level F1:

(asrcstart,asrcend)(a_{src}^{start}, a_{src}^{end})3

(asrcstart,asrcend)(a_{src}^{start}, a_{src}^{end})4

On Spanish MLQA:

  • TAR-train + mBERT (full): F1 = 68.1, EM = 48.3
  • TAR-train + mBERT (small): F1 = 65.5, EM = 47.2
  • mBERT zero/few-shot: F1 = 64.3, EM = 46.6
  • translate-train: F1 = 53.9, EM = 37.4
  • Prior state-of-the-art (XLM): F1 = 68.0, EM = 49.8

On Spanish XQuAD:

  • TAR-train + mBERT (full): F1 = 77.6, EM = 61.8
  • TAR-train + mBERT (small): F1 = 73.8, EM = 59.5
  • Other mBERT-based baselines: F1 ≈ 59–74, EM ≈ 41–55

These empirical findings indicate that the SQuAD-es corpus constructed via TAR sets new state-of-the-art results on cross-lingual extractive QA benchmarks (Carrino et al., 2019).

7. Implementation Details and Replicability

Critical components and hyperparameters:

  • NMT training: Moses scripts for normalization/tokenization, Subword-nmt for BPE (50k merge operations), OpenNMT-py for Transformer. Adam optimizer ((asrcstart,asrcend)(a_{src}^{start}, a_{src}^{end})5), Noam schedule, 4,000 warm-up steps, maximum sentence length of 80.
  • Word alignment: efmaral with default MCMC settings, trained on tokenized NMT data to produce alignment priors.
  • QA fine-tuning: bert-multilingual-cased, max length 384, learning rate 3e–5, 3 epochs, run on a single TITAN X GPU.

With these specifications and resources, the TAR pipeline can be adapted to generate high-coverage QA training resources for other languages using the same systematic methodology (Carrino et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Translate-Align-Retrieve (TAR) Pipeline.