AlignAR: Generative Sentence Alignment
- AlignAR is a generative sentence‐alignment approach that reformulates mapping as a conditional inference problem using LLMs.
- The system combines sentence indexing, structured JSON output from LLMs, and human validation to manage many-to-many alignments.
- Empirical results show significant F₁ improvements on challenging Arabic–English corpora, especially for literary texts with low one-to-one correspondence.
AlignAR denotes a generative sentence-alignment method leveraging LLMs for robust mapping between source and target sentences in complex parallel corpora, as well as an associated Arabic–English dataset specifically curated for legal and literary texts. In contrast to prevailing alignment algorithms based on heuristics or embedding similarity, AlignAR operationalizes alignment as a generative inference problem, directly eliciting alignment structures from an LLM and refining them through targeted human interaction. Empirical results on datasets with significant structural and translational divergence demonstrate strong outperformance versus classical and neural embedding-based baselines, particularly in scenarios requiring many-to-many alignment and resilience to paraphrastic or stylistic translation phenomena (Huang et al., 26 Dec 2025).
1. Generative Alignment Framework
The core innovation in AlignAR is the reframing of sentence alignment as the problem of directly modeling the conditional alignment distribution over sets of sentence pairs or index sets. Given a source document and a target document , an alignment is where each pairs a non-empty subset of source lines to a non-empty subset of target lines.
AlignAR prompts the LLM to generate the most probable alignment , but in lieu of direct inference over , the LLM is instructed to output a structured (JSON) alignment specification given indexed source and target sentences. The approach admits a loss with respect to a gold alignment , though in practice, prompt engineering and human validation are used instead of explicit likelihood optimization.
2. System Architecture and Workflow
The AlignAR pipeline consists of three main stages:
- Sentence Indexing: Each sentence in both source and target is assigned and marked with a unique index to disambiguate mappings.
- Generative LLM Inference: The system constructs a prompt presenting the indexed sentences to a state-of-the-art LLM (e.g., Gemini-2.5-flash or GPT-5.1-mini) with instructions to generate the alignment as a mapping from source indices to lists of corresponding target indices, encoded in JSON.
- Ladder Construction and Human Validation: The raw LLM output is parsed to create a ladder structure of alignments. Skilled annotators employ an interface (“LLMAligner”) to perform Merge, Split, or Exchange operations, correcting errors and confirming gold-standard mappings. This step constitutes a light-touch, post-hoc validation not present in traditional purely automatic systems.
This architecture exploits the generative, world-knowledge-rich capabilities of modern LLMs while retaining an efficient and scalable annotation workflow for high-precision use cases.
3. Arabic–English Parallel Corpus Construction
The empirical foundation of AlignAR’s evaluation is a newly sampled corpus of ten document pairs, divided into two distinctive subsets:
- Easy Subset (Legal Texts): Five statutes from the NCAR, encompassing 892 Arabic and 1,093 English sentences, characterized by high structural parallelism and predominantly one-to-one alignments (77–87%).
- Hard Subset (Literary Texts): Five text pairs (including Hayy ibn Yaqẓān and modern short stories) totaling 378 Arabic and 774 English sentences, with a much lower source/target ratio () and only 28–46% one-to-one alignments, thus necessitating accurate handling of many-to-many correspondences and significant linguistic divergence.
The corpus design is intended to stress-test alignment algorithms well beyond canonical legal and governmental text, underscoring the deficiencies of systems that assume monotonic or simple mapping structures.
4. Comparative Evaluation and Baselines
AlignAR is systematically benchmarked against three prominent non-generative alignment methods:
- BleuAlign [Sennrich & Volk 2010]: Utilizes machine translation of source sentences and BLEU scores between candidate pairs as local anchor scores, followed by dynamic programming to enforce global monotonicity and handle one-to-N mappings.
- VecAlign [Thompson & Koehn 2019]: Computes multilingual sentence embeddings (LASER) for both source and target, employs cosine similarity as local alignment costs, and leverages linear-time dynamic programming to allow for splits and merges across the alignment path.
- BertAlign [Liu & Zhu 2023]: Uses a multilingual BERT variant to compute pairwise contextualized embedding similarities, applying dynamic programming under monotonicity constraints to optimize alignments.
All methods, including AlignAR, are evaluated using strict Precision (P), Recall (R), and micro-averaged F₁ metrics, requiring that predicted tuples exactly match the corresponding gold alignment sets.
| Method | Easy Subset F₁ | Hard Subset F₁ |
|---|---|---|
| BleuAlign | >0.90 | 0.477 |
| VecAlign | >0.90 | 0.761 |
| BertAlign | >0.90 | 0.726 |
| GPT-5.1-mini | >0.90 | 0.767 |
| Gemini-2.5-flash | 0.993 | 0.855 |
On structurally simple "Easy" (legal) subsets, all systems perform near ceiling, highlighting limited discriminatory power of such testbeds. On "Hard" (literary) data, AlignAR’s LLM-based generative inference (Gemini-2.5-flash) achieves , a 9% absolute improvement relative to the best non-generative baseline (VecAlign at 76.1%) and an 11pp gain over the next-best LLM (GPT-5.1-mini).
5. Analysis of Robustness and Error Patterns
Conventional sentence aligners relying on sentence length, translation-based scoring, or embedding similarity experience marked degradation (e.g., BLEUAlign at , BertAlign at in the Hard subset) in the presence of creative translation, paraphrase, extensive sentence splitting/merging, or domain-driven divergence. The generative LLM approach in AlignAR remains robust by virtue of broad contextual comprehension and paraphrase resolution, managing indirect mappings, and supporting many-to-many or highly unbalanced correspondence patterns.
Differences between LLM variants (e.g., Gemini-2.5-flash vs GPT-5.1-mini) further indicate that model size, pretraining distribution, and instruction-tuning strategies contribute substantially to generative alignment quality.
6. Implications and Resource Availability
AlignAR demonstrates that the generative framing of sentence alignment with contemporary LLMs, supplemented by targeted human interventions, can yield state-of-the-art performance for challenging low-parallelism, high-variance parallel corpora. This approach is particularly advantageous for resource-poor language pairs and domains (e.g., literary translation) where mechanical or embedding-based strategies fall short.
All experimental datasets, code, and annotation tools are openly available, facilitating further research in parallel corpus construction and alignment (Huang et al., 26 Dec 2025).
7. Position Relative to the Field
AlignAR represents a distinct paradigm shift in sentence alignment by bypassing traditional similarity scoring protocols in favor of direct generation of alignment mappings by LLMs. This technique is orthogonal and complementary to embedding- and BLEU-based systems, offering improved flexibility in cases that involve significant paraphrastic or structural variation between source and target. The generative approach, especially when strategically combined with human post-editing, sets a new standard for constructing parallel resources for MT research and translation pedagogy, notably in low-resource, literary, or legal domains.