XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages (2202.00291v2)

Published 1 Feb 2022 in cs.CL

Abstract: Multiple critical scenarios (like Wikipedia text generation given English Infoboxes) need automated generation of descriptive text in low resource (LR) languages from English fact triples. Previous work has focused on English fact-to-text (F2T) generation. To the best of our knowledge, there has been no previous attempt on cross-lingual alignment or generation for LR languages. Building an effective cross-lingual F2T (XF2T) system requires alignment between English structured facts and LR sentences. We propose two unsupervised methods for cross-lingual alignment. We contribute XALIGN, an XF2T dataset with 0.45M pairs across 8 languages, of which 5402 pairs have been manually annotated. We also train strong baseline XF2T generation models on the XAlign dataset.

Authors (6)

Tushar Abhishek (4 papers)
Shivprasad Sagare (4 papers)
Bhavyajeet Singh (4 papers)
Anubhav Sharma (13 papers)
Manish Gupta (67 papers)
Vasudeva Varma (47 papers)

Citations (9)

View on Semantic Scholar

Summary

This paper introduces the novel task of cross-lingual fact-to-text (XF2T) generation, specifically focusing on generating descriptive text in low-resource (LR) languages from structured English fact triples, often sourced from Wikidata infoboxes (Abhishek et al., 2022 ). The authors highlight the lack of existing datasets and methods for this task, as previous work predominantly focused on monolingual English fact-to-text generation.

To address the need for aligned data, the paper proposes a two-stage unsupervised method for aligning English facts with sentences in LR languages.

Stage 1 (Candidate Generation): This stage identifies potential alignments between English facts and LR sentences for a given entity. It calculates a similarity score based on both syntactic (TF-IDF on translated/original text) and semantic (cosine similarity of MuRIL embeddings) matching. Sentences are retained if their highest-scoring fact exceeds a threshold ( $\tau=0.65$ ), and the top-K ( $K=10$ ) candidate facts are kept for each retained sentence.
Stage 2 (Candidate Selection): This stage refines the candidate pairs from Stage 1 using more sophisticated techniques. Two approaches are explored:
- Transfer Learning from NLI: Multilingual models (MuRIL, XLM-R, mT5) fine-tuned on the Cross-lingual Natural Language Inference (XNLI) task are used. A (fact, sentence) pair is considered aligned if the model predicts that the sentence (premise) entails the fact (hypothesis).
- Distant Supervision: A binary classifier is trained on an existing English fact-to-text dataset (KELM) to predict fact-sentence alignment. This classifier is then applied in a cross-lingual setting (specifically, the 'translate-train' approach yielded the best results).

The authors evaluated these alignment methods against a manually annotated ground-truth dataset of 5402 pairs across 8 languages (Hindi, Marathi, Telugu, Tamil, English, Gujarati, Bengali, Kannada). The transfer learning approach using mT5 achieved the highest average F1 score of 0.837 for candidate selection.

Using the best alignment method (mT5 transfer learning) on the output of Stage 1, the researchers created the XAlign dataset, containing 0.45 million automatically aligned (English fact set, LR sentence) pairs across the 8 languages.

Finally, the paper establishes baseline results for the XF2T generation task using the XAlign dataset. They trained several multilingual sequence-to-sequence models:

A baseline using simple translation and concatenation of facts.
A standard Transformer model.
A Graph Attention Network (GAT) + Transformer model.
An mT5-small model.

The input to these models consists of the concatenated English facts (optionally including Wikipedia section headers), and the output is the generated sentence in the target language. Evaluation using BLEU scores showed that the mT5-small model performed best, achieving an average BLEU score of 25.0 across all languages, significantly outperforming the baseline and other models. Performance varied by language, with higher scores for English, Hindi, and Bengali compared to other LR languages.

The main contributions are:

Defining the XF2T generation problem for LR languages.
Proposing effective unsupervised/distantly supervised methods for cross-lingual fact-sentence alignment.
Creating the large-scale XAlign dataset with both automatic and manual annotations.
Establishing strong baseline results for XF2T generation using state-of-the-art multilingual models.

The paper makes its dataset and code publicly available to encourage further research in this area.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos