Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale (2304.12206v2)

Published 24 Apr 2023 in cs.CL

Abstract: Existing question answering (QA) systems owe much of their success to large, high-quality training data. Such annotation efforts are costly, and the difficulty compounds in the cross-lingual setting. Therefore, prior cross-lingual QA work has focused on releasing evaluation datasets, and then applying zero-shot methods as baselines. This work proposes a synthetic data generation method for cross-lingual QA which leverages indirect supervision from existing parallel corpora. Our method termed PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages. First, we apply a question generation (QG) model to the English side. Second, we apply annotation projection to translate both the questions and answers. To better translate questions, we propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts. We apply PAXQA to generate cross-lingual QA examples in 4 languages (662K examples total), and perform human evaluation on a subset to create validation and test splits. We then show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets. The largest performance gains are for directions with non-English questions and English contexts. Ablation studies show that our dataset generation method is relatively robust to noise from automatic word alignments, showing the sufficient quality of our generations. To facilitate follow-up work, we release our code and datasets at https://github.com/manestay/paxqa .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Qameleon: Multilingual qa with only 5 examples. arXiv preprint arXiv:2211.08264.
  2. Synthetic QA corpora generation with roundtrip consistency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6168–6173, Florence, Italy. Association for Computational Linguistics.
  3. Cross-lingual transfer of semantic roles: From raw text to semantic roles. In Proceedings of the 13th International Conference on Computational Semantics - Long Papers, pages 200–210, Gothenburg, Sweden. Association for Computational Linguistics.
  4. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637.
  5. XOR QA: Cross-lingual open-retrieval question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 547–564, Online. Association for Computational Linguistics.
  6. Lexically Constrained Neural Machine Translation with Explicit Alignment Guidance. Proceedings of the AAAI Conference on Artificial Intelligence, 35(14):12630–12638.
  7. Cross-Lingual Natural Language Generation via Pre-Training. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7570–7577.
  8. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454–470.
  9. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  11. Zi-Yi Dou and Graham Neubig. 2021. Word alignment by fine-tuning embeddings on parallel corpora. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2112–2128, Online. Association for Computational Linguistics.
  12. A feasibility study of answer-agnostic question generation for education. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1919–1926, Dublin, Ireland. Association for Computational Linguistics.
  13. The FLORES evaluation datasets for low-resource machine translation: Nepali–English and Sinhala–English. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6098–6111, Hong Kong, China. Association for Computational Linguistics.
  14. Survey of low-resource machine translation. Computational Linguistics, 48(3):673–732.
  15. Bootstrapping parsers via syntactic projection across parallel texts. Nat. Lang. Eng., 11(3):311–325.
  16. Cross-lingual training for automatic question generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4863–4872, Florence, Italy. Association for Computational Linguistics.
  17. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  18. Mlqa: Evaluating cross-lingual extractive question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7315–7330.
  19. Bryan Li. 2022. Word alignment in the era of deep learning: A tutorial. arXiv preprint arXiv:2212.00138.
  20. MKQA: A linguistically diverse benchmark for multilingual open domain question answering. Transactions of the Association for Computational Linguistics, 9:1389–1406.
  21. Matt Post and David Vilar. 2018. Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1314–1324.
  22. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
  23. Mohammad Sadegh Rasooli and Michael Collins. 2017. Cross-lingual syntactic transfer with limited resources. Transactions of the Association for Computational Linguistics, 5:279–293.
  24. Synthetic data augmentation for zero-shot cross-lingual question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7016–7030, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  25. Towards zero-shot multilingual synthetic question and answer generation for cross-lingual reading comprehension. In Proceedings of the 14th International Conference on Natural Language Generation, pages 35–45, Aberdeen, Scotland, UK. Association for Computational Linguistics.
  26. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA).
  27. Attention is all you need. Advances in neural information processing systems, 30.
  28. A template-based method for constrained neural machine translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3665–3679. Association for Computational Linguistics.
  29. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  30. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Bryan Li (17 papers)
  2. Chris Callison-Burch (102 papers)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com