- The paper describes the creation of a large multilingual parallel corpus from full-text scientific articles in English, Portuguese, and Spanish using automated sentence alignment.
- This corpus, including over 2.9 million aligned sentences for EN-PT, significantly improves Statistical Machine Translation (SMT) performance compared to previous benchmarks.
- The resource supports various NLP tasks like multilingual text mining, cross-language plagiarism detection, and can enhance Named Entity Recognition (NER) tools.
The paper "A Large Parallel Corpus of Full-Text Scientific Articles" focuses on the creation of a comprehensive parallel corpus using full-text scientific articles from the Scielo database, which is a key resource for scientific literature in Latin America. This resource is significant due to its multilingual nature, with many articles available in English, Portuguese, and Spanish, making it suitable for NLP tasks and Statistical Machine Translation (SMT) applications.
Key Contributions and Methodology
- Corpus Construction: The authors developed a parallel corpus by harnessing articles from Scielo available in English, Portuguese, and Spanish. The corpus was constructed by automated sentence alignment using the Hunalign algorithm, known for its efficacy in aligning multilingual texts based on sentence length and dictionary-based realignment techniques.
- Scope and Scale: This work presents an improvement over previous efforts by including full-text articles across multiple domains beyond the biomedical scope. The corpus comprises more than 2.9 million aligned sentences for the English-Portuguese language pair alone, alongside significant datasets for the other language pairs and trilingual samples.
- Structural Alignment and Metadata: The corpus is organized according to the hierarchical structure of articles, preserving sections and paragraphs, which benefits tasks such as text summarization. Additionally, metadata such as journal name and subject area are included, enhancing the utility for text classification.
- Legal Considerations: In compliance with Creative Commons licenses, only articles permitting derivative works are included, ensuring legal distribution. This is particularly important due to modifications like the removal of non-textual elements in the corpus.
Evaluation and Results
- SMT Performance: The paper evaluates the corpus by training SMT systems using Moses. The resulting BLEU scores indicate superior translation performance compared to prior works, notably achieving a BLEU score of 48.51 for EN→PT and 49.24 for PT→EN, which are significantly higher compared to existing benchmarks.
- Alignment Quality: Manual evaluation of sentence alignment revealed a high accuracy rate, with correct alignments exceeding 98% across all language pairs, illustrating the robustness of the Hunalign algorithm when extended with domain-specific dictionaries.
- Comparison with EuroMatrix: Even though BLEU scores vary depending on corpus domain-specific traits, the results obtained are comparable to established benchmarks such as the Europarl corpus, showcasing the corpus's competitive quality in the scientific article domain.
Implications and Future Work
The corpus is designed to support varied NLP applications, including multilingual text mining, cross-language plagiarism detection, and potentially enhancing Named Entity Recognition (NER) tools across multiple languages. The authors suggest future directions such as the implementation of this corpus in Neural Machine Translation (NMT) systems and application in text classification tasks. Furthermore, potential expansions into more domains and additional language pairs could broaden the applicability and impact of the corpus in the field.