Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The SAMER Arabic Text Simplification Corpus (2404.18615v1)

Published 29 Apr 2024 in cs.CL

Abstract: We present the SAMER Corpus, the first manually annotated Arabic parallel corpus for text simplification targeting school-aged learners. Our corpus comprises texts of 159K words selected from 15 publicly available Arabic fiction novels most of which were published between 1865 and 1955. Our corpus includes readability level annotations at both the document and word levels, as well as two simplified parallel versions for each text targeting learners at two different readability levels. We describe the corpus selection process, and outline the guidelines we followed to create the annotations and ensure their quality. Our corpus is publicly available to support and encourage research on Arabic text simplification, Arabic automatic readability assessment, and the development of Arabic pedagogical language technologies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Bashar Alhafni (21 papers)
  2. Reem Hazim (2 papers)
  3. Juan Piñeros Liberato (2 papers)
  4. Muhamed Al Khalil (3 papers)
  5. Nizar Habash (66 papers)
Citations (4)

Summary

Understanding Arabic Text Simplification for School-aged Learners: Introducing the SAMER Corpus

The Need for Simplified Text

Text simplification is crucial for making content accessible to a diverse audience including children, language learners, and those with cognitive disabilities. Simplification involves rewriting texts in a way that maintains the core meaning but enhances readability through lexical and syntactical changes. While much research has concentrated on English, there's a significant gap in resources for languages like Arabic, especially in creating text materials that cater to specific reader needs.

Introducing the SAMER Corpus

The "Simplification of Arabic Masterpieces for Extensive Reading" (SAMER) project aims to address this gap by creating a manually annotated Arabic corpus specifically targeting school-aged learners. This corpus, derived from 15 Arabic novels spanning a period from the 12th to the 20th century, focuses on lexical simplification—substituting complex words with simpler alternatives while preserving original meanings.

Key Features of the SAMER Corpus

  • Dual Simplification Levels: Each piece in the SAMER Corpus has been simplified to two distinct readability levels—Level 4 (suitable for grades 6-8) and Level 3 (suitable for grades 4-5).
  • Rich Annotations: Annotations include dual simplification levels and readability assessments both at word and document levels.
  • Public Availability: In support of further research and application, the corpus is openly accessible for use in tasks like readability assessment and automated text simplification.

Technical Aspects and Annotation Process

  1. Choosing Texts: Texts were chosen based on historical significance, readability level, and public domain status.
  2. Annotation Tools and Guidelines: An innovative add-on tool was used, enabling annotators to visualize and modify text readability interactively. The tool is integrated with a lexicon defining five readability levels based on word simplicity.
  3. Rigor in Annotation: The project employed native Arabic speakers skilled in linguistics, ensuring that simplifications were accurately aligned with intended readability improvements.

Corpus Statistics and Insights

  • Volume and Composition: The corpus comprises approximately 160,000 words across original and both simplified levels.
  • Lexical Transformations: Analysis shows predominant use of one-to-one word replacements, highlighting a focused approach to simplification rather than broad textual rewrites.
  • Distribution of Readability Adjustments: Words were often lowered by one or two readability levels, demonstrating a granular approach to simplification.

Practical Implications and Future Directions

The SAMER Corpus opens new avenues for developing automated tools that can adapt Arabic text content to different reader competencies. It serves as a foundational model for non-English text simplification efforts and encourages further research into domain-specific simplification strategies. Future work might explore genre expansion, incorporate syntactic simplification, and develop automated readability and simplification models tailored for Arabic.

By providing a structured approach to Arabic text simplification and establishing a publicly available resource, this work significantly contributes to the field, supporting educational technology advancements and promoting linguistic inclusion.

X Twitter Logo Streamline Icon: https://streamlinehq.com