Papers
Topics
Authors
Recent
2000 character limit reached

AnCora-ES Corpus for Spanish NLP

Updated 25 October 2025
  • AnCora-ES Corpus is a comprehensive annotated resource containing ~17,300 Spanish sentences and ~500,000 tokens from newspaper texts.
  • It utilizes XML-based annotations converted into Penn Treebank-inspired bracketed notation, capturing detailed morphological, syntactic, and semantic features.
  • The corpus underpins diverse NLP applications such as constituency parsing and AMR annotation, offering robust benchmarks for educational and research tools.

The AnCora-ES Corpus is a comprehensive linguistic resource for Spanish, annotated at morphological, syntactic, and semantic levels, and widely utilized in both semantic representation and syntactic parsing research. Originally derived from newspaper articles, it includes approximately 17,300 sentences comprising around 500,000 words and forms the basis for numerous downstream NLP applications, notably in constituency parsing and Abstract Meaning Representation (AMR) annotation. Its design is tailored not only for academic research but also for applications in educational technology, with data representations adapted to pedagogical needs and large-scale neural model training.

1. Corpus Composition and Structure

The AnCora-ES corpus consists of approximately 17,300 sentences sourced predominantly from Spanish newspaper texts, summing to roughly 500,000 words. Sentences are annotated with XML tags encoding syntactic, morphological, and semantic features. The annotation spans a range of syntactic categories and morphological properties, providing high granularity for NLP tasks. For pedagogical and parsing tool integration, the original XML structure was reformulated to a Penn Treebank-inspired scheme, using square brackets to delimit constituent boundaries, thereby avoiding collisions with Spanish punctuation.

Dimension Value Source Context
Sentences ~17,300 Newspaper articles
Tokens ~500,000 XML with morpho-syntax
Annotation Type Syntactic, Morphological XML tags, phrase structure

This reformatted data supports applications ranging from educational tools (e.g., MiSintaxis) to sequence-to-sequence model finetuning for syntactic analysis (Delgado et al., 18 Oct 2025).

2. Annotation Methodology and Syntactic Representation

AnCora-ES utilizes XML-based annotations that capture syntactic and—where applicable—morphological attributes of every token. Only those tags relevant to constituent and syntactic function are retained during preprocessing for parsing applications. These annotations were post-processed to translate phrase structure into human-readable bracketed notation:

1
<s>[Compound.Sentence [NP/S [Det The] [N final] ...]]</s>

This representation, aligned with Spanish educational curricula, facilitates both computational parsing and instructional usage. For integration into transformer models with token constraints (e.g., GPT-2), sentences exceeding 512 tokens were filtered, resulting in flexible dataset splits for benchmarking (Delgado et al., 18 Oct 2025).

3. Semantic Annotation: AMR and AnCora-Net Integration

The AnCora-ES lexicon is central to semantic annotation, specifically in projects adapting Abstract Meaning Representation to Spanish (Wein et al., 2022). Annotators employ the rolesets and verb senses defined in AnCora-Net, eschewing English PropBank senses in favor of Spanish lexical standards. This decentering allows for robust capture of Spanish-specific phenomena, including:

  • Pro-drop constructions: Encoded with “sinnombre” AMR concepts, e.g.,
    1
    2
    3
    4
    5
    
    (s / saber-01
      :polarity -
      :ARG0 (f / first-person-sing-sinnombre)
      :ARG1 (h / querer-01
          :ARG0 f))
  • Clitic and reflexive pronouns: Treated as discrete tokens and mapped with corresponding roles.
  • Double negation, derivational morphology: Annotated via modifier concepts and semantic features.
  • Gender and number distinctions: Lemmatized to base forms, with gender marked only when it affects interpretation.

Annotators add new verb senses sequentially as novel predicates arise, maintaining the alignment to AnCora lexical traditions. This approach supports the annotation of 586 Spanish AMRs over 486 unique sentences from diverse genres, enabling gold-standard evaluation for cross-lingual parsing systems (Wein et al., 2022).

4. Machine Learning Applications and Model Training

The AnCora-ES corpus serves as foundational training data for LLMs in constituency parsing via sequence-to-sequence methods (Delgado et al., 18 Oct 2025). Pre-trained transformers—including GPT-2 and Bloom variants from the Hugging Face repository—are fine-tuned using bracketed phrase-structure outputs. Two dataset splits are constructed: the full corpus for unrestricted models, and a 512-token-limited subset for models with input length constraints.

Model performance is evaluated using the F₁ score:

F1=2×Precision×RecallPrecision+RecallF₁ = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

where gpt2-large-bne attains F₁ ≈ 0.8141–0.8183, and Bloom models yield competitive metrics. Training proceeds for five epochs to regulate overfitting, with recorded metrics including inference time, memory footprint, and loss evolution (visualized via semilog plots).

This methodology demonstrates that LLMs can internalize Spanish constituency structures with high fidelity, suggesting applicability for both classroom tools and broader NLP tasks such as automated grammar checking or semantic role labeling.

5. Cross-Lingual and Evaluation Implications

AnCora-ES’s use in gold-standard Spanish AMR annotation (on the AMR 2.0 – Four Translations sembank) addresses key divergences between English and Spanish semantic frameworks (Wein et al., 2022). By leveraging AnCora rolesets, the resource avoids the imposition of English-centric semantic assumptions, facilitating more equitable cross-lingual transfer and evaluation.

This design supports the evaluation of multilingual parsers and generative models. One implication is the possibility for knowledge distillation and cross-lingual alignment strategies unencumbered by English-centric representations. The corpus’s adaptation further enables diagnostic studies of AMR adaptability and syntactic model transfer across languages.

6. Reliability, Disagreement Analysis, and Prospective Refinements

Annotator agreement in AMR annotation was assessed via the Smatch metric on a subset of 50 sentences, yielding scores from 0.83 to 0.89. Analysis reveals divergence stemming from:

  • Entity–event ambiguity in noun interpretation
  • Inconsistent verb sense assignments due to semantic subtleties
  • Non-core label selection for possessive and source relationships

Such findings suggest that while the annotation guidelines and AnCora-based senses provide high reliability, some interpretive subjectivity persists, pointing to prospective refinements for consistency and schema adaptability.

7. Educational Tools and Potential Extensions

The processing pipeline for AnCora-ES data (XML extraction, reformatting, model fine-tuning) underpins the MiSintaxis tool for Spanish syntax instruction (Delgado et al., 18 Oct 2025). Accurate real-time parsing enables dynamic feedback and student analysis, indicating strong pedagogical utility.

Limitations are acknowledged: token restrictions for certain models, gaps in coverage for emergent syntactic patterns, and the need for further sentence expansion to align with evolving curricular standards. Still, the corpus’s comprehensive annotation enables robust grammatical assessment and supports development of Spanish-centered NLP solutions in research and education.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to AnCora-ES Corpus.