PIE-English Dataset Overview
- The PIE-English dataset is a rigorously constructed resource capturing multiword expressions that may bear idiomatic or literal meanings.
- It integrates dictionary mining, corpus harvesting, and human annotation with methods that achieve high precision, demonstrating F₁ scores up to 0.92 in span detection.
- The resource enables diverse NLP applications such as machine translation, text generation, and representation learning with flexible, standardized formats.
A Potential Idiomatic Expression (PIE)–English Dataset is a rigorously constructed resource that captures multiword expressions in English which can bear idiomatic as well as literal interpretations, serving as a foundation for classification, detection, and understanding of non-compositional language phenomena in NLP tasks. From its inception in "Casting a Wide Net" (Haagsma et al., 2019) through later large-scale and fine-grained corpora, PIE-English datasets provide both annotated and automatically extracted examples to enable state-of-the-art research on idiomaticity in machine translation, understanding, generation, and representation learning.
1. Definition and Conceptual Scope
A Potential Idiomatic Expression (PIE) is defined as a multiword string in English whose meaning can be, but is not necessarily, non-compositional—i.e., the meaning may depart from the sum of its parts, potentially bearing a figurative or fixed meaning in context (Haagsma et al., 2019). Datasets label each PIE instance in sentence context (token- or span-level), often with a dichotomous (idiomatic/literal) or multi-class (e.g., metaphor, euphemism, literal, etc.) sense assignment. The PIE scope includes fixed idioms, syntactically flexible expressions, and figurative multiword units.
2. Dataset Construction Methodologies
PIE-English datasets are curated via a pipeline that combines dictionary mining, corpus-driven extraction, and structured human annotation:
- Dictionary Aggregation: Resources such as Wiktionary, Oxford Dictionary of English Idioms, and UsingEnglish.com are scraped and normalized. The set intersection yields high-confidence PIE types (e.g., 591 entries in (Haagsma et al., 2019)).
- Corpus Harvesting: Candidate sentences are drawn from well-established corpora such as the British National Corpus (BNC) and Web-as-Corpus (UKWaC), targeting broad genre coverage (Haagsma et al., 2019, Adewumi et al., 2021).
- Span and Sense Annotation: Sentences identified via dependency parses or pattern-matching are annotated for PIE presence and sense, utilizing guidelines for core word alignment, permissible inflection or insertion, and idiom sense taxonomy (idiomatic, literal, other) (Haagsma et al., 2019, Qin et al., 2021, Adewumi et al., 2021).
- Statistical Sampling and Extension: For extensibility, sampled sentences for each idiom type (15–22 per type), with automatic or manual balancing for literal/idiomatic senses (Adewumi et al., 2021).
- Automated Extraction: Parse-based subtree alignment and inflection-aware string matchers yield robust span identification, with system F₁ scores up to 92% (Haagsma et al., 2019).
3. Annotation Schemes and Reliability
Annotation protocols marry precise linguistic heuristics with standardized evaluation of inter-annotator agreement (IAA). Key schemes and metrics:
- BIO Tagging: Pie spans are labeled in the token sequence with “B-IDIOM”/“I-IDIOM”/“O” tags (Qin et al., 2021, Matheny et al., 20 Nov 2025).
- Sense Labels: PIEs are assigned to fine-grained classes—beyond literal/non-literal, PIE-English supports metaphor, simile, euphemism, personification, oxymoron, paradox, hyperbole, irony, parallelism, and literal (Adewumi et al., 2021).
- Agreement Metrics: IAA is computed using Fleiss’ κ or raw percentage agreement. Reported English PIE datasets achieve κ = 0.74–0.91 (PIE detection) and κ = 0.63–0.83 (sense labeling) (Haagsma et al., 2019); overall IAA of 88.89% is reported on the ten-class variant (Adewumi et al., 2021).
- Adjudication: Cases of disagreement are resolved through expert consensus or application of a label hierarchy.
4. Dataset Composition and Format
PIE-English resources span varied corpus sizes, granularity, and annotation richness:
| Dataset | Sentences | PIE Types | Sense Labels | Format Details |
|---|---|---|---|---|
| (Haagsma et al., 2019) PIE corpus | 1,646,208* | 591 | idiomatic/literal/other | Sent + span + lemma/PoS + label |
| (Qin et al., 2021) EPIE Static | 21,890 | 359 | idiomatic (only) | Cloze-style, BIO, tokenized, context |
| (Adewumi et al., 2021, Adewumi et al., 2022) PIE | 20,174 | 1,200 | 10-class (metaphor, etc.) | Token-in-context, PoS, class, gloss |
| (Matheny et al., 20 Nov 2025) PIFL-OSCAR | 5.8M | 3,152 | unlabeled/human-labeled subset | Full sentence, BIO, PoS, metrics |
*Corpus tokens; 2,239 candidate sentences, 1,050 gold PIEs annotated.
Most datasets release tabular (csv/tsv), JSON, or HuggingFace-compatible formats, each record including sentence, idiom span or type, sense label, linguistic tags, and (where relevant) disambiguation scores or metrics.
5. Evaluation Protocols and Metrics
Standard protocol distinguishes between identification (span or sentence-level), disambiguation (idiomatic vs. literal), and classification (multi-class):
- Precision/Recall/F₁: F₁ computed as where and .
- Inter-Annotator Agreement: Fleiss’ κ, raw agreement, Cohen’s κ (where pairwise), and classwise statistics (Adewumi et al., 2021, Haagsma et al., 2019, Matheny et al., 20 Nov 2025).
- Baseline Models: Sequence labeling (BiLSTM-CRF), transformer classifiers (BERT, T5), and dialogue generation models (DialoGPT) (Adewumi et al., 2022, Matheny et al., 20 Nov 2025).
- Slot/Sentence Labeling: Sequence accuracy (mean exact match), True Positive Consistency for idiom detection (Fornaciari et al., 2024).
Reported performance includes F₁ > 0.92 for parser-based extraction (Haagsma et al., 2019), F₁ ~0.95 for BERT on multi-class sense (Adewumi et al., 2021), and macro-F₁ = 0.98 for T5 classification (Adewumi et al., 2022). PIFL-OSCAR achieves F₁ = 0.77/0.89 on human-annotated test splits (Matheny et al., 20 Nov 2025).
6. Use Cases and Applications
PIE-English datasets power a spectrum of research and applied NLP tasks:
- Detection and Disambiguation: Binary/multi-class labeling of idiom vs. literal usage, figurative-sense classification, and cross-lingual transfer (Adewumi et al., 2021, Adewumi et al., 2022).
- Machine Translation: Alignment-aware idiom detection boosters; transfer of non-compositionality to target languages (Adewumi et al., 2021, Haagsma et al., 2019, Matheny et al., 20 Nov 2025).
- Text Generation: Controlled style transfer—literal-to-idiomatic paraphrasing and bench-marking text generation models for idiom plasticity (Zhou et al., 2021).
- Representation Learning: Probing contextual and static word/sentence embeddings for sensitivity to idiomaticity (Affinity and Scaled Similarity metrics) (He et al., 2024).
- Dialogue Systems: Fine-tuning conversational models for idiom-rich input domains, improving responsiveness and naturalness (Adewumi et al., 2022).
- Figurative Language Analysis: Richly annotated suites for metaphor, simile, and other figures-of-speech enable sophisticated NLU and WSD research (Adewumi et al., 2021).
7. Limitations and Extensibility
Notwithstanding their scale and detail, PIE-English resources exhibit certain limitations:
- Coverage: Even at millions of instances, idiom inventories may omit emergent, domain-specific, or regionally circumscribed expressions (Matheny et al., 20 Nov 2025).
- Balance: Class imbalance reflects the natural distribution of figurative types (e.g., metaphors vastly outnumber irony) (Adewumi et al., 2021).
- Annotation Depth: Full gold sense annotation is unavailable for large-scale automatically extracted corpora; verified subsets provide reliable gold standards (Matheny et al., 20 Nov 2025).
- Contextual Boundaries: Single-sentence focus may fail to capture discourse-level idiom cues or multi-sentence idiom instantiations (Zhou et al., 2021).
- Adaptability: Extensions to new idioms, genres, or research tasks are encouraged in released guidelines—addition of new idiom types, cross-lingual, and multimodal analogs (e.g., XMPIE) are now emerging (Torunoğlu-Selamet et al., 13 Jan 2026).
References
- "Casting a Wide Net: Robust Extraction of Potentially Idiomatic Expressions" (Haagsma et al., 2019)
- "IBERT: Idiom Cloze-style reading comprehension with Attention" (Qin et al., 2021)
- "Potential Idiomatic Expression (PIE)-English: Corpus for Classes of Idioms" (Adewumi et al., 2021)
- "Vector Representations of Idioms in Conversational Systems" (Adewumi et al., 2022)
- "NLP Datasets for Idiom and Figurative Language Tasks" (Matheny et al., 20 Nov 2025)
- "A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding" (Torunoğlu-Selamet et al., 13 Jan 2026)