PIE-Bench Dataset Overview
- PIE-Bench Dataset is a comprehensive corpus featuring over 20,000 samples and 1,200 idiom cases, enabling nuanced analysis of figurative language.
- It introduces a ten-category taxonomy that classifies idioms into distinct types such as metaphor, simile, and euphemism for precise language processing.
- Baseline experiments, including a high-performing BERT model, demonstrated robust accuracy and F1 scores, validating its application in advanced NLP tasks.
The Potential Idiomatic Expression (PIE)-English dataset is a large-scale, publicly available corpus designed to advance research on idiomatic language understanding in NLP. Uniquely, PIE-English classifies idioms into ten detailed categories—moving beyond the conventional binary distinction of literal vs. idiomatic—thereby enabling nuanced treatment of idiomatic phenomena in applications such as machine translation, word sense disambiguation, and information retrieval. The corpus features over 20,100 carefully annotated samples, almost 1,200 idiom cases, and robust baseline experimental results using state-of-the-art neural and statistical models.
1. Dataset Composition and Structure
PIE-English comprises 20,174 English sentence samples, predominantly sourced from the British National Corpus (BNC; 96.9%) with an additional 3.1% from the UK Web Pages (UKWaC). The dataset encodes nearly 1,200 distinct idiomatic expressions, each annotated with their meanings and categorized into one of ten idiom classes. Each entry is structured as a row of well-defined columns including:
- Idiom identifier (ID)
- Token (the idiomatic expression)
- Part-of-speech (PoS) tag, produced with the NLTK library
- Class label (idiom type)
- Paraphrased meaning
- Usage type (idiom or literal)
Samples comprise either one or two sentences. The annotation protocol ensures that, when both idiomatic and literal usages exist for an expression, there are generally 22 contextual samples per idiom; for idioms that exist only in nonliteral form, 16 samples are included, supplemented with inflectional variants as needed.
Source | Samples (%) | Cases (Idioms) |
---|---|---|
BNC | 96.9 | ~1197 |
UKWaC | 3.1 | Supplementary |
This organizational structure facilitates both token- and phrase-level analyses and supports downstream NLP tasks requiring labelled idiomatic data with rich contextual coverage.
2. Idiom Class Taxonomy
PIE-English introduces a ten-class taxonomy of idioms—providing a granular labelling scheme that surpasses previous work restricted to simple literal/nonliteral designations. The classes are as follows:
- Metaphor: Implicit comparison (e.g., “ring a bell” meaning recollect).
- Simile: Explicit comparison using “as” or “like” (e.g., “as clear as a bell”).
- Euphemism: Mild expression for something harsh (e.g., “go belly up” for fail).
- Parallelism: Repetition or rhythmic structure (e.g., “day in, day out”).
- Personification: Human traits to non-humans (e.g., “take time by the forelock”), recognized as a subset of metaphor.
- Oxymoron: Contradictory elements (e.g., “a small fortune”).
- Paradox: Apparent contradiction with underlying truth (e.g., “here today, gone tomorrow”).
- Hyperbole: Exaggeration (e.g., “from the back of beyond”).
- Irony: Intended meaning diverges from the literal (e.g., “Pigs might fly”).
- Literal: Canonical, non-figurative usage (e.g., literal “ring a bell”).
The taxonomy interrelates certain classes; for example, personification is a subset of metaphor, and figures such as apostrophe are encompassed within personification. This hierarchical classification enables fine-grained distinctions essential for advanced figurative language processing.
3. Annotation Protocol and Inter-Annotator Agreement
Corpus construction involved a multi-stage annotation pipeline. Four contributors, all advanced second-language English speakers, initially collected and annotated sentences based on idiom sense and usage from the BNC and UKWaC. The entire corpus underwent review by a near-native English speaker to ensure annotation quality and consistency.
Two independent annotators performed detailed labelling following explicit guidelines informed by Alm-Arvius (2003) and dictionary resources (e.g., The Free Dictionary). The annotation framework required adjudication in cases of disagreement, resolved strictly according to corpus guidelines. The process yielded an overall inter-annotator agreement (IAA) of 88.89% across all idiom classes; for each class, the lowest agreement between the two annotators was reported. Discrepancies (~11.11% of cases) were resolved by guideline-based consensus.
4. Relevance to NLP Applications
PIE-English enables several critical NLP tasks:
- Machine Translation (MT): Class identification assists MT systems in producing contextually and culturally consonant translations; for instance, the euphemism class enables the generation of softened expressions in target languages.
- Word Sense Disambiguation (WSD): Fine-grained class labels, including distinctions like metaphor vs. simile, permit training of models to resolve multi-word expression ambiguity in context.
- Information Retrieval and Dialogue Systems: Context-specific idiom classification supports the retrieval and generation of semantically appropriate responses. An example is recognizing “kick the bucket” as a euphemism, prompting a conversational agent to select sensitive language.
- Neural and Statistical Idiom Detection: The class-balanced yet naturally imbalanced distribution supports experimentation with varied classifier architectures, including both traditional statistical and deep learning models.
The corpus features PoS tagging via the NLTK toolkit and can be extended with IOB tags for phrase chunking and related sequence labelling tasks. The public release as a CC-BY 4.0 resource further encourages adaptations and extensions for novel research objectives.
5. Baseline Experiments and Evaluation
Baseline classification experiments assessed three standard models using the corpus:
- Multinomial Naive Bayes (mNB): Achieved ~74.7% accuracy, F1 score of 0.66.
- Linear Support Vector Machine (SVM): Attained ~76.6% accuracy, F1 of 0.67, leveraging stochastic gradient descent, hinge loss, and l2 regularization.
- BERT: The transformer-based model yielded 93.4% accuracy and a weighted F1 of 0.948, attaining the highest performance. BERT was trained using WordPiece embeddings, batch size of 64, and 7 epochs.
Standard preprocessing steps included lowercasing, removal of HTML tags, and exclusion of digits and non-alphabetic symbols. Features for mNB and SVM were extracted using a Count Vectorizer and normalized via TF-IDF.
Class-wise performance favored categories with larger sample counts (e.g., metaphor), whereas hyperbole and oxymoron saw lower F1 as a consequence of smaller representation. Confusion matrix analysis revealed common misclassifications at boundaries between figurative and literal usage, particularly metaphor and euphemism.
The F1 metric was computed as:
where and denote precision and recall, respectively.
Model | Accuracy (%) | F1 Score |
---|---|---|
mNB | ~74.7 | 0.66 |
SVM | ~76.6 | 0.67 |
BERT | 93.4 | 0.948 |
These results establish robust baselines for idiom class prediction tasks over PIE-English.
6. Access, Licensing, and Extensibility
PIE-English and supporting code for preprocessing, annotation, and experiments are freely available under the CC-BY 4.0 license. The dataset and scripts are located at github.com/tosingithub/idesk. This licensing framework enables wide-ranging academic and research deployment, including redistribution, adaptation, and integration into existing NLP workflows.
Researchers are encouraged to augment the dataset for specific applications, including extension with novel idiomatic categories or further syntactic/semantic annotations. The reproducibility and extensibility of PIE-English position it as a foundational dataset for advancing idiomatic expression understanding.
7. Significance and Prospects
PIE-English constitutes the first large-scale English corpus to annotate idioms across a detailed, hierarchical ten-class schema, paired with almost 1,200 idiom cases and more than 20,100 samples. The dataset’s high annotation agreement and exhaustive baseline experimentation, including strong BERT performance, underscore its methodological rigor and empirical utility. Its open availability and extensible design make it a pivotal resource for ongoing research into figurative language, automatic idiom detection, machine translation, and broader linguistic tasks requiring the nuanced interpretation of idiomatic and literal expressions.