Papers
Topics
Authors
Recent
Search
2000 character limit reached

MetaphorShare: Unified Metaphor Data Repository

Updated 6 April 2026
  • MetaphorShare is a unified, collaborative repository that standardizes multilingual metaphor datasets for interdisciplinary research.
  • It features a robust web infrastructure built with ReactJS, Python FastAPI, and PostgreSQL, enabling seamless upload, search, and download operations.
  • The platform promotes open licensing and community engagement to lower barriers in metaphor research and enhance NLP model development.

MetaphorShare is a dynamic, collaborative repository designed to unify, standardize, and openly disseminate labeled metaphor datasets across languages and disciplines. Developed in response to persistent fragmentation in metaphor research resources, MetaphorShare provides a web-based infrastructure (www.metaphorshare.com) facilitating upload, download, search, and annotation of datasets according to a unified schema. This infrastructure underpins both theoretical investigations in cognitive linguistics and practical advances in NLP-based metaphor identification, interpretation, and generation (Boisson et al., 2024).

1. Motivation and Scope

The metaphor studies and NLP communities have independently assembled dozens of labeled metaphor corpora over several decades. These datasets are characterized by heterogeneity in format, annotation conventions, and licensing, often remaining unpublished or limited to single research groups. MetaphorShare is motivated by three core objectives:

  • Resource Unification: Provide a single portal for seamless access to pre-formatted, multilingual metaphor resources.
  • Interdisciplinary Bridging: Facilitate theoretical linguistics and empirical NLP collaboration by standardizing data formats and increasing accessibility.
  • Promotion of Reuse: Lower technical and legal barriers to dataset reuse through minimal constraints on format and mandatory open licensing.

Through these goals, MetaphorShare endeavors to accelerate research on metaphor processing, supporting both qualitative analysis and development of data-driven NLP models (Boisson et al., 2024).

2. System Architecture and Data Schema

2.1 High-Level Architecture

MetaphorShare’s technical infrastructure is organized as follows:

  • Frontend: Built in ReactJS (v18), integrating Bootstrap and Ant Design for interface consistency, with Redux state management and Chart.js visualizations.
  • Backend: Python FastAPI provides REST endpoints for all operations, with background schedulers managing batch ingestion, index maintenance, and cleanup.
  • Database Layer: PostgreSQL stores all user-submitted metadata (titles, authors, licensing, language tags), while Elasticsearch indexes each labeled metaphor instance, enabling full-text and fuzzy search with multilingual capabilities.
  • Security and Hosting: Cardiff University hosts the platform under HTTPS/SSL, restricting all access to authenticated FastAPI endpoints (Boisson et al., 2024).

2.2 Unified Data Model

MetaphorShare mandates a standardized CSV schema:

  • Required Column: tagged_sentence with XML-style tags—<m>...</m> for metaphoric, <l>...</l> for literal, <t>...</t> for target concept cue (MIPVU-style), and <u>...</u> for user-defined tags (e.g., domain-specific cues). Example: I <m>swim</m> today in an <m>ocean</m> of <t>happiness</t>.
  • Extendable Fields: Free-named columns can encode additional metadata (e.g., concreteness_score, source_pos, wn_synset, mscore).
  • Dataset Index: Each submission includes required metadata (uploader name/email, dataset title, license, binary file) and optional metadata (publication details, annotator profiles, IAA measures).
  • Instance Index: Elasticsearch indexes each Potential Metaphoric Expression (PME) with dataset link, span, label (m, l), position, and all custom fields when available.
  • Multilinguality: No language-specific processing pipeline is enforced—Unicode and per-language analyzers in Elasticsearch ensure text searchability across scripts (Boisson et al., 2024).

3. Core Workflows and Platform Functionalities

3.1 Dataset Submission and Validation

  1. CSV Preparation: Users format data according to the prescribed schema.
  2. Metadata Form Completion: Required and optional fields on the upload page.
  3. Automatic Tag Validation: Each entry is checked for valid XML tagging; malformed rows are flagged for correction.
  4. Review Status: On passing, files become "UNDER_REVIEW."
  5. Manual Admin Verification: Admins ensure license compliance, metadata completeness, and optionally suggest renaming of fields for standardization.
  6. Ingestion: Upon approval, datasets are ingested into PostgreSQL and indexed in Elasticsearch (Boisson et al., 2024).

3.2 Download, Search, and Licensing

  • Web-Based Search: Dropdown filters allow users to search by dataset, language, label, or textual content. Elasticsearch supports BM25-based ranking and result pagination.
  • Download Options: Users may bulk download original CSVs or select records in CSV/JSON formats. All downloads preserve original tagging and custom columns.
  • Licensing Visibility: Each dataset displays its license; adherence to these terms is required for redistribution.
  • (Planned) Annotation Extension: Future releases aim to provide in-browser annotation and interactive correction, potentially with semi-automatic assistance based on fine-tuned transformer models (Boisson et al., 2024).

4. Dataset Coverage and Empirical Evaluation

4.1 Repository Contents

At launch (April 2024), MetaphorShare integrated 25 datasets (primarily English), but explicitly supports any language. Examples included:

  • J&C: Psycholinguistic constructed sentences
  • CARD_V / CARD_N: Adjective-noun pair constructions
  • MOH: WordNet-based metaphors (~1600 instances)
  • NewsMet, TSV_A, GUT, PVC: Wide spectrum from fake news headlines to Wikipedia and phrasal verb corpora
  • MAD / MAGPIE / MIPVU_BO / TONG: Multiword, paraphrase, and variant-rich datasets Instance sizes vary from <2,000 to >140,000 per dataset, with widely differing metaphor prevalence (Boisson et al., 2024).

4.2 Cross-Dataset Model Evaluation

A cross-dataset metaphor identification experiment demonstrated MetaphorShare’s potential for NLP benchmarking:

  • Model: RoBERTa-base, fine-tuned per dataset (10 datasets; J&C and 9 NLP corpora)
  • Training Regime: 800 randomly drawn samples per dataset; Bayesian hyperparameter optimization (BOHB via RayTune)
  • Task: Binary classification for metaphoric versus literal expressions
  • F1-Score Results (excerpt):
Test/Train J&C MOH NewsMet TSV_A GUT TONG PVC MAD MAGPIE VUAC_BO
J&C 0.89 0.52 0.68 0.78 0.71 0.54 0.61 0.63 0.77 0.60
MOH 0.70 0.58 0.42 0.73 0.67 0.38 0.21 0.59 0.26 0.28

Models achieved best F1-scores in-domain, but generalization was highly variable (e.g., GUT and J&C generalized robustly, whereas NewsMet and TONG exhibited lower cross-set transfer). This suggests substantial annotation or domain discrepancy between corpora (Boisson et al., 2024).

5. Data Governance, Community Practices, and Collaboration

5.1 Submission Standards and Moderation

  • Field Labeling: Standardized names (e.g., pos, mscore, source_pos) are recommended to facilitate automatic integration.
  • Licensing: Open licensing (CC BY-4.0, GPL-3.0, Apache-2.0) is mandatory and explicitly enforced in the metadata.
  • Publication Linkage: Submissions related to published work should include BibTeX entries.
  • Review Process: Datasets undergo automated structural checking, followed by manual review for compliance and metadata completeness. Public release follows approval.

5.2 Community Engagement and Feedback

  • Commenting: Each dataset page supports a user comment field for errata and suggestions.
  • Future Features: Planned upgrades include rating and endorsement mechanisms by domain experts, as well as community-driven annotation expansion (Boisson et al., 2024).

MetaphorShare is referenced in the context of Amnestic Forgery, an OWL 2 ontology integrating conceptual metaphor semantics over the Framester Linked Open Data graph (Gangemi et al., 2018). Amnestic Forgery leverages Description and Situation (D&S) patterns to model metaphors as OWL classes linking source and target frames, semantic role mappings, and, prospectively, blended frames. This ontological infrastructure enables formal reasoning, SPARQL querying (e.g., mapping adjective–noun synset pairs), and automated metaphor generation. While MetaphorShare itself primarily standardizes and indexes annotated datasets, such structured resources may feed into or interoperate with ontology-based systems for advanced metaphor detection and interpretation tasks (Gangemi et al., 2018).

7. Future Directions and Open Challenges

Several enhancements are identified:

  • Multilingual Expansion: Active efforts toward broadening language coverage by integrating new corpora under the same schema.
  • Interactive Annotation: In-browser tools for direct labeling, correction, and model-in-the-loop tag suggestion.
  • Versioning and Provenance: Implementation of dataset version control and DOI assignment per release.
  • Analytics Dashboards: Exposure of usage metrics (downloads, user activity, ratings) using Chart.js.
  • Multimodal Data Support: Planned schema extension to include references to images, video, or audio as required for multimodal metaphor studies.
  • Standardized IAA Computation: Built-in calculators (Cohen’s κ, Krippendorff’s α) to facilitate and standardize reporting of inter-annotator agreement metrics (Boisson et al., 2024).

MetaphorShare, as a shared infrastructure for metaphor data, constitutes an initial step toward a collaborative, cross-disciplinary data ecosystem. By enforcing a unified format, open licensing, and supporting robust search/download, it materially reduces barriers to both empirical and theoretical research in metaphor processing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MetaphorShare.