Metaphoric Paraphrase Generation
- Metaphoric paraphrase generation is a technique that transforms literal sentences into creative, meaning-preserving metaphorical expressions.
- It employs neural sequence-to-sequence, masking-based methods, and controlled generation approaches to introduce figurative language.
- The methodologies balance semantic fidelity with innovative metaphor production, enhancing applications in NLP and text augmentation.
Metaphoric paraphrase generation is the task of transforming a literal sentence into a semantically faithful paraphrase that introduces metaphorical language. The generated output should preserve the core propositional meaning while rendering some element—typically, but not exclusively a verb, noun, or adjective—metaphorically. This task intersects computational semantics, figurative language processing, and controllable text generation. Research in this domain provides computational models, datasets, and evaluation protocols for literal-to-metaphoric transformation and has demonstrated considerable downstream value for tasks such as metaphor detection and text augmentation (Stowe et al., 2020, Stowe et al., 2021, Chakrabarty et al., 2021, Ottolina et al., 2022).
1. Problem Formulation and Task Definition
Formally, given a literal input sentence , the goal is to generate a metaphoric paraphrase such that , with containing at least one metaphoric expression absent from (Stowe et al., 2020). Approaches extend this to allow targeted paraphrasing by specifying which tokens (e.g., verb, noun, or adjective) are to be rendered metaphorically (Ottolina et al., 2022). Some models also enable control over the conceptual mapping underpinning the metaphor, for instance, redirecting the source and target FrameNet domains (e.g., “argument” “war”) to produce metaphors of the desired type (Stowe et al., 2021).
Most systems focus on single-word-to-metaphor transformations at the predicate level but recent work generalizes to multiword, phrasal, and varied POS expressions (Ottolina et al., 2022). Notably, the masking+generation paradigm supports both data-driven and controllable generation objectives.
2. Modeling Approaches
Lexical Replacement Baselines
Early models implement rule-based lexical substitution. For example, a WordNet-driven strategy identifies the main verb in input , retrieves troponyms (semantically narrower verbs) from WordNet, and chooses the candidate that maximizes context similarity via Word2Vec embeddings. Fluency is high because only a single word is changed, but the metaphoricity of generated sentences is often limited and literal readings predominate, especially out of domain or when WordNet coverage is thin (Stowe et al., 2020).
Conceptual mapping methods (e.g., CM-Lex) leverage FrameNet. For a literal verb evoking frame , an offset 0 in embedding space is added to 1, and the nearest verb candidate is chosen, followed by inflection adjustment. This supports mapping from a literal to a metaphorical source frame but is restricted to verb slots and fails to use full sentence context (Stowe et al., 2021).
Sequence-to-Sequence and Masking-Based Architectures
Neural sequence-to-sequence (seq2seq) frameworks, particularly transformer-based models, have enabled free-form paraphrasing with explicit metaphor targets. The metaphor masking approach constructs training pairs by masking the metaphoric verb in context (i.e., 2 = literal context + <METAPHOR> token) and training the model to reconstruct the original (metaphoric) sentence. At inference, masking a target verb in a literal sentence prompts the model to generate a new, fluent metaphoric verb or multi-token phrase (Stowe et al., 2020).
Controlled generation architectures (e.g., CM-BART) prepend source/target conceptual frame tokens to the input, conditioning the model to produce output metaphors invoking the desired source domain. Training is on parallel data constructed by matching gold paraphrase pairs via symbolic filtering and FrameNet frame parsing (Stowe et al., 2021). MERMAID derives a similar pipeline but bootstraps parallel data from unsupervised metaphor detection and symbolic mapping, using BART for generation and an auxiliary RoBERTa-based metaphoricity discriminator at inference to promote figurative outputs (Chakrabarty et al., 2021).
Recent masked language modeling methods allow tokens of any major class (verb, noun, adjective) to be masked and unmasked using a metaphor-specialized seq2seq or masked LM (e.g., BART, XLM-RoBERTa) fine-tuned to metaphoric reconstructions (Ottolina et al., 2022). This "mask-and-fill" approach is extensible to broader parts of speech and less reliant on parallel corpora, supporting broader coverage.
3. Data Construction and Parallel Corpus Acquisition
Constructing training data for metaphoric paraphrase is challenging due to the scarcity of aligned literal–metaphor pairs. Manual annotation is limited; thus, several large-scale automatic strategies are employed.
MERMAID extracts metaphorical sentences (verbs only) from the ∼3M-line Gutenberg Poetry Corpus using a fine-tuned BERT classifier, then generates literal paraphrases with masked language modeling, reranking by literalness and symbolic consistency (COMET SymbolOf relation) to yield 393.5K high-confidence literal–metaphor pairs (Chakrabarty et al., 2021). CM-BART constructs 4248K pairs by aligning poetic/metaphoric and literal paraphrases, filtering for FrameNet frame compatibility and symbolic overlap (Stowe et al., 2021).
The “Metaphorical Paraphrase Generation” framework by (Ottolina et al., 2022) adopts a pipeline where metaphor/literal classification identifies candidate literals, masking and unmasking is used for self-supervised metaphorification, with outputs filtered for metaphoricity by a classifier for data quality. Transfer rates—the proportion of masked words replaced with metaphoric tokens—are highest for verbs (56%), but also nontrivial for nouns (24%) and adjectives (31%).
Datasets derived in this manner provide sufficient coverage to fine-tune transformers and enable robust evaluation and ablation.
4. Evaluation Methodologies
Human and automatic evaluation protocols have been developed to account for both semantic fidelity and metaphoricity, with particular consideration of creativity and figurative novelty.
- Human Evaluation: Crowdsourcing or expert annotators rate generated paraphrases on Likert scales for metaphoricity, fluency, creativity, and meaning preservation. For example, (Stowe et al., 2020) employs 1–4 scales for metaphoricity, fluency, and paraphrase quality (semantic similarity). (Chakrabarty et al., 2021) uses 1–5 scales for fluency, meaning retention, creativity, and metaphoricity, with system-generated outputs often rated as more creative and metaphorical than human paraphrases (Creativity: 3.15 system vs. 2.76 human; Metaphoricity: 3.24 vs. 3.09 (Ottolina et al., 2022)).
- Automatic Metrics: Embedding-based similarity metrics (e.g., cosine similarity via SBERT) measure both output-to-reference and relational distance (change from literal). BLEU, ROUGE, and BERTScore are used with caveats—these n-gram measures can overpenalize creative, low-overlap metaphors. Some works propose combining contextual similarity with explicit metaphoricity scorers or Earth Mover’s Distance on BERT embeddings weighted by a metaphor detector (Stowe et al., 2020, Stowe et al., 2021).
- Extrinsic Evaluation: Generated metaphors augmenting labeled datasets improve downstream performance. Appending 428 system-generated metaphors to the TroFi-X metaphor detection set increased XLM-RoBERTa F1 by 3 percentage points (from 93.13% to 96.12%) (Ottolina et al., 2022).
5. Comparative Performance and Qualitative Analysis
The various systems—lexical baselines, conceptual mapping models, neural sequencers—exhibit trade-offs in fluency, metaphoricity, meaning preservation, and creativity.
A selection of results:
| System | Metaphoricity (Human Eval) | Fluency | Meaning/PP | Creativity |
|---|---|---|---|---|
| Gold (human) | 3.2 (Stowe et al., 2020); 4.02/3.58 (Chakrabarty et al., 2021) | 3.9 | 3.6 | - |
| Lexical Rep. | 2.7 | 4.0 | 3.8 (PP 5), 2.7 (6) | 2.16 |
| MetaphorMask | 3.1 | 3.3 | 2.4 (PP 7), 3.0 (8) | 3.00 |
| CM-BART | 2.72–4.0 (various) | - | 29.3% exact match | - |
| MERMAID | 3.07 | 3.46 | 3.35 | 3.50 |
| System (Ottolina et al., 2022) | 3.24 (meta), 3.15 (creativity) | 4.20 | 3.95 | 3.15 |
| Human (Ottolina et al., 2022) | 3.09 (meta), 2.76 (creativity) | 4.29 | 4.34 | 2.76 |
CM-BART achieves the lowest mean distance to gold metaphors (.066) and the highest exact match rate (29.3%) (Stowe et al., 2021). MERMAID obtains best combined scores for semantic similarity and BERTScore (85.0, 0.71) (Chakrabarty et al., 2021). The mask-and-fill approach generalizes across POS and yields metaphors judged more creative than human originals (Ottolina et al., 2022).
Qualitative errors cluster around low-frequency predicate substitutions with little metaphoricity gain or overt generation of ungrammatical/semantically odd outputs, especially for methods lacking contextual embedding or robust domain constraints (Stowe et al., 2021, Chakrabarty et al., 2021). Masking-based approaches support multiword metaphors and content beyond verbs.
6. Research Limitations and Open Directions
Several limitations appear consistently:
- Data scarcity for large, parallel, gold-labeled literal-metaphor pairs constrains supervised model development (Stowe et al., 2020).
- Evaluation of metaphoricity remains nontrivial, with automatic metrics either underappreciating creativity or not aligning with human judgments (Stowe et al., 2020, Ottolina et al., 2022).
- Lexical and conceptual mapping methods struggle to maintain fluency and compatibility with context outside strict verb slot replacement (Stowe et al., 2021).
- Existing models typically focus on single-token metaphors and verbs; extending metaphorification to nouns, adjectives, multiword expressions, and full clauses remains a target of further research (Ottolina et al., 2022, Stowe et al., 2021).
- Conceptual coverage: Most approaches rely on FrameNet or hand-crafted source–target mapping, limiting metaphoric expressivity to recognized domains.
Recommended future work includes:
- Crowdsourcing or heuristic construction of larger-scale, fully parallel literal-metaphor datasets.
- Development of automatic evaluation metrics sensitive both to meaning preservation and figurative novelty, possibly learned via contextual embeddings plus explicit metaphor detectors (Stowe et al., 2020, Ottolina et al., 2022).
- Integration of structured knowledge bases (FrameNet, MetaNet, ConceptNet) or symbolic compatibility constraints to ground metaphor generation in authentic conceptual mappings (Stowe et al., 2021, Chakrabarty et al., 2021).
- Task adaptation to broader figurative language understanding, e.g., sentiment shift, persuasive writing, or narrative voice (Ottolina et al., 2022).
7. Applications and Downstream Impact
Metaphoric paraphrase generation supplies high-quality, figurative rephrasings for data augmentation, improving metaphor detection F1 by substantial margins when system outputs are included in training data (Ottolina et al., 2022). Task-based evaluations in literary domains show that metaphoric rewrites produced by models such as MERMAID increase the aesthetic preference for creative text, with Turkers preferring poems enhanced by these metaphors 68% of the time (Chakrabarty et al., 2021). A plausible implication is that controlled metaphor injection can be leveraged to boost expressiveness and engagement in NLG systems while supporting figurative language understanding for downstream NLP.
In summary, recent advancements in metaphoric paraphrase generation combine symbolic and neural architectures, explicit conceptual mapping, and self-supervised masking protocols to deliver fluent, creative, and semantically faithful metaphoric rewritings. Progress in this area establishes a foundation for figurative language generation and evaluation, with demonstrated utility in both linguistic research and practical NLP applications (Stowe et al., 2020, Chakrabarty et al., 2021, Stowe et al., 2021, Ottolina et al., 2022).