Cross-Lingual AMR Parsing

Updated 11 May 2026

Cross-Lingual AMR Parsing is the process of converting sentences from different languages into a unified semantic graph that abstracts away surface syntax.
It leverages annotation projection and word alignment along with synthetic silver data to overcome low-resource challenges in non-English languages.
Multilingual and bilingual modeling techniques, including sequence-to-sequence architectures and ensemble distillation, significantly boost parsing accuracy.

Cross-lingual Abstract Meaning Representation (AMR) parsing is the task of mapping sentences in multiple languages onto a shared semantic graph formalism—AMR—that was originally designed for English. AMR encodes core semantic structures through concepts and relations, abstracting away from surface syntax. Cross-lingual AMR parsing aims to provide language-neutral representations, thereby enabling multilingual semantic analysis, downstream cross-lingual tasks, and facilitating semantic resource development for low-resource languages. As AMR annotation is costly and heavily anglocentric, a central challenge is accurate projection and parsing for non-English languages using limited or no gold-standard AMR resources in those target languages.

1. AMR as an Interlingua and Cross-Lingual Challenges

AMR graphs employ English lemmas, PropBank frames, and semantic roles, but the formalism aspires to capture language-independent meaning structure. Early research established that textual content in German, Spanish, Italian, Chinese, and other languages can be mapped onto English AMR graphs by leveraging parallel corpora and annotation projection (Damonte et al., 2017). Nevertheless, several challenges persist:

Lexical bias: AMR nodes are typically aligned with English lemmas, making direct mapping from non-English tokens non-trivial, especially in morphologically rich or typologically distant languages.
Structural divergence: Syntactic and semantic differences across languages (e.g., argument realization, word order, head-swapping) frequently complicate graph projection.
Low-resource bottleneck: The small size of human-annotated AMR corpora (e.g., 55k English sentence–AMR pairs in AMR 3.0) constrains supervised cross-lingual transfer and necessitates techniques that leverage silver data and weak supervision (Kang et al., 2024, Uhrig et al., 2021).

2. Annotation Projection, Word Alignment, and Silver Data

Annotation projection has been foundational for cross-lingual AMR systems. The core technique projects AMR node alignments from English source sentences to their foreign equivalents using parallel corpora and word alignment models (fast_align, EM+JAMR, contextual alignment) (Damonte et al., 2017, Sheth et al., 2021). Fast-align and contextual alignment via XLM-R embeddings serve as high-precision, weakly supervised methods for aligning foreign tokens to AMR nodes. Annotation projection, when combined with robust transition-based or seq2seq parsers, yields strong cross-lingual systems.

The use of synthetic silver data—where large-scale automatic English AMR annotation is projected to foreign sentences—has repeatedly been shown to substantially improve performance. Bootstrapping with synthetic data parsed from SQuAD2.0, OntoNotes, or newswire further increases coverage (Sheth et al., 2021).

Technique	Description	Notable Results
Annotation projection	Map English AMR alignments to target language via word alignment	+16–20 F1 over baselines
Silver data augmentation	Use machine-translated or LLM-generated AMRs for low-resource target languages	4–7 F1 boost in-domain
Contextual word alignment	Leverage XLM-R for token-level mapping	Outperforms fast-align

The “full-cycle” evaluation method reconstructs gold English AMR parse quality from a system trained in the reverse direction (F→EN), providing a strong, annotation-free proxy for target language performance (Damonte et al., 2017).

3. Multilingual and Bilingual Modeling Paradigms

Recent advances have explored multilingual sequence-to-sequence architectures, often pre-trained on large cross-lingual corpora (mBART, mT5), and expounded high degree of parameter sharing across languages (Cai et al., 2021, Cai et al., 2021, Barta et al., 27 Feb 2025). In bilingual setups, simultaneous encoding of the target sentence and its English translation yields gains in concept specificity and structural fidelity. The auxiliary reconstruction task—forcing the decoder to regenerate the English input—sharpens latent representations and enhances concept disambiguation.

Experimental ablation consistently demonstrates that bilingual input and multi-task losses (AMR + English reconstruction) outperform monolingual or translation-only pipelines by 5–10 Smatch F1 points across languages:

Configuration	DE	IT	ES	ZH	Improvement (F1)
S2S (no bilingual/aux)	53.1	—	—	—	baseline
+ Bilingual input	57.5	—	—	—	+4.4
+ Auxiliary task	58.6	—	—	—	+5.5
Both	64.0	65.4	67.3	56.5	+10.2 avg

The inclusion of silver-standard data derived from LLMs such as Llama-3.1-70B into mT5 or Llama-3.2-1B training pipelines elevates domain-specific parsing (e.g., Hungarian news) by 4–7 F1 points (Barta et al., 27 Feb 2025).

4. Distillation, Ensemble Methods, and Model Transfer

Knowledge distillation from strong English AMR teachers to multilingual students has demonstrated large cross-lingual gains (Cai et al., 2021). Sequence-level distillation, where the student learns to imitate teacher parses on translated inputs (“MT noise”), makes the model robust to variation in lexicalization and structure. Augmenting mBART vocabulary with AMR frames and relations further improves target coverage.

Maximum Bayes Smatch Ensemble Distillation (MBSE) involves ensembling various English AMR parsers, fusing their predictions to form high-confidence silver graphs, and training a single student model on these distilled annotations (Lee et al., 2021). This approach outperforms prior state-of-the-art in all standard languages:

Model	DE	ES	IT	ZH
MBSE (cross-lingual)	73.7	77.1	76.1	63.0

Scaling the number and diversity of “teacher” systems and silver data size provides diminishing but consistently positive returns.

5. Modular Pipelines: Translate-Then-Parse and Joint Learning

The “Translate, then Parse” (T+P) pipeline sets a robust, transparent, and high-performing baseline for cross-lingual AMR parsing (Uhrig et al., 2021). The process involves translating the input sentence to English (NMT), parsing the translation with an English AMR parser (e.g., T5-based amrlib), and returning the resulting semantic graph:

function cross_lingual_amr_parse(src_sentence, src_lang):
    en_sentence = NMT.translate(src_sentence, source=src_lang, target="en")
    amr_graph = AMRParser.parse(en_sentence)
    return amr_graph

T+P outperforms prior multilingual end-to-end models by up to +16.0 Smatch points in Mandarin, +14.6 in German, and +14.3 in Spanish, and achieves uniformly higher scores on all fine-grained metrics, especially for negation and semantic role labeling.

In contrast, joint multilingual learning trains a single encoder–decoder model on parallel silver AMRs across multiple languages. Meta-learning via first-order MAML provides a modest advantage in zero-shot scenarios but yields negligible or no benefit when even minimal target-language data is available, corroborating the resilience and strength of classical joint learning (Kang et al., 2024).

6. Evaluation Metrics and Error Analysis

Smatch [Cai & Knight 2013] remains the principal metric for system comparison, measuring the proportion of matching triples between predicted and gold AMR graphs under optimal node alignment. Precision, recall, and F₁ are defined as:

$P = \frac{|M \cap M'|}{|M|};\;\; R = \frac{|M \cap M'|}{|M'|};\;\; F_1 = \frac{2PR}{P+R}$

Graded metrics like S2MATCH admit near-synonymous nodes and provide a more tolerant evaluation of semantic content.

Error analysis commonly identifies unaligned or mistranslated concepts, relation mispredictions triggered by translation-induced constituent reorderings, reentrancy errors (especially with pronouns), and out-of-vocabulary named entities. Many judged “errors” under strict Smatch are benign paraphrases, highlighting the need for metrics sensitive to semantic similarity (Uhrig et al., 2021).

Fine-grained breakdowns implicate negation and reentrancy as areas of highest cross-lingual degradation. Domain adaptation remains a challenge, with parser generalization decreasing on out-of-domain or typologically distant data.

7. Recent Extensions and Future Directions

The field is expanding toward:

Broader language coverage: Creation of new AMR resources for typologically diverse languages, e.g., HuAMR for Hungarian (Barta et al., 27 Feb 2025), and test sets for Korean and Croatian (Kang et al., 2024).
Graph-aware encoders and structural priors: Incorporation of GNNs and graph structure into sequence models is proposed to better handle long-range semantic dependencies.
Improved alignment and silver data validation: Semi-automatic methods, human-in-the-loop refinement, and back-translation triangulation have potential to enhance projection fidelity and coverage.
Multilingual AMR formalisms: There is a recognized need to extend AMR’s predicate inventory and semantic role framework beyond PropBank’s English roots, especially for languages with divergent predicate-argument realizations.
Domain adaptation and robustness: Curriculum learning and domain-aware fine-tuning strategies, as well as smarter transfer schedules, are highlighted for further study.

In summary, cross-lingual AMR parsing has progressed from projection and translation-based baselines to sophisticated multilingual seq2seq and ensemble-distillation regimes, achieving high accuracy across several major languages and setting the foundation for resource extension to low-resource and typologically diverse languages (Uhrig et al., 2021, Cai et al., 2021, Lee et al., 2021, Barta et al., 27 Feb 2025, Kang et al., 2024, Sheth et al., 2021, Damonte et al., 2017).