Semantic Role Labeling

Updated 17 December 2025

Semantic Role Labeling is a foundational NLP task that maps predicate-argument structures to explain who did what, when, where, and how.
It has evolved from feature-engineered pipelines to advanced deep learning and LLM-based frameworks, enhancing accuracy and robustness.
SRL underpins practical applications such as information extraction, machine translation, and dialogue systems across multilingual and domain-specific contexts.

Semantic role labeling (SRL) is a foundational task in natural language processing focused on uncovering the predicate–argument structure of sentences. SRL systems map sentences to structured representations comprising predicates (typically verbs or nominalizations), their semantic arguments, and associated role labels—answering "who did what to whom, when, where, and how." SRL has evolved from feature-engineered pipelines to state-of-the-art deep learning and LLM frameworks. This article reviews SRL’s formal definitions, model architectures, syntactic interplay, multilingual and domain-specific considerations, practical applications, and future directions.

1. Formal Definitions and Problem Formulations

The goal of SRL is to predict a set of triplets

$Y = \{\langle p_k, a_k, r_k \rangle \mid p_k \in P, a_k \in A, r_k \in R\}$

for each sentence $S = \{w_1, w_2, \ldots, w_n\}$ , where $P$ is the set of predicates, $A$ is the set of argument spans (span-based) or heads (dependency-based), and $R$ is the inventory of role labels (e.g., ARG0, ARG1, ARGM-LOC) (Chen et al., 9 Feb 2025). Two dominant annotation schemas are:

Span-based SRL: Arguments are contiguous token spans; typical for PropBank and FrameNet.
Dependency-based SRL: Arguments correspond to head word indices; typical for CoNLL-2008/2009.

Evaluation relies on matching the predicted (predicate, argument, role) triples exactly to gold annotation, with metrics:

$\mathrm{Precision} = \frac{TP}{TP + FP},\quad \mathrm{Recall} = \frac{TP}{TP + FN},\quad F_1 = 2 \frac{P \cdot R}{P + R}$

where $TP, FP, FN$ are true positives, false positives, and false negatives, respectively (Chen et al., 9 Feb 2025).

2. Model Architectures: Historical and Contemporary Taxonomies

SRL model development traces a progression from pipeline architectures and feature-based classifiers to end-to-end neural and generative models. Principal categories include (Chen et al., 9 Feb 2025, Li et al., 2019, Zhang et al., 2018, Fernández-González, 2022, Cai et al., 2018):

Sequence Tagging Models: Cast the problem as token-level classification, typically in BIO or IOB2 labeling. Early systems employed CRFs, MaxEnt, or SVMs with rich feature sets (Pham et al., 2017). Modern approaches use BiLSTM or Transformer-based architectures, optionally with boundary indicators or self-attention (Zhang et al., 2018).
Span-Based Models: Score all possible contiguous spans as argument candidates for each predicate. Representations combine start/end boundary embeddings, pooled internal states, and headedness features. Typical classifiers employ biaffine scorers and non-overlap inference constraints (Li et al., 2019, Xia et al., 2019, Xia et al., 2019).
Dependency-Graph-Based Models: Encode the predicate–argument structure as labeled graphs. These models enumerate all potential (predicate, argument) pairs and jointly score and classify arcs, integrating span or token-level representations with global graph inference, often utilizing Graph Convolutional Networks (GCNs) or Tree-LSTMs (Munir et al., 2020, He et al., 2019).
Transition-Based Models: Leverage Pointer Networks or transition systems to construct the SRL graph incrementally in left-to-right or top-down passes, eschewing syntactic dependencies (Fernández-González, 2022). This provides efficiency in $O(n^2)$ per sentence and handles predicate detection, argument recognition, and role classification in a unified process.
End-to-End and Uniform Models: Directly predict all triplets over the token sequence, addressing both predicate sense disambiguation and argument labeling via unified deep networks (e.g., deep BiLSTMs + biaffine heads) (Li et al., 2019, Cai et al., 2018).
MRC-based and Seq2Seq Models: Reformulate SRL as machine reading comprehension (MRC) or sequence-to-sequence generation, leveraging predicate and role semantics via natural language queries or serializations of the role structure (Wang et al., 2021).
LLMs: Recent methods equip LLMs with retrieval-augmented prompting (for predicate/role knowledge) and self-correction, yielding SOTA results and parameter-efficient adaptation (Li et al., 3 Jun 2025). LLMs address conventional performance gaps via auxiliary mechanisms and fine-tuning.

3. Syntax in SRL: Feature Engineering, Neural Encoding, and the “Syntax-Free” Paradigm

Syntactic information, both constituent and dependency-based, has been pivotal since SRL’s inception (Hartmann et al., 2017). Key approaches include (Chen et al., 9 Feb 2025, Xia et al., 2019, Shi et al., 2020, Xia et al., 2019, Munir et al., 2020, Li et al., 2020, Zhang et al., 2019):

Traditional Syntax-Based Features: Parse tree paths, constituent types, function tags, and position-relative features have been classic predictors—Alva-Manchego et al. (2013) achieved $F_1=79.6$ on gold parses using explicit syntactic trees in Portuguese (Hartmann et al., 2017).
Neural Syntax Encoders: GCNs, Tree-GRUs, SA-LSTM, Tree-LSTM with relation gates, and Syntax-Enhanced Self-Attention (e.g., Relation-Aware, LISA) parameterize parse structures as neural modules yielding continuous syntax-aware representations (Xia et al., 2019, Zhang et al., 2019, Munir et al., 2020).
Syntax Pruning: Argument candidate space is restricted via hard/soft pruning heuristics—by tree distance, head ancestry, or pattern frequencies—reducing class imbalance and accelerating inference, especially in graph-based models (Li et al., 2020, He et al., 2019). Syntax-aware multilingual models use uniform pruning across languages for portability (He et al., 2019).
Syntax-Free (“Agnostic”) Models: Deep contextualized representations (e.g., ELMo, BERT) have enabled robust SRL without explicit syntactic cues. Syntax-agnostic models built on deep BiLSTM, Transformer, or pointer-network architectures now consistently match or outperform syntax-aware systems, e.g., achieving $F_1=89.6$ (CoNLL-2009 English) (Cai et al., 2018). Injection of lightweight syntax via feature concatenation or multitask learning can still yield small but statistically significant improvements of $+0.5$ to $+1.2$ $F_1$ in strong baselines, especially out-of-domain or for low-resource languages (Li et al., 2020, Xia et al., 2019).
Syntactic Conversion: Reduction from span-based SRL to dependency parsing via enriched arc labeling is highly lossless— $>98\%$ of PropBank relations are local to the dependency tree (Shi et al., 2020).

4. Multilingual and Domain-Specific Adaptation

SRL research has expanded to diverse languages and application settings (Chen et al., 9 Feb 2025, He et al., 2019, Aghdam et al., 2023, Pham et al., 2017, Xia et al., 2019):

Multilingual SRL Benchmarks: CoNLL-2009 covers Catalan, Chinese, Czech, English, German, Japanese, and Spanish. Uniform models with argument pruning and biaffine scoring architectures achieve consistent SOTA, with gains ranging from $+0.28$ to $+5.56$ F₁ depending on lexical resource density and parse accuracy (He et al., 2019).
Persian and Vietnamese SRL: End-to-end BERT-based approaches for Persian eliminate feature engineering and set new accuracy records ( $F_1=86.26\%$ ), with monolingual pre-training outperforming multilingual models (Aghdam et al., 2023). Vietnamese SRL pipelines achieve $F_1=73.53\%$ using innovative constituent extraction and language-tailored features (Pham et al., 2017).
Cross-Domain and Robustness Studies: Syntax-free and memory-augmented models (AMN) show resilience when tested out-of-domain (e.g., WSJ $\to$ Brown) and pave the way for adaptation with minimal reliance on in-domain gold syntax (Guan et al., 2019, Xia et al., 2019, Cai et al., 2018).
Conversational and Discourse-Level SRL: Dialogue-oriented SRL (CSRL) recovers cross-utterance arguments by extending span labeling to the entire dialogue context, supporting tasks like context rewriting and response generation (Xu et al., 2021).
Unsupervised Transfer (Verbal $\to$ Nominal): Variational autoencoding and selectional-preference sharing enable transfer from verbal SRL to nominalizations with substantial gains ( $+7.24$ F₁ over direct transfer) without labeled nominal roles (Zhao et al., 2020).

5. Applications and Integration in Downstream Tasks

SRL representations underlie many core and emerging NLP tasks (Chen et al., 9 Feb 2025):

Information Extraction (IE): SRL-derived (predicate, argument, role) tuples provide structured facts for open IE and knowledge base construction in domains ranging from news to biomedical texts.
Machine Translation (MT): Semantic roles help ensure preservation of predicate–argument alignment across translation pairs, reducing semantic drift.
Question Answering (QA): SRL facilitates role-centric matching between questions and candidate answers, especially for “who/what/when/where/how” information.
Dialogue Systems and Generation: Recovery of ellipsis and anaphora via SRL supports robust context tracking and natural response generation in conversation modeling (Xu et al., 2021).
Legal, Biomedical, and Compliance Systems: Domain-adapted SRL parses obligations and relations critical for contract analysis, clinical event extraction, and regulatory compliance.
Robotics and Embodied AI: Grounded semantic parsing of action schemas enables mapping from natural language instructions to executable robot plans (Chen et al., 9 Feb 2025).
Vision, Video, and Speech SRL: Multimodal extensions align predicates and arguments to visual regions, temporal video segments, or speech units for situation recognition and cross-modal analytics.

6. Benchmark Results and Quantitative Comparisons

Model/Setting	Dataset/Language	F₁ Score	Key Notes
BiLSTM+Syntactic Features [1]	PB-Br.v1 (BP)	79.6	Gold parses
Lexical Model (no syntax) [8]	PB-Br.v1 (BP)	68.0	Lower recall/precision
BERT-based end-to-end	Persian	86.26	No auxiliary features
AMN-BiLSTM+ELMo	CoNLL-2009 (EN)	89.6	Syntax-agnostic, outperforms prior SOTA
DenseCNN+Adaptive Syntax	CoNLL-2009 (EN)	90.9	Syntax-aware, state-of-the-art
Relation-Aware SA+Dep/BERT	CoNLL-2009 (ZH)	87.35	Deep syntax, contextual embeddings
LLM+Retrieval/Self-Correction	CoNLL-2009 (EN)	91.89	SOTA, parameter-efficient adaptation
Unified span/dep (syntax-agn.)	CoNLL-05/12/09(EN)	83.1–90.4	End-to-end, SOTA on multiple formats

Exposure to parse noise during training increases robustness for real-world, parse-noisy applications (Hartmann et al., 2017). Modern neural models, whether syntax-free or syntax-aware, achieve $F_1$ scores in the 85–91 range on English CoNLL-2009, with syntax-awareness most beneficial out-of-domain, in low-resource, or for argument-long-distance contexts (Xia et al., 2019, PG et al., 2020, Chen et al., 9 Feb 2025).

7. Current Challenges and Future Directions

The field is moving toward more unified, robust, and multimodal SRL frameworks (Chen et al., 9 Feb 2025, Li et al., 3 Jun 2025):

Knowledge-Enhanced SRL: Integration with external ontologies or knowledge graphs to resolve implicit arguments and handle world knowledge.
Multimodal and Cross-Lingual Transfer: Joint modeling of text, vision, and speech, and leveraging cross-lingual representations for low-resource languages.
Interpretable and Explainable SRL: Making model decisions transparent and robust to noisy or adversarial input.
LLM-Based and Generative SRL: Probing and exploiting the structural knowledge encoded in LLMs; moving toward few-shot and zero-shot settings via prompt-based SRL.
Discourse-Scale SRL: Extending predicate–argument graphs to document-wide contexts, recovering inter-sentential and implicit arguments.
Scenario-Optimized and Real-Time SRL: Developing lightweight, fast, and domain-specialized models for deployment in dialogue agents and robotics.

Future advances are expected to arise from tighter integration of powerful pretrained LLMs, soft syntax features, and multimodal priors, with increased emphasis on cross-domain, next-sentence, and resource-constrained evaluation. The relevance of explicit syntactic information appears to persist—especially when data, computation, or context are limited, or structural interpretability and robustness are at a premium (Li et al., 2020, Xia et al., 2019, Chen et al., 9 Feb 2025).