Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 30 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 88 tok/s
GPT OSS 120B 468 tok/s Pro
Kimi K2 202 tok/s Pro
2000 character limit reached

GE-VerbMLP: Graph-Enhanced Verb Classification

Updated 1 September 2025
  • The paper introduces GE-VerbMLP, which fuses a frozen CLIP encoder-based MLP with GNN modules and adversarial training to tackle semantic ambiguity in verb classification.
  • GE-VerbMLP leverages a sparsified label-correlation graph through GCNs to integrate semantic relationships between verbs, enhancing multi-label predictions.
  • Evaluations report over a 3% MAP improvement on benchmarks, with tight class clusters and smoother decision boundaries resulting from adversarial training.

The Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP) is a neural network architecture developed for verb classification in situation recognition tasks, with a central focus on modeling semantic ambiguity and label correlations through the fusion of multilayer perceptron (MLP) components, graph neural networks (GNNs), and adversarial training mechanisms. GE-VerbMLP explicitly reformulates verb labeling as a single positive multi-label learning (SPMLL) problem, addressing the inherent multi-label nature of visual event recognition in images and advancing the state of the art in context recognition (Lin et al., 29 Aug 2025).

1. Architectural Overview

GE-VerbMLP integrates three primary components:

  1. Image Encoding and MLP Projection: Visual representations are extracted using a frozen CLIP encoder. These features are then projected into a task-specific latent space by an MLP, allowing for the modeling of complex, nonlinear relationships between image content and verb categories.
  2. Graph Neural Network (GCN) Module: To incorporate semantic relationships between verb labels, the architecture employs a GCN acting on class embedding vectors. These vectors are initialized using sentence encodings (e.g., BERT representations of class names, definitions, and frames) and are refined via message passing over a sparsified label-correlation graph.
  3. Adversarial Training Pipeline: Robustness is strengthened with adversarial training, employing perturbation techniques such as the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) to promote smooth decision boundaries between overlapping classes.

This design directly targets the label ambiguity and semantic overlap observed in verb classification, with GCNs responsible for enforcing meaningful proximity between similar verbs and adversarial training counteracting model overconfidence near class boundaries.

2. Semantic Correlation Modeling via Graph Neural Networks

Verbs representing visual events often manifest semantic redundancy and contextual ambiguity, necessitating explicit modeling of their interrelations. GE-VerbMLP operationalizes this as follows:

  • Class Embedding Initialization: Each verb class ii is embedded as cic_i through the encoding of its name, meaning, and usage, typically using a pretrained sentence encoder such as BERT.
  • Semantic Similarity Graph Construction: The similarity aija_{ij} between classes ii and jj is computed as

aij=ciTcjci2cj2a_{ij} = \frac{c_i^T c_j}{\|c_i\|_2 \|c_j\|_2}

The resulting affinity matrix is sparsified by retaining only the top KK neighbors per class, yielding a sparse adjacency matrix reflecting the most relevant semantic relationships.

  • Label Graph Propagation: The GCN propagates information among class centers according to:

Cj+1=ρ(A^CjWj),C_{j+1} = \rho(\hat{A} C_j W_j),

where CjC_j is the matrix of class representations at layer jj, A^\hat{A} is a smoothed adjacency matrix parameterized by a hyperparameter ss, WjW_j is a learnable weight matrix, and ρ()\rho(\cdot) denotes a nonlinearity (e.g., ReLU or tanh). This mechanism increases the likelihood that similar verbs will cluster in embedding space, facilitating accurate multi-label assignment.

3. Adversarial Training for Decision Boundary Smoothing

Classic multiclass classification is prone to overfitting ambiguous image regions near class boundaries, especially when classes have overlapping semantics. GE-VerbMLP addresses this via adversarial augmentation:

  • FGSM-Based Perturbations:

δFGSM=ϵsign(xL(fθ(x),z)),xFGSM=x+δFGSM\delta_{\text{FGSM}} = \epsilon\, \text{sign}(\nabla_x L(f_\theta(x), z)), \qquad x_{\text{FGSM}} = x + \delta_{\text{FGSM}}

where LL is the loss (e.g., cross-entropy), ϵ\epsilon is a step size, fθf_\theta is the model, xx is the original input, and zz the one-hot target.

  • PGD-Based Refinement:

xPGD=ΠS(x+δFGSM)x_{\text{PGD}} = \Pi_S \Bigl(x + \delta_{\text{FGSM}}\Bigr)

where ΠS\Pi_S projects to an allowed perturbation set.

These adversarial examples challenge the model to retain consistent predictions against small but informative input changes, which has been shown to yield smoother class boundaries and enhanced robustness, without compromising primary accuracy metrics.

4. Reformulation as Single Positive Multi-Label Learning (SPMLL)

GE-VerbMLP is motivated by the empirical observation that visual scenes are often described by multiple, semantically overlapping verbs; however, practical datasets typically only annotate a single positive label per instance. The SPMLL reformulation incorporates this reality:

  • While a traditional multi-label model minimizes

Rfull(fθ)=1mi=1mL(fθ(xi),yi)R_{\textrm{full}}(f_\theta) = \frac{1}{m} \sum_{i=1}^m L(f_\theta(x_i), y_i)

with yiy_i a full multi-hot label vector, SPMLL uses

Rpartial(fθ)=1mi=1mL(fθ(xi),zi)R_{\textrm{partial}}(f_\theta) = \frac{1}{m} \sum_{i=1}^m L(f_\theta(x_i), z_i)

where ziz_i contains only a single positive entry (the annotated label). This approach reflects the missing label regime: the model must be capable of predicting other plausible labels at test time, even though they are absent from supervision.

This reformulation is essential for aligning the training regimen with the evaluation and for handling the ambiguous, underannotated nature of real-world verb classification.

5. Evaluation Metrics and Empirical Findings

GE-VerbMLP is assessed using both traditional and multi-label-specific metrics:

  • Top-1 / Top-5 Accuracy: Measures congruence between the predicted top verbs and the primary annotation.
  • Mean Average Precision (MAP): Accounts for all valid labels per image, thus directly reflecting the model's capacity for multi-label recognition.

Experimental results demonstrate that GE-VerbMLP yields over a 3% improvement in MAP on curated benchmarks relative to baseline MLP approaches with or without GCN enhancements, while maintaining competitive performance on top-1 and top-5 accuracy. Ablation studies attribute these improvements to both the GCN (clustering semantically close verbs) and the adversarial training pipeline (boundary smoothing).

Additional visualizations, such as T-SNE plots of class centers before and after GCN propagation, reveal tighter clustering of related verbs, corroborating the benefit of graph-based label smoothing in high-dimensional space.

6. Context within the Literature and Implications

GE-VerbMLP extends established lines of research in GNN-based situation recognition (Li et al., 2017), graph-enhanced language representation (Wang et al., 2022), and recent advances in graph LLM architectures (Plenz et al., 13 Jan 2024):

  • Enhancing a basic MLP with label-graph propagation incorporates successful strategies from GGNN for joint verb-role inference but adapts them for label-level smoothing rather than instance-graph modeling.
  • Integration of CLIP-based encoders anchors the model in high-capacity visual feature spaces.
  • The adoption of adversarial training aligns with broader trends in robust classification methodology, explicitly targeting visual event ambiguity.

A plausible implication is that architectures modeled after GE-VerbMLP could be further generalized to other vision-language tasks characterized by annotation incompleteness and semantic overlap, such as relation detection or multi-label image captioning. Moreover, the graph-based class smoothing paradigm is especially suitable for domains with rich class ontologies, where proximity in class semantic space is meaningful.

7. Summary and Prospects

GE-VerbMLP presents a solution tailored to the multi-label reality of verb classification in situation recognition. By fusing MLP image encoding, GNN-based label correlation modeling, and adversarial boundary enforcement, the model achieves both increased MAP and robust standard classification metrics on challenging datasets (Lin et al., 29 Aug 2025).

This architecture is well-positioned as an exemplar for future work seeking to address label ambiguity through explicit class-graph modeling and adversarial learning, with prospective applications spanning multidimensional semantic analysis tasks in both vision and language domains.