GE-VerbMLP: Graph-Enhanced Verb Classification
- The paper introduces GE-VerbMLP, which fuses a frozen CLIP encoder-based MLP with GNN modules and adversarial training to tackle semantic ambiguity in verb classification.
- GE-VerbMLP leverages a sparsified label-correlation graph through GCNs to integrate semantic relationships between verbs, enhancing multi-label predictions.
- Evaluations report over a 3% MAP improvement on benchmarks, with tight class clusters and smoother decision boundaries resulting from adversarial training.
The Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP) is a neural network architecture developed for verb classification in situation recognition tasks, with a central focus on modeling semantic ambiguity and label correlations through the fusion of multilayer perceptron (MLP) components, graph neural networks (GNNs), and adversarial training mechanisms. GE-VerbMLP explicitly reformulates verb labeling as a single positive multi-label learning (SPMLL) problem, addressing the inherent multi-label nature of visual event recognition in images and advancing the state of the art in context recognition (Lin et al., 29 Aug 2025).
1. Architectural Overview
GE-VerbMLP integrates three primary components:
- Image Encoding and MLP Projection: Visual representations are extracted using a frozen CLIP encoder. These features are then projected into a task-specific latent space by an MLP, allowing for the modeling of complex, nonlinear relationships between image content and verb categories.
- Graph Neural Network (GCN) Module: To incorporate semantic relationships between verb labels, the architecture employs a GCN acting on class embedding vectors. These vectors are initialized using sentence encodings (e.g., BERT representations of class names, definitions, and frames) and are refined via message passing over a sparsified label-correlation graph.
- Adversarial Training Pipeline: Robustness is strengthened with adversarial training, employing perturbation techniques such as the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) to promote smooth decision boundaries between overlapping classes.
This design directly targets the label ambiguity and semantic overlap observed in verb classification, with GCNs responsible for enforcing meaningful proximity between similar verbs and adversarial training counteracting model overconfidence near class boundaries.
2. Semantic Correlation Modeling via Graph Neural Networks
Verbs representing visual events often manifest semantic redundancy and contextual ambiguity, necessitating explicit modeling of their interrelations. GE-VerbMLP operationalizes this as follows:
- Class Embedding Initialization: Each verb class is embedded as through the encoding of its name, meaning, and usage, typically using a pretrained sentence encoder such as BERT.
- Semantic Similarity Graph Construction: The similarity between classes and is computed as
The resulting affinity matrix is sparsified by retaining only the top neighbors per class, yielding a sparse adjacency matrix reflecting the most relevant semantic relationships.
- Label Graph Propagation: The GCN propagates information among class centers according to:
where is the matrix of class representations at layer , is a smoothed adjacency matrix parameterized by a hyperparameter , is a learnable weight matrix, and denotes a nonlinearity (e.g., ReLU or tanh). This mechanism increases the likelihood that similar verbs will cluster in embedding space, facilitating accurate multi-label assignment.
3. Adversarial Training for Decision Boundary Smoothing
Classic multiclass classification is prone to overfitting ambiguous image regions near class boundaries, especially when classes have overlapping semantics. GE-VerbMLP addresses this via adversarial augmentation:
- FGSM-Based Perturbations:
where is the loss (e.g., cross-entropy), is a step size, is the model, is the original input, and the one-hot target.
- PGD-Based Refinement:
where projects to an allowed perturbation set.
These adversarial examples challenge the model to retain consistent predictions against small but informative input changes, which has been shown to yield smoother class boundaries and enhanced robustness, without compromising primary accuracy metrics.
4. Reformulation as Single Positive Multi-Label Learning (SPMLL)
GE-VerbMLP is motivated by the empirical observation that visual scenes are often described by multiple, semantically overlapping verbs; however, practical datasets typically only annotate a single positive label per instance. The SPMLL reformulation incorporates this reality:
- While a traditional multi-label model minimizes
with a full multi-hot label vector, SPMLL uses
where contains only a single positive entry (the annotated label). This approach reflects the missing label regime: the model must be capable of predicting other plausible labels at test time, even though they are absent from supervision.
This reformulation is essential for aligning the training regimen with the evaluation and for handling the ambiguous, underannotated nature of real-world verb classification.
5. Evaluation Metrics and Empirical Findings
GE-VerbMLP is assessed using both traditional and multi-label-specific metrics:
- Top-1 / Top-5 Accuracy: Measures congruence between the predicted top verbs and the primary annotation.
- Mean Average Precision (MAP): Accounts for all valid labels per image, thus directly reflecting the model's capacity for multi-label recognition.
Experimental results demonstrate that GE-VerbMLP yields over a 3% improvement in MAP on curated benchmarks relative to baseline MLP approaches with or without GCN enhancements, while maintaining competitive performance on top-1 and top-5 accuracy. Ablation studies attribute these improvements to both the GCN (clustering semantically close verbs) and the adversarial training pipeline (boundary smoothing).
Additional visualizations, such as T-SNE plots of class centers before and after GCN propagation, reveal tighter clustering of related verbs, corroborating the benefit of graph-based label smoothing in high-dimensional space.
6. Context within the Literature and Implications
GE-VerbMLP extends established lines of research in GNN-based situation recognition (Li et al., 2017), graph-enhanced language representation (Wang et al., 2022), and recent advances in graph LLM architectures (Plenz et al., 13 Jan 2024):
- Enhancing a basic MLP with label-graph propagation incorporates successful strategies from GGNN for joint verb-role inference but adapts them for label-level smoothing rather than instance-graph modeling.
- Integration of CLIP-based encoders anchors the model in high-capacity visual feature spaces.
- The adoption of adversarial training aligns with broader trends in robust classification methodology, explicitly targeting visual event ambiguity.
A plausible implication is that architectures modeled after GE-VerbMLP could be further generalized to other vision-language tasks characterized by annotation incompleteness and semantic overlap, such as relation detection or multi-label image captioning. Moreover, the graph-based class smoothing paradigm is especially suitable for domains with rich class ontologies, where proximity in class semantic space is meaningful.
7. Summary and Prospects
GE-VerbMLP presents a solution tailored to the multi-label reality of verb classification in situation recognition. By fusing MLP image encoding, GNN-based label correlation modeling, and adversarial boundary enforcement, the model achieves both increased MAP and robust standard classification metrics on challenging datasets (Lin et al., 29 Aug 2025).
This architecture is well-positioned as an exemplar for future work seeking to address label ambiguity through explicit class-graph modeling and adversarial learning, with prospective applications spanning multidimensional semantic analysis tasks in both vision and language domains.