Vision Graph Neural Networks (ViGs)

Updated 20 November 2025

Vision Graph Neural Networks (ViGs) are models that represent images as adaptive graphs, where nodes are patches and edges capture spatial and semantic relationships.
ViGs leverage dynamic kNN, structured sparsity, and attention-based aggregation to enhance efficiency and expressivity, achieving impressive results in classification, segmentation, and detection.
Hybrid architectures combining CNNs and GNNs enable multi-scale feature fusion, addressing the limitations of grid-based methods and boosting performance in complex visual domains.

Vision Graph Neural Networks (ViGs) are a family of neural architectures that represent images as graphs, where nodes correspond to patches or regions and edges encode relationships between them. ViGs generalize the convolutional and transformer paradigms by leveraging dynamic, adaptive, or structured graph construction to facilitate information flow between both local and non-local regions in an image. This framework has led to state-of-the-art performance across a wide range of visual tasks, including image classification, medical image segmentation, object detection, and recognition of highly flexible objects, while addressing the limitations of grid- and sequence-based approaches.

1. From Images to Graphs: Representation and Construction

The core innovation of ViG models is the representation of an image as a graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , where $\mathcal{V}$ comprises nodes representing localized patches, and edges $\mathcal{E}$ capture adaptable relationships that may be spatial, feature-driven, or semantically weighted.

Patch-to-node embedding and graph construction procedures are summarized as follows:

The input image $I\in\mathbb{R}^{H\times W\times C}$ is partitioned into $N$ non-overlapping patches (typically $p \times p$ ). Each patch is linearly embedded or transformed by a convolutional "stem" to produce feature vectors $\mathbf{x}_i \in \mathbb{R}^D$ for $i=1,...,N$ (Han et al., 2022, Jiang et al., 2023).
Edges are determined by various strategies:
- k-Nearest Neighbor (kNN): For each node, edges are drawn to the $K$ most similar nodes by feature-space (e.g., Euclidean) distance. This process is repeated at every layer as features evolve, yielding dynamic connectivity (Han et al., 2022, Colomba et al., 15 Feb 2024).
- Dilated or Static Patterns: To improve efficiency, structured adjacency such as sparse axial patterns (SVGA), windowed graphs (WiGNet), or logarithmic expansion (LSGC—LogViG) are utilized to constrain neighbor search (Munir et al., 2023, Spadaro et al., 1 Oct 2024, Munir et al., 15 Oct 2025).
- Partitioned and Clustered Schemes: Methods like Dynamic Efficient Graph Convolution (DEGC—ClusterViG) partition nodes into clusters, constructing parallel graphs and fusing global/local context (Parikh et al., 18 Jan 2025).
- Gated/Attention-based or Data-driven Edge Selection: Learnable attention and thresholding mechanisms, such as cross-attention (AttentionViG) and differentiable soft-threshold policies (ViG-LRGC), directly yield edge weights or connectivity masks based on relational scores (Gedik et al., 29 Sep 2025, Elsharkawi et al., 23 Sep 2025, Munir et al., 13 Nov 2025).
- Adaptive/Saliency-based Graphs: FViG exemplifies topology adaptation by saliency analysis at both channel and spatial levels, capturing object-specific connectivity variations for flexible object classes (Zuo et al., 6 Jun 2024).

Adjacency formulation varies from hard $0/1$ masks to normalized affinity matrices and learned soft-gating.

2. Message Passing: Graph Convolutions and Aggregation Operators

After graph construction, feature propagation proceeds via a sequence of graph convolution (“Grapher”) blocks. Canonical update rules include:

Max-Relative Aggregation [Li et al. 2019; adopted in ViG, PVG, WiGNet]:

$\mathbf{x}_i'' = \left[\mathbf{x}_i \,\|\, \max_{j\in\mathcal{N}(i)}(\mathbf{x}_j - \mathbf{x}_i)\right] \in \mathbb{R}^{2D}$

This is followed by a linear update and nonlinearity, often with multi-head variants for efficiency and expressivity (Jiang et al., 2023, Spadaro et al., 1 Oct 2024, Han et al., 2022).

Hybrid Aggregation (e.g., MaxE in PVG):

$x_i^{l+1} = W \,\mathrm{Concat}\Bigl(x_i^l,\, \max_{j\in N_i}(x_j^l - x_i^l),\, \frac{1}{|N_i|}\sum_{j\in N_i}x_j^l\Bigr)$

where mean and max pooling are combined to capture both centroid and extremal differences (Wu et al., 2023).

Attention or Soft-Gated Aggregation:
- Cross-Attention (AttentionViG):
$\mathbf{o}_i = \sigma\left(\mathbf{W}\left[\mathbf{x}_i \,\, \Big\| \sum_{j} \alpha_{ij} \mathbf{v}_{ij}\right]\right)$

with

$\alpha_{ij} = \exp\left(-\beta(1 - s_{ij})\right),\,\, s_{ij} = \frac{\mathbf{q}_i^\top \mathbf{k}_j}{\|\mathbf{q}_i\|_2 \|\mathbf{k}_j\|_2}$

where $\mathbf{Q}, \mathbf{K}, \mathbf{V}$ are learnable projections and $\beta$ is a temperature parameter (Gedik et al., 29 Sep 2025). - Learnable Soft-Threshold Policy (LRGC, FViG):

$\alpha_{ij} = \tanh(\mathrm{ReLU}(\sigma(\mathbf{k}_i^\top \mathbf{q}_j) - \sigma(\tau)))$

with $\tau$ being a trainable threshold per layer, enabling automatic sparsification and adaptation (Elsharkawi et al., 23 Sep 2025, Zuo et al., 6 Jun 2024).
Feed-Forward and Residual Blocks: All major ViG models interleave graph convolutions with pointwise FFNs, layer normalization, and skip connections for node-wise diversity preservation and over-smoothing mitigation.

Different frameworks (e.g., GraphSAGE, EdgeConv, GIN as in AttentionViG) are benchmarked, with newer attention-based aggregators generally achieving superior expressivity and accuracy at comparable computational cost (Gedik et al., 29 Sep 2025).

3. Graph Construction: Efficiency, Scalability, and Hardware

Dynamic graph wiring is the principal computational bottleneck in ViGs, as the naive kNN search scales as $O(N^2)$ per layer. Numerous mechanisms address this:

Structured Sparse Graphs: SVGA (MobileViG) and LSGC (LogViG) restrict each node’s neighborhood to a deterministic subset, e.g., stride- $K$ axial lines, logarithmically expansive rays, or fixed-size windowed subgraphs (WiGNet), hereby reducing complexity to $O(N)$ or $O(N\log N)$ (Munir et al., 2023, Munir et al., 15 Oct 2025, Spadaro et al., 1 Oct 2024).
Partitioned/Clustered Graphs: DEGC in ClusterViG achieves $O(N^2/\kappa)$ cost by distributing kNN search into parallel clusters and augmenting local edges with global cluster-level communication (Parikh et al., 18 Jan 2025).
Windowed Partitioning: WiGNet forms kNN graphs only within spatial windows, optionally utilizing shifted windows for long-range horizontal/vertical information propagation, yielding linear scaling in image size (Spadaro et al., 1 Oct 2024).
Learnable Edge Pruning: Methods such as LRGC (ViG-LRGC) adaptively adjust graph sparsity and per-node neighbor count based on learnable thresholds, balancing representation power and efficiency (Elsharkawi et al., 23 Sep 2025).
Hardware Acceleration: Specialized FPGA accelerators enable real-time dynamic image graph construction, achieving up to $16.6\times$ and $6.8\times$ speedup over optimized CPU and GPU baselines on ViG tasks, and generalizing across backbone and graph construction variants (Ramachandran et al., 29 Sep 2025).

A representative table of graph construction algorithms and their asymptotic complexities is given below:

Graph Construction	Computational Cost	Adaptivity
kNN (dynamic, per-layer)	$O(N^2)$	Fully adaptive
SVGA / grid patterns	$O(N)$	Structured, static
Windowed kNN (WiGNet)	$O(N \cdot M^2)$	Local, block-shared
DEGC/partitioned	$O(N^2/\kappa)$	Semi-global/cluster
LSGC (logarithmic rays)	$O(N \log N)$	Multiscale, fixed
LRGC (attention-threshold)	$O(N^2)$	Fully learnable

4. Architectural Variants, Aggregation, and Specializations

ViG forms are diversified by task demands and efficiency–accuracy tradeoffs:

Hierarchical/Pyramid Backbones: Most ViG architectures employ multi-stage, pyramid-style downsampling, echoing CNN best practices for capturing multi-scale features (Pyramid ViG, PVG, LogViG, GreedyViG) (Han et al., 2022, Wu et al., 2023, Munir et al., 15 Oct 2025, Munir et al., 10 May 2024). Windowed and high-resolution fusion branches (LogViG, WiGNet) are used to preserve spatial detail at higher layers.
Hybrid CNN–GNN Models: Architectures such as MobileViG, MobileViGv2, GreedyViG, and ClusterViG intersperse CNN-style inverted residual blocks (e.g., MBConv) with Grapher blocks to balance local and non-local context while optimizing parameter and FLOP budgets (Munir et al., 2023, Avery et al., 9 Jun 2024, Munir et al., 10 May 2024, Parikh et al., 18 Jan 2025).
Attention and Saliency: AttentionViG implements cross-attention aggregation within Grapher blocks and leverages learnable non-competitive weighting to suppress irrelevant neighbors, significantly outperforming uniform pooling methods (Gedik et al., 29 Sep 2025).
Adaptive and Saliency-driven Graphs: FViG employs learnable channel-aware and spatial-aware self-saliency for robust flexible object recognition, with experiments demonstrating +6.04% accuracy over standard ViG on appropriately challenging datasets (Zuo et al., 6 Jun 2024).
Prompt-based Adaptation: Vision Graph Prompting (VGP) introduces semantic low-rank decomposition for parameter-efficient adaptation to downstream tasks, matching or exceeding full fine-tuning on a variety of vision and molecular benchmarks with only ~5% of the parameter update cost (Ai et al., 7 May 2025).

5. Applications: Effectiveness Across Visual Domains

ViGs have demonstrated widespread applicability across vision sub-domains:

Classification: ViG variants, including PVG, GreedyViG, ClusterViG, and LogViG, achieve or surpass state-of-the-art ImageNet-1K top-1 accuracies with significantly reduced parameter and computation budgets (Wu et al., 2023, Munir et al., 10 May 2024, Munir et al., 15 Oct 2025, Parikh et al., 18 Jan 2025).
Medical Imaging: ViG-UNet models underpin current best practices for medical image segmentation by effectively capturing irregular anatomical structures, outperforming CNN and transformer U-Nets on segmentation datasets (Jiang et al., 2023).
Remote Sensing and EO: On large-scale satellite benchmarks (BigEarthNet, RESISC45, PatternNet), ViG matches or exceeds both CNN and ViT baselines with dynamic graph re-wiring enabling superior multi-label and multi-class performance (Colomba et al., 15 Feb 2024).
Flexible Object Recognition: FViG attains a substantial edge on the Flexible Dataset (FDA) designed for semi-transparent, amorphous, and ambiguous objects (smoke, fog, fire), outperforming ViG and mixture baselines through adaptive self-saliency (Zuo et al., 6 Jun 2024).
Dense Prediction Tasks: On MS COCO and ADE20K, ViG and its hybrids yield both high accuracy and efficient computation (COCO box AP up to 47.4, mIoU up to 47.8), and are competitive with or exceed state-of-the-art transformer and CNN backbones (Gedik et al., 29 Sep 2025, Parikh et al., 18 Jan 2025).

A summary table of representative ImageNet-1K results:

Model	Params (M)	FLOPs (G)	Top-1 (%)
ViG-Ti	10.7	1.7	78.2
GreedyViG-S	12.0	1.6	81.1
ClusterViG-S	28.2	4.2	83.7
PVG-S	22.0	5.0	83.0
LogViG-B	30.5	4.6	83.6
AttentionViG-B	32.3	4.8	83.9
AdaptViG-B	26.8	3.8	83.3

6. Challenges, Limitations, and Future Directions

Despite their versatility, ViGs present new challenges:

Scalability: Quadratic cost of dynamic kNN graph construction at large resolutions persists as a bottleneck; although partitioned and sparse variants alleviate this, further research into fully learnable, scalable graph wiring is ongoing (Ramachandran et al., 29 Sep 2025, Munir et al., 15 Oct 2025, Parikh et al., 18 Jan 2025).
Graph Aggregation Expressivity: Early max-relative and mean-based pooling was less effective at filtering semantically irrelevant patches; learned or data-driven attention (AttentionViG, FViG, LRGC) now addresses this but increases model complexity and memory footprint (Gedik et al., 29 Sep 2025, Zuo et al., 6 Jun 2024, Elsharkawi et al., 23 Sep 2025).
Inductive Bias and Interpretability: The flexibility of graph-based connectivity supports complex spatial and semantic structures but may lose strong local spatial priors inherent to CNNs. Hybrid designs and explicit positional encoding modules (CPE, HRS) counteract this loss (Munir et al., 10 May 2024, Munir et al., 15 Oct 2025).
Application Customization: Domain-specific graph construction, e.g., for remote sensing, medical imaging, or molecular graphs, remains an active area.
Efficient Adaptation and Transfer: Prompting and low-rank methods (VGP) for parameter-efficient domain adaption are demonstrating promising results (Ai et al., 7 May 2025).

Anticipated research will further pursue:

End-to-end learnable graph construction with adaptive sparsity (Elsharkawi et al., 23 Sep 2025),
Joint learning of edge weights and neighbor selection (Zuo et al., 6 Jun 2024, Gedik et al., 29 Sep 2025),
Integrating multiscale and high-resolution feature fusion at low memory/compute cost (Munir et al., 15 Oct 2025),
Hardware-friendly graph neural operators for mobile and high-throughput scenarios (Avery et al., 9 Jun 2024, Ramachandran et al., 29 Sep 2025),
Extension of ViG principles to new modalities (video, multimodal, 3D) and generative frameworks.

7. Summary of Contributions and Impact

Vision Graph Neural Networks shift the paradigm of image representation from rigid grids and sequences to flexible, data-adaptive graphs, unifying the strengths of local and non-local reasoning. Progressive advances in graph construction, message-passing, hybrid architectures, and saliency-driven adaptation have resolved key bottlenecks and enabled state-of-the-art performance on diverse visual domains. Research continues to optimize efficiency, expressivity, scalability, and domain adaptability, establishing ViGs as a competitive and generalizable backbone for contemporary and future computer vision challenges (Han et al., 2022, Jiang et al., 2023, Zuo et al., 6 Jun 2024, Colomba et al., 15 Feb 2024, Parikh et al., 18 Jan 2025, Gedik et al., 29 Sep 2025, Munir et al., 13 Nov 2025).