LogViG: Efficient CNN–GNN Vision Backbone

Updated 4 July 2026

LogViG is a CNN–GNN hybrid model that employs Logarithmic Scalable Graph Construction to replace costly KNN graph generation with a deterministic, logarithmically scaling connectivity pattern.
It integrates multi-scale high-resolution processing via a dedicated shortcut branch that fuses local and global features across four stages.
Empirical evaluations show that LogViG variants achieve high accuracy on benchmarks like ImageNet-1K and ADE20K with fewer parameters and lower computational cost than competing models.

Searching arXiv for LogViG and closely related papers to ground the article. LogViG most directly denotes a family of hybrid CNN–GNN vision models built around Logarithmic Scalable Graph Construction (LSGC) together with a multi-scale high-resolution design for efficient image classification and semantic segmentation (Munir et al., 15 Oct 2025). In that usage, LogViG is a Vision GNN architecture that replaces expensive content-based graph construction such as KNN with a deterministic logarithmic connectivity pattern whose number of long-range links per node grows only logarithmically with image size. In adjacent multimodal literature, the string “LogViG” has also appeared in non-canonical or extrapolative senses: as a shorthand for a log-perplexity-based VIG-guided selective training scheme for LVLMs, and as a plausible logging-and-orchestration layer over VIG/VIC data-generation cycles in VIGC, rather than as the formal name of a model family (Lee et al., 19 Feb 2026, Wang et al., 2023).

1. Nomenclature and scope

Within vision backbones, LogViG is defined as a CNN–GNN hybrid Vision GNN that uses MBConv blocks for local convolutional processing in all 4 stages, LSGC-based grapher blocks in every stage for global processing, and a High-Resolution Shortcut (HRS) branch that maintains high-resolution features and fuses them with the low-resolution backbone (Munir et al., 15 Oct 2025). The model is positioned against three established baselines: CNNs, which have local receptive fields; ViTs, which provide global self-attention but incur quadratic token complexity; and prior ViGs, which represent images as graphs of patches but often rely on expensive or dense graph construction.

A recurrent source of confusion is the proximity between LogViG and VIG. In LVLM work, Visual Information Gain (VIG) is a perplexity-based quantity defined as

$\mathrm{VIG} = \log\left(\frac{\mathrm{PPL}(A \mid Q)}{\mathrm{PPL}(A \mid Q,I)}\right)$

and a “LogViG-style” procedure refers to selective training based on this log-perplexity ratio rather than to the LogViG CNN–GNN backbone (Lee et al., 19 Feb 2026). In VIGC, by contrast, the term “LogViG” is explicitly treated as an extrapolation: the paper defines VIG and VIC, while “LogViG” is described only as a plausible operational layer that would log and analyze those cycles (Wang et al., 2023). This suggests that, absent further qualification, LogViG should be read primarily as the LSGC-based Vision GNN family of (Munir et al., 15 Oct 2025).

2. Logarithmic Scalable Graph Construction

LSGC is the central mechanism of LogViG. It defines a deterministic, structured graph on the pixel or patch grid using logarithmic step sizes determined by the bit depth of the feature-map height and width. For an input feature map

$X \in \mathbb{R}^{B \times C \times H \times W},$

each spatial location is treated conceptually as a node. Instead of building an explicit adjacency matrix or performing nearest-neighbor search, LSGC applies directional expansion operations along height and width.

The bit depth of a dimension is defined as the smallest integer $b$ such that

$2^{b-1} \leq \text{size} < 2^b.$

For height and width, the method sets $h \gets \text{bit depth}(H)$ and $w \gets \text{bit depth}(W)$ . With expansion rate $K$ , the step sizes are

$s_i = K^i - 1.$

For each scale, LSGC performs forward and backward expansions in both spatial dimensions:

$expand_{forwardH}(X, s)$ ,
$expand_{backwardH}(X, s)$ ,
$X \in \mathbb{R}^{B \times C \times H \times W},$ 0,
$X \in \mathbb{R}^{B \times C \times H \times W},$ 1.

When an expansion goes out of bounds, it wraps around to the other side of the image, yielding a toroidal boundary condition. For each expansion, the method computes a relative feature

$X \in \mathbb{R}^{B \times C \times H \times W},$ 2

and aggregates across scales and directions by element-wise max: $X \in \mathbb{R}^{B \times C \times H \times W},$ 3 The grapher output is then

$X \in \mathbb{R}^{B \times C \times H \times W},$ 4

This construction is designed to balance reachability and sparsity. KNN graph construction in classic ViG scales poorly with the number of patches and requires reshaping between 4D tensors and 3D graph representations. SVGA and MGC remove KNN search, but SVGA’s number of connections grows linearly with image dimensions, which can increase redundancy and over-squashing at high resolution. LSGC instead uses $X \in \mathbb{R}^{B \times C \times H \times W},$ 5 height-direction expansions and $X \in \mathbb{R}^{B \times C \times H \times W},$ 6 width-direction expansions, so the number of effective neighbors per node is $X \in \mathbb{R}^{B \times C \times H \times W},$ 7 rather than linear in resolution (Munir et al., 15 Oct 2025).

The paper analyzes graph quality through average shortest path length. A plain 2D lattice has long paths; LSGC sharply reduces them while using fewer edges than SVGA. The reported values are:

Resolution	Structure	Avg. shortest path
56×56	Lattice	37.333
56×56	LSGC	4.359
56×56	SVGA	2.895
28×28	Lattice	18.667
28×28	LSGC	3.719
28×28	SVGA	2.794

These values indicate that LSGC retains short communication paths without adopting the denser connectivity of SVGA. A plausible implication is that the method trades a small increase in path length for lower fan-in per node and reduced over-squashing.

3. Network architecture

LogViG combines a convolutional stem, a multi-stage low-resolution backbone, a single high-resolution branch connected by HRS, and a final merge and classification head (Munir et al., 15 Oct 2025). The input image is $X \in \mathbb{R}^{B \times C \times H \times W},$ 8. The stem consists of two convolutional layers, each with stride 2, producing a $X \in \mathbb{R}^{B \times C \times H \times W},$ 9 feature map. That stem output feeds both the low-resolution backbone and the high-resolution branch.

The low-resolution branch has 4 stages, with decreasing spatial resolution and increasing channel width. Within each stage, MBConv blocks provide local modeling and LSGC blocks provide graph-based long-range interactions. Between stages, downsample blocks reduce spatial resolution by stride-2 convolution. The MBConv block follows the standard MobileNetV2-like pattern: pointwise expansion, depthwise $b$ 0 convolution, and pointwise projection, typically with BN and GeLU and a residual connection when shapes match.

The LSGC block, termed the Logarithmic Grapher, takes an input feature map $b$ 1, computes $b$ 2 by multi-scale expansion and max-relative aggregation, and mixes $b$ 3 and $b$ 4 through convolution. This gives each stage both local convolutional bias and structured long-range message passing.

The High-Resolution Shortcut preserves finer spatial information. Starting from an earlier high-resolution map, it applies two $b$ 5 convolutions: the first with stride 2 and the second with stride 1, each followed by BN and GeLU. At the end of the backbone, the low-resolution stage-4 features are upsampled by bilinear interpolation, passed through a pointwise convolution for channel alignment, summed with the HRS features, and then processed by another pointwise convolution with BN and GeLU. This implements the model’s multi-scale feature fusion.

The classifier head applies global average pooling and a feed-forward projection to class logits. For semantic segmentation, the ImageNet-pretrained LogViG backbone is used with a Semantic FPN decoder rather than the classification head.

4. Variants and training configuration

The paper defines Ti-LogViG, S-LogViG, and B-LogViG, together with a Wide Ti-LogViG ablation variant (Munir et al., 15 Oct 2025). Ti-LogViG uses stage widths $b$ 6, with MBConv/LSGC repetitions $b$ 7, $b$ 8, $b$ 9, and $2^{b-1} \leq \text{size} < 2^b.$ 0. S-LogViG uses widths $2^{b-1} \leq \text{size} < 2^b.$ 1 and repetitions $2^{b-1} \leq \text{size} < 2^b.$ 2, $2^{b-1} \leq \text{size} < 2^b.$ 3, $2^{b-1} \leq \text{size} < 2^b.$ 4, and $2^{b-1} \leq \text{size} < 2^b.$ 5. B-LogViG uses widths $2^{b-1} \leq \text{size} < 2^b.$ 6 and repetitions $2^{b-1} \leq \text{size} < 2^b.$ 7, $2^{b-1} \leq \text{size} < 2^b.$ 8, $2^{b-1} \leq \text{size} < 2^b.$ 9, and $h \gets \text{bit depth}(H)$ 0. Wide Ti-LogViG reduces depth by about $h \gets \text{bit depth}(H)$ 1, increases widths in stages 1–3 by $h \gets \text{bit depth}(H)$ 2, and sets stage 4 to $h \gets \text{bit depth}(H)$ 3 to match parameter count.

The principal ImageNet-1K results for these variants are:

Model	Params / GMACs	ImageNet-1K top-1
Ti-LogViG	8.1M / 1.1	79.9%
S-LogViG	13.9M / 1.9	81.5%
B-LogViG	30.5M / 4.6	83.6%

For image classification, the implementation uses PyTorch 1.12 + timm, ImageNet-1K, input resolution 224×224, AdamW, initial learning rate $h \gets \text{bit depth}(H)$ 4, cosine annealing, and 300 epochs. The recipe includes knowledge distillation from RegNetY-16GF with 82.9% top-1, together with RandAugment, Mixup, CutMix, random erasing, and repeated augmentation. For semantic segmentation on ADE20K, the backbone is pretrained on ImageNet-1K and trained with Semantic FPN, resolution 512×512, AdamW, learning rate $h \gets \text{bit depth}(H)$ 5, poly decay with power $h \gets \text{bit depth}(H)$ 6, and 40K iterations on 8× NVIDIA RTX 6000 Ada GPUs.

The depth-vs-width ablation shows Deep Ti-LogViG: 8.1M params, 79.9% top-1 and Wide Ti-LogViG: ~8.0M params, 79.6% top-1, indicating that the deeper, narrower configuration performs slightly better at similar parameter count.

5. Empirical performance and ablations

On ImageNet-1K, Ti-LogViG reaches 79.9% top-1 with 8.1M parameters and 1.1 GMACs. Relative to Pyramid ViG-Ti, it is reported as 1.7% higher average accuracy with a 24.3% reduction in parameters and a 35.3% reduction in GMACs (Munir et al., 15 Oct 2025). The same paper reports PViG-Ti: 10.7M params, 1.7 GMACs, 78.2%, PViHGNN-Ti: 12.3M params, 2.3 GMACs, 78.9%, and MobileViG-S: 7.2M params, 1.0 GMACs, 78.2%. At larger scales, S-LogViG attains 81.5% with 13.9M / 1.9 GMACs, and B-LogViG attains 83.6% with 30.5M / 4.6 GMACs, approaching or exceeding stronger CNN, ViT, and ViG baselines at substantially lower model size or compute.

The ADE20K segmentation results use LogViG as a backbone with Semantic FPN. The paper reports S-LogViG: 44.1 ± 0.6 mIoU and B-LogViG: 46.8 ± 0.4 mIoU. These numbers exceed a range of listed baselines, including MobileViG-M*: 41.8 mIoU, EfficientFormer-L1: 38.9 mIoU, EfficientFormer-L3: 43.5 mIoU, EfficientFormer-L7: 45.1 mIoU, and FastViT-SA36: 42.9 mIoU.

The ablation studies isolate two main effects. First, replacing SVGA with LSGC at fixed parameter count improves Ti-LogViG from 79.4% to 79.8% top-1 without extra parameters; adding HRS increases this further to 79.9% for a $h \gets \text{bit depth}(H)$ 7M parameter change. Second, the number of stages containing graphers matters. Starting from MobileViG-S (7.2M, 78.2%), a Ti-LogViG variant with graphers only in stage 4 reaches 78.8% with 7.0M parameters; adding graphers to stages 2–4 yields 79.8% with 8.0M parameters; the full 4-stage deep configuration reaches 79.9% with 8.1M parameters. The reported pattern is that multi-scale global reasoning and depth both contribute materially to performance.

The LogViG model paper also implies several limitations. LSGC is a static graph structure based only on image dimensions rather than image content. Unlike KNN-based ViGs, it does not adapt edges to semantic similarity. The architecture uses a single high-resolution branch, not a fully parallel HRNet-style multi-branch hierarchy. The evaluation concentrates on ImageNet-1K and ADE20K, and the paper does not present explicit robustness or out-of-distribution analysis (Munir et al., 15 Oct 2025). These points delimit the present evidence: LogViG demonstrates strong efficiency–accuracy trade-offs for standard recognition and segmentation benchmarks, but does not yet establish advantages for content-adaptive graph learning or distribution shift.

In multimodal work, the term’s neighboring usages sharpen that distinction. In Selective Training for Large Vision LLMs via Visual Information Gain, VIG is the reduction in cross-entropy or log-perplexity produced by adding the image, and the associated selective training ranks samples by

$h \gets \text{bit depth}(H)$ 8

then retains the top $h \gets \text{bit depth}(H)$ 9 and masks low-VIG tokens during instruction tuning (Lee et al., 19 Feb 2026). In VIGC: Visual Instruction Generation and Correction, the acronym VIG denotes Visual Instruction Generation, VIC denotes Visual Instruction Correction, and “LogViG” is explicitly described only as a possible logging, analytics, or dataset-versioning layer over those processes rather than as a named architecture (Wang et al., 2023). A common misconception is therefore to treat all three usages as a single method family. The literature instead supports a narrower statement: LogViG is canonically the LSGC-based hybrid CNN–GNN Vision GNN of (Munir et al., 15 Oct 2025), while the LVLM usages are either “LogViG-style” shorthand or explicit extrapolation.

Taken together, these strands indicate two distinct research directions associated with the string “LogViG.” One is architectural: efficient high-resolution vision backbones through logarithmically scaled graph construction. The other is procedural and only partially canonical: log-perplexity-based sample and token selection, or logged orchestration of multimodal data-generation loops. The first is formally defined and experimentally validated as a model family; the second remains contextual, paper-specific terminology rather than a standardized architecture name (Munir et al., 15 Oct 2025, Lee et al., 19 Feb 2026, Wang et al., 2023).