Dynamic Grained Encoder (DGE)

Updated 5 February 2026

Dynamic Grained Encoder (DGE) is a region-adaptive module that modulates attention query granularity in Vision Transformers to focus on discriminative regions.
It dynamically subdivides image regions to reduce computation, achieving up to a 60% reduction in FLOPs with minimal accuracy loss.
DGE integrates with existing architectures through a two-stage routing process using a gating network and Gumbel-Softmax for differentiable discrete routing.

The Dynamic Grained Encoder (DGE) is a module for Vision Transformers designed to adaptively allocate computation across spatial regions of input images by modulating the granularity of the attention queries. By introducing region-adaptive sparsification into the query set fed to Multi-Head Self-Attention (MHSA) blocks, DGE achieves a fine-grained representation in discriminative regions while substantially reducing overall computational cost. DGE is compatible with most Vision Transformer frameworks and demonstrates significant reductions in floating point operations (FLOPs) with negligible accuracy loss across diverse vision tasks (Song et al., 2023).

1. High-Level Architectural Description

In conventional Vision Transformers (ViTs), the structure comprises an input sequence of $N$ patch tokens $x \in \mathbb{R}^{N \times C}$ , a stack of $L$ encoder blocks (each with MHSA and a multi-layer perceptron), followed by a classification head. DGE modifies this paradigm by replacing each dense encoder block with a two-stage process:

Dynamic Grained Router: The feature map of size $H \times W$ ( $N=H\cdot W$ ) is partitioned into $R$ non-overlapping regions of size $S\times S$ . For each region $i$ , a gating network determines its subdivision granularity, deciding between coarse and fine splits into sub-patches. This adaptive subdivision results in varying numbers of queries per region. Increased patch size leads to fewer queries and thus faster attention computation for less informative regions.
Vanilla Encoder: The tokens, pooled according to the region-assigned granularity, yield a reduced set of $Q \ll N$ queries. MHSA and MLP layers are then applied to these queries, optionally reusing or pruning the keys/values set. After processing, the $Q$ outputs are unpooled back to the $H \times W$ spatial layout and augmented with the input via a residual connection, preserving the spatial resolution for subsequent blocks (Song et al., 2023).

2. Core Routing Algorithm and Mathematical Formulation

Given an input $x \in \mathbb{R}^{(H\cdot W) \times C}$ reshaped to $z \in \mathbb{R}^{H \times W \times C}$ , the routing mechanism proceeds as follows:

Region Partitioning: With $S = \max(\Phi)$ for candidate granularities $\Phi = \{\phi_1, \phi_2, ..., \phi_K\}$ ( $\phi_k \le S$ ), $z$ is split into $R = \lceil H/S \rceil \cdot \lceil W/S \rceil$ blocks $\{z_i\}$ each of shape $S \times S \times C$ .
Gating Logits: For each region $i$ , the descriptor $h_i \in \mathbb{R}^K$ is computed:

$h_i = W^{\top} \left( \frac{1}{S^2} \sum_{(u,v) \in \text{region } i} z_i[u, v] \right) + b$

where $W \in \mathbb{R}^{C \times K}$ and $b \in \mathbb{R}^K$ .

Granularity Selection at Inference: The assigned granularity index is

$\theta_i = \arg\max_{k \in \{1,..,K\}} h_i[k]$

Region $i$ is then subdivided into patches of size $\phi_{\theta_i} \times \phi_{\theta_i}$ , yielding $N_i = (S/\phi_{\theta_i})^2$ tokens per region. The total number of queries is $Q = \sum_i N_i$ .

Differentiable Training via Gumbel-Softmax: To enable backpropagation through the discrete routing decisions, during training the assignment substitutes $\theta_i$ with a Gumbel-Softmax sampling:

$g_k \sim \text{Gumbel}(0,1), \ \alpha_i[k] = (h_i[k] + g_k) / \tau, \ p_i = \text{softmax}(\alpha_i)$

Forward routing uses the hard index $\theta_i = \arg\max_k \alpha_i[k]$ , and gradients are backpropagated through $p_i$ via the straight-through estimator (Song et al., 2023).

3. Computational Complexity and FLOP Analysis

Standard MHSA with $N$ tokens incurs $O(N^2C)$ complexity. With DGE, the number of queries is $Q \approx N / \mathbb{E}[\phi^2]$ (where $\mathbb{E}[\phi^2]$ is the average squared granularity), yielding an attention cost of $O(NQC)$ . The computational savings ratio is approximately $1/\mathbb{E}[\phi^2]$ . With $\Phi = \{1,2,4\}$ and a learned $\mathbb{E}[\phi^2] \approx 2.5$ , the network achieves an empirical FLOP reduction of roughly $1-1/2.5 \approx 60\%$ . On ImageNet, observed FLOP reductions are 40–60% for DeiT-S and PVT backbones, with $<0.2\%$ top-1 accuracy difference at a budget ratio $\gamma = 0.5$ (Song et al., 2023).

Model	Vanilla FLOPs	DGE FLOPs	Top-1 Accuracy (Vanilla→DGE)
DeiT-S	14.3G	6.1G	79.8% → 79.6%
PVT-S	6.2G	3.4G	80.2% → 80.1%
PVT-M	14.1G	8.1G	81.5% → 81.4%

4. Training Procedure and Loss Design

DGE training involves two loss components:

Task Loss: Standard for the target application, e.g., cross-entropy for classification, Mask-RCNN or FPN losses for object detection/segmentation.
Budget Loss: Controls sparseness by penalizing deviation from a target FLOP ratio $\gamma$ :

$\beta = \frac{\text{actual FLOPs used}}{\text{FLOPs of vanilla encoder}}$

$L_{\text{budget}} = (\beta - \gamma)^2$

The total loss is $L = L_{\text{task}} + \lambda L_{\text{budget}}$ with $\lambda \approx 1$ . Typical $\gamma$ values are in $[0,1]$ . The gating networks contribute $O(RCK)$ parameters and are trained by gradient flow through the Gumbel-Softmax path.

Strategies such as a gradual warm-up of $\gamma$ from $1.0$ to target over initial epochs help prevent collapse to coarse tokens when $L_{\text{budget}}$ outweighs $L_{\text{task}}$ (Song et al., 2023).

5. Empirical Behavior on Vision Tasks

ImageNet Classification

DeiT-S: 14.3G→6.1G FLOPs (∼57% reduction), accuracy from 79.8% to 79.6%.
PVT-S: 6.2G→3.4G (∼45% reduction), accuracy from 80.2% to 80.1%.
PVT-M: 14.1G→8.1G (∼42% reduction), accuracy from 81.5% to 81.4%.
The accuracy-FLOP trade-off curve (parametrized by $\gamma$ ) is smooth, with the knee at $\gamma \approx 0.5$ .

COCO Detection and Segmentation

Mask-RCNN w/ PVT-S: 251G→185G, box AP from 40.4 to 40.1, mask AP from 37.8 to 37.5.
Mask-RCNN w/ DPVT-S: 186G→147G, box AP 44.0 to 43.8, mask AP 40.3 to 40.0.
Inference speed: V100 GPU backbone ∼25% faster.

ADE20K Semantic Segmentation

Semantic-FPN w/ PVT-S: 226G→155G, mIoU 41.8 to 41.7.
Semantic-FPN w/ DPVT-S: 157G→121G, mIoU 44.4 unchanged.
PVT-M+DGE vs PVT-S: Improves by 2.1% mIoU at lower FLOPs (Song et al., 2023).

6. Implementation Considerations and Best Practices

PyTorch Implementation: Reference implementation (vtpack) uses a linear layer (nn.Linear( $C$ , $K$ )) after average-pooling per region for the router. Region splitting utilizes nn.Unfold or custom indexing for $S \times S$ blocks.
Gumbel-Softmax: Employ torch.distributions.Gumbel or F.gumbel_softmax (with hard=True). During inference, disable noise and use argmax-only assignment.
Dynamic Query Length: DGE yields variable $Q$ across images/batches. To handle this in MHSA (which expects fixed sequence length): pad query sets per image or process with a for-loop for small batch sizes.
Budget Tracking: Track chosen $\phi_i^2$ per region in the forward pass to compute $\beta$ ; apply $L_{\text{budget}}$ after batch averaging.
Warm-up: Gradually annealing $\gamma$ avoids collapse to coarse tokens if $L_{\text{budget}}$ dominates early training.

7. Context, Generalizability, and Limitations

DGE introduces data-adaptive query sparsification, enabling computational focus on informative image regions within standard Vision Transformer pipelines. Its design and empirical performance demonstrate broad applicability across image classification, object detection, and semantic segmentation tasks, yielding 40–60% computational savings at negligible accuracy cost. The mechanism is compatible with diverse vision transformer backbones and can be integrated with minimal architectural changes. A plausible implication is that DGE-type region-adaptive routing could be extended or specialized for tasks where spatial saliency is both highly variable and critical for computational efficiency (Song et al., 2023).

Markdown Upgrade to Chat

References (1)

Dynamic Grained Encoder for Vision Transformers (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Grained Encoder (DGE).