Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Grained Encoder (DGE)

Updated 5 February 2026
  • Dynamic Grained Encoder (DGE) is a region-adaptive module that modulates attention query granularity in Vision Transformers to focus on discriminative regions.
  • It dynamically subdivides image regions to reduce computation, achieving up to a 60% reduction in FLOPs with minimal accuracy loss.
  • DGE integrates with existing architectures through a two-stage routing process using a gating network and Gumbel-Softmax for differentiable discrete routing.

The Dynamic Grained Encoder (DGE) is a module for Vision Transformers designed to adaptively allocate computation across spatial regions of input images by modulating the granularity of the attention queries. By introducing region-adaptive sparsification into the query set fed to Multi-Head Self-Attention (MHSA) blocks, DGE achieves a fine-grained representation in discriminative regions while substantially reducing overall computational cost. DGE is compatible with most Vision Transformer frameworks and demonstrates significant reductions in floating point operations (FLOPs) with negligible accuracy loss across diverse vision tasks (Song et al., 2023).

1. High-Level Architectural Description

In conventional Vision Transformers (ViTs), the structure comprises an input sequence of NN patch tokens xRN×Cx \in \mathbb{R}^{N \times C}, a stack of LL encoder blocks (each with MHSA and a multi-layer perceptron), followed by a classification head. DGE modifies this paradigm by replacing each dense encoder block with a two-stage process:

  1. Dynamic Grained Router: The feature map of size H×WH \times W (N=HWN=H\cdot W) is partitioned into RR non-overlapping regions of size S×SS\times S. For each region ii, a gating network determines its subdivision granularity, deciding between coarse and fine splits into sub-patches. This adaptive subdivision results in varying numbers of queries per region. Increased patch size leads to fewer queries and thus faster attention computation for less informative regions.
  2. Vanilla Encoder: The tokens, pooled according to the region-assigned granularity, yield a reduced set of QNQ \ll N queries. MHSA and MLP layers are then applied to these queries, optionally reusing or pruning the keys/values set. After processing, the QQ outputs are unpooled back to the H×WH \times W spatial layout and augmented with the input via a residual connection, preserving the spatial resolution for subsequent blocks (Song et al., 2023).

2. Core Routing Algorithm and Mathematical Formulation

Given an input xR(HW)×Cx \in \mathbb{R}^{(H\cdot W) \times C} reshaped to zRH×W×Cz \in \mathbb{R}^{H \times W \times C}, the routing mechanism proceeds as follows:

  • Region Partitioning: With S=max(Φ)S = \max(\Phi) for candidate granularities Φ={ϕ1,ϕ2,...,ϕK}\Phi = \{\phi_1, \phi_2, ..., \phi_K\} (ϕkS\phi_k \le S), zz is split into R=H/SW/SR = \lceil H/S \rceil \cdot \lceil W/S \rceil blocks {zi}\{z_i\} each of shape S×S×CS \times S \times C.
  • Gating Logits: For each region ii, the descriptor hiRKh_i \in \mathbb{R}^K is computed:

hi=W(1S2(u,v)region izi[u,v])+bh_i = W^{\top} \left( \frac{1}{S^2} \sum_{(u,v) \in \text{region } i} z_i[u, v] \right) + b

where WRC×KW \in \mathbb{R}^{C \times K} and bRKb \in \mathbb{R}^K.

  • Granularity Selection at Inference: The assigned granularity index is

θi=argmaxk{1,..,K}hi[k]\theta_i = \arg\max_{k \in \{1,..,K\}} h_i[k]

Region ii is then subdivided into patches of size ϕθi×ϕθi\phi_{\theta_i} \times \phi_{\theta_i}, yielding Ni=(S/ϕθi)2N_i = (S/\phi_{\theta_i})^2 tokens per region. The total number of queries is Q=iNiQ = \sum_i N_i.

  • Differentiable Training via Gumbel-Softmax: To enable backpropagation through the discrete routing decisions, during training the assignment substitutes θi\theta_i with a Gumbel-Softmax sampling:

gkGumbel(0,1), αi[k]=(hi[k]+gk)/τ, pi=softmax(αi)g_k \sim \text{Gumbel}(0,1), \ \alpha_i[k] = (h_i[k] + g_k) / \tau, \ p_i = \text{softmax}(\alpha_i)

Forward routing uses the hard index θi=argmaxkαi[k]\theta_i = \arg\max_k \alpha_i[k], and gradients are backpropagated through pip_i via the straight-through estimator (Song et al., 2023).

3. Computational Complexity and FLOP Analysis

Standard MHSA with NN tokens incurs O(N2C)O(N^2C) complexity. With DGE, the number of queries is QN/E[ϕ2]Q \approx N / \mathbb{E}[\phi^2] (where E[ϕ2]\mathbb{E}[\phi^2] is the average squared granularity), yielding an attention cost of O(NQC)O(NQC). The computational savings ratio is approximately 1/E[ϕ2]1/\mathbb{E}[\phi^2]. With Φ={1,2,4}\Phi = \{1,2,4\} and a learned E[ϕ2]2.5\mathbb{E}[\phi^2] \approx 2.5, the network achieves an empirical FLOP reduction of roughly 11/2.560%1-1/2.5 \approx 60\%. On ImageNet, observed FLOP reductions are 40–60% for DeiT-S and PVT backbones, with <0.2%<0.2\% top-1 accuracy difference at a budget ratio γ=0.5\gamma = 0.5 (Song et al., 2023).

Model Vanilla FLOPs DGE FLOPs Top-1 Accuracy (Vanilla→DGE)
DeiT-S 14.3G 6.1G 79.8% → 79.6%
PVT-S 6.2G 3.4G 80.2% → 80.1%
PVT-M 14.1G 8.1G 81.5% → 81.4%

4. Training Procedure and Loss Design

DGE training involves two loss components:

  • Task Loss: Standard for the target application, e.g., cross-entropy for classification, Mask-RCNN or FPN losses for object detection/segmentation.
  • Budget Loss: Controls sparseness by penalizing deviation from a target FLOP ratio γ\gamma:

β=actual FLOPs usedFLOPs of vanilla encoder\beta = \frac{\text{actual FLOPs used}}{\text{FLOPs of vanilla encoder}}

Lbudget=(βγ)2L_{\text{budget}} = (\beta - \gamma)^2

The total loss is L=Ltask+λLbudgetL = L_{\text{task}} + \lambda L_{\text{budget}} with λ1\lambda \approx 1. Typical γ\gamma values are in [0,1][0,1]. The gating networks contribute O(RCK)O(RCK) parameters and are trained by gradient flow through the Gumbel-Softmax path.

Strategies such as a gradual warm-up of γ\gamma from $1.0$ to target over initial epochs help prevent collapse to coarse tokens when LbudgetL_{\text{budget}} outweighs LtaskL_{\text{task}} (Song et al., 2023).

5. Empirical Behavior on Vision Tasks

ImageNet Classification

  • DeiT-S: 14.3G→6.1G FLOPs (∼57% reduction), accuracy from 79.8% to 79.6%.
  • PVT-S: 6.2G→3.4G (∼45% reduction), accuracy from 80.2% to 80.1%.
  • PVT-M: 14.1G→8.1G (∼42% reduction), accuracy from 81.5% to 81.4%.
  • The accuracy-FLOP trade-off curve (parametrized by γ\gamma) is smooth, with the knee at γ0.5\gamma \approx 0.5.

COCO Detection and Segmentation

  • Mask-RCNN w/ PVT-S: 251G→185G, box AP from 40.4 to 40.1, mask AP from 37.8 to 37.5.
  • Mask-RCNN w/ DPVT-S: 186G→147G, box AP 44.0 to 43.8, mask AP 40.3 to 40.0.
  • Inference speed: V100 GPU backbone ∼25% faster.

ADE20K Semantic Segmentation

  • Semantic-FPN w/ PVT-S: 226G→155G, mIoU 41.8 to 41.7.
  • Semantic-FPN w/ DPVT-S: 157G→121G, mIoU 44.4 unchanged.
  • PVT-M+DGE vs PVT-S: Improves by 2.1% mIoU at lower FLOPs (Song et al., 2023).

6. Implementation Considerations and Best Practices

  • PyTorch Implementation: Reference implementation (vtpack) uses a linear layer (nn.Linear(CC, KK)) after average-pooling per region for the router. Region splitting utilizes nn.Unfold or custom indexing for S×SS \times S blocks.
  • Gumbel-Softmax: Employ torch.distributions.Gumbel or F.gumbel_softmax (with hard=True). During inference, disable noise and use argmax-only assignment.
  • Dynamic Query Length: DGE yields variable QQ across images/batches. To handle this in MHSA (which expects fixed sequence length): pad query sets per image or process with a for-loop for small batch sizes.
  • Budget Tracking: Track chosen ϕi2\phi_i^2 per region in the forward pass to compute β\beta; apply LbudgetL_{\text{budget}} after batch averaging.
  • Warm-up: Gradually annealing γ\gamma avoids collapse to coarse tokens if LbudgetL_{\text{budget}} dominates early training.

7. Context, Generalizability, and Limitations

DGE introduces data-adaptive query sparsification, enabling computational focus on informative image regions within standard Vision Transformer pipelines. Its design and empirical performance demonstrate broad applicability across image classification, object detection, and semantic segmentation tasks, yielding 40–60% computational savings at negligible accuracy cost. The mechanism is compatible with diverse vision transformer backbones and can be integrated with minimal architectural changes. A plausible implication is that DGE-type region-adaptive routing could be extended or specialized for tasks where spatial saliency is both highly variable and critical for computational efficiency (Song et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Grained Encoder (DGE).