DGAP: Deformable Global Attention Plugin

Updated 19 December 2025

The paper introduces DGAP, which integrates learnable spatial offsets into transformer global attention to improve segmentation accuracy and training speed.
DGAP employs an Offset Net to predict deformed sampling coordinates, enabling shape-aware Q/K/V projections and reducing redundant computations.
Empirical results on 3DTeethSAM show notable gains in boundary IoU, mIoU, and overall test accuracy, along with a >20% reduction in training time.

Deformable Global Attention Plugins (DGAP) introduce adaptive, morphology-driven sampling into transformer global attention, enabling models to focus computational resources on task-relevant image regions. DGAP modifies standard global multi-head self-attention by learning spatial offsets for the query, key, and value projections, resulting in shape-aware, content-adaptive attention fields. The primary motivation for DGAP is to reduce the redundancy and inefficiency of uniform global attention, particularly in visual domains with structured, repetitive geometries such as dental imagery, while maintaining the expressiveness and flexibility of global attention. Empirical results in the context of 3DTeethSAM demonstrate that DGAP delivers significant improvements in segmentation accuracy and training speed on complex 3D tooth datasets, validating its architectural and algorithmic contributions (Lu et al., 12 Dec 2025).

1. Module Design and Architectural Integration

DGAP is implemented as a lightweight module within the vision transformer backbone (specifically, SAM2’s Hiera-L encoder) and is integrated into every global attention block of Stage 3. The Hiera design, like other hierarchical ViT architectures, uses progressive spatial downsampling over four stages; Stage 3 is selected for DGAP integration due to its balance between spatial fidelity and high-level feature abstraction. The remaining attention stages and feed-forward (MLP) blocks retain their original implementations.

Within the target global attention block, the standard attention mechanism is augmented as follows:

A dedicated Offset Net, implemented as a small MLP, predicts, for each spatial location, a set of K two-dimensional offsets relative to a regular local grid.
These offsets are added to the fixed grid to obtain deformed sampling coordinates.
Features at the deformed coordinates are gathered using bilinear interpolation, yielding a deformed local feature set for each token.
The query, key, and value projections are generated from the deformed features, enabling the attention to be computed over regions aligned with salient morphological details (e.g., tooth boundaries and surfaces).
The output of the deformable-attention operation is fused with the original feature map via a residual (skip) connection, preserving the baseline information path (Lu et al., 12 Dec 2025).

2. Mathematical Formalism

Let $X \in \mathbb{R}^{N \times C}$ be the input tokens, where $N = H \cdot W$ spatial locations and $C$ is the channel dimension.

Offset prediction: For each token $x_i \in \mathbb{R}^C$ , an Offset Net produces K offsets:

$\Delta p_{i,k} = \phi_{\text{offset}}(x_i), \quad k = 1,...,K$

Deformed sampling: For each reference grid point $p_{i,k}^{\text{ref}}$ around $i$ :

$p_{i,k} = p_{i,k}^{\text{ref}} + \Delta p_{i,k}$

Features are then sampled at $p_{i,k}$ via bilinear interpolation from $X$ :

$x'_{i,k} = \mathrm{BilinearSample}(X, p_{i,k})$

Q/K/V projection: The gathered set $\{x'_{i,k}\}_{k=1}^K$ is reshaped and projected to form queries $Q$ , keys $K$ , and values $V$ for each attention head.
Attention computation: Standard multi-head global attention is performed:

$A = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right), \quad \text{head} = AV$

The concatenated heads are projected and added to the original $X$ :

$Y = X + \mathrm{Proj_{out}}(\mathrm{Concat}(\mathrm{head}_1, ..., \mathrm{head}_K))$

DGAP does not alter positional encoding or normalization conventions of the baseline transformer block (Lu et al., 12 Dec 2025).

3. Hyperparameters, Computational Complexity, and Implementation

DGAP’s additional parameters are limited to the small feed-forward Offset Net and the modified Q/K/V projections. The default number of attention heads (e.g., 16 for Hiera-L) and sampling points per token remain as in the baseline unless otherwise specified. Although the precise value of K is not detailed, the module is described as learning “a small set of offsets per token.”

Standard global attention has a computational complexity of $O(N^2 \cdot d)$ . DGAP reduces this to $O(N \cdot K \cdot d)$ , due to replacing all-pair affinity computations with a fixed number of local, offset-driven samples. With $K \ll N$ , both run-time and memory are substantially reduced. The Offset Net adds negligible overhead. By limiting DGAP insertion to Stage 3, parameter and FLOP increases are minimized while focusing adaptivity on the most discriminative layer (Lu et al., 12 Dec 2025).

4. Empirical Performance and Ablation

In 3DTeethSAM, DGAP leads to statistically significant improvements as isolated by controlled ablation:

Overall test set accuracy improves from 94.87% to 95.48% (+0.61 percentage points).
Tooth-wise mIoU increases from 90.61% to 91.90% (+1.29 pp).
Boundary IoU rises from 66.64% to 70.05% (+3.41 pp).
Dice score improves from 93.45% to 94.33% (+0.88 pp).

Training convergence is also accelerated: the composite segmentation loss decreases more rapidly and achieves a lower plateau within 40 epochs when DGAP is enabled. These figures are directly attributable to DGAP’s capacity to provide more morphology-aware features, especially at complex boundaries between tooth and gingiva or adjacent teeth. The plugin operates without updating the majority of SAM2’s pretrained parameters, demonstrating strong adaptation with minimal finetuning (Lu et al., 12 Dec 2025).

5. Mechanistic Distinction from Standard and Deformable Attention

Standard global attention computes dense all-pairs interactions, treating every token equivalently and incurring $O(N^2)$ complexity. This uniformity results in significant redundancy for imagery dominated by structured but spatially repetitive patterns (e.g., dental surfaces), as computation is wasted on background or non-informative regions.

DGAP, by predicting per-token spatial offsets, makes the attention receptive field shape- and content-dependent. This contrasts with fixed grid or windowed attention used in PVT, Swin, or vanilla ViT, as well as with the deformable multi-head attention in DAT++ and related works. In particular, DGAP differs by:

Deforming all Q/K/V projections via learned, per-token offsets relative to a grid, ensuring all heads share the same learned locality structure.
Fusing the deformed attention output additively with the original feature for robustness.
Applying the plugin only to the critical intermediate attention stage (Stage 3), rather than all blocks (Lu et al., 12 Dec 2025).

Compared to approaches like DAT++ (which also employ offset prediction and bilinear sampling but across multiple transformer stages and at varying resolutions), DGAP’s parameters are learned specifically for the anatomy and scale at hand, and its plugin design focuses on parameter and FLOP efficiency while exploiting content-driven attention localization (Xia et al., 2023).

6. Practical Application: 3DTeethSAM and State-of-the-Art Segmentation

DGAP was evaluated within 3DTeethSAM, a system for high-resolution 3D dental mesh segmentation. The module is applied to rendered 2D views via the SAM2 foundation model. After prompt and mask refinement, predictions are projected back into 3D.

On the Teeth3DS benchmark, systems with DGAP achieve an IoU of 91.90% and establish new state-of-the-art results. The largest absolute improvement is in boundary IoU (a 3.41 pp gain), demonstrating enhanced discrimination at anatomical edges—a critical property for dental applications (Lu et al., 12 Dec 2025).

Wall-clock training time also exhibits a >20% reduction when DGAP is enabled, illustrating simultaneous efficacy and efficiency advantages.

7. Contextual Significance and Outlook

DGAP represents a trend toward adaptive attention mechanisms that combine the representational power of transformers with spatial inductive biases tailored to the task domain. By focusing Q/K/V projections via content-aware spatial deformation, DGAP reduces wasteful computation and improves morphological alignment, especially in medical and other structured vision tasks. Its plug-and-play design, low parameter overhead, and efficacy when inserted into only a single attention stage offer a highly efficient and effective approach to transformer adaptation for specialized, high-resolution segmentation (Lu et al., 12 Dec 2025).

A plausible implication is that the DGAP framework could generalize to other domains characterized by local shape and boundary information, and its selective, per-stage application provides a blueprint for cost-effective transformer customization.

Markdown Upgrade to Chat

References (2)

3DTeethSAM: Taming SAM2 for 3D Teeth Segmentation (2025)

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deformable Global Attention Plugins (DGAP).