Tiled Squeeze-and-Excite (TSE) for CNNs
- Tiled Squeeze-and-Excite (TSE) is a variant of SE that computes channel descriptors using local, non-overlapping tile pooling rather than global average pooling.
- It preserves the original SE excitation network, serving as a drop-in replacement that significantly reduces activation buffer requirements and pipeline stalls.
- Empirical results show TSE achieving nearly identical accuracy to global SE while reducing buffer usage by up to 90%, making it a practical solution for hardware accelerators.
Tiled Squeeze-and-Excite (TSE) is a variant of the Squeeze-and-Excite (SE) channel attention mechanism for convolutional neural networks (CNNs) in which channel descriptors are computed not by global spatial pooling, but by aggregating over local, non-overlapping tiles. TSE preserves the original SE excitation network and can be used as a drop-in replacement for SE blocks, providing substantial hardware efficiency gains with negligible performance loss by exploiting the sufficiency of local spatial statistics for channel attention (Vosco et al., 2021).
1. Channel Attention and the Squeeze-and-Excite Mechanism
The canonical SE block operates on input activations by first “squeezing” spatial information through global average pooling (GAP), producing channel descriptors for . The resulting vector feeds into a compact two-layer excitation network—specifically an MLP with bottleneck dimension —which outputs scaling coefficients . Each channel of is then scaled by the corresponding .
While compute- and parameter-efficient, SE’s reliance on GAP necessitates buffering the full feature map before attention modulation, which is nontrivial on streaming hardware accelerators and may cause substantial pipeline stalls (Vosco et al., 2021).
2. Tiled Squeeze-and-Excite: Architecture and Mechanism
TSE replaces global average pooling with pooling over local, non-overlapping spatial tiles, while sharing the original SE excitation MLP across all tiles. For input , the spatial domain is partitioned into tiles of size . For each channel and tile , TSE computes a local descriptor , assembling a tensor . Each tile descriptor vector then passes through the same excitation MLP:
This yields , which is broadcast and upsampled by nearest-neighbor to form , permitting per-pixel modulation: .
TSE Pseudocode
1 2 3 4 5 6 7 8 9 10 11 12 13 |
def TSE(x, h, w, r): # x: [N, C, H, W] # r: reduction ratio # sq: AvgPool2d(kernel=(h,w), stride=(h,w), ceil_mode=True) z = sq(x) # [N, C, ⌈H/h⌉, ⌈W/w⌉] z = z.view(N, C, -1) # [N, C, T] a = conv1(z) # 1×1, C→C/r b = relu(a) c = conv2(b) # 1×1, C/r→C s = sigmoid(c) # [N, C, T] S = s.view(N, C, ⌈H/h⌉, ⌈W/w⌉) S_full = upsample(S, (H, W), 'nearest') return x * S_full |
3. Empirical Evaluation and Buffer Efficiency
Empirical analysis on ImageNet-1K with RegNetY-800MF demonstrated that TSE with row-strip tiles of height (tile size ) achieves nearly identical accuracy to global SE (76.29% vs. 76.30% top-1), while reducing the required activation buffer from M to $7WC=0.42$M activations. Similar results were observed with column-strip or square tiles. On EfficientDet-D2 (COCO), TSE () matched SE’s mAP=42.3 while decreasing buffer usage from 50.8M to 4.77M activations (Vosco et al., 2021).
| Model | SE Top-1/mAP | TSE () Top-1/mAP | SE Buffer | TSE Buffer |
|---|---|---|---|---|
| RegNetY-800MF | 76.30% | 76.29% | 1.07M activations | 0.42M |
| EfficientDet-D2 (COCO) | mAP=42.3 | mAP=42.3 | 50.8M | 4.77M |
A strip height or width of approximately 7 was typically sufficient to match global context, confirming that local context of this extent is adequate for channel attention (Vosco et al., 2021).
4. Deployment, Buffer Formulas, and Drop-in Replacement
TSE affords a direct, parameter-preserving replacement for SE blocks. If for each tile closely matches , pretrained SE weights can be reused in TSE, with tile-pooling substituting for GAP and no retraining required. In practice, accuracy drop was for a variety of architectures. Further, brief fine-tuning (10–20% of data) recovers accuracy to within 0.2–0.4% of original SE models.
Buffering requirements differ starkly:
- Global SE:
- TSE:
For EfficientDet-D2 (input , ), this equates to $50.8$M (SE) versus $4.77$M activations (TSE with , ) (Vosco et al., 2021).
5. Practical Considerations and Hardware Implications
Tile shape selection in TSE is governed by several trade-offs:
- Use row-strips () if hardware buffer scales with width .
- Square patches () may increase parallelism if buffer scaling is independent of .
- Strip or patch size of approximately 7 generally suffices for accuracy parity with SE.
TSE decreases activation buffer requirements by up to 90%, reduces pipeline stalls, and is highly compatible with streaming convolutional dataflows, except for local tile boundaries. TSE maintains the same parameter count as SE and introduces only a marginal FLOP increase due to repeated MLP evaluations (Vosco et al., 2021).
6. Limitations and Observed Behavior
TSE’s limitations include a minor increase in FLOPs proportional to the tile count and the need to select tile dimensions appropriately. Tile sizes that are too small can degrade accuracy. However, local mean descriptors generally prove similar to global means, justifying the efficacy of TSE at moderate tile sizes.
In conclusion, TSE provides a hardware-friendly, empirically validated alternative to global SE, confirming that moderately local spatial statistics suffice for channel attention and enabling substantial reductions in accelerator buffer costs with negligible impact on accuracy (Vosco et al., 2021).