Papers
Topics
Authors
Recent
Search
2000 character limit reached

Tiled Squeeze-and-Excite (TSE) for CNNs

Updated 9 February 2026
  • Tiled Squeeze-and-Excite (TSE) is a variant of SE that computes channel descriptors using local, non-overlapping tile pooling rather than global average pooling.
  • It preserves the original SE excitation network, serving as a drop-in replacement that significantly reduces activation buffer requirements and pipeline stalls.
  • Empirical results show TSE achieving nearly identical accuracy to global SE while reducing buffer usage by up to 90%, making it a practical solution for hardware accelerators.

Tiled Squeeze-and-Excite (TSE) is a variant of the Squeeze-and-Excite (SE) channel attention mechanism for convolutional neural networks (CNNs) in which channel descriptors are computed not by global spatial pooling, but by aggregating over local, non-overlapping tiles. TSE preserves the original SE excitation network and can be used as a drop-in replacement for SE blocks, providing substantial hardware efficiency gains with negligible performance loss by exploiting the sufficiency of local spatial statistics for channel attention (Vosco et al., 2021).

1. Channel Attention and the Squeeze-and-Excite Mechanism

The canonical SE block operates on input activations XRC×H×WX \in \mathbb{R}^{C \times H \times W} by first “squeezing” spatial information through global average pooling (GAP), producing channel descriptors gc=1HWh=1Hw=1WXc,h,wg_c = \frac{1}{HW} \sum_{h=1}^H \sum_{w=1}^W X_{c,h,w} for c=1,,Cc=1,\dots,C. The resulting vector gRC\mathbf{g} \in \mathbb{R}^C feeds into a compact two-layer excitation network—specifically an MLP with bottleneck dimension C/rC/r—which outputs scaling coefficients s=σ(W2ReLU(W1g))\mathbf{s} = \sigma(W_2 \mathrm{ReLU}(W_1 \mathbf{g})). Each channel of XX is then scaled by the corresponding scs_c.

While compute- and parameter-efficient, SE’s reliance on GAP necessitates buffering the full H×WH \times W feature map before attention modulation, which is nontrivial on streaming hardware accelerators and may cause substantial pipeline stalls (Vosco et al., 2021).

2. Tiled Squeeze-and-Excite: Architecture and Mechanism

TSE replaces global average pooling with pooling over local, non-overlapping spatial tiles, while sharing the original SE excitation MLP across all tiles. For input XX, the spatial domain is partitioned into T=H/hW/wT = \lceil H/h \rceil \cdot \lceil W/w \rceil tiles of size (h,w)(h, w). For each channel cc and tile tt, TSE computes a local descriptor zc,t=1Ωt(h,w)ΩtXc,h,wz_{c,t} = \frac{1}{|\Omega_t|}\sum_{(h',w') \in \Omega_t} X_{c,h',w'}, assembling a tensor ZRC×TZ \in \mathbb{R}^{C \times T}. Each tile descriptor vector ztz^t then passes through the same excitation MLP:

  • at=W1ztRC/ra^t = W_1 z^t \in \mathbb{R}^{C/r}
  • bt=ReLU(at)b^t = \mathrm{ReLU}(a^t)
  • ct=W2btRCc^t = W_2 b^t \in \mathbb{R}^C
  • st=σ(ct)RCs^t = \sigma(c^t) \in \mathbb{R}^C

This yields SRC×TS \in \mathbb{R}^{C \times T}, which is broadcast and upsampled by nearest-neighbor to form S~RC×H×W\widetilde S \in \mathbb{R}^{C \times H \times W}, permitting per-pixel modulation: Yc,h,w=S~c,h,wXc,h,wY_{c,h,w} = \widetilde S_{c,h,w} X_{c,h,w}.

TSE Pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
def TSE(x, h, w, r):
    # x: [N, C, H, W]
    # r: reduction ratio
    # sq: AvgPool2d(kernel=(h,w), stride=(h,w), ceil_mode=True)
    z = sq(x)                   # [N, C, ⌈H/h⌉, ⌈W/w⌉]
    z = z.view(N, C, -1)        # [N, C, T]
    a = conv1(z)                # 1×1, C→C/r
    b = relu(a)
    c = conv2(b)                # 1×1, C/r→C
    s = sigmoid(c)              # [N, C, T]
    S = s.view(N, C, H/h, W/w)
    S_full = upsample(S, (H, W), 'nearest')
    return x * S_full

3. Empirical Evaluation and Buffer Efficiency

Empirical analysis on ImageNet-1K with RegNetY-800MF demonstrated that TSE with row-strip tiles of height k=7k=7 (tile size 7×W7 \times W) achieves nearly identical accuracy to global SE (76.29% vs. 76.30% top-1), while reducing the required activation buffer from HWC=1.07HWC=1.07M to $7WC=0.42$M activations. Similar results were observed with column-strip or 7×77\times7 square tiles. On EfficientDet-D2 (COCO), TSE (7×W7 \times W) matched SE’s mAP=42.3 while decreasing buffer usage from 50.8M to 4.77M activations (Vosco et al., 2021).

Model SE Top-1/mAP TSE (7×W7\times W) Top-1/mAP SE Buffer TSE Buffer
RegNetY-800MF 76.30% 76.29% 1.07M activations 0.42M
EfficientDet-D2 (COCO) mAP=42.3 mAP=42.3 50.8M 4.77M

A strip height or width of approximately 7 was typically sufficient to match global context, confirming that local context of this extent is adequate for channel attention (Vosco et al., 2021).

4. Deployment, Buffer Formulas, and Drop-in Replacement

TSE affords a direct, parameter-preserving replacement for SE blocks. If zc,tz_{c,t} for each tile closely matches gcg_c, pretrained SE weights (W1,W2)(W_1, W_2) can be reused in TSE, with tile-pooling substituting for GAP and no retraining required. In practice, accuracy drop was 0.6%\leq 0.6\% for a variety of architectures. Further, brief fine-tuning (10–20% of data) recovers accuracy to within 0.2–0.4% of original SE models.

Buffering requirements differ starkly:

  • Global SE: BufSE=CHW\mathrm{Buf}_{\mathrm{SE}} = C \cdot H \cdot W
  • TSE: BufTSE=Chw\mathrm{Buf}_{\mathrm{TSE}} = C \cdot h \cdot w

For EfficientDet-D2 (input 768×768768\times768, C=64C=64), this equates to $50.8$M (SE) versus $4.77$M activations (TSE with h=7h=7, w=768w=768) (Vosco et al., 2021).

5. Practical Considerations and Hardware Implications

Tile shape selection in TSE is governed by several trade-offs:

  • Use row-strips (k×Wk \times W) if hardware buffer scales with width WW.
  • Square patches (k×kk \times k) may increase parallelism if buffer scaling is independent of WW.
  • Strip or patch size of approximately 7 generally suffices for accuracy parity with SE.

TSE decreases activation buffer requirements by up to 90%, reduces pipeline stalls, and is highly compatible with streaming convolutional dataflows, except for local tile boundaries. TSE maintains the same parameter count as SE and introduces only a marginal FLOP increase due to T>1T>1 repeated MLP evaluations (Vosco et al., 2021).

6. Limitations and Observed Behavior

TSE’s limitations include a minor increase in FLOPs proportional to the tile count TT and the need to select tile dimensions appropriately. Tile sizes that are too small can degrade accuracy. However, local mean descriptors generally prove similar to global means, justifying the efficacy of TSE at moderate tile sizes.

In conclusion, TSE provides a hardware-friendly, empirically validated alternative to global SE, confirming that moderately local spatial statistics suffice for channel attention and enabling substantial reductions in accelerator buffer costs with negligible impact on accuracy (Vosco et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tiled Squeeze-and-Excite (TSE).