Tiled Squeeze-and-Excite (TSE) for CNNs

Updated 9 February 2026

Tiled Squeeze-and-Excite (TSE) is a variant of SE that computes channel descriptors using local, non-overlapping tile pooling rather than global average pooling.
It preserves the original SE excitation network, serving as a drop-in replacement that significantly reduces activation buffer requirements and pipeline stalls.
Empirical results show TSE achieving nearly identical accuracy to global SE while reducing buffer usage by up to 90%, making it a practical solution for hardware accelerators.

Tiled Squeeze-and-Excite (TSE) is a variant of the Squeeze-and-Excite (SE) channel attention mechanism for convolutional neural networks (CNNs) in which channel descriptors are computed not by global spatial pooling, but by aggregating over local, non-overlapping tiles. TSE preserves the original SE excitation network and can be used as a drop-in replacement for SE blocks, providing substantial hardware efficiency gains with negligible performance loss by exploiting the sufficiency of local spatial statistics for channel attention (Vosco et al., 2021).

1. Channel Attention and the Squeeze-and-Excite Mechanism

The canonical SE block operates on input activations $X \in \mathbb{R}^{C \times H \times W}$ by first “squeezing” spatial information through global average pooling (GAP), producing channel descriptors $g_c = \frac{1}{HW} \sum_{h=1}^H \sum_{w=1}^W X_{c,h,w}$ for $c=1,\dots,C$ . The resulting vector $\mathbf{g} \in \mathbb{R}^C$ feeds into a compact two-layer excitation network—specifically an MLP with bottleneck dimension $C/r$ —which outputs scaling coefficients $\mathbf{s} = \sigma(W_2 \mathrm{ReLU}(W_1 \mathbf{g}))$ . Each channel of $X$ is then scaled by the corresponding $s_c$ .

While compute- and parameter-efficient, SE’s reliance on GAP necessitates buffering the full $H \times W$ feature map before attention modulation, which is nontrivial on streaming hardware accelerators and may cause substantial pipeline stalls (Vosco et al., 2021).

2. Tiled Squeeze-and-Excite: Architecture and Mechanism

TSE replaces global average pooling with pooling over local, non-overlapping spatial tiles, while sharing the original SE excitation MLP across all tiles. For input $X$ , the spatial domain is partitioned into $T = \lceil H/h \rceil \cdot \lceil W/w \rceil$ tiles of size $(h, w)$ . For each channel $c$ and tile $t$ , TSE computes a local descriptor $z_{c,t} = \frac{1}{|\Omega_t|}\sum_{(h',w') \in \Omega_t} X_{c,h',w'}$ , assembling a tensor $Z \in \mathbb{R}^{C \times T}$ . Each tile descriptor vector $z^t$ then passes through the same excitation MLP:

$a^t = W_1 z^t \in \mathbb{R}^{C/r}$
$b^t = \mathrm{ReLU}(a^t)$
$c^t = W_2 b^t \in \mathbb{R}^C$
$s^t = \sigma(c^t) \in \mathbb{R}^C$

This yields $S \in \mathbb{R}^{C \times T}$ , which is broadcast and upsampled by nearest-neighbor to form $\widetilde S \in \mathbb{R}^{C \times H \times W}$ , permitting per-pixel modulation: $Y_{c,h,w} = \widetilde S_{c,h,w} X_{c,h,w}$ .

TSE Pseudocode

def TSE(x, h, w, r):
    # x: [N, C, H, W]
    # r: reduction ratio
    # sq: AvgPool2d(kernel=(h,w), stride=(h,w), ceil_mode=True)
    z = sq(x)                   # [N, C, ⌈H/h⌉, ⌈W/w⌉]
    z = z.view(N, C, -1)        # [N, C, T]
    a = conv1(z)                # 1×1, C→C/r
    b = relu(a)
    c = conv2(b)                # 1×1, C/r→C
    s = sigmoid(c)              # [N, C, T]
    S = s.view(N, C, ⌈H/h⌉, ⌈W/w⌉)
    S_full = upsample(S, (H, W), 'nearest')
    return x * S_full

3. Empirical Evaluation and Buffer Efficiency

Empirical analysis on ImageNet-1K with RegNetY-800MF demonstrated that TSE with row-strip tiles of height $k=7$ (tile size $7 \times W$ ) achieves nearly identical accuracy to global SE (76.29% vs. 76.30% top-1), while reducing the required activation buffer from $HWC=1.07$ M to $7WC=0.42$M activations. Similar results were observed with column-strip or $7\times7$ square tiles. On EfficientDet-D2 (COCO), TSE ( $7 \times W$ ) matched SE’s mAP=42.3 while decreasing buffer usage from 50.8M to 4.77M activations (Vosco et al., 2021).

Model	SE Top-1/mAP	TSE ( $7\times W$ ) Top-1/mAP	SE Buffer	TSE Buffer
RegNetY-800MF	76.30%	76.29%	1.07M activations	0.42M
EfficientDet-D2 (COCO)	mAP=42.3	mAP=42.3	50.8M	4.77M

A strip height or width of approximately 7 was typically sufficient to match global context, confirming that local context of this extent is adequate for channel attention (Vosco et al., 2021).

4. Deployment, Buffer Formulas, and Drop-in Replacement

TSE affords a direct, parameter-preserving replacement for SE blocks. If $z_{c,t}$ for each tile closely matches $g_c$ , pretrained SE weights $(W_1, W_2)$ can be reused in TSE, with tile-pooling substituting for GAP and no retraining required. In practice, accuracy drop was $\leq 0.6\%$ for a variety of architectures. Further, brief fine-tuning (10–20% of data) recovers accuracy to within 0.2–0.4% of original SE models.

Buffering requirements differ starkly:

Global SE: $\mathrm{Buf}_{\mathrm{SE}} = C \cdot H \cdot W$
TSE: $\mathrm{Buf}_{\mathrm{TSE}} = C \cdot h \cdot w$

For EfficientDet-D2 (input $768\times768$ , $C=64$ ), this equates to $50.8$M (SE) versus $4.77$M activations (TSE with $h=7$ , $w=768$ ) (Vosco et al., 2021).

5. Practical Considerations and Hardware Implications

Tile shape selection in TSE is governed by several trade-offs:

Use row-strips ( $k \times W$ ) if hardware buffer scales with width $W$ .
Square patches ( $k \times k$ ) may increase parallelism if buffer scaling is independent of $W$ .
Strip or patch size of approximately 7 generally suffices for accuracy parity with SE.

TSE decreases activation buffer requirements by up to 90%, reduces pipeline stalls, and is highly compatible with streaming convolutional dataflows, except for local tile boundaries. TSE maintains the same parameter count as SE and introduces only a marginal FLOP increase due to $T>1$ repeated MLP evaluations (Vosco et al., 2021).

6. Limitations and Observed Behavior

TSE’s limitations include a minor increase in FLOPs proportional to the tile count $T$ and the need to select tile dimensions appropriately. Tile sizes that are too small can degrade accuracy. However, local mean descriptors generally prove similar to global means, justifying the efficacy of TSE at moderate tile sizes.

In conclusion, TSE provides a hardware-friendly, empirically validated alternative to global SE, confirming that moderately local spatial statistics suffice for channel attention and enabling substantial reductions in accelerator buffer costs with negligible impact on accuracy (Vosco et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Tiled Squeeze-and-Excite: Channel Attention With Local Spatial Context (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tiled Squeeze-and-Excite (TSE).

Tiled Squeeze-and-Excite (TSE) for CNNs

1. Channel Attention and the Squeeze-and-Excite Mechanism

2. Tiled Squeeze-and-Excite: Architecture and Mechanism

TSE Pseudocode

3. Empirical Evaluation and Buffer Efficiency

4. Deployment, Buffer Formulas, and Drop-in Replacement

5. Practical Considerations and Hardware Implications

6. Limitations and Observed Behavior

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Tiled Squeeze-and-Excite (TSE) for CNNs

1. Channel Attention and the Squeeze-and-Excite Mechanism

2. Tiled Squeeze-and-Excite: Architecture and Mechanism

TSE Pseudocode

3. Empirical Evaluation and Buffer Efficiency

4. Deployment, Buffer Formulas, and Drop-in Replacement

5. Practical Considerations and Hardware Implications

6. Limitations and Observed Behavior

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research