CRAM: Convolutional Rectangular Attention Module

Updated 21 February 2026

The paper introduces CRAM, a differentiable attention module that replaces per-pixel weighting with a compact, five-parameter rectangular window.
Its design enforces spatial regularity and reduces parameter count, thereby avoiding overfitting common in irregular, position-wise attention maps.
Empirical studies on benchmarks like the Oxford-IIIT Pet dataset show CRAM achieves consistent performance gains, with improvements of up to 0.5 percentage points over baseline models.

The Convolutional Rectangular Attention Module (CRAM) is a differentiable spatial attention mechanism designed for integration with convolutional neural networks (CNNs). It constrains attention to a single, oriented rectangular region within the spatial domain of a feature map, specified by only five parameters. This contrasts with conventional position-wise attention maps, which assign an independent weight to each spatial location, resulting in highly flexible but often irregular attention maps. CRAM reduces parameter count, regularizes the attended region, and improves generalization while retaining ease of interpretability and end-to-end trainability (Nguyen et al., 13 Mar 2025).

1. Motivation and Theoretical Premise

Traditional spatial attention within CNN architectures allocates a scalar attention weight to each location in the feature map independently, a methodology here referred to as “position-wise” attention. While this method maximizes representational flexibility, empirical observations indicate that the produced attention maps frequently display highly irregular and fragmented support, with noisy, jagged edges that can overfit to the idiosyncrasies of training data and generalize poorly to novel samples.

CRAM is motivated by the tendency of human visual attention to operate over compact, coherent regions akin to rectangular windows. By limiting the attention support to a single, parameterized rectangle, CRAM implicitly imposes global spatial regularity, reducing the parameter space (from $H \times W$ to five) and promoting stability across varied data distributions. The rectangular constraint also enables interpretation of "where to look" and facilitates the gathering of descriptive statistics from the attention mechanism.

2. Mathematical Formulation

CRAM defines its attention window by a smooth, differentiable approximation to a rectangle. In one dimension, the window function is: $w_{s,t_0,\sigma}(t) = \Lambda\left(s \left[1-\left(\frac{t-t_0}{\sigma}\right)^2\right]\right)$ where $\Lambda(u) = \frac{e^{u}}{1+e^{u}}$ is the sigmoid, $t_0$ is the window center, $\sigma$ is the half-width, and $s$ is the steepness parameter.

The two-dimensional generalization is constructed as a product of two such 1D functions over a rotated coordinate frame. Let $\mu=(\mu_1,\mu_2)$ denote the center coordinates, $\sigma=(\sigma_1,\sigma_2)$ the half-sizes, and $\alpha$ the rotation angle,

$\begin{pmatrix} u \ v \end{pmatrix} = R_{-\alpha} \left( (t_1, t_2) - \mu \right)$

where $R_{-\alpha}$ is the standard 2D rotation matrix. The 2D attention window function is then: $A_{s,\mu,\sigma,\alpha}(t_1, t_2) = w_{s,\mu_1,\sigma_1}(u) \times w_{s,\mu_2,\sigma_2}(v)$ In the limit $s \gg 1$ , this becomes an indicator over a rotated rectangle, but remains fully differentiable for all $s$ .

All parameters $(\mu_1, \mu_2, \sigma_1, \sigma_2, \alpha)$ are generated by a compact subnetwork and constrained to valid domains via activation functions (sigmoid for positions and sizes, appropriate periodicity for angle). This guarantees differentiability with respect to the image, enabling efficient gradient-based learning.

3. Integration into Convolutional Networks

The CRAM module is integrated as follows:

Attention-parameter subnetwork: Given convolutional features $x \in \mathbb{R}^{H \times W \times C}$ , a shallow subnetwork $\phi$ (typically 3 convolutional layers, global pooling, and a fully connected layer) predicts raw values for the five attention parameters.
Parameter Projection: These outputs are mapped to valid values for $\mu$ , $\sigma$ , and $\alpha$ as described above.
Attention Mask Computation: For each spatial location $(i, j)$ (normalized to unit square), compute $f(x)_{i,j} = A_{s, \mu(x), \sigma(x), \alpha(x)} (i/H, j/W)$ .
Residual Coupling: The attention map is broadcast across channels and combined with the feature map via $\widehat{x} = x + f(x) \odot x$ , maintaining residual connections.

This sequence preserves spatial structure and introduces minimal overhead. The core operation is differentiable, allowing inclusion at any layer of a standard CNN.

Pseudocode Representation

function CRAM_Block(x):
    # x: [H,W,C]
    params = AttNet(x)              # outputs raw (u1,u2,v1,v2,θ)
    mu     = sigmoid(params[1:2])   # in [0,1]^2
    sigma  = sigmoid(params[3:4])   # in [0,1]^2
    alpha  = scale_angle(params[5]) # in [-π,π]
    # compute attention mask f of size [H,W]
    f = zeros(H,W)
    for i=1..H, j=1..W:
        (u,v) = rotate((i/H,j/W)-mu, -alpha)
        f[i,j] = sigmoid(s*(1 - (u/σ1)^2)) * sigmoid(s*(1 - (v/σ2)^2))
    return x + f[:,:,None] * x

4. Training Strategy

The CRAM-enhanced network is trained end-to-end by backpropagation from the main task loss (e.g., cross-entropy for classification). No explicit supervision of spatial position or bounding box is required.

An optional auxiliary "equivariance" loss can be incorporated: randomly transform the input (translation, rotation, scaling) and encourage attention parameter predictions to transform accordingly. This regularization promotes coherence and robustness of the attended region under spatial transformations. The main task loss remains the only required supervisory signal; the auxiliary loss is optional and weighted (e.g., $\lambda=0.1$ ).

5. Experimental Results

Evaluation was conducted on the Oxford-IIIT Pet dataset using MobileNetV3 and EfficientNet-b0 as backbone CNNs, both pretrained on ImageNet. Four variants were compared: baseline (no attention), standard position-wise attention (CBAM-style), CRAM without equivariance term, and CRAM with equivariance regularization.

Results Summary

Model Variant	Top-1 Accuracy (MobileNetV3)	Top-1 Accuracy (EfficientNet-b0)
No attention	91.15% ± 0.31%
Position-wise (CBAM)	91.53% ± 0.17%
CRAM (no equivariance)	91.76% ± 0.20%
CRAM (equivariance)	91.82% ± 0.34%	≈0.4–0.5 pp above baseline

CRAM consistently outperforms both the baseline and position-wise attention mechanisms across multiple train/val splits (60:40, 70:30, 80:20). Ablation indicates that position-wise attention can degrade performance in larger models due to producing irregular maps, whereas CRAM yields stable and beneficial improvements under all configurations.

6. Interpretability and Applications

Visualization

CRAM yields an explicit, parameterized rectangular attention region. The predicted rectangle (the $0.5$-level set of $A$ ) can be overlaid on input images to visually assess the attended area. Empirical review shows that CRAM's attention consistently covers the relevant foreground object, in contrast to the less stable, sometimes background-spanning saliency of position-wise maps.

Quantitative Description

Analysis of the five predicted parameters across a dataset enables statistical characterization of attended regions. For example, plotting the distribution of rectangle centers $\mu$ provides insight into object location statistics in the data.

Extensions and Uses

Possible extensions include:

Multi-rectangle attention by summing several CRAM modules
Regions with other parametric forms (e.g., ellipses, oriented polygons)
Application as a lightweight object detector, especially in low-data scenarios
Temporal coherence for video, tracking object location over time with evolving rectangles

A plausible implication is that CRAM’s structure may provide substantial benefits for generalization and interpretability in settings where task-relevant regions are spatially compact or approximately rectangular, with minimal increase in model complexity.

7. Summary and Implications

CRAM imposes a structured, rectangular, and differentiable prior on spatial attention in CNNs by reducing the attended region to a five-parameter rectangle. This approach trades away the maximal flexibility of per-pixel attention for regularity, generalizability, and a form inherently suited for interpretability and statistical analysis. Empirical evidence demonstrates robust gains over standard position-wise attention across different architectures and data splits, and the methodology is extensible to numerous vision tasks (Nguyen et al., 13 Mar 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Convolutional Rectangular Attention Module (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Convolutional Rectangular Attention Module (CRAM).

CRAM: Convolutional Rectangular Attention Module

1. Motivation and Theoretical Premise

2. Mathematical Formulation

3. Integration into Convolutional Networks

Pseudocode Representation

4. Training Strategy

5. Experimental Results

Results Summary

6. Interpretability and Applications

Visualization

Quantitative Description

Extensions and Uses

7. Summary and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CRAM: Convolutional Rectangular Attention Module

1. Motivation and Theoretical Premise

2. Mathematical Formulation

3. Integration into Convolutional Networks

Pseudocode Representation

4. Training Strategy

5. Experimental Results

Results Summary

6. Interpretability and Applications

Visualization

Quantitative Description

Extensions and Uses

7. Summary and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research