STAM: Shape-Texture Attention Module

Updated 6 September 2025

The paper introduces STAM, a dual-branch attention architecture that leverages complementary shape and texture cues for improved recognition and segmentation.
STAM decouples geometric and textural feature extraction using deformable convolutions and learnable Gabor filters, enabling fine-grained discrimination.
Empirical evaluations show STAM’s effectiveness in enhancing accuracy and F1 scores in edge-deployed plant disease classification compared to traditional methods.

The Shape-Texture Attention Module (STAM) refers to a class of learnable architectural designs that decompose visual attention into distinct shape-aware and texture-aware processing branches, enabling models to capture and exploit complementary visual cues for discriminative tasks such as recognition, segmentation, or image quality assessment. The foundational principle is that shape and texture encode orthogonal information—shapes often provide geometric class invariance while textures contribute material or pathological sensitivity—and their explicit fusion yields superior performance compared to generic attention mechanisms, especially in domains requiring fine-grained discrimination.

1. Definition and Rationale

STAM is built on the observation that standard attention mechanisms and deep neural networks often exhibit bias—favoring either shape or texture cues, with the former typical in semantic segmentation and the latter prevalent in standard CNN training (Zhang et al., 2023, Oliveira et al., 2023, Cohen et al., 22 May 2025). This bias is suboptimal for scenarios where both cues are critical, such as plant disease identification (irregular lesion shapes and complex textures), medical imaging, or material science. STAM explicitly splits attention into two specialized branches:

Shape-aware branch: designed to perceive geometric structure, object contours, and boundary features.
Texture-aware branch: focused on extracting periodicity, orientation, and repetition inherent in textures.

The outputs are fused to synthesize an attention map that reflects the discriminative contributions from both modalities, resulting in improved recognition and localization performance (Qiu, 3 Sep 2025).

2. Architectural Components

a) Parallel Attention Branches

STAM begins with a compressed feature descriptor, typically obtained via a $1\times 1$ convolution on the input feature map. This descriptor is routed into:

Shape branch: Employs deformable convolutional networks (DCNv4). The receptive field of each kernel is not fixed but adapts via learned spatial offsets, thereby contouring to irregular shapes such as the boundaries of leaf lesions (Qiu, 3 Sep 2025).
Texture branch: Utilizes a learnable Gabor filter bank. Eight or more orientations are instantiated and the filters are trainable, allowing the model to adaptively extract dataset-specific pathological textures.

b) Branch Fusion

The responses from both branches are concatenated and processed by a fusion convolutional block, followed by a sigmoid activation to ensure attenuation weights in $[0,1]$ . The result is an aggregated spatial attention map, which is applied residually to the input features to highlight discriminative image regions.

c) Mathematical Summary

Let $x$ denote input features, $f^{1\times1}$ the channel compressor, $f^{shape}$ and $f^{texture}$ the branch transformations, $M_{shape}$ and $M_{texture}$ the respective branch maps, and $f^{fusion}$ the fusion operator: $\begin{aligned} x_{desc} &= f^{1\times1}(x) \ M_{shape} &= f^{shape}(x_{desc}) \ x_{texture} &= x_{desc} \otimes \sigma(M_{shape}) \ M_{texture} &= f^{texture}(x_{texture}) \ M_{stam} &= \sigma(f^{fusion}([M_{shape}; M_{texture}])) \end{aligned}$ where $\otimes$ denotes element-wise multiplication and $[\,;\,]$ concatenation along the channel axis.

3. Empirical Evaluation and Ablation

Empirical results on lightweight edge-deployed plant disease classification tasks indicate substantial gains when STAM is integrated into convolutional models. On the CCMT dataset (Qiu, 3 Sep 2025), the baseline model achieved 86.84% Top-1 accuracy and 86.71% F1; adding STAM raised accuracy to 87.41%, and further to 89.00% (F1=88.96%) when combined with the SE attention module, all within a resource profile of 401K parameters and 51.1M FLOPs. Ablation studies demonstrate that STAM outperforms both the baseline and generic attention frameworks (such as CBAM), with the fused shape-texture approach capturing subtle class cues missed by vanilla spatial attention. The improvements are robust even under hardware constraints typical of edge devices.

Model Variant	Accuracy (%)	F1 Score (%)	Parameters (K)
Baseline	86.84	86.71	401
+STAM	87.41	87.33	401
+SE+STAM	89.00	88.96	401

4. Design Choices

The shape branch’s utilization of DCNv4 enables adaptation to non-rigid target morphologies, which is crucial for segmenting and recognizing disease-induced plant lesions characterized by jagged and variable boundaries. The texture branch’s learnable Gabor filters provide orientation and frequency selectivity, maximizing responsiveness to texture signatures such as spots, rings, or filamentous structures that are diagnostic in pathogenesis.

Fusion is performed via a lightweight convolutional module, which ensures computational efficiency and is suited for deployment in resource-constrained environments. The residual application of the fused attention map maintains original global context while selectively boosting task-relevant regions.

5. Impact and Domain-Specific Implications

STAM’s decoupled attention approach improves discriminative performance for fine-grained datasets—especially those requiring sensitivity to both geometry and material appearance. This addresses the limitations of standard lightweight attention designs that are tuned for object-centric benchmarks and do not generalize to pathology or texture-dominated settings. For precision agriculture, STAM enables effective disease detection on real-time, low-power devices, facilitating scalable solutions for crop management and early intervention.

The public release of the STA-Net codebase at https://github.com/RzMY/STA-Net provides a practical foundation for researchers to adapt and extend STAM to new applications.

6. Contextualization with Broader Research

The shape–texture decoupling embodied in STAM is a response to the observed anti-correlation between shape- and texture-sensitive neurons in deep models (Oliveira et al., 2023), the dominance of texture in segmentation models such as SAM (Zhang et al., 2023), and the need for balanced cue exploitation in critical vision domains (Cohen et al., 22 May 2025). A plausible implication is that future STAM architectures will further refine cross-modal fusion strategies, leveraging domain knowledge and hierarchical representations (e.g., multi-scale or task-adaptive attention) to enhance robustness and generalization in low-resource and OOD scenarios.

7. Future Directions

The continued refinement of shape-texture attention strategies is likely to include:

Multi-modal fusion, allowing integration of additional cues (e.g. color, edge, semantic priors).
Hierarchical attention stacking, inspired by temporal attention frameworks (Yang et al., 2021).
Expansion to non-plant domains, such as medical imaging, defect detection, and video analysis.
Adaptive regularization and bias mitigation techniques, as studied in CognitiveCNN (Mohla et al., 2020), to prevent over-reliance on either modality and enhance interpretability.

The bibliographically grounded architecture and empirically validated gains of STAM underscore its relevance for both resource-conscious deployment and high-accuracy classification/segmentation in visually complex settings (Qiu, 3 Sep 2025).