Attribute-Centric Representations (ACR)
- ACR is a framework for fine-grained zero-shot learning that enforces attribute disentanglement by integrating expert modules directly within the representation backbone.
- It employs Mixture of Patch Experts (MoPE) and Mixture of Attribute Experts (MoAE) to generate part-aware, spatially localized attribute maps for interpretable feature extraction.
- Empirical results on benchmarks like CUB, AwA2, and SUN highlight ACR’s state-of-the-art performance and robust transfer capabilities with clear semantic structure.
Attribute-Centric Representations (ACR) provide a framework for fine-grained zero-shot learning (ZSL) that imposes attribute disentanglement directly within the backbone representation, as opposed to post-hoc solutions that operate on already entangled features. The central idea is to encode spatially localized, part-aware maps whose coordinates correspond directly to interpretable attributes (e.g., color, shape, texture), addressing the problem of attribute entanglement common in monolithic vision transformer (ViT) embeddings. This is achieved through two specialized mixture-of-experts modules: the Mixture of Patch Experts (MoPE) and the Mixture of Attribute Experts (MoAE), which together enable robust transfer to unseen categories while preserving semantic structure and interpretability (Chen et al., 13 Dec 2025).
1. Core Principles of Attribute-Centric Representations
In the ACR framework, the representation backbone transforms image information by routing patch tokens through banks of lightweight “attribute experts” before projecting onto a sparse, part-aware attribute map. The result is a global image descriptor where each coordinate explicitly encodes evidence for attribute , grounded in a spatially localized heatmap. In contrast to conventional ZSL models aggregating all visual cues into a single dense embedding (CLS token), ACR specializes experts on coherent attribute families, thus enforcing attribute disentanglement at the representation level and circumventing the limits of post-hoc reweighting or subspace projections found in APN and TransZero++.
2. Mixture-of-Experts Architecture
2.1 Mixture of Patch Experts (MoPE)
MoPE is integrated into each layer of an -layer ViT architecture, inserted between the Multi-Head Self-Attention (MHSA) and Feed-Forward (FFN) modules. At every layer , each non-CLS patch token is dispatched to a small, dynamically selected set of experts out of total, where each expert is a low-rank LoRA adapter.
- Dual-level Routing:
- Instance-level router: Computes a global image bias vector from the CLS token, retaining only the top- entries:
$\mathbf u^{\mathrm{logit}} = W^I\,\mathbf h_{\langle\cls\rangle}\;\in\;\mathbb R^E$ - Patch-level router: Computes expert logits per patch:
- The instance and patch signals are mixed:
$\mathbf w_m = \softmax\left(\frac{\mathcal M_k(\mathbf g_m)}{\tau}\right)\in\Delta^{E-1}$
with masking all but top- entries.
- Expert Update: For each token,
$\Delta \mathbf h_m = \sum_{e\in\mathcal S_m} \frac{w_m[e]}{\sum_{j\in\mathcal S_m}w_m[j]} \;\EXP_e(\mathbf h_m)$
where each expert implements a residual LoRA adapter:
$\EXP_e(\mathbf h) = W^{(e)}_B\bigl(W^{(e)}_A\,\mathbf h\bigr)$
The adapter output is added to the usual FFN:
$\mathbf h_m^{\mathrm{out}} = \FFN(\mathbf h_m) + \Delta\mathbf h_m$
2.2 Mixture of Attribute Experts (MoAE) Head
After completing all transformer layers, MoAE processes the set of tokens :
- Attribute Transform: Each token is projected:
$\mathbf a_m = \AT(\mathbf h_m)\;\in\;\mathbb R^A$
forming the matrix .
- Attribute Router: For each attribute , a sparse patch-wise heatmap is computed:
$\mathbf f_a = \softmax\left(\frac{\mathcal M_j(W^A\,\mathbf A_{[a,:]})}{\tau}\right)$
Enforced via the straight-through Gumbel trick at training and hard at test ().
- Sparse Attribute Map and Pooling: The attribute-activated patch map is
The global descriptor is then:
Sparsity is imposed directly by the hard top- mask, obviating additional attribute reconstruction terms.
3. Loss Functions and Regularization
Model training is driven by a composite loss:
- Classification Loss (): The score for class is
with softmax cross-entropy:
$\mathcal L_{\rm cls} = \CE(\softmax(\mathbf s), y^s)$
- Load Balancing (): Prevents expert collapse by constraining average routing:
- Cross-layer Consistency (): Encourages stable routing over depth:
$\mathcal L_{\rm cons} = \frac1{M\,L}\sum_{m,\ell} \KL\left(\mathbf w_m^{(\ell)}\|\bar{\mathbf w}_m\right)$
- Diversity Loss (): Promotes expert exploration by maximizing entropy:
This loss suite enforces that each expert is utilized without collapse, token routing remains consistent in depth, and the network explores attribute specializations during training.
4. Diagnosing and Resolving Attribute Entanglement
Empirical analyses using t-SNE, expert patch collages, and attribute heatmaps demonstrate effective attribute disentanglement:
- t-SNE Visualization: For vanilla ViT, the final CLS embeddings display overlapping class clusters and high intra-class variance. The ACR global attribute vector produces tightly clustered, well-separated points, indicating clean attribute separation.
- Expert Patch Collages: Each MoPE expert consistently specializes in a coherent visual pattern (“fine mottled textures,” “solid color fields,” “feather edges”), confirming that color/shape/pattern families are not entangled.
- Attribute Heatmaps: MoAE accurately localizes part-level attributes such as “breast: solid” or “throat_color: white” on relevant spatial regions, enabling interpretable inspection of which image parts contribute to each attribute.
Vanilla ViT models lack such interpretability or spatial grounding, operating instead on fused, indiscriminate embeddings.
5. Quantitative Results and Benchmark Performance
Performance comparisons on CUB, AwA2, and SUN fine-grained ZSL/GZSL benchmarks highlight the empirical advantages of ACR. The following summarizes zero-shot classification (ZSL, ) and generalized ZSL (, , ):
| Dataset | Metric | ACR | Prior Best |
|---|---|---|---|
| CUB | ZSL (T1) | 80.9 | 78.9 |
| GZSL U | 72.8 | 69.4 | |
| GZSL S | 82.2 | 78.2 | |
| GZSL H | 77.2 | 73.6 | |
| AwA2 | ZSL (T1) | 79.1 | 76.6 |
| GZSL U | 74.1 | 71.8 | |
| GZSL S | 86.3 | 84.3 | |
| GZSL H | 79.7 | 77.6 | |
| SUN | ZSL (T1) | 76.5 | 75.3 |
| GZSL U | 60.0 | 59.4 | |
| GZSL S | 51.0 | 49.1 | |
| GZSL H | 55.1 | 53.8 |
Key observations:
- On CUB, the harmonic mean is improved by 3.6 points compared to prior best.
- Gains on AwA2 and SUN in are 2.1 and 1.3 points, respectively.
- These results confirm that expert-driven, part-aware disentanglement is critical for transfer in fine-grained ZSL scenarios.
6. Component Ablations and Qualitative Insights
Ablation experiments on CUB/AwA2 show the dependence of overall performance on individual components:
| Component Removed | (CUB) | (AwA2) |
|---|---|---|
| -2.6 | -6.4 | |
| -0.9 | -1.9 | |
| -1.7 | -3.1 | |
| Patch-level router | -3.1 | -3.3 |
The largest drops in are observed when removing patch-level routing and cross-layer consistency, signifying their essential role in the disentanglement process.
Qualitative examples show attribute-wise activations highlighting corresponding object parts (e.g., bird head, breast, wings) and attribute specialists (color, shape, pattern) emerging among experts. These findings further support the interpretability and specialization facilitated by the ACR approach.
ACR introduces a robust methodology for compositional zero-shot transfer by injecting attribute-specific inductive biases and leveraging conditional routing within vision transformers. MoPE and MoAE modules, in synergy with targeted loss functions, ensure an embedding whose coordinates correspond directly to disentangled, spatially grounded attributes. This architecture achieves state-of-the-art results on fine-grained ZSL tasks and enables direct interpretability through part-aware attribute heatmaps (Chen et al., 13 Dec 2025).