Cardinality-Preserved Attention (CPA)

Updated 6 January 2026

Cardinality-Preserved Attention (CPA) is a mechanism that retains explicit count information by modifying standard softmax normalization in attention models.
It employs alternative aggregation methods—such as fixed scaling and additive or scaled approaches in GNNs—to inject cardinality signals, thereby restoring discriminative capacity.
Empirical results in object-centric segmentation and graph classification demonstrate that CPA boosts performance and robustness when input set sizes vary.

Cardinality-Preserved Attention (CPA) refers to a class of attention mechanisms designed to retain information about the number of attended entities in the output of attention layers. In traditional attention architectures, such as dot-product attention in Transformers or message-passing attention in Graph Neural Networks (GNNs), normalization by the attention weights' sum (i.e., a weighted mean) eliminates explicit signals about the cardinality (the count) of inputs aggregated. CPA introduces modifications to standard attention that preserve this cardinality information, improving generalization to variable set sizes and enhancing discriminative power, particularly in object-centric learning and structured data scenarios.

1. Cardinality Loss in Standard Attention

Standard attention mechanisms, as implemented in Transformer architectures and GNNs, employ a softmax-based normalization over attention scores. In this setup, given a set of inputs $\{\tilde{x}_n\}_{n=1}^N$ and queries (e.g., slots, nodes), attention coefficients $\gamma_{n,k}$ (or $\alpha_{ij}$ ) compute relevance between tokens or neighbors. Traditional aggregation forms a normalized weighted mean:

$u_k = \frac{\sum_{n=1}^N \gamma_{n,k}\,v(x_n)}{\sum_{n=1}^N \gamma_{n,k}}$

This normalization causes all outputs to lose information regarding how many or how much total attention mass is assigned—i.e., the absolute "count" of contributing entities. As shown in both slot-attention models for unsupervised object-centric learning and theoretical analysis of GNN attention aggregators, this renders the network unable to distinguish between scenarios where the same feature set is repeated multiple times, or different set sizes are presented at inference compared to training (Krimmel et al., 2024, Zhang et al., 2019).

Formally, for any differentiable attention function, there is no function of $u_k$ alone that recovers the column sum of attention assignments $\sum_{n}\gamma_{n,k}$ . Thus, outputs become invariant under "multiplicities," causing a loss in discriminative capacity and degrading generalization if the set size changes between train and test.

2. Cardinality-Preserved Attention Formulations

CPA introduces alternative aggregation operations, either by incorporating unnormalized sums or scaling coefficients, that re-inject cardinality signals into the aggregation layer.

CPA in Object-Centric and Transformer Architectures

Two primary CPA variants proposed in object-centric networks (Krimmel et al., 2024):

Weighted Sum with Fixed Scaling (fixed-C variant):

$u_k = \frac{1}{N} \sum_{n=1}^N \gamma_{n,k} v(x_n)$

Here, omitting the divisor $\sum_n \gamma_{n,k}$ and scaling by a fixed constant $1/N$ (input size) allows the $\ell_2$ norm of $u_k$ to increase with the number of contributing tokens, rendering the cardinality observable.

Batch-Normalized Weighted Sum:

$U^{(j)} = \alpha\,\frac{\tilde U^{(j)}-m}{\sqrt{v+\varepsilon}} + \beta$

Here, the batch statistics $(m,v)$ are computed across batch, slot, and feature axes, and $\alpha, \beta$ are learned scalars. This normalization is affine, so total attention mass is preserved in the output scale.

CPA in Graph Neural Networks

CPA mechanisms for GNNs (Zhang et al., 2019):

Additive CPA:

Augments the attention-aggregated sum with an unweighted sum, restoring injectivity with respect to neighbor multiplicity:

$h_i^l = f^l\left(\sum_{j\in\widetilde{\mathcal N}(i)} \alpha_{ij}^{\,l-1} h_j^{\,l-1} + w^l \odot \sum_{j\in\widetilde{\mathcal N}(i)} h_j^{\,l-1}\right)$

$w^l$ is a learned vector.

Scaled CPA:

Applies a scaling function of the neighbor cardinality to the attention aggregation:

$h_i^l = f^l\left(\psi^l(|\widetilde{\mathcal N}(i)|) \odot \sum_{j\in\widetilde{\mathcal N}(i)} \alpha_{ij}^{\,l-1} h_j^{\,l-1} \right)$

$\psi^l$ is an injective embedding of the neighbor count.

\textit{Fixed-weight additive and scaled variants} (f-Additive, f-Scaled) offer computationally lightweight implementations requiring no additional learnable components.

3. Theoretical Properties and Discriminability

The theoretical limitation of standard attention-based aggregation is its non-injectivity on multisets differing only in multiplicity: identical features with different counts yield identical aggregated outputs. CPA variants restore injectivity with respect to set cardinality: if $X_2=k X_1$ (i.e., $k$ times more copies of $X_1$ ), the CPA output grows correspondingly, making the network sensitive to actual set sizes (Zhang et al., 2019, Krimmel et al., 2024).

In object-centric models, it is proven that recovery of total assignment mass from standard normalized updates is impossible for general $v_n$ , but feasible for almost all linear $v$ maps within scaled weighted sums. In GNNs, CPA lifts the discriminative bound to that of injective sum-based aggregators, matching the power of the 1-WL graph isomorphism test for relevant cases.

4. Empirical Results and Quantitative Analysis

In unsupervised object segmentation tasks, experiments on CLEVR and MOVi datasets quantitatively demonstrate that CPA-enhanced normalization schemes maintain high segmentation performance as the number of objects or slots increases beyond training regime. For example, Foreground Adjusted Rand Index (F-ARI) on CLEVR increases from $\sim$ 0.75 (baseline) to $\sim$ 0.84–0.87 (CPA variants) as slot count scales (Krimmel et al., 2024). Batch-normalized CPA achieves F-ARI 0.81 vs baseline 0.74 on MOVi-D in a zero-shot regime.

In GNNs, across node and graph classification tasks (TRIANGLE-NODE, REDDIT-BINARY/MULTI5K, MUTAG, PROTEINS, ENZYMES, NCI1), CPA variants consistently and significantly outperform standard attention models. For instance, in the TRIANGLE-NODE task, CPA accuracy exceeds 91% (original GAT: 78%). On social network datasets with $P=100\%$ collapsible multisets, CPA boosts accuracy from 50% (chance) to above 92% (Zhang et al., 2019). These results confirm that preservation of cardinality confers practical benefits in tasks where set size or local structure varies.

Model (GNN)	RE-B (acc.)	MUTAG (acc.)
Original	50.00 ± 0.00	84.96 ± 7.65
CPA f-Scaled	92.57 ± 2.06	90.44 ± 6.44

Ablation studies isolate the normalization mechanism as the key factor underpinning these improvements.

5. Algorithmic Implementation

CPA is implemented by modifying only the value aggregation mechanism in attention modules. In slot attention, this change corresponds to replacing the weighted mean with a scaled weighted sum or batch-normalized sum. In GNNs, the AGG step in the message-passing layer is substituted with one of the CPA-specific formulas (additive, scaled, or their fixed-weight variants), as detailed in provided pseudocode (Zhang et al., 2019). No architectural changes beyond normalization are required, ensuring minimal computational overhead and ease of integration into existing frameworks.

6. Extensions, Limitations, and Open Questions

CPA is applicable to any dot-product attention architecture, including Transformers, provided that aggregation moves from a (softmax-weighted mean) to a (possibly scaled) weighted sum or affine normalization. In the context of Transformers, per-head batch normalization or fixed scaling of the attention sum can inject cardinality awareness into hidden representations (Krimmel et al., 2024).

Expected advantages include improved generalization to unseen set sizes and better robustness to distribution shifts in input cardinality. However, practical deployment raises open questions: tuning the scaling constant $C$ or the learned normalization parameters $(\alpha, \beta)$ , stability with very large $N$ , and the interaction with multi-head attention all require further study. Additionally, the impact of CPA on downstream tasks outside segmentation and classification (e.g., tracking, detection) and in architectures with nonlinear "value" networks remains an active area of investigation.

7. Relation to Broader Attention Research

CPA contributes to the evolving understanding of limitations in softmax-based attention mechanisms, addressing a gap analogous to the difference between sum-, mean-, or max-pooling in permutation-invariant function design. By enhancing the theoretical expressivity and practical performance of attention-based models in varying set-size scenarios, CPA stands as a general principle for future attention architectures in deep learning for graphs, vision, and structured data.

References:

"Attention Normalization Impacts Cardinality Generalization in Slot Attention" (Krimmel et al., 2024)
"Improving Attention Mechanism in Graph Neural Networks via Cardinality Preservation" (Zhang et al., 2019)

PDF Markdown Chat (Pro)

References (2)

Attention Normalization Impacts Cardinality Generalization in Slot Attention (2024)

Improving Attention Mechanism in Graph Neural Networks via Cardinality Preservation (2019)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Cardinality-Preserved Attention (CPA).