Cosine Anchor Loss (CAL) for Deep Learning

Updated 13 December 2025

Cosine Anchor Loss (CAL) is a supervised geometric loss that aligns deep features with fixed, unit-norm class anchors using cosine similarity.
It enforces intra-class compactness and inter-class angular separation without extra learnable parameters or margin tuning.
CAL integrates efficiently into training pipelines for classification and deep hashing, achieving competitive performance on benchmarks.

Cosine Anchor Loss (CAL) is a supervised geometric loss formulation for deep representation learning, in which each class is associated with a fixed “anchor” vector in feature space, and learning proceeds by aligning network outputs for each sample with the anchor for its class using a cosine similarity objective. CAL has been established as a versatile and effective technique in both classification and deep hashing/regression contexts, providing a unified approach for producing discriminative and quantizable representations with a single loss term (Hao et al., 2018, Hoe et al., 2021).

1. Formal Definition and Mathematical Formulation

Let $f_W(x)\in\mathbb{R}^d$ denote the deep feature vector generated by a neural network (parameters $W$ ) for input $x$ , and let $\{a_1,\dots,a_C\}\subset\mathbb{R}^d$ denote $C$ fixed anchor vectors—one per class. CAL evaluates the angular proximity of features to their class anchors using the cosine distance,

$M_C(f,a) = 1 - \frac{f\cdot a}{\|f\|_2\,\|a\|_2}.$

Defining class probabilities for input $x$ as

$p(c \mid x) = \frac{\exp\left(-M_C(f_W(x),a_c)\right)}{\sum_{c'=1}^C \exp\left(-M_C(f_W(x),a_{c'})\right)},$

the Cosine Anchor Loss over a minibatch of size $N$ is the negative log-likelihood,

$L_{\text{cos}} = -\frac{1}{N}\sum_{i=1}^N \log\,p(y_i\mid x_i),$

which, up to a constant, is equivalent to maximizing cosine similarity between $f_W(x_i)$ and $a_{y_i}$ .

In deep hashing, each anchor $b_c\in\{\pm1\}^K$ is a binary, mutually orthogonal code. The loss becomes

$\mathcal{L}_{\text{CAL}} = -\frac{1}{N}\sum_{i=1}^N \frac{u_i^\top b_{y_i}}{\|u_i\|_2 \, \|b_{y_i}\|_2},$

where $u_i$ is the continuous embedding output of the network for $x_i$ (Hoe et al., 2021).

2. Anchor Construction and Design Principles

The utility of CAL depends on the principled construction of anchor sets:

Unit-Norm Constraint: All anchors satisfy $\|a_c\|_2=1$ . This places them on the unit hypersphere, ensuring scale invariance for cosine similarity.
Maximal Angular Separation: Anchors are chosen to maximize minimum pairwise angle; i.e., for $c\neq c'$ , $\frac{a_c\cdot a_{c'}}{\|a_c\|\|a_{c'}\|} \leq \cos(\theta_M)$ . For small $C$ and $d$ , anchors can be evenly spaced (e.g., in 2D as regular polygons); for high $d$ , random orthonormal basis vectors or spherical code constructions are used (Hao et al., 2018). In deep hashing, binary anchors are synthesized from a Hadamard matrix or orthogonalization of random sign vectors (Hoe et al., 2021).

3. Rationale for Cosine Similarity and Loss Properties

Cosine Anchor Loss is characterized by several favorable properties:

Scale-Invariance: Only angular direction matters, so the feature norm and anchor norm do not influence the loss directly.
Intra-Class Compactness: All features of a class are driven to the direction of that class’s anchor, enhancing compactness.
Inter-Class Angular Separation: Enforced by anchor choice on the sphere, classes are discriminated based on large mutual angles.
No Learnable Centers or Margins: Anchors are fixed, eliminating the need for additional parameters or angular margin tuning as in L-Softmax (Hao et al., 2018). Compared to Euclidean approaches (which constrain both norm and direction), the cosine loss avoids norm imbalance or over-regularization.

A plausible implication is that this approach can be easily extended to strict quantization in hashing by using binary orthogonal anchors, eliminating the need for supplementary loss terms such as quantization or decorrelation penalties (Hoe et al., 2021).

4. Optimization Methods and Training Protocol

CAL is efficiently optimized via standard minibatch SGD or Adam, integrating seamlessly into modern deep learning frameworks. The following outlines the core algorithms:

Classification setting (Hao et al., 2018):

Compute features $f_b = f_W(x_b)$ for each minibatch.
For each anchor, calculate $M[b,c] = 1 - \frac{f_b \cdot a_c}{\|f_b\| \|a_c\|}$ .
Logits: $L[b,c] = -M[b,c]$ ; softmax over $c$ produces $p[b,c]$ .
Loss is cross-entropy as above.

Hashing setting (Hoe et al., 2021):

The negative cosine similarity is directly minimized between each output $\mathbf{u}_i$ and its designated binary anchor $\mathbf{b}_{y_i}$ .

The gradient of a single cosine similarity term (for feature $f$ and anchor $a$ ) is

$\nabla_f \left(\frac{f\cdot a}{\|f\|\|a\|}\right) = \frac{a}{\|f\|\|a\|} - \frac{(f\cdot a)}{\|f\|^3\|a\|}f,$

and the gradient with respect to network weights $W$ follows via the chain rule (Hao et al., 2018, Hoe et al., 2021).

Recommended hyperparameters and regularization:

Feature dimension: 256–512 for classification, $K\in\{16,32,64,128\}$ for hashing.
Learning rate: $0.1$ (SGD for classification), $10^{-4}$ (Adam for hashing).
Batch size: 128–256.
Dropout: $p\approx0.1$ –$0.25$ after pooling.
Optimization with momentum ($0.9$), weight decay ( $5\times10^{-4}$ ), and data augmentation (random cropping, flipping) further increase effectiveness (Hao et al., 2018, Hoe et al., 2021).
In hashing, batch normalization on the code layer ensures automatic bit-balance.

5. Extensions: Multi-Label Setting and Bit-Balance

CAL is directly extensible to multi-label classification and deep hashing via the construction of soft anchors. Given binary label vectors $y_i \in \{0,1\}^{C}$ , a convex combination of class anchors forms the target: $\tilde{b}_i = \frac{\sum_{c=1}^C \hat y_{ic} b_c}{\left\|\sum_{c=1}^C \hat y_{ic} b_c\right\|_2}$ where $\hat y_{ic} = (1-\epsilon)y_{ic} + \frac{\epsilon}{C}$ applies label smoothing ( $\epsilon=0.1$ recommended). The loss is then the negative cosine similarity to $\tilde{b}_i$ (Hoe et al., 2021).

Bit-balance and code orthogonality in hashing are achieved using batch normalization before binarization and the anchor construction method, without explicit penalty terms for bit-balance or code decorrelation (Hoe et al., 2021).

6. Empirical Results and Comparative Analysis

Empirical evaluation demonstrates that CAL yields improved or competitive discriminative power for both classification and hashing. Key results from (Hao et al., 2018) and (Hoe et al., 2021):

Dataset	Metric	Softmax	L-Softmax	C-NCM / CAL
MNIST	Error Rate (%)	0.47	0.31	0.25
CIFAR-10	Error Rate (%)	10.22	7.58	8.78
CIFAR-10+	Error Rate (%)	6.61	5.92	5.98
CIFAR-100	Error Rate (%)	37.26	29.53	30.86
CIFAR-10	mAP@5000 (64b)	—	—	89.1
NUS-WIDE	mAP@5000 (64b)	—	—	81.5
MS-COCO	mAP@1000 (64b)	—	—	51.8

CAL achieves significant gains over standard Softmax and rivals advanced techniques such as L-Softmax, but with zero extra learnable parameters and no margin tuning (Hao et al., 2018). In deep hashing, CAL consistently surpasses prior multi-loss baselines by 1–3 points in mAP across code lengths, while removing the need for complex regularization and auxiliary objectives (Hoe et al., 2021).

7. Practical Implications and Deployment Considerations

CAL is readily applicable in deep learning pipelines for both classification and supervised hashing. Its geometric formulation requires only the predefinition of anchor vectors and straightforward modifications to loss computation. It allows practitioners to eliminate explicit center/margin parameters, quantization objectives, and regularizers for bit-balance or class separation. CAL’s unification of intra/inter-class discrimination and quantization in a single loss renders model training stable, interpretable, and efficient (Hao et al., 2018, Hoe et al., 2021).

A plausible implication is that CAL provides a generic design pattern for loss construction wherever label supervision is available, anchor geometries can be imposed, and angular separation is desirable, including but not limited to image classification, instance retrieval, and multi-label recognition.