TokenCLIP: Token-Wise Anomaly Detection

Updated 29 October 2025

The paper presents a novel framework that employs token-wise dynamic alignment using optimal transport to match visual tokens to orthogonal textual subspaces for enhanced anomaly detection.
TokenCLIP introduces multi-head prompting to generate diverse, learnable subspaces, ensuring semantic specialization and mitigating representational tradeoffs found in global alignment methods.
Empirical evaluations demonstrate state-of-the-art region-level anomaly localization with notable improvements on benchmarks like MVTec AD and various medical datasets, highlighting its robustness and efficiency.

TokenCLIP is a token-wise adaptation framework for zero-shot anomaly detection that advances CLIP-based vision-LLMs by moving from global or indiscriminate alignment to dynamic per-token cross-modal adaptation. Unlike previous approaches that enforce a single, token-agnostic textual space for all visual tokens, TokenCLIP deploys a set of learnable, orthogonal textual subspaces and utilizes optimal transport to dynamically match each visual token to a semantically relevant subspace. This yields strong specialization, expressivity, and generalization for fine-grained, region-level anomaly detection across diverse domains.

1. Motivation: Limitations of Global Alignment in CLIP-Based ZSAD

CLIP and its early derivatives align global image and text representations using contrastive learning, excelling at coarse-grained tasks such as classification and retrieval. For zero-shot anomaly detection and localization, existing methods (e.g., AnomalyCLIP, WinCLIP, AdaCLIP, FAPrompt) map all visual tokens indiscriminately to a globally shared textual space, typically handcrafted or learnable prompts representing "normal" or "^{^{^{^{1^{^{^{^"}}}}}}} states. This induces two problems:

Semantic overgeneralization: A single prompt must capture all possible anomaly types, producing weak and unfocused token-level discriminability.
Representational tradeoff: The textual space cannot simultaneously capture the diversity of localized visual abnormalities (e.g., cracks, stains, tumors).

Attempts to assign individual, learnable prompts for each token are computationally intractable and prone to optimization collapse due to insufficient data per-token.

2. Architectural Principles and Token-Wise Dynamic Alignment

TokenCLIP introduces a principled token-wise alignment mechanism built around several technical innovations:

2.1 Orthogonal Textual Subspaces via Multi-Head Prompting

A base learnable text embedding (e.g., for "normal" or "anomalous" states) is expanded by a multi-head MLP into $Q$ orthogonal subspaces: for class $c\in\{\mathtt{normal}, \mathtt{anomaly}\}$ ,

$\{o_c^j\}_{j=1}^Q = \text{MultiHead}_c(l_c)$

with orthogonality regularized by

$\mathcal{L}_{\text{reg}} = \|O_c^\top O_c - I\|^2$

where $O_c$ is the matrix of normalized subspace vectors. This allows each subspace to specialize in different anomaly/normal semantics.

2.2 Token-to-Subspace Assignment via Entropic Optimal Transport

For $N$ visual tokens $\{v_i\}$ and $Q$ textual subspaces $\{o_c^j\}$ , TokenCLIP formulates the assignment as an optimal transport (OT) problem:

Cost matrix: $C_{ij} = 1 - \text{cos}(v_i, o_c^j)$ .
Transport plan: $\mathbf{T}_c \in \mathbb{R}^{N\times Q}$ , solved by:

$\min_{\mathbf{T}_c \in \Pi(\mathbf{u}, \mathbf{v})} \sum_{i, j} (\mathbf{T}_c \odot \mathbf{C})_{ij} - \lambda \sum_{ij} (\mathbf{T}_c \odot \log \mathbf{T}_c)_{ij}$

with $\mathbf{u}$ , $\mathbf{v}$ uniform marginals.

Sinkhorn-Knopp iteration is used for efficient entropic OT solution.

2.3 Top- $k$ Sparsification for Specialization

Each token’s alignment to subspaces is sparsified:

$A_c^{ij} = \begin{cases} (\mathbf{T}_c^*)_{ij}, & j \in \text{TopK}((\mathbf{T}_c^*)_{i,:}, k) \wedge (\mathbf{T}_c^*)_{ij} > \epsilon \ 0, & \text{otherwise} \end{cases}$

followed by row normalization: $\bar A_c$ . This ensures each token interacts mainly with $k$ (e.g., $k=2$ ) subspaces, supporting semantic diversity and computational tractability.

3. Training Objectives and Loss Design

TokenCLIP’s optimization integrates several loss terms:

Global cross-entropy loss ( $L^G$ ): Standard image-level anomaly/normal classification.
Local base alignment loss ( $L^L_{base}$ ): Focal and Dice losses using all tokens matched to the global space.
Token-wise dynamic alignment loss ( $L^L_{da}$ ): Focal and Dice losses using the dynamic OT assignment described above.
Orthogonal regularization ( $L_{\text{reg}}$ ): For disentangling subspace semantics.
Hinge loss ( $L_{\text{hinge}}$ ): Explicitly enlarges score separation between normal and anomaly at region level.

The total loss is:

$L_{\text{total}} = L^L_{base} + L^L_{da} + L^G + \eta L_{hinge} + \xi L_{reg}$

where $\eta, \xi$ are hyperparameters.

4. Inference, Pixel-Level Decision, and Computational Properties

At inference:

Each patch-wise token $v_i$ obtains, for each class $c$ , a logit:

$z_c^i = \sum_{j=1}^Q \bar A_c^{ij} \langle o_c^j, v_i \rangle$

Patch anomaly probability:

$S_a^{\text{da}}(i) = \frac{\exp(z_a^i / \tau)}{\exp(z_a^i / \tau) + \exp(z_n^i / \tau)}$

where $z_n^i$ is the analogous normal score.

Inference proceeds efficiently, as OT with sparsification is solved in parallel, and the number of subspaces is a small constant (e.g., $Q=3$ –$4$). TokenCLIP’s speed and memory envelope are comparable to AnomalyCLIP, and significantly better than methods requiring explicit token-wise prompt learning.

5. Empirical Evaluation and Analysis

Experimental results demonstrate:

State-of-the-art pixel-level anomaly localization: On VisA, MVTec AD, and a range of medical datasets, TokenCLIP achieves substantial AUROC and PRO gains over previous methods, e.g., 92.2 AUROC and 87.9 PRO on MVTec AD (baseline: 91.1/81.4 for AnomalyCLIP).
Token-to-subspace visualization: OT induces clear semantic specialization, with subspaces specializing to background, object, or diverse anomaly types.
Ablation studies demonstrate that dynamic OT assignment yields superior performance compared to heuristic maximum affinity or non-optimal transport-based assignment.

Component	Prior Approaches	TokenCLIP Approach	Effect
Visual-token alignment	All to one global space	Dynamic OT to orthogonal subs	Fine-grained, region-specific align
Textual subspaces	Single or handcrafted	Multi-head orthogonal learn	Diversity, semantic specialization
Assignment method	Shared or greedy affinity	Entropic OT, top-k sparse	Optimal, efficient, non-collapsed
Computational cost	Low or high (if per-token prompt)	Moderate (efficient OT, top-k)	Scalable to dense detection
Generalization	Limited for OOD/fine-grained	Strong (across domains)	Robust anomaly localization

6. Context, Impact, and Relation to Broader Trends

TokenCLIP exemplifies a transition from global, indiscriminate cross-modal alignment in CLIP-style models to highly structured, token-level adaptation—directly addressing the need for dense, region-aware vision-language modeling in anomaly detection and other dense prediction tasks. By leveraging optimal transport, TokenCLIP establishes a strong foundation for semantically rich, computationally feasible token-wise prompt learning. This approach resolves both identification (specialization/diversity) and optimization (data starvation, instability) issues intrinsic to per-token adaptation schemes.

A plausible implication is that frameworks similar to TokenCLIP, which dynamically match localized vision features to diversified textual representations, will underpin next-generation models for downstream tasks requiring dense, open-vocabulary, or compositional reasoning. This also suggests potential extensibility to applications beyond anomaly detection—such as referring segmentation, open-vocabulary dense prediction, or text-to-region grounding.

7. Summary

TokenCLIP is a token-wise CLIP adaptation framework for zero-shot anomaly detection, based on dynamic and sparse optimal transport assignment of patch-level visual features to multiple orthogonal textual subspaces. This design overcomes the limitations of previous token-agnostic alignment by enabling robust specialization and efficient region-level reasoning. Empirical results across industrial and medical benchmarks demonstrate its superiority, and its methodological innovations establish a framework for future dense, token-level vision–language adaptation models (Zhou et al., 24 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to TokenCLIP.

TokenCLIP: Token-Wise Anomaly Detection

1. Motivation: Limitations of Global Alignment in CLIP-Based ZSAD

2. Architectural Principles and Token-Wise Dynamic Alignment

2.1 Orthogonal Textual Subspaces via Multi-Head Prompting

2.2 Token-to-Subspace Assignment via Entropic Optimal Transport

2.3 Top- $k$ Sparsification for Specialization

3. Training Objectives and Loss Design

4. Inference, Pixel-Level Decision, and Computational Properties

5. Empirical Evaluation and Analysis

6. Context, Impact, and Relation to Broader Trends

7. Summary

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TokenCLIP: Token-Wise Anomaly Detection

1. Motivation: Limitations of Global Alignment in CLIP-Based ZSAD

2. Architectural Principles and Token-Wise Dynamic Alignment

2.1 Orthogonal Textual Subspaces via Multi-Head Prompting

2.2 Token-to-Subspace Assignment via Entropic Optimal Transport

2.3 Top-kkk Sparsification for Specialization

3. Training Objectives and Loss Design

4. Inference, Pixel-Level Decision, and Computational Properties

5. Empirical Evaluation and Analysis

6. Context, Impact, and Relation to Broader Trends

7. Summary

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

2.3 Top- $k$ Sparsification for Specialization