TokenCLIP: Token-Wise Anomaly Detection
- The paper presents a novel framework that employs token-wise dynamic alignment using optimal transport to match visual tokens to orthogonal textual subspaces for enhanced anomaly detection.
- TokenCLIP introduces multi-head prompting to generate diverse, learnable subspaces, ensuring semantic specialization and mitigating representational tradeoffs found in global alignment methods.
- Empirical evaluations demonstrate state-of-the-art region-level anomaly localization with notable improvements on benchmarks like MVTec AD and various medical datasets, highlighting its robustness and efficiency.
TokenCLIP is a token-wise adaptation framework for zero-shot anomaly detection that advances CLIP-based vision-LLMs by moving from global or indiscriminate alignment to dynamic per-token cross-modal adaptation. Unlike previous approaches that enforce a single, token-agnostic textual space for all visual tokens, TokenCLIP deploys a set of learnable, orthogonal textual subspaces and utilizes optimal transport to dynamically match each visual token to a semantically relevant subspace. This yields strong specialization, expressivity, and generalization for fine-grained, region-level anomaly detection across diverse domains.
1. Motivation: Limitations of Global Alignment in CLIP-Based ZSAD
CLIP and its early derivatives align global image and text representations using contrastive learning, excelling at coarse-grained tasks such as classification and retrieval. For zero-shot anomaly detection and localization, existing methods (e.g., AnomalyCLIP, WinCLIP, AdaCLIP, FAPrompt) map all visual tokens indiscriminately to a globally shared textual space, typically handcrafted or learnable prompts representing "normal" or "anomalous" states. This induces two problems:
- Semantic overgeneralization: A single prompt must capture all possible anomaly types, producing weak and unfocused token-level discriminability.
- Representational tradeoff: The textual space cannot simultaneously capture the diversity of localized visual abnormalities (e.g., cracks, stains, tumors).
Attempts to assign individual, learnable prompts for each token are computationally intractable and prone to optimization collapse due to insufficient data per-token.
2. Architectural Principles and Token-Wise Dynamic Alignment
TokenCLIP introduces a principled token-wise alignment mechanism built around several technical innovations:
2.1 Orthogonal Textual Subspaces via Multi-Head Prompting
A base learnable text embedding (e.g., for "normal" or "anomalous" states) is expanded by a multi-head MLP into orthogonal subspaces: for class ,
with orthogonality regularized by
where is the matrix of normalized subspace vectors. This allows each subspace to specialize in different anomaly/normal semantics.
2.2 Token-to-Subspace Assignment via Entropic Optimal Transport
For visual tokens and textual subspaces , TokenCLIP formulates the assignment as an optimal transport (OT) problem:
- Cost matrix: .
- Transport plan: , solved by:
with , uniform marginals.
- Sinkhorn-Knopp iteration is used for efficient entropic OT solution.
2.3 Top- Sparsification for Specialization
Each token’s alignment to subspaces is sparsified:
followed by row normalization: . This ensures each token interacts mainly with (e.g., ) subspaces, supporting semantic diversity and computational tractability.
3. Training Objectives and Loss Design
TokenCLIP’s optimization integrates several loss terms:
- Global cross-entropy loss (): Standard image-level anomaly/normal classification.
- Local base alignment loss (): Focal and Dice losses using all tokens matched to the global space.
- Token-wise dynamic alignment loss (): Focal and Dice losses using the dynamic OT assignment described above.
- Orthogonal regularization (): For disentangling subspace semantics.
- Hinge loss (): Explicitly enlarges score separation between normal and anomaly at region level.
The total loss is:
where are hyperparameters.
4. Inference, Pixel-Level Decision, and Computational Properties
At inference:
- Each patch-wise token obtains, for each class , a logit:
- Patch anomaly probability:
where is the analogous normal score.
Inference proceeds efficiently, as OT with sparsification is solved in parallel, and the number of subspaces is a small constant (e.g., –$4$). TokenCLIP’s speed and memory envelope are comparable to AnomalyCLIP, and significantly better than methods requiring explicit token-wise prompt learning.
5. Empirical Evaluation and Analysis
Experimental results demonstrate:
- State-of-the-art pixel-level anomaly localization: On VisA, MVTec AD, and a range of medical datasets, TokenCLIP achieves substantial AUROC and PRO gains over previous methods, e.g., 92.2 AUROC and 87.9 PRO on MVTec AD (baseline: 91.1/81.4 for AnomalyCLIP).
- Token-to-subspace visualization: OT induces clear semantic specialization, with subspaces specializing to background, object, or diverse anomaly types.
- Ablation studies demonstrate that dynamic OT assignment yields superior performance compared to heuristic maximum affinity or non-optimal transport-based assignment.
| Component | Prior Approaches | TokenCLIP Approach | Effect |
|---|---|---|---|
| Visual-token alignment | All to one global space | Dynamic OT to orthogonal subs | Fine-grained, region-specific align |
| Textual subspaces | Single or handcrafted | Multi-head orthogonal learn | Diversity, semantic specialization |
| Assignment method | Shared or greedy affinity | Entropic OT, top-k sparse | Optimal, efficient, non-collapsed |
| Computational cost | Low or high (if per-token prompt) | Moderate (efficient OT, top-k) | Scalable to dense detection |
| Generalization | Limited for OOD/fine-grained | Strong (across domains) | Robust anomaly localization |
6. Context, Impact, and Relation to Broader Trends
TokenCLIP exemplifies a transition from global, indiscriminate cross-modal alignment in CLIP-style models to highly structured, token-level adaptation—directly addressing the need for dense, region-aware vision-language modeling in anomaly detection and other dense prediction tasks. By leveraging optimal transport, TokenCLIP establishes a strong foundation for semantically rich, computationally feasible token-wise prompt learning. This approach resolves both identification (specialization/diversity) and optimization (data starvation, instability) issues intrinsic to per-token adaptation schemes.
A plausible implication is that frameworks similar to TokenCLIP, which dynamically match localized vision features to diversified textual representations, will underpin next-generation models for downstream tasks requiring dense, open-vocabulary, or compositional reasoning. This also suggests potential extensibility to applications beyond anomaly detection—such as referring segmentation, open-vocabulary dense prediction, or text-to-region grounding.
7. Summary
TokenCLIP is a token-wise CLIP adaptation framework for zero-shot anomaly detection, based on dynamic and sparse optimal transport assignment of patch-level visual features to multiple orthogonal textual subspaces. This design overcomes the limitations of previous token-agnostic alignment by enabling robust specialization and efficient region-level reasoning. Empirical results across industrial and medical benchmarks demonstrate its superiority, and its methodological innovations establish a framework for future dense, token-level vision–language adaptation models (Zhou et al., 24 Oct 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free