Papers
Topics
Authors
Recent
Search
2000 character limit reached

AP-OOD: Attention Pooling for OOD Detection

Updated 6 February 2026
  • AP-OOD is a semi-supervised method that employs learnable attention pooling to selectively integrate token embeddings for more precise OOD detection.
  • It bridges unsupervised and supervised regimes by interpolating between loss functions, achieving state-of-the-art performance with improved AUROC and reduced FPR95.
  • The approach scales with multiple attention heads and optimized query vectors, addressing limitations of mean pooling by effectively capturing rare, anomalous signals.

AP-OOD (Attention Pooling for Out-of-Distribution Detection) is a semi-supervised approach for out-of-distribution (OOD) detection in natural language processing that employs attention pooling to aggregate token embeddings from LLMs, setting it apart from mean pooling or traditional Mahalanobis/KNN-based methods. AP-OOD leverages token-level features for enhanced OOD discrimination, provides a flexible mechanism to interpolate between unsupervised and supervised regimes, and achieves state-of-the-art detection performance, particularly in settings where rare, strongly OOD tokens might be masked by averaging operations (Hofmann et al., 5 Feb 2026).

1. Motivation and Theoretical Rationale

Most prior OOD detection approaches for LLMs collapse sequence embeddings via an arithmetic mean across token-level representations, hˉ=1Li=1Lhi\bar h = \frac{1}{L}\sum_{i=1}^L h_i. While this is effective in high-resource in-distribution (ID) settings, it "washes out" signals from rare but highly anomalous tokens. As demonstrated by constructible counterexamples, two sequences may have token means that are identical, but their distributions in feature space—especially in OOD contexts—differ profoundly. Mean-based aggregation, followed by Mahalanobis- or KNN-based scoring, thus cannot distinguish such cases.

In contrast, attention pooling introduces a learned "focus" through a parameterized query vector and inverse temperature. The model then weights token embeddings in a manner responsive to the statistical characteristics of the in-distribution data. This mechanism amplifies OOD signals that would otherwise be lost and enables the pooling operator to emphasize or suppress regions of the sequence embedding space, based on characteristics most predictive of OOD behavior.

2. Formal Definition of Attention Pooling

Given token-level embeddings h1,,hLRDh_1,\dots,h_L \in \mathbb{R}^D from a sequence, AP-OOD introduces a learnable query vector qRDq \in \mathbb{R}^D and an inverse-temperature parameter β0\beta \geq 0. The attention weights per token are defined as: αi=exp(βqThi)j=1Lexp(βqThj),i=1,,L\alpha_i = \frac{\exp(\beta\,q^{T}h_i)}{\sum_{j=1}^L\exp(\beta\,q^{T}h_j)}, \quad i = 1,\dots,L The attention-pooled representation is: r=i=1Lαihir = \sum_{i=1}^L \alpha_i h_i For computational efficiency, these operations may be vectorized with H=[h1hL]RD×LH = [h_1\, \dots\, h_L] \in \mathbb{R}^{D \times L},

r=Hsoftmax(βHTq)r = H \,\mathrm{softmax}(\beta\,H^T q)

Unlike mean pooling, which uniformly averages token-wise features, attention pooling selectively integrates over tokens, with the attention weights parameterized and optimized for OOD signal extraction.

3. OOD Scoring and Aggregation over Multiple Attention Heads

AP-OOD extends single-head pooling to MM distinct heads, each parameterized by a query vector wjw_j for j=1,,Mj=1,\dots,M. Given a reference matrix H~\tilde H concatenating token embeddings from a large pool of ID examples, each head computes a reference ("prototype") vector r~j=H~softmax(βH~Twj)\tilde r_j = \tilde H \mathrm{softmax}(\beta\,\tilde H^T w_j). For each input HH:

  • Compute rj=Hsoftmax(βHTwj)r_j = H \mathrm{softmax}(\beta\,H^T w_j).
  • The per-head anomaly score: dj2(H)=(wjTrjwjTr~j)2d_j^2(H) = \big(w_j^T r_j - w_j^T \tilde r_j\big)^2.
  • Total anomaly score: d2(H)=j=1Mdj2(H)d^2(H) = \sum_{j=1}^M d_j^2(H).
  • Scalar OOD score: s(H)=d2(H)+j=1Mlogwj22s(H) = -d^2(H) + \sum_{j=1}^M \log\|w_j\|_2^2.

The method outputs a single scalar per sequence (higher s(H)s(H) indicates greater likelihood of in-distribution origin), amenable to thresholding or ROC analysis.

4. Semi-Supervised Interpolation and Training Procedure

Training in AP-OOD is framed as an interpolation between unsupervised and supervised objectives:

  • Unsupervised loss:

Lunsup=1Ni=1Nd2(Hi)j=1Mlogwj22\mathcal{L}_{\mathrm{unsup}} = \frac{1}{N} \sum_{i=1}^N d^2(H_i) - \sum_{j=1}^M \log\|w_j\|_2^2

with NN ID samples.

  • Supervised loss (with NN' auxiliary OOD realizations): Lsup=1N+N[i=1Nd2(Hi)i=N+1N+Nlog(1exp(d2(Hi)))]\mathcal{L}_{\mathrm{sup}} = \frac{1}{N + N'} \left[\sum_{i=1}^N d^2(H_i) - \sum_{i=N+1}^{N+N'}\log\big(1 - \exp(-d^2(H_i))\big)\right]
  • Total loss: L=λLunsup+(1λ)Lsup\mathcal{L} = \lambda \mathcal{L}_{\mathrm{unsup}} + (1-\lambda) \mathcal{L}_{\mathrm{sup}}, for λ[0,1]\lambda \in [0,1].

When auxiliary OOD data is available, AP-OOD leverages it via interpolation, smoothly spanning the regime from classic Mahalanobis-style scoring (purely unsupervised) to explicit outlier exposure. This flexible semi-supervised formulation allows efficient use of even scarce labeled outliers.

The high-level training and inference algorithm consists of:

1
2
3
4
5
6
7
8
for step in 1..K:
    sample minibatch of sequence embeddings H_i
    for each head j:
        compute attention-pooled r_j and prototype ṙ_j
        compute per-head squared distance d_j^2
    form total loss (unsup/sup/interpolation)
    backpropagate and update query vectors {w_j}
compute s(H) for new sequences; threshold for OOD
(Hofmann et al., 5 Feb 2026)

5. Empirical Results and Benchmarking

AP-OOD demonstrates substantial improvements across multiple standard benchmarks:

  • XSUM summarization (PEGASUS_LARGE; C4 AUX):
    • Mahalanobis (mean-pool): FPR95 = 27.84%; AUROC = 91.60%
    • AP-OOD: FPR95 = 4.67%; AUROC = 98.91%
  • WMT15 En\rightarrowFr translation (Transformer-base; ParaCrawl AUX):
    • Perplexity: FPR95 = 77.08%; AUROC = 72.25%
    • AP-OOD: FPR95 = 70.37%; AUROC = 74.81%
    • Embedding-based baselines (Mahalanobis/KNN/Deep SVDD) fall between these extremes.
  • Audio OOD detection (MIMII-DG):
    • AP-OOD FPR95 = 22.35%, outperforming MSP (36.43%), EBO (61.86%), Mahalanobis (84.39%), KNN (57.11%).
  • Ablation and scaling:
    • Varying attention heads (M, T) up to 1024×16 further increases mean AUROC (to 99.40%).
    • Dot-product attention consistently surpasses Euclidean-distance variants in high dimensions.
    • Final scoring using "sum + log norm" across heads outperforms alternatives.

AP-OOD maintains computational efficiency: at M=256, T=4 and batch size 32, the added overhead vs. mean-pooling Mahalanobis detection is ≈6.6 ms per batch, with negligible impact relative to the full encoder–decoder runtime.

AP-OOD can be contrasted with mean-based Mahalanobis, KNN, Deep SVDD, MSP, Energy, and binary-logit-based methods. Its key innovation is selective aggregation via learnable attention weights, which enables the model to emphasize OOD-sensitive features at the token level—critical in language-modeling settings where OOD-ness may be manifested in isolated but semantically salient sub-sequences.

A plausible implication is that AP-OOD's architecture generalizes to other modalities where input granularity matters (such as dense audio or vision features), as supported by improved results on MIMII-DG audio benchmarks. The method is closely related to but distinct from prototypical and pseudo-labeling frameworks (e.g., ProtoOOD, APP) in that it provides a flexible mechanism for partial supervision and can effectively scale computationally in both unsupervised and outlier-exposure scenarios (Hofmann et al., 5 Feb 2026).

7. Implementation Considerations and Hyperparameter Choices

Models evaluated with AP-OOD include PEGASUS_LARGE for summarization, Transformer-base for translation, and lightweight transformer classifiers for audio. Key hyperparameter settings are:

  • Learning rate: 0.01 (Adam); batch size: 512; steps: 1,000
  • Inverse temperature β{1/D,0.25,0.5,1,2}\beta \in \{1/\sqrt{D}, 0.25, 0.5, 1, 2\}
  • Attention heads chosen to fix parameter count: typical M=256M=256, T=4T=4 (with M×T=DM \times T = D)
  • Interpolation weight λ{0.1,1,10}\lambda \in \{0.1, 1, 10\}.

Ablation studies confirm robustness to variation in β\beta (optimal range $0.25$–$1$), and scalability with increased M,TM, T, supporting deployment in high-dimensional settings. Dot-product attention is preferred, and the "sum + log norm" scoring variant consistently yields superior AUROC.


AP-OOD introduces learnable, token-level attention pooling to OOD detection in sequential modeling, providing principled and empirically validated gains in precision, robustness, and computational efficiency across diverse NLP and audio settings, and setting new state-of-the-art benchmarks for OOD detection (Hofmann et al., 5 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AP-OOD.