New Intent Discovery in Conversational AI

Updated 15 November 2025

New Intent Discovery (NID) is the automated extraction of diverse action–object pairs from text to uncover unseen user intents.
It leverages architectures like sequence tagging, adversarial regularization, and contrastive clustering to enhance intent detection and pairing.
NID advances conversational AI by enabling robust, domain-adaptive recognition with minimal annotated samples and scalable clustering techniques.

New Intent Discovery (NID) refers to the automated identification and extraction of all user intentions from text utterances—including intents that were not part of a predefined taxonomy or seen during model training. This task, central to conversational AI and dialog modeling, addresses the limitations of closed-set intent classification by (i) detecting actionable intents, (ii) locating complete action–object pairs within utterances, and (iii) generalizing to arbitrary intent types or domains with minimal annotated data. A range of architectures—from sequence tagging, contrastive clustering, and deep adaptive methods to LLM-assisted schemes—has emerged to address NID’s structural, domain adaptation, and robustness challenges.

1. Formal Problem Definition

The NID task is defined for a user utterance $x = [x_1, x_2, ..., x_n]$ (a sequence of tokens), where each intent $i$ is a tuple $(A, O)$ : $A$ is an action verb/phrase, $O$ is an object noun/phrase that is acted upon. Every token $x_t$ receives a tag $y_t \in \{$ Action, Object, None $\}$ , yielding:

(i) The utterance-level detection: decide if $\exists t$ such that $y_t \ne$ None,
(ii) Tag sequence inference: $y^* = \arg\max_{y \in \{A, O, N\}^n} P(y|x)$ ,
(iii) Intent pairing: extract all $(A_j, O_k)$ pairs from contiguous “Action” and “Object” peptide spans.

This multi-intent, open-set discovery formalism departs from single-label, closed ontology approaches and supports explicit multi-span labeling. In practical deployments, models must:

Discover whether any actionable intent is present in free-form dialog;
Parse and pair all action–object structures per utterance;
Operate over domains and classes unseen during training, generalizing from minimal supervision.

2. Principal Architectures and Core Methodologies

A. Sequence Tagging with Structural Constraints

The TOP-ID architecture exemplifies sequence-tagging approaches for NID (Vedula et al., 2019). It employs a two-stage architecture:

Stage I: Utterance-level detection

Feature fusion: Char-CNN and GloVe embeddings, combined via a highway layer.
Bi-LSTM encodes the sequence; max-pooled and sigmoid-activated for binary intent existence.

Stage II: Token-level tagging and pairing

Adversarially regularized Bi-LSTM applies worst-case perturbations to input embeddings, yielding robust decision boundaries (adding 3–5 F1 points).
Multi-head self-attention models long-distance token dependencies for context-aware tag assignment.
Conditional Random Field (CRF) decodes the tag sequence, with augmentation:
- Beam search penalizes missing action/object.
- Integer linear programming ensures valid (Action, Object) pairs per utterance.
Intent pairing uses a nearest-object heuristic or a learned MLP scorer with hinge loss.

This pipeline supports discovery of thousands of intent types (including unseen), achieving F1 gains of 5–17 points over baselines, with $<$ 100 annotated samples required for new domains.

B. Deep (Semi-)Supervised Clustering

Contrastive and clustering-based architectures mine semantic intent clusters from unlabeled pools:

USNID (Zhang et al., 2023): Initializes representation via contrastive pre-training, applies centroid-guided k-means with self-supervised target refinement, and imposes joint cluster-level (CE) and instance-level (SCL) objectives. Cluster count $K$ can be estimated via post-hoc heuristics, and the system achieves up to 30 ARI-point improvements over deep clustering baselines.
CDAC+ (Lin et al., 2019): Constrained deep adaptive clustering integrates pairwise must-link/cannot-link priors (labeled and dynamically pseudo-labeled), deep cosine similarity matrix learning, and DEC-style refinement for high-confidence cluster assignments. Insensitive to overestimation in $K$ and robust with minimal supervision.

C. Graph-Based and Prototypical Contrastive Methods

DWGF (Shi et al., 2023): Constructs multi-hop diffusion-weighted graphs on semantic neighborhoods of utterances, samples weighted positive pairs for local contrastive learning, and applies graph-based cluster smoothing at inference to filter boundary noise.
RAP (Zhang et al., 2024), PLPCL (Deng et al., 2024), RoNID (Zhang et al., 2024): All employ explicit prototype learning strategies with loss terms for both intra-cluster attraction and inter-cluster dispersion. RoNID uses an EM-style loop (optimal transport pseudo-labeling, momentum prototype updates) to break cycles of noisy pseudo-labeling and poor feature discrimination, yielding cluster-friendly representations.

D. LLM-Assisted and Hybrid Models

Recent methods incorporate LLMs either for semantic pair labeling, iterative clustering refinement, or intent synthesis:

NILC (Wang et al., 8 Nov 2025), LANID (Fan et al., 31 Mar 2025): Alternate between embedding-based clustering and LLM-guided centroid enrichment, rewriting ambiguous samples, or mining intent similarity via pairwise LLM judgement, then using contrastive or triplet losses for representation improvement.
Industry pipeline (Chrabrowa et al., 2023): Combines self-supervised and weakly-supervised in-domain LLM pre-training (MLM, MTSO, contrastive objectives) with conversational context (question–answer) and fine-tunes for clustering, then applies human-in-the-loop cluster review.

3. Evaluation Protocols, Datasets, and Metrics

NID evaluation leverages a variety of datasets:

Multi-domain: CLINC-150 (150 intents, 10 domains), BANKING-77 (77 intents), StackOverflow-20, DBPedia-14, SNIPS-7.
Real-world imbalanced: ImbaNID-Bench (Zhang et al., 2024) simulates power-law intent distributions for tail-class robustness.

Key metrics:

Clustering: Clustering Accuracy (ACC), Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), cluster purity, Silhouette coefficient.
Token-level extraction: per-token F1, precision, recall (used in sequence tagging approaches).
Discovery: Macro-F1 over detected novel intents, Binary-F1 (known vs. novel), best one-to-one mapping assessment.

Models are benchmarked against classical clustering algorithms (K-means, DBSCAN, HDBSCAN), deep clustering architectures (DEC, DCN, DeepAligned), and contrastive pipelines (CLNN, DSSCC, DPN), with state-of-the-art NID models outperforming previous bests by 1–17 F1 points and up to 30 ARI points depending on task regime and domain.

4. Empirical Findings and Domain Adaptation

TOP-ID (Vedula et al., 2019): Achieves open intent extraction F1 of 0.76 (vs. baselines 0.49–0.59), with domain adaptation requiring only 80–100 in-domain samples to recover cross-domain F1.
Prototype-guided continual methods (PLRD (Song et al., 2023)): Achieve 81.11–92.83% accuracy and minimal forgetting across continual streaming stages.
Graph-based (DWGF): Outperforms prior contrastive and centroid-based approaches by 2–3% on ACC/NMI across all benchmarks.
Semi-supervised pipelines utilize known intent prototypes and soft constraints to accelerate adaptation (PLPCL, RAP), with notable robustness when the true cluster count is unknown.

Label efficiency remains a central theme: domain adaptation often succeeds with $<$ 100 labeled examples due to adversarial and transfer pre-training. Imbalanced intent distributions (long-tailed benchmarks) remain a challenge, but ImbaNID (Zhang et al., 2024) demonstrates $+$ 3–4% gains on tail classes via relaxed optimal transport and noise-regularized contrastive clustering.

5. Limitations, Open Problems, and Future Directions

Common limitations:

Reliance on explicit action–object phrasing: implicit requests or idiomatic expressions can elude models focused on structural token patterns (Vedula et al., 2019).
Dependency on accurate estimate of cluster count; while robustness measures have improved, fully unsupervised "unknown $K$ " estimation remains difficult [(Mou et al., 2022), USNID].
Increased inference cost for constrained decoding (integer linear programming, graph smoothing, or pairwise LLM querying).
Interpretability: NID clusters are typically not labeled with human-intelligible names except where explicit parsing or generation components are added [(Liu et al., 2021), NILC].

Open research avenues:

Extension to social media: adaptation for shorter, noisier utterances with non-standard grammar (Vedula et al., 2019).
Hierarchical discovery: multi-resolution approaches for sub-intent and taxonomy induction [(Zhang et al., 2023), USNID].
Integration of generative models (LLMs) for labeling and data augmentation, both for taxonomy expansion and cold-start intent synthesis (Rodrigues et al., 16 May 2025).
Few-shot and generalized continual learning: rapid adaptation to streaming and evolving intent sets with minimal annotation (Song et al., 2023).
Imbalanced intent handling: addressing head-tail skew in real-world systems (Zhang et al., 2024).

6. Practical Considerations and Best Practices

Feature fusion (character + word), multi-head attention, adversarial embedding perturbation, and CRF constraints collectively yield high NID performance in sequence tagging regimes (Vedula et al., 2019).
Pre-training on large, unlabeled in-domain corpora plus weak labels (dialog acts, conversation context) stabilizes clustering in industrial settings and domain transfer (Chrabrowa et al., 2023).
Iterative clustering and representation co-learning—particularly with LLM aid for centroid refinement and rewriting (NILC, LANID)—effectively bridge semantic gaps between fixed embeddings and evolving intent structure.
Should cluster boundaries stabilize, retraining the classifier with pseudo-labeled new intents extends dialog systems' coverage incrementally (Akbari et al., 2023).
For small-to-moderate annotation budgets, prototype-based metric learning and semi-supervised constraints enable scalable NID in highly dynamic environments.

7. Historical Context and Impact

“Towards Open Intent Discovery for Conversational Text” (Vedula et al., 2019) formalized the open intent discovery task, introduced TOP-ID, and highlighted the challenge of open-set, multi-intent extraction. Subsequent work advanced deep clustering frameworks resilient to label imbalance, domain shift, and pseudo-label noise. The NID paradigm has driven both theoretical progress and practical deployment—transforming dialog system design from ontology-centric, single-label regimes to systems capable of discovering and adapting to diverse, dynamic user needs across complex domains.