IntentGPT: Open-Domain Intent Discovery and Understanding
- IntentGPT is a class of methodologies that leverage large language models, symbolic reasoning, and multimodal embeddings to detect, discover, and refine user intent in open-domain, conversational, and interactive environments.
- Key applications include conversational agents, UI understanding, co-creative authoring tools, and spoken language interfaces, using approaches such as prompt-based learning, hierarchical task representations, and micro-prompted intent tagging.
- IntentGPT systems demonstrate strong performance in few-shot and zero-shot intent classification across text, UI, and video modalities, but face challenges with context window limits, low-resource domains, and ambiguity in multi-modal scenarios.
IntentGPT refers to a class of systems and methodologies that integrate LLMs into the automated extraction, discovery, and interactive refinement of user intent, primarily in conversational agents, UI understanding, co-creative authoring, and spoken language interfaces. These systems unify prompt-driven learning, hierarchical symbolic reasoning, and multimodal embedding techniques to robustly interpret user goals and actions, even in open–world, multilingual, and limited–data settings. The following sections synthesize foundational works: the symbolic interactive paradigm (Lawley et al., 2023), prompt-based few-shot intent discovery (Rodriguez et al., 2024), multimodal UI trajectory and video-based intent inference (Fu et al., 2024, Berkovitch et al., 2024), granular micro-prompting workflows (Gmeiner et al., 26 Feb 2025), and benchmarking of LLM intent accuracy in spoken language understanding (He et al., 2023).
1. Problem Formalization and Scope
The core problem addressed by IntentGPT is open–domain intent detection, encompassing both the classification of utterances into a dynamically evolving set of intent categories and the discovery and definition of novel, user-expressed intents in real time. Input types span free-form natural language utterances, sequences of UI actions (snapshots and event logs), and multimodal interaction data. The target output is typically a symbolic intent label or a structured, interpretable semantic representation of the user's underlying goal.
Let denote a user input (utterance, UI trajectory), and the intent label or description. In open-set scenarios, the mapping requires to be either selected from a set of known intents (with labeled examples) or generated as a new intent in if is out-of-distribution (Rodriguez et al., 2024). Formally, intent identification is defined as modeling the conditional , where parameterizes the LLM or multimodal encoder-decoder (Berkovitch et al., 2024). In interactive and symbolic systems, may be decomposed recursively into hierarchical action structures (Lawley et al., 2023).
2. System Architectures and Learning Paradigms
2.1 Conversationally-Grounded Hierarchical Learning
IntentGPT systems for symbolic, interpretable task induction utilize tightly-scoped LLM modules to parse user utterances into predicate–argument triples, resolve anaphora, and unify paraphrases (“grab” ↔ “pickUp”) (Lawley et al., 2023). The architecture comprises:
- Conversational Front-End: GPT-based modules (e.g.,
parseGPT,matchGPT) for semantic parsing. - Symbolic Task Store: Hierarchical, predicate-based intent/action library 0, where 1 is an intent predicate, 2 the argument slots, 3 the scoped variables, and 4 an ordered list of subtasks.
- Recursive Interactive Learner: Upon encountering an unknown predicate, the system recursively prompts the user for clarification, incrementally building a reusable, interpretable intent ontology.
2.2 Prompt-Based Few-Shot Discovery
In training-free scenarios, IntentGPT systems rely on prompt engineering, in-context learning, and semantic retrieval (SBERT embeddings) to synthesize effective demonstration prompts and perform both closed-set classification and open-set discovery of new intents (Rodriguez et al., 2024):
- In-Context Prompt Generator (ICPG): Samples representative utterance–intent pairs, synthesizes domain-aware prompts with LLM meta-prompting, and caches final instruction sets.
- Semantic Few-Shot Sampler (SFS): Employs SBERT-based nearest neighbor search to retrieve semantically similar few-shot examples for prompt composition.
- Intent Predictor: Constructs in-context prompts containing the evolving list of known and newly-discovered intents, associated demonstration examples, and the current inference queries. Predicted intent names not found in the library are added to 5.
A feedback mechanism (“Known Intent Feedback”, KIF) maintains and augments the growing set of intent labels during inference.
2.3 Multimodal Embedding and Decoding
For UI-based and video action sequences, IntentGPT-style systems integrate JEPA-tuned vision transformers for video embedding, masking strategies for high-level action abstraction, and lightweight LLM decoders for textual intent generation (Fu et al., 2024):
- Encoder: JEPA-style ViT abstracts UI video to compact embeddings; masking targets high-level transitions.
- Decoder: A small LLM (e.g., Phi-3 + LoRA) is fine-tuned on paired video–intent data; inference is via concatenated video-text embedding streams.
- Evaluation: Datasets (e.g., IIW and IIT) are designed for few/zero-shot UI intent understanding. On-device deployment is achieved with 63B parameter decoders, with efficiency gains of 7 in compute and 8 in latency compared to large MLLMs.
2.4 Micro-Prompting and Intent Tagging
Fine-grained intent control adopts an “intent tag” formalism, where intent is expressed as a collection of atomic [attribute:value] pairs, with tags grouped by concern and iteratively refined via granular UIs and contextual LLM-based suggestions (Gmeiner et al., 26 Feb 2025). The workflow is highly interactive, supporting non-linear, multi-modal, and iterative authoring environments.
3. Algorithms and Evaluation Protocols
IntentGPT methodologies incorporate deterministic, recursive, and prompt-driven pipelines, summarized as follows:
| Component | Main Algorithmic Steps | Metrics |
|---|---|---|
| Hierarchical Task Learning | Parse → Predicate Match → User Prompt (if unknown) → Recursive Tree Update | Learning success, clarifications needed |
| Prompt-based Discovery | Prompt Synthesis (ICPG) → Semantic Example Retrieval (SFS) → LLM Classification/Discovery | NMI, ARI, Clustering ACC, NDI |
| Multimodal UI Decoding | JEPA Embedding → Projection → LLM Decoding (Autoregressive Intent Text) | Intent similarity, compute/latency |
| Intent Tagging | Tag Definition/Editing → Micro-prompted Suggestions → UI/Content Generation | User study Likert ratings, task statistics |
Evaluation protocols address both open–set and closed–set accuracy, clustering fidelity (NMI/ARI), and semantic similarity. UI–intent trajectory benchmarks use paraphrase-aware “satisfaction” relations, evaluated manually and via LMM-judging to account for the one–many mapping between action sequence and user goal (Berkovitch et al., 2024).
4. Empirical Results and Comparative Performance
4.1 Few-Shot and Zero-Shot Textual Intent Discovery
On the CLINC (150 intents) and BANKING (77 intents) datasets, IntentGPT-4 achieves test NMI = 96.06/85.94, ARI = 84.76/66.66, and ACC = 88.76/77.21 at 50-shot for high KIR (0.75), outperforming prior unsupervised and semi-supervised baselines—often with less than 10% of the domain data seen by alternatives. With zero-shot prompts, NMI = 94.35/81.42 and ACC = 83.20/64.22 further establish competitive performance without any fine-tuning or retraining required (Rodriguez et al., 2024).
Scaling to multilingual (MTOP: English, Spanish, German, Hindi, Thai) and diverse domain (StackOverflow, SNIPS) settings confirms strong generalizability, though performance drops in low-resource languages.
4.2 Spoken Language and Multilingual SLU
Only very large models (ChatGPT, GPT-3.5) reach zero-shot intent accuracy (≈80%) competitive with supervised models on SLURP and MINDS-14, with multi-lingual accuracy up to 97.9% in French and above 90% in Korean (He et al., 2023). However, slot filling remains challenging (ChatGPT: F1 ≈13%). ASR-induced noise (16.7% WER) drops accuracy by 6–8 points.
4.3 Multimodal UI Trajectories and Video
JEPA-compressed video embeddings, when decoded by a small LLM, offer intent similarity scores exceeding GPT-4 Turbo and Claude 3.5 Sonnet by 10.0% and 7.2% respectively, for UI-driven tasks (IIW/IIT). Computational cost is reduced 9 with 0 latency improvement, enabling real-time, on-device inference (Fu et al., 2024).
Human annotators outperform GPT-4 and Gemini-1.5 on naturalistic web and Android UI trajectory goal identification (match rate 1 0.80 for humans vs. 2 0.44–0.58 for LLMs), with LLMs exhibiting under-specificity or hallucination on ambiguous traces (Berkovitch et al., 2024).
4.4 Micro-prompting Co-Creation
User studies with IntentTagger reveal statistically significant improvements in perceived control, meta-intent elicitation, and iteration efficiency over chat-based and gallery-based generative creation tools. Task times were short (mean 3 min), with extensive use (4) of system-suggested tags (Gmeiner et al., 26 Feb 2025).
5. Design Principles, Limitations, and Future Directions
IntentGPT best practices across modalities converge on the following design recommendations (Gmeiner et al., 26 Feb 2025, Rodriguez et al., 2024, Lawley et al., 2023, Berkovitch et al., 2024):
- Flexible, Interactive Workflows: Enable dynamic, iterative, and non-linear user interaction, with persistent, editable representations of intent (tags, predicates, clusters).
- Mixed-Initiative Discovery: Support dynamic intent extension, clarifying unknown intents via user dialog or iterative feedback.
- Semantic Prompt Engineering: Utilize automated/chain-of-thought prompt generation, semantic retrieval, and contextual pruning (SKIF) to manage prompt length and relevance.
- Interpretable Representations: Store intent libraries symbolically; structure tasks as explicit, human-interpretable trees or tag clusters.
- Efficiency for Real-Time Application: Favor small, LoRA-tuned decoders for computationally-constrained scenarios (edge/mobile), backed by JEPAized vision transformers for high-level action abstraction.
- Evaluation Alignment: Employ paraphrase-aware scoring in UI/trajectory domains, and cluster-based comparison in open-label discovery.
Limitations include context window constraints for large 5, accuracy degradation in low-resource/zero-shot cross-modal generalization, ASR and phonetic modeling brittleness, and privacy risks in LLMs potentially pre-trained on evaluation data (Rodriguez et al., 2024, He et al., 2023, Berkovitch et al., 2024). Directions for improvement include multimodal fusion (audio, speech, haptics), dynamic hierarchical intent compression, and robust detection of dataset leakage in LLM pretraining.
6. Theoretical Insights and Broader Impact
A central insight is that frozen LLMs, when combined with semantic retrieval and iterative feedback loops, are effective for open-set intent classification and discovery, eliminating the need for domain-specific finetuning in many cases (Rodriguez et al., 2024). The hierarchical, interactive paradigms pioneered in dialog-based symbolic learning (Lawley et al., 2023) enable compositionality and reusability of discovered skills and intents.
Practical impact spans voice-activated assistants, chatbots, UI automation, content co-creation interfaces, and on-device intent tracking in mobile ecosystems. The emergence of intent-in-the-loop systems capable of dynamically adapting to users' evolving goals signals a shift towards more adaptive, grounded, and transparent conversational AI.
The ongoing challenge remains closing the gap to human-level intent understanding in ambiguous, multi-modal, and context-rich scenarios. Robust intent representation, grounding across modalities, and real-time, privacy-preserving inference are likely to shape the next generation of IntentGPT systems.