InterAct: Dialogue-Driven LLM Agents

Updated 10 April 2026

InterAct is a framework that integrates free-form dialogue, autonomous actions, and human feedback in LLM agents to address ambiguous real-world tasks.
It employs a perceive-think-act-speak loop, dynamically switching between internal reasoning, concrete actions, and interactive conversation.
Evaluations of ReSpAct implementations show improved task success across benchmarks like ALFWorld, WebShop, and MultiWOZ compared to traditional methods.

InterAct refers to a class of frameworks and methodologies for integrating interaction—specifically, free-form dynamic dialogue—into LLM-based agents tasked with real-world problem solving in environments requiring both autonomous action and human collaboration. In contemporary LLM agent research, systems such as ReSpAct instantiate this approach by harmonizing three distinct modalities: internal reasoning ("Think"), grounded environmental actions ("Act"), and unstructured conversational exchange with users ("Speak"). This paradigm enables agents to clarify ambiguous goals, refine subtasks via dialogue, solicit endpoint-specific input, and interleave plan execution with ongoing user feedback—all without reliance on fixed dialogue schemas or pre-defined dialog acts (Dongre et al., 2024).

1. Architectural Foundations and Interaction Loop

In core InterAct-style frameworks exemplified by ReSpAct, the LLM agent operates on a generalized perceive-think-act(-speak) loop. The environment (which may be a structured API, web interface, or textual simulation) emits observations $o_t$ . The agent's state at each timestep $t$ is captured by an internal context

$c_t = (o_1, a_1, o_2, a_2, \dots, o_t),$

where $a_i$ denotes agent actions and $o_i$ , environment responses. The policy $\pi$ (typically an LLM, e.g., GPT-4) maps this context $\mathcal{C}$ to a composite action space

$\hat{\mathcal{A}} = \mathcal{A} \cup \mathcal{L}, \quad \mathcal{U} \subset \mathcal{L},$

where $\mathcal{A}$ are environment/API actions, $\mathcal{L}$ are internal language reasoning traces, and $t$ 0 are dialogue ("Speak") actions. At each step, $t$ 1 selects between:

Think: language-internal reasoning with no external effect,
Act: execution of a concrete environmental or API action,
Speak: user-directed utterances (clarifications, status reports, or alternative proposals).

When "Speak" produces a user reply $t$ 2, this is appended to the context, integrating user information directly into future reasoning and acting decisions. This cyclical, schema-free alternation among thinking, acting, and speaking underpins robust InterAct agent behavior (Dongre et al., 2024).

2. Formal Constructs and Operational Pseudocode

The InterAct formalism introduced in ReSpAct specifies policy and action types through the following constructs:

$t$ 3: space of callable environment or API actions,
$t$ 4: space of language-driven internal thought processes,
$t$ 5: subset reserved for user-interactive dialogue actions ("Speak").

The composite action-selection mechanism is thus:

$t$ 6

with action selection at stage $t$ 7 given by

$t$ 8

High-level operational pseudocode is as follows: $c_t = (o_1, a_1, o_2, a_2, \dots, o_t),$ 1 A notable methodological feature is that the user-simulator (for offline benchmarking) or real user (online/human-in-the-loop) is interleaved organically without requiring explicit dialog schemas (Dongre et al., 2024).

3. Innovations over Reasoning-First Architectures

Traditional reasoning-first agent architectures, such as ReAct, bifurcate agent action into thought (internal reasoning traces) and environmental actuation, but omit explicit mechanisms for dynamic, unstructured, user-agent dialogue. In contrast, InterAct models as instantiated in ReSpAct extend the action schema by incorporating the "Speak" mode ( $t$ 9). This enables agents to proactively ask clarification questions, present status updates, resolve subtask failures, and negotiate alternatives in natural language as needed, with the timing and content of dialogue determined by the LLM's internal reasoning rather than by pre-defined schema.

Unlike schema-guided dialog systems—where turn-taking, intent, and slot-filling are controlled by engineered dialog acts and finite-state machines—InterAct agents leverage a small few-shot prompting scaffold that defines response format (Think/Speak/Act) and allows for emergent, context-sensitive interaction policies without rigid templates. This supports broader generalization and responsiveness across diverse domains (Dongre et al., 2024).

4. Evaluation Benchmarks and Quantitative Performance

ReSpAct-style InterAct frameworks have been evaluated on multiple benchmarks that probe task-oriented dialogue, web-based goal completion, and embodied text environments:

Environment	Metric	ReAct Baseline	ReSpAct (InterAct)
ALFWorld (embodied text)	Success Rate (%)	80.6	87.3
WebShop (user simulation)	Success Rate (%)	8.0	12.0
MultiWOZ (dialogue)	Inform (%) / Success (%)	66.7 / 48.8	72.2 / 51.8

These results were obtained using a GPT-4 LLM backbone and demonstrate substantial improvements: absolute success rate gains of 6.7 points in ALFWorld, 4 points in WebShop, and evaluation gains of 5.5% (Inform) and 3% (Success) in MultiWOZ relative to the ReAct baseline. The evaluation employed the AutoTOD script for MultiWOZ and standard success/score metrics for ALFWorld and WebShop, employing both rule-based and LLM-based user simulators in offline settings (Dongre et al., 2024).

5. Mechanisms of Dynamic User–Agent Collaboration

InterAct agents employing ReSpAct policy structure dynamically incorporate user feedback to disambiguate tasks, request missing or underspecified parameters, confirm or propose alternatives, and provide ongoing status. Example interaction types extracted from benchmark trajectories include:

Disambiguation: "Which pan do you want from the kitchen?" (ALFWorld)
Missing Parameter Solicitation: "How many people for the hotel booking?" (MultiWOZ)
Alternative/Confirmation: "There is no cheap guesthouse; prefer hotel or shorter stay?" (MultiWOZ)
Status Updates: "I have found three creditcards—which two shall I put away?" (ALFWorld)

By incorporating each user utterance directly into the agent’s history, error propagation from prior mistaken assumptions is mitigated, and excessive, blind reasoning cycles are reduced. This suggests improved sample efficiency and robustness, particularly in tasks characterized by ambiguity or underspecification (Dongre et al., 2024).

6. Implementation Guidelines and Domain Adaptation

Implementing InterAct paradigms such as ReSpAct requires careful prompt engineering to distinguish the three response modes and a representative set of few-shot exemplars clarifying when to invoke "Speak." Extension to new APIs or domains requires only augmentation of the $c_t = (o_1, a_1, o_2, a_2, \dots, o_t),$ 0 action set to capture new environment actuators; the prompting and context-tracking scaffold remains otherwise domain-general.

Offline evaluation necessitates construction of a user simulator (rule-based or LLM-driven) to produce interactive replies in response to "Speak" actions. Online, such simulators are replaced with human participants. Empirical ablation—such as reducing the incidence of "Speak" actions—can be used to calibrate the actions-to-dialogue ratio and avoid degenerate over-communication. Variants ranging from monologue-only to schema-guided baselines enable systematic exploration of the autonomy versus user involvement tradeoff. Any API-enabled domain (robotics, web navigation, database manipulation) is compatible with this schema-free InterAct scaffold (Dongre et al., 2024).

7. Context and Significance within LLM Agent Research

InterAct and its instantiations such as ReSpAct represent a progression from static, fixed-policy dialogue agents and pure chain-of-thought approaches toward unified frameworks capable of tightly coupling open-ended reasoning, grounded action, and unconstrained, context-driven user collaboration. This supports more human-like task completion, resilience to task ambiguity, and adaptability across heterogeneous environments. A plausible implication is broader applicability of LLM agents as real-world digital assistants, robust to open-domain instruction and adaptive to varying user preferences, without the limitations of rigid dialog act ontologies (Dongre et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

ReSpAct: Harmonizing Reasoning, Speaking, and Acting Towards Building Large Language Model-Based Conversational AI Agents (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InterAct.