ReSpAct: Unified LLM Agent Framework
- ReSpAct is a unified framework that integrates THINK, SPEAK, and ACT modalities, enabling clear reasoning and proactive user dialogue.
- It employs a modular pipeline with frozen LLMs to alternate between internal planning, dynamic conversation, and actionable commands.
- Empirical evaluations show that ReSpAct outperforms ReAct, boosting task success rates and user alignment across diverse benchmarks.
ReSpAct (Reason + Speak + Act) is a LLM-based agent framework that unifies autonomous reasoning, dynamic user dialogue, and actionable commands for complex task-solving in interactive environments. Unlike previous reasoning-centric frameworks such as ReAct, which augment LLMs with internal “thought” and external “action” channels, ReSpAct introduces a distinct “SPEAK” modality. This explicit dialogue channel enables agents to proactively engage with users in free-form, unscripted conversations that clarify instructions, confirm assumptions, deliver progress updates, and solicit user preferences—without reliance on rigid dialogue schemas. ReSpAct’s integration of conversational interaction, internal planning, and environment manipulation demonstrably enhances both task completion and user alignment across diverse evaluation benchmarks (Dongre et al., 2024).
1. Framework Definition and Comparative Foundations
ReSpAct extends reasoning-first LLM agent paradigms by decomposing the agent’s action space into three disjoint categories:
- THINK: Internal “chain-of-thought” reasoning for decomposition, subgoal identification, plan revision, and diagnostic analysis.
- SPEAK: Natural-language utterances explicitly directed to human users, comprising clarification questions, confirmations, status updates, or requests for input.
- ACT: Concrete commands affecting the environment, encompassing API calls, item manipulation, navigation, or web interactions.
In contrast, ReAct restricts agent interaction to “thought” (internal) and “act” (external environment) steps, limiting conversational bandwidth with users. The introduction of SPEAK in ReSpAct is designed to bridge this gap by allowing the agent to elicit, assimilate, and respond to arbitrary user input during its decision cycle, thus enabling more collaborative and resilient task resolution strategies (Dongre et al., 2024).
2. Architectural Overview
At each timestep , the ReSpAct agent maintains:
- : Current observation from the environment (or user response, post-SPEAK).
- : History of observations and actions forming the agent’s context.
ReSpAct’s control is structured as an in-context prompt-based pipeline using a frozen LLM (e.g., GPT-4o, LLaMA 3.1 405B). Three primary modules collaborate:
- Reasoning Module: Generates THINK traces to determine the next subgoal, assess progress, or identify missing information.
- Dialogue Manager: Decides, based on the reasoning trace, whether to issue a SPEAK act (query, confirmation, status update).
- Action Selector: Executes ACT steps in the environment, contingent on whether no user interaction is required.
This pipeline forms a loop where, on every step, the agent must select—via LLM output parsing—among THINK, SPEAK, or ACT, append user responses or environment feedback to context, and re-enter the prompt cycle, terminating upon task completion.
3. Dynamic Interaction and Conversational Strategies
The defining feature of ReSpAct is fluid alternation among THINK, SPEAK, and ACT modalities, controlled by the LLM’s prompt-based policy:
- Ambiguity Handling: When internal reasoning detects underspecified or ambiguous task parameters (e.g., multiple candidate objects), the agent dispatches a SPEAK request for clarification (“Could you clarify which pan I should use?”).
- Status Communication: After multi-step or time-consuming subtasks, the agent proactively informs the user of interim results (“I’ve located three candidate pans…”).
- Failure Recovery: Failed or invalid external actions trigger dialogue to resolve impasses (“I tried to open cabinet 4 but it’s locked; should I try another location?”).
- Iterative Plan Refinement: User replies are assimilated into via SPEAK responses, recursively influencing the next reasoning phase and subsequent action selection.
This mechanism eschews formal dialogue schemas, instead empowering the agent to engage in free-form, context-sensitive natural language as warranted by domain reasoning and emergent task structure.
4. Mathematical Formalism
The ReSpAct agent is modeled within the canonical sequential decision-making formalism:
- Observation space:
- Action space: (environment/API actions , language actions )
- Dialogue subspace: (SPEAK actions)
- Context:
- Policy: selects among THINK, SPEAK, or ACT.
No reinforcement learning objective is directly optimized. Invalid or failed ACT steps are interpreted as the selection of suboptimal given context , motivating the SPEAK modality as a feedback correction mechanism.
5. Prompt Engineering and Task Instantiation
ReSpAct relies exclusively on prompt-based few-shot learning with frozen LLMs. For each benchmark (ALFWorld, WebShop, MultiWOZ), three manually annotated successful agent trajectories interleaving THINK, SPEAK, and ACT are provided. Six prompt variants per task (all pairs and orderings of the three demonstrations) serve both as training exemplars and to gauge prompt sensitivity. Notably:
- ReSpAct prompts: Include explicit SPEAK actions in demonstration (e.g., user confirmations, clarifications).
- ReAct prompts: Omit SPEAK, reflecting the baseline’s more limited interaction protocol.
Domain-specific prompting, such as the addition of “Objective” and constraint-tightening sections in MultiWOZ, encourage the agent to seek user input in response to ambiguity or large candidate result sets. Fine-tuning resides entirely in prompt manipulation—no parameter updates are performed.
6. Empirical Evaluation and Quantitative Performance
Evaluation is conducted on three diverse benchmarks:
| Benchmark | Task Domain | ReAct Success | ReSpAct Success | Metric Change |
|---|---|---|---|---|
| ALFWorld | Household manipulation | 80.6 % | 87.3 % | +6.7 pp |
| WebShop | Online shopping | 8 % (success), 20.1 | 12 % (success), 32.7 | +4 pp, +12.6 (score) |
| MultiWOZ | Task dialogue (NLP) | 66.7 % (Inform), 48.8 % (Success) | 72.2 % (Inform), 51.8 % (Success) | +5.5 pp, +3 pp |
Improvements are consistent regardless of LLM backbone (GPT-4o, LLaMA 3.1 405B) and across prompt permutations. WebShop “score” reflects attribute match fraction; MultiWOZ uses “Inform” (entity coverage) and “Success” (attribute accuracy). A plausible implication is that the SPEAK modality enables context-sensitive disambiguation and enhances both user alignment and completion robustness (Dongre et al., 2024).
7. Limitations, Ablation Results, and Prospective Developments
Limitations:
- Evaluation settings remain synthetic or constrained; scalability to open-world, unstructured domains is untested.
- Excessive SPEAK actions risk overwhelming users with queries, potentially impeding usability.
- The strategy is tightly coupled to prompt design, with no end-to-end learning or policy gradient component; real-world generalization may be sensitive to simulator or user realism.
Ablations:
- User Simulator Robustness: “Helpful Knowledgeable” simulators yield 85.3 % (ALFWorld), but “Perturbed” or “Unhelpful” responses degrade success (52.9 %, 32.1 %).
- Inner Monologue Replacement: Unregulated “Inner Monologue” SPEAK triggers lead to sharp performance drops (87.3 % → 48.5 % in ALFWorld), indicating the necessity of context-aware dialogue gating.
- Schema-Guided Dialogue: Restricting SPEAK to fixed dialogue schemas stabilizes act distribution but yields slightly lower task success rates.
Future trajectories include exploration of stateful confirmation policies, prompt-based SPEAK augmentation with light policy fine-tuning or retrieval, and extension to more complex and unstructured real-world scenarios. Optimizations targeting the minimization of excessive user-agent turns are identified as necessary for further practical deployment.
The ReSpAct framework demonstrates that harmonizing open-ended reasoning with dynamic, proactive user interaction significantly elevates the reliability, interpretability, and breadth of LLM-based agent competence in both dialogic and action-oriented settings (Dongre et al., 2024).