Interactive Multimodal LLMs
- Interactive multimodal LLMs are architectures that integrate language, vision, audio, and other sensory inputs with real-time, user-driven interaction.
- They employ modality-specific encoders, alignment modules, and Transformer backbones to achieve dynamic fusion and fine-grained grounding of inputs.
- Evaluations reveal gains in task comprehension and interaction quality, though challenges remain in scalability, multi-turn dialogue, and efficiency.
Interactive multimodal LLMs are architectures and systems that tightly couple robust language modeling with explicit, real-time interaction via multiple input/output modalities—most frequently vision, audio, and text, but with extensibility to video, gesture, gaze, and more. These systems depart from conventional passive multimodal LLMs by supporting fine-grained, dynamic exchanges between human users and the model, including region- or event-level grounding, stateful dialogue, and user-in-the-loop adaptation, thus enabling novel applications in human-computer interaction, robotics, creativity tools, and collaborative problem-solving.
1. Core Architectural Principles
Interactive multimodal LLMs are fundamentally structured around several essential architectural principles:
- Modality-specific encoding: Each input (image, audio, video, gaze, sketch, etc.) is mapped into a vector or token sequence via a domain-optimized encoder. Continuous encoders (e.g., Vision Transformer (ViT), InternVit, CLAP, Whisper, Q-Former, CNNs) and discrete tokenizers (e.g., VQ, RVQ) are both widely employed (Jiang et al., 16 Dec 2024).
- Connector/alignment module: All encoded features are projected into the shared embedding space of the LLM via modality-specific adapters, projectors, or cross-attention bridges. Choices include linear layers, multi-layer MLPs, Q-Former (Rekimoto, 31 Mar 2025), BLIP-style cross-attention (Chen et al., 2023), or more complex modules such as request-based interactive adapters (Li et al., 2023).
- Transformer backbone/interactivity: A unified Transformer (T5, LLaMA, Mixtral, Gemini, Qwen, etc.) processes concatenated or interleaved modality tokens and textual tokens, with potential for instruction-aware adapters or multimodal cross-attention at multiple layers (Li et al., 21 Feb 2024).
- Bidirectional interaction mechanisms: These systems frequently implement request-response cycles, allowing LLMs to query modality encoders (e.g., vision modules) contingent on user intent or dialogue state (Li et al., 2023), or to adaptively focus on user-indicated regions, time spans, or dynamic events (Zhao et al., 2023, Li et al., 21 Feb 2024, Rekimoto, 31 Mar 2025).
- Task and tool invocation heads: Modern systems integrate explicit token-level tool, generation, or control heads, so the LLM can orchestrate downstream semantic image generation, editing, segmentation, or even procedural world construction (Zhu et al., 22 Feb 2024, Duan et al., 5 Sep 2025).
2. Multimodal Fusion and Interactive Mechanisms
Integrating user interactivity across modalities demands advanced fusion strategies and dynamic attention mechanisms, including:
- Gaze- and region-aware input selection: Systems such as GazeLLM (Rekimoto, 31 Mar 2025) achieve massive input compression and increased task relevance by decomposing video frames based on human visual attention—high-res processing for the gaze region, aggressive downsampling for periphery, adaptive vision token fusion, and explicit pixel-reduction constraints ().
- Pointer and sketch interaction: Models like ChatSpot (Zhao et al., 2023) and LIM2N (Zu et al., 2023) accept direct spatial reference input (mouse clicks, boxes, hand-drawn paths/zones/sketches), which are injected into LLM prompts as normalized coordinates or rasterized masks. This grounding enables region-specific Q&A, navigation constraints, or visual tool invocation.
- Request-based visual attention: LMEye (Li et al., 2023) employs an explicit “request–acquire–interact–respond” pipeline, where the LLM encodes latent requests that are then resolved dynamically via cross-modal attention blocks (RVII) and the resulting contextual vision tokens are reintegrated for response generation.
- Dynamic adapters for video and temporal content: Frame-wise or span-wise relevance is enforced by lightweight selection/interactor modules (IVA (Li et al., 21 Feb 2024)) deployed inside LLM layers. These can attend to question-conditioned, fine-grained visual elements in arbitrarily long video, enabling precise spatiotemporal grounding and efficient memory use.
- Interactive code execution in reasoning: Systems like Interactive Sketchpad (Chen et al., 12 Feb 2025) close the user–model loop by generating (and executing) code for visualizations in response to user questions, supporting collaborative whiteboarding or diagram-centric tutoring.
3. Training Paradigms and Datasets
Interactive multimodal LLMs are generally trained in two phases (Jiang et al., 16 Dec 2024):
- Alignment pre-training: All modality encoders, connectors, and their junctions to the LLM are optimized jointly, often with cross-entropy or contrastive objectives to align image/audio/video representations with grounded natural language.
- Instruction fine-tuning: After alignment, the model is exposed to large-scale, diverse, and (ideally) multi-turn instruction datasets incorporating multimodal context and user interactivity. LLM outputs may include tool tokens, chain-of-thought rationales, action predictions, and explicit grounding signals.
Representative datasets include:
- Task- and region-level dialogues (MGVLID (Zhao et al., 2023), LLaVA-Instruct, multimodal instruction sets).
- User-driven interaction logs for creativity, AR, or world manipulation (OmniActions (Li et al., 6 May 2024), TaleForge (Nguyen et al., 27 Jun 2025), LLMBind (Zhu et al., 22 Feb 2024)).
- Robotics and HRI data with mixed text, vision, sound, haptics, and demonstration sketches (Zu et al., 2023, Zhao et al., 2023).
- Situated scientific interaction corpora blending textual, figure, equation, and table grounding with conversational QA (cPAPERS (Sundar et al., 12 Jun 2024)).
4. Applications and System Prototypes
Interactive multimodal LLMs have been instantiated in a variety of application domains:
- Wearable and AR agents: GazeLLM (Rekimoto, 31 Mar 2025) achieves near-real-time task comprehension for first-person video with GPU-constrained hardware, reducing latency by 40% relative to full-frame processing, and matching or exceeding human task performance in robot skill transfer and real-world task annotation.
- Pervasive context-aware digital assistants: OmniActions (Li et al., 6 May 2024) links continuous real-world sensing (scene, object, audio, transcripts, activity) to digital action prediction, leveraging chain-of-thought LLM prompting and context-aware structured text grounding.
- Human-robot interaction and navigation: LIM2N (Zu et al., 2023) fuses language, sketch, and geometric/state sketches into RL-driven navigation with online adjustment of constraints; Matcha (Zhao et al., 2023) integrates physical feedback (sound, touch, weight) with epistemic action sequencing.
- Personalized content generation: TaleForge (Nguyen et al., 27 Jun 2025) and LLMBind (Zhu et al., 22 Feb 2024) support direct user participation as protagonists in story/image/video/audio/scene generation, exposing fine-grained controls for illustration, editing, pose, and background through multimodal interfaces.
- Collaborative problem solving and tutoring: Interactive Sketchpad (Chen et al., 12 Feb 2025) allows dialogic manipulation of math/scientific diagrams, integrating code execution, visual sketch, and language-based guidance.
- Interactive research assistance: cPAPERS (Sundar et al., 12 Jun 2024) demonstrates fine-grained conversational interaction over figures, tables, and equations within scientific documents, providing baselines for zero-shot and fine-tuned QA over grounded LaTeX and image context.
- Omni-modal world and environment modeling: LatticeWorld (Duan et al., 5 Sep 2025) unifies symbolic and continuous layout/agent generation from language and visual sketch, coupling LLM pipeline with physical simulation and multi-agent interaction in real-time UE5 environments.
5. Evaluation and Quantitative Analysis
Quantitative assessment of interactive multimodal LLMs spans diverse benchmarks, each tailored to the system’s modality coverage and interaction depth:
| Task | Model/Paper | Metric(s) | Results (condensed) |
|---|---|---|---|
| Task comprehension | GazeLLM (Rekimoto, 31 Mar 2025) | BLEU, ROUGE-L, SBERT, LLM | Gaze: BLEU 0.92, LLM 87/100 |
| Follow-up action pred. | OmniActions (Li et al., 6 May 2024) | Top-3 accuracy (17 labels) | Up to 67.1% (GPT-4, CoT) |
| Region QA/OCR | ChatSpot (Zhao et al., 2023) | Acc, AP, robustness | COCO-GT: 64.5% Acc |
| Multiturn navigation | LIM2N (Zu et al., 2023) | Success Rate (%) | Up to 95% (static), 61.7% (ped.) |
| Multimodal reasoning | LMEye (Li et al., 2023) | MMBench, SEED-Bench, VQA | 62.3% MMBench, 54.0% OK-VQA |
| Story engagement | TaleForge (Nguyen et al., 27 Jun 2025) | User-rated 1–5 alignment, etc | 4.25 (align), 4.42 (engage) |
Ablations consistently reveal that interactivity critically improves both accuracy and user satisfaction: removal of gaze, request, or region context typically reduces performance by double-digit percentage points (Rekimoto, 31 Mar 2025, Li et al., 2023, Li et al., 6 May 2024). User studies further highlight acceptance and preference for multimodal, interactive interfaces versus static or unimodal alternatives.
6. Limitations, Open Problems, and Future Directions
Despite substantial progress, key challenges remain:
- Long-context modeling and efficiency: Large latent or spatial features in high-res/long videos and continuous input streams drive memory and computation out of practical bounds. Selective input routing (GazeLLM (Rekimoto, 31 Mar 2025), IVA (Li et al., 21 Feb 2024), dynamic tokenization (Mao et al., 26 Sep 2025)) and attention-based compression are actively investigated, but scaling remains a bottleneck for interactive deployment (Jiang et al., 16 Dec 2024).
- Multi-turn, multi-modal dialogue: Most released instruction datasets and benchmarks focus on single-turn or limited interaction. Rich, user-driven conversational episodes, especially with grounded region or pointer reference, are still lacking (Zhao et al., 2023, Sundar et al., 12 Jun 2024).
- Catastrophic forgetting and extensibility: Adding new modalities or fine-tuning on new instruction sets can degrade performance on previously learned interactions and tasks. Incremental, parameter-efficient, or modular training protocols are needed to mitigate forgetting (Jiang et al., 16 Dec 2024).
- Interactive generation and control: Seamless “any-to-any” modality translation (e.g., image+audio→video, pointer+text→edit, sketch→robot action) requires nuanced task and tool invocation, robust token-level coordination, and model-level extensibility (LLMBind (Zhu et al., 22 Feb 2024), LatticeWorld (Duan et al., 5 Sep 2025)).
- Evaluation and benchmarking: Unified and comprehensive benchmarks for conversational, grounded, and interactive multimodal tasks remain an open frontier. Current evaluations often rely on adapted single-modality datasets or proprietary platforms.
- Human-centered usability: Latency, proactive error recovery, and continuous context adaptation are critical for adoption of interactive systems in real-world contexts. Efficient on-device inference and attention to user experience are non-trivial engineering requirements (Li et al., 6 May 2024).
Promising research directions include:
- Life-long and modular instruction tuning with explicit mechanisms for dynamic modality addition (Jiang et al., 16 Dec 2024).
- Fully open, bilingual or multilingual interactive LLMs supporting seamless user-in-the-loop multimodal dialogue (VITA (Fu et al., 9 Aug 2024), Mixtral-ours, Gemini-1.5).
- Interactive token-based communication protocols for distributed, low-bandwidth collaboration between user devices and cloud agents (Mao et al., 26 Sep 2025).
Interactive multimodal LLMs represent a rapidly emerging paradigm that unifies modeling, interface design, and system integration to enable fluid, grounded, and context-aware reasoning across human-language and rich sensory streams.