MM-ReAct: Multimodal Reasoning Framework
- MM-ReAct is a prompting-based system that integrates a ChatGPT controller with specialized vision experts for zero-shot multimodal reasoning.
- It employs a ReAct-style loop where 'Thought' and 'Action' steps sequentially invoke external vision tools to manage complex visual tasks.
- The system demonstrates competitive performance on diverse benchmarks, offering extensibility and practical applications without joint vision-language pretraining.
MM-ReAct is a prompting-based system paradigm that integrates a LLM controller—specifically ChatGPT (gpt-3.5-turbo)—with a pool of modular vision experts to achieve multimodal reasoning and action in a zero-shot fashion. The design enables the LLM to orchestrate complex visual tasks by sequentially invoking external vision modules as tools, leveraging structured prompt engineering and a ReAct-style division between reasoning ("Thought") and tool invocation ("Action"). The system neither embeds raw visual signals in the prompt nor relies on joint vision-language pretraining, instead textualizing visual results and exchanging visual references strictly via file-path tokens. This approach demonstrates competitive or superior performance on a diverse set of advanced multimodal benchmarks without additional model-specific finetuning (Yang et al., 2023).
1. System Architecture and Orchestration
MM-ReAct decomposes multimodal question answering into a language-based planning loop involving two principal elements: a controller LLM (ChatGPT) and a dynamic pool of specialist "vision experts". The controller LLM coordinates all dialog, internal reasoning, and actions via prompt engineering, while vision experts—themselves independent models specialized for tasks such as OCR, object detection, captioning, video summarization, or table parsing—are invoked as external processes ("tools").
At runtime, uploaded images and videos are referenced using opaque placeholder tokens (e.g., "<ImagePath1>", "<VideoPathA>"). The prompt provided to ChatGPT includes explicit descriptions of each expert's name, input/output signature, and example invocations, thereby conditioning the LLM's tool-use policy via in-context learning. The system loop proceeds as follows:
- User submits a question and reference files;
- ChatGPT receives the prefixed prompt and user input;
- For each dialog turn, ChatGPT either emits a "Thought:" (intermediate reasoning) or an "Action request: Assistant, please run <ExpertName> on <ImagePathX>.";
- The system parses the action via regex, invokes the corresponding vision expert, and textualizes the output as "Observation";
- The observation is appended to the prompt for continued deliberation;
- The procedure iterates until ChatGPT emits a final answer, which is returned to the user.
2. Text-Only Prompt Encoding and Formal Notation
MM-ReAct encodes all multimodal state into text-based constructs, separating visual data exchange from model embeddings:
- File-path placeholders: The file set comprises string tokens (e.g., ), which the LLM uses to reference visual assets during planning and tool calls. These tokens themselves are semantically inert except as pointers.
- Serialized results: Vision experts return outputs as textual sequences—text lines for OCR, tuple-lists for detections (e.g., for object bounding boxes), or JSON-style lists for structured semantic extraction. Legends and serialization conventions are included to instruct the LLM.
- System state: Each input at time is , with the user query or system "Thought", and the current file-path set. Outputs are either "Thought" (textual reasoning) or "Action" (structured expert invocation).
Dense visual signals (pixels) are never embedded in the LLM's prompt; only the file-path tokens and serialized results are exchanged.
3. Prompt Templates, Example Interactions, and Tool Invocation
The initialization prompt conditions ChatGPT to operate as a "multimodal reasoning agent", enumerating each expert's API and furnishing several in-context examples per expert. Each example demonstrates the full loop: user query, model reasoning, action formulation, and the use of structured "Observation" text. For instance:
- OCR Expert:
- Capability: Extract all visible text from an image.
- Input:
<ImagePath> - Output: List of text lines.
- Protocol: "Assistant, please run OCR on
<ImagePath1>."
- Object Detector Expert:
- Capability: Return bounding boxes and object labels.
- Input:
<ImagePath> - Output: List of
<label, x1, y1, x2, y2>tuples.
Instructions mandate that the agent always emits a reasoning "Thought" prior to any Action or final answer. The invocation protocol is identified via the explicit phrase "Assistant, please run <ExpertName> on <ImagePathX>", enabling regex-based parsing and consistent logging of expert actions.
4. Runtime Procedure and Pseudocode
The MM-ReAct orchestration algorithm proceeds in a sequential, interleaved fashion, as formalized by the following pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Input:
- User question Q
- File–path set F = {f1,…,fK}
- Prefix prompt P (describes experts & examples)
Context C ← P ∥ "User: " ∥ Q
while true:
Response R ← ChatGPT_generate(C)
if R contains final answer A:
return A
else if R contains "Assistant, please run E on f":
Parse expert E and file f via regex
Obs ← RunExpert(E, f)
C ← C ∥ R ∥ "Observation: " ∥ Obs
else:
# it’s a “Thought:” without an action
C ← C ∥ R |
At each step, ChatGPT may deliberate or trigger a tool call; results propagate as text back into the context window, recursively enabling multi-hop or multi-tool planning. The process repeats until a conclusive answer is generated.
5. Empirical Evaluation: Tasks and Comparative Performance
Zero-shot efficacy of MM-ReAct is demonstrated across a suite of challenging multimodal scenarios, each evaluated without additional model-specific finetuning. These include:
- Visual math and diagram reasoning
- Meme interpretation and context-dependent humor
- Spatial and coordinate-based tasks, complex visual planning
- Multi-image aggregation (e.g., totaling costs across receipts)
- Multi-hop chart and flowchart question answering
- Table-based factual queries; open-world object grounding
- Video summarization and event detection
Correctness was determined by human judge assessment. Side-by-side comparisons with PaLM-E, a joint-finetuned state-of-the-art vision-LLM, indicate that MM-ReAct matches or outperforms PaLM-E's zero-shot performance on several benchmarks. An illustrative table:
| Task | MM-ReAct (zero-shot) | PaLM-E (zero-shot) |
|---|---|---|
| Receipt sum | 100% (10/10) | 90% (9/10) |
| Bar-chart QA | 80% (8/10) | 70% (7/10) |
| Meme understanding | 100% (5/5) | 80% (4/5) |
| Video summary | 80% (4/5) | 60% (3/5) |
This suggests that explicit tool orchestration via prompt engineering can deliver competitive zero-shot multimodal capabilities (Yang et al., 2023).
6. Paradigmatic Comparison and Extensibility
MM-ReAct diverges from joint vision-language pretraining prevalent in integrated models such as Flamingo, PaLM-E, and GPT-4V, which encode visual signals as continuous tokens and require extensive labeled data and compute for finetuning. Instead, MM-ReAct assembles existing vision tools "on the fly," using only prompt edits to extend system competence to new tasks or experts, constrained by the available LLM context window.
A feature of MM-ReAct is modular extensibility: new experts can be added by supplementing the prefix prompt, formulated as additional expert descriptions and example interactions.
7. Practical Applications, Limitations, and Future Enhancements
Demonstrated applications include automated receipt processing (multi-image sum), OCR-augmented data entry, diagrammatic and mathematical tutoring, analytical QA on charts and flowcharts, and contextual analysis of media such as memes and video.
Noted limitations include:
- Predominantly qualitative evaluation; lack of unified annotated benchmarks for several task categories.
- Reliance on expert pool defined in the initialization prompt; addressing unseen tasks requires manual prompt engineering.
- Bounded by the LLM context window (approximately 4k tokens), limiting expert enumeration and in-context demonstrations.
- Constraints of textual serialization for representing complex visual features.
- Sequential tool-use protocol induces higher latency owing to multiple API calls.
Currently, expert selection is performed symbolically by the LLM emitting the expert name in the action string. A plausible implication is that future variants could employ embedding-based scoring functions for expert selection, e.g. by computing over query and expert description embeddings, and selecting . Presently, selection is governed by fixed pattern-matching using regex.
In summary, MM-ReAct demonstrates that with meticulously crafted prompt designs and a ReAct-style architecture interleaving reasoning and tool invocation, competitive multimodal reasoning and decision-making can be achieved without dedicated multimodal pretraining or joint model finetuning (Yang et al., 2023).