MM-ReAct: Multimodal Reasoning Framework

Updated 18 March 2026

MM-ReAct is a prompting-based system that integrates a ChatGPT controller with specialized vision experts for zero-shot multimodal reasoning.
It employs a ReAct-style loop where 'Thought' and 'Action' steps sequentially invoke external vision tools to manage complex visual tasks.
The system demonstrates competitive performance on diverse benchmarks, offering extensibility and practical applications without joint vision-language pretraining.

MM-ReAct is a prompting-based system paradigm that integrates a LLM controller—specifically ChatGPT (gpt-3.5-turbo)—with a pool of modular vision experts to achieve multimodal reasoning and action in a zero-shot fashion. The design enables the LLM to orchestrate complex visual tasks by sequentially invoking external vision modules as tools, leveraging structured prompt engineering and a ReAct-style division between reasoning ("Thought") and tool invocation ("Action"). The system neither embeds raw visual signals in the prompt nor relies on joint vision-language pretraining, instead textualizing visual results and exchanging visual references strictly via file-path tokens. This approach demonstrates competitive or superior performance on a diverse set of advanced multimodal benchmarks without additional model-specific finetuning (Yang et al., 2023).

1. System Architecture and Orchestration

MM-ReAct decomposes multimodal question answering into a language-based planning loop involving two principal elements: a controller LLM (ChatGPT) and a dynamic pool of specialist "vision experts". The controller LLM coordinates all dialog, internal reasoning, and actions via prompt engineering, while vision experts—themselves independent models specialized for tasks such as OCR, object detection, captioning, video summarization, or table parsing—are invoked as external processes ("tools").

At runtime, uploaded images and videos are referenced using opaque placeholder tokens (e.g., "<ImagePath1>", "<VideoPathA>"). The prompt provided to ChatGPT includes explicit descriptions of each expert's name, input/output signature, and example invocations, thereby conditioning the LLM's tool-use policy via in-context learning. The system loop proceeds as follows:

User submits a question and reference files;
ChatGPT receives the prefixed prompt and user input;
For each dialog turn, ChatGPT either emits a "Thought:" (intermediate reasoning) or an "Action request: Assistant, please run <ExpertName> on <ImagePathX>.";
The system parses the action via regex, invokes the corresponding vision expert, and textualizes the output as "Observation";
The observation is appended to the prompt for continued deliberation;
The procedure iterates until ChatGPT emits a final answer, which is returned to the user.

2. Text-Only Prompt Encoding and Formal Notation

MM-ReAct encodes all multimodal state into text-based constructs, separating visual data exchange from model embeddings:

File-path placeholders: The file set $F = \{ f_1, f_2, \dots, f_k \}$ comprises string tokens (e.g., $\mathtt{<ImagePath1>}$ ), which the LLM uses to reference visual assets during planning and tool calls. These tokens themselves are semantically inert except as pointers.
Serialized results: Vision experts return outputs as textual sequences—text lines for OCR, tuple-lists for detections (e.g., $b_j = ( \ell_j, x_j^{(1)}, y_j^{(1)}, x_j^{(2)}, y_j^{(2)} )$ for object bounding boxes), or JSON-style lists for structured semantic extraction. Legends and serialization conventions are included to instruct the LLM.
System state: Each input at time $t$ is $(Q_t, F)$ , with $Q_t$ the user query or system "Thought", and $F$ the current file-path set. Outputs are either "Thought $_t$ " (textual reasoning) or "Action $_t$ " (structured expert invocation).

Dense visual signals (pixels) are never embedded in the LLM's prompt; only the file-path tokens and serialized results are exchanged.

3. Prompt Templates, Example Interactions, and Tool Invocation

The initialization prompt conditions ChatGPT to operate as a "multimodal reasoning agent", enumerating each expert's API and furnishing several in-context examples per expert. Each example demonstrates the full loop: user query, model reasoning, action formulation, and the use of structured "Observation" text. For instance:

OCR Expert:
- Capability: Extract all visible text from an image.
- Input: <ImagePath>
- Output: List of text lines.
- Protocol: "Assistant, please run OCR on <ImagePath1>."
Object Detector Expert:
- Capability: Return bounding boxes and object labels.
- Input: <ImagePath>
- Output: List of <label, x1, y1, x2, y2> tuples.

Instructions mandate that the agent always emits a reasoning "Thought" prior to any Action or final answer. The invocation protocol is identified via the explicit phrase "Assistant, please run <ExpertName> on <ImagePathX>", enabling regex-based parsing and consistent logging of expert actions.

4. Runtime Procedure and Pseudocode

The MM-ReAct orchestration algorithm proceeds in a sequential, interleaved fashion, as formalized by the following pseudocode:

Input:
    - User question Q
    - File–path set F = {f1,…,fK}
    - Prefix prompt P (describes experts & examples)
Context C ← P ∥ "User: " ∥ Q
while true:
    Response R ← ChatGPT_generate(C)
    if R contains final answer A:
        return A
    else if R contains "Assistant, please run E on f":
        Parse expert E and file f via regex
        Obs ← RunExpert(E, f)
        C ← C ∥ R ∥ "Observation: " ∥ Obs
    else:
        # it’s a “Thought:” without an action
        C ← C ∥ R

At each step, ChatGPT may deliberate or trigger a tool call; results propagate as text back into the context window, recursively enabling multi-hop or multi-tool planning. The process repeats until a conclusive answer is generated.

5. Empirical Evaluation: Tasks and Comparative Performance

Zero-shot efficacy of MM-ReAct is demonstrated across a suite of challenging multimodal scenarios, each evaluated without additional model-specific finetuning. These include:

Visual math and diagram reasoning
Meme interpretation and context-dependent humor
Spatial and coordinate-based tasks, complex visual planning
Multi-image aggregation (e.g., totaling costs across receipts)
Multi-hop chart and flowchart question answering
Table-based factual queries; open-world object grounding
Video summarization and event detection

Correctness was determined by human judge assessment. Side-by-side comparisons with PaLM-E, a joint-finetuned state-of-the-art vision-LLM, indicate that MM-ReAct matches or outperforms PaLM-E's zero-shot performance on several benchmarks. An illustrative table:

Task	MM-ReAct (zero-shot)	PaLM-E (zero-shot)
Receipt sum	100% (10/10)	90% (9/10)
Bar-chart QA	80% (8/10)	70% (7/10)
Meme understanding	100% (5/5)	80% (4/5)
Video summary	80% (4/5)	60% (3/5)

This suggests that explicit tool orchestration via prompt engineering can deliver competitive zero-shot multimodal capabilities (Yang et al., 2023).

6. Paradigmatic Comparison and Extensibility

MM-ReAct diverges from joint vision-language pretraining prevalent in integrated models such as Flamingo, PaLM-E, and GPT-4V, which encode visual signals as continuous tokens and require extensive labeled data and compute for finetuning. Instead, MM-ReAct assembles existing vision tools "on the fly," using only prompt edits to extend system competence to new tasks or experts, constrained by the available LLM context window.

A feature of MM-ReAct is modular extensibility: new experts can be added by supplementing the prefix prompt, formulated as additional expert descriptions and example interactions.

7. Practical Applications, Limitations, and Future Enhancements

Demonstrated applications include automated receipt processing (multi-image sum), OCR-augmented data entry, diagrammatic and mathematical tutoring, analytical QA on charts and flowcharts, and contextual analysis of media such as memes and video.

Noted limitations include:

Predominantly qualitative evaluation; lack of unified annotated benchmarks for several task categories.
Reliance on expert pool defined in the initialization prompt; addressing unseen tasks requires manual prompt engineering.
Bounded by the LLM context window (approximately 4k tokens), limiting expert enumeration and in-context demonstrations.
Constraints of textual serialization for representing complex visual features.
Sequential tool-use protocol induces higher latency owing to multiple API calls.

Currently, expert selection is performed symbolically by the LLM emitting the expert name in the action string. A plausible implication is that future variants could employ embedding-based scoring functions for expert selection, e.g. by computing $s_e = \cos( f(Q), f(D_e) )$ over query and expert description embeddings, and selecting $e^* = \arg\max_e s_e$ . Presently, selection is governed by fixed pattern-matching using regex.

In summary, MM-ReAct demonstrates that with meticulously crafted prompt designs and a ReAct-style architecture interleaving reasoning and tool invocation, competitive multimodal reasoning and decision-making can be achieved without dedicated multimodal pretraining or joint model finetuning (Yang et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MM-ReAct.