M3DBench: Multimodal & Embodied AI Benchmarks
- M3DBench is a comprehensive suite of benchmarks that evaluates multimodal and embodied AI by integrating text, images, 3D perception, and tool-use in complex tasks.
- It covers diverse applications such as 3D spatial reasoning, medical teleconsultations, mobile manipulation, and multi-threaded tool workflows using large, structured datasets.
- The benchmarks employ advanced evaluation protocols and baseline models, revealing current limitations and guiding future research in AI modalities and embodied systems.
M3DBench constitutes a family of rigorous benchmarks designed to assess the capabilities of modern multimodal models and embodied agents across diverse, complex tasks. These benchmarks target the integration of multiple modalities—including text, images, 3D perception, language reasoning, and tool-use—in domains such as 3D spatial reasoning, embodied robotics, telemedicine, and AI-driven tool orchestration. The following entry surveys the principal M3DBench benchmarks, focusing on their dataset scope, task formalisms, methodological innovations, evaluation metrics, baseline analysis, and implications for future multimodal and embodied AI research.
1. Dataset Composition and Multimodal Scope
The principal M3DBench instances are characterized by large-scale, highly structured datasets encompassing a wide array of modalities.
M3DBench (3D Instruction-following Benchmark) (Li et al., 2023):
- Contains approximately 327,000 instruction–response pairs, with ≈138,000 involving complex interleaving of text, images, point/box/3D object prompts, and numeric coordinates.
- Instructions are sequences where are primitives drawn from text, 2D/3D visual prompts, or coordinates.
- Covers both region-level (object detection, visual grounding) and scene-level (dense captioning, question answering, planning, navigation, multi-round dialogue) tasks over complex 3D scenes.
3MDBench (Medical Multimodal Multi-agent Dialogue Benchmark) (Sviridov et al., 26 Mar 2025):
- Focuses on LVLM-driven telemedical consultations.
- Provides 3,030 curated cases across 34 diagnosis classes (dermatological, dental, nail, ophthalmological, mucosal).
- Each case combines a high-resolution clinical image, basic complaint (symptom narrative), structured "atomic" additional complaints, and a ground-truth diagnosis.
- Data sources include Kaggle ODC, DPT, NDD, CD, ISIC Archive, Google SCIN, Fitzpatrick17k, with image curation and augmentation to ensure representation per class.
M3Bench (Whole-body Motion for Mobile Manipulation in 3D) (Zhang et al., 2024):
- Comprises 30,000 object rearrangement tasks across 119 photorealistic household scenes and 32 object types.
- Each task associates a 3D environment, robot configuration, and object-centric rearrangement instruction.
- Demonstration data is synthesized via M3BenchMaker, which leverages optimization and affordance learning for motion trajectory generation.
M3-Bench (Multi-modal, Multi-hop, Multi-threaded Tool-using MLLM Agent Benchmark) (Zhou et al., 21 Nov 2025):
- Spans 28 MCP servers with 231 tools, exposing agents to image and text inputs, multi-hop and multi-threaded workflows, and toolchain dependencies.
- Benchmarks agent orchestration of tool calls, persistent resources, and reasoning over graphs of tool dependencies.
2. Task Taxonomies and Formalizations
M3DBench benchmarks are unified by formal taxonomies that delineate the compositional complexity of the tasks.
2.1 3D Multimodal Reasoning (Li et al., 2023)
- Region-Level:
- Object Detection (OD): point cloud; bounding box set; Acc@mIoU.
- Visual Grounding (VG): Cross-modal reference to localize objects.
- Scene-Level:
- Dense Captioning (DC), Visual Question Answering (VQA), Multi-region Reasoning (MR), Scene Description (SD), Multi-round Dialogue (MD), Embodied Planning (EP), Vision-Language Navigation (VLN).
- Prompts are highly compositional, supporting interleaved cues (e.g., "At the pointed region <box:...> tell me...").
2.2 LVLM-driven Medical Dialogue (Sviridov et al., 26 Mar 2025)
- Multi-agent simulation with role-playing Patient Agents (four temperament profiles) and an Assessor Agent (self-consistent evaluation).
- Scenarios modeled as turn-based interaction leveraging both image context and text-based complaint trajectories.
- Tasks: Differential diagnosis, rationale generation, context-sensitive questioning, dialogue adherence to clinical competence rubrics.
2.3 Embodied Mobile Manipulation (Zhang et al., 2024)
- Focused on end-to-end whole-body planning: mobile base + manipulator arm in realistic, constraint-rich environments.
- Task: Generate a feasible trajectory (joint space) to move/rearrange objects under collision/self-collision/joint-limit constraints.
- Problem posed as optimization:
where combines goal accuracy, smoothness, and efficiency.
2.4 Multimodal Tool-use Orchestration (Zhou et al., 21 Nov 2025)
- Agents are benchmarked on realistic multi-step, multi-threaded MCP tool workflows.
- Tasks require coordination of image/text input, execution of REST/file/vision tools, and alignment of intermediate state across steps.
- Each workflow is represented as a trajectory of tool calls, supporting both sequential and parallel (multi-threaded) execution.
3. Methodological Contributions and Baselines
3.1 Data Generation and Augmentation
- (Li et al., 2023) Introduces systematic prompts and region-object interleaving to produce diverse multimodal instructions and evaluation targets.
- (Zhang et al., 2024) Details the M3BenchMaker pipeline: task builder, scene sampler (for randomized object and base placement), affordance-driven goal proposal, and stochastic optimization for collision-free, physically plausible motion trajectories.
- (Sviridov et al., 26 Mar 2025) Employs GPT-4o-mini for synthetic but clinically relevant complaint generation, with subsequent human-in-the-loop refinement.
3.2 Benchmark Model Architectures
- (Li et al., 2023) Baseline: Scene Perceiver (3D encoder; PointNet++/Vote2Cap), multimodal instruction encoder (linear mappings and frozen feature extractors for each modality), LLM-based decoder (frozen OPT-6.7B, LLaMA-2-7B, Vicuna-7B), trained via autoregressive cross-entropy over outputs.
- (Sviridov et al., 26 Mar 2025) Patient Agents modeled via Llama-3-8B-Instruct; Assessor Agent via Llava-OneVision-Qwen2-72b-ov-chat-hf.
- (Zhang et al., 2024) Trajectory optimization employs TrajOpt-inspired sequential convex programming with virtual kinematic chains and affordance-conditioned goal sampling.
- (Zhou et al., 21 Nov 2025) Orchestration of tool calls evaluated with similarity-driven call alignment, leveraging sentence encoders for argument-matching, and Hungarian matching for one-to-one correspondences.
3.3 Baseline Metrics
- (Li et al., 2023) Per-task metrics: Acc@mIoU (object detection, grounding), BLEU, ROUGE-L, METEOR, CIDEr (captioning, QA, planning), and GPT-4 holistic scores for dialogue.
- (Sviridov et al., 26 Mar 2025) Clinically motivated F1, precision, recall for diagnosis; binary and ordinal scoring for dialogue quality; Cohen’s for inter-rater reliability.
- (Zhang et al., 2024) Task success rate (SR), end-effector goal distance, collision/self-collision/joint-limit violation rates, motion smoothness.
- (Zhou et al., 21 Nov 2025) Tool call recall/precision, argument similarity (cosine), step coherence, merge purity (multi-hop separation), and order consistency (permutational alignment), all with recall-weighted reporting to prevent metric inflation.
4. Experimental Findings and Quantitative Analysis
4.1 Task and Model-level Performance
3D Multimodal Instruction-following (Li et al., 2023):
- Best dense captioning (BLEU-1 ≈ 11.96), visual QA (BLEU-1 ≈ 61.0), embodied QA (BLEU-1 ≈ 47.4).
- Low localization metrics ([email protected] ≈ 1–3%) indicate limitations in 3D feature fusion.
- Multi-task instruction tuning yields partial generalization to held-out embodied tasks, especially with LLaMA-2 architectures.
Medical Multi-agent Dialogue (Sviridov et al., 26 Mar 2025):
- GPT-4o-mini in immediate, minimal input condition: F1 = 50.4%; gains to F1 = 66.8% with all complaints, and up to 70.3% with internal reasoning and CNN prediction cues injected.
- Internal reasoning prompts yield +6.5% F1 improvement; multimodal dialogue yields +1.4% over text-only; CNN cues (predictions of EfficientNetV2-XL) give up to +20% F1 gain.
- Clinical competence rubric: near-ceiling for history/symptom enquiry and rapport; lower scores for plan accuracy and diagnostic rationale (Table 2 in (Sviridov et al., 26 Mar 2025)).
Embodied Mobile Manipulation (Zhang et al., 2024):
- VKC+Afford (modmp) yields ~20% pick and ~3% place success rates. End-to-end MPNet-style and Skill-Transformer baselines demonstrate near-zero success, with frequent collisions and misaligned maneuvers.
- Place tasks exhibit lower success and higher time/collision penalties, reflecting orientation and stability challenges.
Tool-using MLLM Agent Benchmark (Zhou et al., 21 Nov 2025):
- GPT-5 system achieves highest average score (0.482), recall (0.627), argument similarity (0.583), and structure (StepCoh = 0.502, OrdCons = 0.290, MergePur = 0.453).
- Compact models may achieve higher precision at the expense of recall/coverage and dialogue length.
- Open-source models show substantial deficit in complex tool reasoning and structural consistency.
4.2 Error and Limitation Analysis
- (Li et al., 2023): Fine-grained perception and spatial referencing in 3D remain unsolved due to shallow fusion in baselines; GPT-4 dialogue scoring penalizes brevity and hallucination.
- (Sviridov et al., 26 Mar 2025): Dialogue models exceed in rapport-building but exhibit deficits in clinical plan articulation and rationale explanation. Hybrid architectures demonstrate that image classifier fusion mitigates these weaknesses.
- (Zhang et al., 2024): Most models fail to coordinate base-arm trajectories in cluttered scenes, with hybrid approaches modestly outperforming deep learning-only baselines; data suggests gap in integration between high-level semantics and low-level kinematics.
- (Zhou et al., 21 Nov 2025): Common failures include schema-inconsistent tool calls, hallucinated invocation, argument-value errors, and poor handling of multi-threaded and multi-hop dependencies.
5. Evaluation Protocols
5.1 Quantitative Metrics
A range of established and benchmark-specific metrics are standardized and applied:
| Benchmark | Task Example | Metric(s) |
|---|---|---|
| (Li et al., 2023) | Region/scene QA, planning | BLEU, ROUGE-L, METEOR, CIDEr, Acc@mIoU |
| (Sviridov et al., 26 Mar 2025) | Diagnosis, clinical dialogue | Macro-averaged F1, Precision, Recall, Cohen’s |
| (Zhang et al., 2024) | Whole-body motion planning | SR, goal distance, collision/self-collision, smoothness |
| (Zhou et al., 21 Nov 2025) | Multimodal tool use | Recall, Precision, ArgSim, StepCoh, MergePur, OrdCons, LLM-ensemble scoring |
Metrics are tightly coupled to task formalization, e.g., semantic fidelity, workflow structure, and multi-modal alignment for tool benchmarks; trajectory feasibility and constraint satisfaction for robotic/rearrangement tasks; and rubric-driven dialogue for medical settings.
6. Open Challenges and Future Directions
Identified open challenges include:
- Developing richer fusion architectures for fine-grained 3D perception and spatially localized language grounding (Li et al., 2023).
- Closing the gap in clinical diagnostic planning and rationale generation in multi-agent telemedicine dialogue, possibly via model-based or hybrid approaches (Sviridov et al., 26 Mar 2025).
- Advancing hierarchical and sampling-based planners to improve coordinated manipulation under physical constraints, surpassing the brittle performance of current end-to-end and hybrid VKC+affordance solutions (Zhang et al., 2024).
- Achieving robust, auditably consistent multi-hop and multi-threaded multimodal tool-use by integrating schema-centric planning, tool-graph reasoning, and dynamic prompt schema discovery (Zhou et al., 21 Nov 2025).
Suggested future extensions include:
- End-to-end fine-tuning of all model components (rather than only lightweight projectors) in 3D instruction-following (Li et al., 2023).
- Incorporating temporal, semantic, and multi-agent coordination data for embodied and robotics tasks.
- Extending tool-use benchmarks beyond current MCP schema to encompass dynamic, open-vocabulary toolchains and real-world physical agents.
A plausible implication is that enhanced modality fusion, explicit reasoning over domain constraints, and cross-task/generalization-aware training protocols collectively represent the main frontiers to be addressed.
7. Public Resources and Reproducibility
All principal M3DBench benchmarks provide open-source data, baseline models, and evaluation scripts. Code and dataset links are as follows:
- 3D Multimodal instruction-following: https://github.com/OpenM3D/M3DBench (Li et al., 2023)
- Medical multi-agent dialogue: https://github.com/univanxx/3mdbench (Sviridov et al., 26 Mar 2025)
- Whole-body mobile manipulation: See referenced supplementary materials (Zhang et al., 2024)
- Multi-modal tool-use agent benchmark: https://github.com/EtaYang10th/Open-M3-Bench (Zhou et al., 21 Nov 2025)
Each repository includes model weights, annotation protocols, quantitative evaluation templates, and sample reproduction commands. This ensures full auditability, metric provenance, and baseline reproducibility for all tasks.