- The paper demonstrates that adaptive coordination-path selection significantly improves multimodal reasoning by tailoring strategies to varied input characteristics.
- It employs a planner-executor framework that assigns one of five distinct coordination paths with interpretable intermediate reasoning traces.
- Experimental results show notable accuracy gains and efficient token usage compared to fixed-path models, underscoring the framework’s practical value.
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
Motivation and Path Diversity in Multimodal Reasoning
Unified multimodal models (UMMs) increasingly aim to integrate both understanding and generation capabilities within a single architecture, yet effective inference-time coordination remains largely underexplored. Prior approaches either enforce fixed coordination protocols or couple modalities solely during training, leading to rigid execution patterns and limiting adaptability. The paper provides strong empirical evidence that multimodal tasks exhibit substantial coordination-path diversity: individual inputs and subject domains consistently benefit from distinct coordination strategies. The oracle path-selection results demonstrate that instance-level and subject-level coordination diversity yields large performance gains compared to fixed-path baselines, indicating deep complementarity among reasoning paths.


Figure 1: Different inputs favor different coordination paths, illustrating the need for adaptive coordination.
UniPath Framework: Path-Based Planner-Executor Architecture
UniPath operationalizes coordination-path diversity through an adaptive planner-executor framework. The system formalizes multimodal reasoning as the selection and execution of a coordination path, moving beyond capability coexistence toward explicit, input-dependent orchestration. Five canonical coordination paths are defined, each centered on one functional role: direct answering ($p_{\mathrm{A}$), explicit understanding ($p_{\mathrm{U}$), textual reasoning ($p_{\mathrm{R}$), visual-thought construction ($p_{\mathrm{C}$), and hypothesis exploration ($p_{\mathrm{H}$). Role-aligned trajectories are constructed for supervised training, marking intermediate states according to their functional role and aligning visual-thought spans via image-derived summaries.
Figure 2: Overview of the training and inference process of UniPath, with separate planner and executor modules and explicit path conditioning.
Executor training follows a staged curriculum with progressively activated objectives for textual, visual, and image-generation components. The path-conditioned executor is trained to produce interpretable reasoning traces, enabling differentiated intermediate behaviors and consistent output formatting. Planner supervision is multi-label, as several paths may solve a single input. Planner routing is calibrated with surface-level query-form buckets, mitigating domain bias and supporting generalization across benchmark distributions.
Experimental Results and Benchmark Comparisons
UniPath achieves consistent improvements over the BAGEL backbone and other leading UMMs across expert-level multimodal benchmarks. On understanding tasks, UniPath outperforms baselines by significant margins (+4.3% on MMMU, +4.4% on MMBench-EN, +7.7% on MMStar), with gains directly attributable to adaptive path selection rather than larger post-training corpora. Notably, MathVista exhibits lower improvement due to homogeneous problem types, further confirming the necessity of path diversity in heterogeneous settings. Visual-thought alignment in executor training yields competitive results on generation tasks, while preserving understanding-generation consistency.
Planner Behavior and Coordination Path Analysis
The planner's routing strategy is substantiated across benchmarks: selection frequencies for coordination paths correlate with task domain and complexity. Expert-level questions (MMMU) favor visual-thought construction, while recognition and commonsense datasets (MMBench-EN, MMStar) are dominated by direct answering. Conditional accuracy remains high across all selected paths, affirming the alignment between input characteristics and coordination strategy. Planner ablations confirm that neither raw learned scores nor bucket-only routing are sufficient. Deployable planners combining both mechanisms outperform fixed-path and domain-adapted variants.
Figure 3: Planner behavior across benchmarks and validation transfer, reporting selected path distributions and conditional accuracies.
Efficiency and Token-Accuracy Tradeoff
Adaptive coordination is realized without inflating output length. UniPath maintains or improves accuracy using fewer output tokens compared to fixed or interleaved reasoning post-training baselines (e.g., IRG, UniCoT), positioning results closer to the upper-left region of the token-accuracy frontier. This outcome demonstrates that adaptive selection of reasoning paths is more effective and efficient than uniform, lengthy coordination traces.
Figure 4: Accuracy versus average output-token cost on understanding benchmarks, highlighting efficiency of adaptive path selection.
Qualitative analyses confirm that path-specific execution provides interpretable intermediate reasoning traces. For instance, visual-thought construction successfully disentangles complex circuit diagrams, while hypothesis-based paths support explicit visual comparison among plausible alternatives. These coordinated behaviors expose why and how each reasoning path was favored for a given input, exposing task-specific strategy as an interpretable diagnostic asset.
Implications, Limitations, and Future Work
Theoretically, UniPath elevates coordination policy to a central modeling problem for UMMs, demonstrating that explicit path selection augments both accuracy and interpretability. Practically, the planner-executor paradigm supports efficiency, modularity, and robustness across multimodal domains. The main limitation is the current gap between deployable planners and oracle routing, driven by expensive supervision requirements and domain generalization challenges. Future research should develop cost-effective path supervision methods and planners with improved cross-domain robustness.
Advancements in adaptive coordination are expected to drive next-generation UMMs toward more selective computation, explainable intermediate behaviors, and efficient inference. Conversely, improvements in path selection could increase the risk profile of multimodal systems if misapplied. Careful evaluation and safeguard strategies will become increasingly vital as adaptive reasoning becomes standard across high-stakes verticals.
Conclusion
UniPath establishes adaptive coordination-path selection as a key driver of unified multimodal reasoning. By formalizing and empirically validating coordination-path diversity, the planner-executor architecture delivers superior task-level performance, efficient inference, and interpretable execution traces. Remaining challenges in planner generalization and supervision cost underscore future research directions; nonetheless, UniPath provides a robust foundation for the principled orchestration of understanding and generation in unified multimodal models (2605.11400).