Breadth–Depth Compute Allocation for LVLM Test-Time Reasoning

Determine the optimal allocation of test-time compute between breadth (sampling more reasoning paths via multi-pass decoding) and depth (using stronger chain-of-thought or "thinking" modes) for large vision–language models on perception tasks.

Background

The paper studies when and how test-time "thinking" (explicit chain-of-thought style decoding) benefits visual reasoning in large vision–LLMs, comparing InternVL3.5 and Qwen3-VL families on MMMU and other benchmarks. A central practical challenge is how to allocate a fixed compute budget across two axes: breadth (number of sampled reasoning paths) and depth (invoking stronger reasoning modes).

The authors explicitly note that this compute allocation problem is not yet established for perception tasks. Their analysis highlights that more thinking is not always better, motivating strategies that adaptively decide when to expand in breadth or deepen reasoning to improve visual grounding and accuracy under budget constraints.

References

We do not yet know how to best allocate test-time compute between sampling more reasoning paths (breadth) and using stronger reasoning modes (depth) in perception tasks.

— When to Think and When to Look: Uncertainty-Guided Lookback (2511.15613 - Bi et al., 19 Nov 2025) in Section 1 (Introduction), under the question "How should we trade off breadth vs. depth of thinking?"

Breadth–Depth Compute Allocation for LVLM Test-Time Reasoning

Background

References

Related Problems