Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs (2501.06186v1)

Published 10 Jan 2025 in cs.CV

Abstract: Reasoning is a fundamental capability for solving complex multi-step problems, particularly in visual contexts where sequential step-wise understanding is essential. Existing approaches lack a comprehensive framework for evaluating visual reasoning and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing step-by-step visual reasoning in LLMs (LMMs) through three key contributions. First, we introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks. The benchmark presents a diverse set of challenges with eight different categories ranging from complex visual perception to scientific reasoning with over 4k reasoning steps in total, enabling robust evaluation of LLMs' abilities to perform accurate and interpretable visual reasoning across multiple steps. Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps, emphasizing both correctness and logical coherence. The proposed metric offers deeper insights into reasoning performance compared to traditional end-task accuracy metrics. Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach, where tasks are progressively organized to facilitate incremental skill acquisition and problem-solving. The proposed LlamaV-o1 is designed for multi-step reasoning and learns step-by-step through a structured training paradigm. Extensive experiments show that our LlamaV-o1 outperforms existing open-source models and performs favorably against close-source proprietary models. Compared to the recent Llava-CoT, our LlamaV-o1 achieves an average score of 67.3 with an absolute gain of 3.8\% across six benchmarks while being 5 times faster during inference scaling. Our benchmark, model, and code are publicly available.

Summary

  • The paper introduces LlamaV-o1, achieving a 3.8% gain in average score and 5× faster inference compared to previous methods.
  • The paper presents VRC-Bench, a benchmark with 1,000+ samples and over 4,000 verified reasoning steps to assess intermediate reasoning quality.
  • The paper proposes a novel evaluation metric that systematically quantifies faithfulness, informativeness, and logical coherence of each reasoning step.

The paper presents a comprehensive framework aimed at advancing multimodal step‐by‐step visual reasoning for large multimodal models. The work is organized around three primary contributions:

1. Visual Reasoning Benchmark (VRC-Bench):

  • The authors design a novel benchmark specifically tailored to assess multi-step reasoning across visual modalities.
  • VRC-Bench encompasses eight distinct categories including visual perception, math and logic reasoning, scientific reasoning, OCR/document understanding, chart/diagram interpretation, social-cultural context, and medical imaging.
  • The benchmark comprises over 1,000 challenging samples with more than 4,000 manually verified reasoning steps. This detailed annotation allows for investigating not only the final answer accuracy but also the interpretability and logical coherence of the intermediate reasoning chain.

2. Novel Evaluation Metric:

  • Traditional evaluation methods that focus solely on end-task accuracy are complemented with a metric that quantifies the quality of step-by-step reasoning.
  • The metric is defined at the granularity of individual reasoning steps, assessing correctness as well as logical coherence.
  • Attributes such as Faithfulness (step and token level), Informativeness, Hallucination, Redundancy, Semantic Coverage, and Commonsense are systematically quantified, providing granular insights into the reasoning process.

3. Multimodal Visual Reasoning Model – LlamaV-o1:

  • The model is built atop a robust foundation (Llama-3.2-11B-Vision-Instruct) and is fine-tuned via a multi-stage curriculum learning strategy.
  • Stage 1 involves training the model on simpler tasks (e.g., summary generation and detailed captioning) using datasets like PixMo and Geo170K, thereby establishing a clear plan of approach.
  • Stage 2 further refines the model’s capabilities by incorporating a chain-of-thought framework to derive detailed, multi-step reasoning and final answer generation. Training in this stage leverages the structure of datasets originally assembled for step-by-step reasoning tasks.
  • Inference efficiency is enhanced through a Beam Search strategy which generates multiple reasoning paths in parallel. Notably, the paper reports that LlamaV-o1 achieves an absolute gain of 3.8% in terms of average score across six benchmarks while scaling inference 5× faster than recent methods employing stage-level beam search.

Experimental Validation and Ablations:

  • Extensive experiments are conducted on both the newly proposed VRC-Bench and six established multimodal benchmarks (including MMStar, MMBench, MMVet, MathVista, AI2D, and HallusionBench).
  • Results show that LlamaV-o1 outperforms existing open-source methods like Llava-CoT with an average score of 67.33%, with marked improvements in challenging domains such as mathematical reasoning, chart interpretation (83.18% in diagram understanding), and document OCR (93.44%).
  • Ablation studies confirm the contributions of both curriculum learning and the optimized Beam Search scheme. For example, transitioning from the baseline model to a curriculum learning approach improved the average score by approximately 9.14%, and further incorporation of Beam Search yielded additional gains.

Technical Highlights:

  • The framework emphasizes a structured reasoning process where each intermediate step is explicitly generated and verified, mirroring human problem decomposition.
  • By integrating a reference-based evaluation mechanism via automated and manually verified annotations, the work provides a rigorous benchmark for assessing interpretability and logical consistency at each stage of reasoning.
  • The efficiency of the Beam Search technique, demonstrating linear rather than quadratic scaling, is particularly notable for real-world applications.

Overall, the paper delivers a methodologically sound approach to multimodal reasoning that combines carefully constructed benchmarks, metrics, and a structured training methodology to enable interpretable, accurate, and efficient step-by-step reasoning in large multimodal models.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com