Papers
Topics
Authors
Recent
Search
2000 character limit reached

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

Published 12 Feb 2026 in cs.AI | (2602.12268v1)

Abstract: AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from an 8B Base model and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau-Bench, by 10 points on BFCL-V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open-source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards. Code provided by the open-source community: https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.

Summary

  • The paper introduces a 'checklist reward' mechanism that assigns binary, verifiable rewards at each subgoal to mitigate reward hacking in complex tasks.
  • It demonstrates empirical gains of up to +14% accuracy and reduced confidently incorrect outputs across various multi-tool, multi-turn environments.
  • The approach enhances RL transparency and scalability by offering granular credit tracing and robust, stepwise policy optimization for agentic tool use.

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

Motivation and Context

The paper addresses the growing demand for robust agentic tool use in multi-turn and multi-step scenarios—tasks where LLM-powered agents must reason over extended dialog or action chains while invoking APIs, manipulating environments, or dynamically leveraging external tools. The conventional RL reward modeling paradigm often relies on opaque scalar feedback or preference-based signals, which are susceptible to spurious correlations, reward hacking, and lack fine-grained alignment with task compositionality. CM2 proposes a novel "checklist reward" formulation that decomposes rewards into stepwise, interpretable binary checkpoints based on task completion criteria and causal justification, thereby mitigating indecisive reward attribution and enhancing alignment with complex task structures.

Checklist Reward Design and Implementation

CM2 formalizes checklist rewards as a sequence of binary indicators—each corresponding to a specific subgoal or tool invocation that is either achieved or not. Unlike scalar reward functions or batched human preferences, checklist rewards provide a verifiable, granular causal trace for each agent action and outcome. This allows RL protocols to precisely credit each atomic step in multi-turn, multi-modal, or tool-integrated pipelines. The reward traces are designed to be compositional, robust to distributional shifts, and directly interpretable, enabling more reliable credit assignment and facilitating downstream analysis of error patterns or reliability bottlenecks.

To efficiently operationalize checklist rewards, CM2 builds on an RL framework that integrates:

  • Automated checklist construction from tool specifications, environment APIs, or synthetic task protocols.
  • Reward evaluation mechanisms based on verifiable program outputs, structured action logs, or heuristic/ground-truth validators.
  • Step-granularity credit assignment in policy optimization (e.g., PPO or DPO) for agent training and inference-time adaptation.

The approach is validated across several agentic tool use domains, including web manipulation, API orchestration, and interactive simulations, with checklist extraction either automated or curated by domain experts.

Numerical Results and Empirical Claims

CM2 demonstrates strong empirical performance across a suite of benchmarks that demand compositional, multi-step tool use and interactive reasoning. Notably:

  • Checklist reward RL outperforms scalar or preference-based reward baselines by up to +14% absolute accuracy in multi-step agentic tasks.
  • The method exhibits substantially lower rates of confidently incorrect generations, as the binary checklist trace robustly conditions learning on key causal junctures in agentic workflows.
  • Across multiple environments (including ToolBench, R2E-Gym, tau2^2-Bench), CM2-trained agents achieve higher Pass@kk and completion rates on long-horizon tasks, especially those requiring correct sequencing and justification of tool invocations.

These empirical gains are shown to be consistent across backbone models, task categories, and both synthetic and real-world agentic settings.

Theoretical Analysis and Guarantees

The paper includes formal analysis establishing that checklist rewards provide bounded, verifiable causal credit assignment, mitigating reward hacking and providing local gradient alignment between confidence and reasoning quality. The authors prove that checklist-based gradients are fundamentally local, ensuring that policy updates improve task completion strictly within bounded regions and step sizes—a property not guaranteed by heuristic or preference-based reward signals. They further demonstrate that checklist rewards enable algorithms to avoid "confidently incorrect traps," where scalar reward models may assign high confidence to fundamentally erroneous behavior.

Comparative Study

CM2 is compared against leading RL reward modeling alternatives:

  • Rubric-based reward modeling [rubrics-as-rewards], synthetic preference generation [openrubrics], programmatic reward signals [generative-reward-models], and question-specific rubrics [rubric-is-all-you-need].
  • Scalar reward models and traditional RLHF setups, which tend to suffer from ambiguous credit assignment, reward hacking, and generalization failures in compositional agentic tasks.

Checklist rewards consistently realize higher compositional reliability, reduced hallucination, and improved agentic tool use fluency. The study also notes that checklist rewards are more scalable than rubric-based methods and more interpretable than scalar rewards.

Practical and Theoretical Implications

The checklist reward formulation advances practical RL for agentic tool use in several significant ways:

  • It enables transparent, verifiable reward traces for each agent action, supporting debugging, safety-critical oversight, and regulatory compliance.
  • The compositional structure improves transferability to unseen environments, novel tool APIs, or new task decompositions, making it suitable for real-world deployment of workflow-driven LLM agents.
  • The method helps mitigate confidently incorrect generalizations, reward hacking, and out-of-distribution failure modes typical of preference-based RLHF pipelines.

On a theoretical level, the formal guarantees about local positive correlation between checklist reward gradients and reasoning quality yield enhanced stability in policy optimization, reproducible improvement in task completion, and direct verifiability of agentic decisions.

Speculations and Future Directions

CM2 opens new avenues for agentic reinforcement learning:

  • Checklist rewards could be combined with synthetic rubric generation, leveraging LLMs to automatically bootstrap causal checkpoints for novel domains.
  • The framework is extensible to multi-agent, collaborative workflows, where compositional credit assignment is critical for scalable tool orchestration.
  • Integration with procedural environment generation [procedural-environment-generation, apigen-mt] and hybrid verifiers [r2e-gym] is anticipated, enabling scalable, verifiable multi-turn RL pipelines.
  • The approach can facilitate safety and interpretability research for RL-driven LLMs, particularly in domains where action logs and causal traces are essential for auditability.

Conclusion

CM2 introduces a reinforcement learning paradigm for agentic tool use based on decomposed checklist rewards, providing robust, verifiable, and compositional credit assignment in multi-turn and multi-step tasks. Empirical and theoretical evidence demonstrates that checklist rewards substantially strengthen agentic reasoning reliability, mitigate confidently incorrect behaviors, and enhance transferability to complex tool orchestration scenarios. The approach is poised to improve both practical deployment and theoretical understanding of RL for agentic LLMs; further integration with procedural generation, rubric bootstrapping, and hybrid verification is likely to accelerate progress in the next generation of AI agents.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

A simple explanation of “Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space” (Appendix)

Overview

This paper explains and tests a new way for AI models to think through problems that involve both text and images. The idea is called DMLR (Dynamic Multimodal Latent Reasoning). Instead of only writing out step‑by‑step text, the model “thinks” inside its hidden mind (its internal states) and, at key moments, it brings in small parts of the image to guide its thinking. The appendix shows how the authors evaluated this idea, the settings they used, extra results, and the reasons why this approach works.

Key objectives and questions

To make the purpose of the paper clear, here are the main questions the authors wanted to answer:

  • When do AI models really need to “look” at the image during reasoning, and when can they just use text?
  • If the model dynamically brings in image details at the right moments, does it become more accurate and more confident?
  • How can we measure whether the model’s reasoning is faithful (based on real evidence), and whether it avoids hallucinations (making things up)?
  • Can we improve models at test time (without extra training) by optimizing their “inner thoughts” to be more reliable?

Methods and approach

The authors tested DMLR using a unified setup across many tasks, with clear prompts and a consistent way of judging answers.

  • How the model “thinks”: Instead of only producing visible text, the model uses “latent think tokens,” which you can imagine as invisible steps in its mind. After each hidden step, the model pulls in a few small image patches (like tiny crops of the picture) that its attention says are most relevant. This is called dynamic visual injection. Think of it like solving a math puzzle while occasionally glancing back at key parts of the image when you reach a tricky step.
  • Picking image patches: The model’s attention acts like a flashlight that highlights the most important parts of the picture. At each internal step, it re‑checks the image, picks new relevant patches, and adds them into its hidden thinking. This keeps its mental picture fresh and focused.
  • Confidence and correctness checks: The authors compared the model’s final answers to the ground truth and used a careful process to avoid judging mistakes caused by formatting. They also used a strong external judge (GPT‑4o, run deterministically) to check whether a reasoning chain was logically sound and truly used the given evidence.
  • Faithfulness and hallucination testing: The team looked at reasoning steps that referenced visual facts (like “the triangle is red” or “there are 3 coins”). They checked whether each claim matched the image. If not, it was labeled as a hallucination. They then compared the model’s confidence at those steps to see patterns.
  • Visual dependency analysis: To find out when tokens (pieces of the model’s reasoning) depend on the image, they perturbed images in different ways (blur, block occlusion, random masking, color changes) and watched how the reasoning changed. If a token’s output shifted a lot with image noise, it likely depended on vision.
  • Pass@k metric: This measures the chance that at least one of k attempts is correct. For example, Pass@8 asks, “If the model tries up to 8 different solutions, how likely is it to produce at least one correct answer?” This helps judge robustness beyond a single try.
  • Simple theory to explain why it works:
    • Confidence gradient: Imagine walking uphill toward “feeling more sure.” If “feeling more sure” aligns with “being more correct,” then taking small steps uphill improves quality.
    • Visual injection and information: Bringing in the right visual details increases how much the model’s inner state knows about the answer. When the model knows more, its uncertainty drops, so its confidence rises. In everyday terms: better clues → better chances → more certainty.
  • A note on optimization inside the mind: The model tries tiny changes to its hidden thoughts (like testing a small edit to its plan), scores the result (confidence as the “reward”), and then nudges its inner state in the direction that helped. This is similar to trying a small tweak, seeing if it helps, and keeping what works.

Main findings and why they matter

The appendix reports several important results across math and visual reasoning benchmarks:

  • Vision is used at key moments: The model doesn’t need the image at every step. Instead, it depends on vision at meaningful stages, like describing the scene, checking spatial details, counting, or verifying a number. This pattern stayed stable even when the images were altered in different ways, showing the behavior is robust, not tied to one type of noise.
  • Dynamic visual injection boosts accuracy and robustness: When the model repeatedly and selectively brings in the right image patches during its hidden thinking, performance improves across many tasks. In Pass@8 evaluation, DMLR often increases the chance of getting at least one correct solution within 8 tries by roughly 2%–5%, especially on tasks needing careful multi‑step reasoning or precise visual grounding (like MM‑Math, HallusionBench, and MMVP).
  • Training‑free but competitive: DMLR consistently matches or beats some methods that require extra training (like MCOUT and IVT‑LR). Because DMLR adapts during inference—without changing the model’s parameters—it generalizes well, especially for smaller models that might overfit when trained heavily.
  • Confidence reflects grounding: Steps that hallucinate (claiming visuals that aren’t there) tend to have lower confidence and more uncertainty. This supports the idea that confidence signals can help detect weak or ungrounded reasoning.
  • Theory supports practice: The math shows that when “being more sure” aligns with “being more correct,” small steps to increase confidence tend to increase quality. It also explains why adding visual information lowers uncertainty and raises confidence.

Implications and potential impact

This work suggests a practical way to make multimodal AI more reliable without retraining:

  • Smarter “when to look” behavior: The model learns to glance back at the image at just the right times, not constantly, which is efficient and effective.
  • Fewer hallucinations: By tying confidence to real visual evidence, the model’s reasoning becomes more faithful and trustworthy.
  • Plug‑and‑play improvements: Because DMLR works at test time, it can be added to many existing models and tasks.
  • Better tools for learning and science: More reliable multimodal reasoning could help with math education, data analysis, science questions, and any task where images and text are combined.

Overall, the appendix shows that dynamically interleaving image details into the model’s “inner thoughts,” and gently optimizing those thoughts for confidence, leads to clearer, more grounded, and more accurate reasoning.

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Below is a consolidated, actionable list of what remains missing, uncertain, or unexplored in the paper’s appendix. Each item is framed to guide future research.

Evaluation Protocol and Metrics

  • Lack of explicit definition and calibration of “token-level confidence”: clarify the exact metric (e.g., per-token log-probability, normalized confidence, entropy proxies), calibration status, and aggregation scheme across steps and modalities.
  • Limited statistical rigor: no confidence intervals, variance across seeds, or significance testing reported for accuracy and Pass@k; ablations are run on 300-sample subsets without power analysis.
  • Sampling setup under Pass@k is under-specified: sampling strategy (temperature, top-p, nucleus/top-k, number of draws n per problem) is not documented, making Pass@k comparability and reproducibility unclear.
  • Evaluation protocol caps datasets at 1000 samples and relies on “mini” benchmarks; no assessment of performance stability on full datasets or under data scaling.
  • Regex-based answer extraction and “boxed” formatting may bias or mis-score non-math tasks; robustness of answer parsing and error analysis across task types is not provided.
  • No calibration metrics (e.g., ECE, Brier score), overconfidence analysis, or selective prediction evaluation to substantiate claims about “confidence” beyond accuracy gains.
  • Latency–quality trade-offs are not measured: missing benchmarks for wall-clock time, energy use, and memory overhead per sample as optimization steps and k vary.

Theoretical Assumptions and Validation

  • Strong, unverified assumption that I(Y; T) is a strictly increasing function of I(T; z_v) (existence and monotonicity of g): no empirical tests or counterexamples to support this information-theoretic linkage in deep models.
  • Mutual information terms are never estimated or proxied; the key claim that DVI increases I(T; z_v) (and thereby I(Y; T)) is not validated empirically.
  • Confidence–quality gradient alignment is local and assumption-heavy (smoothness, alignment); no empirical test of alignment (e.g., directional derivatives, sensitivity analysis) in real models or detection of “confidently incorrect” traps in practice.
  • Latent policy gradient lacks convergence guarantees, variance analysis, or stability bounds under single-sample Monte Carlo; no baseline subtraction, control variates, or adaptive noise to reduce variance explored.
  • No characterization or detection method for confidently incorrect traps; no practical strategies to escape such traps or to adapt step sizes dynamically when entering misaligned regions.

Algorithm Design and Hyperparameters

  • Hyperparameter sensitivity is limited: no systematic sweeps for number of latent tokens, number of patches per step, maximum patches, step sizes, decay schedules, or learning rates across tasks and backbones.
  • Visual patch selection depends on attention scores; no comparison against alternative attribution methods (e.g., gradients, integrated gradients, RISE, occlusion-based) to test robustness of patch selection.
  • Image resolution fixed at 256 px for all tasks; no analysis of high-resolution requirements, scale sensitivity, or impact on fine-grained perception tasks.
  • The DVI procedure is partially described (appendix text truncates mid-algorithm) and lacks complete pseudocode/specifications (e.g., patch stride/size, re-encoding strategy, cache reuse), hindering exact reproduction.
  • No study on when DVI hurts performance (negative deltas): conditions, error modes, tasks, or model regimes where dynamic injection degrades reasoning or induces instability are not analyzed.

Robustness and Generalization

  • Visual dependency analysis uses common perturbations but does not test adversarial perturbations, distribution shifts, domain shifts (e.g., medical, aerial, document OCR), or robustness to spurious correlations.
  • Reliance on attention-aligned “high-dependency” tokens lacks human-annotated validation or causal tests (e.g., counterfactual interventions) to confirm that identified steps are truly vision-critical.
  • No evaluation of robustness to prompt manipulations (e.g., prompt injection, instruction conflicts) or to noisy/ambiguous image–text alignment scenarios.
  • Seed robustness is not reported (single seed fixed at 42); no assessment of variability across seeds or hardware backends.

Comparative Scope and Fairness

  • Exclusion of “think-with-image” models (e.g., DeepEye, GRIT) narrows comparative scope; a controlled study (possibly subsampled or matched compute) is needed to contextualize DMLR vs. training-heavy approaches.
  • Baseline configurations may not be fully optimal or aligned (e.g., greedy decoding for all vs. sampling for Pass@k); thorough per-method tuning and reporting of hyperparameters for fairness is absent.
  • Limited coverage of larger or proprietary frontier models; unclear how DMLR scales with model size and whether gains persist or saturate for stronger backbones.

Efficiency, Systems, and Deployment

  • Missing compute profile for DMLR: no measurements of FLOPs, GPU memory footprint, latency per iteration, and throughput under varying optimization steps and patch counts.
  • No analysis of amortizing DVI (e.g., caching visual features, reusing attention maps) or opportunities for early stopping, adaptive step counts, or dynamic compute allocation at inference time.
  • Precision and quantization not explored: all runs use float32; effects under FP16/BF16/INT8 (typical in deployment) and on different accelerators are unknown.
  • Applicability to resource-constrained or real-time settings is not discussed; feasibility on edge devices or with strict latency budgets remains open.

Annotation, Judging, and Faithfulness

  • Heavy reliance on GPT-4o as a judge (faithfulness, correctness, hallucination) raises potential bias and consistency issues; no human expert evaluation or cross-judge agreement analysis is provided.
  • Automatic extraction of “visual statements” from chains is not validated for precision/recall; mis-extractions may confound hallucination labeling and confidence associations.
  • The dual-pass GPT-4o judging protocol lacks an error analysis and does not report inter-annotator (inter-judge) agreement metrics or adjudication procedures for disagreements.

Reproducibility and Transparency

  • Code is not yet released; exact re-implementation is hindered by missing algorithmic specifics (e.g., DVI patching details, sampling parameters, policy gradient implementations).
  • Dataset subsampling procedures (selection criteria for “mini” sets, random seeds for subsampling) are not fully specified; risk of selection bias without shared splits.
  • Some formulae/typesetting (e.g., Pass@k equation) appear malformed, inviting ambiguity; a formal checklist of equations, hyperparameters, and inference flags would aid replication.

Safety, Ethics, and Reliability

  • No analysis of safety impacts from latent-state manipulation (e.g., bypassing safety filters, shifting outputs towards unsafe content) or interactions with guardrails.
  • Bias and fairness effects not studied: whether DVI shifts demographic performance disparities or reinforces dataset artifacts remains unknown.
  • No exploration of catastrophic failure cases (e.g., increased hallucinations under specific conditions) or of monitoring/abort criteria when confidence rises but external evidence is weak.

Scope of Applicability

  • Generality beyond vision–language tasks is not demonstrated: applicability to audio–text, video–text, or non-perceptual reasoning domains (code, symbolic math) is untested.
  • Interplay with training-time methods is unexplored: can DMLR be combined with training-based latent reasoning (e.g., MCOUT, IVT-LR) to further improve performance, or does it interfere?
  • Long-context and multi-image/multi-hop settings are not examined; how dynamic injection scales with multiple images, pages, or temporal sequences remains an open question.

Practical Applications

Immediate Applications

Below are actionable use cases derived from the findings, methods, and innovations of the research paper, categorized as "Immediate Applications" for deployment now.

Industry

  • Multimodal Visual Analysis Tools: Implement Dynamic Visual Injection (DVI) modules in sectors like finance and marketing to enhance visual data analysis and decision-making processes.
  • Augmented Reality in Retail: Utilize the DMLR framework to improve product visualization and buying experiences in retail environments.
  • Vision-LLMs for Automation: Apply multimodal reasoning techniques for automated quality assurance systems in manufacturing.

Academia

  • Educational Software: Develop educational applications that use MathVista benchmarks for enhanced mathematical visual reasoning in digital learning platforms.
  • Research in Multimodal Integration: Foster collaborations across fields like computer science and cognitive psychology to explore further integration of visual cues in reasoning processes.

Policy

  • Policy Planning for AI in Education: Adopt ScienceQA benchmarks for designing AI-assisted educational programs in public schooling systems.

Daily Life

  • Personalized Learning Applications: Leverage multimodal composition tasks as study tools for personalized learning applications, adaptable to individual student needs.

Long-Term Applications

These applications require further research, scaling, or development before deployment.

Industry

  • Advanced Robotics Systems: Scale DMLR-based reasoning systems for more adaptive and intelligent path-finding in robotics.
  • Energy Management Solutions: Investigate DMLR's potential for optimizing energy distribution systems through enhanced multimodal data processing.

Academia

  • Cross-Disciplinary Research: Establish foundational research investigating the boundaries of visual reasoning across different learning systems.

Policy

  • Regulatory Framework Development: Develop new policies around data privacy and ethical AI use concerning dynamic multimodal reasoning systems.

Daily Life

  • Assistive Technologies for Disabilities: Develop technologies that utilize visual-context reasoning to assist individuals with disabilities.

Assumptions and Dependencies

Here are some assumptions or dependencies that could impact the feasibility of each application.

  • Technical Dependencies: The implementation of DMLR requires high computational power and advanced machine learning infrastructure.
  • Data Availability: Successful deployment in industry requires access to multimodal datasets similar to those used in the paper.
  • Regulatory Approvals: Long-term applications, particularly in policy and industry, may require significant changes in legislation or regulatory frameworks.
  • User Acceptance: Applications in daily life and education need user acceptance and adaptability to new technology interfaces.

Glossary

  • Attention-driven Selection (ADS): A mechanism that selects relevant image regions based on model attention to interleave visual and textual reasoning. "It employs a plug-and-play Attention-driven Selection (ADS) mechanism to dynamically identify and insert relevant image regions into the reasoning chain based on the model's attention maps."
  • Basin of attraction: The set of initial states that converge to a particular point under iterative updates. "Let $\mathcal{B}(h_{\mathrm{trap})$ denote the basin of attraction of $h_{\mathrm{trap}$ under~\eqref{eq:conf_dynamics}"
  • Block occlusion: An image perturbation that hides content by covering it with blocks to test robustness. "we apply four distinct perturbation types including block occlusion, color jitter, random region masking, and Gaussian blur."
  • Cauchy–Schwarz inequality: A mathematical inequality used to bound inner products in proofs. "By the Cauchy--Schwarz inequality and Assumption~A.2,"
  • Chain-of-Thought (CoT): A method prompting models to generate intermediate reasoning steps before answers. "MCOUT (Training)~\cite{pham2025multimodalchaincontinuousthought} is a latent-space reasoning framework that replaces traditional text-based CoT with continuous hidden-state “thought vectors,”"
  • CLIP-blind image–text pairs: Image–text pairs that systematically fool CLIP-like models, exposing perception failures. "MMVP is a benchmark built from multimodal visual patterns designed to expose “CLIP-blind’’ image–text pairs,"
  • Confidence landscape: The function over latent states representing model confidence, treated as an optimization surface. "During test-time optimization, DMLR updates the latent state by ascending the confidence landscape:"
  • Conditional entropy: The uncertainty of a target variable given the latent state, related to confidence. "The model’s confidence objective is a strictly decreasing function of the conditional entropy of YY given the latent state:"
  • Descent lemma: A smoothness-based inequality bounding function change via gradients and step sizes. "\begin{lemma}[Descent lemma form]"
  • Dynamic Visual Injection (DVI): A module that dynamically injects visual features or patches into the latent reasoning stream. "Section~\ref{e} further elaborates on the design choices, mechanisms, and stability analyses of the Dynamic Visual Injection module."
  • Eager attention backend: An inference implementation that evaluates attention computations in a non-lazy manner. "and use the eager attention backend for inference."
  • Gaussian blur: An image smoothing perturbation used to test visual dependency. "we apply four distinct perturbation types including block occlusion, color jitter, random region masking, and Gaussian blur."
  • Gaussian policy gradient: A method that uses Gaussian perturbations of latent actions to estimate gradients for optimization. "We give a detailed derivation of the gradient used to update the latent thought vectors HH ... via a Gaussian policy gradient method."
  • Greedy decoding: Deterministic generation without sampling for model outputs. "Unless otherwise stated, we use greedy decoding (do_sample=False) for all generation tasks."
  • Hallucination: Model-generated claims about visual content that are unsupported or fabricated. "HallusionBench is a benchmark for image-context reasoning that uses carefully structured question pairs to diagnose hallucination, visual illusion, and logical inconsistency in large vision-LLMs."
  • Latent reasoning state: A vector in latent space representing the internal reasoning configuration of the model. "We consider the latent reasoning state hRdh \in \mathbb{R}^d"
  • Latent think tokens: Special latent tokens used to carry and structure internal “thought” during generation. "Latent Think Tokens $\mathcal{T$:} We set the number of latent think tokens to 4."
  • Latent thought vectors: Continuous hidden-state representations of intermediate reasoning steps. "We give a detailed derivation of the gradient used to update the latent thought vectors HH (e.g., latent think tokens)"
  • Monte Carlo sampling: Random sampling used to approximate expectations and gradients. "In practice, the expectation in~\eqref{eq:latent_pg_gaussian} is approximated via Monte Carlo sampling."
  • Mutual information (MI): An information-theoretic measure of shared information between variables. "the mutual information between the latent state and YY is a strictly increasing function of the mutual information between the latent state and visual features."
  • Negative definite: A property of a Hessian indicating a strict local maximum in the confidence function. "$\nabla^2 C(h_{\mathrm{trap}) \text{ is negative definite,}$"
  • Pass@k: A metric estimating the probability that at least one of k generated solutions is correct. "We employ the Pass@kk metric to evaluate the accuracy of the model's generated answers."
  • Pearson correlation: A statistical measure of linear correlation between two variables or curves. "Values represent the average Pearson correlation between dependency curves under different perturbations."
  • Plug-and-play: An approach that can be integrated without additional training or parameter updates. "It employs a plug-and-play Attention-driven Selection (ADS) mechanism"
  • Policy gradient: A reinforcement learning technique to compute gradients of expected reward with respect to policy parameters. "\begin{lemma}[Policy gradient identity]"
  • Random region masking: An image perturbation that hides randomly chosen areas to probe visual reliance. "we apply four distinct perturbation types including block occlusion, color jitter, random region masking, and Gaussian blur."
  • Relative-position alignment scheme: A normalization method aligning tokens across different-length reasoning chains by relative indices. "we adopt a relative-position alignment scheme that normalizes each reasoning chain to a comparable relative index space."
  • Scene graph: A structured representation of objects, attributes, and relationships extracted from an image. "It first generates a scene graph to capture object attributes and relationships"
  • Sparsity pattern: A structured distribution where visual dependency concentrates at key reasoning steps rather than uniformly. "This consistency confirms that the sparsity pattern observed in the main paper is not tied to any specific perturbation method"
  • Unbiased estimator: An estimator whose expected value equals the true parameter being estimated. "Following standard practice, we calculate the unbiased estimator using the formula:"
  • Visual dependency: The extent to which tokens or reasoning steps rely on visual input. "To obtain a stable estimation of token-level visual dependency, each dependency value is averaged across five independently perturbed versions of the same image"
  • Visual grounding: Ensuring that textual reasoning or claims align with and are supported by visual evidence. "such tokens consistently align with reasoning stages in which visual grounding is intrinsically required"
  • Visual patches: Small image regions inserted into the model’s processing stream to refresh visual context. "We dynamically insert visual patches into the latent stream."
  • Zero-shot prompting: Using prompts to elicit capabilities without any task-specific training. "CCoT ... is a zero-shot prompting method that utilizes scene graphs to extract compositional knowledge."
  • Zero temperature: A deterministic setting for sampling or judgments that removes randomness in outputs. "All judgments are performed with zero temperature to maintain high determinism."

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 157 likes about this paper.