CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use
Abstract: AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from an 8B Base model and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau-Bench, by 10 points on BFCL-V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open-source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards. Code provided by the open-source community: https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
A simple explanation of “Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space” (Appendix)
Overview
This paper explains and tests a new way for AI models to think through problems that involve both text and images. The idea is called DMLR (Dynamic Multimodal Latent Reasoning). Instead of only writing out step‑by‑step text, the model “thinks” inside its hidden mind (its internal states) and, at key moments, it brings in small parts of the image to guide its thinking. The appendix shows how the authors evaluated this idea, the settings they used, extra results, and the reasons why this approach works.
Key objectives and questions
To make the purpose of the paper clear, here are the main questions the authors wanted to answer:
- When do AI models really need to “look” at the image during reasoning, and when can they just use text?
- If the model dynamically brings in image details at the right moments, does it become more accurate and more confident?
- How can we measure whether the model’s reasoning is faithful (based on real evidence), and whether it avoids hallucinations (making things up)?
- Can we improve models at test time (without extra training) by optimizing their “inner thoughts” to be more reliable?
Methods and approach
The authors tested DMLR using a unified setup across many tasks, with clear prompts and a consistent way of judging answers.
- How the model “thinks”: Instead of only producing visible text, the model uses “latent think tokens,” which you can imagine as invisible steps in its mind. After each hidden step, the model pulls in a few small image patches (like tiny crops of the picture) that its attention says are most relevant. This is called dynamic visual injection. Think of it like solving a math puzzle while occasionally glancing back at key parts of the image when you reach a tricky step.
- Picking image patches: The model’s attention acts like a flashlight that highlights the most important parts of the picture. At each internal step, it re‑checks the image, picks new relevant patches, and adds them into its hidden thinking. This keeps its mental picture fresh and focused.
- Confidence and correctness checks: The authors compared the model’s final answers to the ground truth and used a careful process to avoid judging mistakes caused by formatting. They also used a strong external judge (GPT‑4o, run deterministically) to check whether a reasoning chain was logically sound and truly used the given evidence.
- Faithfulness and hallucination testing: The team looked at reasoning steps that referenced visual facts (like “the triangle is red” or “there are 3 coins”). They checked whether each claim matched the image. If not, it was labeled as a hallucination. They then compared the model’s confidence at those steps to see patterns.
- Visual dependency analysis: To find out when tokens (pieces of the model’s reasoning) depend on the image, they perturbed images in different ways (blur, block occlusion, random masking, color changes) and watched how the reasoning changed. If a token’s output shifted a lot with image noise, it likely depended on vision.
- Pass@k metric: This measures the chance that at least one of k attempts is correct. For example, Pass@8 asks, “If the model tries up to 8 different solutions, how likely is it to produce at least one correct answer?” This helps judge robustness beyond a single try.
- Simple theory to explain why it works:
- Confidence gradient: Imagine walking uphill toward “feeling more sure.” If “feeling more sure” aligns with “being more correct,” then taking small steps uphill improves quality.
- Visual injection and information: Bringing in the right visual details increases how much the model’s inner state knows about the answer. When the model knows more, its uncertainty drops, so its confidence rises. In everyday terms: better clues → better chances → more certainty.
- A note on optimization inside the mind: The model tries tiny changes to its hidden thoughts (like testing a small edit to its plan), scores the result (confidence as the “reward”), and then nudges its inner state in the direction that helped. This is similar to trying a small tweak, seeing if it helps, and keeping what works.
Main findings and why they matter
The appendix reports several important results across math and visual reasoning benchmarks:
- Vision is used at key moments: The model doesn’t need the image at every step. Instead, it depends on vision at meaningful stages, like describing the scene, checking spatial details, counting, or verifying a number. This pattern stayed stable even when the images were altered in different ways, showing the behavior is robust, not tied to one type of noise.
- Dynamic visual injection boosts accuracy and robustness: When the model repeatedly and selectively brings in the right image patches during its hidden thinking, performance improves across many tasks. In Pass@8 evaluation, DMLR often increases the chance of getting at least one correct solution within 8 tries by roughly 2%–5%, especially on tasks needing careful multi‑step reasoning or precise visual grounding (like MM‑Math, HallusionBench, and MMVP).
- Training‑free but competitive: DMLR consistently matches or beats some methods that require extra training (like MCOUT and IVT‑LR). Because DMLR adapts during inference—without changing the model’s parameters—it generalizes well, especially for smaller models that might overfit when trained heavily.
- Confidence reflects grounding: Steps that hallucinate (claiming visuals that aren’t there) tend to have lower confidence and more uncertainty. This supports the idea that confidence signals can help detect weak or ungrounded reasoning.
- Theory supports practice: The math shows that when “being more sure” aligns with “being more correct,” small steps to increase confidence tend to increase quality. It also explains why adding visual information lowers uncertainty and raises confidence.
Implications and potential impact
This work suggests a practical way to make multimodal AI more reliable without retraining:
- Smarter “when to look” behavior: The model learns to glance back at the image at just the right times, not constantly, which is efficient and effective.
- Fewer hallucinations: By tying confidence to real visual evidence, the model’s reasoning becomes more faithful and trustworthy.
- Plug‑and‑play improvements: Because DMLR works at test time, it can be added to many existing models and tasks.
- Better tools for learning and science: More reliable multimodal reasoning could help with math education, data analysis, science questions, and any task where images and text are combined.
Overall, the appendix shows that dynamically interleaving image details into the model’s “inner thoughts,” and gently optimizing those thoughts for confidence, leads to clearer, more grounded, and more accurate reasoning.
Knowledge Gaps
Knowledge Gaps, Limitations, and Open Questions
Below is a consolidated, actionable list of what remains missing, uncertain, or unexplored in the paper’s appendix. Each item is framed to guide future research.
Evaluation Protocol and Metrics
- Lack of explicit definition and calibration of “token-level confidence”: clarify the exact metric (e.g., per-token log-probability, normalized confidence, entropy proxies), calibration status, and aggregation scheme across steps and modalities.
- Limited statistical rigor: no confidence intervals, variance across seeds, or significance testing reported for accuracy and Pass@k; ablations are run on 300-sample subsets without power analysis.
- Sampling setup under Pass@k is under-specified: sampling strategy (temperature, top-p, nucleus/top-k, number of draws n per problem) is not documented, making Pass@k comparability and reproducibility unclear.
- Evaluation protocol caps datasets at 1000 samples and relies on “mini” benchmarks; no assessment of performance stability on full datasets or under data scaling.
- Regex-based answer extraction and “boxed” formatting may bias or mis-score non-math tasks; robustness of answer parsing and error analysis across task types is not provided.
- No calibration metrics (e.g., ECE, Brier score), overconfidence analysis, or selective prediction evaluation to substantiate claims about “confidence” beyond accuracy gains.
- Latency–quality trade-offs are not measured: missing benchmarks for wall-clock time, energy use, and memory overhead per sample as optimization steps and k vary.
Theoretical Assumptions and Validation
- Strong, unverified assumption that I(Y; T) is a strictly increasing function of I(T; z_v) (existence and monotonicity of g): no empirical tests or counterexamples to support this information-theoretic linkage in deep models.
- Mutual information terms are never estimated or proxied; the key claim that DVI increases I(T; z_v) (and thereby I(Y; T)) is not validated empirically.
- Confidence–quality gradient alignment is local and assumption-heavy (smoothness, alignment); no empirical test of alignment (e.g., directional derivatives, sensitivity analysis) in real models or detection of “confidently incorrect” traps in practice.
- Latent policy gradient lacks convergence guarantees, variance analysis, or stability bounds under single-sample Monte Carlo; no baseline subtraction, control variates, or adaptive noise to reduce variance explored.
- No characterization or detection method for confidently incorrect traps; no practical strategies to escape such traps or to adapt step sizes dynamically when entering misaligned regions.
Algorithm Design and Hyperparameters
- Hyperparameter sensitivity is limited: no systematic sweeps for number of latent tokens, number of patches per step, maximum patches, step sizes, decay schedules, or learning rates across tasks and backbones.
- Visual patch selection depends on attention scores; no comparison against alternative attribution methods (e.g., gradients, integrated gradients, RISE, occlusion-based) to test robustness of patch selection.
- Image resolution fixed at 256 px for all tasks; no analysis of high-resolution requirements, scale sensitivity, or impact on fine-grained perception tasks.
- The DVI procedure is partially described (appendix text truncates mid-algorithm) and lacks complete pseudocode/specifications (e.g., patch stride/size, re-encoding strategy, cache reuse), hindering exact reproduction.
- No study on when DVI hurts performance (negative deltas): conditions, error modes, tasks, or model regimes where dynamic injection degrades reasoning or induces instability are not analyzed.
Robustness and Generalization
- Visual dependency analysis uses common perturbations but does not test adversarial perturbations, distribution shifts, domain shifts (e.g., medical, aerial, document OCR), or robustness to spurious correlations.
- Reliance on attention-aligned “high-dependency” tokens lacks human-annotated validation or causal tests (e.g., counterfactual interventions) to confirm that identified steps are truly vision-critical.
- No evaluation of robustness to prompt manipulations (e.g., prompt injection, instruction conflicts) or to noisy/ambiguous image–text alignment scenarios.
- Seed robustness is not reported (single seed fixed at 42); no assessment of variability across seeds or hardware backends.
Comparative Scope and Fairness
- Exclusion of “think-with-image” models (e.g., DeepEye, GRIT) narrows comparative scope; a controlled study (possibly subsampled or matched compute) is needed to contextualize DMLR vs. training-heavy approaches.
- Baseline configurations may not be fully optimal or aligned (e.g., greedy decoding for all vs. sampling for Pass@k); thorough per-method tuning and reporting of hyperparameters for fairness is absent.
- Limited coverage of larger or proprietary frontier models; unclear how DMLR scales with model size and whether gains persist or saturate for stronger backbones.
Efficiency, Systems, and Deployment
- Missing compute profile for DMLR: no measurements of FLOPs, GPU memory footprint, latency per iteration, and throughput under varying optimization steps and patch counts.
- No analysis of amortizing DVI (e.g., caching visual features, reusing attention maps) or opportunities for early stopping, adaptive step counts, or dynamic compute allocation at inference time.
- Precision and quantization not explored: all runs use float32; effects under FP16/BF16/INT8 (typical in deployment) and on different accelerators are unknown.
- Applicability to resource-constrained or real-time settings is not discussed; feasibility on edge devices or with strict latency budgets remains open.
Annotation, Judging, and Faithfulness
- Heavy reliance on GPT-4o as a judge (faithfulness, correctness, hallucination) raises potential bias and consistency issues; no human expert evaluation or cross-judge agreement analysis is provided.
- Automatic extraction of “visual statements” from chains is not validated for precision/recall; mis-extractions may confound hallucination labeling and confidence associations.
- The dual-pass GPT-4o judging protocol lacks an error analysis and does not report inter-annotator (inter-judge) agreement metrics or adjudication procedures for disagreements.
Reproducibility and Transparency
- Code is not yet released; exact re-implementation is hindered by missing algorithmic specifics (e.g., DVI patching details, sampling parameters, policy gradient implementations).
- Dataset subsampling procedures (selection criteria for “mini” sets, random seeds for subsampling) are not fully specified; risk of selection bias without shared splits.
- Some formulae/typesetting (e.g., Pass@k equation) appear malformed, inviting ambiguity; a formal checklist of equations, hyperparameters, and inference flags would aid replication.
Safety, Ethics, and Reliability
- No analysis of safety impacts from latent-state manipulation (e.g., bypassing safety filters, shifting outputs towards unsafe content) or interactions with guardrails.
- Bias and fairness effects not studied: whether DVI shifts demographic performance disparities or reinforces dataset artifacts remains unknown.
- No exploration of catastrophic failure cases (e.g., increased hallucinations under specific conditions) or of monitoring/abort criteria when confidence rises but external evidence is weak.
Scope of Applicability
- Generality beyond vision–language tasks is not demonstrated: applicability to audio–text, video–text, or non-perceptual reasoning domains (code, symbolic math) is untested.
- Interplay with training-time methods is unexplored: can DMLR be combined with training-based latent reasoning (e.g., MCOUT, IVT-LR) to further improve performance, or does it interfere?
- Long-context and multi-image/multi-hop settings are not examined; how dynamic injection scales with multiple images, pages, or temporal sequences remains an open question.
Practical Applications
Immediate Applications
Below are actionable use cases derived from the findings, methods, and innovations of the research paper, categorized as "Immediate Applications" for deployment now.
Industry
- Multimodal Visual Analysis Tools: Implement Dynamic Visual Injection (DVI) modules in sectors like finance and marketing to enhance visual data analysis and decision-making processes.
- Augmented Reality in Retail: Utilize the DMLR framework to improve product visualization and buying experiences in retail environments.
- Vision-LLMs for Automation: Apply multimodal reasoning techniques for automated quality assurance systems in manufacturing.
Academia
- Educational Software: Develop educational applications that use MathVista benchmarks for enhanced mathematical visual reasoning in digital learning platforms.
- Research in Multimodal Integration: Foster collaborations across fields like computer science and cognitive psychology to explore further integration of visual cues in reasoning processes.
Policy
- Policy Planning for AI in Education: Adopt ScienceQA benchmarks for designing AI-assisted educational programs in public schooling systems.
Daily Life
- Personalized Learning Applications: Leverage multimodal composition tasks as study tools for personalized learning applications, adaptable to individual student needs.
Long-Term Applications
These applications require further research, scaling, or development before deployment.
Industry
- Advanced Robotics Systems: Scale DMLR-based reasoning systems for more adaptive and intelligent path-finding in robotics.
- Energy Management Solutions: Investigate DMLR's potential for optimizing energy distribution systems through enhanced multimodal data processing.
Academia
- Cross-Disciplinary Research: Establish foundational research investigating the boundaries of visual reasoning across different learning systems.
Policy
- Regulatory Framework Development: Develop new policies around data privacy and ethical AI use concerning dynamic multimodal reasoning systems.
Daily Life
- Assistive Technologies for Disabilities: Develop technologies that utilize visual-context reasoning to assist individuals with disabilities.
Assumptions and Dependencies
Here are some assumptions or dependencies that could impact the feasibility of each application.
- Technical Dependencies: The implementation of DMLR requires high computational power and advanced machine learning infrastructure.
- Data Availability: Successful deployment in industry requires access to multimodal datasets similar to those used in the paper.
- Regulatory Approvals: Long-term applications, particularly in policy and industry, may require significant changes in legislation or regulatory frameworks.
- User Acceptance: Applications in daily life and education need user acceptance and adaptability to new technology interfaces.
Glossary
- Attention-driven Selection (ADS): A mechanism that selects relevant image regions based on model attention to interleave visual and textual reasoning. "It employs a plug-and-play Attention-driven Selection (ADS) mechanism to dynamically identify and insert relevant image regions into the reasoning chain based on the model's attention maps."
- Basin of attraction: The set of initial states that converge to a particular point under iterative updates. "Let $\mathcal{B}(h_{\mathrm{trap})$ denote the basin of attraction of $h_{\mathrm{trap}$ under~\eqref{eq:conf_dynamics}"
- Block occlusion: An image perturbation that hides content by covering it with blocks to test robustness. "we apply four distinct perturbation types including block occlusion, color jitter, random region masking, and Gaussian blur."
- Cauchy–Schwarz inequality: A mathematical inequality used to bound inner products in proofs. "By the Cauchy--Schwarz inequality and Assumption~A.2,"
- Chain-of-Thought (CoT): A method prompting models to generate intermediate reasoning steps before answers. "MCOUT (Training)~\cite{pham2025multimodalchaincontinuousthought} is a latent-space reasoning framework that replaces traditional text-based CoT with continuous hidden-state “thought vectors,”"
- CLIP-blind image–text pairs: Image–text pairs that systematically fool CLIP-like models, exposing perception failures. "MMVP is a benchmark built from multimodal visual patterns designed to expose “CLIP-blind’’ image–text pairs,"
- Confidence landscape: The function over latent states representing model confidence, treated as an optimization surface. "During test-time optimization, DMLR updates the latent state by ascending the confidence landscape:"
- Conditional entropy: The uncertainty of a target variable given the latent state, related to confidence. "The model’s confidence objective is a strictly decreasing function of the conditional entropy of given the latent state:"
- Descent lemma: A smoothness-based inequality bounding function change via gradients and step sizes. "\begin{lemma}[Descent lemma form]"
- Dynamic Visual Injection (DVI): A module that dynamically injects visual features or patches into the latent reasoning stream. "Section~\ref{e} further elaborates on the design choices, mechanisms, and stability analyses of the Dynamic Visual Injection module."
- Eager attention backend: An inference implementation that evaluates attention computations in a non-lazy manner. "and use the eager attention backend for inference."
- Gaussian blur: An image smoothing perturbation used to test visual dependency. "we apply four distinct perturbation types including block occlusion, color jitter, random region masking, and Gaussian blur."
- Gaussian policy gradient: A method that uses Gaussian perturbations of latent actions to estimate gradients for optimization. "We give a detailed derivation of the gradient used to update the latent thought vectors ... via a Gaussian policy gradient method."
- Greedy decoding: Deterministic generation without sampling for model outputs. "Unless otherwise stated, we use greedy decoding (do_sample=False) for all generation tasks."
- Hallucination: Model-generated claims about visual content that are unsupported or fabricated. "HallusionBench is a benchmark for image-context reasoning that uses carefully structured question pairs to diagnose hallucination, visual illusion, and logical inconsistency in large vision-LLMs."
- Latent reasoning state: A vector in latent space representing the internal reasoning configuration of the model. "We consider the latent reasoning state "
- Latent think tokens: Special latent tokens used to carry and structure internal “thought” during generation. "Latent Think Tokens $\mathcal{T$:} We set the number of latent think tokens to 4."
- Latent thought vectors: Continuous hidden-state representations of intermediate reasoning steps. "We give a detailed derivation of the gradient used to update the latent thought vectors (e.g., latent think tokens)"
- Monte Carlo sampling: Random sampling used to approximate expectations and gradients. "In practice, the expectation in~\eqref{eq:latent_pg_gaussian} is approximated via Monte Carlo sampling."
- Mutual information (MI): An information-theoretic measure of shared information between variables. "the mutual information between the latent state and is a strictly increasing function of the mutual information between the latent state and visual features."
- Negative definite: A property of a Hessian indicating a strict local maximum in the confidence function. "$\nabla^2 C(h_{\mathrm{trap}) \text{ is negative definite,}$"
- Pass@k: A metric estimating the probability that at least one of k generated solutions is correct. "We employ the Pass@ metric to evaluate the accuracy of the model's generated answers."
- Pearson correlation: A statistical measure of linear correlation between two variables or curves. "Values represent the average Pearson correlation between dependency curves under different perturbations."
- Plug-and-play: An approach that can be integrated without additional training or parameter updates. "It employs a plug-and-play Attention-driven Selection (ADS) mechanism"
- Policy gradient: A reinforcement learning technique to compute gradients of expected reward with respect to policy parameters. "\begin{lemma}[Policy gradient identity]"
- Random region masking: An image perturbation that hides randomly chosen areas to probe visual reliance. "we apply four distinct perturbation types including block occlusion, color jitter, random region masking, and Gaussian blur."
- Relative-position alignment scheme: A normalization method aligning tokens across different-length reasoning chains by relative indices. "we adopt a relative-position alignment scheme that normalizes each reasoning chain to a comparable relative index space."
- Scene graph: A structured representation of objects, attributes, and relationships extracted from an image. "It first generates a scene graph to capture object attributes and relationships"
- Sparsity pattern: A structured distribution where visual dependency concentrates at key reasoning steps rather than uniformly. "This consistency confirms that the sparsity pattern observed in the main paper is not tied to any specific perturbation method"
- Unbiased estimator: An estimator whose expected value equals the true parameter being estimated. "Following standard practice, we calculate the unbiased estimator using the formula:"
- Visual dependency: The extent to which tokens or reasoning steps rely on visual input. "To obtain a stable estimation of token-level visual dependency, each dependency value is averaged across five independently perturbed versions of the same image"
- Visual grounding: Ensuring that textual reasoning or claims align with and are supported by visual evidence. "such tokens consistently align with reasoning stages in which visual grounding is intrinsically required"
- Visual patches: Small image regions inserted into the model’s processing stream to refresh visual context. "We dynamically insert visual patches into the latent stream."
- Zero-shot prompting: Using prompts to elicit capabilities without any task-specific training. "CCoT ... is a zero-shot prompting method that utilizes scene graphs to extract compositional knowledge."
- Zero temperature: A deterministic setting for sampling or judgments that removes randomness in outputs. "All judgments are performed with zero temperature to maintain high determinism."
Collections
Sign up for free to add this paper to one or more collections.