Papers
Topics
Authors
Recent
2000 character limit reached

MuSoBench: Benchmark for Multi-Solution LLM Evaluation

Updated 8 December 2025
  • MuSoBench is a benchmark for multi-solution tasks that defines exhaustive enumeration for domains like scheduling and subset sum.
  • It quantifies reasoning overconfidence by comparing a model's confidence with its recall using metrics such as precision, recall, and ECE.
  • The benchmark evaluates models under Short-CoT and Long-CoT protocols, with the latter significantly enhancing recall and error correction.

MuSoBench is a rigorously constructed benchmark designed to evaluate LLMs on their ability to comprehensively enumerate all valid solutions in well-defined combinatorial tasks, rather than simply producing a single “best” answer. Its primary aim is to expose the phenomenon of reasoning overconfidence—where a model halts prematurely after finding an incomplete solution set yet assigns high confidence to its coverage. MuSoBench addresses the need for holistic evaluation protocols that emphasize completeness and calibration over narrow correctness, with a particular focus on the pathologies that emerge in multi-solution domains such as scheduling and combinatorial optimization (Guan et al., 1 Dec 2025).

1. Motivation and Conceptual Foundations

Traditional LLM evaluation emphasizes tasks with unique or tightly bounded answers, such as multiple-choice questions or arithmetic puzzles, providing little insight into models’ ability to comprehensively explore solution spaces. In practice, many domains—including scheduling, combinatorial optimization, and creative design—prioritize completeness and diversity rather than spot accuracy. Prior “multi-answer” benchmarks either severely constrain the solution set (making exhaustive enumeration trivial) or leave it unbounded (as in open-ended generation), precluding robust measurement of search completeness.

MuSoBench fills this gap by introducing multi-solution tasks whose ground-truth solution sets can be exhaustively enumerated and validated. Its chief design goal is to quantify reasoning overconfidence (ROC), where an LLM ceases exploration early yet expresses unjustified certainty in its output. This diagnostic construct is central to understanding the limitations of current prompt paradigms and internal reasoning architectures (Guan et al., 1 Dec 2025).

2. Dataset Construction and Structure

MuSoBench comprises two principal combinatorial problem domains:

  • TimeTabling: Requires the generation of all conflict-free course schedules under constraints on rooms, time slots, and instructor assignments.
  • SubsetSum: Involves finding every nonempty subset of a given set of small integers that sums to a target value.

Both domains are instantiated at several discrete complexity levels, defined by the cardinality of the ground-truth solution set. Specifically, TimeTabling is stratified into 10 complexity levels with 100 instances each (1,000 problems), while SubsetSum includes 7 complexity levels with 100 instances each (700 problems). For each problem, exhaustive enumeration of the full solution set is performed via backtracking, with manual verification to ensure correctness and completeness.

Instances are thereby grouped to enable a controlled paper of reasoning performance as a function of enumerative difficulty. The formal task definition is

T={(xi,Yi^)xi is a problem, Yi^ is its ground-truth solution set, Yi^1}i=1N.\mathcal{T} = \{ (x_i, \widehat{\mathcal{Y}_i}) \mid x_i \text{ is a problem, } \widehat{\mathcal{Y}_i} \text{ is its ground-truth solution set, } |\widehat{\mathcal{Y}_i}| \ge 1 \}_{i=1}^N.

A model M\mathcal{M}, given xix_i, produces a set Yi=M(xi)\mathcal{Y}_i = \mathcal{M}(x_i) to be compared exhaustively against Yi^\widehat{\mathcal{Y}_i}.

Domain Complexity Levels Instances per Level Total Problems
TimeTabling 10 100 1 000
SubsetSum 7 100 700

3. Evaluation Metrics and Formalism

MuSoBench is constructed with a suite of metrics that target three orthogonal aspects of LLM performance: coverage (performance), calibration (overconfidence), and solution dynamics (behavioral change upon reflection).

A. Coverage

  • Precision(xi)=YiYi^Yi(x_i) = \frac{|\mathcal{Y}_i \cap \widehat{\mathcal{Y}_i}|}{|\mathcal{Y}_i|}
  • Recall(xi)=YiYi^Yi^(x_i) = \frac{|\mathcal{Y}_i \cap \widehat{\mathcal{Y}_i}|}{|\widehat{\mathcal{Y}_i}|} Recall is considered the primary metric for enumerative completeness.

B. Overconfidence (Calibration)

ECE(r)=m=1MBmNRecall(Bm)Conf(Bm)\mathrm{ECE}(r) = \sum_{m=1}^{M}\frac{|B_m|}{N}\, |\overline{\mathrm{Recall}}(B_m) - \overline{\mathrm{Conf}}(B_m)|

where Conf(Bm)\overline{\mathrm{Conf}}(B_m) and Recall(Bm)\overline{\mathrm{Recall}}(B_m) are bin-averaged verbalized confidence and empirical recall, respectively. A model is flagged as reasoning-overconfident if ci>Recall(xi)c_i > \mathrm{Recall}(x_i) systematically.

C. Behavioral Dynamics

Models undergo a second round of reflection, and three metrics are tracked:

  • Correct Solution Retention (CSR): fraction of initially correct answers that persist.
  • Error Solution Correction (ESC): fraction of incorrect answers removed.
  • New Solution Discovery (NSD): fraction of missing ground-truth solutions discovered upon re-examination. Definitions follow:

CSR=(Yi,1Yi^)(Yi,2Yi^)Yi,1Yi^,ESC=1(Yi,1Yi^)(Yi,2Yi^)Yi,1Yi^\mathrm{CSR} = \frac{|(\mathcal{Y}_{i,1} \cap \widehat{\mathcal{Y}_i}) \cap (\mathcal{Y}_{i,2} \cap \widehat{\mathcal{Y}_i})|}{|\mathcal{Y}_{i,1} \cap \widehat{\mathcal{Y}_i}|}, \quad \mathrm{ESC} = 1 - \frac{|(\mathcal{Y}_{i,1}\setminus \widehat{\mathcal{Y}_i})\cap (\mathcal{Y}_{i,2}\setminus \widehat{\mathcal{Y}_i})|}{|\mathcal{Y}_{i,1}\setminus \widehat{\mathcal{Y}_i}|}

NSD=(Yi,2Yi^)Yi,1Yi^\mathrm{NSD} = \frac{|(\mathcal{Y}_{i,2}\cap \widehat{\mathcal{Y}_i})\setminus \mathcal{Y}_{i,1}|}{|\widehat{\mathcal{Y}_i}|}

4. Experimental Protocols and Baseline Comparisons

Two primary prompting paradigms are compared:

The benchmark evaluates releases such as Qwen3-8B (with and without “thinking mode”), DeepSeek-V3 and DeepSeek-R1, and GPT-4o-mini under both Short-CoT and Long-CoT paradigms. A verbal-elicitation protocol is applied post-output, requesting confidence ratings on a 0–100 scale.

Selected findings:

  • On TimeTabling and SubsetSum, Short-CoT yields average Recall ≈ 10–30%, Precision ≈ 30–70%, and ECE(recall) >78.2% for all model families.
  • Long-CoT raises Recall by a factor of 3–5 and lowers ECE(recall) by ≥18.3%, attaining single-digit ECE on closed-source models (open-source remains ≥56.5%).
  • Behavioral metrics: Long-CoT achieves ESC up to 99.2% and NSD up to 3.0%, an order of magnitude above Short-CoT, which rarely recovers missing solutions upon reconsideration.

5. Analysis: Reasoning Overconfidence and Cognitive Rigidity

Short-CoT consistently produces outputs clustered in the low-recall, high-confidence regime—evidencing reasoning overconfidence (ROC), in which the model concludes it has found all solutions and halts prematurely. The cognitive-rigidity hypothesis posits that ROC is induced by the model’s rapid convergence onto a restricted set of thought paths, precluding branch-and-backtrack exploration.

Attention-entropy analysis of Qwen3-8B reveals distinct patterns:

  • Early transformer layers (0–10): converged entropy profiles.
  • Core reasoning layers (15–30): Short-CoT yields low entropy (rigid focus), whereas Long-CoT maintains higher entropy (indicative of flexible exploration).
  • Final layers (35+): entropy rises in Short-CoT and falls in Long-CoT, reflecting late-stage confusion versus certainty after deeper exploration.

6. Influential Factors and Mitigation Strategies

Performance and calibration are affected by several factors:

  • Reasoning Length: Longer outputs correlate with reduced confidence and superior calibration.
  • Task Complexity: Long-CoT appropriately decreases confidence as solution sets contract; Short-CoT is insensitive to complexity.
  • Decoding Temperature: Modulating sampling temperature has negligible impact on recall or ECE, suggesting ROC arises from internal reasoning pathologies, not surface-level randomness.

Mitigation approaches include:

  • Reflection loops and sequential prompting with cues such as “Wait, there may be others”.
  • Parallel solution enumeration with voting/self-consistency to motivate further exploration. These interventions raise recall and calibration metrics, though evidence suggests they do not entirely eliminate reasoning overconfidence.

7. Broader Implications and Future Directions

MuSoBench establishes a controlled environment for teasing apart LLMs’ enumerative and calibrative reasoning abilities in contexts where completeness is critical. The dual-task design, exhaustive ground-truth annotation, and comprehensive metric suite offer methodological advances over prior “multi-answer” benchmarks.

Findings underscore the persistent gap between LLM verbal confidence and actual coverage in multi-solution domains, highlighting the importance of long-form, iterative reasoning strategies and architectural mechanisms to foster exploratory search. The cognitive-rigidity analysis and attention-entropy diagnostics point to promising directions for developing models with enhanced branch-and-backtrack capabilities—a prerequisite for reliable performance in complex, open-ended reasoning settings (Guan et al., 1 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to MuSoBench.