Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness (2504.21773v1)

Published 30 Apr 2025 in cs.CL and cs.AI

Abstract: With the widespread application of LLMs, the issue of generating non-existing facts, known as hallucination, has garnered increasing attention. Previous research in enhancing LLM confidence estimation mainly focuses on the single problem setting. However, LLM awareness of its internal parameterized knowledge boundary under the more challenging multi-problem setting, which requires answering multiple problems accurately simultaneously, remains underexplored. To bridge this gap, we introduce a novel method, Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning), that separates the learning of answer prediction and confidence estimation during fine-tuning on instruction data. Extensive experiments demonstrate that our method outperforms baselines by up to 25% in average precision.

MAC-Tuning: Enhancing LLM Reasoning in Multi-Propositional Constraints

The paper "MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness" addresses a significant challenge in the deployment of LLMs—the phenomenon of hallucination, where models generate fabricated or erroneous information. Despite advances in mitigating hallucinations in single-problem settings, the complex multi-problem setting remains relatively underexplored. This setting requires LLMs to address multiple interconnected queries within a singular input, often revealing deficiencies in model reliability and confidence estimation.

Methodological Advances

The authors introduce a novel approach: Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning). This innovative method segregates the learning of answer generation and confidence estimation during the fine-tuning phase on instructional data. Such segregation ensures a more targeted learning approach, enhancing the model's sensitivity to its inherent knowledge boundaries and improving its reliability across multi-problem settings.

The methodology is implemented through several key procedures:

  1. Data Construction: By randomly concocting multiple single problems from existing datasets, the model's outputs are systematically compared to ground-truth answers to establish a precise knowledge boundary. Confidence labels ("I am sure" or "I am unsure") are assigned based on alignment with correct answers.
  2. Training: A two-step supervised fine-tuning process is employed where the model first learns to predict answers and subsequently calibrates its confidence levels against uncertain questions.

Empirical Results

The empirical evaluation demonstrates that MAC-Tuning delivers substantial improvements over baseline models, with an average precision increase of up to 25%. Such advancements are measured across diverse datasets encompassing both Independent (e.g., CoQA, ParaRel) and Sequential (e.g., MTI-Bench, SQA) settings. The robustness of MAC-Tuning is exemplified in its superior calibration of confidence scores, reflected by lower Expected Calibrated Error (ECE) and heightened accuracy levels across different problem configurations.

Moreover, MAC-Tuning shows promising adaptability, maintaining superior performance even in out-of-domain applications, suggesting its generalized efficacy beyond specific dataset constraints. A focused examination on varying question numbers in multi-problem scenarios reveals MAC-Tuning's capacity to effectively manage multiple queries, particularly enhancing precision in relatively easier datasets.

Theoretical and Practical Implications

This research contributes significantly to the theoretical understanding of LLM knowledge awareness and confidence calibration in complex reasoning environments. Practically, the work advocates for a refined tuning process, facilitating improved LLM deployment in applications requiring simultaneous problem-solving—such as automated customer service, complex instructional systems, and integrated recommendation frameworks.

Future prospects lie in further refining LLM's contextual understanding and enhancing its self-awareness capabilities, providing stronger assurances of information reliability across increasingly complex reasoning tasks.

Conclusion

MAC-Tuning represents a key development in advancing the reliability and accuracy of LLMs within multi-problem contexts. By demonstrating substantial improvements in precision and confidence calibration, this paper suggests a viable pathway to enhancing the cognitive robustness of AI, ultimately translating into more trustworthy multi-faceted AI systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Junsheng Huang (6 papers)
  2. Zhitao He (9 papers)
  3. Sandeep Polisetty (7 papers)
  4. Qingyun Wang (41 papers)
  5. May Fung (8 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com