Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 106 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data (2508.01450v1)

Published 2 Aug 2025 in cs.CL

Abstract: Supervised Fine-Tuning (SFT) plays a pivotal role in adapting LLMs to specialized domains such as medical reasoning. However, existing SFT practices often rely on unfiltered datasets that contain redundant and low-quality samples, leading to substantial computational costs and suboptimal performance. Although existing methods attempt to alleviate this problem by selecting data based on sample difficulty, defined by knowledge and reasoning complexity, they overlook each sample's optimization utility reflected in its gradient. Interestingly, we find that gradient-based influence alone favors easy-to-optimize samples that cause large parameter shifts but lack deep reasoning chains, while difficulty alone selects noisy or overly complex cases that fail to guide stable optimization. Based on this observation, we propose a data selection strategy, Difficulty-Influence Quadrant (DIQ), which prioritizes samples in the high-difficulty-high-influence quadrant to balance complex clinical reasoning with substantial gradient influence, enabling efficient medical reasoning with minimal fine-tuning data. Furthermore, Human and LLM-as-a-judge evaluations show that DIQ-selected subsets demonstrate higher data quality and generate clinical reasoning that is more aligned with expert practices in differential diagnosis, safety check, and evidence citation, as DIQ emphasizes samples that foster expert-like reasoning patterns. Extensive experiments on medical reasoning benchmarks demonstrate that DIQ enables models fine-tuned on only 1% of selected data to match full-dataset performance, while using 10% consistently outperforms the baseline, highlighting the superiority of principled data selection over brute-force scaling. The code and data are available at https://github.com/mihara-bot/DIQ.

Summary

  • The paper introduces the DIQ framework that selects high-impact training samples, enabling efficient fine-tuning with only 1% of the data.
  • The paper demonstrates that combining difficulty scoring with gradient influence leads to performance on par with full-dataset training.
  • The paper highlights DIQ’s clinical value by improving reasoning metrics and reducing computational overhead in medical applications.

Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data

The paper "Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data" investigates techniques to improve the efficiency of adapting LLMs for medical reasoning tasks. It introduces a novel data selection framework called Difficulty-Influence Quadrant (DIQ) to identify minimal, high-impact subsets of training data, which enhance learning without requiring extensive computational resources.

Supervised Fine-Tuning Challenges

Supervised Fine-Tuning (SFT) is critical for adapting LLMs to specialized domains. However, traditional SFT relies heavily on large, unfiltered datasets, which often contain redundant or low-quality samples. This results in unnecessary computational load and suboptimal model performance. Previous data selection strategies have primarily focused on sample difficulty, but these approaches miss the optimization utility of each sample as indicated by its gradient influence.

Difficulty-Influence Quadrant (DIQ) Framework

The DIQ framework selects training samples by jointly considering two critical dimensions:

  1. Difficulty: Reflects the complexity of reasoning required for each sample, determined using a BiomedBERT-based classifier fine-tuned to score medical questions on a 5-point Likert scale.
  2. Influence: Measures the optimization impact of each sample, approximated through gradient dot products between training and validation samples over epochs.

This combination allows DIQ to prioritize samples in the "high-difficulty–high-influence" quadrant. This balance ensures efficient learning by emphasizing samples that support complex clinical reasoning while also providing substantial parameter shifts. Figure 1

Figure 1: Overview of the DIQ framework. The method operates by first mapping each sample into a two-dimensional space defined by difficulty and influence, creating four distinct data quadrants for strategic selection.

Implementation and Performance

Dataset and Model Details

The paper utilizes several datasets, such as Huatuo and FineMed, comprising up to 32k medical reasoning samples. Models like Llama3.1-8B-Instruct and Qwen3-8B serve as the primary LLMs for experimentation, fine-tuned via LoRA with specific hyperparameters to adapt them efficiently to medical tasks.

Results

DIQ-enabled strategies show that fine-tuning on just 1% of selected data achieves equivalent performance to traditional full-dataset training, and using 10% consistently surpasses the baseline:

  • Llama3.1-8B-Instruct trained on 1% DIQ data performs comparably to full dataset training, highlighting the effectiveness of DIQ in capturing essential training signals. Figure 2

    Figure 2: Downstream task performance comparison of models trained on MedReason-QA at different data keeping ratios.

Efficiency Analysis

A core advantage of DIQ is computational efficiency. The framework's overhead in computing difficulty and influence scores is minimal compared to full model fine-tuning procedures. This upfront cost is mitigated by the potential to reuse computed scores across multiple experiments, emphasizing DIQ's suitability for frequent model fine-tuning cycles. Figure 3

Figure 3: The FLOPs consumption (101410^{14}) comparison of computing DIQ scores, fine-tuning Llama3.1 and Qwen3 series models. The y-axis is log scale for better presentation.

Clinical Value Assessment

Expert evaluations highlight improvements in clinical reasoning quality, with DIQ-selected subsets enhancing standard clinical reasoning metrics such as Differential Diagnosis, Safety Checks, and Evidence Citation. Models trained on DIQ data produce reasoning processes that are more closely aligned with expert judgments, indicative of their higher clinical value.

Conclusion

The Difficulty-Influence Quadrant framework stands out as a methodologically sound approach to optimizing medical LLM fine-tuning with minimal data. By strategically selecting high-impact training samples, DIQ reduces resource needs without sacrificing model performance and offers a robust, scalable solution for specialized domain adaptation in LLMs. Future work may explore extensions to larger model architectures and additional domains to further validate and expand the utility of DIQ.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.