Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 26 tok/s Pro
GPT-4o 98 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 216 tok/s Pro
2000 character limit reached

DrugReasoner: Interpretable Drug Approval Prediction with a Reasoning-augmented Language Model (2508.18579v1)

Published 26 Aug 2025 in cs.LG, cs.AI, and q-bio.QM

Abstract: Drug discovery is a complex and resource-intensive process, making early prediction of approval outcomes critical for optimizing research investments. While classical machine learning and deep learning methods have shown promise in drug approval prediction, their limited interpretability constraints their impact. Here, we present DrugReasoner, a reasoning-based LLM built on the LLaMA architecture and fine-tuned with group relative policy optimization (GRPO) to predict the likelihood of small-molecule approval. DrugReasoner integrates molecular descriptors with comparative reasoning against structurally similar approved and unapproved compounds, generating predictions alongside step-by-step rationales and confidence scores. DrugReasoner achieved robust performance with an AUC of 0.732 and an F1 score of 0.729 on the validation set and 0.725 and 0.718 on the test set, respectively. These results outperformed conventional baselines, including logistic regression, support vector machine, and k-nearest neighbors and had competitive performance relative to XGBoost. On an external independent dataset, DrugReasoner outperformed both baseline and the recently developed ChemAP model, achieving an AUC of 0.728 and an F1-score of 0.774, while maintaining high precision and balanced sensitivity, demonstrating robustness in real-world scenarios. These findings demonstrate that DrugReasoner not only delivers competitive predictive accuracy but also enhances transparency through its reasoning outputs, thereby addressing a key bottleneck in AI-assisted drug discovery. This study highlights the potential of reasoning-augmented LLMs as interpretable and effective tools for pharmaceutical decision-making.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces DrugReasoner, a reasoning-augmented LLM employing GRPO to provide interpretable drug approval predictions.
  • The methodology integrates molecular descriptors with comparative reasoning against similar compounds to enhance prediction accuracy and transparency.
  • Empirical results demonstrate robust performance with AUC scores around 0.73 and high recall, outperforming baselines including ChemAP on external datasets.

DrugReasoner: Interpretable Drug Approval Prediction with a Reasoning-Augmented LLM

Introduction

DrugReasoner introduces a reasoning-augmented LLM framework for interpretable drug approval prediction, leveraging the LLaMA-3.1-8B-Instruct architecture fine-tuned with group relative policy optimization (GRPO). The model is designed to address the dual challenge of predictive accuracy and interpretability in early-stage drug discovery, where the ability to rationalize predictions is critical for downstream decision-making. DrugReasoner integrates molecular descriptors and comparative reasoning against structurally similar compounds, outputting both a binary approval prediction and a stepwise rationale with a confidence score. This approach is positioned as an advancement over prior models such as ChemAP, which, despite strong performance, lack transparent reasoning capabilities. Figure 1

Figure 1: Schematic representation of DrugReasoner development and assessment, including dataset preparation, GRPO-based model training, and comparative evaluation against baselines.

Model Architecture and Training

Data Processing and Feature Engineering

DrugReasoner operates on a curated dataset of 2,255 approved and 2,255 unapproved small molecules from ChEMBL, with class balance achieved via random undersampling. Each molecule is represented by a set of physicochemical and structural descriptors computed using RDKit, explicitly excluding SMILES to mitigate data leakage from LLM pretraining. Embeddings are generated using MOLFORMER, a transformer-based model trained on SMILES, with mean pooling over the last hidden states to yield 768-dimensional representations.

For each query molecule, the five most similar approved and five most similar unapproved compounds are identified using XGBoost-based leaf embedding similarity, where Hamming distance in the decision space quantifies structural proximity. This comparative context is critical for the model's reasoning process.

Prompt Engineering and Output Schema

Inputs to DrugReasoner are structured prompts containing the RDKit descriptors of the query molecule and its ten nearest neighbors (five approved, five unapproved). The system prompt enforces a domain-specific instruction set to simulate expert chemical reasoning. The model outputs are formatted in XML with three fields: > (rationale), <label> (approved/unapproved), and <score> (confidence, 0.0–1.0), ensuring both interpretability and compatibility with reward modeling.

Fine-Tuning with Group Relative Policy Optimization

DrugReasoner is fine-tuned using GRPO, a reinforcement learning algorithm that extends PPO by leveraging group-based advantage estimation. For each input, the model generates KjK_j candidate responses, each scored by a multi-objective reward function. The group mean reward serves as a baseline, and the advantage for each response is computed as Ajk=RjkRˉjA_{jk} = R_{jk} - \bar{R}_j. The surrogate loss incorporates a clipped objective and a KL-divergence penalty to stabilize updates.

The reward function is multi-faceted, combining correctness, XML format compliance, soft format compliance, interpretability, and confidence-alignment. This design ensures that the model not only predicts accurately but also adheres to the required output structure and provides semantically valid, confidence-calibrated rationales.

Training is performed for 14,500 steps on a single NVIDIA V100 GPU, with 4-bit quantization and LoRA applied to key attention projection layers to reduce memory and computational overhead. The optimal checkpoint is selected based on validation AUC and output structure adherence. Figure 2

Figure 2: DrugReasoner assessment, including reward trajectory during GRPO training, evaluation metrics, and comparative performance against baselines and ChemAP on validation, test, and external datasets.

Empirical Results

Validation and Test Performance

On the validation set, DrugReasoner achieves an AUC of 0.732 and F1-score of 0.729, outperforming logistic regression, SVM, KNN, and matching or exceeding XGBoost. On the test set, the model maintains robust performance (AUC = 0.725, F1 = 0.718), with the highest recall (0.702) among all models. The model's confidence scores stabilize at 0.87, indicating reliable calibration.

External Dataset Generalization

On an independent external dataset (17 approved, 8 unapproved drugs), DrugReasoner achieves an AUC of 0.728 and F1-score of 0.774, outperforming all baselines and ChemAP (AUC = 0.64, F1 = 0.529). Notably, DrugReasoner attains the highest precision (0.857) and balanced accuracy (0.720), with strong recall (0.706) and specificity (0.750). Baseline models exhibit poor sensitivity (\leq 0.235) despite perfect specificity, underscoring DrugReasoner's superior generalizability and real-world applicability.

Interpretability and Output Structure

DrugReasoner consistently produces structured rationales that articulate the comparative reasoning process, referencing molecular features and similarities to known compounds. The model's output format is strictly enforced via reward shaping, resulting in 100% adherence to the expected XML schema at the selected checkpoint.

Implementation Considerations

Computational Requirements

Training DrugReasoner requires a single high-memory GPU (32 GB VRAM) and approximately 800 hours for full convergence. The use of 4-bit quantization and LoRA significantly reduces memory footprint, enabling efficient fine-tuning of the 8B-parameter LLaMA model. Inference can be performed on consumer-grade GPUs with reduced batch sizes.

Data Leakage Mitigation

To prevent memorization or data leakage from LLM pretraining, SMILES strings are excluded from model inputs. Only computed molecular descriptors are used, which also enhances interpretability. This design choice may limit the model's ability to capture fine-grained structural nuances but is justified by the need for robust generalization.

Reward Function Design

The multi-objective reward function is critical for aligning the model's outputs with both predictive accuracy and interpretability. The confidence-alignment component penalizes overconfident incorrect predictions and rewards calibrated uncertainty, improving trustworthiness in deployment.

Scaling and Future Extensions

The current implementation is constrained by context window (4,096 tokens) and model size (8B parameters). Scaling to larger models and longer contexts could enable the inclusion of richer structural information (e.g., SMILES for neighbors) and more complex reasoning chains. Systematic hyperparameter optimization and integration of additional domain knowledge (e.g., bioactivity, toxicity) are promising directions for further performance gains.

Theoretical and Practical Implications

DrugReasoner demonstrates that reasoning-augmented LLMs, when fine-tuned with structured comparative prompts and multi-objective RL, can achieve both high predictive accuracy and interpretable outputs in drug approval prediction. The model's ability to generalize to external datasets and provide transparent rationales addresses a key bottleneck in AI-assisted drug discovery, where trust and explainability are paramount.

The approach exemplifies a shift from black-box predictive models to transparent, reasoning-driven systems, aligning with regulatory and practical requirements in pharmaceutical R&D. The integration of GRPO enables efficient exploration of the output space, reinforcing desirable behaviors beyond simple accuracy.

Limitations and Future Directions

Key limitations include the exclusion of SMILES, which may restrict structural expressivity, and the computational cost of RL-based fine-tuning. The dataset size, while balanced, remains modest relative to the chemical space. Future work should explore controlled integration of structural data, scaling to larger models, and extension to multi-modal or multi-task settings (e.g., toxicity, efficacy, ADMET prediction). Incorporating active learning and uncertainty quantification could further enhance the model's utility in real-world pipelines.

Conclusion

DrugReasoner establishes a robust framework for interpretable drug approval prediction, combining LLM-based chain-of-thought reasoning, comparative molecular analysis, and reinforcement learning via GRPO. The model achieves strong empirical performance, outperforms established baselines and ChemAP on external data, and delivers structured, confidence-calibrated rationales. This work underscores the potential of reasoning-augmented LLMs as transparent, trustworthy tools for early-stage pharmaceutical decision-making and sets a foundation for future advances in interpretable AI for drug discovery.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.