Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
81 tokens/sec
Gemini 2.5 Pro Premium
33 tokens/sec
GPT-5 Medium
31 tokens/sec
GPT-5 High Premium
22 tokens/sec
GPT-4o
78 tokens/sec
DeepSeek R1 via Azure Premium
92 tokens/sec
GPT OSS 120B via Groq Premium
436 tokens/sec
Kimi K2 via Groq Premium
209 tokens/sec
2000 character limit reached

Hallucinations Can Improve Large Language Models in Drug Discovery (2501.13824v1)

Published 23 Jan 2025 in cs.CL and cs.AI

Abstract: Concerns about hallucinations in LLMs have been raised by researchers, yet their potential in areas where creativity is vital, such as drug discovery, merits exploration. In this paper, we come up with the hypothesis that hallucinations can improve LLMs in drug discovery. To verify this hypothesis, we use LLMs to describe the SMILES string of molecules in natural language and then incorporate these descriptions as part of the prompt to address specific tasks in drug discovery. Evaluated on seven LLMs and five classification tasks, our findings confirm the hypothesis: LLMs can achieve better performance with text containing hallucinations. Notably, Llama-3.1-8B achieves an 18.35% gain in ROC-AUC compared to the baseline without hallucination. Furthermore, hallucinations generated by GPT-4o provide the most consistent improvements across models. Additionally, we conduct empirical analyses and a case study to investigate key factors affecting performance and the underlying reasons. Our research sheds light on the potential use of hallucinations for LLMs and offers new perspectives for future research leveraging LLMs in drug discovery.

Summary

  • The paper shows that integrating LLM-generated hallucinations increases ROC-AUC scores over SMILES and MolT5 baselines.
  • Methodologically, a two-stage process first generates molecule descriptions then uses them in property classification tasks.
  • The findings emphasize that both the quality of the hallucination source and cross-lingual outputs, notably Chinese, significantly boost performance.

This paper explores the counter-intuitive hypothesis that hallucinations in LLMs can actually improve their performance on drug discovery tasks (2501.13824). Typically viewed as a negative characteristic, the authors propose that the creative aspect of hallucinations might be beneficial in a field requiring innovation like drug discovery.

Methodology:

The core methodology involves a two-stage process:

  1. Hallucination Generation: An LLM is prompted to generate a natural language description of a molecule based solely on its SMILES string. The prompt template used is:
    1
    2
    3
    
    System: You are an expert in drug discovery.
    User: [SMILES]
    Describe the molecule in natural language:
    The generated descriptions were found to have low factual consistency (high hallucination) when compared to reference descriptions generated by MolT5, a domain-specific model (2501.13824).
  2. Label Prediction: The hallucinated description is then incorporated into a new prompt for a downstream drug discovery classification task (predicting properties like HIV inhibition, blood-brain barrier penetration, toxicity, etc.). The prompt template for this stage is:
    1
    2
    
    System: You are an expert in drug discovery.
    User: [SMILES] [Description] [Instruct]
    Here, [SMILES] is the molecule's SMILES string, [Description] is the potentially hallucinated text generated in step 1, and [Instruct] is the task-specific question (e.g., "Does the molecule have the ability to inhibit HIV replication? Only answer Yes or No:"). The LLM predicts the label ("Yes" or "No") based on the highest probability next token.

Experimental Setup:

  • Models Evaluated: Seven instruction-tuned LLMs were tested: Llama-3-8B, Llama-3.1-8B, Ministral-8B, Falcon3-Mamba-7B, ChemLLM-7B (domain-specific), GPT-3.5, and GPT-4o.
  • Datasets: Five binary classification datasets from the MoleculeNet benchmark were used: HIV, BBBP, Clintox, SIDER (subset), and Tox21 (subset).
  • Baselines: Performance was compared against two baselines:
    • SMILES: The prompt contained only the SMILES string and the instruction (empty [Description]).
    • MolT5: The prompt used reference descriptions generated by MolT5 for the [Description].
  • Evaluation Metric: ROC-AUC was used to measure classification performance.

Key Findings:

  1. Hallucinations Improve Performance: The central hypothesis was confirmed. Most LLMs showed improved average ROC-AUC scores when using hallucinated descriptions compared to both the SMILES baseline and the MolT5 reference baseline.
    • Llama-3.1-8B achieved the most significant gain, with an 18.35% increase in average ROC-AUC over the SMILES baseline and 13.79% over the MolT5 baseline when using hallucinations generated by GPT-3.5.
    • Falcon3-Mamba-7B also demonstrated notable improvements (around 9.5-9.7% over baselines using GPT-4o hallucinations).
    • Even the domain-specific ChemLLM-7B benefited from hallucinated descriptions.
  2. Source of Hallucination Matters: Hallucinations generated by GPT-4o provided the most consistent and significant average performance boost across all tested LLMs. GPT-3.5 hallucinations also generally helped. Hallucinations from some models (e.g., Falcon3-Mamba, Llama-3.1) sometimes led to performance decreases compared to the SMILES baseline when used by other models.
  3. Impact of Model Size: Larger models tend to benefit more from hallucinations compared to the MolT5 baseline, although the effect seemed to plateau around the 8B parameter size for the Llama models tested. All sizes outperformed the SMILES baseline.
  4. Impact of Generation Temperature: Higher temperatures during hallucination generation led to lower factual consistency (more hallucination). However, the downstream task performance was relatively stable across different temperatures (0.1 to 0.9), although slightly better at lower temperatures. Crucially, performance at all tested temperatures was better than the baselines for Llama-3.1-8B.
  5. Impact of Hallucination Language: Generating the hallucinated description in different languages significantly affected performance. For Llama-3.1-8B, Chinese hallucinations yielded the highest average ROC-AUC, surprisingly outperforming English and other pre-trained languages (French, German, Spanish). Japanese performed worst. The authors noted the Chinese output often contained Pinyin and English, potentially contributing to its effectiveness.
  6. Case Study (Why it Works): An attention analysis on Llama-3.1-8B processing a hallucinated description showed the model attended not only to incorrect factual claims but also to subjective, potentially useful phrases like "potential applications in drug discovery." The authors hypothesize that this "unrelated yet faithful information" might boost the model's confidence or provide useful contextual cues, leading to better predictions.

Practical Implications:

  • This research suggests a simple, prompt-based method to potentially enhance LLM performance in drug discovery without needing specific fine-tuning for hallucination control or complex architectural changes.
  • Instead of solely focusing on mitigating hallucinations, developers could explore leveraging them, particularly those generated by powerful models like GPT-4o, as a form of "creative" input augmentation.
  • The two-step prompting strategy (generate description -> use description in task prompt) is straightforward to implement.
  • Consider the source LLM for generating descriptions; models like GPT-4o seem more effective at producing beneficial "hallucinations" for this purpose.
  • The findings regarding language suggest potential for cross-lingual prompting strategies, even leveraging languages not explicitly in the base model's pre-training data.

Limitations & Considerations:

  • The mechanism by which hallucinations help is still hypothesized (based on the case paper) and needs further investigation.
  • The generation of descriptions adds an extra computational step.
  • Performance gains are variable and depend on the base LLM, the LLM generating the hallucination, and the specific dataset/task.
  • The term "hallucination" here broadly means low factual consistency with a reference; the generated text might contain useful abstract connections or analogies rather than just errors.
Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube