This paper evaluates the performance of 21 contemporary LLMs, including open-source and closed-source options, on the 2024 Portuguese National Exam for medical specialty access (PNA). The PNA serves as a standardized benchmark to assess the models' medical knowledge and reasoning capabilities in Portuguese, without specific fine-tuning on the exam itself. The goal is to understand the potential of current LLMs as assistive tools for medical diagnosis and treatment planning, considering both accuracy and cost-effectiveness.
Methodology:
The researchers tested the LLMs using the 150 multiple-choice questions from the PNA 2024 exam. A strict pass@1 methodology was employed, meaning only the first answer generated by the LLM for each question was considered. Questions were presented in batches of 10, using basic prompting without any specialized instructions, few-shot examples, or explicit reasoning prompts like Chain-of-Thought (CoT).
To evaluate the models holistically, a custom scoring metric was developed:
$\text{Score} = 100 \times \left(\frac{\text{Correct}}{N}\right)^3 \times \frac{1}{\sqrt{1 + \log_{10}(P + 1)} \times C_{\text{risk}}$
This formula heavily weights accuracy (), incorporates cost-efficiency by penalizing higher price per million tokens () using a logarithmic and square root function, and includes a small penalty () for models with a higher theoretical risk of data contamination based on their knowledge cutoff date relative to the exam's publication date.
Key Findings:
- High Performance: Several LLMs demonstrated strong performance, exceeding the median human score (101/150) on the PNA 2024. OpenAI's O1 achieved the highest raw accuracy (136/150), slightly surpassing the top human student score (135/150). Other top performers in accuracy included Google's Gemini 2.5 Pro (Exp) (135/150) and OpenAI's GPT-4.5 Preview (133/150).
- Cost-Effectiveness: When factoring in cost, Google's experimental models (Gemini 2.5 Pro, Gemini 2.0 Flash Thinking) and open-source models like Meta's LLaMA 4 Maverick and DeepSeek R1 achieved high overall scores due to their combination of strong accuracy and low/zero cost. High-cost models like O1 and GPT-4.5 Preview, despite top accuracy, received lower overall scores.
- Reasoning Methods: Models utilizing explicit reasoning or "thinking" modes (often related to CoT) generally performed well, suggesting that structured reasoning aids performance on complex medical questions. The paper also mentions Chain of Draft (CoD) as a potential future direction for efficient reasoning.
- Provider Landscape: Distinct performance and cost profiles were observed across providers (Google, OpenAI, Anthropic, Meta, DeepSeek, etc.). Open-source models are becoming increasingly competitive.
- Data Contamination Risk: The paper acknowledges the potential for models with later knowledge cutoff dates to have been inadvertently trained on the exam data, assigning a risk level (Low, Medium, High) based on temporal overlap.
Discussion and Implications:
- AI as Assistive Tools: The results support the potential for LLMs as valuable assistants in clinical settings (e.g., for differential diagnosis suggestions, information retrieval) but emphasize they should augment, not replace, human clinicians due to limitations like potential hallucinations, "jagged intelligence" (inconsistent performance), bias, and the inability to replicate nuanced clinical skills.
- Challenges: Significant challenges remain for clinical deployment, including ensuring accuracy and reliability, data privacy (HIPAA/GDPR), seamless workflow integration, regulatory compliance (e.g., EU AI Act, which may classify diagnostic aids as high-risk), mitigating bias, ensuring transparency/explainability, and establishing ethical guidelines.
- Model Selection: Choosing an LLM for clinical use involves trade-offs between accuracy, cost, data privacy, API reliability, and specific features. The paper provides a breakdown of top models based on different priorities (peak accuracy, best value, open-source).
- Regulation: The paper maps potential LLM use cases (e.g., patient chatbots, clinical decision support) to risk categories under the EU AI Act, highlighting the stringent requirements for high-risk applications like diagnostic aids.
Future Directions:
The authors suggest future work should include evaluation on realistic clinical vignettes, specialty-specific testing, real-world integration studies, direct comparison with clinicians, robust safety/reliability validation, exploring novel architectures like diffusion models for text, and developing ethical/regulatory frameworks. They propose a "PNA 2025 LLM-Student Showdown" – a live, blinded benchmark comparing LLMs and humans on the next exam to eliminate contamination concerns.
Conclusion:
LLMs show remarkable potential for supporting medical tasks, with some models achieving high accuracy on a challenging medical exam, often at low cost. However, their integration into clinical practice must be approached cautiously, addressing significant challenges related to reliability, safety, ethics, and regulation, positioning them as complementary tools under strict human oversight.