Comparative correctness of expert versus LLM confounder designations when they disagree

Establish whether expert-selected confounders for the Coronary Drug Project are more correct than the confounder designations produced by large language models when the two sets differ.

Background

The paper finds moderate agreement between LLM outputs and expert confounder lists but also substantial inconsistency across models, prompts, and iterations. In cases of disagreement, the authors emphasize that their evaluation cannot confirm that experts are universally more correct than LLMs, leaving open the question of relative correctness.

Resolving this question would require independent validation of causal relationships beyond expert consensus, to determine whether discrepancies arise from LLM errors, expert limitations, or differences in methodology.

References

it is possible that the LLMs are in fact not recalling expert opinion from its text data and instead are doing a better job of applying causal reasoning to identify confounders than the experts; we cannot prove that in the case of a deviation the experts are more correct than the LLMs, only that they are not the same.

Do LLMs Act as Repositories of Causal Knowledge? (2412.10635 - Huntington-Klein et al., 14 Dec 2024) in Conclusion