CHARM: Molecule Caption Accuracy Metric

Updated 1 March 2026

The paper introduces CHARM and its recall-oriented complement RCHARM to measure hallucinated and omitted molecular entities in model-generated captions.
It details a computational workflow using high-confidence entity extraction via BERN2 to compare predicted molecular entities with ground truth.
The approach integrates traditional metrics with LLM-based judgment to provide a balanced, chemistry-aware evaluation of molecular language models.

CHARM (Caption Hallucination Assessment with Relevance to Molecule) is a molecule-centric, automatic metric designed to quantify factual hallucinations and omissions in model-generated molecular captions. It addresses the deficiencies of traditional token-based metrics by evaluating entity-level correspondence between generated text and molecular ground truth, providing an assessment that is directly tied to the chemical fidelity of captions. CHARM and its recall-oriented complement RCHARM have become central to the evaluation of large molecular LLMs (LMLMs), particularly for tasks where factual accuracy in molecular entity attribution is crucial (Park et al., 18 Jan 2026).

1. Formal Definitions and Metric Specification

CHARM (Caption Hallucination Assessment with Relevance to Molecule) is a precision-oriented metric that measures the fraction of molecular entities mentioned in a generated caption that lack grounding in the true input molecule. Its complementary version, RCHARM, is recall-oriented and quantifies the fraction of ground-truth entities that are omitted from the generated caption.

Let

$\hat E$ : set of molecular entities extracted from the model-generated caption,
$E$ : set of entities correctly grounded in the input molecule (typically from the ground-truth caption or molecule annotation),
$G$ : set of molecular entities present in the ground-truth caption.

The metrics are defined as: $\text{CHARM} = \frac{|\hat E \setminus E|}{|\hat E|}, \quad \text{RCHARM} = \frac{|G \setminus \hat E|}{|G|}.$ CHARM penalizes the hallucination of entities, while RCHARM penalizes the omission of correct entities (Park et al., 18 Jan 2026).

2. Underlying Principles

The central principle of CHARM is the explicit detection of factual hallucinations at the entity level. A hallucination is operationally defined as any molecular entity—such as atom types, substructures, or functional groups—asserted in the generated caption absent in the input molecule. Conversely, RCHARM quantifies the omission rate: the model's failure to mention entities that are present in the molecule.

Entity extraction for CHARM and RCHARM relies on the BERN2 named entity recognition (NER) system, filtered by a confidence threshold $\geq 0.9$ for high precision. There are no relation-aware components within the scoring process; CHARM and RCHARM operate purely on entity sets, in contrast to the relation-aware modality fusion present in LMLM architectures such as CoLLaMo (Park et al., 18 Jan 2026).

3. Computational Workflow

The computation of CHARM and RCHARM involves the following steps:

Entity Extraction: Apply BERN2 NER to both the generated caption and the ground-truth caption, with confidence $\geq 0.9$ $\geq 0.9$ .
- $E_{\text{pred}} = \text{BERN2.extract\_entities}(\text{caption\_pred}, \text{conf\_threshold}=0.9)$
- $G = \text{BERN2.extract\_entities}(\text{caption\_gt}, \text{conf\_threshold}=0.9)$
Establish Entity Reference Set: In practice, $E$ is defined as $G$ or a curated molecular entity list.
Set Operations: Compute hallucinated ( $E_{\text{pred}} \setminus E$ ) and omitted ( $E \setminus E_{\text{pred}}$ ) entities.
Score Calculation:
- If $|E_{\text{pred}}| > 0$ , $\text{CHARM} = \frac{|E_{\text{pred}} \setminus E|}{|E_{\text{pred}}|}$ ; else, $\text{CHARM} = 0$ .
- If $|E| > 0$ , $\text{RCHARM} = \frac{|E \setminus E_{\text{pred}}|}{|E|}$ ; else, $\text{RCHARM} = 0$ .

This process is repeated for each input molecule/caption pair to assess model performance across datasets (Park et al., 18 Jan 2026).

4. Comparison to Traditional Token-based Metrics

Standard generation metrics such as BLEU, ROUGE, and METEOR evaluate n-gram or lexical overlap between predicted and reference text. However, they are insensitive to the factual grounding of molecular entities. These metrics do not penalize the hallucination of chemically impossible substructures or the invention of spurious atoms, nor do they reward the correct identification of functional groups phrased in alternate language.

CHARM and RCHARM address this shortcoming by:

Directly penalizing spurious or ungrounded entity mentions (CHARM).
Penalizing omissions of real, but unmentioned, molecular entities (RCHARM).

Together, they provide a balanced, chemistry-aware evaluation of both overgeneration and undergeneration of molecular facts—dimensions not captured by traditional overlap-based metrics (Park et al., 18 Jan 2026).

5. Integration with LLM-based Caption Quality Judges

In addition to CHARM and RCHARM, an LLM-as-a-judge approach using GPT-4o is employed for qualitative caption assessment. The evaluating prompt includes the molecule in SELFIES representation, the ground-truth caption, and the model's caption, requesting a score from 0 to 5 for factual informativeness and alignment with the ground truth. The resulting single composite "LLM Score" complements the entity-based metrics by assessing perceived informativeness and accuracy in a manner more akin to human expert evaluation.

This LLM judge is used primarily to validate that lower hallucination (CHARM) and omission (RCHARM) rates correspond to higher factual informativeness and human-like quality in generated captions (Park et al., 18 Jan 2026).

6. Empirical Results and Significance

Empirical evaluation demonstrates the discriminatory power and utility of CHARM and RCHARM in benchmarking LMLMs. Table 4 from (Park et al., 18 Jan 2026) compares several models on three metrics—CHARM, RCHARM (both lower is better), and LLM Score (higher is better):

Model	CHARM (%)	RCHARM (%)	LLM-Score (0–5)
GPT-4	98.4	94.3	1.95
GPT-4 (ICL)	77.1	76.9	2.04
GPT-4o	99.0	98.4	1.99
GPT-4o (ICL)	74.0	75.1	2.15
o1-mini	99.1	98.8	2.17
o1-mini (ICL)	83.5	82.7	2.01
LLaMo	64.7	67.2	2.17
CoLLaMo (Ours)	58.5	59.9	2.52

CoLLaMo exhibits the lowest CHARM (58.5%) and RCHARM (59.9%) and the highest LLM Score (2.52), indicating both improved factual precision and enhanced human-judged informativeness relative to prior models. Qualitative examples confirm that CoLLaMo produces chemically accurate, detailed captions (e.g., correctly identifying "3-hydroxy fatty acyl-CoA(4−)"), while competitors tend to hallucinate or omit key molecular groups (Park et al., 18 Jan 2026).

7. Broader Implications and Evaluation Context

CHARM and RCHARM offer a principled, entity-based alternative to n-gram overlap metrics, advancing the standard for factual assessment in molecular language modeling. They facilitate rigorous evaluation by quantifying both spurious attribution and omission of central molecular constituents, with strong empirical alignment to human and LLM-based judgments. This approach is integral to robust benchmarking of LMLMs in molecule captioning and related factual generation tasks, supporting the development of models with higher chemical fidelity (Park et al., 18 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Improving Large Molecular Language Model via Relation-aware Multimodal Collaboration (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CHARM (Caption Hallucination Assessment w.r.t. Molecule).