CHEM: Conformal Hallucination Metric

Updated 17 December 2025

CHEM is a general-purpose, model-agnostic metric that quantifies hallucinated textures in image reconstruction models using multiscale transforms and conformalized quantile regression.
It isolates spurious high-frequency artifacts via wavelet and shearlet decompositions, enabling localized error assessments that standard metrics often miss.
CHEM offers practical insights into the trade-offs between reducing MSE and increasing hallucination risk, proving essential for safety-critical applications.

The Conformal Hallucination Estimation Metric (CHEM) is a general-purpose, model-agnostic quantitative framework for identifying and assessing hallucinated textures in image reconstruction models. Hallucinations, defined here as plausible but incorrect image features introduced by models such as U-Net and its variants, pose significant risks in safety-critical applications. CHEM evaluates hallucination artifacts through a combination of sparse multiscale signal representations (wavelets and shearlets) and conformalized quantile regression, producing statistically valid uncertainty intervals at the coefficient level and quantifying excess artifact energy not explained by calibration. The methodology is directly applicable to any image-to-image mapping, making it a robust instrument for both scientific analysis and practical validation of computer vision models (Li et al., 10 Dec 2025).

1. Core Principles and Scope

CHEM targets quantification and explantion of hallucinated texture artifacts in image-to-image reconstruction models $\varphi(x)$ , such as deconvolution or denoising networks. Hallucinations typically manifest as spurious directional energy in high-frequency subbands, invisible to conventional pixel-domain statistics like mean-squared error (MSE) or peak signal-to-noise ratio (PSNR). The metric operates in the transform domain—either discrete wavelet transform (DWT) or discrete shearlet transform (DST)—in which directional and multiscale image features can be effectively isolated for analysis.

A fundamental component of CHEM is the use of conformalized quantile regression to obtain per-coefficient confidence bands in the multiscale space. This allows pixel- or coefficient-wise assessment of whether a model’s prediction exceeds expected variability, with guarantees of $(1{-}\alpha)$ statistical coverage that are completely distribution-free.

2. Mathematical Formulation

Multiscale Transform Domain

Let $W$ denote a DWT or DST. An image $x \in \mathbb{R}^t$ is decomposed into coefficients $\widehat{w} = W x$ , indexed by location, scale, and (for shearlets) orientation: $\widehat{w}_j$ . The true image $y$ and model output $\varphi(x)$ admit transforms $\widehat{w}_j(y)$ and $\widehat{w}_j(\varphi(x))$ . Hallucinations are most apparent as irregular energy in higher frequencies, so CHEM monitors $|\widehat{w}_j(\varphi(x)) - \widehat{w}_j(y)|$ scale-by-scale.

Conformal Quantile Regression

A calibration set $\left\{(x_i,y_i)\right\}_{i=1}^N$ is used to fit an initial pointwise nonconformity radius $\widehat{r}_j(x)$ for each coefficient $j$ around the model output. For each coefficient, the smallest scaling factor $\lambda_j^n$ is found such that $g_\lambda(\widehat{r}_j(x_n)) \geq |\widehat{w}_j(\varphi(x_n)) - \widehat{w}_j(y_n)|$ , with $g_\lambda(r) = \lambda r$ . The calibration quantile $\lambda_j$ is determined empirically as the $(1-\alpha)(1+1/N)$ quantile of $\{\lambda_j^n\}_{n=1}^N$ , and the final interval is: $[\widehat{w}_j(\varphi(x)) - \widehat{R}_j(x),\, \widehat{w}_j(\varphi(x)) + \widehat{R}_j(x)]$ with $\widehat{R}_j(x) = g_{\lambda_j}(\widehat{r}_j(x))$ .

For a test pair $(x, y)$ , the excess coefficient-wise error is

$H_j(x, y) = \max\left\{ |\widehat{w}_j(\varphi(x)) - \widehat{w}_j(y)| - \widehat{R}_j(x),\, 0 \right\}$

and is truncated at $\theta$ to avoid single-coefficient domination: $H_j^\theta(x, y) = \min\{ H_j(x, y),\, \theta \}.$ The overall CHEM score is then

$CHEM(\varphi) = E_{(X,Y)}\left[\, \frac{1}{\hat{J}}\sum_{j=1}^{\hat{J}} H_j^\theta(X, Y) \right]$

where $\hat{J}$ is the total number of coefficients. The empirical estimate over $M$ test samples enjoys a concentration bound $O(\theta \sqrt{\log{1/\delta}/2M})$ .

Table 1: Example Model Comparison (from CANDELS Deconvolution Task)

Method	Loss	Parameters	Train Time
Learnlets	L1/L2	21k	62 h
SwinUNet	L1/L2	99M	134 h
U-Net	L1/L2	7.8M	75 h

3. Theoretical Insights: Hallucination Origin in U-shaped Networks

CHEM provides a rigorous lens on why U-shaped architectures like U-Net are prone to hallucinations. The theoretical analysis frames the supervised reconstruction problem as approximating an unknown operator $M:C(\Omega)\to C(\Omega)$ from polynomially projected samples. The key result is an upper bound on achievable error using U-Net–type architectures: $\|\mathcal{S}(M(f)) - \Phi(\mathcal{S}(V_m(f)))\|_\infty \leq C_1 \omega_f(2/m) + C_2/m + C_3 t^3 / (L K \log(K/L))^{1/t}$ where $\omega_f$ is the modulus of continuity of $f$ , $t$ is the number of grid points, and $L$ , $K$ govern network depth/parameterization.

The implication is that errors due to discretization (O(1/m)), intrinsic function complexity (large $\omega_f$ ), and limited network capacity all contribute to a nonvanishing residual, especially in high-frequency regions. Such residuals may appear as “hallucinated” features because the network fills in plausible-but-incorrect fine texture not supported by data. A nonzero lower bound further asserts that—even on finite grids—a perfect recovery is impossible.

A plausible implication is that increasing network capacity can reduce hallucination risk, but only to a point; practical architectures invariably produce some spurious details when high-frequency content is present in real data (Li et al., 10 Dec 2025).

4. Experimental Outcomes and Quantitative Evaluation

CHEM was evaluated on the CANDELS astronomical imaging dataset, specifically for deconvolution tasks. Three denoisers were compared: U-Net (7.8M parameters), Learnlets (21k), and SwinUNet (99M). Training used 10,000 galaxy cutouts and 500 epochs per model.

Key findings include:

Under mild point-spread function (PSF) perturbations, both Learnlets and SwinUNet maintain low CHEM scores, indicating robustness to hallucinations, whereas U-Net exhibits a rapid increase in CHEM as PSF deviates (as reflected in CHEM–FWHM curves).
Dictionary ablations with Haar, Daubechies-4 (db4), Daubechies-8 (db8), and shearlets show that Learnlets and SwinUNet are consistently more robust, while U-Net produces hallucinated fine-scale structure.
Over the first 300 epochs, U-Net demonstrates a clear trade-off: optimizing for lower MSE eventually increases CHEM, indicating more hallucination. SwinUNet reveals a milder version of this trade-off, whereas Learnlets exhibits a flat, low hallucination profile during training.
Fine-scale visualization in the transformed (db8 and shearlet) domain reveals “false” clumps and streaks in U-Net outputs, directly localized by CHEM. In contrast, the other models do not show such artifacts.

5. Discussion: Merits, Limitations, and Interpretation

CHEM delivers several advantages: it is agnostic to network architecture and training loss, requires no distributional assumption about training data, and provides localized artifact maps due to its transform-domain formulation. The conformal regression element confers principled, statistically valid uncertainty intervals, offering reliability even in non-Gaussian, non-i.i.d. settings.

Limitations include dependence on the choice of transform basis (wavelet or shearlet), added calibration overhead from conformal interval computation, and potential dependence of $\chi$ (the uncertainties) on truncation parameter $\theta$ . In addition, the practical interpretation of $\theta$ and choice of dictionary represent open degrees of freedom.

Standard pixelwise or aggregate perceptual metrics routinely miss subtle texture hallucinations that CHEM highlights, making it especially relevant for safety-critical domains, such as medical imaging or astronomical data analysis, where spurious detail can mislead downstream decisions or invalidate findings.

6. Applications and Future Directions

CHEM directly targets scenarios where identifying artifact-prone image reconstructions is paramount. Applications extend to benchmarking and design of new architectures with improved robustness to hallucinated structure, as well as domain-specific tasks in which downstream analytic integrity is compromised by model hallucinations.

Stated avenues for future work include:

Dictionary learning or adaptation to tailor the multiscale basis to the data modality (“learned multiscale transforms”).
Theoretical and empirical investigation of the trade-off frontier between reconstruction performance and hallucination risk, to guide architecture and loss function development.
Integrating CHEM-derived artifact maps with out-of-distribution detection, providing an automated mechanism for flagging unreliable predictions in batch-processing pipelines (Li et al., 10 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

CHEM: Estimating and Understanding Hallucinations in Deep Learning for Image Processing (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conformal Hallucination Estimation Metric (CHEM).

CHEM: Conformal Hallucination Metric

1. Core Principles and Scope

2. Mathematical Formulation

Multiscale Transform Domain

Conformal Quantile Regression

Table 1: Example Model Comparison (from CANDELS Deconvolution Task)

3. Theoretical Insights: Hallucination Origin in U-shaped Networks

4. Experimental Outcomes and Quantitative Evaluation

5. Discussion: Merits, Limitations, and Interpretation

6. Applications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CHEM: Conformal Hallucination Metric

1. Core Principles and Scope

2. Mathematical Formulation

Multiscale Transform Domain

Conformal Quantile Regression

Table 1: Example Model Comparison (from CANDELS Deconvolution Task)

3. Theoretical Insights: Hallucination Origin in U-shaped Networks

4. Experimental Outcomes and Quantitative Evaluation

5. Discussion: Merits, Limitations, and Interpretation

6. Applications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research