Papers
Topics
Authors
Recent
Search
2000 character limit reached

CHEM: Conformal Hallucination Metric

Updated 17 December 2025
  • CHEM is a general-purpose, model-agnostic metric that quantifies hallucinated textures in image reconstruction models using multiscale transforms and conformalized quantile regression.
  • It isolates spurious high-frequency artifacts via wavelet and shearlet decompositions, enabling localized error assessments that standard metrics often miss.
  • CHEM offers practical insights into the trade-offs between reducing MSE and increasing hallucination risk, proving essential for safety-critical applications.

The Conformal Hallucination Estimation Metric (CHEM) is a general-purpose, model-agnostic quantitative framework for identifying and assessing hallucinated textures in image reconstruction models. Hallucinations, defined here as plausible but incorrect image features introduced by models such as U-Net and its variants, pose significant risks in safety-critical applications. CHEM evaluates hallucination artifacts through a combination of sparse multiscale signal representations (wavelets and shearlets) and conformalized quantile regression, producing statistically valid uncertainty intervals at the coefficient level and quantifying excess artifact energy not explained by calibration. The methodology is directly applicable to any image-to-image mapping, making it a robust instrument for both scientific analysis and practical validation of computer vision models (Li et al., 10 Dec 2025).

1. Core Principles and Scope

CHEM targets quantification and explantion of hallucinated texture artifacts in image-to-image reconstruction models φ(x)\varphi(x), such as deconvolution or denoising networks. Hallucinations typically manifest as spurious directional energy in high-frequency subbands, invisible to conventional pixel-domain statistics like mean-squared error (MSE) or peak signal-to-noise ratio (PSNR). The metric operates in the transform domain—either discrete wavelet transform (DWT) or discrete shearlet transform (DST)—in which directional and multiscale image features can be effectively isolated for analysis.

A fundamental component of CHEM is the use of conformalized quantile regression to obtain per-coefficient confidence bands in the multiscale space. This allows pixel- or coefficient-wise assessment of whether a model’s prediction exceeds expected variability, with guarantees of (1α)(1{-}\alpha) statistical coverage that are completely distribution-free.

2. Mathematical Formulation

Multiscale Transform Domain

Let WW denote a DWT or DST. An image xRtx \in \mathbb{R}^t is decomposed into coefficients w^=Wx\widehat{w} = W x, indexed by location, scale, and (for shearlets) orientation: w^j\widehat{w}_j. The true image yy and model output φ(x)\varphi(x) admit transforms w^j(y)\widehat{w}_j(y) and w^j(φ(x))\widehat{w}_j(\varphi(x)). Hallucinations are most apparent as irregular energy in higher frequencies, so CHEM monitors w^j(φ(x))w^j(y)|\widehat{w}_j(\varphi(x)) - \widehat{w}_j(y)| scale-by-scale.

Conformal Quantile Regression

A calibration set {(xi,yi)}i=1N\left\{(x_i,y_i)\right\}_{i=1}^N is used to fit an initial pointwise nonconformity radius r^j(x)\widehat{r}_j(x) for each coefficient jj around the model output. For each coefficient, the smallest scaling factor λjn\lambda_j^n is found such that gλ(r^j(xn))w^j(φ(xn))w^j(yn)g_\lambda(\widehat{r}_j(x_n)) \geq |\widehat{w}_j(\varphi(x_n)) - \widehat{w}_j(y_n)|, with gλ(r)=λrg_\lambda(r) = \lambda r. The calibration quantile λj\lambda_j is determined empirically as the (1α)(1+1/N)(1-\alpha)(1+1/N) quantile of {λjn}n=1N\{\lambda_j^n\}_{n=1}^N, and the final interval is: [w^j(φ(x))R^j(x),w^j(φ(x))+R^j(x)][\widehat{w}_j(\varphi(x)) - \widehat{R}_j(x),\, \widehat{w}_j(\varphi(x)) + \widehat{R}_j(x)] with R^j(x)=gλj(r^j(x))\widehat{R}_j(x) = g_{\lambda_j}(\widehat{r}_j(x)).

For a test pair (x,y)(x, y), the excess coefficient-wise error is

Hj(x,y)=max{w^j(φ(x))w^j(y)R^j(x),0}H_j(x, y) = \max\left\{ |\widehat{w}_j(\varphi(x)) - \widehat{w}_j(y)| - \widehat{R}_j(x),\, 0 \right\}

and is truncated at θ\theta to avoid single-coefficient domination: Hjθ(x,y)=min{Hj(x,y),θ}.H_j^\theta(x, y) = \min\{ H_j(x, y),\, \theta \}. The overall CHEM score is then

CHEM(φ)=E(X,Y)[1J^j=1J^Hjθ(X,Y)]CHEM(\varphi) = E_{(X,Y)}\left[\, \frac{1}{\hat{J}}\sum_{j=1}^{\hat{J}} H_j^\theta(X, Y) \right]

where J^\hat{J} is the total number of coefficients. The empirical estimate over MM test samples enjoys a concentration bound O(θlog1/δ/2M)O(\theta \sqrt{\log{1/\delta}/2M}).

Table 1: Example Model Comparison (from CANDELS Deconvolution Task)

Method Loss Parameters Train Time
Learnlets L1/L2 21k 62 h
SwinUNet L1/L2 99M 134 h
U-Net L1/L2 7.8M 75 h

3. Theoretical Insights: Hallucination Origin in U-shaped Networks

CHEM provides a rigorous lens on why U-shaped architectures like U-Net are prone to hallucinations. The theoretical analysis frames the supervised reconstruction problem as approximating an unknown operator M:C(Ω)C(Ω)M:C(\Omega)\to C(\Omega) from polynomially projected samples. The key result is an upper bound on achievable error using U-Net–type architectures: S(M(f))Φ(S(Vm(f)))C1ωf(2/m)+C2/m+C3t3/(LKlog(K/L))1/t\|\mathcal{S}(M(f)) - \Phi(\mathcal{S}(V_m(f)))\|_\infty \leq C_1 \omega_f(2/m) + C_2/m + C_3 t^3 / (L K \log(K/L))^{1/t} where ωf\omega_f is the modulus of continuity of ff, tt is the number of grid points, and LL, KK govern network depth/parameterization.

The implication is that errors due to discretization (O(1/m)), intrinsic function complexity (large ωf\omega_f), and limited network capacity all contribute to a nonvanishing residual, especially in high-frequency regions. Such residuals may appear as “hallucinated” features because the network fills in plausible-but-incorrect fine texture not supported by data. A nonzero lower bound further asserts that—even on finite grids—a perfect recovery is impossible.

A plausible implication is that increasing network capacity can reduce hallucination risk, but only to a point; practical architectures invariably produce some spurious details when high-frequency content is present in real data (Li et al., 10 Dec 2025).

4. Experimental Outcomes and Quantitative Evaluation

CHEM was evaluated on the CANDELS astronomical imaging dataset, specifically for deconvolution tasks. Three denoisers were compared: U-Net (7.8M parameters), Learnlets (21k), and SwinUNet (99M). Training used 10,000 galaxy cutouts and 500 epochs per model.

Key findings include:

  • Under mild point-spread function (PSF) perturbations, both Learnlets and SwinUNet maintain low CHEM scores, indicating robustness to hallucinations, whereas U-Net exhibits a rapid increase in CHEM as PSF deviates (as reflected in CHEM–FWHM curves).
  • Dictionary ablations with Haar, Daubechies-4 (db4), Daubechies-8 (db8), and shearlets show that Learnlets and SwinUNet are consistently more robust, while U-Net produces hallucinated fine-scale structure.
  • Over the first 300 epochs, U-Net demonstrates a clear trade-off: optimizing for lower MSE eventually increases CHEM, indicating more hallucination. SwinUNet reveals a milder version of this trade-off, whereas Learnlets exhibits a flat, low hallucination profile during training.
  • Fine-scale visualization in the transformed (db8 and shearlet) domain reveals “false” clumps and streaks in U-Net outputs, directly localized by CHEM. In contrast, the other models do not show such artifacts.

5. Discussion: Merits, Limitations, and Interpretation

CHEM delivers several advantages: it is agnostic to network architecture and training loss, requires no distributional assumption about training data, and provides localized artifact maps due to its transform-domain formulation. The conformal regression element confers principled, statistically valid uncertainty intervals, offering reliability even in non-Gaussian, non-i.i.d. settings.

Limitations include dependence on the choice of transform basis (wavelet or shearlet), added calibration overhead from conformal interval computation, and potential dependence of χ\chi (the uncertainties) on truncation parameter θ\theta. In addition, the practical interpretation of θ\theta and choice of dictionary represent open degrees of freedom.

Standard pixelwise or aggregate perceptual metrics routinely miss subtle texture hallucinations that CHEM highlights, making it especially relevant for safety-critical domains, such as medical imaging or astronomical data analysis, where spurious detail can mislead downstream decisions or invalidate findings.

6. Applications and Future Directions

CHEM directly targets scenarios where identifying artifact-prone image reconstructions is paramount. Applications extend to benchmarking and design of new architectures with improved robustness to hallucinated structure, as well as domain-specific tasks in which downstream analytic integrity is compromised by model hallucinations.

Stated avenues for future work include:

  • Dictionary learning or adaptation to tailor the multiscale basis to the data modality (“learned multiscale transforms”).
  • Theoretical and empirical investigation of the trade-off frontier between reconstruction performance and hallucination risk, to guide architecture and loss function development.
  • Integrating CHEM-derived artifact maps with out-of-distribution detection, providing an automated mechanism for flagging unreliable predictions in batch-processing pipelines (Li et al., 10 Dec 2025).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conformal Hallucination Estimation Metric (CHEM).