CHEM: Conformal Hallucination Metric
- CHEM is a general-purpose, model-agnostic metric that quantifies hallucinated textures in image reconstruction models using multiscale transforms and conformalized quantile regression.
- It isolates spurious high-frequency artifacts via wavelet and shearlet decompositions, enabling localized error assessments that standard metrics often miss.
- CHEM offers practical insights into the trade-offs between reducing MSE and increasing hallucination risk, proving essential for safety-critical applications.
The Conformal Hallucination Estimation Metric (CHEM) is a general-purpose, model-agnostic quantitative framework for identifying and assessing hallucinated textures in image reconstruction models. Hallucinations, defined here as plausible but incorrect image features introduced by models such as U-Net and its variants, pose significant risks in safety-critical applications. CHEM evaluates hallucination artifacts through a combination of sparse multiscale signal representations (wavelets and shearlets) and conformalized quantile regression, producing statistically valid uncertainty intervals at the coefficient level and quantifying excess artifact energy not explained by calibration. The methodology is directly applicable to any image-to-image mapping, making it a robust instrument for both scientific analysis and practical validation of computer vision models (Li et al., 10 Dec 2025).
1. Core Principles and Scope
CHEM targets quantification and explantion of hallucinated texture artifacts in image-to-image reconstruction models , such as deconvolution or denoising networks. Hallucinations typically manifest as spurious directional energy in high-frequency subbands, invisible to conventional pixel-domain statistics like mean-squared error (MSE) or peak signal-to-noise ratio (PSNR). The metric operates in the transform domain—either discrete wavelet transform (DWT) or discrete shearlet transform (DST)—in which directional and multiscale image features can be effectively isolated for analysis.
A fundamental component of CHEM is the use of conformalized quantile regression to obtain per-coefficient confidence bands in the multiscale space. This allows pixel- or coefficient-wise assessment of whether a model’s prediction exceeds expected variability, with guarantees of statistical coverage that are completely distribution-free.
2. Mathematical Formulation
Multiscale Transform Domain
Let denote a DWT or DST. An image is decomposed into coefficients , indexed by location, scale, and (for shearlets) orientation: . The true image and model output admit transforms and . Hallucinations are most apparent as irregular energy in higher frequencies, so CHEM monitors scale-by-scale.
Conformal Quantile Regression
A calibration set is used to fit an initial pointwise nonconformity radius for each coefficient around the model output. For each coefficient, the smallest scaling factor is found such that , with . The calibration quantile is determined empirically as the quantile of , and the final interval is: with .
For a test pair , the excess coefficient-wise error is
and is truncated at to avoid single-coefficient domination: The overall CHEM score is then
where is the total number of coefficients. The empirical estimate over test samples enjoys a concentration bound .
Table 1: Example Model Comparison (from CANDELS Deconvolution Task)
| Method | Loss | Parameters | Train Time |
|---|---|---|---|
| Learnlets | L1/L2 | 21k | 62 h |
| SwinUNet | L1/L2 | 99M | 134 h |
| U-Net | L1/L2 | 7.8M | 75 h |
3. Theoretical Insights: Hallucination Origin in U-shaped Networks
CHEM provides a rigorous lens on why U-shaped architectures like U-Net are prone to hallucinations. The theoretical analysis frames the supervised reconstruction problem as approximating an unknown operator from polynomially projected samples. The key result is an upper bound on achievable error using U-Net–type architectures: where is the modulus of continuity of , is the number of grid points, and , govern network depth/parameterization.
The implication is that errors due to discretization (O(1/m)), intrinsic function complexity (large ), and limited network capacity all contribute to a nonvanishing residual, especially in high-frequency regions. Such residuals may appear as “hallucinated” features because the network fills in plausible-but-incorrect fine texture not supported by data. A nonzero lower bound further asserts that—even on finite grids—a perfect recovery is impossible.
A plausible implication is that increasing network capacity can reduce hallucination risk, but only to a point; practical architectures invariably produce some spurious details when high-frequency content is present in real data (Li et al., 10 Dec 2025).
4. Experimental Outcomes and Quantitative Evaluation
CHEM was evaluated on the CANDELS astronomical imaging dataset, specifically for deconvolution tasks. Three denoisers were compared: U-Net (7.8M parameters), Learnlets (21k), and SwinUNet (99M). Training used 10,000 galaxy cutouts and 500 epochs per model.
Key findings include:
- Under mild point-spread function (PSF) perturbations, both Learnlets and SwinUNet maintain low CHEM scores, indicating robustness to hallucinations, whereas U-Net exhibits a rapid increase in CHEM as PSF deviates (as reflected in CHEM–FWHM curves).
- Dictionary ablations with Haar, Daubechies-4 (db4), Daubechies-8 (db8), and shearlets show that Learnlets and SwinUNet are consistently more robust, while U-Net produces hallucinated fine-scale structure.
- Over the first 300 epochs, U-Net demonstrates a clear trade-off: optimizing for lower MSE eventually increases CHEM, indicating more hallucination. SwinUNet reveals a milder version of this trade-off, whereas Learnlets exhibits a flat, low hallucination profile during training.
- Fine-scale visualization in the transformed (db8 and shearlet) domain reveals “false” clumps and streaks in U-Net outputs, directly localized by CHEM. In contrast, the other models do not show such artifacts.
5. Discussion: Merits, Limitations, and Interpretation
CHEM delivers several advantages: it is agnostic to network architecture and training loss, requires no distributional assumption about training data, and provides localized artifact maps due to its transform-domain formulation. The conformal regression element confers principled, statistically valid uncertainty intervals, offering reliability even in non-Gaussian, non-i.i.d. settings.
Limitations include dependence on the choice of transform basis (wavelet or shearlet), added calibration overhead from conformal interval computation, and potential dependence of (the uncertainties) on truncation parameter . In addition, the practical interpretation of and choice of dictionary represent open degrees of freedom.
Standard pixelwise or aggregate perceptual metrics routinely miss subtle texture hallucinations that CHEM highlights, making it especially relevant for safety-critical domains, such as medical imaging or astronomical data analysis, where spurious detail can mislead downstream decisions or invalidate findings.
6. Applications and Future Directions
CHEM directly targets scenarios where identifying artifact-prone image reconstructions is paramount. Applications extend to benchmarking and design of new architectures with improved robustness to hallucinated structure, as well as domain-specific tasks in which downstream analytic integrity is compromised by model hallucinations.
Stated avenues for future work include:
- Dictionary learning or adaptation to tailor the multiscale basis to the data modality (“learned multiscale transforms”).
- Theoretical and empirical investigation of the trade-off frontier between reconstruction performance and hallucination risk, to guide architecture and loss function development.
- Integrating CHEM-derived artifact maps with out-of-distribution detection, providing an automated mechanism for flagging unreliable predictions in batch-processing pipelines (Li et al., 10 Dec 2025).