Zero-DCE: Zero-Reference Deep Curve Estimation

Updated 4 January 2026

Zero-DCE is a deep learning framework for low-light image enhancement that predicts pixel-wise quadratic curves without the need for paired reference images.
It employs a lightweight seven-layer fully convolutional network to iteratively refine image brightness while preserving contrast and detail.
By combining zero-reference losses with controllable exposure adjustments, Zero-DCE has paved the way for successors like CuDi that offer real-time performance and reduced model complexity.

Zero-Reference Deep Curve Estimation (Zero-DCE) is a deep learning paradigm for low-light image enhancement that eschews the requirement for paired or unpaired reference images, instead learning pixel-wise nonlinear transformation curves through a compact convolutional neural network. The approach has set a precedent for zero-reference enhancement frameworks and spawned highly efficient successors leveraging curve distillation and controllable exposure mechanisms.

1. Formulation of Pixel-wise Curve Estimation

Zero-DCE frames enhancement as the prediction of image-specific, pixel-wise, high-order light-enhancement curves. The quadratic LE-curve adopted for each pixel $x$ and channel is: $\text{LE}(I)(x) = I(x) + \alpha(x) I(x) [1 - I(x)], \quad \alpha(x) \in [-1, 1]$ where $I(x) \in [0,1]$ is the normalized input intensity and $\alpha(x)$ is a predicted parameter map. This mapping preserves the input range, enforces monotonicity ( $\partial Y/\partial I = 1 + \alpha(1 - 2I) \geq 0$ for $|\alpha| \leq 1$ ), and is differentiable. To permit greater curvature, the mapping is applied iteratively: $\text{LE}_n(I) = \text{LE}_{n-1}(I) + \mathcal{A}_n \text{LE}_{n-1}(I)[1 - \text{LE}_{n-1}(I)]$ across $n$ steps ( $n=8$ in practice), with independent parameter maps for each step and color channel. The final output is $R = \text{LE}_n(I)$ (Guo et al., 2020, Li et al., 2021).

This curve framework contrasts with pixel-to-pixel mappings and hand-crafted global enhancement curves, providing a differentiable, adaptive solution where the enhancement is locally modulated for each spatial location.

2. DCE-Net Network Architecture

DCE-Net is a lightweight, fully convolutional architecture with seven layers, each with $3\times3$ convolutions (32 channels, ReLU except for the final tanh), and no downsampling or pooling. For $n$ iterations and three color channels, the output has $3n$ parameter maps at full image resolution (e.g., 24 channels for $n=8$ ). This architectural design enables spatially detailed curve prediction without spatial bottlenecks and generalizes to arbitrary image sizes.

Key architectural details:

$N=8$ curve iterations.
Trainable parameters: $\approx$ 79 K.
Inference runtime: $\approx$ 0.0025 s on $1200 \times 900$ images with GPU ( $\approx$ 500 FPS on $640 \times 480$ GPU) (Guo et al., 2020, Li et al., 2021).
DCE-Net++: depthwise separable convolutions, shared single $A(x)$ across iterations, and input downsampling yield $~10$ K parameters, $\approx$ 1,000 FPS GPU.

The architecture provides a resource-efficient mapping from image to curve parameters for real-time deployment scenarios, including mobile and embedded applications.

3. Zero-Reference Losses and Training Paradigm

Zero-DCE is trained solely with non-reference losses, circumventing the need for paired ground-truth images. Four differentiable losses drive the solution:

Spatial Consistency Loss: Preserves local contrast across neighboring patches.

$\mathcal{L}_{sc} = \frac{1}{K} \sum_{i=1}^{K} \sum_{j \in \Omega(i)} \left(|\mu(R_i) - \mu(R_j)| - |\mu(I_i) - \mu(I_j)|\right)^2$

Exposure Control Loss: Adjusts patch-wise means towards a fixed value (typically $E=0.6$ ).

$\mathcal{L}_{exp} = \frac{1}{M} \sum_{k=1}^{M} |\mu(R_k) - E|$

Color Constancy Loss: Maintains Gray-World channel balances globally.

$\mathcal{L}_{cc} = \sum_{(p,q)\in \{(r,g),(r,b),(g,b)\}} (\mu(R^p) - \mu(R^q))^2$

Illumination Smoothness Loss: Encourages smoothness in learned curve coefficient maps.

$\mathcal{L}_{is} = \frac{1}{N} \sum_{n=1}^N \sum_{c \in \{r,g,b\}} \left(\|\nabla_x \mathcal{A}_n^c\|_1 + \|\nabla_y \mathcal{A}_n^c\|_1\right)$

The total loss is a weighted sum: $\mathcal{L}_{\text{total}} = \mathcal{L}_{sc} + \mathcal{L}_{exp} + w_{cc}\mathcal{L}_{cc} + w_{is}\mathcal{L}_{is}$ with weights empirically set.

Training employs images from the SICE dataset, random cropping and normalization, Adam optimizer ( $\beta_1=0.9$ , $\beta_2=0.999$ , lr= $1 \times 10^{-4}$ ), and batch size 8. No reference enhanced images are required, as the non-reference objectives suffice to enforce contrast, exposure, color, and smoothness constraints (Guo et al., 2020, Li et al., 2021, Li et al., 2022).

4. Curve Distillation and Controllable Exposure Adjustment

CuDi (Curve Distillation) builds upon the Zero-DCE framework with a two-stage distillation and explicit exposure control. The n-step high-order pixel-wise curve mapping of Zero-DCE is analytically approximated by its first-order Taylor expansion at each pixel: $\text{TL}(I)(x) = \mathcal{K}(x) I(x) + \mathcal{B}(x)$ where $\mathcal{K}(x) = \frac{\partial f}{\partial I}(I(x))$ and $\mathcal{B}(x) = f(I(x)) - \mathcal{K}(x) I(x)$ for the teacher mapping $f$ (Li et al., 2022).

Two-stage training: The teacher network (U-Net with n-step mapping) is trained with exposure-map guided losses; the distilled student network (tiny convolutional net, 3K parameters) is optimized to mimic the teacher's output by predicting the tangent approximation.
Self-supervised Spatial Exposure Control Loss: $\mathcal{L}_{sec} = \frac{1}{M}\sum_{m=1}^{M} |\mu(R_m) - \mu(E_m)|$ enforces that spatial regions in the output conform to input exposure maps, enabling global and local exposure control via user- or algorithm-defined $E$ .

CuDi achieves drastic reduction in model complexity and inference time (∼12 ms per $4\text{K}$ image on GPU, $\sim$ 0.5 s on CPU) versus Zero-DCE, and corrects both underexposed and overexposed images within a single model. Crucially, direct student zero-ref training fails; distillation with analytical tangent is essential (Li et al., 2022).

5. Quantitative Results, Benchmarking, and Ablation

Extensive empirical evaluations on under-/overexposed benchmarks and standard datasets demonstrate:

Benchmark performance:

Metric (Dataset)	Zero-DCE	CuDi (auto/manu)
OverExp-Pair PSNR (overexposed)	9.07 dB	16.55/17.94 dB
VE-LOL PSNR (underexposed)	18.10 dB	19.76/22.49 dB
SICE Part 2 PSNR (low/normal pairs)	16.57 dB	—
SSIM/LPIPS/PI/NIQE/MUSIQ	Similar gains	—

Model size and speed:

	Zero-DCE	CuDi (student)
Parameters	~80 K	3 K
Inference (GPU, 1024 $\times$ 1024)	28.6 ms	5.4 ms
Inference (CPU, 4K)	44 s	0.5 s

Ablation: Loss term removal yields artifacts (under- or over-enhancement, color casts); curve distillation is required for tiny-net efficacy.
Practical utility: Enhanced images improve face detection AP on DARK FACE dataset; model generalizes to diverse illumination conditions without mode collapse, inversion, or overflow (Guo et al., 2020, Li et al., 2022).

6. Exposure Control: Global and Local Mechanisms

Controllable exposure is achieved via input $E$ maps:

Global: Uniform $E(x) \equiv s$ (e.g., $s=0.65$ for under- / $s=0.20$ for over-exposed).
Local: $E(x) = S + A \mathrm{Norm}(L_{\rm avg} - L(x))$ ( $A=0.15$ , $S=0.55$ or $0.25$), where $\mathrm{Norm}(\cdot) \in [-1,1]$ modulates exposure by local luminance.

Architecture modifications integrate $E$ as a fourth channel (besides $R,G,B$ ), with depthwise separable convolutions and downsampling for efficient prediction. Output can reflect a 1D exposure slider or flexible 2D control for advanced shadow/highlight correction (Li et al., 2022).

Model flexibility allows both global and spatially variant corrections within one learned representation.

7. Significance, Limitations, and Prospective Directions

Zero-DCE provides a principled, efficient low-light enhancement technique through zero-reference curve estimation, outperforming deep supervised and conventional methods in both quantitative metrics and speed. CuDi, via curve distillation and spatial exposure control, delivers similar visual quality at drastically reduced resource cost and offers explicit exposure adjustability.

Zero-reference losses obviate the need for curated ground-truth datasets, making the framework data-efficient and broadly applicable. Limitations include failure modes if supervision is omitted in distillation, and possible suboptimality relative to heavily annotated, task-specific supervised solutions under certain conditions.

These paradigms have informed subsequent research in image enhancement, robust exposure adjustment, and lightweight vision models for downstream recognition tasks.

References: (Guo et al., 2020, Li et al., 2021, Li et al., 2022)