Differentiable ISP: Concepts & Architectures

Updated 4 July 2026

Differentiable ISP is a camera pipeline where traditional processes like demosaicing and tone mapping are recast as differentiable functions or neural modules.
It blends analytical operators and learned proxies, enabling jointly optimized loss functions that improve tasks such as image restoration and detection.
Architectural paradigms range from modular reconfiguration to complete end-to-end RAW-to-RGB networks, balancing interpretability with performance.

Searching arXiv for the specified paper and closely related differentiable ISP work to ground the article in current literature. arXiv search query: "Differentiable Image Signal Processor ISP survey ReconfigISP AWNet ParamISP Uni-ISP" A Differentiable Image Signal Processor (ISP) is a software-based camera pipeline in which each traditional step—from RAW sensor data to final sRGB output—is expressed as a differentiable function or neural module, so that a single loss, or a combination of losses, can be back-propagated through the entire chain and all parameters can be jointly optimized (Silva et al., 2023). In the literature, this concept spans several distinct regimes: analytic operators recast as differentiable layers, classical ISP modules approximated by differentiable proxies, compact end-to-end neural RAW→RGB pipelines, bidirectional forward/inverse ISP models, and task-driven pipelines optimized not only for image fidelity but also for detection, denoising, or language-conditioned rendering (Yu et al., 2021, Li et al., 2024, Mayer et al., 13 Sep 2025). The common premise is that end-to-end differentiability can eliminate error accumulation found in sequential heuristic pipelines, enable task-aware tuning, and accommodate perceptual, adversarial, or high-level vision losses (Silva et al., 2023).

1. Definition and conceptual scope

A conventional ISP transforms Color Filter Array sensor data through stages such as black-level correction, defective pixel correction, white balance, demosaicing, denoising, color correction, tone mapping, gamma correction, and post-processing (Silva et al., 2023). In a differentiable ISP, each of these stages is implemented either as a differentiable operator or as a neural module, so that gradients flow end-to-end from the final objective to the earliest RAW-domain computations (Silva et al., 2023).

The literature does not treat differentiable ISP as a single architecture class. One line of work retains explicit modular decomposition and makes existing modules differentiable, either directly or through learned surrogates. ReconfigISP, for example, preserves a classic modular structure while making both the choice of modules and their internal parameters learnable; many classical blocks that are not differentiable with respect to their hyper-parameters are replaced by small neural proxies, after which the entire super-network becomes trainable end-to-end (Yu et al., 2021). Another line of work replaces the whole pipeline by a learned RAW→RGB network, as in AWNet, LW-ISP, and SimpleISP, where the entire graph is a composition of standard tensor operations in PyTorch or related frameworks (Dai et al., 2020, Chen et al., 2022, Elezabi et al., 2024). A third line extends differentiable ISP beyond one-way rendering and models both forward and inverse camera processing, as in ParamISP and Uni-ISP (Kim et al., 2023, Li et al., 2024).

This diversity clarifies a recurrent misconception: differentiable ISP does not imply that every stage is a large neural network. The survey literature explicitly describes differentiable formulations of traditional steps such as $y = x - b$ for black-level offset correction, pixel-wise channel scaling for white balance, learned $3 \times 3$ color transforms for color correction, and differentiable power-law gamma or parametric curve layers for tone mapping (Silva et al., 2023). Conversely, recent work on language-based tuning makes only a single ISP block differentiable—the $3 \times 3$ color-adjustment matrix—while other stages remain fixed or are handled off-line (Mayer et al., 13 Sep 2025). This suggests that differentiability is best understood as a property of the computational graph rather than a commitment to a particular degree of learned complexity.

2. Differentiable formulations of ISP stages

The survey literature provides a stage-by-stage account of how traditional camera operations can be rewritten in differentiable form (Silva et al., 2023). Black-level correction is represented as subtraction of a constant or per-channel bias, white balance as channel-wise multiplication, demosaicing as convolutional interpolation, denoising as residual CNN blocks, color correction as a learned matrix and offset, tone mapping as a differentiable parametric curve or small CNN, and post-processing as residual convolutional or attention modules (Silva et al., 2023). These formulations are technically significant because they allow the RAW-to-sRGB chain to be optimized under composite objectives rather than by independently tuning separate heuristic blocks.

Several papers instantiate these principles with explicit operator definitions. In the language-based color tuning work, the differentiable ISP block is a single linear color-adjustment transform on per-pixel RGB. If $X \in \mathbb{R}^{3 \times H \times W}$ and $\phi = \{\phi_{ij}\}$ , the color-transform matrix is

$M_{(\phi)} = \begin{bmatrix} 1-\phi_{11} & \phi_{12} & \phi_{13}\ \phi_{21} & 1-\phi_{22} & \phi_{23}\ \phi_{31} & \phi_{32} & 1-\phi_{33} \end{bmatrix},$

with forward operation

$Y = g_{(\mathrm{color})}(X,\phi) = M_{(\phi)} \cdot X.$

Because the operation is matrix multiplication, it is fully differentiable; the parameters are initialized at the identity, each row sums to one for white-point conservation, and each $\phi_{ij}$ is clipped to $|\phi_{ij}| \le \tau$ during optimization (Mayer et al., 13 Sep 2025).

Other systems differentiate through more heterogeneous stages. DualDn constructs a simplified differentiable ISP for training two denoisers, including EXIF-driven black-level subtraction and normalization, white balance, a differentiable re-expression of Adaptive Homogeneity-Directed demosaicing via convolutions and soft masks, color correction with a $3 \times 3$ matrix, global tone mapping through a “deep curve,” and optional gamma correction (Li et al., 2024). ParamISP combines fixed canonical operators and learnable modules: CanoNet performs Malvar–He–Cutler demosaicing, known white-balance gains, and a known camera color-correction matrix; LocalNet models local residual operations; and GlobalNet models cascaded quadratic and gamma transforms, all with differentiable operations in PyTorch (Kim et al., 2023). Dark-ISP factorizes the ISP into a learnable linear sensor-calibration map $3 \times 3$ 0 and a nonlinear tone-mapping curve

$3 \times 3$ 1

where $3 \times 3$ 2 and one set of coefficients is predicted per RGB channel (Guo et al., 11 Sep 2025).

A separate formulation strategy is exact differentiable signal transforms inside end-to-end networks. AWNet uses a 1-level Haar DWT in each scaling module, with grouped convolutions implementing discrete wavelet transform and inverse DWT as fixed Haar filters. These operators are described as exactly invertible, lossless, and fully differentiable (Dai et al., 2020). This is technically distinct from proxy-based differentiation: rather than approximating a non-differentiable block, the pipeline uses operators that are already linear and therefore naturally compatible with back-propagation.

3. Architectural paradigms

The principal architectural paradigms can be organized into modular reconfiguration, end-to-end learned rendering, forward/inverse unification, and differentiable training scaffolds.

Paradigm	Representative systems	Defining characteristic
Modular differentiable ISP	ReconfigISP, language-based color tuning	Classic ISP structure retained; modules or parameters made learnable
End-to-end RAW→RGB ISP	AWNet, LW-ISP, SimpleISP, Mobile AI learned ISP	Entire RAW→RGB graph implemented as neural or differentiable operators
Unified forward/inverse ISP	ParamISP, Uni-ISP	Joint modeling of RAW↔sRGB or XYZ↔sRGB mappings
Task-coupled differentiable ISP	DualDn, Dark-ISP	ISP embedded in a larger optimization target such as denoising or detection

ReconfigISP is a canonical example of modular differentiable design. It defines a super-network of depth $3 \times 3$ 3 over a pool of 22 candidate algorithms divided into RAW→RAW denoising, RAW→sRGB demosaicing, and several sRGB→sRGB stages, with step-wise soft architecture weights $3 \times 3$ 4 and differentiable proxies for non-differentiable modules (Yu et al., 2021). This preserves explicit pipeline structure while enabling differentiable neural architecture search over module choice, hyper-parameters, and efficiency trade-offs.

AWNet, LW-ISP, and the Mobile AI learned ISP systems illustrate end-to-end rendering networks. AWNet is a two-branch U-Net–style encoder–decoder with a 4-channel RAW branch and a 3-channel demosaiced RGB branch; discrete wavelet down/up-sampling and global context blocks are integrated into both branches, and the final image is obtained by averaging the two outputs (Dai et al., 2020). LW-ISP uses a tiny U-Net with Fine-Grained Attention Modules between down stages and Contextual Complement Upsampling Blocks between up stages, with the entire graph composed of convs, pooling, PixelShuffle, sigmoid, ReLU, additions, and concatenations (Chen et al., 2022). The Mobile AI challenge report describes an ISP as a single TensorFlow Lite graph with demosaicing, denoising, color correction, and tone mapping/gamma blocks, all implemented with small CNN components and optimized for smartphone NPUs (Ignatov et al., 2021).

SimpleISP separates global color transformation from local reconstruction. Its Color Module, or CMod, learns global pixel-wise color transformations guided by full-image statistics, while a lightweight CNN reconstruction network restores local structure and detail (Elezabi et al., 2024). The formulation

$3 \times 3$ 5

makes the division explicit: full-image guidance is used to modulate pixel-wise transformations consistently across the patch or image (Elezabi et al., 2024).

ParamISP and Uni-ISP extend differentiable ISP to bidirectional modeling. ParamISP defines a forward ISP $3 \times 3$ 6 and inverse ISP $3 \times 3$ 7, with camera parameters entering through ParamNet, a small MLP that transforms EXIF-style optical parameters into a feature vector injected throughout the network (Kim et al., 2023). Uni-ISP instead uses a shared backbone conditioned on device-aware embeddings $3 \times 3$ 8 through a Device Embedding Interaction Module, allowing a single model to support multiple cameras in both inverse and forward directions (Li et al., 2024). This suggests that differentiable ISP has become a vehicle not only for image rendering but also for parameter-conditioned and device-conditioned camera modeling.

4. Objectives, supervision, and optimization

Differentiable ISPs are defined as much by their objectives as by their architectures. The survey literature lists common losses including $3 \times 3$ 9, $3 \times 3$ 0, perceptual losses, adversarial losses, and SSIM-based losses, often combined into a total loss

$3 \times 3$ 1

with the advantage that all terms can supervise the entire RAW→RGB chain simultaneously (Silva et al., 2023).

AWNet uses multi-scale supervision. Its Charbonnier loss is

$3 \times 3$ 2

with $3 \times 3$ 3, accompanied by VGG-19 perceptual loss and SSIM loss, applied with different weights across decoder scales; the total loss sums scale-wise objectives, with $3 \times 3$ 4 for the RAW branch and $3 \times 3$ 5 for the demosaiced branch (Dai et al., 2020). SimpleISP uses a color-module loss $3 \times 3$ 6 and a final reconstruction objective combining $3 \times 3$ 7, $3 \times 3$ 8, $3 \times 3$ 9, and $X \in \mathbb{R}^{3 \times H \times W}$ 0, with all weights set to 1 in the experiments (Elezabi et al., 2024). The Mobile AI challenge report similarly combines $X \in \mathbb{R}^{3 \times H \times W}$ 1, SSIM, and perceptual VGG-19 loss during final fine-tuning (Ignatov et al., 2021).

Proxy-based and search-based systems require additional optimization machinery. ReconfigISP alternates a parameter step updating module hyper-parameters and an architecture step updating architecture weights through a DARTS-style meta-learning scheme; it also employs online pruning and proxy tuning using a memory of recent intermediate activations (Yu et al., 2021). This is qualitatively different from pure end-to-end CNN training because the optimization target includes not only numerical parameters but pipeline topology.

Task-conditioned differentiable ISPs add supervision from downstream objectives. Dark-ISP uses the usual object-detection loss $X \in \mathbb{R}^{3 \times H \times W}$ 2 and augments it with a self-boost loss

$X \in \mathbb{R}^{3 \times H \times W}$ 3

and regularization on tone-curve coefficients and the learned color matrix:

$X \in \mathbb{R}^{3 \times H \times W}$ 4

The self-boost term is turned on only after epoch 10 (Guo et al., 11 Sep 2025). DualDn uses raw supervision $X \in \mathbb{R}^{3 \times H \times W}$ 5 and sRGB supervision $X \in \mathbb{R}^{3 \times H \times W}$ 6, with a differentiable ISP connecting the raw-domain and sRGB-domain denoisers during training (Li et al., 2024).

Language-conditioned optimization departs from paired reconstruction altogether. In language-based color ISP tuning, OpenAI CLIP with ViT-B/32, LAION-2B weights supplies image and text embeddings, and single-prompt tuning solves

$X \in \mathbb{R}^{3 \times H \times W}$ 7

where $X \in \mathbb{R}^{3 \times H \times W}$ 8 is cosine similarity. For two-prompt interpolation, the method minimizes

$X \in \mathbb{R}^{3 \times H \times W}$ 9

after applying a 2-D softmax to prompt similarities, so that the output style lies $\phi = \{\phi_{ij}\}$ 0 between prompt A and prompt B (Mayer et al., 13 Sep 2025). A plausible implication is that differentiable ISP can support objectives defined in semantic embedding spaces, not only in pixel or feature reconstruction spaces.

5. Representative systems and empirical patterns

Empirical results across the literature indicate that differentiable ISP is useful in several distinct operating regimes, including restoration, learned photography, efficiency-constrained deployment, multi-camera adaptation, and vision-task optimization.

ReconfigISP reports substantial gains in low-light image restoration and detection. On the SID dataset, the default ISP yields PSNR = 15.69, while ReconfigISP reaches PSNR = 25.65; on the S7 ISP dataset, the default ISP yields PSNR = 21.08 and ReconfigISP reaches PSNR = 23.31. For object detection on a custom OnePlus low-light dataset, Lightroom JPEG gives AP 0.318, software ISP 0.515, and ReconfigISP 0.601 (Yu et al., 2021). The same study reports efficiency-adapted variants, including ReconfigISP-Fast at 0.63 s/MP versus 1.16 s/MP with a PSNR drop of about 2 dB, and ReconfigISP-Faster at 0.049 s/MP while still matching or outperforming the default ISP (Yu et al., 2021).

For learned RAW→RGB conversion, AWNet and LW-ISP emphasize quality-efficiency trade-offs. AWNet reports an ensemble result of PSNR 21.97 and SSIM 0.7818 on the AIM2020/ZRR smartphone versus DSLR test set, with challenge placements of 5th in Track 1 and 2nd in Track 2 (Dai et al., 2020). LW-ISP reports PSNR = 21.57 dB versus PyNET’s 21.19 dB on the Zurich RAW→RGB benchmark, with 2.01 M parameters versus 47.55 M, 4.23 G FLOPs@224² versus 342.7 G, and inference time at 12 MP on a V100 GPU of 0.25 s versus 3.8 s (Chen et al., 2022). The Mobile AI challenge report adds deployment-scale evidence: models compatible with the MediaTek Dimensity 1000+ NPU process Full HD photos under 60–100 milliseconds, with reported test-set results including PSNR 23.73 and SSIM 0.8487 for AIISP at 90.8 ms, and PSNR 23.20 and SSIM 0.8467 for dh_isp at 61 ms (Ignatov et al., 2021).

Global-context designs show that patch-trained neural ISPs can benefit from explicit full-image guidance. On ZRR Small, LiteISP + CMod with full-image guidance reports PSNR 24.79, SSIM 0.8593, LPIPS 0.135, compared to LiteISP at PSNR 22.18, SSIM 0.8305, LPIPS 0.162; on ISPIW, LiteISP + CMod reports PSNR 24.04 and SSIM 0.8409 versus LiteISP at PSNR 22.14 and SSIM 0.8146 (Elezabi et al., 2024). The paper also reports that SimpleISP has 0.064 M parameters and 8.42 G MACs at $\phi = \{\phi_{ij}\}$ 1, and rivals state-of-the-art with 20× fewer parameters (Elezabi et al., 2024).

Forward/inverse and multi-camera systems show that differentiable ISP can model camera-specific variation rather than only a single rendering pipeline. ParamISP reports inverse-ISP PSNRs of $\phi = \{\phi_{ij}\}$ 2 dB and forward-ISP PSNRs of $\phi = \{\phi_{ij}\}$ 3 dB across five cameras, improving over the best prior by about 2 dB on average, with only 0.7 M parameters (Kim et al., 2023). Uni-ISP reports that on FiveCam, inverse ISP PSNR improves from 31.21 dB for ParamISP to 32.70 dB, and forward ISP PSNR improves from 26.74 dB to 29.15 dB, with corresponding SSIM improvements from 0.918 to 0.940 for inverse and 0.918 to 0.931 for forward (Li et al., 2024).

Task-driven differentiable ISPs provide a different empirical pattern: the primary metric is not necessarily image fidelity. Dark-ISP, trained end-to-end for low-light object detection, reports COCO mAP 70.4 on the LOD dataset with a ResNet-50 backbone, compared with 67.0 for IA-ISP, 67.9 for LIS, and 66.0 for default ISP→detector, using about 0.49 MB of ISP parameters and about 3.4 ms/frame on a single GPU (Guo et al., 11 Sep 2025). DualDn shows that a differentiable ISP can serve purely as a training scaffold: it is used to co-train raw-domain and sRGB-domain denoisers under varied synthetic ISP settings and then discarded at inference time (Li et al., 2024).

Language-based tuning demonstrates a still narrower but conceptually important use case. Optimization of only a $\phi = \{\phi_{ij}\}$ 4 color-adjustment matrix yields $\phi = \{\phi_{ij}\}$ 5 CLIP-IQA colorfulness up to 0.49 and $\phi = \{\phi_{ij}\}$ 6 Hasler colorfulness up to 61.3 for vibrant versus dull prompts, with prompt varieties including color names, cultural references, and emotions; two-prompt interpolation produces smooth visual transitions controlled by $\phi = \{\phi_{ij}\}$ 7, and the final images exhibit the intended color cast without introducing artifacts (Mayer et al., 13 Sep 2025).

6. Interpretability, misconceptions, and open problems

A persistent tension in the literature concerns modularity versus full end-to-end learning. The survey notes that fully end-to-end nets lack explicit control over individual ISP parameters such as white balance, complicating fine-tuning (Silva et al., 2023). ReconfigISP addresses this by retaining interpretable stage boundaries and searching over explicit modules (Yu et al., 2021). ParamISP combines fixed canonical steps with learned local and global refinements conditioned on camera parameters (Kim et al., 2023). Language-based color tuning goes further toward explicit control, because the only learnable ISP block is the constrained $\phi = \{\phi_{ij}\}$ 8 color-adjustment matrix with white-point conservation and clipping to avoid runaway color shifts (Mayer et al., 13 Sep 2025). This suggests that differentiable ISP is not inherently opposed to interpretability; rather, the degree of interpretability depends on where one places the learnable components.

Another common misunderstanding is that differentiability requires direct differentiation through every classical algorithm exactly as originally written. ReconfigISP instead trains a proxy network for each non-differentiable module and tunes those proxies online during search (Yu et al., 2021). DualDn replaces hard branches and thresholding with smooth masks or smooth approximations so that back-propagation is never cut off (Li et al., 2024). By contrast, AWNet relies on exact differentiable wavelet operators rather than surrogates (Dai et al., 2020). These examples show that differentiable ISP includes both exact operator reformulation and learned approximation.

The literature also identifies several unresolved problems. The survey highlights sensor diversity and misalignment, real-time and edge deployment, dynamic range and low light, interpretability and modularity, and domain generalization as open challenges (Silva et al., 2023). SimpleISP addresses the global-context deficit of patch-based training by conditioning color transformations on full-image guidance, but its motivation itself indicates that global properties such as color constancy and illumination remain difficult for patch-only methods (Elezabi et al., 2024). Uni-ISP addresses camera diversity through device-aware embeddings and a multi-camera training scheme, yet the same line of work makes clear that synchronously captured multi-camera datasets are scarce enough that a new 4K dataset, FiveCam, had to be constructed (Li et al., 2024). ParamISP notes that interpretability remains incomplete with respect to how aperture or focal length modulate ISP behavior and suggests future extensions to video ISPs or explicit disentanglement between photometric and geometric stages (Kim et al., 2023).

Future directions are stated explicitly in the survey: Vision Transformers as ISP backbones, neural architecture search for per-stage modules tailored to hardware constraints, self-supervised RAW↔RGB mapping, and hybrid pipelines that mix analytic differentiable blocks with learned modules for interpretability and robustness (Silva et al., 2023). The language-based color tuning paper leaves differentiability of demosaic, white-balance, tone-mapping, gamma, and other stages to future work, while noting that its architecture, loss formulas, and optimization recipes generalize directly to any other ISP stage that can be recast as a differentiable operator (Mayer et al., 13 Sep 2025). A plausible implication is that the field is moving toward hybrid systems in which analytic camera structure, task-driven optimization, semantic conditioning, and device-aware adaptation coexist within a single differentiable framework.