RetouchLLM: Code-Based Image Retouching
- RetouchLLM is a training-free white-box image retouching system that uses iterative, interpretable code-based adjustments to enhance high-resolution photos.
- It integrates a visual critic to analyze photometric differences and a code generator to create explicit Python operations, enabling user-adapted editing.
- Evaluated on datasets like MIT-Adobe FiveK, it achieves competitive performance by combining transparency, control, and iterative refinement.
RetouchLLM is a training-free, white-box image retouching system designed to deliver interpretable, code-based enhancement of high-resolution photographs. Distinct from opaque, deep learning-based models, it requires no task-specific training and enables controllable, user-adapted adjustment paths. RetouchLLM operates by iteratively improving images in a multi-step fashion, emulating the staged workflows characteristic of human photo editing. Its architecture comprises two tightly coupled modules: a visual critic that decomposes the difference between the source and target styles, and a code generator that translates these differences into executable retouching operations. The resulting edits are explicitly represented, facilitating transparency, editability, and direct user interaction—an approach that generalizes across a diverse range of retouching styles.
1. Architecture and Workflow
RetouchLLM is structured as a closed-loop pipeline with two major components: the visual critic and the code generator. The process begins with an input source image and either reference images or natural language instructions. The visual critic, based on vision-LLMs, analyzes photometric discrepancies along multiple axes—exposure, contrast, highlight, shadow, saturation, temperature, and texture. For each iteration, it outputs several candidate textual descriptions that characterize plausible directions for retouching.
Each description is then provided to the code generator, which utilizes a LLM to plan and synthesize executable Python code, applying one or several image adjustment filters corresponding to the critic's suggestions. These generated programs are explicit, interpretable, and accessible for further editing or inspection.
Multiple candidate retouched images are produced and subsequently scored for reference-style alignment using a CLIP-based metric. The scoring mechanism computes the Kullback–Leibler (KL) divergence between the embedding distributions of each candidate and averaged reference embeddings, using both image and text modalities. The best aligned candidate is selected as the new source, and the iteration proceeds until a stopping criterion is met (e.g., satisfactory similarity, user termination).
The schematic workflow can be summarized as:
| Stage | Functionality | Output/Role |
|---|---|---|
| Input | Source & reference images / instructions | Initial photo and targets |
| Visual Critic | Photometric difference analysis | Multiple textual descriptions |
| Code Generator | Conversion to executable Python code for retouching | Several code-based retouching strategies |
| Execution+Scoring | Filter application and CLIP/KL scoring | Set of retouched candidates; best pick |
| Iteration | Repeat with updated source | Progressive enhancement |
2. Visual Critic: Photometric Analysis
The visual critic module constitutes the perceptual backbone of RetouchLLM. Using a vision-LLM, it decomposes aesthetic and structural differences between the source image and reference images (or user-supplied instructions). It does not restrict itself to a single interpretation; instead, it generates multiple candidate descriptions, each representing an actionable adjustment direction. These descriptions encompass a wide span of retouching factors: exposure (brightness), contrast, highlights, shadows, saturation, temperature (color warmth/coolness), and texture. This diversity increases the probability that subsequent edits will align well with the references, even when source and target images differ in content.
Experiments confirm that multiple candidates are essential for robust retouching, as they efficiently cover the parameter space of potential adjustments and facilitate exploration of varied enhancement paths. The critic thus anchors the retouching workflow in an interpretable and semantically rich manner, mitigating issues stemming from style ambiguity or content discrepancy.
3. Code Generator: Planning and Execution
The code generator module translates natural language descriptions into explicit, executable Python programs that perform the retouching. It leverages LLM-based planning to select from a set of supported image filters (exposure, contrast, saturation, temperature, highlight, shadow, texture), optimize their parameters, and produce code sequences that define adjustment strategies.
Each filter invocation is rendered in "white-box" form, such as:
1 2 |
filter.exposure(image, value=1.2) filter.contrast(image, value=1.1) |
This explicit code structure enables several key capabilities: (1) transparency—the transformation applied is clearly visible and editable; (2) adaptability—users may inspect and modify filters and parameters; and (3) reuse—the generated code can function as a customizable retouching preset. The generative approach further yields multiple candidate programs per iteration, allowing the system to explore a diverse set of retouching paths through sequential application.
4. Iterative Retouching Strategy
RetouchLLM's retouching process is inherently iterative and multi-step, mirroring professional editing practices where coarse, global adjustments precede fine, localized refinements. For each cycle:
- The visual critic produces candidate descriptions of source-to-target difference.
- The code generator converts these into candidate adjustment codes.
- Each program is executed, yielding several retouched images.
- Candidates are scored using a CLIP-based KL divergence alignment metric.
Selection is guided mathematically as follows. For each candidate , and text prompts :
For reference images :
The selection score is:
The candidate minimizing is selected. As the process repeats, is non-increasing and converges (), ensuring progressive alignment toward reference style.
This iterative approach allows strong global changes initially, followed by finer detail-focused enhancements, reflecting human editing heuristics.
5. Interpretable and Interactive User Control
RetouchLLM incorporates user interaction both at the image and instruction level. Users may provide explicit instructions ("increase warmth, lower shadows") instead of image references. The system parses these directives, generating candidate adjustment codes while optionally allowing users to select preferred candidates after each step.
All code is human-readable and modifiable, which supports transparent, personalized retouching workflows. Users are empowered to review the editing history, adapt parameters, and create bespoke presets with minimal system opacity. This stands in contrast to black-box, end-to-end neural models, promoting trust and customizability.
6. Experimental Evaluation
RetouchLLM has been evaluated on MIT-Adobe FiveK and PPR10K datasets using four quality metrics: PSNR (fidelity), SSIM (structure), LPIPS (perceptual similarity), and ΔE (CIELAB color difference). Results indicate that RetouchLLM outperforms the training-free baseline Z-STAR and achieves competitive performance with task-specific supervised models, all without requiring paired training data.
Ablation studies demonstrate the necessity of multiple candidate exploration, the critical roles of the visual critic and code generator, and the effectiveness of the CLIP/KL selection framework. The method scales directly to high-resolution images and generalizes across varied retouching styles.
7. Limitations and Future Directions
RetouchLLM exhibits several limitations. Its performance is contingent on the accuracy of large vision-language and LLMs; generated code may require occasional re-querying due to errors or nonfunctional stubs. The system currently supports seven basic retouching operations, which could be expanded to accommodate a broader spectrum of nuanced effects.
The selection metric, while effective, relies on CLIP alignment; future work may explore perceptually tuned or adaptive scoring models that correlate even more closely with human judgments. An important trajectory is the integration of enhanced filter diversity and paper of real-world human-AI interaction for improved personal adaptation and robustness.
A plausible implication is that RetouchLLM’s transparent, multi-candidate, code-driven paradigm could serve as a foundation for further research in iterative, interpretable, and user-controllable image enhancement, particularly as large vision-LLMs continue to advance.
RetouchLLM thus marks a significant development in training-free, interpretable photo retouching, combining multi-candidate exploration, code-centric adjustment, and transparent selection strategies for high-resolution, style-adaptive enhancement (Ye-Bin et al., 9 Oct 2025).