- The paper introduces a human-guided ground-truth generation method that enhances perceptual quality in image super-resolution training.
- It leverages diverse CNN and transformer architectures along with human annotations to build a dataset of 20,193 groups and 80,772 enhanced patches.
- Trained models show improved qualitative metrics, with 78.72% positive human evaluations, demonstrating the method’s practical impact on Real-ISR.
Human Guided Ground-truth Generation for Realistic Image Super-resolution
The paper "Human Guided Ground-truth Generation for Realistic Image Super-resolution" proposes a novel methodology for generating ground-truth (GT) data used to train realistic image super-resolution (Real-ISR) models. The research addresses significant limitations in current LR-HR pair generation techniques, which traditionally use high-resolution images as GTs without adequately considering human perception, thereby potentially constraining the perceptual quality of model outputs. The limitations of existing datasets often result in trained ISR models producing over-smoothed outputs or outputs with visual artifacts.
The authors introduce a human-guided GT generation approach to enhance the perceptual quality utilized in training. They begin by training various image enhancement models with multiple high-quality photorealistic datasets to augment the perceptual quality of existing HR images. The models feature different backbone architectures (e.g., CNN and transformer models) and different levels of degradation handling, ensuring a diverse enhancement pool for the HR images. Subsequently, volunteers are enlisted to annotate high-quality areas in enhanced HR images as positive GTs and regions with artifacts as negative samples.
Through this process, a comprehensive human-guided GT dataset is constructed with clearly labeled positive and negative samples to aid in the development of more functional loss functions specifically tailored for Real-ISR model training. The developed dataset includes 20,193 groups with 80,772 enhanced patches. Upon evaluation of these samples by human experts, a high percentage (78.72%) were marked as "Positive," showcasing the effectiveness of the authors' enhancement process.
The implications for practice are substantial. Training models using the proposed dataset demonstrated a marked improvement in qualitative assessments over those trained with traditional datasets such as DF2K-OST. Perceptual metrics such as LPIPS and DISTS showed impressive enhancements when comparing results from various state-of-the-art Real-ISR models retrained using the new dataset. This supports the assertion that the inclusion of human perception in GT selection can significantly refine generated model outputs, potentially leading to advances in the application of ISR in real-world scenarios where visual fidelity is paramount.
Moving forward, the methodological innovation in GT dataset creation posed in this paper could encourage broader adoption of human-centric approaches for other problems where perceptual quality is crucial. Furthermore, potential future developments could see refinements in automatic annotation systems for cumbersome large-scale datasets by harnessing artificial intelligence methodologies.
In conclusion, the paper offers strong empirical support for the role of human perception in guiding the data generation processes of AI models, presenting a method that allows for the creation of perceptually realistic data that dramatically improves the fidelity of Real-ISR model outputs. This could signify a step forward in closing the gap between synthetic simulation and real-world applicability in image processing models.