PatchGAN Discriminator Overview

Updated 6 January 2026

PatchGAN Discriminator is a CNN-based module in GANs that assesses local image patches, ensuring high-frequency detail and texture fidelity.
It outputs a spatial probability map that evaluates 70×70 patches, a key design choice popularized by Pix2Pix for enforcing local realism.
Variants such as Temporal PatchGAN and SPatchGAN extend the approach to video and global feature matching, enhancing performance in tasks like underwater reconstruction and image translation.

A PatchGAN discriminator is a convolutional neural network (CNN) module used in generative adversarial networks (GANs) to determine if local image or video patches are real or fake. Rather than producing a single scalar for an entire image, PatchGAN outputs a spatial probability map, where each scalar assesses the realism of a corresponding patch of the input. This architecture was introduced to ensure high-frequency detail and textural realism in image-to-image and video-to-video translation, as well as related generation tasks. PatchGAN variants have been widely adopted in applications requiring fine-grained texture discrimination, including underwater image enhancement, video inpainting, and unsupervised domain translation (Akash et al., 5 Dec 2025, Shao et al., 2021, Chang et al., 2019).

1. Architectural Design and Implementation

A canonical PatchGAN discriminator processes as input the concatenation of a source (e.g., low-quality or masked) image (or video clip) and its target (either ground truth or generator output) along the channel dimension. The core architectural characteristic is a fully convolutional stack, where all outputs have limited receptive fields corresponding to local, overlapping image (or video) patches. In the "Underwater Image Reconstruction Using a Swin Transformer-Based Generator and PatchGAN Discriminator" application, the implementation is as follows (Akash et al., 5 Dec 2025):

Layer	Filters	Kernel	Stride	Padding	Output Resolution	Normalization	Activation
Conv1	64	4x4	2	1	⌊H/2⌋ × ⌊W/2⌋ × 64	—	LeakyReLU(0.2)
Conv2	128	4x4	2	1	⌊H/4⌋ × ⌊W/4⌋ × 128	BatchNorm	LeakyReLU(0.2)
Conv3	256	4x4	2	1	⌊H/8⌋ × ⌊W/8⌋ × 256	BatchNorm	LeakyReLU(0.2)
Conv4	512	4x4	1	1	⌊H/8⌋ × ⌊W/8⌋ × 512	BatchNorm	LeakyReLU(0.2)
Conv5	1	4x4	1	1	⌊H/8⌋ × ⌊W/8⌋ × 1	—	Sigmoid

Each output scalar in the final feature map “sees” a 70×70 region of the original input, the canonical patch size for PatchGAN as popularized in Pix2Pix.

Pseudocode for forward and loss computation is as follows: $\mathcal{L}_{GAN}(G,D) = \mathbb{E}_{x,y}\bigl[\log D(x,y)\bigr] + \mathbb{E}_{x}\bigl[\log(1 - D(x,G(x)))\bigr]$ 0 The loss is binary cross-entropy averaged over all patches and batch elements. Generator adversarial and pixelwise L1 losses are combined with $\lambda=100$ (Akash et al., 5 Dec 2025).

2. Patch-wise Adversarial Loss Formulation

PatchGAN uses a patch-based adversarial loss, enforcing realism at the local patch level: $\mathcal{L}_{GAN}(G,D) = \mathbb{E}_{x,y}\bigl[\log D(x,y)\bigr] + \mathbb{E}_{x}\bigl[\log(1 - D(x,G(x)))\bigr]$ where $D(x,y)$ is the discriminator’s average response map, with each output corresponding to a 70×70 patch (for the standard configuration). The generator aims to maximize discriminator error over these local patches, while the discriminator seeks to correctly classify them as real or fake. The discriminator loss is

$\mathcal{L}_D = -\,\mathbb{E}_{x,y}\bigl[\log D(x,y)\bigr] -\,\mathbb{E}_{x}\bigl[\log(1 - D(x,G(x)))\bigr]$

(Akash et al., 5 Dec 2025).

3. High-Frequency Detail Preservation and Patch Size

By assigning real/fake labels at the patch level, PatchGAN focuses discriminator capacity on local realism, textural consistency, and edge fidelity. The overlapping nature of patches ensures spatial continuity. A patch size of 70×70 pixels is empirically large enough to capture local semantic structure and nontrivial objects, while retaining discrimination ability for fine-grained textures. If the patch is too large, the discriminator can memorize global color/lighting; too small, and it cannot enforce spatial coherence (Akash et al., 5 Dec 2025). The convolutional architecture with strides and padding automatically determines the receptive field per output and thus the patch size.

4. Variants and Extensions: Temporal and Statistical PatchGANs

PatchGAN has been extended in several domains:

Temporal PatchGAN adapts the architecture to video, replacing all 2D convolutions with 3D convolutions, such that each output maps to a spatio-temporal patch (e.g., 13 frames × 253×253 spatial pixels after 6 layers). This enables the discriminator to enforce both spatial detail and temporal consistency, an essential property for applications like video inpainting. The loss uses a hinge formulation: $L_D = \mathbb{E}_{x\sim P_{data}}[{\rm ReLU}(1 + D(x))] + \mathbb{E}_{z\sim P_z}[{\rm ReLU}(1 - D(G(z)))]$ With spectral normalization on all convolutions, the network prevents temporal “flicker” and mode collapse, focusing both on sharp details and temporal smoothness (Chang et al., 2019).

SPatchGAN (Statistical PatchGAN) further generalizes the concept by extracting channel-wise statistics (mean, max, stddev) over all patches at several scales, feeding them to scale-specific MLP heads. This approach matches the distributions of such statistics between real and generated samples, giving greater stability and global feature enforcement, especially for tasks demanding significant shape deformation (e.g., selfie-to-anime). The loss is derived from the Least-Squares GAN (LSGAN) approach but computed over statistical summaries rather than raw convolutional outputs (Shao et al., 2021).

5. GAN Framework Integration and Training Workflow

PatchGAN is integrated in conditional GAN (cGAN) frameworks for paired or unpaired image translation. Training proceeds iteratively:

Sample a mini-batch of source and target pairs.
Generator transforms the source (e.g., $I_{low}$ ) to a candidate target $\hat{y}$ .
Discriminator receives real ( $I_{low} \| I_{high}$ ) and fake ( $I_{low} \| \hat{y}$ ) pairs, computes the patch-wise probability maps.
Compute PatchGAN loss terms; update discriminator and generator through backpropagation.
For image tasks, combine patch-based adversarial loss with L1 or L2 reconstruction loss in the generator, employing a high relative weight (e.g., $\lambda=100$ ) for reconstruction to stabilize training (Akash et al., 5 Dec 2025, Shao et al., 2021).

For video PatchGANs, mini-batches include video clips; generator and discriminator operate on sequences, with the discriminator’s receptive field encompassing both time and space (Chang et al., 2019).

6. Comparative Analysis and Limitations

PatchGAN offers improved preservation of local detail, edge sharpness, and textural realism compared to global discriminators. However, because it is primarily local, it may not enforce global property consistency (e.g., object-level shape, inter-patch semantics) without additional constraints. This limitation is evidenced in tasks requiring major shape deformation, where statistical or multi-scale discriminators, such as SPatchGAN, provide enhanced performance by incorporating global patch statistics (Shao et al., 2021).

Temporal PatchGANs address the inability of 2D PatchGANs to enforce temporal coherence. Yet, their receptive fields, though large, are still fundamentally local compared to discriminators that analyze full sequences or objects holistically (Chang et al., 2019). No spectral normalization or explicit multi-scale discriminators were included in the "Underwater Image Reconstruction Using a Swin Transformer-Based Generator and PatchGAN Discriminator" design, which closely follows the canonical PatchGAN as popularized in Pix2Pix (Akash et al., 5 Dec 2025).

7. Applications and Empirical Results

PatchGAN and its variants have demonstrated strong empirical results in:

Underwater image reconstruction: Used with a Swin Transformer-based generator, PatchGAN enables state-of-the-art PSNR (24.76 dB) and SSIM (0.89) on EUVP paired underwater datasets, with effective restoration of color, contrast, and haze reduction (Akash et al., 5 Dec 2025).
Unsupervised image-to-image translation: SPatchGAN outperforms PatchGAN and prior methods on selfie-to-anime, male-to-female, and glasses-removal tasks (noted with lower FID/KID, better shape consistency) (Shao et al., 2021).
Video inpainting: Temporal PatchGAN discriminator with 3D convolutional structure achieves improved temporal consistency and spatial sharpness compared to 2D PatchGAN, with empirical benefits demonstrated on the FaceForensics and FVI benchmarks (Chang et al., 2019).

PatchGAN has become a de facto standard discriminator module for local texture enforcement in GAN-based restoration and translation pipelines, with extensions addressing its locality and domain-specific requirements.