Memory-Augmented Patch Encoders
- Memory-augmented patch encoders are neural architectures that combine deep feature extraction with external patch memories to enrich local context in vision tasks.
- They use both patch-wise and layer-wise memory banks along with adaptive coreset sampling to balance computational efficiency and performance.
- Applied in anomaly detection and image inpainting, these encoders achieve high accuracy and robustness, outperforming baseline methods in efficiency and visual fidelity.
Memory-augmented patch encoders are a class of neural architectures that explicitly leverage external memory banks structured at the patch level to store, retrieve, and manipulate local feature representations during tasks such as anomaly detection and image inpainting. These encoders integrate fixed or adaptive patch memories—populated with embedding vectors from reference images or unmasked regions—enabling rich, non-parametric local context utilization. Such designs offer improved generalization to rare or out-of-distribution phenomena compared to purely parametric deep architectures, and achieve strong tradeoffs between efficiency, accuracy, and realism in high-resolution vision tasks.
1. Architectural Fundamentals of Memory-Augmented Patch Encoders
At their core, memory-augmented patch encoders combine a parametric feature extraction backbone with an explicitly maintained patch memory. Images are processed through a deep encoder (e.g., Wide-ResNet50), yielding mid-level feature maps (for example, of size spatially). These feature maps are split into non-overlapping patches, and for each patch index the corresponding feature vectors—of dimension matching the backbone channel count—form a local patch-level memory (Kim et al., 2022, Xu et al., 2020).
This structure can be formalized as follows:
- Patch Indexing: Input is encoded via frozen ImageNet-pretrained (e.g., Wide-ResNet50). For each mid-level layer , the spatial feature map is divided into patches (7 x 7 grid), index .
- Memory Construction (Training): For all training images, patch from image yields a set of feature-vectors . These are concatenated across the dataset to form .
- Dimensionality: Each entry has shape for training set size , per-patch vector count , and channel dimension .
A similar principle governs image inpainting, although the memory is typically constructed per image from unmasked regions; see (Xu et al., 2020).
2. Patch-wise and Layer-wise Memory Banks
Memory-augmented patch encoders, as operationalized in Fast Adaptive Patch Memory (FAPM) (Kim et al., 2022), employ both patch-wise and layer-wise memory structures:
- Patch-wise Memory: Each patch index () possesses its independent memory bank , aggregating feature-vectors across all training images.
- Layer-wise Memory: Separate memory is maintained for each backbone layer (), preserving the multi-level semantic richness without upsampling deeper features, which would otherwise incur significant compute and memory cost.
Formally, the layer-wise memory is . At inference, patch features from each relevant layer are compared solely within the corresponding layer’s memory, and results are fused post-hoc by weighted aggregation. This approach avoids the concatenation and upsampling bottlenecks of prior methods (e.g., PatchCore, PaDiM).
In the context of texture inpainting, a per-image patch memory $\mathcal{T}\in\mathbb{R}^{\Npool\times k_p\times k_p\times3}$ is constructed from unmasked regions, encoded via lightweight patch “heads” such as and (Xu et al., 2020).
3. Memory Compression via Adaptive Coreset Sampling
A central technical challenge is maintaining tractable memory and inference cost, especially when populating each patch memory with numerous high-dimensional vectors. FAPM introduces an adaptive coreset sampling mechanism (Kim et al., 2022):
- Basic Farthest-Point Coreset: For each memory , select cluster centers (with set to capture 10% of the vector population) using a facility-location greedy algorithm.
- Cluster Spread and Adaptation: For each cluster, compute the normalized spread . If the largest spread exceeds , double and resample.
- Benefits: "Wide" (high-variance) patch memories retain more centers; "narrow" patches are aggressively downsampled, yielding significant memory and runtime savings with minimal loss in coverage.
At inference, only the reduced key set is used for nearest-neighbor computations. Complexity for sampling is ; during inference, comparisons per patch.
4. Inference and Fusion Mechanisms
The inference phase capitalizes on the memory organization for efficient and robust decision-making:
- Anomaly Detection: Each patch’s feature(s) from the test image is compared (Euclidean distance, no normalization) to its memory . The minimum distance is used as a patch-level anomaly score. Top- largest distances () are retained to stabilize the score.
- Fusion: Patch scores are fused across layers (weighted sum or max), arranged into a 7x7 map, upsampled, and Gaussian-blurred () to form a full-resolution heatmap. The maximal patch-level score gives the image-level anomaly score.
- Image Inpainting: For each missing region patch, a set of top-$\Nc$ candidates ($\Nc=4$) are retrieved from by normalized dot-product similarity, discretely sampled via a straight-through estimator (to avoid blurry soft interpolation). These candidate textures are fused in a patch synthesis module (PSNet) via skip connections at multiple spatial scales (Xu et al., 2020).
5. Integration with Task-Specific Pipelines
Memory-augmented patch encoders are tightly integrated into the end-to-end task pipeline.
In industrial anomaly detection (Kim et al., 2022):
- Training involves constructing the downsampled patch and layer-wise memories via adaptive coreset.
- Inference applies per-patch nearest-neighbor distance computation, patch/layer fusion, and upsampling to generate pixel-level and image-level anomaly scores.
- FAPM achieves image-level AUROC of 99.0%, pixel-level AUROC of 98.0%, throughput of 44.1 FPS (MVTec AD test set), and fits the memory (tens of MB) in GPU DRAM.
In image inpainting (Xu et al., 2020):
- A coarse encoder-decoder network first fills missing regions.
- Structural patches in the filled regions query a per-image patch memory using differentiable retrieval.
- The resulting texture candidates are fused (with contextual neighborhoods) in a two-stream PSNet, which is optimized using a combined , VGG perceptual, GAN, and distribution-alignment (patch distribution loss) objectives.
- On the Places dataset, T-MAD achieves , PSNR = 17.20 dB, SSIM = 0.867, FID = 8.13 at 47 FPS, outperforming DeepFill and other baselines.
6. Design Trade-offs, Ablation Studies, and Hyperparameter Sensitivity
Quantitative ablation studies highlight critical trade-offs:
| Variant (FAPM) | Img AUROC | Pix AUROC | FPS |
|---|---|---|---|
| PatchCore (10% memory) | 98.9% | 98.1% | 23.4 |
| Patch-wise memory only | 98.5% | 97.5% | 34.1 |
| Layer-wise memory only | 98.3% | 97.1% | 35.7 |
| Patch- & layer-wise, 10% | 98.4% | 97.8% | 46.0 |
| Full FAPM (adaptive) | 99.0% | 98.0% | 44.1 |
Key hyperparameters for FAPM include:
- Input size: (7x7 patches of )
- Layers: Wide-ResNet50 blocks
- Initial downsampling: 10%, adaptive threshold
- nearest keys for inference, Gaussian blur
Ablation on T-MAD (Xu et al., 2020) reveals:
- Optimal number of texture candidates $\Nc=4$; larger $\Nc$ degrades FID.
- Patch distribution loss is essential; standard GAN loss increases FID.
- Blending loss removal causes patch-boundary artifacts.
7. Task-Specific Implications and Comparison with Related Methods
Memory-augmented patch encoders provide distinct advantages:
- In anomaly detection, FAPM’s decoupling of layer-wise and patch-wise memories drastically reduces upsampling and nearest-neighbor computation costs, enabling real-time throughput with minimal accuracy loss (Kim et al., 2022).
- In inpainting, per-image memory ensures high-fidelity texture reproduction, leveraging real content heuristics while enabling end-to-end differentiable learning (Xu et al., 2020).
Compared to baseline memory-augmented methods such as PatchCore, PaDiM, and DeepFill, the introduction of adaptive coreset strategies, per-image (rather than dataset-wide) memories, and differentiable, discrete memory querying yield marked improvements in efficiency, robustness, and output realism.
A plausible implication is that memory-augmented patch encoders provide a flexible framework for out-of-distribution reasoning and sample-efficient learning in dense vision tasks, especially where local context and per-instance adaptation are crucial.