Dynamic Residual Encoding via SLCL

Updated 14 November 2025

The paper introduces dynamic residual encoding with a memory bank and slide-level contrastive learning to improve cancer subtyping and gene mutation prediction.
It employs a Vision Transformer for tile encoding, K-means clustering for codebook generation, and VLAD aggregation refined by a Transformer for slide embedding.
Experimental results show significant gains in F1 and AUC scores across multiple histopathology datasets over conventional methods.

Dynamic Residual Encoding with Slide-Level Contrastive Learning (DRE-SLCL) is a methodology for end-to-end representation learning of whole slide images (WSIs) in histopathology, aimed at overcoming GPU memory limitations arising from the massive number of tiles per gigapixel WSI. DRE-SLCL combines tile-level dynamic residual encoding with an external memory bank and slide-level contrastive learning, aligning image and text (report) embeddings to improve performance on cancer subtyping, recognition, and gene mutation prediction tasks (Jin et al., 7 Nov 2025).

1. Architectural Overview

DRE-SLCL is organized into four primary components for WSI representation:

Tile Encoder: A Vision Transformer (ViT-256), pretrained with HIPT, encodes each 256×256 image tile $x$ into a $d$ -dimensional feature vector $f(x)\in\mathbb{R}^d$ .
Memory Bank: A two-level dictionary structure storing L2-normalized tile features $f(x)$ for each slide, indexed as MemoryBank[ $i$ ] [ $j$ ], where $i$ denotes the slide and $j$ the tile.
Residual Encoder: Uses a fixed codebook $C=\{\mathbf c_1,\ldots,\mathbf c_K\}$ (obtained via K-means) and implements a VLAD aggregation scheme. For each tile, its residual with respect to its nearest codeword is computed and aggregated, then concatenated and normalized before refinement via a small Transformer, producing the slide embedding $h\in\mathbb{R}^D$ .
Projection Head: A linear layer projects $h$ into a multimodal joint space of dimension $d'$ (e.g., 4096) for alignment with LLaMA2-7B text encodings of histopathology reports.

During inference, tile and residual encoders are applied to all tiles once per slide. During training, only a subset is processed per iteration, with features and residuals recomputed in a dynamic fashion leveraging the memory bank.

2. Memory Bank Structure and Update Mechanism

The memory bank is crucial for scaling to whole slides containing tens of thousands of tiles. Its structure and use are as follows:

Initialization: Before training, ViT-256 encodes all tiles, and their L2-normalized features are stored in MemoryBank[ $i$ ] [ $j$ ].
Organization: Implemented as a dict of dicts, with top-level keys by slide, subordinate keys by tile index.
Readout: To encode an entire WSI, all tile features are retrieved from MemoryBank[ $i$ ].
Dynamic Update: At each iteration during training, for every WSI in the batch, a randomly sampled subset ( $r$ tiles) is re-encoded and normalized, replacing old features in the bank. This allows the memory bank to remain synchronized with ongoing updates to the tile encoder parameters.

3. Dynamic Residual Encoding

To mitigate computational infeasibility with full-tile re-encoding and to control gradient memory footprint, DRE-SLCL implements dynamic residual encoding as follows:

For slide $i$ with $N_i$ tiles and iteration $t$ , randomly sample $r$ tiles $\mathcal S_i^{(t)}$ , encode them, and update MemoryBank[ $i$ ] [ $j$ ].
Retrieve all $N_i$ features for slide $i$ from the memory bank.
Assign each tile feature $f_{i,j}$ to its nearest codebook entry:

$k^* = \arg\min_{1\le k\le K}\|f_{i,j}-\mathbf c_k\|_2,\quad \mathbf r_{i,j} = f_{i,j}-\mathbf c_{k^*}.$

Aggregate the residuals per codeword:

$\mathbf v_k^{(i)} = \sum_{j:\,k^*(j)=k}\mathbf r_{i,j},\quad k=1,\ldots,K.$

Concatenate, L2-normalize:

$\mathbf v_{\mathrm{VLAD}}^{(i)} = [\mathbf v_1^{(i)},\ldots,\mathbf v_K^{(i)}]\in\mathbb{R}^{K d},\quad \mathbf v_{\mathrm{VLAD}}^{(i)} \leftarrow \frac{\mathbf v_{\mathrm{VLAD}}^{(i)}}{\|\mathbf v_{\mathrm{VLAD}}^{(i)}\|_2}.$

Feed $\mathbf v_{\mathrm{VLAD}}^{(i)}$ through a Transformer to yield the slide embedding $h_i = \mathrm{Transformer}(\mathbf v_{\mathrm{VLAD}}^{(i)})$ .

This incremental update and aggregation overcome sampling bias of previous methods and avoid re-encoding the full WSI per iteration.

4. Slide-Level Contrastive Learning

To further enrich the learned slide representations, DRE-SLCL employs slide-level contrastive learning by aligning slide embeddings with textual report embeddings:

Report Encoder: LLaMA2-7B encodes each pathology report $T_i$ into a normalized 4096-dimensional vector $t_i$ .
Image-Text Alignment: The slide embedding $h_i$ is projected and normalized to $h'_i$ with a linear layer, so $h'_i = \mathrm{Proj}(h_i)$ .
Similarity Logits: For each batch of $b$ slides, define

$S_{\mathrm{img2txt}} = \tfrac{1}{\tau} H' T^\top,\qquad S_{\mathrm{txt2img}} = \tfrac{1}{\tau} T (H')^\top,$

with $H'$ and $T$ as batch matrices and $\tau$ the learnable temperature parameter.

Contrastive Loss:

$\mathcal{L}_{\mathrm{str}} = \frac{1}{b} \sum_{i=0}^{b-1} -\log \frac{\exp(S_{\mathrm{img2txt}}[i,y_i])}{\sum_{j=0}^{b-1} \exp(S_{\mathrm{img2txt}}[i,j])}$

$\mathcal{L}_{\mathrm{rts}} = \frac{1}{b} \sum_{i=0}^{b-1} -\log \frac{\exp(S_{\mathrm{txt2img}}[i,y_i])}{\sum_{j=0}^{b-1} \exp(S_{\mathrm{txt2img}}[i,j])}$

$\mathcal{L}_{\mathrm{contrastive}} = \tfrac{1}{2} (\mathcal{L}_{\mathrm{str}} + \mathcal{L}_{\mathrm{rts}})$

where ground-truth pairs correspond to matching slide/report indices.

Positive pairs are true slide-report correspondences in the mini-batch, with all others serving as negatives. This encourages the visual representation to capture semantics that are also salient in clinical text.

5. Training Pipeline

The DRE-SLCL training process comprises preparation and iterative training phases:

Preparation

Extract features for all tiles with ViT-256 to populate the initial MemoryBank.
Apply K-means clustering (number of clusters $K$ ) over all features to construct the codebook $C$ .

Training Loop

For each batch of $b$ $b$ slides and their corresponding tiles and reports:
1. Dynamic Sampling & Update: For each slide, randomly select $r$ tiles, encode, and update MemoryBank.
2. Residual Encoding: Retrieve all tile features and perform VLAD + Transformer encoding.
3. Contrastive Alignment: Encode reports, compute projected slide embeddings, calculate similarity logits and contrastive loss.
4. Classification: Use the slide embedding $h_i$ for downstream tasks, with a classification head trained via cross-entropy.
5. Optimization: Backpropagate on the combined contrastive and classification losses.

Initial epochs keep the tile encoder frozen; subsequent epochs fine-tune the entire network end-to-end, leveraging the up-to-date memory bank.

6. Experimental Results

DRE-SLCL was evaluated on several large-scale histopathology datasets:

Task	Dataset	DRE-SLCL Metric	Best Baseline	Δ
LUAD vs. LUSC Subtyping	TCGA_LUNG	F1 = 80.88%	76.05%	+4.83
LUAD vs. LUSC Subtyping	CPTAC_LUNG	F1 = 76.43%	68.65%	+7.78
Cancer Recognition	CPTAC_PDA	AUC = 92.34%	89.51%	+2.83
TP53 Mutation Prediction	TCGA_LUAD	AUC = 71.33%, F1 = 82.76%	69.23% (AUC), 79.40% (F1)	+2.10 (AUC), +3.36 (F1)

DRE-SLCL consistently outperformed MIL, Transformer-MIL, HIPT, Prov-GigaPath, and prior end-to-end schemes across cancer subtyping, recognition, and gene mutation prediction.

7. Ablation Studies and Interpretations

Ablation experiments on TP53 mutation prediction tested key hyperparameters and loss components:

Batch Size: Larger batches (up to 64) yielded better AUC, attributed to a greater number of negatives in the contrastive loss.
Codebook Size: Increasing $K$ from 32 to 256 improved AUC, highlighting the importance of fine-grained residual encoding.
Tiles per WSI in Update ( $r$ ): Best performance attained at $r=10$ , reflecting a balance between computational efficiency and representative sampling.
Contrastive Loss: Removing contrastive loss reduced AUC by ~2–3%. Adding the loss to ABMIL improved results less than full DRE-SLCL.
End-to-End Training: End-to-end fine-tuning (unfreezing ViT-256 after initial freezing) with memory bank yielded the highest performance.

These findings substantiate that the combination of dynamic VLAD residual encoding and a synchronized memory bank addresses sampling bias, while slide-level contrastive alignment with real pathology reports enhances generalization. Optimal choices of mini-batch size, codebook granularity, and tile sampling rate are critical for best empirical results.

PDF Markdown Chat (Pro)

References (1)

Dynamic Residual Encoding with Slide-Level Contrastive Learning for End-to-End Whole Slide Image Representation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Dynamic Residual Encoding with Slide-Level Contrastive Learning (DRE-SLCL).