Papers
Topics
Authors
Recent
2000 character limit reached

Dynamic Residual Encoding via SLCL

Updated 14 November 2025
  • The paper introduces dynamic residual encoding with a memory bank and slide-level contrastive learning to improve cancer subtyping and gene mutation prediction.
  • It employs a Vision Transformer for tile encoding, K-means clustering for codebook generation, and VLAD aggregation refined by a Transformer for slide embedding.
  • Experimental results show significant gains in F1 and AUC scores across multiple histopathology datasets over conventional methods.

Dynamic Residual Encoding with Slide-Level Contrastive Learning (DRE-SLCL) is a methodology for end-to-end representation learning of whole slide images (WSIs) in histopathology, aimed at overcoming GPU memory limitations arising from the massive number of tiles per gigapixel WSI. DRE-SLCL combines tile-level dynamic residual encoding with an external memory bank and slide-level contrastive learning, aligning image and text (report) embeddings to improve performance on cancer subtyping, recognition, and gene mutation prediction tasks (Jin et al., 7 Nov 2025).

1. Architectural Overview

DRE-SLCL is organized into four primary components for WSI representation:

  • Tile Encoder: A Vision Transformer (ViT-256), pretrained with HIPT, encodes each 256×256 image tile xx into a dd-dimensional feature vector f(x)Rdf(x)\in\mathbb{R}^d.
  • Memory Bank: A two-level dictionary structure storing L2-normalized tile features f(x)f(x) for each slide, indexed as MemoryBank[ii] [jj], where ii denotes the slide and jj the tile.
  • Residual Encoder: Uses a fixed codebook C={c1,,cK}C=\{\mathbf c_1,\ldots,\mathbf c_K\} (obtained via K-means) and implements a VLAD aggregation scheme. For each tile, its residual with respect to its nearest codeword is computed and aggregated, then concatenated and normalized before refinement via a small Transformer, producing the slide embedding hRDh\in\mathbb{R}^D.
  • Projection Head: A linear layer projects hh into a multimodal joint space of dimension dd' (e.g., 4096) for alignment with LLaMA2-7B text encodings of histopathology reports.

During inference, tile and residual encoders are applied to all tiles once per slide. During training, only a subset is processed per iteration, with features and residuals recomputed in a dynamic fashion leveraging the memory bank.

2. Memory Bank Structure and Update Mechanism

The memory bank is crucial for scaling to whole slides containing tens of thousands of tiles. Its structure and use are as follows:

  • Initialization: Before training, ViT-256 encodes all tiles, and their L2-normalized features are stored in MemoryBank[ii] [jj].
  • Organization: Implemented as a dict of dicts, with top-level keys by slide, subordinate keys by tile index.
  • Readout: To encode an entire WSI, all tile features are retrieved from MemoryBank[ii].
  • Dynamic Update: At each iteration during training, for every WSI in the batch, a randomly sampled subset (rr tiles) is re-encoded and normalized, replacing old features in the bank. This allows the memory bank to remain synchronized with ongoing updates to the tile encoder parameters.

3. Dynamic Residual Encoding

To mitigate computational infeasibility with full-tile re-encoding and to control gradient memory footprint, DRE-SLCL implements dynamic residual encoding as follows:

  • For slide ii with NiN_i tiles and iteration tt, randomly sample rr tiles Si(t)\mathcal S_i^{(t)}, encode them, and update MemoryBank[ii] [jj].
  • Retrieve all NiN_i features for slide ii from the memory bank.
  • Assign each tile feature fi,jf_{i,j} to its nearest codebook entry:

k=argmin1kKfi,jck2,ri,j=fi,jck.k^* = \arg\min_{1\le k\le K}\|f_{i,j}-\mathbf c_k\|_2,\quad \mathbf r_{i,j} = f_{i,j}-\mathbf c_{k^*}.

  • Aggregate the residuals per codeword:

vk(i)=j:k(j)=kri,j,k=1,,K.\mathbf v_k^{(i)} = \sum_{j:\,k^*(j)=k}\mathbf r_{i,j},\quad k=1,\ldots,K.

  • Concatenate, L2-normalize:

vVLAD(i)=[v1(i),,vK(i)]RKd,vVLAD(i)vVLAD(i)vVLAD(i)2.\mathbf v_{\mathrm{VLAD}}^{(i)} = [\mathbf v_1^{(i)},\ldots,\mathbf v_K^{(i)}]\in\mathbb{R}^{K d},\quad \mathbf v_{\mathrm{VLAD}}^{(i)} \leftarrow \frac{\mathbf v_{\mathrm{VLAD}}^{(i)}}{\|\mathbf v_{\mathrm{VLAD}}^{(i)}\|_2}.

  • Feed vVLAD(i)\mathbf v_{\mathrm{VLAD}}^{(i)} through a Transformer to yield the slide embedding hi=Transformer(vVLAD(i))h_i = \mathrm{Transformer}(\mathbf v_{\mathrm{VLAD}}^{(i)}).

This incremental update and aggregation overcome sampling bias of previous methods and avoid re-encoding the full WSI per iteration.

4. Slide-Level Contrastive Learning

To further enrich the learned slide representations, DRE-SLCL employs slide-level contrastive learning by aligning slide embeddings with textual report embeddings:

  • Report Encoder: LLaMA2-7B encodes each pathology report TiT_i into a normalized 4096-dimensional vector tit_i.
  • Image-Text Alignment: The slide embedding hih_i is projected and normalized to hih'_i with a linear layer, so hi=Proj(hi)h'_i = \mathrm{Proj}(h_i).
  • Similarity Logits: For each batch of bb slides, define

Simg2txt=1τHT,Stxt2img=1τT(H),S_{\mathrm{img2txt}} = \tfrac{1}{\tau} H' T^\top,\qquad S_{\mathrm{txt2img}} = \tfrac{1}{\tau} T (H')^\top,

with HH' and TT as batch matrices and τ\tau the learnable temperature parameter.

  • Contrastive Loss:

Lstr=1bi=0b1logexp(Simg2txt[i,yi])j=0b1exp(Simg2txt[i,j])\mathcal{L}_{\mathrm{str}} = \frac{1}{b} \sum_{i=0}^{b-1} -\log \frac{\exp(S_{\mathrm{img2txt}}[i,y_i])}{\sum_{j=0}^{b-1} \exp(S_{\mathrm{img2txt}}[i,j])}

Lrts=1bi=0b1logexp(Stxt2img[i,yi])j=0b1exp(Stxt2img[i,j])\mathcal{L}_{\mathrm{rts}} = \frac{1}{b} \sum_{i=0}^{b-1} -\log \frac{\exp(S_{\mathrm{txt2img}}[i,y_i])}{\sum_{j=0}^{b-1} \exp(S_{\mathrm{txt2img}}[i,j])}

Lcontrastive=12(Lstr+Lrts)\mathcal{L}_{\mathrm{contrastive}} = \tfrac{1}{2} (\mathcal{L}_{\mathrm{str}} + \mathcal{L}_{\mathrm{rts}})

where ground-truth pairs correspond to matching slide/report indices.

Positive pairs are true slide-report correspondences in the mini-batch, with all others serving as negatives. This encourages the visual representation to capture semantics that are also salient in clinical text.

5. Training Pipeline

The DRE-SLCL training process comprises preparation and iterative training phases:

Preparation

  • Extract features for all tiles with ViT-256 to populate the initial MemoryBank.
  • Apply K-means clustering (number of clusters KK) over all features to construct the codebook CC.

Training Loop

  • For each batch of bb slides and their corresponding tiles and reports:
    1. Dynamic Sampling & Update: For each slide, randomly select rr tiles, encode, and update MemoryBank.
    2. Residual Encoding: Retrieve all tile features and perform VLAD + Transformer encoding.
    3. Contrastive Alignment: Encode reports, compute projected slide embeddings, calculate similarity logits and contrastive loss.
    4. Classification: Use the slide embedding hih_i for downstream tasks, with a classification head trained via cross-entropy.
    5. Optimization: Backpropagate on the combined contrastive and classification losses.

Initial epochs keep the tile encoder frozen; subsequent epochs fine-tune the entire network end-to-end, leveraging the up-to-date memory bank.

6. Experimental Results

DRE-SLCL was evaluated on several large-scale histopathology datasets:

Task Dataset DRE-SLCL Metric Best Baseline Δ
LUAD vs. LUSC Subtyping TCGA_LUNG F1 = 80.88% 76.05% +4.83
LUAD vs. LUSC Subtyping CPTAC_LUNG F1 = 76.43% 68.65% +7.78
Cancer Recognition CPTAC_PDA AUC = 92.34% 89.51% +2.83
TP53 Mutation Prediction TCGA_LUAD AUC = 71.33%, F1 = 82.76% 69.23% (AUC), 79.40% (F1) +2.10 (AUC), +3.36 (F1)

DRE-SLCL consistently outperformed MIL, Transformer-MIL, HIPT, Prov-GigaPath, and prior end-to-end schemes across cancer subtyping, recognition, and gene mutation prediction.

7. Ablation Studies and Interpretations

Ablation experiments on TP53 mutation prediction tested key hyperparameters and loss components:

  • Batch Size: Larger batches (up to 64) yielded better AUC, attributed to a greater number of negatives in the contrastive loss.
  • Codebook Size: Increasing KK from 32 to 256 improved AUC, highlighting the importance of fine-grained residual encoding.
  • Tiles per WSI in Update (rr): Best performance attained at r=10r=10, reflecting a balance between computational efficiency and representative sampling.
  • Contrastive Loss: Removing contrastive loss reduced AUC by ~2–3%. Adding the loss to ABMIL improved results less than full DRE-SLCL.
  • End-to-End Training: End-to-end fine-tuning (unfreezing ViT-256 after initial freezing) with memory bank yielded the highest performance.

These findings substantiate that the combination of dynamic VLAD residual encoding and a synchronized memory bank addresses sampling bias, while slide-level contrastive alignment with real pathology reports enhances generalization. Optimal choices of mini-batch size, codebook granularity, and tile sampling rate are critical for best empirical results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Dynamic Residual Encoding with Slide-Level Contrastive Learning (DRE-SLCL).