AffinityNet: Weak Supervision & Few-shot Learning
- The paper demonstrates that learning semantic pixel affinities using a modified ResNet-38 and random-walk propagation significantly boosts segmentation performance on benchmarks like PASCAL VOC and DeepGlobe.
- It introduces a methodology that leverages affinity structures and kNN attention pooling to enhance label propagation and sample efficiency in high-dimensional, few-shot genomic applications.
- Practical implementation details, including hyperparameters such as gamma=5, beta=8, and 256 iterations, are provided to ensure reproducibility and guide adaptation across diverse domains.
AffinityNet refers to a family of neural network architectures designed to model affinities, or pairwise similarities, between instances in various domains. The dominant instantiations of AffinityNet fall into two principal categories: (1) models that predict semantic pixel affinities for weakly supervised semantic segmentation in images, and (2) architectures for semi-supervised few-shot learning, particularly in genomic and structured data settings. Both lines of work leverage learned affinity structures to enable label propagation, regularization, and improved sample efficiency, but differ in network design, application, and theoretical motivation.
1. AffinityNet for Weakly Supervised Semantic Segmentation
The seminal AffinityNet, introduced by Ahn and Kwak for weakly supervised semantic segmentation, addresses the scarcity of pixel-level annotations by learning to predict class-agnostic semantic affinities between adjacent image locations using only image-level class labels as supervision (Ahn et al., 2018). The key innovation lies in leveraging these affinities to propagate sparse discriminative object responses across object regions via random walk, thereby generating pseudo-labels suitable for full segmentation network training.
Model Architecture
- Backbone Feature Extractor: A modified ResNet-38 (pretrained on ImageNet) serves as the feature pipeline. The final three levels of residual blocks, originally with stride 2, are converted to atrous convolutions with dilation rates 1, 2, and 4, resulting in a feature map with overall stride 8.
- Multi-level Feature Aggregation: Feature maps from the last three backbone stages (, , ) are reduced in dimension (to 128, 256, 512 channels, respectively) with convolutions, concatenated, and passed through an additional convolution to yield a fused feature map tailored to the affinity task.
- Affinity Computation: For any two spatial locations within Euclidean radius , semantic affinity is computed as
This forms a sparse, symmetric affinity matrix with .
2. Random Walk Propagation and Training Pipeline
To address incomplete object localization in seed maps generated from class activation maps (CAMs), AffinityNet employs random-walk propagation with the learned affinities:
- Transition Matrix Construction: The random-walk transition matrix is
where (Hadamard power) amplifies strong affinities and .
- Iterative Propagation: Given a vectorized CAM , the propagation proceeds via , iterated times. The resulting constitutes the refined CAM.
Supervision for Affinity Prediction is derived without pixel-level labels:
- CAMs for ground-truth classes are produced, normalized, and refined with dense CRF.
- "Confident foreground," "confident background," and "neutral" regions are determined by varying thresholds ().
- Pixel pairs within confident regions are labeled as positive (same label), negative (different labels), or ignored (involving neutral regions).
- The cross-entropy loss over positive and negative pairs enforces boundary sensitivity.
Training and Deployment: After AffinityNet is trained, the propagated pseudo-labels serve as synthetic ground-truth for standard segmentation network training.
3. AffinityNet for Semi-supervised Few-shot Learning
A parallel instantiation, introduced by Zhang et al. (Ma et al., 2018), targets "big p, small N" regimes in high-dimensional data such as cancer genomics. Here, AffinityNet is characterized by k-Nearest-Neighbor (kNN) attention pooling layers, which can act as plug-in modules in arbitrary neural architectures.
Model Structure
- Input: (samples features).
- Feature Attention Layer: Weighted re-scaling selects informative features, implemented as with learned, , .
- Stacked kNN Attention Pooling: At each layer , for every sample ,
where are the -nearest neighbors, is the normalized attention (softmax over similarity scores), and comprises affine transform and nonlinearity.
- Affinity Computation and Label Propagation: The kNN attention acts as an implicit regularizer by enforcing neighborhood consistency in the learned representation.
Training: Supervised cross-entropy is applied to labeled data, while unlabeled samples influence representations via neighbor pooling, enabling semi-supervised and few-shot learning. No explicit regularizer is used.
4. Experimental Results and Quantitative Performance
Semantic Segmentation
- On PASCAL VOC 2012:
- CAM alone: 48.0% mIoU (train set)
- CAM + AffinityNet random-walk: 58.1%
- + dCRF: 59.7%
- Final segmentation, ResNet-38 backbone: 61.7% (val) / 63.7% (test), surpassing all prior image-level supervised methods (previous best ≈ 55%) (Ahn et al., 2018).
- On DeepGlobe Land Cover (satellite images) (Nivaggioli et al., 2019):
- Weakly supervised AffinityNet + random walk: mIoU 45.90% (no background), within 7–8 points of fully supervised top entries (53.58%).
- Off-the-shelf segmentation nets, trained on AffinityNet labels, exhibited negligible degradation relative to full supervision.
Few-shot and Genomics
- Synthetic 4-cluster data (p=42, 1% labels): AffinityNet accuracy 98.2% vs. 46.9% for plain neural net.
- TCGA kidney cancer, 1% labeled: AffinityNet AMI 0.84 vs. 0.70 (NN/SVM).
- Survival analysis (Cox model): c-index ~0.69–0.73 for AffinityNet features (Ma et al., 2018).
5. Implementation Details and Hyperparameters
Semantic Segmentation (Image Domain)
- Backbone: ResNet-38/ResNet-74 with atrous convolutions and feature aggregation.
- Affinity radius: (patch-wise for satellite, up to 10).
- Random-walk: , iterations (can use repeated squaring for efficiency).
- Loss: Weighted sum of cross-entropies for foreground, background, and negative pairs. Emphasizing background affinity in remote sensing ().
- Training: Adam optimizer; typical convergence: ~30 epochs (CAM), ~7 (AffinityNet), ~32 (segmentation net).
- Data Augmentation: Horizontal flips, random crops, color jitter; random scaling for segmentation, not AffinityNet.
kNN Attention (Tabular Domain)
- Attention kernel: Cosine similarity (for cancer data), alternatives supported.
- : 2–3 in genomics, general range 2–5% of .
- Hidden dimensions: , layers: (kidney), (uterus).
- Batch size: All or mini-batches; neighbor sets constructed per batch if needed.
6. Strengths, Limitations, and Generalizations
Strengths
- Enables propagation from sparse seed activations to dense, high-quality pseudo-labels using only image-level supervision; achieves segmentation results near those of full supervision (Ahn et al., 2018, Nivaggioli et al., 2019).
- Affinity structures encode boundary sensitivity and improve label consistency without direct pixelwise annotations.
- kNN attention pooling in tabular AffinityNet provides regularization and sample efficiency, facilitating few-shot learning and semi-supervised clustering (Ma et al., 2018).
- Plug-and-play kNN modules are flexible, generalize beyond graph data, and can replace normalization/pooling in arbitrary pipelines.
Limitations
- Naive affinity/random-walk computation is . Efficient sparse or approximate implementations are needed for large or dense segmentation maps.
- Patch-wise prediction in large images can yield boundary artifacts.
- The hyperparameters (, , ) and attention kernel must be domain-tuned; oversmoothing is possible if class structure is unclear.
- For weakly supervised segmentation, minor/rare classes may be under-represented in CAMs, limiting affinity label diversity.
Generalizations
- AffinityNet's affinity prediction paradigm can be directly extended to multi-spectral or SAR remote sensing data by adapting the backbone for additional channels (Nivaggioli et al., 2019).
- Multi-scale affinity computation and graph-cut regularization are viable enhancements.
- Combination with weak forms of supervision (scribbles, points) could further densify and regularize pseudo-label propagation.
7. Relation to Prior Work and Impact
AffinityNet's core contributions include an end-to-end learnable pixel-affinity predictor and a random-walk label propagation framework, together providing a principled and empirically validated approach to weakly supervised segmentation (Ahn et al., 2018). The extension to kNN attention for few-shot learning connects to and generalizes the Graph Attention Model (GAM) (Ma et al., 2018), removing the requirement of a fixed graph and increasing applicability across domains. AffinityNet has demonstrated marked gains in low-label, high-dimensional regimes (cancer genomics), providing a template for future affinity-based regularization and label propagation models.