Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Streaming convolutional neural networks for end-to-end learning with multi-megapixel images (1911.04432v1)

Published 11 Nov 2019 in cs.CV

Abstract: Due to memory constraints on current hardware, most convolution neural networks (CNN) are trained on sub-megapixel images. For example, most popular datasets in computer vision contain images much less than a megapixel in size (0.09MP for ImageNet and 0.001MP for CIFAR-10). In some domains such as medical imaging, multi-megapixel images are needed to identify the presence of disease accurately. We propose a novel method to directly train convolutional neural networks using any input image size end-to-end. This method exploits the locality of most operations in modern convolutional neural networks by performing the forward and backward pass on smaller tiles of the image. In this work, we show a proof of concept using images of up to 66-megapixels (8192x8192), saving approximately 50GB of memory per image. Using two public challenge datasets, we demonstrate that CNNs can learn to extract relevant information from these large images and benefit from increasing resolution. We improved the area under the receiver-operating characteristic curve from 0.580 (4MP) to 0.706 (66MP) for metastasis detection in breast cancer (CAMELYON17). We also obtained a Spearman correlation metric approaching state-of-the-art performance on the TUPAC16 dataset, from 0.485 (1MP) to 0.570 (16MP). Code to reproduce a subset of the experiments is available at https://github.com/DIAGNijmegen/StreamingCNN.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Hans Pinckaers (6 papers)
  2. Bram van Ginneken (69 papers)
  3. Geert Litjens (33 papers)
Citations (85)

Summary

  • The paper proposes a streaming method that divides high-resolution images into smaller tiles, significantly reducing memory usage for CNN training.
  • The approach leverages gradient checkpointing and localized convolutions to match conventional performance across datasets, including medical imaging.
  • Empirical validation on datasets like TUPAC16 and CAMELYON17 shows enhanced metrics on images up to 8192×8192 pixels, underlining its diagnostic potential.

Analysis of Streaming Convolutional Neural Networks for High-Resolution Image Processing

The paper "Streaming Convolutional Neural Networks for End-to-End Learning with Multi-Megapixel Images" presents a novel methodology for training convolutional neural networks (CNNs) on high-resolution images without the prohibitive memory requirements typically associated with such tasks. This work is particularly relevant for fields like medical imaging, where images can exceed gigapixels in resolution and contain critical information only visible at high resolutions.

Methodology

In the context of CNN architectures, a major limitation is the memory bottleneck that arises from the large intermediate activation maps produced when processing high-resolution input images. The authors address this challenge by introducing a streaming approach whereby images are divided into smaller tiles that can be processed separately. This strategy not only reduces memory usage significantly but also allows for leveraging the benefits of full-resolution image features.

The streaming method operates by performing the forward and backward passes on these image tiles. The underlying principle involves the use of convolutional locality, facilitating the computation of the convolution for small segments of the image sequentially rather than on the entire image at once. The technique further incorporates gradient checkpointing, which reduces memory footprint by re-computing intermediate activations instead of storing them for backpropagation.

Empirical Validation

The authors validate their method across multiple datasets, including the ImageNette subset of ImageNet and the TUPAC16 and CAMELYON17 datasets, which feature high-resolution medical images. Notably, their empirical results suggest that CNNs trained using the streaming method perform equivalently to those trained conventionally, demonstrated by virtually identical loss curves when controlling for initialization seeds.

Additionally, the capability to handle larger image inputs was tested against these datasets. An increase in performance was observed with higher image resolutions up to specific thresholds, evidencing the potential of capturing more detailed image characteristics pivotal for accurate predictions.

Key Results

Significantly, in the experiments conducted on the TUPAC16 dataset, the paper reports a Spearman’s rank-order correlation for image-level regression tasks approaching current state-of-the-art performances. Notably, the application of the streaming method allowed for the use of input images of up to 8192×8192 pixels, a scale that would otherwise necessitate excessive memory (approximately 825GB per mini-batch) for conventional end-to-end training on such datasets.

In the classification task on the CAMELYON17 dataset, there was a discernible improvement in the area under the ROC curve when employing higher resolution images, underscoring the advantage of accessing high-resolution details for detecting diverse classes of tumor metastases in the tissue samples.

Implications and Future Directions

The implications of the proposed streaming method are substantial. It has the potential to revolutionize how high-resolution images are utilized within CNN frameworks, particularly in domains requiring fine-grained image analysis. The approach presents a feasible pathway to overcome hardware constraints without sacrificing the depth or detail of the image analysis.

For future work, exploration could be directed towards optimizing the computational efficiency of the streaming process, potentially through mixed-precision training or parallel processing techniques. Moreover, adapting the approach to support other image operations that predicate on full-image contexts could broaden its applicability across various deep learning tasks.

In sum, this work provides a robust framework that bridges the gap between current hardware limitations and the demand for high-resolution image processing, a necessity in advancing fields such as medical diagnosis, where detail at the macro-level is as critical as at the micro-level.