FeatUp: A Model-Agnostic Framework for Features at Any Resolution (2403.10516v2)
Abstract: Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas. In this work, we introduce FeatUp, a task- and model-agnostic framework to restore lost spatial information in deep features. We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multi-view consistency loss with deep analogies to NeRFs. Our features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains even without re-training. We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation.
- Weakly supervised learning of instance segmentation with inter-pixel relations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2209–2218, 2019.
- Understanding intermediate layers using linear classifier probes, 2016. URL https://arxiv.org/abs/1610.01644.
- Deep vit features as dense visual descriptors, 2021. URL https://arxiv.org/abs/2112.05814.
- Single-stage semantic segmentation from image labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- A non-local algorithm for image denoising. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pp. 60–65 vol. 2, 2005. doi: 10.1109/CVPR.2005.38.
- The guided bilateral filter: When the joint/cross bilateral filter becomes robust. IEEE Transactions on Image Processing, 24(4):1199–1208, 2015. doi: 10.1109/TIP.2015.2389617.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9650–9660, October 2021.
- Learning continuous image representation with local implicit image function. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8628–8638, 2021.
- Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5939–5948, 2019.
- Adaptive confidence thresholding for monocular depth estimation, 2020. URL https://arxiv.org/abs/2009.12840.
- Selfdeco: Self-supervised monocular depth completion in challenging indoor environments. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 467–474, 2021. doi: 10.1109/ICRA48506.2021.9560831.
- Learning affinity-aware upsampling for deep image matting, 2020.
- Learning affinity-aware upsampling for deep image matting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6841–6850, 2021.
- N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pp. 886–893 vol. 1, 2005. doi: 10.1109/CVPR.2005.177.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Image super-resolution using deep convolutional networks, 2015. URL https://arxiv.org/abs/1501.00092.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- A guide to convolution arithmetic for deep learning, 2016a. URL https://arxiv.org/abs/1603.07285.
- A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285, 2016b.
- Shape recipes: Scene representations that refer to the image. Advances in Neural Information Processing Systems, 15, 2002.
- Contextual deconvolution network for semantic segmentation. Pattern Recognition, 101:107152, 2020. ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2019.107152. URL https://www.sciencedirect.com/science/article/pii/S0031320319304534.
- Superpixel convolutional networks using bilateral inceptions, 2015. URL https://arxiv.org/abs/1511.06739.
- Superpixel convolutional networks using bilateral inceptions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp. 597–613. Springer, 2016.
- Jon Gauthier. Conditional generative adversarial nets for convolutional face generation. 2015.
- Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323. JMLR Workshop and Conference Proceedings, 2011.
- Semantically-guided representation learning for self-supervised monocular depth, 2020. URL https://arxiv.org/abs/2002.12319.
- It is likely that your loss should be a likelihood. arXiv preprint arXiv:2007.06059, 2020.
- Unsupervised semantic segmentation by distilling feature correspondences. arXiv preprint arXiv:2203.08414, 2022.
- Deep residual learning for image recognition, 2015. URL https://arxiv.org/abs/1512.03385.
- Momentum contrast for unsupervised visual representation learning, 2019. URL https://arxiv.org/abs/1911.05722.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units, 2021. URL https://arxiv.org/abs/2106.07447.
- Learning implicit feature alignment function for semantic segmentation, 2022.
- Unet 3+: A full-scale connected unet for medical image segmentation. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1055–1059, 2020. doi: 10.1109/ICASSP40776.2020.9053405.
- Perceptual losses for real-time style transfer and super-resolution. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (eds.), Computer Vision – ECCV 2016, pp. 694–711, Cham, 2016. Springer International Publishing.
- Extensions of lipschitz maps into banach spaces. Israel Journal of Mathematics, 54(2):129–138, 1986.
- Robert Keys. Cubic convolution interpolation for digital image processing. IEEE transactions on acoustics, speech, and signal processing, 29(6):1153–1160, 1981.
- Decomposing nerf for editing via feature field distillation. arXiv preprint arXiv:2205.15585, 2022.
- Joint bilateral upsampling. ACM Trans. Graph., 26(3):96–es, jul 2007. ISSN 0730-0301. doi: 10.1145/1276377.1276497. URL https://doi.org/10.1145/1276377.1276497.
- Deep laplacian pyramid networks for fast and accurate super-resolution, 2017. URL https://arxiv.org/abs/1704.03915.
- Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690, 2017.
- Relevance-cam: Your model already knows where to look. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14944–14953, 2021.
- H-denseunet: Hybrid densely connected unet for liver and tumor segmentation from ct volumes. IEEE Transactions on Medical Imaging, 37(12):2663–2674, 2018. doi: 10.1109/TMI.2018.2845918.
- Sift flow: Dense correspondence across scenes and its applications. IEEE transactions on pattern analysis and machine intelligence, 33(5):978–994, 2010.
- Learning to upsample by learning to sample, 2023.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, 2015.
- GÂ LoweDavid. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004.
- Index networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):242–255, 2022a. doi: 10.1109/TPAMI.2020.3004474.
- Fade: Fusing the assets of decoder and encoder for task-agnostic upsampling. In Proc. European Conference on Computer Vision (ECCV), 2022b.
- Sapa: Similarity-aware point affiliation for feature upsampling. In Proc. Annual Conference on Neural Information Processing Systems (NeurIPS), 2022c.
- Transformer for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 457–466, June 2022d.
- Davide Mazzini. Guided upsampling network for real-time semantic segmentation, 2018. URL https://arxiv.org/abs/1807.07466.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- Nerf: Representing scenes as neural radiance fields for view synthesis, 2020. URL https://arxiv.org/abs/2003.08934.
- Learning deconvolution network for semantic segmentation. 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1520–1528, 2015.
- Deconvolution and checkerboard artifacts. Distill, 2016. doi: 10.23915/distill.00003. URL http://distill.pub/2016/deconv-checkerboard.
- Attention-based transformers for instance segmentation of cells in microstructures. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 700–707, 2020. doi: 10.1109/BIBM49941.2020.9313305.
- Blending anti-aliasing into vision transformer. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 5416–5429. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/2b3bf3eee2475e03885a110e9acaab61-Paper.pdf.
- Rethinking softmax with cross-entropy: Neural network classifier as mutual information estimator. arXiv preprint arXiv:1911.10688, 2019.
- Improving language understanding by generative pre-training. 2018.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 2020.
- High-resolution image synthesis with latent diffusion models, 2021. URL https://arxiv.org/abs/2112.10752.
- U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597, 2015. URL http://arxiv.org/abs/1505.04597.
- Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
- wav2vec: Unsupervised pre-training for speech recognition, 2019. URL https://arxiv.org/abs/1904.05862.
- Transfer learning for visual categorization: A survey. IEEE transactions on neural networks and learning systems, 26(5):1019–1034, 2014.
- Is the deconvolution layer the same as a convolutional layer?, 2016. URL https://arxiv.org/abs/1609.07009.
- Zero-shot super-resolution using deep internal learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3118–3126, 2018. doi: 10.1109/CVPR.2018.00329.
- Implicit neural representations with periodic activation functions, 2020a. URL https://arxiv.org/abs/2006.09661.
- Implicit neural representations with periodic activation functions. In Proc. NeurIPS, 2020b.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
- Pixel-adaptive convolutional neural networks. CoRR, abs/1904.05373, 2019. URL http://arxiv.org/abs/1904.05373.
- Fourier features let networks learn high frequency functions in low dimensional domains, 2020. URL https://arxiv.org/abs/2006.10739.
- Raft: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision, pp. 402–419. Springer, 2020.
- C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), pp. 839–846, 1998. doi: 10.1109/ICCV.1998.710815.
- Image super-resolution using dense skip connections. In Proceedings of the IEEE international conference on computer vision, pp. 4799–4807, 2017.
- Splicing vit features for semantic appearance transfer, 2022.
- Deep image prior. International Journal of Computer Vision, 128(7):1867–1888, mar 2020. doi: 10.1007/s11263-020-01303-4. URL https://doi.org/10.1007%2Fs11263-020-01303-4.
- Carafe: Content-aware reassembly of features. 2019. doi: 10.48550/ARXIV.1905.02188. URL https://arxiv.org/abs/1905.02188.
- Non-local neural networks, 2017. URL https://arxiv.org/abs/1711.07971.
- Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12275–12284, 2020.
- A survey of transfer learning. Journal of Big data, 3(1):1–40, 2016.
- Fast end-to-end trainable guided filter, 2019.
- Fast image dehazing using guided joint bilateral filter. Vis. Comput., 28(6–8):713–721, jun 2012. ISSN 0178-2789. doi: 10.1007/s00371-012-0679-y. URL https://doi.org/10.1007/s00371-012-0679-y.
- Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
- Squeezesegv3: Spatially-adaptive convolution for efficient point-cloud segmentation. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision – ECCV 2020, pp. 1–19, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58604-1.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127(3):302–321, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.