Aligning Logits Generatively for Principled Black-Box Knowledge Distillation
Abstract: Black-Box Knowledge Distillation (B2KD) is a formulated problem for cloud-to-edge model compression with invisible data and models hosted on the server. B2KD faces challenges such as limited Internet exchange and edge-cloud disparity of data distributions. In this paper, we formalize a two-step workflow consisting of deprivatization and distillation, and theoretically provide a new optimization direction from logits to cell boundary different from direct logits alignment. With its guidance, we propose a new method Mapping-Emulation KD (MEKD) that distills a black-box cumbersome model into a lightweight one. Our method does not differentiate between treating soft or hard responses, and consists of: 1) deprivatization: emulating the inverse mapping of the teacher function with a generator, and 2) distillation: aligning low-dimensional logits of the teacher and student models by reducing the distance of high-dimensional image points. For different teacher-student pairs, our method yields inspiring distillation performance on various benchmarks, and outperforms the previous state-of-the-art approaches.
- Knowledge distillation from internal representations. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7350–7357, 2020.
- Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223. PMLR, 2017.
- Do deep nets really need to be deep? Advances in Neural Information Processing Systems, 27, 2014.
- Adversarial network compression. In Proceedings of the European Conference on Computer Vision Workshops, pages 0–0, 2018.
- Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4183–4192, 2020.
- On a constructive proof of kolmogorov’s superposition theorem. Constructive Approximation, 30(3):653–675, 2009.
- Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2018.
- Data-free learning of student networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3514–3522, 2019.
- Learning student networks via feature embedding. IEEE Transactions on Neural Networks and Learning Systems, 32(1):25–35, 2020.
- InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. Advances in Neural Information Processing Systems, 29, 2016.
- Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- The johnson-lindenstrauss lemma and the sphericity of some graphs. Journal of Combinatorial Theory, Series B, 44(3):355–362, 1988.
- Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030, 2016.
- Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
- Improved training of wasserstein gans. Advances in Neural Information Processing Systems, 30, 2017.
- Group knowledge transfer: Federated learning of large cnns at the edge. Advances in Neural Information Processing Systems, 33:14068–14080, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Distilling the knowledge in a neural network. arXiv preprint:1503.02531, 2(7), 2015.
- Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Learning discrete representations via information maximizing self-augmented training. In International conference on machine learning, pages 1558–1567. PMLR, 2017.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
- Paraphrasing complex network: Network compression via factor transfer. Advances in Neural Information Processing Systems, 31, 2018.
- Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.
- Mario Köppen. On the training of a kolmogorov network. In International Conference on Artificial Neural Networks, pages 474–479. Springer, 2002.
- Learning multiple layers of features from tiny images. Unvieristy of Toronto: Technical Report, 2009.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- A geometric view of optimal transportation and generative model. Computer Aided Geometric Design, 68:1–21, 2019.
- A geometric understanding of deep learning. Engineering, 6(3):361–374, 2020.
- Nonsmooth duality, sandwich, and squeeze theorems. SIAM Journal on Control and Optimization, 38(2):613–626, 2000.
- Conditional teacher-student learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6445–6449, 2019.
- Zero-shot knowledge transfer via adversarial belief matching. Advances in Neural Information Processing Systems, 32, 2019.
- Envelope theorems for arbitrary choice sets. Econometrica, 70(2):583–601, 2002.
- Conditional generative adversarial nets. arXiv preprint:1411.1784, 2014.
- Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, pages 5191–5198, 2020.
- Reading digits in natural images with unsupervised feature learning. 2011.
- Knockoff nets: Stealing functionality of black-box models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4954–4963, 2019.
- Quped: Quantized personalization via distillation with applications to federated learning. Advances in Neural Information Processing Systems, 34, 2021.
- Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10870–10879, 2020.
- Heterogeneous knowledge distillation using information flow modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2339–2348, 2020.
- Alp-kd: Attention-based layer projection for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 13657–13665, 2021.
- Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint:1511.06434, 2015.
- Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging, pages 146–157. Springer, 2017.
- Meal: Multi-model ensemble via adversarial learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4886–4893, 2019.
- Information-theoretical learning of discriminative clusters for unsupervised domain adaptation. In ICML, 2012.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Does knowledge distillation really work? Advances in Neural Information Processing Systems, 34, 2021.
- A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
- Stealing machine learning models via prediction {{\{{APIs}}\}}. In 25th USENIX security symposium (USENIX Security 16), pages 601–618, 2016.
- Cédric Villani. Optimal transport: old and new. Springer, 2009.
- Neural networks are more productive teachers than human raters: Active mixup for data-efficient knowledge distillation from a blackbox model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1498–1507, 2020a.
- High-frequency component helps explain the generalization of convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8684–8694, 2020b.
- Adversarial learning of portable student networks. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
- Zi Wang. Zero-shot knowledge distillation from a decision-based black-box model. In International Conference on Machine Learning, pages 10675–10685. PMLR, 2021.
- Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. arXiv preprint:1709.00513, 2017.
- Data-free knowledge amalgamation via group-stack dual-gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12516–12525, 2020.
- A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4133–4141, 2017.
- Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11953–11962, 2022.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2223–2232, 2017.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.