Estimating Conditional Mutual Information for Dynamic Feature Selection (2306.03301v3)
Abstract: Dynamic feature selection, where we sequentially query features to make accurate predictions with a minimal budget, is a promising paradigm to reduce feature acquisition costs and provide transparency into a model's predictions. The problem is challenging, however, as it requires both predicting with arbitrary feature sets and learning a policy to identify valuable selections. Here, we take an information-theoretic perspective and prioritize features based on their mutual information with the response variable. The main challenge is implementing this policy, and we design a new approach that estimates the mutual information in a discriminative rather than generative fashion. Building on our approach, we then introduce several further improvements: allowing variable feature budgets across samples, enabling non-uniform feature costs, incorporating prior information, and exploring modern architectures to handle partial inputs. Our experiments show that our method provides consistent gains over recent methods across a variety of datasets.
- Overview and findings from the religious orders study. Current Alzheimer Research, 9(6):628–645, 2012a.
- Overview and findings from the rush memory and aging project. Current Alzheimer Research, 9(6):646–663, 2012b.
- Sessile serrated polyp prevalence determined by a colonoscopist with a high lesion detection rate and an experienced pathologist. Gastrointestinal Endoscopy, 81(3):517–524, 2015.
- Ambityga. Imagenet100. https://www.kaggle.com/datasets/ambityga/imagenet100.
- Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755, 2014.
- Learning wake-sleep recurrent attention models. Advances in Neural Information Processing Systems, 28, 2015.
- Concrete autoencoders: Differentiable feature selection and reconstruction. In International Conference on Machine Learning, pages 444–453. PMLR, 2019.
- Efficient and explainable risk assessments for imminent dementia in an aging cohort study. IEEE Journal of Biomedical and Health Informatics, 25(7):2409–2420, 2021.
- Mutual information neural estimation. In International Conference on Machine Learning, pages 531–540. PMLR, 2018.
- José M Bernardo. Expected information as expected utility. The Annals of Statistics, pages 686–690, 1979.
- Approximating CNNs with bag-of-local-features models works surprisingly well on ImageNet. arXiv preprint arXiv:1904.00760, 2019.
- Feature selection in machine learning: A new perspective. Neurocomputing, 300:70–79, 2018.
- John Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6):679–698, 1986.
- Dropout feature ranking for deep learning models. arXiv preprint arXiv:1712.08645, 2017.
- Interpretable by design: Learning predictors by composing interpretable queries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Variational information pursuit for interpretable predictions. arXiv preprint arXiv:2302.02876, 2023.
- Sequential information maximization: When is greedy near-optimal? In Conference on Learning Theory, pages 338–363. PMLR, 2015.
- Elements of Information Theory. Wiley, 2012. ISBN 9781118585771.
- Learning to maximize mutual information for dynamic feature selection. arXiv preprint arXiv:2301.00557, 2023.
- Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. arXiv preprint arXiv:1102.3975, 2011.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. Ieee, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Datum-wise classification: a sequential approach to sparsity. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 375–390. Springer, 2011.
- Restricted strong convexity implies weak submodularity. The Annals of Statistics, 46(6B):3539–3568, 2018.
- CoAI: Cost-aware artificial intelligence for health care. medRxiv, 2021.
- Sessile serrated adenoma: challenging discrimination from other serrated colonic polyps. The American Journal of Surgical Pathology, 32(1):30–35, 2008.
- Sparse-input neural networks for high-dimensional nonparametric regression and classification. arXiv preprint arXiv:1711.07592, 2017.
- François Fleuret. Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research, 5(9), 2004.
- An active testing model for tracking roads in satellite images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(1):1–14, 1996.
- A multinational, internet-based assessment of observer variability in the diagnosis of serrated colorectal polyps. American Journal of Clinical Pathology, 127(6):938–945, 2007.
- Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research, 42:427–486, 2011.
- An introduction to variable and feature selection. Journal of Machine Learning Research, 3(Mar):1157–1182, 2003.
- Cost-sensitive dynamic feature selection. In ICML Inferning Workshop, 2012.
- Active information acquisition. arXiv preprint arXiv:1602.02181, 2016a.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016b.
- BSODA: a bipartite scalable framework for online disease diagnosis. In Proceedings of the ACM Web Conference 2022, pages 2511–2521, 2022.
- Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- FastAI Jeremy Howard. The Imagenette dataset. https://github.com/fastai/imagenette.
- Missingness bias in model debugging. In International Conference on Learning Representations, 2021.
- Classification with costly features using deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3959–3966, 2019.
- Classification with costly features as a sequential decision-making problem. Machine Learning, 109:1587–1615, 2020.
- Opportunistic learning: Budgeted cost-sensitive learning from data streams. In International Conference on Learning Representations, 2018.
- Timely object recognition. Advances in Neural Information Processing Systems, 25, 2012.
- Reinterpretation of histology of proximal colon polyps called hyperplastic in 2001. World Journal of Gastroenterology, 15(30):3767, 2009.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Lassonet: Neural networks with feature sparsity. In International Conference on Artificial Intelligence and Statistics, pages 10–18. PMLR, 2021.
- Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6):1–45, 2017.
- Active feature acquisition with generative surrogate models. In International Conference on Machine Learning, pages 6450–6459. PMLR, 2021.
- Differentiable unsupervised feature selection based on a gated Laplacian. Advances in Neural Information Processing Systems, 34:1530–1542, 2021.
- Dennis V Lindley. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 27(4):986–1005, 1956.
- EDDI: Efficient dynamic discovery of high-value information with partial VAE. In International Conference on Machine Learning, pages 4234–4243. PMLR, 2019.
- Neural variational inference and learning in belief networks. In International Conference on Machine Learning, pages 1791–1799. PMLR, 2014.
- Recurrent models of visual attention. Advances in Neural Information Processing Systems, 27, 2014.
- Intriguing properties of vision transformers. Advances in Neural Information Processing Systems, 34, 2021.
- On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems, 14, 2001.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Automatic differentiation in PyTorch. 2017.
- On variational bounds of mutual information. In International Conference on Machine Learning, pages 5171–5180. PMLR, 2019.
- A probabilistic hard attention model for sequentially observed scenes. arXiv preprint arXiv:2111.07534, 2021.
- Marc’Aurelio Ranzato. On learning where to look. arXiv preprint arXiv:1405.5488, 2014.
- Certified patch robustness via smoothed vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15137–15147, 2022.
- Neural joint entropy estimation. IEEE Transactions on Neural Networks and Learning Systems, 2022.
- Joint active feature acquisition and classification with variable-size set encoding. Advances in Neural Information Processing Systems, 31, 2018.
- Understanding the limitations of variational mutual information estimators. arXiv preprint arXiv:1910.06222, 2019.
- Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- Neural Granger causality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4267–4279, 2021.
- A petri dish for histopathology image analysis. In Artificial Intelligence in Medicine, pages 11–24. Springer, 2021.
- Observer agreement in the diagnosis of serrated polyps of the large bowel. Histopathology, 55(1):63–66, 2009.