Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Estimating Conditional Mutual Information for Dynamic Feature Selection (2306.03301v3)

Published 5 Jun 2023 in cs.LG, cs.IT, and math.IT

Abstract: Dynamic feature selection, where we sequentially query features to make accurate predictions with a minimal budget, is a promising paradigm to reduce feature acquisition costs and provide transparency into a model's predictions. The problem is challenging, however, as it requires both predicting with arbitrary feature sets and learning a policy to identify valuable selections. Here, we take an information-theoretic perspective and prioritize features based on their mutual information with the response variable. The main challenge is implementing this policy, and we design a new approach that estimates the mutual information in a discriminative rather than generative fashion. Building on our approach, we then introduce several further improvements: allowing variable feature budgets across samples, enabling non-uniform feature costs, incorporating prior information, and exploring modern architectures to handle partial inputs. Our experiments show that our method provides consistent gains over recent methods across a variety of datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Overview and findings from the religious orders study. Current Alzheimer Research, 9(6):628–645, 2012a.
  2. Overview and findings from the rush memory and aging project. Current Alzheimer Research, 9(6):646–663, 2012b.
  3. Sessile serrated polyp prevalence determined by a colonoscopist with a high lesion detection rate and an experienced pathologist. Gastrointestinal Endoscopy, 81(3):517–524, 2015.
  4. Ambityga. Imagenet100. https://www.kaggle.com/datasets/ambityga/imagenet100.
  5. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755, 2014.
  6. Learning wake-sleep recurrent attention models. Advances in Neural Information Processing Systems, 28, 2015.
  7. Concrete autoencoders: Differentiable feature selection and reconstruction. In International Conference on Machine Learning, pages 444–453. PMLR, 2019.
  8. Efficient and explainable risk assessments for imminent dementia in an aging cohort study. IEEE Journal of Biomedical and Health Informatics, 25(7):2409–2420, 2021.
  9. Mutual information neural estimation. In International Conference on Machine Learning, pages 531–540. PMLR, 2018.
  10. José M Bernardo. Expected information as expected utility. The Annals of Statistics, pages 686–690, 1979.
  11. Approximating CNNs with bag-of-local-features models works surprisingly well on ImageNet. arXiv preprint arXiv:1904.00760, 2019.
  12. Feature selection in machine learning: A new perspective. Neurocomputing, 300:70–79, 2018.
  13. John Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6):679–698, 1986.
  14. Dropout feature ranking for deep learning models. arXiv preprint arXiv:1712.08645, 2017.
  15. Interpretable by design: Learning predictors by composing interpretable queries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  16. Variational information pursuit for interpretable predictions. arXiv preprint arXiv:2302.02876, 2023.
  17. Sequential information maximization: When is greedy near-optimal? In Conference on Learning Theory, pages 338–363. PMLR, 2015.
  18. Elements of Information Theory. Wiley, 2012. ISBN 9781118585771.
  19. Learning to maximize mutual information for dynamic feature selection. arXiv preprint arXiv:2301.00557, 2023.
  20. Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. arXiv preprint arXiv:1102.3975, 2011.
  21. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. Ieee, 2009.
  22. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  23. Datum-wise classification: a sequential approach to sparsity. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 375–390. Springer, 2011.
  24. Restricted strong convexity implies weak submodularity. The Annals of Statistics, 46(6B):3539–3568, 2018.
  25. CoAI: Cost-aware artificial intelligence for health care. medRxiv, 2021.
  26. Sessile serrated adenoma: challenging discrimination from other serrated colonic polyps. The American Journal of Surgical Pathology, 32(1):30–35, 2008.
  27. Sparse-input neural networks for high-dimensional nonparametric regression and classification. arXiv preprint arXiv:1711.07592, 2017.
  28. François Fleuret. Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research, 5(9), 2004.
  29. An active testing model for tracking roads in satellite images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(1):1–14, 1996.
  30. A multinational, internet-based assessment of observer variability in the diagnosis of serrated colorectal polyps. American Journal of Clinical Pathology, 127(6):938–945, 2007.
  31. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research, 42:427–486, 2011.
  32. An introduction to variable and feature selection. Journal of Machine Learning Research, 3(Mar):1157–1182, 2003.
  33. Cost-sensitive dynamic feature selection. In ICML Inferning Workshop, 2012.
  34. Active information acquisition. arXiv preprint arXiv:1602.02181, 2016a.
  35. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016b.
  36. BSODA: a bipartite scalable framework for online disease diagnosis. In Proceedings of the ACM Web Conference 2022, pages 2511–2521, 2022.
  37. Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  38. FastAI Jeremy Howard. The Imagenette dataset. https://github.com/fastai/imagenette.
  39. Missingness bias in model debugging. In International Conference on Learning Representations, 2021.
  40. Classification with costly features using deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3959–3966, 2019.
  41. Classification with costly features as a sequential decision-making problem. Machine Learning, 109:1587–1615, 2020.
  42. Opportunistic learning: Budgeted cost-sensitive learning from data streams. In International Conference on Learning Representations, 2018.
  43. Timely object recognition. Advances in Neural Information Processing Systems, 25, 2012.
  44. Reinterpretation of histology of proximal colon polyps called hyperplastic in 2001. World Journal of Gastroenterology, 15(30):3767, 2009.
  45. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  46. Lassonet: Neural networks with feature sparsity. In International Conference on Artificial Intelligence and Statistics, pages 10–18. PMLR, 2021.
  47. Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6):1–45, 2017.
  48. Active feature acquisition with generative surrogate models. In International Conference on Machine Learning, pages 6450–6459. PMLR, 2021.
  49. Differentiable unsupervised feature selection based on a gated Laplacian. Advances in Neural Information Processing Systems, 34:1530–1542, 2021.
  50. Dennis V Lindley. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 27(4):986–1005, 1956.
  51. EDDI: Efficient dynamic discovery of high-value information with partial VAE. In International Conference on Machine Learning, pages 4234–4243. PMLR, 2019.
  52. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, pages 1791–1799. PMLR, 2014.
  53. Recurrent models of visual attention. Advances in Neural Information Processing Systems, 27, 2014.
  54. Intriguing properties of vision transformers. Advances in Neural Information Processing Systems, 34, 2021.
  55. On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems, 14, 2001.
  56. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  57. Automatic differentiation in PyTorch. 2017.
  58. On variational bounds of mutual information. In International Conference on Machine Learning, pages 5171–5180. PMLR, 2019.
  59. A probabilistic hard attention model for sequentially observed scenes. arXiv preprint arXiv:2111.07534, 2021.
  60. Marc’Aurelio Ranzato. On learning where to look. arXiv preprint arXiv:1405.5488, 2014.
  61. Certified patch robustness via smoothed vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15137–15147, 2022.
  62. Neural joint entropy estimation. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  63. Joint active feature acquisition and classification with variable-size set encoding. Advances in Neural Information Processing Systems, 31, 2018.
  64. Understanding the limitations of variational mutual information estimators. arXiv preprint arXiv:1910.06222, 2019.
  65. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  66. Neural Granger causality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4267–4279, 2021.
  67. A petri dish for histopathology image analysis. In Artificial Intelligence in Medicine, pages 11–24. Springer, 2021.
  68. Observer agreement in the diagnosis of serrated polyps of the large bowel. Histopathology, 55(1):63–66, 2009.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com