Variational Information Maximization for Feature Selection (1606.02827v1)

Published 9 Jun 2016 in stat.ML and cs.LG

Abstract: Feature selection is one of the most fundamental problems in machine learning. An extensive body of work on information-theoretic feature selection exists which is based on maximizing mutual information between subsets of features and class labels. Practical methods are forced to rely on approximations due to the difficulty of estimating mutual information. We demonstrate that approximations made by existing methods are based on unrealistic assumptions. We formulate a more flexible and general class of assumptions based on variational distributions and use them to tractably generate lower bounds for mutual information. These bounds define a novel information-theoretic framework for feature selection, which we prove to be optimal under tree graphical models with proper choice of variational distributions. Our experiments demonstrate that the proposed method strongly outperforms existing information-theoretic feature selection approaches.

Citations (48)

View on Semantic Scholar

Summary

The paper introduces a variational information maximization framework using tractable lower bounds on mutual information for high-dimensional feature selection.
Empirical results demonstrate that the proposed variational methods significantly outperform existing information-theoretic methods across synthetic and real-world datasets.
The paper offers a more theoretically sound and empirically superior method for high-dimensional feature selection compared to traditional MI approximations.

Variational Information Maximization for Feature Selection

The paper presents a novel approach to feature selection by introducing a variational information maximization framework. Central to the approach is the concept of mutual information (MI) maximization between subsets of features and class labels, which is a common technique in information-theoretic feature selection. However, the computation of MI in high dimensions poses significant challenges, leading existing methods to rely on heuristic approximations with weak theoretical underpinnings. This paper critiques these traditional methods and proposes an alternative that leverages variational distributions to create tractable lower bounds on mutual information. This methodological pivot allows for more theoretically sound approximations even in high-dimensional spaces.

Key Contributions

The authors highlight several key contributions of their work:

Critique of Existing Assumptions: The paper rigorously assesses the assumptions underlying traditional MI-based feature selection methods and illustrates that these assumptions are often mutually inconsistent, thus limiting the methods' effectiveness.
Variational Lower Bounds: The core innovation of this research is the use of variational lower bounds to approximate MI. By adopting this method, the authors can ensure more flexible assumptions and provide a theoretically robust framework for feature selection that is optimal under tree graphical models.
Auto-regressive Decomposition: The methodology utilizes an auto-regressive decomposition strategy that aligns well with forward feature selection, enabling a stepwise optimization of the MI lower bounds.
Empirical Superiority: Empirical results demonstrate that the proposed method substantially outperforms existing state-of-the-art methods in both synthetic and real-world datasets, including gene expression and standard machine learning datasets.

Numerical Results and Implications

The paper provides extensive quantitative evaluations where the proposed variational methods, termed $\mathcal{VMI}_{naive}$ and $\mathcal{VMI}_{pairwise}$ , consistently outperform existing information-theoretic methods such as mRMR, JMI, CMIM, and SPEC_CMI across multiple datasets. The improved performance is especially notable in high-dimensional data scenarios such as gene expression datasets, where traditional approximation methods falter. The variational methods also maintain a competitive computational complexity akin to mRMR, ensuring feasibility in practical applications.

Implications and Future Directions

The implications of this research are twofold:

Practical Implications: In practical feature selection tasks, especially in high-dimensional spaces common in genomics and other data-intensive domains, this approach presents a promising alternative by balancing computational tractability with theoretical rigor.
Theoretical Implications: This work reveals the potential of variational approaches in overcoming the traditional shortcomings of MI estimation. It sets the stage for further exploration of variational methods, not only in feature selection but also in broader machine learning tasks where information-theoretic measures are involved.

Looking forward, this framework opens multiple avenues for future research:

Global Optimization: Exploring global optimization strategies for the variational lower bounds could further enhance the framework's robustness and convergence properties.
Enhanced Variational Approaches: Developing more sophisticated variational distributions that capture intricate dependencies between features could yield even stronger performance.
Broader Applicability: Extending this information-theoretic approach to unsupervised and semi-supervised learning scenarios could broaden the framework’s applicability across machine learning problems.

In conclusion, this paper contributes a substantial advancement to the field of feature selection by refining the employ of mutual information via variational lower bounds, thereby providing a more reliable and efficient technique for selecting informative features in high-dimensional learning problems.

Related Papers

YouTube

Show All Videos