Spectral Metric for Dataset Complexity Assessment (1905.07299v1)

Published 17 May 2019 in cs.LG and stat.ML

Abstract: In this paper, we propose a new measure to gauge the complexity of image classification problems. Given an annotated image dataset, our method computes a complexity measure called the cumulative spectral gradient (CSG) which strongly correlates with the test accuracy of convolutional neural networks (CNN). The CSG measure is derived from the probabilistic divergence between classes in a spectral clustering framework. We show that this metric correlates with the overall separability of the dataset and thus its inherent complexity. As will be shown, our metric can be used for dataset reduction, to assess which classes are more difficult to disentangle, and approximate the accuracy one could expect to get with a CNN. Results obtained on 11 datasets and three CNN models reveal that our method is more accurate and faster than previous complexity measures.

Citations (27)

View on Semantic Scholar

Summary

The paper introduces the innovative Cumulative Spectral Gradient (CSG) measure that quantifies dataset complexity using spectral analysis.
The method leverages CNN-AE and t-SNE embeddings to estimate inter-class overlaps, reducing computational cost while ensuring robust performance.
The strong Pearson correlation between CSG scores and CNN test errors validates its practical utility for rapid dataset assessment and data reduction.

This paper, "Spectral Metric for Dataset Complexity Assessment" (1905.07299), introduces a novel measure called the Cumulative Spectral Gradient (CSG) to quantify the inherent complexity of image classification datasets. The primary motivation is to assess how challenging a classification problem is, identify difficult classes, and estimate the potential performance of a Convolutional Neural Network (CNN) without needing to train multiple CNNs extensively. This addresses a significant bottleneck in dataset development and selection, where the standard practice is time-consuming and requires fully annotated data.

Existing dataset complexity measures (c-measures), such as those by Ho and Basu, are often ill-suited for large, modern image datasets used with deep learning. Their limitations include assumptions of linear separability, being restricted to two-class problems, high computational cost scaling poorly with the number of samples or feature dimensions, and processing raw data which doesn't reflect the feature space learned by CNNs.

The proposed CSG method overcomes these limitations by first projecting raw images into a lower-dimensional latent space using an embedding function $\phi(x)$ . This allows the analysis to occur in a feature space more relevant to what CNNs learn. The core steps of the method are:

Embedding: Project input images $x$ into an embedding space $\phi(x) \in \mathbb{R}^d$ . The paper explores various embeddings, including using the raw image data, t-SNE, a CNN autoencoder (CNN-AE), and t-SNE applied to the CNN-AE embedding. The CNN-AE + t-SNE embedding was found to be the most effective in practice.
Estimate Pairwise Class Overlap: For every pair of classes $C_i$ and $C_j$ , estimate the overlap between their distributions in the embedding space. Instead of computing the integral of minimum probabilities (Eq. 1) or direct probability product kernels (Eq. 2), which are computationally expensive, the method uses a Monte-Carlo approximation of the expectation $E_{P(\phi(x)|C_i)}[P(\phi(x)|C_j)]$ (Eq. 3). This is approximated by averaging the estimated probability $P(\phi(x_m)|C_j)$ for $M$ samples $\phi(x_m)$ drawn from class $C_i$ .
Approximate Probability Density: The unknown probability density function %%%%10%%%% is approximated using a K-nearest neighbor estimator (Eq. 4), which is computationally tractable.
Construct Inter-class Similarity Matrix: The Monte-Carlo approximations of divergence between all pairs of classes form a $K \times K$ similarity matrix $\mathcal{S}$ , where $K$ is the number of classes. Since $\mathcal{S}$ is not necessarily symmetric, it's treated as a set of class signature vectors ( $\mathcal{S}_i$ being the signature of class $C_i$ ). An undirected adjacency matrix $W$ of size $K \times K$ is computed using the Bray-Curtis distance between these signature vectors (Eq. 5). $w_{ij} = 1$ indicates identical distributions (high overlap), and $w_{ij} = 0$ indicates no overlap.
Compute Laplacian and Spectrum: A Laplacian matrix $L = D - W$ is constructed, where $D$ is the diagonal degree matrix with $D_{ii} = \sum_j w_{ij}$ . The eigenvalues $\{\lambda_0, ..., \lambda_{K-1}\}$ of $L$ are computed, forming the spectrum. The distribution and magnitude of these eigenvalues indicate the dataset's complexity; larger eigenvalues and steeper gradients suggest more entangled classes.
Calculate CSG: The Cumulative Spectral Gradient (CSG) is derived from the spectrum. First, normalized eigengaps $\Delta \widetilde{\lambda_i} = (\lambda_{i+1} - \lambda_i) / (K - i)$ are calculated (Eq. 6). The CSG is then the sum of the cumulative maximum of these normalized eigengaps (Eq. 7). This metric is designed to capture both the overall scale of eigenvalues and the position of the most significant "eigengap" (gradient discontinuity), which indicates the difficulty of partitioning the graph (separating classes).

The algorithm for computing CSG is summarized below:

Algorithm: Compute CSG
Input: Dataset {(\phi(x_1), t_1), ..., (\phi(x_N), t_N)}, Parameters M (samples per class), k (neighbors)
Output: CSG score

1.  For each pair of classes (Ci, Cj):
    a.  Select M samples from Ci.
    b.  For each sample phi(xm) from Ci:
        i.  Estimate P(phi(xm) | Cj) using k-nearest neighbors in Cj.
    c.  Compute the Monte-Carlo average: S_ij = (1/M) * sum(P(phi(xm) | Cj)) over selected samples.
2.  The results form the K x K similarity matrix S.
3.  For each pair of classes (Ci, Cj):
    a.  Compute the Bray-Curtis distance between the signature vectors S_i and S_j.
    b.  Calculate the adjacency matrix entry: w_ij = 1 - Bray-Curtis(S_i, S_j).
4.  Compute the degree matrix D where D_ii = sum(w_ij over j).
5.  Compute the Laplacian matrix L = D - W.
6.  Compute the eigenvalues {lambda_0, ..., lambda_{K-1}} of L.
7.  Sort eigenvalues in ascending order: lambda_0 <= lambda_1 <= ... <= lambda_{K-1}.
8.  Compute normalized eigengaps: delta_tilde_lambda_i = (lambda_{i+1} - lambda_i) / (K - i) for i=0, ..., K-2.
9.  Compute cumulative maximum of normalized eigengaps: cummax(delta_tilde_lambda).
10. Compute CSG = sum(cummax(delta_tilde_lambda)_i) over i.
11. Return CSG.

Practical Implementation and Evaluation:

The method was evaluated on 11 image classification datasets, varying in size, content, and complexity (e.g., MNIST, CIFAR10, SVHN, MioTCD, STL-10, SeeFood). Correlation was measured against the test error rates of three CNN architectures (AlexNet, ResNet-50, XceptionNet).

Embeddings: The study showed that processing data in a learned embedding space (like from a CNN-AE) and further reducing dimensionality with t-SNE significantly improved the correlation between CSG and CNN accuracy compared to using raw data or just CNN-AE embeddings.
Correlation: CSG, particularly with the CNN-AE + t-SNE embedding, achieved strong Pearson correlations (up to 0.968) with CNN error rates across multiple datasets and models, outperforming traditional Ho-Basu complexity measures which showed weak or no significant correlation.
Speed: The Monte-Carlo approximation and Bray-Curtis distance provide significant runtime improvements. While training the CNN-AE embedding takes time (e.g., hours), computing the CSG score after the embedding is available is very fast (seconds to minutes), making it orders of magnitude faster than training/evaluating multiple CNNs or computing some traditional c-measures.
Hyperparameters: The CSG metric was shown to be relatively insensitive to the choice of $M$ (number of samples per class) and $k$ (number of neighbors for density estimation), providing robust performance across a range of values.

Applications:

Dataset Complexity Assessment: CSG provides a single, interpretable number that correlates strongly with expected CNN performance. This helps gauge the inherent difficulty of a dataset.
Dataset Reduction: By monitoring the CSG as samples are removed, one can identify the point at which removing more data causes a sharp increase in complexity, indicating potential performance degradation. This was demonstrated on the MioTCD dataset, where CSG remained stable for large reduction ratios before increasing sharply, mirroring the CNN error rate.
Class Disentanglement Analysis: The similarity matrix $W$ itself can be used to visualize the relationships between classes, e.g., using Multidimensional Scaling (MDS). The resulting 2D plots and the structure of the $W$ matrix itself show which classes are more entangled, effectively providing a confusion matrix predictor without CNN training.

Future Directions:

The authors suggest extending the method to compare datasets with different numbers of classes (e.g., by analyzing random subsets of classes), generalizing it to other tasks like segmentation and localization, incorporating the similarity matrix $W$ directly into CNN training objectives (similar to triplet loss) to improve class separation, and applying the method to other domains like NLP using existing embeddings such as Word2Vec.