Dataset Complexity Assessment Based on Cumulative Maximum Scaled Area Under Laplacian Spectrum

Published 29 Sep 2022 in cs.CV, cs.AI, and cs.LG | (2209.14743v1)

Abstract: Dataset complexity assessment aims to predict classification performance on a dataset with complexity calculation before training a classifier, which can also be used for classifier selection and dataset reduction. The training process of deep convolutional neural networks (DCNNs) is iterative and time-consuming because of hyperparameter uncertainty and the domain shift introduced by different datasets. Hence, it is meaningful to predict classification performance by assessing the complexity of datasets effectively before training DCNN models. This paper proposes a novel method called cumulative maximum scaled Area Under Laplacian Spectrum (cmsAULS), which can achieve state-of-the-art complexity assessment performance on six datasets.

Abstract PDF Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper introduces cmsAULS as a novel metric to assess image dataset complexity through Laplacian spectrum analysis.
It employs dimension reduction, similarity matrix construction using the probability product kernel, and eigenvalue analysis from the Laplacian matrix.
Empirical findings show that lower cmsAULS scores correlate with lower DCNN test error rates, guiding model and dataset selection.

This paper introduces a novel method called cumulative maximum scaled Area Under Laplacian Spectrum (cmsAULS) for assessing the complexity of classification datasets, particularly for image data intended for deep convolutional neural network (DCNN) training. The primary motivation is that training @@@@1@@@@ is computationally expensive and time-consuming, and being able to predict classification performance based on dataset complexity before training can save significant resources.

The core idea is to capture the entanglement or overlap between classes within a dataset. The proposed cmsAULS metric is derived from the Laplacian spectrum of a similarity matrix constructed between the dataset's classes. The method consists of three main steps:

Dimension Reduction: High-dimensional image data is reduced to a lower-dimensional feature space $\psi(x) \in \mathbb{R}^d$ . Any standard dimension reduction technique like autoencoders, t-SNE, or PCA can be used for this step. The choice of method and the target dimension $d$ can influence the complexity assessment results.
Similarity Matrix Construction: The overlap between any two classes is estimated. This is approximated using the probability product kernel, which measures the similarity between the feature distributions of two classes. This expectation is efficiently estimated using a Monte Carlo method, sampling $M$ points from one class and calculating the probability of them belonging to the other class using a $k$ -nearest neighbor estimator with $E$ samples from the second class. This results in an $n \times n$ similarity matrix $\mathbf{X}$ , where $n$ is the number of classes.
Dataset Complexity Calculation using Laplacian Spectrum:
- The asymmetric similarity matrix $\mathbf{X}$ is converted into a symmetric similarity matrix $\mathbf{W}$ using the Bray Curtis distance.
- A Laplacian matrix %%%%10%%%% is constructed from $\mathbf{W}$ and the degree matrix $\mathbf{D}$ ( $\mathbf{L} = \mathbf{D} - \mathbf{W}$ ).
- The eigenvalues $(\lambda_0, \lambda_1, \dots, \lambda_{n-1})$ of the Laplacian matrix $\mathbf{L}$ are computed, forming the Laplacian spectrum. The magnitude of these eigenvalues reflects the similarity (or separability) between classes.
- The cmsAULS metric is calculated based on the cumulative maximum of a scaled difference between squared adjacent eigenvalues:
  
  $\Delta\lambda_{i} = \frac{\lambda_{i+1}^{2} - \lambda_{i}^{2} }{2(n - i)}$
  
  $\mathrm{cmsAULS} = \sum_{i = 0}^{n - 2} \mathrm{cummax}(\Delta\lambda)_{i}$
A lower cmsAULS score indicates lower dataset complexity (easier to classify), while a higher score suggests higher complexity (more difficult).

The overall process can be summarized as follows:

Algorithm: Compute cmsAULS
Input: Dataset with n classes, M (samples for Monte Carlo), E (samples for KNN), k (neighbors for KNN)
Output: cmsAULS score

1. For each image x in the dataset:
   Apply dimension reduction: psi_x = psi(x)
   Store features grouped by class.

2. Initialize n x n similarity matrix X.
3. For each pair of classes (i, j):
   Randomly sample M features from class i.
   For each sampled feature psi_m:
     Estimate probability p(psi_m | class j) using k-NN on E samples from class j.
   Calculate expectation: X_ij = (1/M) * sum(p(psi_m | class j)).

4. Initialize n x n symmetric similarity matrix W.
5. For each pair of classes (i, j):
   Calculate Bray Curtis distance between columns X_i and X_j:
   W_ij = 1 - (|X_i - X_j| sum) / (|X_i + X_j| sum).

6. Construct Degree matrix D: D_ii = sum(W_ij) over all j.
7. Construct Laplacian matrix L = D - W.

8. Compute eigenvalues of L: lambda_0, lambda_1, ..., lambda_{n-1}, sorted non-decreasingly.

9. Compute scaled squared eigenvalue differences:
   For i = 0 to n-2:
     delta_lambda_i = (lambda_{i+1}^2 - lambda_i^2) / (2 * (n - i))

10. Compute cumulative maximum of delta_lambda:
    cummax_delta_lambda = cumulative maximum of the vector [delta_lambda_0, ..., delta_lambda_{n-2}].

11. Compute cmsAULS:
    cmsAULS = sum(cummax_delta_lambda).

Return cmsAULS.

Practical Implementation and Application:

Dimension Reduction Choice: The paper evaluates CNN autoencoders, t-SNE, and PCA. Using pretrained DCNN feature extractors (like EfficientNet-B4 trained on ImageNet) combined with t-SNE was shown to yield the highest correlations with DCNN error rates, suggesting that features learned by powerful models are effective for this complexity assessment. This implies that for practical use, leveraging existing state-of-the-art feature extractors is a good strategy.
Hyperparameter Tuning: The method involves hyperparameters $M, E, k$ for similarity matrix construction and the chosen dimension reduction method's parameters (e.g., target dimension $d$ , perplexity for t-SNE, contribution rate for PCA). The paper uses $M=100, E=100, k=3$ and $d=128$ (autoencoder), $d=3$ (t-SNE). Experimentation might be needed to find optimal parameters for different types of datasets.
Computational Complexity: The asymptotic time complexity is $O(M \cdot d^2 \cdot n^2)$ . While this can be significant for very large numbers of classes or high feature dimensions, the experiments show reasonable computation times (e.g., 50s for cifar10, which has 10 classes). For datasets with thousands of classes, optimizing the similarity estimation (Step 3) or eigenvalue computation (Step 8) might be necessary.
Predicting Performance: The core application is predicting the expected error rate of a classifier (specifically, DCNNs) on a dataset before expensive training. A strong positive correlation between cmsAULS and DCNN test error rates means datasets with high cmsAULS scores are likely to result in higher error rates. This allows developers to prioritize datasets, select appropriate model complexity, or decide if data augmentation or cleaning is required.
Classifier Selection and Dataset Reduction: High complexity scores can indicate that a simpler classifier might struggle, suggesting the need for more powerful models. Conversely, low scores might suggest a dataset is amenable to simpler models or even dataset reduction techniques without significant performance loss.
Visualization: The derived symmetric similarity matrix $\mathbf{W}$ can be used with techniques like Multidimensional Scaling (MDS) to visualize class separability in a 2D or 3D space. This provides an intuitive understanding of which classes are most confused or difficult to separate. The paper's visualization shows clear separation for MNIST and more overlap for CIFAR10 and STL10, consistent with their known complexities.

Limitations:

Uncertain Upper Limit: Unlike some normalized complexity measures, cmsAULS does not have a fixed upper limit value, which could make comparing scores across vastly different numbers of classes less intuitive without normalization.
Suitability for Pixel-Level Tasks: The method is designed for classification datasets where each sample belongs to a single class. Its direct application to pixel-level tasks like segmentation, where images contain many different classes simultaneously, is not straightforward and identified as future work.

In summary, the cmsAULS method provides a practical, spectrally-derived metric for assessing image dataset complexity, strongly correlating with DCNN classification performance. Its implementation involves standard steps of dimension reduction and graph Laplacian analysis on class-similarity matrices. Leveraging powerful pretrained feature extractors appears to be key for achieving high predictive accuracy in practice. The metric can serve as a valuable tool in the machine learning lifecycle for dataset selection, model choice, and effort estimation before embarking on lengthy training processes.

Markdown