What is the dimension of your binary data?

Published 4 Feb 2019 in cs.LG and stat.ML | (1902.01480v1)

Abstract: Many 0/1 datasets have a very large number of variables; on the other hand, they are sparse and the dependency structure of the variables is simpler than the number of variables would suggest. Defining the effective dimensionality of such a dataset is a nontrivial problem. We consider the problem of defining a robust measure of dimension for 0/1 datasets, and show that the basic idea of fractal dimension can be adapted for binary data. However, as such the fractal dimension is difficult to interpret. Hence we introduce the concept of normalized fractal dimension. For a dataset $D$, its normalized fractal dimension is the number of columns in a dataset $D'$ with independent columns and having the same (unnormalized) fractal dimension as $D$. The normalized fractal dimension measures the degree of dependency structure of the data. We study the properties of the normalized fractal dimension and discuss its computation. We give empirical results on the normalized fractal dimension, comparing it against baseline measures such as PCA. We also study the relationship of the dimension of the whole dataset and the dimensions of subgroups formed by clustering. The results indicate interesting differences between and within datasets.

Abstract PDF Upgrade to Chat

Citations (53)

View on Semantic Scholar

Summary

The paper introduces a normalized correlation dimension that adapts fractal concepts to measure the effective dimensionality of sparse binary datasets.
It details efficient computation methods using direct calculation, sparse optimizations, and sampling to estimate pairwise L1 distances.
The approach enables comparing dataset complexity and serves as a complementary tool to PCA by revealing intrinsic structural dependencies.

This paper (1902.01480) addresses the challenge of defining a meaningful "effective dimension" for binary datasets, which are often high-dimensional but sparse and structured. Traditional dimensionality reduction methods like PCA and SVD are designed for real-valued data and are not directly suitable. The authors propose adapting concepts from fractal dimension, specifically the correlation dimension, for binary data and introduce a normalized correlation dimension to make the measure more interpretable.

The core idea is to analyze the distribution of pairwise distances between points in the binary dataset. For a dataset $D$ with $K$ binary variables, the %%%%2%%%% distance (Manhattan distance) is used between two points $x, y \in D$ . The random variable $Z_D$ represents the $L_1$ distance between two randomly chosen points from $D$ . The correlation dimension is based on the probability $P(Z_D < r)$ , which is the fraction of point pairs with an $L_1$ distance less than $r$ .

The authors define the correlation dimension, denoted $cd$ , as the slope of a line fitted to the log-log plot of $(r, P(Z_D < r))$ for various radii $r$ . To handle the discrete nature of binary distances, they linearly interpolate $P(Z_D < r)$ between integer values of $r$ to create a continuous function $f(r)$ . The dimension $cd_R(D, r_1, r_2)$ is the slope of the least-squares linear fit to the points $\{\log r, \log f(r)\}$ for $r$ in a range $[r_1, r_2]$ . A related definition, $cd_A(D, \alpha_1, \alpha_2)$ , uses radii $r_1, r_2$ such that $f(r_1) = \alpha_1$ and $f(r_2) = \alpha_2$ , effectively focusing on quantiles of the distance distribution. The paper primarily uses $cd_A(D, 1/4, 3/4)$ , based on the distances between the first and third quartiles of the pairwise distance distribution.

Implementation of Correlation Dimension:

To compute $cd_A(D, \alpha_1, \alpha_2)$ , you need to calculate $f(r) = P(Z_D < r)$ for various integer values of $r$ from $0$ to $K$ . This involves calculating the $L_1$ distance between all pairs of points in $D$ . The $L_1$ distance between two binary vectors $x, y$ is the number of positions where they differ: $||x - y||_1 = \sum_{i=1}^K |x_i - y_i|$ . Since $x_i, y_i \in \{0, 1\}$ , $|x_i - y_i|$ is 1 if $x_i \neq y_i$ and 0 if $x_i = y_i$ . The number of pairs with distance $< r$ is $\sum_{x \in D} \sum_{y \in D} I(||x - y||_1 < r)$ , where $I(\cdot)$ is the indicator function. $f(r)$ is this count divided by $|D|^2$ .

Direct Computation: Calculating all pairwise $L_1$ distances takes $O(|D|^2 K)$ time. For sparse binary data, the $L_1$ distance between vectors $x$ and $y$ is $||x||_1 + ||y||_1 - 2 \cdot ||x \odot y||_1$ , where $\odot$ is element-wise multiplication (AND). Summing these up takes $O(|D|^2 K)$ or $O(|D|^2 M)$ time where $M$ is the average number of 1s per vector. A sparse matrix representation improves pairwise distance calculation. The total number of 1s in $D$ is $L$ . Calculating all pairwise distances is $\sum_{i=1}^{|D|} \sum_{j=1}^{|D|} (m_i + m_j - 2 \cdot \text{overlap}(x_i, y_j))$ , where $m_i$ is the number of 1s in $x_i$ . The paper gives an $O(|D| L)$ bound for calculating all pairwise distances in total across all pairs, which seems more efficient if you sum over distances per point pair: $\sum_{i=1}^{|D|} \sum_{j=1}^{|D|} (m_i + m_j)$ which is $2|D| \sum m_i = 2|D| L$ . This suggests computing distances efficiently.
Approximation via Sampling: For very large datasets (large $|D|$ $∣ D ∣$ ), direct computation is too slow. The paper proposes estimating $P(Z_D < r)$ $P (Z_{D} < r)$ using a random subset $D_s \subset D$ $D_{s} \subset D$ . Two estimation methods are given:
1. Pick $x \in D$ , $y \in D_s$ : $\frac{1}{|D||D_s|} \sum_{x \in D} \sum_{y \in D_s} I(||x-y||_1 < r)$ .
2. Pick $x \in D_s$ , $y \in D_s$ : $\frac{1}{|D_s|^2} \sum_{x \in D_s} \sum_{y \in D_s} I(||x-y||_1 < r)$ . The experiments used the first method with $|D_s| = 10,000$ points. The time complexity using sampling is roughly $O(|D_s| L)$ if using the first method and computing distances efficiently, or $O(|D_s| \cdot |D| \cdot \text{avg_row_nnz})$ roughly. The experimental section Table 3 suggests the time is proportional to the total number of 1s ( $L$ ) when sampling $D_s$ , with a factor related to $|D_s|$ . This indicates an efficient sparse distance calculation is crucial.

Once $f(r)$ is estimated for relevant integer $r$ , you find $r_1, r_2$ such that $f(r_1) \approx \alpha_1$ and $f(r_2) \approx \alpha_2$ . Then compute the slope of $\log f(r)$ vs $\log r$ for $r \in [r_1, r_2]$ using linear regression. The paper uses $N=50$ points within $[r_1, r_2]$ for the linear fit $\mathcal{I}(D, r_1, r_2, N)$ .

Normalization: Normalized Correlation Dimension (NCD):

The raw correlation dimension ( $cd$ ) can be small and hard to interpret. The NCD aims to provide a more intuitive scale. The NCD of dataset $D$ , $ncd_A(D, \alpha_1, \alpha_2)$ , is defined as the number of columns ( $H$ ) a synthetic dataset $ind(H, s)$ with $H$ independent binary variables (each '1' with probability $s$ ) would need to have the same correlation dimension as $D$ , i.e., $cd_A(ind(H, s)) = cd_A(D)$ . The marginal probability $s$ is chosen such that $cd_A(ind(K, s)) = cd_A(ind(D))$ , where $ind(D)$ is a dataset with $K$ independent columns having the same marginal probabilities as $D$ .

Implementation of NCD:

1. Calculate $cd_A(D)$ using the method described above. 2. Create a synthetic dataset $ind(D)$ by randomizing each column of $D$ independently (or generating data with the same marginal probabilities). 3. Calculate $cd_A(ind(D))$ . This involves estimating $P(Z_{ind(D)} < r)$ by generating random pairs from the independent distribution implied by $D$ 's marginals. 4. Find a probability $s$ such that $cd_A(ind(K, s)) = cd_A(ind(D))$ using binary search. $cd_A(ind(H, s))$ can be approximated theoretically (Proposition 1) or estimated by generating synthetic data. 5. Find an integer $H$ such that $cd_A(ind(H, s)) = cd_A(D)$ using binary search. This $H$ is the normalized correlation dimension $ncd_A(D)$ .

Approximation for NCD: Proposition 2 offers a direct approximation: $ncd_A(D) \approx K \cdot (cd_A(D) / cd_A(ind(D)))^2$ . This avoids the binary search for $H$ and $s$ . The empirical results suggest this approximation works well for synthetic data but can be less accurate for sparse real-world data.

Practical Applications and Interpretation:

Complexity Measure: NCD provides a single number describing the "effective complexity" or intrinsic dimensionality of a binary dataset. A high NCD relative to the number of variables $K$ suggests less structure/more independence, while a low NCD suggests strong dependencies or sparsity patterns reducing the effective degrees of freedom.
Dataset Comparison: NCD allows comparing the structural complexity of different binary datasets, even if they have different numbers of variables or sparsity. For instance, the paper shows NCD varies significantly across real datasets (Table 2). Retail (K=16k) has ncd ~1.8k (11% of K), while Accidents (K=469) has ncd ~220 (47% of K), indicating Retail has more structure per variable than Accidents.
Alternative to PCA for Binary Data: The paper compares NCD to PCA (number of components for 90% variance). They correlate positively, but there are differences. For 'Paleo' data, PCA suggests higher dimension than NCD and average correlation, indicating PCA might overestimate complexity for some binary data structures, especially those with homogeneous margins (which NCD might handle better). This suggests NCD can be a complementary or better measure for certain binary datasets.
Analyzing Subgroups: Studying the NCD of clusters or subgroups can reveal how dimensionality changes locally. The experiments show clusters can have higher dimensions than the combined dataset, implying the structure reducing the overall dimension might be due to the relationships between clusters.

Implementation Considerations:

Sparsity: Binary data is often sparse. Efficiently computing $L_1$ distances and sums of distances is crucial. Use sparse matrix libraries (e.g., SciPy in Python) and optimize distance calculations for binary data (popcount, bitwise operations if applicable).
Sampling: For datasets with millions of rows, sampling is essential to make the computation feasible. The choice of sample size $|D_s|$ affects accuracy and runtime.
Linear Regression: Fitting the line to $\{\log r, \log f(r)\}$ is a standard linear regression task. Need to select appropriate $r$ values or $\alpha$ quantiles. The paper used $\alpha_1=1/4, \alpha_2=3/4$ and $N=50$ points.
Binary Search (for NCD without approximation): Implementing the binary search for $H$ and $s$ requires an efficient way to calculate $cd_A(ind(H, p))$ or its approximation. Proposition 1 provides an analytical form based on the normal approximation of sums of Bernoulli variables, which can be used.
Computational Resources: Calculating pairwise distances (even with sampling) can be memory and CPU intensive. Distributed computing frameworks could be beneficial for large datasets.

The paper highlights that this method, unlike PCA or SVD, does not provide a low-dimensional embedding. Its purpose is to measure intrinsic dimension, not to map data to a lower space for visualization or feature reduction. However, knowing the intrinsic dimension can be useful for model selection, algorithm choice (e.g., which indexing structures or distance metrics might work well), or simply understanding the underlying structure of the data.