Papers
Topics
Authors
Recent
Search
2000 character limit reached

What is the dimension of your binary data?

Published 4 Feb 2019 in cs.LG and stat.ML | (1902.01480v1)

Abstract: Many 0/1 datasets have a very large number of variables; on the other hand, they are sparse and the dependency structure of the variables is simpler than the number of variables would suggest. Defining the effective dimensionality of such a dataset is a nontrivial problem. We consider the problem of defining a robust measure of dimension for 0/1 datasets, and show that the basic idea of fractal dimension can be adapted for binary data. However, as such the fractal dimension is difficult to interpret. Hence we introduce the concept of normalized fractal dimension. For a dataset $D$, its normalized fractal dimension is the number of columns in a dataset $D'$ with independent columns and having the same (unnormalized) fractal dimension as $D$. The normalized fractal dimension measures the degree of dependency structure of the data. We study the properties of the normalized fractal dimension and discuss its computation. We give empirical results on the normalized fractal dimension, comparing it against baseline measures such as PCA. We also study the relationship of the dimension of the whole dataset and the dimensions of subgroups formed by clustering. The results indicate interesting differences between and within datasets.

Citations (53)

Summary

  • The paper introduces a normalized correlation dimension that adapts fractal concepts to measure the effective dimensionality of sparse binary datasets.
  • It details efficient computation methods using direct calculation, sparse optimizations, and sampling to estimate pairwise L1 distances.
  • The approach enables comparing dataset complexity and serves as a complementary tool to PCA by revealing intrinsic structural dependencies.

This paper (1902.01480) addresses the challenge of defining a meaningful "effective dimension" for binary datasets, which are often high-dimensional but sparse and structured. Traditional dimensionality reduction methods like PCA and SVD are designed for real-valued data and are not directly suitable. The authors propose adapting concepts from fractal dimension, specifically the correlation dimension, for binary data and introduce a normalized correlation dimension to make the measure more interpretable.

The core idea is to analyze the distribution of pairwise distances between points in the binary dataset. For a dataset DD with KK binary variables, the %%%%2%%%% distance (Manhattan distance) is used between two points x,yDx, y \in D. The random variable ZDZ_D represents the L1L_1 distance between two randomly chosen points from DD. The correlation dimension is based on the probability P(ZD<r)P(Z_D < r), which is the fraction of point pairs with an L1L_1 distance less than rr.

The authors define the correlation dimension, denoted cdcd, as the slope of a line fitted to the log-log plot of (r,P(ZD<r))(r, P(Z_D < r)) for various radii rr. To handle the discrete nature of binary distances, they linearly interpolate P(ZD<r)P(Z_D < r) between integer values of rr to create a continuous function f(r)f(r). The dimension cdR(D,r1,r2)cd_R(D, r_1, r_2) is the slope of the least-squares linear fit to the points {logr,logf(r)}\{\log r, \log f(r)\} for rr in a range [r1,r2][r_1, r_2]. A related definition, cdA(D,α1,α2)cd_A(D, \alpha_1, \alpha_2), uses radii r1,r2r_1, r_2 such that f(r1)=α1f(r_1) = \alpha_1 and f(r2)=α2f(r_2) = \alpha_2, effectively focusing on quantiles of the distance distribution. The paper primarily uses cdA(D,1/4,3/4)cd_A(D, 1/4, 3/4), based on the distances between the first and third quartiles of the pairwise distance distribution.

Implementation of Correlation Dimension:

To compute cdA(D,α1,α2)cd_A(D, \alpha_1, \alpha_2), you need to calculate f(r)=P(ZD<r)f(r) = P(Z_D < r) for various integer values of rr from $0$ to KK. This involves calculating the L1L_1 distance between all pairs of points in DD. The L1L_1 distance between two binary vectors x,yx, y is the number of positions where they differ: xy1=i=1Kxiyi||x - y||_1 = \sum_{i=1}^K |x_i - y_i|. Since xi,yi{0,1}x_i, y_i \in \{0, 1\}, xiyi|x_i - y_i| is 1 if xiyix_i \neq y_i and 0 if xi=yix_i = y_i. The number of pairs with distance <r< r is xDyDI(xy1<r)\sum_{x \in D} \sum_{y \in D} I(||x - y||_1 < r), where I()I(\cdot) is the indicator function. f(r)f(r) is this count divided by D2|D|^2.

  • Direct Computation: Calculating all pairwise L1L_1 distances takes O(D2K)O(|D|^2 K) time. For sparse binary data, the L1L_1 distance between vectors xx and yy is x1+y12xy1||x||_1 + ||y||_1 - 2 \cdot ||x \odot y||_1, where \odot is element-wise multiplication (AND). Summing these up takes O(D2K)O(|D|^2 K) or O(D2M)O(|D|^2 M) time where MM is the average number of 1s per vector. A sparse matrix representation improves pairwise distance calculation. The total number of 1s in DD is LL. Calculating all pairwise distances is i=1Dj=1D(mi+mj2overlap(xi,yj))\sum_{i=1}^{|D|} \sum_{j=1}^{|D|} (m_i + m_j - 2 \cdot \text{overlap}(x_i, y_j)), where mim_i is the number of 1s in xix_i. The paper gives an O(DL)O(|D| L) bound for calculating all pairwise distances in total across all pairs, which seems more efficient if you sum over distances per point pair: i=1Dj=1D(mi+mj)\sum_{i=1}^{|D|} \sum_{j=1}^{|D|} (m_i + m_j) which is 2Dmi=2DL2|D| \sum m_i = 2|D| L. This suggests computing distances efficiently.
  • Approximation via Sampling: For very large datasets (large D|D|), direct computation is too slow. The paper proposes estimating P(ZD<r)P(Z_D < r) using a random subset DsDD_s \subset D. Two estimation methods are given:

    1. Pick xDx \in D, yDsy \in D_s: 1DDsxDyDsI(xy1<r)\frac{1}{|D||D_s|} \sum_{x \in D} \sum_{y \in D_s} I(||x-y||_1 < r).
    2. Pick xDsx \in D_s, yDsy \in D_s: 1Ds2xDsyDsI(xy1<r)\frac{1}{|D_s|^2} \sum_{x \in D_s} \sum_{y \in D_s} I(||x-y||_1 < r). The experiments used the first method with Ds=10,000|D_s| = 10,000 points. The time complexity using sampling is roughly O(DsL)O(|D_s| L) if using the first method and computing distances efficiently, or $O(|D_s| \cdot |D| \cdot \text{avg_row_nnz})$ roughly. The experimental section Table 3 suggests the time is proportional to the total number of 1s (LL) when sampling DsD_s, with a factor related to Ds|D_s|. This indicates an efficient sparse distance calculation is crucial.

Once f(r)f(r) is estimated for relevant integer rr, you find r1,r2r_1, r_2 such that f(r1)α1f(r_1) \approx \alpha_1 and f(r2)α2f(r_2) \approx \alpha_2. Then compute the slope of logf(r)\log f(r) vs logr\log r for r[r1,r2]r \in [r_1, r_2] using linear regression. The paper uses N=50N=50 points within [r1,r2][r_1, r_2] for the linear fit I(D,r1,r2,N)\mathcal{I}(D, r_1, r_2, N).

Normalization: Normalized Correlation Dimension (NCD):

The raw correlation dimension (cdcd) can be small and hard to interpret. The NCD aims to provide a more intuitive scale. The NCD of dataset DD, ncdA(D,α1,α2)ncd_A(D, \alpha_1, \alpha_2), is defined as the number of columns (HH) a synthetic dataset ind(H,s)ind(H, s) with HH independent binary variables (each '1' with probability ss) would need to have the same correlation dimension as DD, i.e., cdA(ind(H,s))=cdA(D)cd_A(ind(H, s)) = cd_A(D). The marginal probability ss is chosen such that cdA(ind(K,s))=cdA(ind(D))cd_A(ind(K, s)) = cd_A(ind(D)), where ind(D)ind(D) is a dataset with KK independent columns having the same marginal probabilities as DD.

  • Implementation of NCD:

1. Calculate cdA(D)cd_A(D) using the method described above. 2. Create a synthetic dataset ind(D)ind(D) by randomizing each column of DD independently (or generating data with the same marginal probabilities). 3. Calculate cdA(ind(D))cd_A(ind(D)). This involves estimating P(Zind(D)<r)P(Z_{ind(D)} < r) by generating random pairs from the independent distribution implied by DD's marginals. 4. Find a probability ss such that cdA(ind(K,s))=cdA(ind(D))cd_A(ind(K, s)) = cd_A(ind(D)) using binary search. cdA(ind(H,s))cd_A(ind(H, s)) can be approximated theoretically (Proposition 1) or estimated by generating synthetic data. 5. Find an integer HH such that cdA(ind(H,s))=cdA(D)cd_A(ind(H, s)) = cd_A(D) using binary search. This HH is the normalized correlation dimension ncdA(D)ncd_A(D).

  • Approximation for NCD: Proposition 2 offers a direct approximation: ncdA(D)K(cdA(D)/cdA(ind(D)))2ncd_A(D) \approx K \cdot (cd_A(D) / cd_A(ind(D)))^2. This avoids the binary search for HH and ss. The empirical results suggest this approximation works well for synthetic data but can be less accurate for sparse real-world data.

Practical Applications and Interpretation:

  • Complexity Measure: NCD provides a single number describing the "effective complexity" or intrinsic dimensionality of a binary dataset. A high NCD relative to the number of variables KK suggests less structure/more independence, while a low NCD suggests strong dependencies or sparsity patterns reducing the effective degrees of freedom.

  • Dataset Comparison: NCD allows comparing the structural complexity of different binary datasets, even if they have different numbers of variables or sparsity. For instance, the paper shows NCD varies significantly across real datasets (Table 2). Retail (K=16k) has ncd ~1.8k (11% of K), while Accidents (K=469) has ncd ~220 (47% of K), indicating Retail has more structure per variable than Accidents.
  • Alternative to PCA for Binary Data: The paper compares NCD to PCA (number of components for 90% variance). They correlate positively, but there are differences. For 'Paleo' data, PCA suggests higher dimension than NCD and average correlation, indicating PCA might overestimate complexity for some binary data structures, especially those with homogeneous margins (which NCD might handle better). This suggests NCD can be a complementary or better measure for certain binary datasets.
  • Analyzing Subgroups: Studying the NCD of clusters or subgroups can reveal how dimensionality changes locally. The experiments show clusters can have higher dimensions than the combined dataset, implying the structure reducing the overall dimension might be due to the relationships between clusters.

Implementation Considerations:

  • Sparsity: Binary data is often sparse. Efficiently computing L1L_1 distances and sums of distances is crucial. Use sparse matrix libraries (e.g., SciPy in Python) and optimize distance calculations for binary data (popcount, bitwise operations if applicable).
  • Sampling: For datasets with millions of rows, sampling is essential to make the computation feasible. The choice of sample size Ds|D_s| affects accuracy and runtime.
  • Linear Regression: Fitting the line to {logr,logf(r)}\{\log r, \log f(r)\} is a standard linear regression task. Need to select appropriate rr values or α\alpha quantiles. The paper used α1=1/4,α2=3/4\alpha_1=1/4, \alpha_2=3/4 and N=50N=50 points.
  • Binary Search (for NCD without approximation): Implementing the binary search for HH and ss requires an efficient way to calculate cdA(ind(H,p))cd_A(ind(H, p)) or its approximation. Proposition 1 provides an analytical form based on the normal approximation of sums of Bernoulli variables, which can be used.
  • Computational Resources: Calculating pairwise distances (even with sampling) can be memory and CPU intensive. Distributed computing frameworks could be beneficial for large datasets.

The paper highlights that this method, unlike PCA or SVD, does not provide a low-dimensional embedding. Its purpose is to measure intrinsic dimension, not to map data to a lower space for visualization or feature reduction. However, knowing the intrinsic dimension can be useful for model selection, algorithm choice (e.g., which indexing structures or distance metrics might work well), or simply understanding the underlying structure of the data.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.