- The paper introduces a normalized correlation dimension that adapts fractal concepts to measure the effective dimensionality of sparse binary datasets.
- It details efficient computation methods using direct calculation, sparse optimizations, and sampling to estimate pairwise L1 distances.
- The approach enables comparing dataset complexity and serves as a complementary tool to PCA by revealing intrinsic structural dependencies.
This paper (1902.01480) addresses the challenge of defining a meaningful "effective dimension" for binary datasets, which are often high-dimensional but sparse and structured. Traditional dimensionality reduction methods like PCA and SVD are designed for real-valued data and are not directly suitable. The authors propose adapting concepts from fractal dimension, specifically the correlation dimension, for binary data and introduce a normalized correlation dimension to make the measure more interpretable.
The core idea is to analyze the distribution of pairwise distances between points in the binary dataset. For a dataset D with K binary variables, the %%%%2%%%% distance (Manhattan distance) is used between two points x,y∈D. The random variable ZD represents the L1 distance between two randomly chosen points from D. The correlation dimension is based on the probability P(ZD<r), which is the fraction of point pairs with an L1 distance less than r.
The authors define the correlation dimension, denoted cd, as the slope of a line fitted to the log-log plot of (r,P(ZD<r)) for various radii r. To handle the discrete nature of binary distances, they linearly interpolate P(ZD<r) between integer values of r to create a continuous function f(r). The dimension cdR(D,r1,r2) is the slope of the least-squares linear fit to the points {logr,logf(r)} for r in a range [r1,r2]. A related definition, cdA(D,α1,α2), uses radii r1,r2 such that f(r1)=α1 and f(r2)=α2, effectively focusing on quantiles of the distance distribution. The paper primarily uses cdA(D,1/4,3/4), based on the distances between the first and third quartiles of the pairwise distance distribution.
Implementation of Correlation Dimension:
To compute cdA(D,α1,α2), you need to calculate f(r)=P(ZD<r) for various integer values of r from $0$ to K. This involves calculating the L1 distance between all pairs of points in D.
The L1 distance between two binary vectors x,y is the number of positions where they differ: ∣∣x−y∣∣1=∑i=1K∣xi−yi∣. Since xi,yi∈{0,1}, ∣xi−yi∣ is 1 if xi=yi and 0 if xi=yi.
The number of pairs with distance <r is ∑x∈D∑y∈DI(∣∣x−y∣∣1<r), where I(⋅) is the indicator function. f(r) is this count divided by ∣D∣2.
- Direct Computation: Calculating all pairwise L1 distances takes O(∣D∣2K) time. For sparse binary data, the L1 distance between vectors x and y is ∣∣x∣∣1+∣∣y∣∣1−2⋅∣∣x⊙y∣∣1, where ⊙ is element-wise multiplication (AND). Summing these up takes O(∣D∣2K) or O(∣D∣2M) time where M is the average number of 1s per vector. A sparse matrix representation improves pairwise distance calculation. The total number of 1s in D is L. Calculating all pairwise distances is i=1∑∣D∣j=1∑∣D∣(mi+mj−2⋅overlap(xi,yj)), where mi is the number of 1s in xi. The paper gives an O(∣D∣L) bound for calculating all pairwise distances in total across all pairs, which seems more efficient if you sum over distances per point pair: ∑i=1∣D∣∑j=1∣D∣(mi+mj) which is 2∣D∣∑mi=2∣D∣L. This suggests computing distances efficiently.
- Approximation via Sampling: For very large datasets (large ∣D∣), direct computation is too slow. The paper proposes estimating P(ZD<r) using a random subset Ds⊂D. Two estimation methods are given:
- Pick x∈D, y∈Ds: ∣D∣∣Ds∣1x∈D∑y∈Ds∑I(∣∣x−y∣∣1<r).
- Pick x∈Ds, y∈Ds: ∣Ds∣21x∈Ds∑y∈Ds∑I(∣∣x−y∣∣1<r).
The experiments used the first method with ∣Ds∣=10,000 points. The time complexity using sampling is roughly O(∣Ds∣L) if using the first method and computing distances efficiently, or $O(|D_s| \cdot |D| \cdot \text{avg_row_nnz})$ roughly. The experimental section Table 3 suggests the time is proportional to the total number of 1s (L) when sampling Ds, with a factor related to ∣Ds∣. This indicates an efficient sparse distance calculation is crucial.
Once f(r) is estimated for relevant integer r, you find r1,r2 such that f(r1)≈α1 and f(r2)≈α2. Then compute the slope of logf(r) vs logr for r∈[r1,r2] using linear regression. The paper uses N=50 points within [r1,r2] for the linear fit I(D,r1,r2,N).
Normalization: Normalized Correlation Dimension (NCD):
The raw correlation dimension (cd) can be small and hard to interpret. The NCD aims to provide a more intuitive scale. The NCD of dataset D, ncdA(D,α1,α2), is defined as the number of columns (H) a synthetic dataset ind(H,s) with H independent binary variables (each '1' with probability s) would need to have the same correlation dimension as D, i.e., cdA(ind(H,s))=cdA(D). The marginal probability s is chosen such that cdA(ind(K,s))=cdA(ind(D)), where ind(D) is a dataset with K independent columns having the same marginal probabilities as D.
1. Calculate cdA(D) using the method described above.
2. Create a synthetic dataset ind(D) by randomizing each column of D independently (or generating data with the same marginal probabilities).
3. Calculate cdA(ind(D)). This involves estimating P(Zind(D)<r) by generating random pairs from the independent distribution implied by D's marginals.
4. Find a probability s such that cdA(ind(K,s))=cdA(ind(D)) using binary search. cdA(ind(H,s)) can be approximated theoretically (Proposition 1) or estimated by generating synthetic data.
5. Find an integer H such that cdA(ind(H,s))=cdA(D) using binary search. This H is the normalized correlation dimension ncdA(D).
- Approximation for NCD: Proposition 2 offers a direct approximation: ncdA(D)≈K⋅(cdA(D)/cdA(ind(D)))2. This avoids the binary search for H and s. The empirical results suggest this approximation works well for synthetic data but can be less accurate for sparse real-world data.
Practical Applications and Interpretation:
Complexity Measure: NCD provides a single number describing the "effective complexity" or intrinsic dimensionality of a binary dataset. A high NCD relative to the number of variables K suggests less structure/more independence, while a low NCD suggests strong dependencies or sparsity patterns reducing the effective degrees of freedom.
- Dataset Comparison: NCD allows comparing the structural complexity of different binary datasets, even if they have different numbers of variables or sparsity. For instance, the paper shows NCD varies significantly across real datasets (Table 2). Retail (K=16k) has ncd ~1.8k (11% of K), while Accidents (K=469) has ncd ~220 (47% of K), indicating Retail has more structure per variable than Accidents.
- Alternative to PCA for Binary Data: The paper compares NCD to PCA (number of components for 90% variance). They correlate positively, but there are differences. For 'Paleo' data, PCA suggests higher dimension than NCD and average correlation, indicating PCA might overestimate complexity for some binary data structures, especially those with homogeneous margins (which NCD might handle better). This suggests NCD can be a complementary or better measure for certain binary datasets.
- Analyzing Subgroups: Studying the NCD of clusters or subgroups can reveal how dimensionality changes locally. The experiments show clusters can have higher dimensions than the combined dataset, implying the structure reducing the overall dimension might be due to the relationships between clusters.
Implementation Considerations:
- Sparsity: Binary data is often sparse. Efficiently computing L1 distances and sums of distances is crucial. Use sparse matrix libraries (e.g., SciPy in Python) and optimize distance calculations for binary data (popcount, bitwise operations if applicable).
- Sampling: For datasets with millions of rows, sampling is essential to make the computation feasible. The choice of sample size ∣Ds∣ affects accuracy and runtime.
- Linear Regression: Fitting the line to {logr,logf(r)} is a standard linear regression task. Need to select appropriate r values or α quantiles. The paper used α1=1/4,α2=3/4 and N=50 points.
- Binary Search (for NCD without approximation): Implementing the binary search for H and s requires an efficient way to calculate cdA(ind(H,p)) or its approximation. Proposition 1 provides an analytical form based on the normal approximation of sums of Bernoulli variables, which can be used.
- Computational Resources: Calculating pairwise distances (even with sampling) can be memory and CPU intensive. Distributed computing frameworks could be beneficial for large datasets.
The paper highlights that this method, unlike PCA or SVD, does not provide a low-dimensional embedding. Its purpose is to measure intrinsic dimension, not to map data to a lower space for visualization or feature reduction. However, knowing the intrinsic dimension can be useful for model selection, algorithm choice (e.g., which indexing structures or distance metrics might work well), or simply understanding the underlying structure of the data.