Data complexity measured by principal graphs (1212.5841v2)

Published 23 Dec 2012 in cs.LG, cs.IT, and math.IT

Abstract: How to measure the complexity of a finite set of vectors embedded in a multidimensional space? This is a non-trivial question which can be approached in many different ways. Here we suggest a set of data complexity measures using universal approximators, principal cubic complexes. Principal cubic complexes generalise the notion of principal manifolds for datasets with non-trivial topologies. The type of the principal cubic complex is determined by its dimension and a grammar of elementary graph transformations. The simplest grammar produces principal trees. We introduce three natural types of data complexity: 1) geometric (deviation of the data's approximator from some "idealized" configuration, such as deviation from harmonicity); 2) structural (how many elements of a principal graph are needed to approximate the data), and 3) construction complexity (how many applications of elementary graph transformations are needed to construct the principal object starting from the simplest one). We compute these measures for several simulated and real-life data distributions and show them in the "accuracy-complexity" plots, helping to optimize the accuracy/complexity ratio. We discuss various issues connected with measuring data complexity. Software for computing data complexity measures from principal cubic complexes is provided as well.

Citations (25)

View on Semantic Scholar

Summary

The paper introduces a method to quantify data complexity by assessing the complexity of principal cubic complexes and trees that approximate multidimensional data.
It employs an EM-based algorithm to optimize a balance between approximation accuracy (via MSD) and the intrinsic complexity (via geometrical and structural measures).
The approach is validated on both simulated and real datasets, using accuracy-complexity plots to identify optimal approximators and reveal different data scales.

The paper "Data complexity measured by principal graphs" (1212.5841) addresses the non-trivial problem of quantifying the complexity of a finite set of multidimensional vectors. Instead of directly measuring data complexity, the authors propose measuring the complexity of an approximator that represents the data's underlying structure. The core idea is that less complex data can be well-approximated by simpler objects, and complex data requires more complex approximators to achieve high accuracy. The optimal approximator balances approximation accuracy and its own complexity, aligning with the structural risk minimization principle.

The chosen approximators are Principal Cubic Complexes (PCCs), which are generalizations of principal manifolds capable of approximating data with non-trivial topologies like branches. A PCC is defined as a Cartesian product of graphs, where one-dimensional PCCs are principal graphs or trees. These approximators are constructed iteratively using a graph grammar, a set of elementary transformations (e.g., adding/removing a node, bisecting an edge). An algorithm based on the Expectation-Maximization (EM) approach is used to find the optimal embedding of a given graph structure in the data space. This algorithm minimizes an energy functional that combines the Mean Squared Distance (MSD) between data points and the closest point on the embedded graph with an elastic energy term penalizing deviations from graph regularity (like stretching and bending). A softening strategy, starting with high elasticity coefficients and gradually decreasing them, is employed to help mitigate issues with local minima in the optimization landscape.

The paper introduces three types of complexity measures for the approximator:

Geometrical Complexity (GC): This measures the deviation of the approximator's embedding in the data space from an "idealized" configuration. The paper generalizes the concept of linearity to harmonicity. An embedding is harmonic if, for every "star" (a central node connected to neighbors), the central node's position is the average of its neighbors' positions. The GC is quantified as the sum of squared deviations from harmonicity for all stars in the graph, scaled by the number of nodes squared ( $N_{nodes}^2$ ) to account for changes in edge lengths as nodes are added:

$GC^{\varphi}(G) = N_{nodes}^2 \sum\limits_{S_{k}^{(j)} } {\mu_{kj} \vert \vert \varphi (S_{k}^{(j)} (0))-\frac{1}{k}\sum\limits_{i=1}^k {\varphi (S_{k}^{(j)} (i))} \vert \vert^{2}}$

In the examples, $\mu_{kj}=1$ .
Structural Complexity (SC): This describes the complexity in terms of the number and types of elements in the graph structure (nodes, edges, k-stars). The paper uses the number of nodes as a primary indicator and introduces a symbolic barcode notation like $N_{k-stars}|...|N_{4-stars}|N_{3-stars}||N_{nodes}$ to represent the count of different k-stars and the total number of nodes. For trees, the number of edges is always $N_{nodes}-1$ .
Construction Complexity (CC): This is defined as the minimum number of elementary graph grammar applications needed to construct the approximator. For the simple grammars primarily used (adding one node at a time), CC often equals $N_{nodes}-1$ , making it closely related to SC. The paper focuses mainly on GC and SC in its examples.

To identify the complexity of the data, the authors propose using an "accuracy-complexity" plot. This plot graphs the Geometrical Complexity (GC) against the approximation accuracy, measured by the Fraction of Variance Explained (FVE), which is $1 - \frac{MSD}{\text{Total Variance}}$ . Changes in Structural Complexity (SC) are marked on the plot using vertical lines and the barcode notation. The optimal approximator, and thus the inferred data complexity, corresponds to points on this plot where increasing accuracy leads to a drastic increase in complexity, often observed as local minima in the GC landscape or sudden jumps in SC without significant FVE gain. Multiple local minima can reveal different "scales" of complexity within the data.

The paper demonstrates this approach using both simulated 2D datasets (linear, arc, branching) and real-world datasets from the UCI Machine Learning Repository (Iris, Wine, Forestfires, Abalone). These examples show how different datasets exhibit distinct "accuracy-complexity" landscapes. For instance, linear data shows low GC until overfitting begins, while branching data requires a branching approximator (principal tree) to achieve high accuracy efficiently. The Wine dataset plot clearly shows an initial optimal structure (a 3-star) corresponding to its cluster structure, with subsequent complexity increases yielding diminishing returns in accuracy.

The construction algorithm for principal trees involves starting with a simple graph (e.g., two nodes connected by an edge on the first principal component), iteratively applying graph grammar operations (like adding a node or bisecting an edge) from a predefined set ( $O^{(grow)}$ and optionally $O^{(shrink)}$ ), and selecting the operation that minimizes the energy functional $U^{\varphi}(X,G)$ after re-optimizing the node positions. This process continues until a maximum allowed structural or construction complexity is reached or no operation improves the energy significantly.

The implementation of this method for elastic principal graphs, including principal trees, exists in Java, with graphical interfaces available online, demonstrating its practical application in visualizing and analyzing diverse datasets.

PDF Markdown

Data complexity measured by principal graphs (1212.5841v2)

Summary

Related Papers