Topological Prototype Selector (TPS)
- TPS is a topological data analysis framework that uses persistent homology and bifiltration to extract representative prototypes from large datasets.
- It combines neighbor filtration for capturing class boundary features with radius filtration for intra-class refinement to ensure high classification fidelity.
- Empirical results show TPS significantly reduces data volume while boosting or maintaining accuracy across synthetic and real-world benchmarks.
The Topological Prototype Selector (TPS) constitutes a topological data analysis (TDA)-based framework for representative subset selection (prototype selection) from large datasets. TPS exploits persistent homology to identify data points that best capture both the intra-class topological “shape” and boundary-region separation between classes. By combining neighbor-based and radius-based filtrations in a bifiltration approach, TPS rigorously selects boundary-informative prototypes, tunable through geometric regularization parameters, and achieves substantial data reduction while preserving—often improving—classification accuracy across simulated and real-world tasks.
1. Prototype Selection and Topological Motivation
Given a labeled dataset , the principal challenge of prototype selection is to identify the smallest subset such that the classifier accuracy on validation data matches or exceeds full-data performance. Formally,
Traditional distance-based heuristics (e.g., CNN, ENN), clustering (K-Means), or optimization (set cover) retain excessive internal or noisy points and are sensitive to instance order, lacking multiscale geometric robustness. TPS addresses these issues using TDA, leveraging persistent homology—a method robust to noise and high-dimensional structure—to quantify and extract the significant topological features at multiple scales, especially at class boundaries.
2. Mathematical Foundations of TPS
TPS operates on the data space , a metric space . The approach constructs TDA objects, beginning with simplicial complexes:
- A -simplex is defined as the convex hull of affinely independent points.
- A simplicial complex is a finite set of simplices closed under faces/intersections.
The Vietoris–Rips complex at scale ,
contains all simplices whose vertices are pairwise within . A filtration is a nested sequence of such complexes over increasing radii: Persistent homology computes the birth and death (, ) of -dimensional features (connected components, loops, voids), visualized in persistence diagrams .
The TPS central mechanism is bifiltration—using two parameters:
- Neighbor filtration (inter-class proximity, -axis)
- Radius filtration (intra-class connectivity, -axis)
A bifiltration satisfies whenever , . For a feature interval , its lifetime is .
3. Topological Prototype Selector Algorithm
TPS consists of two successive filtrations per class, formalized as follows:
3.1 Neighbor Filtration (Inter-Class)
- Partition into target class and non-target .
- For , compute sum of nearest neighbor distances in ,
- Construct a weighted Rips complex with edge at .
- Compute $0$- and $1$-dimensional persistence (); extract lifetimes .
- Remove lifetimes (geometric regularization).
- Choose quantile lifetime .
- Select slice index nearest to .
- Extract vertices at scale .
3.2 Radius Filtration (Intra-Class)
- Restrict to .
- For , compute same-class radii,
- Build Rips complex with edge at .
- Compute persistence ; extract lifetimes .
- Filter using .
- Set .
- Select slice closest to .
- Collect vertices as prototypes for .
TPS implicitly solves
Parallelization is achieved since the class selection loop is independent.
Algorithm Overview
| Phase | Operation | Output |
|---|---|---|
| Neighbor filt | Rips + persistence | (boundary) |
| Radius filt | Rips + persistence | (prototypes) |
4. Theoretical Guarantees and Computational Complexity
4.1 Complexity Analysis
With , constructing Rips up to maximum dimension incurs cost. Persistent homology (Ripser) practically executes in to . TPS’s two computations per class yield total complexity
Parameter greatly influences combinatorial cost.
4.2 Stability
By the stability theorem for persistent homology, minor metric perturbations yield proportional changes in persistence diagrams, ensuring that the prototype set adapts smoothly to input noise.
4.3 Prototype Cardinality Bound
If are Betti numbers at chosen slices,
Thus, the number of prototypes is bounded by total Betti numbers.
5. Empirical and Practical Evaluation
5.1 Simulated Data
On nine synthetic 2D datasets (blobs, circles, moons, imbalanced scenarios), TPS demonstrates:
- Mean G-Mean improvement: (1-NN), (SVM).
- Data reduction rates: (1-NN), (SVM).
TPS typically preserves or enhances accuracy while retaining only 10–40% of original points.
5.2 Hyperparameter Effects
- (neighbor quantile) and (minimum persistence) regularize geometric structure, balancing reduction vs. performance.
- Smaller targets boundary topological features more effectively.
5.3 Computational Performance
- For points in , TPS processes each class in seconds.
- Classwise parallelization achieves linear speedup.
5.4 Real Data Benchmarks
Across eight UCI datasets:
- TPS achieves average reduction , average G-Mean change .
- Outperforms CNN+ENN (), AllKNN (), and matches BienTib () and K-Means, typically with fewer prototypes and lower runtime.
5.5 Metric Robustness
For text (Spam/Ham, Doc2Vec, ):
- Under Euclidean metric: G-Mean at reduction.
- Under Cosine metric: G-Mean at reduction.
TPS leverages structure induced by different metric choices, outperforming purely distance-based alternatives.
6. Implementation Notes and Geometric Interpretation
- Ripser and GUDHI libraries efficiently compute Rips complexes and persistence for small .
- Recommended pipeline: precompute distance matrix, process each class with neighbor → persistence → quantile selection, radius → persistence → mean selection, and extract required vertices.
- Geometric intuition:
- Neighbor filtration identifies boundary/“thin” regions near other classes.
- Radius filtration isolates “thick” intra-class zones near core features.
- Their intersection selects topologically critical boundary exemplars.
7. Significance and Distinction
TPS is the first prototype selector fundamentally founded on TDA principles. Its bifiltration construction retains mathematically significant, boundary-informative points in a fashion that is parallelizable, interpretable, metric-flexible, and robust to noise. Unlike previous methods, TPS offers explicit geometric regularization through and settings, providing practitioners principled control over dataset reduction while maintaining high classification fidelity across a spectrum of domains.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free