Topological Prototype Selector (TPS)

Updated 10 November 2025

TPS is a topological data analysis framework that uses persistent homology and bifiltration to extract representative prototypes from large datasets.
It combines neighbor filtration for capturing class boundary features with radius filtration for intra-class refinement to ensure high classification fidelity.
Empirical results show TPS significantly reduces data volume while boosting or maintaining accuracy across synthetic and real-world benchmarks.

The Topological Prototype Selector (TPS) constitutes a topological data analysis (TDA)-based framework for representative subset selection (prototype selection) from large datasets. TPS exploits persistent homology to identify data points that best capture both the intra-class topological “shape” and boundary-region separation between classes. By combining neighbor-based and radius-based filtrations in a bifiltration approach, TPS rigorously selects boundary-informative prototypes, tunable through geometric regularization parameters, and achieves substantial data reduction while preserving—often improving—classification accuracy across simulated and real-world tasks.

1. Prototype Selection and Topological Motivation

Given a labeled dataset $\mathcal{X} = \{(x_i, y_i)\}_{i=1}^n$ , the principal challenge of prototype selection is to identify the smallest subset $\mathcal{S} \subset \mathcal{X}$ such that the classifier accuracy $\mathrm{Acc}(f_{\mathcal{S},\mathrm{Val}})$ on validation data matches or exceeds full-data performance. Formally,

$\min_{\mathcal{S} \subset \mathcal{X}} |\mathcal{S}| \quad \text{subject to} \quad \mathrm{Acc}(f_{\mathcal{S},\mathrm{Val}}) \approx \mathrm{Acc}(f_{\mathcal{X},\mathrm{Val}})$

Traditional distance-based heuristics (e.g., CNN, ENN), clustering (K-Means), or optimization (set cover) retain excessive internal or noisy points and are sensitive to instance order, lacking multiscale geometric robustness. TPS addresses these issues using TDA, leveraging persistent homology—a method robust to noise and high-dimensional structure—to quantify and extract the significant topological features at multiple scales, especially at class boundaries.

2. Mathematical Foundations of TPS

TPS operates on the data space $\mathcal{X} \subset X$ , a metric space $(X, d)$ . The approach constructs TDA objects, beginning with simplicial complexes:

A $q$ -simplex $\sigma$ is defined as the convex hull of $q+1$ affinely independent points.
A simplicial complex $\mathcal{K}$ is a finite set of simplices closed under faces/intersections.

The Vietoris–Rips complex at scale $\epsilon$ ,

$R_\epsilon(\mathcal{X}) = \{\sigma : d(x_i, x_j) \leq \epsilon, \forall x_i, x_j \in \sigma\}$

contains all simplices whose vertices are pairwise within $\epsilon$ . A filtration is a nested sequence of such complexes over increasing radii: $\emptyset = \mathcal{K}_0 \subseteq \mathcal{K}_1 \subseteq \cdots \subseteq \mathcal{K}_m = \mathcal{K}$ Persistent homology computes the birth and death ( $b_i$ , $d_i$ ) of $k$ -dimensional features (connected components, loops, voids), visualized in persistence diagrams $\mathrm{PD}_k$ .

The TPS central mechanism is bifiltration—using two parameters:

Neighbor filtration (inter-class proximity, $j$ -axis)
Radius filtration (intra-class connectivity, $i$ -axis)

A bifiltration $\{\mathcal{K}_{i,j}\}_{i,j}$ satisfies $\mathcal{K}_{i,j} \subseteq \mathcal{K}_{i',j'}$ whenever $i \leq i'$ , $j \leq j'$ . For a feature interval $d=(b,d)$ , its lifetime is $\mathrm{int}(d) = \min\{d,\max(\mathcal{E})\} - b$ .

3. Topological Prototype Selector Algorithm

TPS consists of two successive filtrations per class, formalized as follows:

3.1 Neighbor Filtration (Inter-Class)

Partition $\mathcal{X}$ into target class $\mathcal{X}_c$ and non-target $\mathcal{X}_{\setminus c}$ .
For $x \in \mathcal{X}_c$ , compute sum of $K$ nearest neighbor distances in $\mathcal{X}_{\setminus c}$ ,

$n(x) = \sum_{j=1}^K d(x, x_{\setminus c}^{(j)})$

Construct a weighted Rips complex with edge $(x, y)$ at $\max\{d(x, y), n(x), n(y)\}$ .
Compute $0$- and $1$-dimensional persistence ( $\mathrm{PD}^{(n)}$ ); extract lifetimes $\{\ell_j\}$ .
Remove lifetimes $\ell < \tau_{\min}$ (geometric regularization).
Choose quantile lifetime $\theta^{(n)}$ .
Select slice index $j^*$ nearest to $\theta^{(n)}$ .
Extract vertices $V^{(n)}$ at scale $\epsilon_{j^*}$ .

3.2 Radius Filtration (Intra-Class)

Restrict to $V^{(n)} \subset \mathcal{X}_c$ .
For $x \in V^{(n)}$ , compute same-class radii,

$r(x) = \sum_{y \in V^{(n)}} d(x, y)$

Build Rips complex with edge $(x, y)$ at $\max\{d(x, y), r(x), r(y)\}$ .
Compute persistence $\mathrm{PD}^{(r)}$ ; extract lifetimes $\{\ell_i\}$ .
Filter using $\tau_{\min}$ .
Set $\theta^{(r)} = \mathrm{mean}\{\ell_i\}$ .
Select slice $i^*$ closest to $\theta^{(r)}$ .
Collect vertices $V^{(r)}$ as prototypes for $c$ .

TPS implicitly solves

$\min_{\substack{i,j,\,V \subset \mathcal{X}_c}} |V| \quad \text{subject to} \quad V \text{ captures significant topological features at } (i, j)$

Parallelization is achieved since the class selection loop is independent.

Algorithm Overview

Phase	Operation	Output
Neighbor filt	Rips + persistence	$V^{(n)}$ (boundary)
Radius filt	Rips + persistence	$V^{(r)}$ (prototypes)

4. Theoretical Guarantees and Computational Complexity

4.1 Complexity Analysis

With $|\mathcal{X}_c| = n_c$ , constructing Rips up to maximum dimension $h$ incurs $O(n_c^h)$ cost. Persistent homology (Ripser) practically executes in $O(n_c^2)$ to $O(n_c^3)$ . TPS’s two computations per class yield total complexity

$O\left(K \sum_c n_c^3\right)$

Parameter $h$ greatly influences combinatorial cost.

4.2 Stability

By the stability theorem for persistent homology, minor metric perturbations yield proportional changes in persistence diagrams, ensuring that the prototype set adapts smoothly to input noise.

4.3 Prototype Cardinality Bound

If $\beta_0, \beta_1$ are Betti numbers at chosen slices,

$|\mathcal{S}_c| = |V^{(r)}| \leq \sum_{k=0}^h \dim H_k(\mathcal{K}_{i^*,j^*}) \leq \sum_{k=0}^h \beta_k(\mathcal{X}_c)$

Thus, the number of prototypes is bounded by total Betti numbers.

5. Empirical and Practical Evaluation

5.1 Simulated Data

On nine synthetic 2D datasets (blobs, circles, moons, imbalanced scenarios), TPS demonstrates:

Mean G-Mean improvement: $+1.06\%$ (1-NN), $+1.98\%$ (SVM).
Data reduction rates: $78.0\%$ (1-NN), $80.2\%$ (SVM).

TPS typically preserves or enhances accuracy while retaining only 10–40% of original points.

5.2 Hyperparameter Effects

$q$ (neighbor quantile) and $\tau_{\min}$ (minimum persistence) regularize geometric structure, balancing reduction vs. performance.
Smaller $K$ targets boundary topological features more effectively.

5.3 Computational Performance

For $\sim2500$ points in $\mathbb{R}^4$ , TPS processes each class in $\approx2$ seconds.
Classwise parallelization achieves linear speedup.

5.4 Real Data Benchmarks

Across eight UCI datasets:

TPS achieves average reduction $69.3\%$ , average G-Mean change $+0.013$ .
Outperforms CNN+ENN ( $-0.040$ ), AllKNN ( $-0.001$ ), and matches BienTib ( $+0.020$ ) and K-Means, typically with fewer prototypes and lower runtime.

5.5 Metric Robustness

For text (Spam/Ham, Doc2Vec, $\mathbb{R}^{100}$ ):

Under Euclidean metric: G-Mean $-0.08\%$ at $80–90\%$ reduction.
Under Cosine metric: G-Mean $+1.7\%$ at $62–90\%$ reduction.

TPS leverages structure induced by different metric choices, outperforming purely distance-based alternatives.

6. Implementation Notes and Geometric Interpretation

Ripser and GUDHI libraries efficiently compute Rips complexes and persistence for small $h$ .
Recommended pipeline: precompute distance matrix, process each class with neighbor → persistence → quantile selection, radius → persistence → mean selection, and extract required vertices.
Geometric intuition:
- Neighbor filtration identifies boundary/“thin” regions near other classes.
- Radius filtration isolates “thick” intra-class zones near core features.
- Their intersection selects topologically critical boundary exemplars.

7. Significance and Distinction

TPS is the first prototype selector fundamentally founded on TDA principles. Its bifiltration construction retains mathematically significant, boundary-informative points in a fashion that is parallelizable, interpretable, metric-flexible, and robust to noise. Unlike previous methods, TPS offers explicit geometric regularization through $q$ and $\tau_{\min}$ settings, providing practitioners principled control over dataset reduction while maintaining high classification fidelity across a spectrum of domains.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Topological Prototype Selector (TPS).