Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 104 tok/s
Gemini 3.0 Pro 36 tok/s Pro
Gemini 2.5 Flash 133 tok/s Pro
Kimi K2 216 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Topological Prototype Selector (TPS)

Updated 10 November 2025
  • TPS is a topological data analysis framework that uses persistent homology and bifiltration to extract representative prototypes from large datasets.
  • It combines neighbor filtration for capturing class boundary features with radius filtration for intra-class refinement to ensure high classification fidelity.
  • Empirical results show TPS significantly reduces data volume while boosting or maintaining accuracy across synthetic and real-world benchmarks.

The Topological Prototype Selector (TPS) constitutes a topological data analysis (TDA)-based framework for representative subset selection (prototype selection) from large datasets. TPS exploits persistent homology to identify data points that best capture both the intra-class topological “shape” and boundary-region separation between classes. By combining neighbor-based and radius-based filtrations in a bifiltration approach, TPS rigorously selects boundary-informative prototypes, tunable through geometric regularization parameters, and achieves substantial data reduction while preserving—often improving—classification accuracy across simulated and real-world tasks.

1. Prototype Selection and Topological Motivation

Given a labeled dataset X={(xi,yi)}i=1n\mathcal{X} = \{(x_i, y_i)\}_{i=1}^n, the principal challenge of prototype selection is to identify the smallest subset SX\mathcal{S} \subset \mathcal{X} such that the classifier accuracy Acc(fS,Val)\mathrm{Acc}(f_{\mathcal{S},\mathrm{Val}}) on validation data matches or exceeds full-data performance. Formally,

minSXSsubject toAcc(fS,Val)Acc(fX,Val)\min_{\mathcal{S} \subset \mathcal{X}} |\mathcal{S}| \quad \text{subject to} \quad \mathrm{Acc}(f_{\mathcal{S},\mathrm{Val}}) \approx \mathrm{Acc}(f_{\mathcal{X},\mathrm{Val}})

Traditional distance-based heuristics (e.g., CNN, ENN), clustering (K-Means), or optimization (set cover) retain excessive internal or noisy points and are sensitive to instance order, lacking multiscale geometric robustness. TPS addresses these issues using TDA, leveraging persistent homology—a method robust to noise and high-dimensional structure—to quantify and extract the significant topological features at multiple scales, especially at class boundaries.

2. Mathematical Foundations of TPS

TPS operates on the data space XX\mathcal{X} \subset X, a metric space (X,d)(X, d). The approach constructs TDA objects, beginning with simplicial complexes:

  • A qq-simplex σ\sigma is defined as the convex hull of q+1q+1 affinely independent points.
  • A simplicial complex K\mathcal{K} is a finite set of simplices closed under faces/intersections.

The Vietoris–Rips complex at scale ϵ\epsilon,

Rϵ(X)={σ:d(xi,xj)ϵ,xi,xjσ}R_\epsilon(\mathcal{X}) = \{\sigma : d(x_i, x_j) \leq \epsilon, \forall x_i, x_j \in \sigma\}

contains all simplices whose vertices are pairwise within ϵ\epsilon. A filtration is a nested sequence of such complexes over increasing radii: =K0K1Km=K\emptyset = \mathcal{K}_0 \subseteq \mathcal{K}_1 \subseteq \cdots \subseteq \mathcal{K}_m = \mathcal{K} Persistent homology computes the birth and death (bib_i, did_i) of kk-dimensional features (connected components, loops, voids), visualized in persistence diagrams PDk\mathrm{PD}_k.

The TPS central mechanism is bifiltration—using two parameters:

  • Neighbor filtration (inter-class proximity, jj-axis)
  • Radius filtration (intra-class connectivity, ii-axis)

A bifiltration {Ki,j}i,j\{\mathcal{K}_{i,j}\}_{i,j} satisfies Ki,jKi,j\mathcal{K}_{i,j} \subseteq \mathcal{K}_{i',j'} whenever iii \leq i', jjj \leq j'. For a feature interval d=(b,d)d=(b,d), its lifetime is int(d)=min{d,max(E)}b\mathrm{int}(d) = \min\{d,\max(\mathcal{E})\} - b.

3. Topological Prototype Selector Algorithm

TPS consists of two successive filtrations per class, formalized as follows:

3.1 Neighbor Filtration (Inter-Class)

  1. Partition X\mathcal{X} into target class Xc\mathcal{X}_c and non-target Xc\mathcal{X}_{\setminus c}.
  2. For xXcx \in \mathcal{X}_c, compute sum of KK nearest neighbor distances in Xc\mathcal{X}_{\setminus c},

n(x)=j=1Kd(x,xc(j))n(x) = \sum_{j=1}^K d(x, x_{\setminus c}^{(j)})

  1. Construct a weighted Rips complex with edge (x,y)(x, y) at max{d(x,y),n(x),n(y)}\max\{d(x, y), n(x), n(y)\}.
  2. Compute $0$- and $1$-dimensional persistence (PD(n)\mathrm{PD}^{(n)}); extract lifetimes {j}\{\ell_j\}.
  3. Remove lifetimes <τmin\ell < \tau_{\min} (geometric regularization).
  4. Choose quantile lifetime θ(n)\theta^{(n)}.
  5. Select slice index jj^* nearest to θ(n)\theta^{(n)}.
  6. Extract vertices V(n)V^{(n)} at scale ϵj\epsilon_{j^*}.

3.2 Radius Filtration (Intra-Class)

  1. Restrict to V(n)XcV^{(n)} \subset \mathcal{X}_c.
  2. For xV(n)x \in V^{(n)}, compute same-class radii,

r(x)=yV(n)d(x,y)r(x) = \sum_{y \in V^{(n)}} d(x, y)

  1. Build Rips complex with edge (x,y)(x, y) at max{d(x,y),r(x),r(y)}\max\{d(x, y), r(x), r(y)\}.
  2. Compute persistence PD(r)\mathrm{PD}^{(r)}; extract lifetimes {i}\{\ell_i\}.
  3. Filter using τmin\tau_{\min}.
  4. Set θ(r)=mean{i}\theta^{(r)} = \mathrm{mean}\{\ell_i\}.
  5. Select slice ii^* closest to θ(r)\theta^{(r)}.
  6. Collect vertices V(r)V^{(r)} as prototypes for cc.

TPS implicitly solves

mini,j,VXcVsubject toV captures significant topological features at (i,j)\min_{\substack{i,j,\,V \subset \mathcal{X}_c}} |V| \quad \text{subject to} \quad V \text{ captures significant topological features at } (i, j)

Parallelization is achieved since the class selection loop is independent.

Algorithm Overview

Phase Operation Output
Neighbor filt Rips + persistence V(n)V^{(n)} (boundary)
Radius filt Rips + persistence V(r)V^{(r)} (prototypes)

4. Theoretical Guarantees and Computational Complexity

4.1 Complexity Analysis

With Xc=nc|\mathcal{X}_c| = n_c, constructing Rips up to maximum dimension hh incurs O(nch)O(n_c^h) cost. Persistent homology (Ripser) practically executes in O(nc2)O(n_c^2) to O(nc3)O(n_c^3). TPS’s two computations per class yield total complexity

O(Kcnc3)O\left(K \sum_c n_c^3\right)

Parameter hh greatly influences combinatorial cost.

4.2 Stability

By the stability theorem for persistent homology, minor metric perturbations yield proportional changes in persistence diagrams, ensuring that the prototype set adapts smoothly to input noise.

4.3 Prototype Cardinality Bound

If β0,β1\beta_0, \beta_1 are Betti numbers at chosen slices,

Sc=V(r)k=0hdimHk(Ki,j)k=0hβk(Xc)|\mathcal{S}_c| = |V^{(r)}| \leq \sum_{k=0}^h \dim H_k(\mathcal{K}_{i^*,j^*}) \leq \sum_{k=0}^h \beta_k(\mathcal{X}_c)

Thus, the number of prototypes is bounded by total Betti numbers.

5. Empirical and Practical Evaluation

5.1 Simulated Data

On nine synthetic 2D datasets (blobs, circles, moons, imbalanced scenarios), TPS demonstrates:

  • Mean G-Mean improvement: +1.06%+1.06\% (1-NN), +1.98%+1.98\% (SVM).
  • Data reduction rates: 78.0%78.0\% (1-NN), 80.2%80.2\% (SVM).

TPS typically preserves or enhances accuracy while retaining only 10–40% of original points.

5.2 Hyperparameter Effects

  • qq (neighbor quantile) and τmin\tau_{\min} (minimum persistence) regularize geometric structure, balancing reduction vs. performance.
  • Smaller KK targets boundary topological features more effectively.

5.3 Computational Performance

  • For 2500\sim2500 points in R4\mathbb{R}^4, TPS processes each class in 2\approx2 seconds.
  • Classwise parallelization achieves linear speedup.

5.4 Real Data Benchmarks

Across eight UCI datasets:

  • TPS achieves average reduction 69.3%69.3\%, average G-Mean change +0.013+0.013.
  • Outperforms CNN+ENN (0.040-0.040), AllKNN (0.001-0.001), and matches BienTib (+0.020+0.020) and K-Means, typically with fewer prototypes and lower runtime.

5.5 Metric Robustness

For text (Spam/Ham, Doc2Vec, R100\mathbb{R}^{100}):

  • Under Euclidean metric: G-Mean 0.08%-0.08\% at 8090%80–90\% reduction.
  • Under Cosine metric: G-Mean +1.7%+1.7\% at 6290%62–90\% reduction.

TPS leverages structure induced by different metric choices, outperforming purely distance-based alternatives.

6. Implementation Notes and Geometric Interpretation

  • Ripser and GUDHI libraries efficiently compute Rips complexes and persistence for small hh.
  • Recommended pipeline: precompute distance matrix, process each class with neighbor → persistence → quantile selection, radius → persistence → mean selection, and extract required vertices.
  • Geometric intuition:
    • Neighbor filtration identifies boundary/“thin” regions near other classes.
    • Radius filtration isolates “thick” intra-class zones near core features.
    • Their intersection selects topologically critical boundary exemplars.

7. Significance and Distinction

TPS is the first prototype selector fundamentally founded on TDA principles. Its bifiltration construction retains mathematically significant, boundary-informative points in a fashion that is parallelizable, interpretable, metric-flexible, and robust to noise. Unlike previous methods, TPS offers explicit geometric regularization through qq and τmin\tau_{\min} settings, providing practitioners principled control over dataset reduction while maintaining high classification fidelity across a spectrum of domains.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Topological Prototype Selector (TPS).