Large-scale Multi-view Subspace Clustering in Linear Time (1911.09290v1)

Published 21 Nov 2019 in cs.LG, cs.CV, and stat.ML

Abstract: A plethora of multi-view subspace clustering (MVSC) methods have been proposed over the past few years. Researchers manage to boost clustering accuracy from different points of view. However, many state-of-the-art MVSC algorithms, typically have a quadratic or even cubic complexity, are inefficient and inherently difficult to apply at large scales. In the era of big data, the computational issue becomes critical. To fill this gap, we propose a large-scale MVSC (LMVSC) algorithm with linear order complexity. Inspired by the idea of anchor graph, we first learn a smaller graph for each view. Then, a novel approach is designed to integrate those graphs so that we can implement spectral clustering on a smaller graph. Interestingly, it turns out that our model also applies to single-view scenario. Extensive experiments on various large-scale benchmark data sets validate the effectiveness and efficiency of our approach with respect to state-of-the-art clustering methods.

Citations (336)

View on Semantic Scholar

Summary

The paper introduces the LMVSC algorithm, a scalable method that achieves linear time multi-view clustering through an innovative anchor-based strategy.
It employs a reduced-size graph and SVD integration to streamline computations while preserving the accuracy of traditional spectral methods.
Empirical results on datasets with over 10,000 samples show that LMVSC outperforms existing methods in both clustering accuracy and computational efficiency.

Large-scale Multi-view Subspace Clustering in Linear Time

The paper proposes an innovative approach for multi-view subspace clustering by introducing the Large-scale Multi-view Subspace Clustering (LMVSC) algorithm. This work addresses a crucial issue in the scalability of existing multi-view subspace clustering algorithms, often constrained by quadratic or cubic complexity, by presenting a method that boasts linear time complexity. The paper posits that in the age of big data, efficient algorithms are imperative, particularly for clustering tasks involving multiple views, which are increasingly common due to the diverse ways in which data can be captured and represented.

Methodology

The LMVSC algorithm leverages an anchor-based strategy, utilizing a reduced-size graph to approximate the full relationship graph constructed traditionally in subspace clustering methods. This reduced representation is obtained by selecting a set of anchor points, significantly streamlining both the time and space complexity. The selection of anchor points is inspired by efficient procedures such as k-means, ensuring that the complexity remains linear with respect to the number of data points.

The process involves initially constructing a smaller graph for each data view, which is then integrated through a proposed novel method into a unified representation. The integration circumvents the need for costly spectral decomposition on large matrices by reducing the problem size and employing singular value decomposition (SVD) on a concatenated matrix of the smaller graphs. This aspect of the methodology is supported by theoretical guarantees regarding the equivalence of embeddings obtained from the SVD of this concatenated matrix to traditional spectral methods.

Experimental Results

Empirical evaluations across several large-scale benchmark datasets, including Handwritten, Caltech-101, Reuters, and NUS-WIDE-Object, demonstrate the superior performance of LMVSC in terms of clustering accuracy and computational efficiency. Notably, on datasets that surpassed 10,000 samples, the algorithm consistently outperformed state-of-the-art multi-view clustering methods and did so with a fraction of the computational cost. The paper also highlights how the method remains robust when applied to single-view clustering tasks, delivering promising results on traditional large-scale datasets like RCV1 and CoverType.

Implications and Future Directions

The introduction of LMVSC carries significant implications for practical applications where speed and scalability of data processing are paramount. Industries that rely on real-time data analysis, particularly those handling multimodal data, such as computer vision, natural language processing, and bioinformatics, could benefit from integrating such efficient clustering techniques into their pipelines.

Theoretically, the work sets a precedent for future research in scaling clustering algorithms, suggesting that anchor-based and graph integration methodologies can be further refined and adapted to other clustering paradigms or even extend to supervised learning scenarios with similar scalability challenges.

Future research may explore adaptive strategies for anchor selection, developing methods that can automatically balance between computation cost and clustering accuracy based on specific dataset characteristics. Furthermore, integrating deep learning approaches with LMVSC could provide novel insights and further enhancements, leveraging learned representations to inform both the choice of anchors and the construction of the multi-view graphs.

Overall, this paper contributes significantly to the field by providing a scalable solution to a pervasive problem in multi-view clustering, paving the way for further advancements in this area essential for contemporary data analysis demands.

PDF Markdown