Scalable Tensor Factorizations for Incomplete Data (1005.2197v1)

Published 12 May 2010 in math.NA, cs.NA, and physics.data-an

Abstract: The problem of incomplete data - i.e., data with missing or unknown values - in multi-way arrays is ubiquitous in biomedical signal processing, network traffic analysis, bibliometrics, social network analysis, chemometrics, computer vision, communication networks, etc. We consider the problem of how to factorize data sets with missing values with the goal of capturing the underlying latent structure of the data and possibly reconstructing missing values (i.e., tensor completion). We focus on one of the most well-known tensor factorizations that captures multi-linear structure, CANDECOMP/PARAFAC (CP). In the presence of missing data, CP can be formulated as a weighted least squares problem that models only the known entries. We develop an algorithm called CP-WOPT (CP Weighted OPTimization) that uses a first-order optimization approach to solve the weighted least squares problem. Based on extensive numerical experiments, our algorithm is shown to successfully factorize tensors with noise and up to 99% missing data. A unique aspect of our approach is that it scales to sparse large-scale data, e.g., 1000 x 1000 x 1000 with five million known entries (0.5% dense). We further demonstrate the usefulness of CP-WOPT on two real-world applications: a novel EEG (electroencephalogram) application where missing data is frequently encountered due to disconnections of electrodes and the problem of modeling computer network traffic where data may be absent due to the expense of the data collection process.

Citations (600)

View on Semantic Scholar

Summary

Scalable Tensor Factorizations for Incomplete Data

The reviewed paper addresses the challenge of performing tensor factorizations on datasets with significant missing entries, a common problem encountered in diverse fields such as EEG analysis, social network analysis, and chemometrics. It introduces the CP-WOPT (CP Weighted OPTimization) algorithm, leveraging the CANDECOMP/PARAFAC (CP) tensor decomposition method. The principal objective is to develop a scalable solution capable of accurate tensor factorization and completion, even when up to 99% of a dataset is missing.

Methodology

The authors reformulate the CP factorization problem in the presence of missing data as a weighted least squares optimization, which focuses exclusively on known entries. The CP-WOPT utilizes a first-order optimization approach, which is instrumental in achieving scalability for handling large-scale and sparse data arrays. The algorithm's effectiveness is demonstrated through extensive numerical experiments, confirming its capability to handle 1000x1000x1000 tensors with only five million known entries.

Numerical Results

The paper presents robust numerical evidence supporting CP-WOPT's efficacy in tensor factorization under substantial data sparsity. The algorithm successfully factors tensors with noise and a high percentage of missing data, showcasing its potential in various real-world applications such as EEG analysis, where electrode disconnections cause data loss. The results indicate that CP-WOPT achieves significant computational efficiency gains compared to traditional methods such as EM-ALS and INDAFAC, particularly as the missing data percentage increases.

Practical Implications

One of the paper's primary contributions lies in demonstrating CP-WOPT's application to multi-channel EEG data, effectively capturing brain dynamics despite missing signals. This capability is particularly valuable for practitioners who routinely face data loss challenges in real-time applications.

Moreover, within network traffic analysis, the paper highlights CP-WOPT's capacity to address tensor completion problems, a critical task when data acquisition processes are expensive or incomplete. The application of CP-WOPT ensures the preservation of modeling accuracy even when large portions of data are absent.

Theoretical Implications and Future Research

Theoretically, the paper provides a significant contribution by extending CP tensor factorization methods to incomplete data scenarios, highlighting the potential for the CP model in higher-order data decompositions. Future research could explore the integration of constraints such as non-negativity and sparsity, which could lead to more interpretable factorization models. Additionally, investigating robust techniques for handling incomplete data centering could further enhance CP-WOPT's applicability.

In conclusion, CP-WOPT represents a substantial advancement for tensor factorization in incomplete datasets, offering scalable and efficient computational advantages. It opens avenues for future exploration in both methodological enhancements and new domain-specific applications, making it a pertinent contribution to the ongoing development of tensor analytics.

PDF Markdown