A Survey of Parallel Sequential Pattern Mining (1805.10515v2)

Published 26 May 2018 in cs.DB

Abstract: With the growing popularity of shared resources, large volumes of complex data of different types are collected automatically. Traditional data mining algorithms generally have problems and challenges including huge memory cost, low processing speed, and inadequate hard disk space. As a fundamental task of data mining, sequential pattern mining (SPM) is used in a wide variety of real-life applications. However, it is more complex and challenging than other pattern mining tasks, i.e., frequent itemset mining and association rule mining, and also suffers from the above challenges when handling the large-scale data. To solve these problems, mining sequential patterns in a parallel or distributed computing environment has emerged as an important issue with many applications. In this paper, an in-depth survey of the current status of parallel sequential pattern mining (PSPM) is investigated and provided, including detailed categorization of traditional serial SPM approaches, and state of the art parallel SPM. We review the related work of parallel sequential pattern mining in detail, including partition-based algorithms for PSPM, Apriori-based PSPM, pattern growth based PSPM, and hybrid algorithms for PSPM, and provide deep description (i.e., characteristics, advantages, disadvantages and summarization) of these parallel approaches of PSPM. Some advanced topics for PSPM, including parallel quantitative / weighted / utility sequential pattern mining, PSPM from uncertain data and stream data, hardware acceleration for PSPM, are further reviewed in details. Besides, we review and provide some well-known open-source software of PSPM. Finally, we summarize some challenges and opportunities of PSPM in the big data era.

Citations (229)

View on Semantic Scholar

Summary

The paper surveys Parallel Sequential Pattern Mining (PSPM), analyzing challenges with large datasets and classifying strategies (partition-based, Apriori, hybrid) and their performance.
It highlights partition-based strategies like NPSPM, SPSPM, and HPSPM for efficient workload distribution and discusses scalable techniques such as MG-FSM for distributed architectures.
The survey outlines theoretical implications in fields like bioinformatics and web mining, and suggests future directions including integration with deep learning and addressing stream/uncertain data.

An In-depth Analysis of Parallel Sequential Pattern Mining Approaches

The paper "A Survey of Parallel Sequential Pattern Mining" offers an extensive overview of the domain of Parallel Sequential Pattern Mining (PSPM). This survey not only identifies the challenges faced by traditional data mining algorithms when analyzing large datasets but also discusses the imperative for parallel and distributed computing solutions. The authors systematically categorize the multitude of existing PSPM strategies, focusing on their adaptability in diverse real-world applications and performance within contemporary computational architectures.

Core Themes and Findings

The paper foregrounds sequential pattern mining (SPM) as a fundamental task in data mining. However, unlike other pattern mining disciplines, SPM entails additional complexity arising from the intrinsic temporal ordering of data. This necessitates specialized algorithms to efficiently uncover sequential patterns. The authors categorize sequential pattern mining into several paradigms: Apriori-based methods, pattern-growth techniques, early-pruning algorithms, and constraint-based approaches. Among these, Apriori-style algorithms depend on breadth-first generation and are notable for their computational overhead caused by candidate generation.

The discussion extends into the field of parallel SPM, where strategies are classified into partition-based, Apriori-based, and hybrid methods. The authors advocate for partition-based strategies to mitigate the computational load by leveraging balanced workload distribution across processors. Representative methods such as NPSPM, SPSPM, and HPSPM serve as foundational strategies within this paradigm.

Apriori-based approaches are explored with specific focus on techniques like pSPADE and DGSP, emphasizing efficiency in shared memory environments. However, these methods incur significant communication and synchronization costs in distributed settings. Pattern-growth algorithms such as Par-ASP and its derivatives notably implement depth-first strategies that exhibit superior scaling capabilities as opposed to breadth-first counterparts.

Noteworthy Numerical and Theoretical Insights

The survey elucidates the quantitative advantages of contemporary PSPM techniques, albeit with a critical lens on scalability and computational trade-offs. Techniques like MG-FSM incorporate gap constraints, optimizing both memory usage and computational overhead, thus representing scalable strategies in distributed architectures like MapReduce. These approaches permit the integration of hierarchical data structures, providing more nuanced data exploration frameworks.

The paper also ventures into specialized scenarios such as mining PSPMs from uncertain and stream data, indicating both their current limitations and potential trajectories for future improvements. Advanced algorithms leverage predictive, probabilistic models to deal with data uncertainty, while stream data processing within PSPMs remains nascent, posing distinct efficiency and scalability challenges.

Theoretical Implications and Future Directions

The survey underscores the theoretical implications of PSPM beyond the immediate domain of data mining, impacting fields like bioinformatics (e.g., DNA sequence analysis), web mining, and market basket analysis. The parallelization of SPM contributes to the foundational understanding of data-driven decision-making processes in complex datasets. Furthermore, the intersection of SPM with emerging paradigms such as deep learning offers compelling directions for future inquiry. Integration with neural network-based models could potentially revolutionize pattern detection, enhancing both the depth and breadth of sequential analysis.

Conclusion

Gan et al. offer a meticulous survey that not only catalogs the evolution of PSPM but also provides a strategic roadmap for future research endeavors. By systematically evaluating operational constraints, computational demands, and real-world applicability, the paper serves as both a comprehensive reference and a pivotal guide for advancing research on parallel and distributed frameworks within sequential pattern analysis. Future research will undeniably benefit from addressing outlined challenges, particularly concerning dynamic data environments and privacy preservation mechanisms.

PDF Markdown