A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques (1707.02919v2)

Published 10 Jul 2017 in cs.CL, cs.AI, and cs.IR

Abstract: The amount of text that is generated every day is increasing dramatically. This tremendous volume of mostly unstructured text cannot be simply processed and perceived by computers. Therefore, efficient and effective techniques and algorithms are required to discover useful patterns. Text mining is the task of extracting meaningful information from text, which has gained significant attentions in recent years. In this paper, we describe several of the most fundamental text mining tasks and techniques including text pre-processing, classification and clustering. Additionally, we briefly explain text mining in biomedical and health care domains.

Citations (519)

View on Semantic Scholar

Summary

The paper provides a comprehensive overview of text mining by discussing classification, clustering, and extraction techniques with a focus on biomedical applications.
The paper details key methodologies such as Naive Bayes, decision trees, SVM, hierarchical clustering, k-means, and probabilistic models like HMM and CRF for structuring unstructured text data.
The paper outlines the challenges of high-dimensional, sparse data and points towards future integration of advanced NLP and deep learning models for improved analytical accuracy.

A Brief Survey of Text Mining: Classification, Clustering, and Extraction Techniques

This paper provides a comprehensive overview of text mining, delineating key methodologies and algorithms that underscore the extraction of information from voluminous text data. The paper discusses the principal tasks in text mining, including classification, clustering, and information extraction, with an emphasis on techniques applied in various domains like biomedicine.

Text Mining Approaches

The authors introduce text mining as an essential process for deriving meaningful information from unstructured text data. They differentiate knowledge discovery from data mining, emphasizing the extraction of valid patterns. Text mining's distinct relevance to traditional data mining is highlighted through its interdisciplinary nature, particularly its intersection with machine learning, databases, and statistics.

Core Techniques

Classification: The paper outlines classification techniques used in text mining, such as Naive Bayes, decision trees, and Support Vector Machines (SVM), emphasizing their application in assigning predefined categories to documents. The authors provide a detailed explanation of the Naive Bayes classifier, highlighting the differences between multivariate Bernoulli and multinomial models.
Clustering: The clustering section discusses methodologies like hierarchical clustering and k-means clustering, used for grouping similar documents. The unique characteristics of text data, such as high dimensionality and sparsity, necessitate specialized algorithms that consider word correlations and document normalization.
Information Extraction: The paper describes key tasks like Named Entity Recognition (NER) and relation extraction, essential for identifying entities and their interrelations. Probabilistic models such as Hidden Markov Models and Conditional Random Fields are explored for their effectiveness in extracting structured data from text.

Biomedical Applications

The paper extends its discussion to text mining in biomedical domains, underscoring processes like information extraction and summarization. Unique challenges in biomedical text mining, such as the dynamic nature of medical terminology, are addressed. The utility of biomedical ontologies in enhancing semantic understanding and facilitating accurate extraction is also emphasized.

Implications and Future Directions

While not presenting innovative solutions, this survey highlights the ongoing importance and challenges of text mining. The implications span practical applications in various industries and theoretical frameworks that continue to evolve. Future developments in this field may integrate more sophisticated models, leveraging advancements in natural language processing and deep learning to enhance accuracy and efficiency.

Overall, the paper serves as a foundational review of text mining, offering insights into its methodologies, challenges, and applications, particularly in biomedicine. The structured overview aids in understanding the complexities and nuances of managing and interpreting large text corpora.

PDF Markdown