Ontology-based Feature Selection: A Survey (2104.07720v2)

Published 15 Apr 2021 in cs.AI and cs.LG

Abstract: The SemanticWeb emerged as an extension to the traditional Web, towards adding meaning to a distributed Web of structured and linked data. At its core, the concept of ontology provides the means to semantically describe and structure information and data and expose it to software and human agents in a machine and human-readable form. For software agents to be realized, it is crucial to develop powerful artificial intelligence and machine learning techniques, able to extract knowledge from information and data sources and represent it in the underlying ontology. This survey aims to provide insight into key aspects of ontology-based knowledge extraction, from various sources such as text, images, databases and human expertise, with emphasis on the task of feature selection. First, some of the most common classification and feature selection algorithms are briefly presented. Then, selected methodologies, which utilize ontologies to represent features and perform feature selection and classification, are described. The presented examples span diverse application domains, e.g., medicine, tourism, mechanical and civil engineering, and demonstrate the feasibility and applicability of such methods.

Citations (13)

View on Semantic Scholar

Summary

The paper surveys various methodologies for Ontology-based Feature Selection (OBFS), explaining how they integrate domain knowledge from ontologies to enhance the feature selection process in machine learning.
Core OBFS approaches include leveraging concept mapping and semantic similarity, exploiting ontological hierarchies, ontology-driven filtering and ranking, and using rule-based reasoning.
OBFS techniques demonstrate effectiveness across diverse domains by improving performance, reducing dimensionality, and increasing interpretability, although their success relies heavily on the quality and availability of domain ontologies.

Ontology-based Feature Selection (OBFS) leverages the structured, semantic knowledge encoded within domain ontologies to enhance the feature selection process in machine learning workflows. By utilizing formal representations of concepts, properties, and their interrelations, OBFS aims to identify more relevant, interpretable, and potentially higher-level features compared to purely data-driven methods. This approach integrates domain expertise, formalized within the ontology, directly into the feature engineering and selection pipeline, seeking improvements in dimensionality reduction, classification performance, and model comprehensibility. The survey "Ontology-based Feature Selection: A Survey" (Ontology-based Feature Selection: A Survey, 2021) provides a comprehensive overview of various methodologies developed for this purpose.

Core Methodologies in Ontology-Based Feature Selection

Several distinct approaches utilize ontological structures for feature selection, often combining semantic knowledge with traditional machine learning techniques.

Concept Mapping and Semantic Similarity: This common technique involves mapping low-level features (e.g., terms from text documents, raw data attributes) to concepts within a domain ontology (e.g., WordNet, UMLS, SNOMED CT). Features that fail to map or are deemed semantically distant from relevant concepts based on ontological relationships (e.g., lack of a hypernym path, low semantic similarity score) are pruned. For instance, Elhadad et al. [56] utilized WordNet to filter terms from Bag-of-Words representations of web documents, retaining only terms with a semantic path via a common hypernym to target categories, subsequently weighting them with TFIDF. Similarly, Vicient et al. [58] mapped Potential Named Entities (PNEs) from tourism texts to ontological classes using direct matching and WordNet hypernym expansion, employing web-based Pointwise Mutual Information (PMI) scores to rank the relevance of classes.
Hierarchy Exploitation: The inherent hierarchical structure (superclass-subclass, part-whole relationships) within ontologies is frequently exploited. This can involve generalizing specific features to higher-level concepts, specializing concepts to finer-grained features, or searching for optimal feature sets across different abstraction levels. Wang et al. [60], working with medical documents, mapped terms to UMLS concepts, organized these concepts hierarchically per class, and employed a hill-climbing search based on concept frequency to identify an optimal subset of representative concepts. Lu et al. [67] merged RxNorm and NDF-RT drug ontologies into a hierarchical structure, performing feature selection via a top-down traversal, sorting nodes by Information Gain Ratio (IGR), and pruning nodes based on parent/child relationships and IGR scores.
Ontology-Driven Filtering and Ranking: Ontological characteristics, such as concept types, properties, cardinalities, or derived semantic metrics, can serve as criteria for filtering or ranking potential features. Abdollali et al. [65] mapped clinical note expressions to UMLS concepts for Coronary Artery Disease (CAD) classification but restricted the feature set to concepts explicitly typed as "Disease or Syndrome" or "Sign or Symptom" within UMLS. Particle Swarm Optimization (PSO) was then applied to this semantically pre-filtered set, optimizing for classification accuracy. Di Noia et al. [75] leveraged ontology-driven data summarization (patterns, frequencies, cardinalities) from Linked Data for recommender systems, using cardinality descriptors to filter properties and pattern frequency to rank the remaining properties (features).
Rule-Based Reasoning with Ontologies: Ontologies provide the vocabulary and semantic framework for defining logical rules (often using languages like SWRL - Semantic Web Rule Language) that encode domain knowledge or heuristics for feature selection. Reasoning engines infer relationships or select features based on these rules applied to instance data mapped to the ontology. Mabkhot et al. [70] employed an ontology of manufacturing processes and materials where SWRL rules matched product features to process capabilities. This was integrated with Case-Based Reasoning (CBR), using ontology-defined similarity measures. Han et al. [73] encoded relationships between noise signals (targets) and vibration signals (sources) in an NVH ontology using SWRL rules based on signal characteristics (frequency, amplitude), enabling reasoning to identify noise sources (features).
Ontology for Domain Representation and Feature Definition: In some applications, the ontology primarily serves to structure the domain, formally define the concepts and attributes relevant as features, and establish their interrelationships. Subsequent feature selection or classification may then operate on this ontology-defined representation. Kang et al. [72] used a process ontology to model machining features and capabilities, with inference rules selecting appropriate processes. Belgiu et al. [74] defined building type classes in an ontology; while Random Forest (RF) initially identified important features (e.g., slope, height) from Airborne Laser Scanning (ALS) data, the final classification logic, incorporating RF-determined thresholds for these features, was modeled within the ontology. Guan et al. [76] utilized a two-level ontology describing Security Requirements (SRs) and Security Patterns (SPs), with ontology-defined attributes and a classification scheme based on ontological facets selecting relevant SPs (features) for given SRs.

Associated Algorithms and Tools

The implementation of OBFS methodologies typically involves integrating ontological resources and reasoning tools with standard machine learning and NLP techniques.

Ontologies & Lexical Resources: Foundational resources include general-purpose ontologies like WordNet, domain-specific ontologies such as UMLS, RxNorm, NDF-RT in medicine, or custom-built domain ontologies.
Mapping & NLP Tools: Tools like MetaMap are used for mapping text to UMLS concepts. Standard NLP pipelines involving POS tagging, stemming, named entity recognition (e.g., using OpenNLP), and parsing are often prerequisite steps.
Semantic & Statistical Measures: Feature relevance is often quantified using metrics like TFIDF, PMI, Information Gain Ratio (IGR), or various semantic similarity measures computable over the ontology structure.
Search & Optimization Algorithms: When the feature space remains large even after semantic filtering, search algorithms like Hill Climbing or metaheuristics like Particle Swarm Optimization (PSO) may be employed to find optimal feature subsets, often using classifier performance as the fitness function in a wrapper approach.
Reasoning Engines: For rule-based approaches, SWRL reasoners (e.g., Pellet, Drools) are used to infer relationships or make selections. Case-Based Reasoning (CBR) systems can also leverage ontology-defined similarity metrics.
Machine Learning Classifiers: Standard classifiers like Naive Bayes, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Decision Trees (DT), Logistic Regression (LR), and Random Forests (RF) are commonly used either to evaluate the quality of the selected features or as part of wrapper/embedded feature selection methods integrated with the ontological processing.

Application Domains and Effectiveness

OBFS techniques have been applied across a diverse range of domains, demonstrating their versatility.

Domains: Notable applications include text classification (web documents, scientific literature, medical notes), medicine and healthcare (disease prediction, clinical trial analysis, drug interaction analysis), engineering (manufacturing process selection, NVH analysis), geospatial analysis (building type classification from remote sensing data), recommender systems (leveraging Linked Data), and information security (mapping requirements to patterns).
Effectiveness: The surveyed studies generally report favorable outcomes compared to baseline methods lacking ontological guidance or using purely statistical dimensionality reduction techniques like PCA.
- Performance: Improvements in classification accuracy are frequently reported [56, 60, 65, 67]. For example, Abdollali et al. [65] found that combining UMLS-based filtering with PSO outperformed methods using only PSO or standard classifiers on the raw feature set for CAD prediction.
- Dimensionality Reduction: OBFS methods effectively reduce feature space dimensionality by focusing on semantically meaningful concepts [56, 60, 65].
- Interpretability: Features derived from or mapped to ontology concepts are often more easily interpreted by domain experts.
- Knowledge Integration: Ontologies provide a principled mechanism for incorporating explicit domain knowledge into the feature selection process [70, 73].
Limitations: The success of OBFS is often contingent on the availability, quality, and completeness of the underlying domain ontology. Wang et al. [60] noted the dependency on a well-developed ontology. Elhadad et al. [56] observed that the coverage limitations of general ontologies like WordNet could lead to the exclusion of relevant domain-specific terms. Constructing and maintaining high-quality domain ontologies remains a significant knowledge engineering effort.

Conclusion

Ontology-based feature selection offers a potent set of methodologies for integrating semantic domain knowledge into machine learning workflows. By leveraging the structured representations within ontologies, practitioners can guide the selection process towards more meaningful, interpretable, and often more predictive features. While dependent on the quality of the available ontological resources, OBFS has demonstrated effectiveness across diverse application domains, providing a valuable approach particularly when domain knowledge is crucial for model performance and understanding.

PDF Markdown