Generating Knowledge Graphs by Employing Natural Language Processing and Machine Learning Techniques within the Scholarly Domain
(2011.01103v1)
Published 28 Oct 2020 in cs.CL, cs.AI, and cs.LG
Abstract: The continuous growth of scientific literature brings innovations and, at the same time, raises new challenges. One of them is related to the fact that its analysis has become difficult due to the high volume of published papers for which manual effort for annotations and management is required. Novel technological infrastructures are needed to help researchers, research policy makers, and companies to time-efficiently browse, analyse, and forecast scientific research. Knowledge graphs i.e., large networks of entities and relationships, have proved to be effective solution in this space. Scientific knowledge graphs focus on the scholarly domain and typically contain metadata describing research publications such as authors, venues, organizations, research topics, and citations. However, the current generation of knowledge graphs lacks of an explicit representation of the knowledge presented in the research papers. As such, in this paper, we present a new architecture that takes advantage of Natural Language Processing and Machine Learning methods for extracting entities and relationships from research publications and integrates them in a large-scale knowledge graph. Within this research work, we i) tackle the challenge of knowledge extraction by employing several state-of-the-art Natural Language Processing and Text Mining tools, ii) describe an approach for integrating entities and relationships generated by these tools, iii) show the advantage of such an hybrid system over alternative approaches, and vi) as a chosen use case, we generated a scientific knowledge graph including 109,105 triples, extracted from 26,827 abstracts of papers within the Semantic Web domain. As our approach is general and can be applied to any domain, we expect that it can facilitate the management, analysis, dissemination, and processing of scientific knowledge.
The paper presents a novel pipeline that integrates multiple NLP and ML tools to extract fine-grained entities and relationships from scholarly texts.
The approach achieves high performance with a precision of 78.71%, recall of 80.19%, and an F-measure of 81.17% when applied to Semantic Web abstracts.
The method enhances research navigation and semantic querying while offering adaptability to other domains by replacing domain-specific components.
The paper "Generating Knowledge Graphs by Employing Natural Language Processing and Machine Learning Techniques within the Scholarly Domain" (Dessì et al., 2020) addresses the challenge of extracting structured knowledge from the ever-growing volume of scientific literature. Traditional methods and existing scientific knowledge graphs (SKGs) often focus on metadata (authors, venues, citations) and lack an explicit representation of the actual content and relationships discussed within the research papers. The authors propose a novel architecture and pipeline to automatically generate large-scale SKGs by extracting entities and relationships directly from the text of scholarly publications, integrating outputs from multiple NLP and ML tools, and refining the results.
The core problem is the difficulty in analyzing and gaining fine-grained insights from vast collections of scientific papers due to their unstructured nature. Existing SKGs in the scholarly domain, while useful for navigation and analysis based on metadata, do not capture the intricate knowledge embedded in the text, such as specific methods used for tasks, comparisons, or properties of entities. This limits advanced semantic querying, exploration, recommendation, and trend analysis.
The proposed solution is a multi-stage pipeline that processes scientific text (specifically abstracts in the presented use case) through an ensemble of extraction and refinement modules:
Extraction of Entities and Relations: This initial step utilizes several tools:
Extractor Framework [29]: A deep learning-based tool specifically designed for scientific literature, extracting entities (Task, Method, Metric, Material, Other-Scientific-Term, Generic) and predefined relations (Compare, Part-of, Conjunction, Evaluate-for, Feature-of, Used-for, Hyponym-Of). Conjunction relations are discarded.
CSO Classifier [30]: Identifies research topics based on the Computer Science Ontology (CSO) [31] using syntactic (n-grams, Levenshtein similarity) and semantic (Word2Vec, POS tagging) methods. CSO topics are added as entities.
OpenIE [32]: A general-purpose Open Information Extraction tool from Stanford CoreNLP. It extracts subject-verb-object triples by analyzing sentence structure. Only triples where the subject and object match entities found by the Extractor Framework or CSO Classifier are kept to focus on the target domain.
Stanford CoreNLP PoS tagger: Extracts verbs appearing between any two entities identified by the Extractor Framework and CSO Classifier, generating <entity, verb, entity> triples.
Entities Manager: This stage refines and merges the extracted entities.
Entities Refiner: Cleans entities by removing punctuation and stop words, discards overly generic terms using a frequency-based filter (comparing frequency in domain-specific, general computer science, and broad datasets, along with black/whitelists), splits compound entities containing "and", and resolves acronyms within the same abstract using regular expressions.
Entities Merger: Combines entities referring to the same concept using lemmatization (SpaCy) and leverages alternative labels provided by the CSO ontology to merge synonymous research topics.
Relations Manager: This stage processes the extracted triples to find the best relation predicate and map them to a common vocabulary.
Best Relation Finder: Triples from the Extractor Framework are processed by selecting the most frequent relation for each entity pair. Triples from OpenIE and the PoS tagger (which have verbs as predicates) are processed by finding the verb whose word embedding is closest to the average embedding of all verbs linking the same entity pair (using word embeddings trained on MAG data). The "support" (number of papers supporting a triple) is recorded for each triple.
Mapper: Reduces redundant relations by mapping semantically similar verb predicates to a single representative label. This is done by clustering verb word embeddings using hierarchical clustering (based on 1-cosine similarity) and manually defining a map from clustered verbs to representative predicates (e.g., uses, utilizes, employs -> uses). Relations from the Extractor Framework are manually integrated into this map.
Triples Selection: This stage filters the extracted triples to include only "valid" and "consistent" ones.
Valid Triples: Triples from the Extractor Framework, OpenIE, and PoS tagger triples with a high support (≥ 10 papers) are considered valid.
Consistent Triples: Triples from the PoS tagger with low support are evaluated for consistency with the set of valid triples. A Multi-Perceptron Classifier is trained on valid triples (concatenation of subject and object embeddings as input, relation as output). This classifier is applied to the low-support PoS triples. If the predicted relation matches the actual relation, or if the semantic similarity (average of cosine and Wu-Palmer similarity) between the predicted and actual relation embeddings exceeds a threshold (empirically set at 0.5), the triple is deemed consistent and added to the valid set.
Knowledge Graph Enhancement: The final set of triples is enriched by leveraging the hierarchical structure of the CSO ontology. If a triple (e2, r, e1) exists and e3 is a superTopicOf e1 in CSO, and no triple links e2 and e3, the triple (e2, r, e3) is inferred and added.
Finally, the resulting triples are converted to RDF format to form the SKG.
For the use case, the authors applied the pipeline to 26,827 abstracts of Semantic Web papers from the Microsoft Academic Graph (MAG). The resulting SKG contained 109,105 triples. An evaluation was conducted on a manually annotated gold standard of 818 triples related to Semantic Web sub-topics, annotated by five researchers. The hybrid approach combining OpenIE, Extractor Framework, and the PoS tagger with consistent triple selection achieved the best performance, with Precision of 78.71%, Recall of 80.19%, and an F-measure of 81.17% (Table 2), demonstrating the advantage of integrating multiple methods compared to using individual tools.
The authors discuss examples showing the interpretability and potential usefulness of the extracted triples for understanding research topics like "ontology alignment" (Table 3, Figure 4) and "supervised machine learning" (Figure 5). They also highlight current limitations, such as ambiguous relations, potentially incorrect predicate mapping, missing sub-concept relationships not captured by CSO's superTopicOf, and the difficulty in recognizing synonyms not in existing knowledge bases.
The approach is designed to be generalizable to other domains, although domain-specific components like the Extractor Framework (which was trained on Computer Science papers) and the CSO ontology would need to be replaced or adapted with resources relevant to the new domain (e.g., SciSpacy for biomedical text, MeSH or SNOMED-CT for biomedical ontologies, MSC for Mathematics, PhySH for Physics). The entity and relation handling logic is largely domain-independent.
Potential real-world applications discussed include powering intelligent systems for navigating research, providing structured data for graph embedding techniques, enhancing academic recommender systems by explaining suggestions based on article content, and improving trend detection systems by providing a rich network of research entities and their relationships.
Future work outlined by the authors includes: enforcing semantic constraints on entities and relations using technologies like SHACL, allowing the extraction of multiple relationships between entities, better handling of conjunctions, integrating cross-document relations like citations, improving synonym detection using word and graph embeddings, addressing scalability bottlenecks, and conducting extrinsic evaluations to assess the value of the SKG in downstream AI tasks.