- The paper introduces SKG, a framework that integrates semantic concept extraction with traditional metadata for comprehensive academic paper analysis.
- It employs a semi-supervised NER approach using RoBERTa and human-in-the-loop fine-tuning to accurately classify key scientific concepts.
- The framework supports interactive semantic queries and visualizations, demonstrating utility in tasks like literature reviews and scholarly profiling.
This paper introduces SKG, a Semantic Knowledge Graph framework designed for versatile information retrieval and analysis in academic literature. The exponential growth of research papers necessitates advanced methods beyond simple keyword searches or topic modeling, which often lack semantic understanding and structured output. SKG addresses this by integrating semantic concepts extracted from abstracts with traditional meta-information into a unified graph structure.
The core of the framework is the Semantic Knowledge Graph (SKG). Unlike many existing academic knowledge graphs that primarily focus on metadata, SKG incorporates concepts and their semantic roles extracted from paper abstracts. The defined ontology includes five entity types:
- Paper: Represents individual papers with attributes like title, ID, year, and URL.
- Concept: Represents scientific keywords or keyphrases, with semantic roles such as Application, Data, Method, Visualization, and Evaluation. Concepts recognized in DBpedia are linked with URLs and normalized names.
- Author: Represents authors by name.
- Journal: Represents journals by name.
- Venue: Represents conference venues by name.
Relationships connect these entities, including standard meta-data links (Paper -> Author, Paper -> Journal/Venue, Paper -> Paper (citations)) and crucial semantic links (Paper -> Concept, categorized by the Concept's semantic role).
The construction of SKG involves building a large, domain-specific dataset called VISBank (based on filtering and enhancing the Semantic Scholar Open Research Corpus - S2ORC), extracting Concept entities, and representing the graph in RDF format. The Concept entity extraction is a challenging task handled by a semi-supervised module. This module consists of:
- Unsupervised Pipeline: Generates weak training samples for Named Entity Recognition (NER). It uses a rule-based approach with POS tags to extract noun n-grams as candidate entities. A fine-tuned QA model (RoBERTa) is then used to assign initial semantic labels (Application, Data, Method, Visualization, Evaluation) to these candidates based on their context and pre-defined questions corresponding to each label.
- Human Fine-tuning: High-confidence samples from the unsupervised step are selected and manually refined by human annotators to create a high-quality training dataset. An iterative process with domain experts ensures consistency in labeling.
- Supervised Training: A NER model is fine-tuned on the human-annotated dataset to automatically recognize and classify Concept entities in the entire corpus. The paper reports evaluating several LLMs, with RoBERTa showing superior performance (average F1 score).
- Entity Normalization: Extracted concepts are mapped to DBpedia using DBpedia Spotlight to normalize entities and enrich SKG with links to external knowledge.
Semantic queries over SKG are essentially subgraph extractions based on specified criteria. The paper proposes a versatile algorithm (Algorithm 1) that takes source entities, a target entity type, and the number of desired results as input. It finds reachable nodes of the target type from the source nodes, scores them based on the number of simple paths connecting source and target entities, and returns the top-K target entities and their connecting edges. This algorithm forms the basis for the interactive query system.
To enable users to interact with SKG and perform these semantic queries efficiently, the authors designed and developed a dataflow system. This system is built around components (operators for data manipulation, viewers for visualization) that users can connect via a drag-and-drop interface, addressing requirements for interactive querying, heterogeneous data visualization, flexible pipelines, and raw data access. Key components include:
- Operators:
- Querier: Retrieves initial entities based on keyword/keyphrase matching.
- Expander: Implements the semantic query algorithm (Algorithm 1) to retrieve target entities of a specified type, showing connections to source entities. It can output target entities, the source-target graph, or the graph among target entities.
- Comparer: Merges and visualizes graphs from multiple Expanders, highlighting common nodes.
- Viewers:
- Node Visualizer: Displays graph structures (e.g., citation networks, collaboration networks) using node-link diagrams or concept relationships using Sankey diagrams.
- Table Viewer: Presents raw data from operators in a tabular format.
- Node Viewer: Links concepts to their DBpedia pages for external information.
The effectiveness of the Concept entity extraction module was evaluated by comparing the f1-score of entities recognized by the unsupervised pipeline and fine-tuned models against human-labeled ground truth. The evaluation showed that the unsupervised approach serves as a valuable starting point, and fine-tuned models, particularly RoBERTa, significantly improve accuracy.
Case studies demonstrate the framework's utility. The "Information Summarization of Trust in ML" case validated SKG's coverage and the quality of extracted concepts by comparing automatically summarized concepts for a set of papers with expert-provided summaries. It also showcased using the dataflow system for co-authorship analysis. The "Knowledge Discovery of Text Mining" case highlighted the system's flexibility in building analysis pipelines for tasks like automated literature review (combining concept-based and citation-based retrieval, summarizing concepts for a set of papers) and comparative scholarly profiling (comparing the research interests/publication venues of two authors based on their connected concepts and metadata).
The authors emphasize SKG's advantages: Extensibility (easy to incorporate new data), Usability (potential for downstream tasks like recommendation or GNN inputs), and Transferability (approach adaptable to other domains with minimal modifications). Future work includes further integration of new knowledge and leveraging machine learning for insight discovery on the heterogeneous SKG.
In conclusion, the paper proposes SKG as a novel framework for navigating academic literature, combining a rich semantic knowledge graph built via a semi-supervised extraction process with an interactive visual dataflow system. This approach allows users to perform flexible, semantically-aware queries and analyses that go beyond traditional metadata-only knowledge graphs, effectively supporting tasks like literature review, scholarly profiling, and knowledge discovery.