Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 69 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 108 tok/s Pro

Kimi K2 198 tok/s Pro

GPT OSS 120B 461 tok/s Pro

Claude Sonnet 4.5 33 tok/s Pro

2000 character limit reached

Data-driven Coreference-based Ontology Building (2410.17051v1)

Published 22 Oct 2024 in cs.CL

Abstract: While coreference resolution is traditionally used as a component in individual document understanding, in this work we take a more global view and explore what can we learn about a domain from the set of all document-level coreference relations that are present in a large corpus. We derive coreference chains from a corpus of 30 million biomedical abstracts and construct a graph based on the string phrases within these chains, establishing connections between phrases if they co-occur within the same coreference chain. We then use the graph structure and the betweeness centrality measure to distinguish between edges denoting hierarchy, identity and noise, assign directionality to edges denoting hierarchy, and split nodes (strings) that correspond to multiple distinct concepts. The result is a rich, data-driven ontology over concepts in the biomedical domain, parts of which overlaps significantly with human-authored ontologies. We release the coreference chains and resulting ontology under a creative-commons license, along with the code.

Summary

The paper presents a data-driven method that leverages coreference chains from 30 million biomedical abstracts to construct ontologies.
The methodology transforms extracted phrase graphs into a Directed Acyclic Graph, using betweenness centrality to differentiate hierarchical, identity, and noisy edges with metrics such as 84.3% hierarchy recall and 92.1% direction consistency.
The results demonstrate that the approach uncovers novel hierarchical relationships and complements traditional ontologies, enhancing automated semantic analysis and reducing manual curation costs.

Overview of "Data-driven Coreference-based Ontology Building"

This paper presents a novel approach to ontology construction by leveraging coreference resolution over a vast corpus, specifically 30 million biomedical abstracts. The authors propose a data-driven method for building ontologies that utilizes coreference chains to establish a graph of phrases, providing an alternative to traditional, manually curated ontologies.

Methodology

The process involves constructing a graph where nodes are string phrases and edges denote co-occurrence within coreference chains. The centrality of nodes within this graph is analyzed using betweenness centrality, enabling the differentiation between hierarchical, identity, and noisy edges. This analysis facilitates the annotation of edge directionality, identifying hierarchy and alias relationships, and ensuring the split of ambiguous nodes into distinct concepts.

Key steps include:

Coreference Graph Construction: Implementation of an efficient coreference resolution algorithm across PubMed abstracts to extract chain data, followed by filtering and normalization of phrases.
Ontology Extraction: Transforming the graph into a Directed Acyclic Graph (DAG) using betweenness centrality to guide edge directionality, identify aliases, and distinguish potential noise.

Results and Evaluation

The resulting data-driven ontology demonstrated substantial overlap with existing biomedical ontologies, although it also uncovered hierarchical relationships that were absent in these traditional resources. The evaluation employed comparisons with the SnomedCT and UMLS ontologies, achieving:

A hierarchy edge recall of 84.3% with precision at 40.1%, and human-verified precision improving to 75%.
Direction consistency with SnomedCT at 92.1%.
Identity-edge clustering purity evidenced by an entropy of 0.406 and an ARI of 0.387.

These metrics indicate that the method captures a reliable structure with reasonable precision and extensive coverage.

Implications

The research suggests that data-driven ontologies can complement human-curated knowledge bases by offering real-time, comprehensive, and contextually aligned representations of domain knowledge. This approach could significantly aid in extending coverage in specialized areas, reducing maintenance costs, and facilitating more accurate text mining applications.

Future Directions

Anticipated future developments involve applying this scalable ontology-building method in other knowledge domains, refining the assessment techniques to mitigate evaluation challenges, and enhancing integration with existing ontological databases. Additionally, the application of machine learning refinements and increasing the efficiency of computation algorithms may further optimize this approach.

The alignment of this method with future AI advancements has the potential to enhance understanding in not only biomedical domains but across various scientific fields, providing a foundation for automated knowledge discovery and semantic analysis.

Conclusion

The paper convincingly showcases a framework for ontology design that circumvents the limitations of traditional systems by utilizing coreference chains from textual data. This innovative approach underscores the feasibility and advantages of leveraging data-driven ontologies to capture complex domain-specific knowledge structures effectively.