A new methodology for constructing a publication-level classification system of science (1203.0532v1)

Published 2 Mar 2012 in cs.DL

Abstract: Classifying journals or publications into research areas is an essential element of many bibliometric analyses. Classification usually takes place at the level of journals, where the Web of Science subject categories are the most popular classification system. However, journal-level classification systems have two important limitations: They offer only a limited amount of detail, and they have difficulties with multidisciplinary journals. To avoid these limitations, we introduce a new methodology for constructing classification systems at the level of individual publications. In the proposed methodology, publications are clustered into research areas based on citation relations. The methodology is able to deal with very large numbers of publications. We present an application in which a classification system is produced that includes almost ten million publications. Based on an extensive analysis of this classification system, we discuss the strengths and the limitations of the proposed methodology. Important strengths are the transparency and relative simplicity of the methodology and its fairly modest computing and memory requirements. The main limitation of the methodology is its exclusive reliance on direct citation relations between publications. The accuracy of the methodology can probably be increased by also taking into account other types of relations, for instance based on bibliographic coupling.

Citations (540)

View on Semantic Scholar

Summary

The paper introduces a novel methodology that classifies individual publications using direct citation relationships to form hierarchical research areas.
The approach produces three classification levels from broad disciplines to specific subfields, enabling detailed analysis of multidisciplinary trends.
The study highlights limitations of relying solely on direct citations and proposes integrating additional relational data to enhance accuracy.

Publication-Level Classification System for Science

Waltman and van Eck present a novel methodology for constructing a publication-level classification system that addresses limitations in prevalent journal-level systems, such as those used by Web of Science and Scopus. Traditional systems classify journals into research areas, which can lead to inadequate detail and challenges with multidisciplinary journals. This paper discusses a new approach that classifies individual publications based on citation relationships, enabling greater granularity and flexibility.

Methodology Overview

The methodology involves three primary steps:

Determining Relatedness: Publications are initially assessed for relatedness through direct citations, leading to a binary matrix of citation relationships. This approach simplifies computational demands but is limited by the exclusion of co-citations or bibliographic coupling.
Cluster Formation: Using hierarchical clustering, publications are organized into research areas. Each cluster forms a research area at a specific granularity, from broad disciplines down to specific subfields. The parameters such as level resolution and minimum publications per area guide this process.
Labeling Research Areas: Labels are generated from terms extracted from publication titles and abstracts. These terms help characterize each research area, though refinement is needed, particularly at higher aggregation levels.

Application and Results

The methodology was applied to a dataset encompassing ten million publications spanning 2001 to 2010. The classification structure includes three levels: broad disciplines, fields, and subfields. At the highest level, 20 areas were identified with a substantial overlap with traditional scientific disciplines. Notably, classifications revealed areas without clear correspondence to established disciplines, reflecting the evolving nature of scientific inquiry.

The second level houses 672 research areas, visualized using bibliometric mapping to uncover relationships and potential hot spots, such as graphene research in physics. The third level offers a finer resolution with over 22,000 areas, though the classification accuracy occasionally suffers due to reliance solely on direct citation data.

Limitations and Future Directions

While the method boasts transparency and modest resource requirements, its exclusive reliance on direct citations is a limitation. Many publications remain unclassified due to insufficient citation connections. Future work could incorporate additional relational data, such as shared bibliographic references or semantic similarity, to improve coverage and accuracy.

Assessing publication relatedness through bibliographic coupling or content analysis could mitigate issues of misclassification, particularly for multidisciplinary or sparsely connected publications. Moreover, better labeling techniques, perhaps integrating expert judgment or journal title terms, could enhance clarity and usability.

The approach holds promise not only for publication-level classification but could inform more refined journal-level systems. As scientific disciplines continue to evolve, such adaptable methodologies are crucial for capturing the complexity and interwoven nature of modern research landscapes.

This paper's methodology contributes to the field by pushing beyond traditional journal-level limitations, offering a scalable and detailed classification alternative. Future research, particularly with more comprehensive relational metrics, could further refine its applicability and accuracy in the dynamic world of scientific literature.

PDF Markdown