- The paper introduces the concept of 'frequent users' (100-1000 downloads/year) as a robust indicator of active research engagement beyond traditional citation counts.
- It employs detailed data acquisition, filtering, and correlation analyses between usage logs and publication records to validate its approach.
- The study highlights how immediate usage metrics can overcome citation lags, offering a promising alternative for measuring scholarly impact across disciplines.
Assessing research activity is often done using traditional bibliometric indicators like publication counts and citations. While valuable, these metrics don't capture the full picture, particularly the essential activity of researchers reading and accessing literature as part of the research cycle. The paper "Usage Bibliometrics as a Tool to Measure Research Activity" (1706.02153) explores the practical use of digital usage data, specifically downloads of scholarly articles, as a complementary measure of research activity.
A core challenge in using usage data from digital libraries (like the Astrophysics Data System - ADS, used in this paper) is that access logs contain significant noise from various user types (students, practitioners, general public, crawlers, random browsers) that are not directly engaged in generating scholarly publications. To address this, the paper proposes focusing on a filtered subset of users, defined as "frequent users." In the context of ADS, this class is empirically defined as users who download between 100 and 1000 full-text articles per year. This threshold is based on observations that the usage patterns of users within this range correlate well with known populations of active researchers and authors in astronomy.
Implementing a system to leverage usage bibliometrics based on this paper involves several practical steps:
- Data Acquisition:
- Usage Logs: Obtain detailed click-stream logs from the digital library or platform. These logs should ideally include information about the requested item (e.g., article ID), the time of the request, and a unique identifier associated with the user session or account (beyond just IP addresses). IP addresses are also needed for geolocation.
- Publication Data: Access a database of scholarly publications, including metadata such as authors, affiliations, publication year, journal, and cited references. This can often be obtained via APIs provided by the platform or external databases.
- Data Preprocessing and Filtering:
- Robot/Crawler Filtering: Implement robust filtering mechanisms to remove automated access (bots, crawlers). This involves analyzing "User Agent" strings and identifying known bot patterns or IP addresses associated with systematic access.
- Define "Frequent Users": For each user identifier, count the number of full-text downloads within a specified time period (e.g., annually). Identify users whose download counts fall within the defined threshold (e.g., 100-1000 downloads/year, based on the paper's astronomy findings). This threshold may need empirical tuning for different disciplines or platforms.
- Geolocation: Map user IP addresses or associated organizational information to geographic entities (countries, institutes). This requires a reliable IP-to-location database or organizational affiliation data associated with user accounts.
- Publication Filtering: Filter publication data based on criteria relevant to the analysis, such as journal list (the paper focuses on "main astronomy journals"), publication year, and author affiliations.
A simplified data processing flow could look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
Usage Logs (clickstream)
|
+--- Filter out bots/crawlers (User Agent, IP patterns)
|
+--- Group downloads by User ID and Year
|
+--- Count annual downloads per User ID
|
+--- Identify "Frequent Users" (downloads per year within threshold)
|
+--- Geolocation (IP/Affiliation -> Country/Institute)
|
Frequent User Downloads (per Entity, per Year)
Publication Database
|
+--- Filter by Journal List, Year, Affiliation
|
Entity Publications (per Entity, per Year)
|
+--- Extract Cited References
Cited Publications (per Entity, per Year) |
- Calculating Usage Metrics and Correlations:
- Entity-level Counts: Aggregate the number of frequent users per entity (country/institute) per year. Aggregate the number of publications (optionally first-authored) per entity per year. Aggregate the number of frequent user downloads per entity per year.
- Correlation Analysis: Calculate correlations between these aggregated metrics over time for different entities. This can involve standard statistical methods like Pearson correlation. The paper shows strong correlations between frequent user counts, affiliated first author counts, and publication counts (Figure 4, Figure 6).
- Similarity Analysis (Overlap): For a given entity and year, create a set of unique identifiers for publications downloaded by frequent users associated with that entity. Create a set of unique identifiers for publications cited by authors affiliated with that entity in the same year. Calculate the overlap fraction (e.g., size of intersection / size of union, or size of intersection / size of cited set) between these two sets (Figure 8). Compare this overlap to what would be expected from random downloads (Figure 10).
- Obsolescence Pattern Comparison: Analyze the distribution of publication years for both downloaded and cited papers. Compare these distributions (e.g., normalized counts per year) to see if frequent user downloads follow a similar temporal pattern to citations (Figure 7).
- Correlation with Traditional Metrics: Calculate correlations between usage metrics (e.g., frequent user downloads) and traditional citation-based metrics (e.g., h-index calculated on publications from that entity) potentially with a time lag (Figure 9).
- Correlation with External Data: Integrate external data like socio-economic indicators (e.g., GDP per capita) and correlate them with usage metrics (Figure 11).
- Implementation Considerations:
- Scalability: Processing massive click-stream logs and publication databases requires scalable infrastructure, potentially utilizing distributed computing frameworks (like Spark) or efficient database systems optimized for log analysis.
- Data Quality: The accuracy of results heavily depends on the quality of the input data, including the richness of usage logs, the accuracy of geolocation, and the completeness and correctness of publication metadata and affiliations. Careful data cleaning and deduplication are essential.
- Definition Portability: The specific definition of "frequent user" (100-1000 downloads/year) is derived from the ADS astronomy context. Applying this method to other fields or platforms would require re-validating these thresholds based on the usage patterns in that specific domain. A data-driven approach to clustering users based on usage patterns might be more generally applicable.
- Privacy: Handling detailed usage logs and user identifiers requires careful consideration of user privacy and compliance with data protection regulations.
- Deployment: The analysis system could be deployed as a batch processing pipeline that periodically updates metrics or as a more interactive system allowing on-demand analysis of specific entities or time periods.
In summary, the paper demonstrates a practical methodology for extracting meaningful research activity signals from potentially noisy usage data. By focusing on carefully filtered "frequent users" and leveraging comprehensive data from platforms like ADS, it's possible to establish correlations between reading behavior (downloads) and publishing behavior (publications, citations). Implementing this involves robust data acquisition and filtering pipelines, definition of user categories based on usage patterns, and various statistical analyses comparing usage and publication datasets at different levels of aggregation. While the specific thresholds and correlations found are tied to the astronomy domain, the overall approach of using filtered usage data as a proxy for research activity is generalizable, provided the data source and user behavior characteristics in the target domain are well understood. This offers a promising avenue for developing new, more temporally immediate indicators of research engagement compared to citation-based metrics which inherently have a time lag.