- The paper presents a dual-phase framework that first determines institutional expertise using a multi-layered NLP module and weighted network analysis.
- The paper employs novel metrics, including x-index, x(g)-index, CRR, and CCSRR, to quantify thematic strengths derived from publication citations.
- The paper demonstrates its practical application by recommending collaborations that are evaluated for novelty, coverage, and diversity.
This paper (Institutional Collaboration Recommendation: An expertise-based framework using NLP and Network Analysis, 2022) presents a framework for institutional collaboration recommendation in academia, addressing a gap in existing research which largely focuses on individual-level collaborations. The motivation stems from the global shift towards performance-based funding for research institutions, requiring them to enhance their research output and thematic strengths. The framework leverages NLP and network analysis techniques to identify the core and potential core competency areas of institutions and subsequently recommend potential collaborators based on these identified strengths.
The framework is structured into two main sections: expertise determination and recommendation retrieval.
1. Expertise Determination Section
This section focuses on understanding the research strengths of institutions in different thematic areas.
- Data Collection: The process begins by collecting data, primarily author keywords (DE field in WoS) and citation counts (Z9 field), from reputed online databases like Web of Science (WoS). The case paper uses data from 195 Indian institutions in the field of Computer Science over the period 2010-2019, filtering institutions with at least 25 publications.
- Data Pre-processing using NLP Module: Author keywords are used to represent thematic areas, but raw keywords can be generic, too fine-grained, or have variations (plural/singular). A multi-layered NLP module is employed to address these issues:
- A Word2Vec model, trained on a domain-specific corpus (Semantic Scholar Open Research Corpus subset), is used to find semantically similar keywords. For each author keyword, the top 5 most similar keywords are found. The most frequent keyword among the original and its 5 similar keywords replaces the original keyword. This helps group fine-grained terms under more general thematic areas (e.g., 'Hierarchical Clustering' might be replaced by 'Clustering').
- A second layer uses Levenshtein distance to identify and replace plural forms with singular forms where the difference is only a trailing 's' (e.g., 'Algorithms' -> 'Algorithm'). More complex pluralizations are expected to be handled by the Word2Vec layer. The output is a set of NLP-processed author keywords, K(A).
- Network Creation and Analysis Module: To map publications to thematic areas and measure institutional strength, a publication-thematic area affiliation network is created.
- A bipartite network, W-K(A), is formed with publications (W) as the first mode and processed keywords/thematic areas (K(A)) as the second mode. Links exist from a publication to the keywords associated with it.
- This network is then weighted, W-K*(A), by injecting an attribute of the first mode vertices (publications). Total citation count is used as the attribute for 'injection'. The weight of a link between a publication w and a keyword k(a) is the citation count of w.
- Thematic strength of an institution in a specific thematic area is computed as the weighted indegree of that thematic area node in the W-K*(A) network for publications authored by that institution. This is essentially the sum of citations accumulated by an institution's publications tagged with that thematic area.
- Example: If an institution has publications w1 (2 citations), w2 (5 citations), w3 (10 citations), and w1 is tagged with t1, t2; w2 with t1, t3; w3 with t2, t3, t4. The thematic strength for t1 is $2+5=7$. (Illustrated in Figure 3).
- Determination of Areas of Expertise: Expertise indices are used to identify core and potential core competency areas based on thematic strengths.
- Thematic strengths for all keywords associated with an institution are sorted in decreasing order.
- The x-index is defined, inspired by the h-index. An institution has an x-index of x if it has published papers in at least x thematic areas with a thematic strength of at least x in those areas. The top x areas form the x-core, treated as core competency areas. Mathematically, x={maxr:thematic strength at position r≥r}.
- The x(g)-index is defined, inspired by the g-index. An institution has an x(g)-index of x(g) if it has published papers in at least x(g) thematic areas such that the average strength in these areas is at least x(g). This is equivalent to i=1∑x(g)thematic strength at position i≥x(g)2. The top x(g) areas form the x(g)-core. Mathematically, x(g)={maxr:i=1∑rthematic strength at position i≥r2}.
- Two ratios are defined based on the sorted list of thematic strengths:
- Citation to Rank Ratio (CRR): CRRr=rthematic strength at position r
- Cumulative Citations to Squared Rank Ratio (CCSRR): CCSRRr=r2∑i=1rthematic strength at position i
- Thematic areas with CRRr≥1 are considered core competency areas.
- Thematic areas with CCSRRr≥1 that are not already core competency areas are considered potential core competency areas.
- For BHU in the case paper, x-index was 41 (41 core areas), x(g)-index was 54, leading to 13 potential core areas.
2. Recommendation Retrieval Section
This section utilizes the identified expertise to generate collaboration recommendations based on two strategies.
- Determination of Unique Thematic Areas: A unique set of all processed keywords/thematic areas (K′) across all institutions is identified.
- Creation of Thematic Area-Institution Matrices: Three matrices of size m×n (where m is the number of unique thematic areas with non-zero strength for at least one institution, and n is the number of institutions) are created:
- Citation weighted T-I matrix: Contains the thematic strength (total citations) of each institution for each thematic area.
- CRR weighted T-I matrix: Contains the CRR value of each institution for each thematic area.
- CCSRR weighted T-I matrix: Contains the CCSRR value of each institution for each thematic area.
- Rows corresponding to thematic areas with zero strength across all institutions are removed.
- Execution of Strategies:
- Strategy 1 (Enhance Core Competency): For a given institution i, and one of its core competency thematic areas t (where CRRt,i≥1):
- Identify institutions j that also have t as a core competency area (CRRt,j≥1). Let this set be S.
- From set S (excluding institution i), recommend institutions j whose thematic strength in t (Citationt,j) is greater than or equal to δ times the thematic strength of institution i in t (Citationt,i), where δ is a threshold (e.g., 0.75). This recommends institutions strong in the same core area.
- Example (BHU, 'machine learning', δ=0.75): BHU's strength is 60. Recommended institutions must have strength ≥0.75×60=45. 25 institutions satisfied this criteria (Table 2).
- Strategy 2 (Complement Potential Core Competency): For a given institution i, and one of its potential core competency thematic areas u (where CCSRRu,i≥1 and CRRu,i<1):
- Identify institutions j that are core competent in u (CRRu,j≥1). Let this set be Y. These are High Priority recommendations. All institutions in Y (excluding i) are recommended.
- Identify institutions k that are potentially core competent in u (CCSRRu,k≥1) but not core competent (CRRu,k<1). Let this set be Z. These are potential Low Priority recommendations.
- From set Z (excluding institution i), recommend institutions k whose CCSRR value for u (CCSRRu,k) is greater than or equal to the CCSRR value of institution i for u (CCSRRu,i). This recommends institutions potentially strong in the area who are stronger than or similar to the institution of interest in that potential area.
- Example (BHU, 'PSO'): High priority: 46 institutions with CRR≥1 for 'PSO' (Table 3). Low priority: Check institutions with CCSRR≥1 and CRR<1 for 'PSO'. Recommend those with CCSRR≥ BHU's CCSRR for 'PSO'. In BHU's case, no institutions met this low priority criterion for 'PSO', but examples for 'fuzzy sets' for 'South Asian University' are provided.
Evaluation:
The framework's performance was evaluated using standard metrics for recommendation systems: Novelty, Coverage, and Diversity. Accuracy (Precision/Recall) could not be evaluated as there was no ground truth data on successful past collaborations or future collaboration outcomes.
- Novelty: Assessed the system's ability to recommend institutions not frequently recommended. Measured using a Novelty index based on the frequency distribution of recommended institutions across all recommendations. Lower index values indicate higher novelty. The system showed low novelty index values (close to 0), indicating high novelty in recommendations.
- Coverage: Assessed the distinctiveness of recommendations. Intra-set coverage (within each set of recommendations - Strategy 1, Strategy 2 High, Strategy 2 Low) was measured using the Gini index. Lower Gini values mean higher distinction. Inter-set coverage (distinctness between the three sets) was measured using Jaccard dissimilarity. Higher Jaccard dissimilarity means higher distinction. The intra-set coverage was deemed satisfactory (GI < 0.5), and inter-set coverage was also considered satisfactory (JD > 0.5).
- Diversity: Assessed the distribution of recommended institutions across different groups. Institutions were grouped based on their x-index (four groups: >80, 60-80, 40-60, <40). Diversity was measured using Shannon entropy, and evenness of distribution using Shannon equitability index. Higher values (closer to 1) indicate higher diversity/evenness. The system showed high diversity and evenness across all three recommendation sets.
Implementation Considerations and Applications:
Implementing this framework requires:
- Access to a comprehensive publication database like WoS with author keywords and citation data.
- Developing or utilizing a robust NLP pipeline, including training a domain-specific word embedding model (Word2Vec) and implementing string similarity checks (Levenshtein distance).
- Implementing network creation and analysis tools capable of building weighted bipartite networks and computing weighted node degrees.
- Developing modules to compute x, x(g) indices, CRR, and CCSRR based on sorted thematic strengths.
- Building matrix representation of thematic area-institution relationships and implementing the logic for Strategy 1 and Strategy 2 recommendations, including filtering based on thematic strength and CCSRR thresholds.
- Considering scalability for large datasets (hundreds/thousands of institutions, millions of publications, tens of thousands of keywords). Matrix operations can become computationally intensive.
The paper highlights potential applications for various stakeholders:
- National Policymakers: Identifying expert institutions in 'thrust areas' for funding and fostering networks of excellence.
- Institutional Policymakers: Identifying suitable partners to enhance core strengths or develop potential strengths.
- Funding Agencies: Assessing institutional strength in specific thematic areas for performance-based funding decisions.
The framework provides a data-driven approach to identify complementary or reinforcing expertise, offering a structured way for institutions to seek collaborations that align with their strategic goals for performance improvement.