A Hierarchical Classification System of Research Fields to Understand Research Interconnectedness

Today’s challenges in scientometrics include overemphasis on existing metrics and the need for improved interdisciplinary indices [1,2,3,5,8]. An interdisciplinary project team based at ETH Zurich, co-lead by Prof. Peter Egger (economist) and Prof. Ce Zhang (computer scientist), addressed this issue by developing a hierarchical classification system that categorizes publications into disciplines, fields, and subfields based on abstracts, enhancing interdisciplinary research classification [6]. The system helps to understand research interconnectedness by laying a groundwork for classifying discipline and field boundaries [4] and understanding the interconnectedness between them [8]. This blog post consists of a presentation of one of our research projects and findings.

The classification system presented in Fig. 1 enables a systematic categorization of research activities in the discipline-field-subfield hierarchical tiers. We look at the knowledge production through publications and impact through citations, permitting each publication to fall into multiple categories. The classification system distinguishes 44 disciplines, 718 fields, and 1,485 subfields among some 160 million abstract snippets in the bibliometric database Microsoft Academic Graph (MAG, Version 2018-05-17). We used batch training in a modularized and distributed fashion to address and allow for interdisciplinary and interfield classifications. In total, we have conducted over 3,000 experiments to assign disciplines, fields, and subfields to each publication. The classification accuracy is higher than 90% in more than 75% of the experiments that we conducted.

Illustration of the hierarchical classification system
Figure 1: Hierarchical classification system with three levels: Discipline, field and subfield

Our hierarchical classification system described above is valuable since it can identify the interdisciplinary exchange of ideas between fields and disciplines, assess the concentration of such interdisciplinarity, and determine whether it stems from authors with cross-disciplinary backgrounds. Additionally, by assigning researchers to disciplines, the framework can aid in understanding the likelihood of interdisciplinary impact in collaborations, which can inform the selection of instruments to promote interdisciplinary research and its impact.

As an example, we show how the hierarchical classification system allows to identify whether a discipline rather influences or absorbs from another discipline. In Fig. 2 we present the net output by subtracting citation input (the number of citations appearing in the bibliography of each article, classified by discipline, field, and subfield) from citation output (the number of articles which cite each article, classified by discipline, field, and subfield) for each pair in a set of 44 disciplines, each discipline being associated to a number between 0 and 43. The colors correspond to the value of net input in that discipline pair. We read the heatmap row by row. The higher ratios (red) correspond to cases where there is the positive net output from one discipline in row to one discipline in column. The lower ratios (blue) correspond to negative net output. Note that the difference between the dark red cells with respect to the dark blue cells means that there is a difference in citation of the order 1 million.

An inspection of this figures suggests that certain disciplines have a positive net output, which is indicated by the red rows, e.g., “Computer science” (0), “Economics” (1), “Pure mathematics” (2), “Space science” (27), “Chemistry” (34) and “Biology” (43). A positive net output means that a discipline influences other disciplines more than it absorbs from them . The red cells in column 4 (“Agriculture”) and column 15 (“Human physical performance and recreation”) show similar patterns across the disciplines. This indicates that they require similar inputs from disciplines such as “Computer science” (0), “Pure mathematics” (2), “Chemistry” (34), “Engineering” (35) and “Biology” (43). On the other hand, the negative net output means that a discipline absorbs more citations from other disciplines.

Heatmap representing net output accross 44 disciplines
Figure 2: Net output accross 44 disciplines

Showing the input-output relationships between the disciplines is only one downstream analytical product enabled by our hierarchical classification system. Our system also enables a more fine-grained analysis on the field level, in particular, the fields that are not within the same disciplines. These types of analysis enable us to have a holistic view of how the collaboration across domains operate. To summarize, our hierarchical classification system can help objective scientometrics evaluations by providing a more accurate and transparent way to categorize research activities. It can provide a more detailed understanding of the knowledge production and impact of research activities. Besides, we have designed an annotation and inference engine SAINE [7] to support understanding and finetuning the models developed in [6]. We welcome collaboration and feedback from the scientific community on these projects.

References

[1] Cassi, Lorenzo, Raphael Champeimont, Wilfriedo Mescheba, and Elisabeth De Turckheim. “Analysing institutions interdisciplinarity by extensive use of Rao-Stirling diversity index.” PLoS One 12, no. 1 (2017): e0170296.

[2] Clarivate Analytics. “Interdisciplinarity index”, http://help.prod-incites.com/incitesLiveESI/4274-TRS.html, last accessed Sep., 27, 2023.

[3] Leydesdorff, Loet, Caroline S. Wagner, and Lutz Bornmann. “Interdisciplinarity as diversity in citation patterns among journals: Rao-Stirling diversity, relative variety, and the Gini coefficient.” Journal of Informetrics 13, no. 1 (2019): 255-269.

[4] Leydesdorff, Loet, and Ismael Rafols. “A global map of science based on the ISI subject categories.” Journal of the American Society for Information Science and Technology 60, no. 2 (2009): 348-362.

[5] Okamura, Keisuke. “Interdisciplinarity revisited: evidence for research impact and dynamism.” Palgrave Communications 5, no. 1 (2019).

[6] Rao, Susie Xi, Peter H. Egger, and Ce Zhang. “Hierarchical classification of research fields in the “web of science” using deep Learning.” Under Review in Quantitative Science Studies (QSS), 2023.

[7] Rao, Susie Xi, Yilei Tu, and Peter H. Egger. “SAINE: scientific annotation and inference engine of scientific research.” IJCNLP-AACL 2023, System Demonstrations.

[8] Van Noorden, Richard. “Interdisciplinary research by the numbers.” Nature 525, no. 7569 (2015): 306-307.

Photo of author

Dr Susie Xi Rao

Dr Susie Xi Rao is a senior scientific staff at the Chair of Applied Economics at ETHZ, with research interests in dynamic network analysis, information extraction, and natural language processing. She has worked on projects in pattern matching and record linkage, legal text analysis, and fraud detection. Currently, she is mainly working on projects involving hierarchical classification of research fields and understanding discipline connectedness. In addition, since Nov. 2022 she has been a Guest Lecturer at Lucerne University of Applied Sciences and Arts, leading lab exercises for the module "Fraud Detection".

Leave a comment