An initial investigation into the visibility of Swiss Higher Education Institutions (HEIs) in commercial bibliometric databases was published here in April 2024. It highlighted the importance of (1) metadata quality, (2) institutional identifiers, and (3) language disambiguation. These findings revealed significant differences between the platforms Dimensions, Scopus and Web of Science (WoS).
This blog post presents additional results and can be considered as foundational work to insights about the findability of Swiss HEIs across two open bibliometric databases, OpenAlex and OpenAIRE. We will explore how open databases compare to their commercial counterparts, focusing on the findability of Swiss HEIs and on the publication counts associated with them.
Commercial databases were evaluated through their respective web applications, while open databases were explored through data dumps from March 2024.
Findability of Institutions in Commercial and Open Databases
As outlined in our previous blog post [1], we focus on 52 institutions listed by swissuniversities [2] as Accredited Swiss Higher Education Institutions in accordance with the Higher Education Funding and Coordination Act in February 2024 [3].
The first step was to find the ROR (Research Organization Registry) identifier of each institution, and to check whether this ROR is findable in OpenAlex, or OpenAIRE respectively. In OpenAlex, our approach involved inspecting authorships to determine the ROR ID of the authors’ institution. In OpenAIRE, we used the ROR ID in a different way. We began by searching for the ROR ID in the “PID” field. If the ROR ID could not be found, a manual search was conducted using the institution’s name and its variations to ensure comprehensive coverage. In the interest of comparability with the analysis of commercial databases, we adopted the classification system introduced by Dimensions for the definition of child and related institutions to universities, even though this definition does not always correspond to the one that we can find in OpenAlex and OpenAIRE.
In OpenAlex we were able to identify 46 institutions when we searched using ROR IDs (see Figure 1). The remaining six institutions that could not be found in OpenAlex (nor in ROR) as of April 2024 are:
- Hochschulinstitut Schaffhausen (HSSH)
- Pädagogische Hochschule Nordwestschweiz (PH FHNW)
- Swiss Business School (SBS)
- Swiss UMEF (SUMEF)
- SUPSI – Dipartimento formazione e apprendimento (SUPSI-DFA)
- Schweizerisches universitäres Institut für traditionelle chinesische Medizin (TCMUNI)
OpenAIRE organises institutions into approved and pending categories. A total of 43 institutions was found, which were designated as approved (they appear under the “openorgs” prefix). This number rises to 46 when institutions with pending approval are included (they appear with the prefix “pending_org,”). Two of the three institutions with pending status have a ROR ID (namely Pädagogische Hochschule Graubünden (PHGR) and Pädagogische Hochschule Wallis (PHVS)). However, six institutions could not be identified in OpenAIRE. Only one of these has a ROR ID:
- Stiftung Universitäre Fernstudien Schweiz, Brig (SUFS)
The remaining five had no ROR ID.
- Hochschulinstitut Schaffhausen (HSSH)
- Swiss Business School (SBS)
- Swiss UMEF (SUMEF)
- SUPSI – Dipartimento formazione e apprendimento (SUPSI-DFA)
- Schweizerisches universitäres Institut für traditionelle chinesische Medizin (TCMUNI)
The counts of identified institutions are represented in Figure 1, along with our findings of the commercial databases [1]. The differences in findability are remarkable. Open databases and Dimensions have the highest counts with 46 identified institutions in each, whereas more restrictive or closed databases only identify 17 (Scopus) and 26 (WoS) institutions.
Figure 1.
A comprehensive overview of the institutional visibility across different bibliometric databases is provided in the corresponding PDF file detailing which HEIs were identified in each database.
A significant challenge that impacts the findability and consistency of institutional data is the way records are generated. For instance, various names for the same institution exist across different databases. A compelling example is OpenAIRE, where identical Swiss institutions are recorded under several different names, and often with unique identifiers. This lack of uniformity can lead to fragmentation in the way institutions are represented and found in the database.
However, in OpenAIRE, there is often only one identifier that is approved, while the others are pending. This ensures that each institution is associated with one unique approved identifier. In other words, the institution is represented by several different names, each associated with different IDs, but only one is approved while others are pending. As such the presence of pending identifiers can be beneficial, as they highlight areas where available data could be improved. Allowing variants in OpenAIRE could also raise awareness about the importance of providing high-quality metadata and encourages users to work towards improving the data. To give one example, we illustrate this problem for the University of Basel below:
Approved Name:
- openorgs____::a5124687d06ee9348a73a7dcfba96ec7 (‘University of Basel’)
Pending Variants:
- pending_org_::00e97cf9dfb17d03db98d2db2cc583e7 (‘Basel university’)
- pending_org_::2bbb9a361d1ed5099f752126a70d1dc4 (‘UNI: Basel Universität Basel CH’)
- pending_org_::572545ae002a4fd7d630ecd95438d1df (‘Universitätsspital Universität Basel’)
- pending_org_::8aa7f6b1353028eeebb04e8c9c52c7c4 (‘WWZ Uni Basel Universität Basel’)
- pending_org_::a5827f0cdb16445e884256da6f9bc7cd (‘Abteilung Wirtschaft und Politik Wirtschaftswissenschaftliche Fakultät (WWZ) Universität Basel’)
- pending_org_::466cfe34d8a54740f6f3939ada3eab4c (‘Abteilung Wirtschaftstheorie Wirtschaftswissenschaftliche Fakultät Universität Basel’)
- …
Fragmentation like this can potentially lead to confusion and inefficiency. Analysts searching for all relevant information for the University of Basel might miss critical data because it is scattered across multiple records. However, the presence of multiple identifiers, including pending ones in OpenAIRE, also offers analysts an opportunity to access more comprehensive data by enabling searches across records that might not appear under the approved identifier. By contrast, such extensive searches can be much more challenging on platforms, such as OpenAlex.
A Collaborative Approach to Data Curation
The discrepancies in the organizational identifier have already been outlined. For emphasis, OpenAlex lists one University of Basel (via its ROR ID), while OpenAIRE provides access to one main identifier (“openorgs”), usually linked to a ROR ID as well as further identifiers (“pending_org”). The most crucial difference is the absence or presence of alternative identifiers, which highlights different approaches and choices made by each data provider. These also come with different advantages:
One core benefit of open bibliometric databases is the potential for the community to significantly improve the quality of organizational identification. For instance, in OpenAIRE, community members can actively participate in identifying duplicates once they have registered and participated in a training session. Over 200 entries associated with the ETH Zurich have already been deduplicated in OpenOrgs. In Open Aire, the deduplication process by a curator works as follows:
- Institutions can be listed under different identifiers, including “openorgs” and “pending_org”. For example, there is one ETH Zurich in OpenAlex (represented by a single ROR ID), but many identifiers for ETH Zurich in OpenAIRE, such as:
- openorgs____::fb1e14f93f04d43e1a10a9f17d12c669 (‘EPFZ’)
- pending_org_:: 889b0ea359235ee890f576b3a18904f8 (‘ETH Zuerich’)
Unlike closed databases, the community-driven nature of ROR IDs and OpenOrgs allows curators to improve the data quality continuously, reducing the need for repeated curation efforts. Consequently, we expect that the next data dump will already reflect better quality for the ETH Zurich due to resolving previous duplicate entries.
The proportion of HEIs found in OpenAlex is identical to the proportion found in OpenAIRE. This suggests that even though identification issues may persist, they are not unique to one database. By addressing these challenges, we can enhance our understanding of the underlying data, and find ways to increase the accuracy and reliability of bibliometric analyses.
Publication Counts
The counts of publications from Swiss HEIs that we can access without any curation can differ massively across the considered databases. To associate publications with their corresponding institutions, we used the ROR ID (as well as approved identifiers in the case of OpenAIRE) as the key identifier. Additionally, we only included institutions from OpenAIRE that were verified and had acquired “openorgs”-identification.
Figure 2.
Figure 2 shows the counts of published articles of eight Swiss HEIs across different bibliometric databases from 2012 to 2022. For institutions, such as EPFL, ETH Zurich, UNIBAS, and UNIGE, the counts are generally higher in OpenAIRE and OpenAlex compared to Scopus, Web of Science (WoS), and Dimensions. In particular, ETH Zurich and UNIGE show significant growth in article counts over the years, especially in OpenAIRE. By contrast, smaller institutions such as BFH, FHGR, FHNW, and HSLU, all universities of applied sciences, show lower counts and more variation across all databases. These discrepancies highlight different data processing and inclusion criteria used by the respective bibliometric database.
These variability and differences across databases is also visible in Figure 3, which focuses on the count of book publications. For example, ETH Zurich and EPFL display higher counts in OpenAIRE in comparison with other databases, especially in recent years. It is noteworthy that databases, such as Scopus and WoS, consistently report lower counts for most institutions. For example, UNIGE stands out as an institution with significant book counts in OpenAIRE, while FHGR and FHNW display small counts across all databases.
Figure 3.
Some of the higher counts can be affected by the deduplication process. For example, although all databases attempt to deduplicate publications from preprints, there may be instances where deduplication has not yet occurred. This can result in one piece of research being counted as two in databases that did not deduplicate but only, and correctly, as one in those that made the effort.
One illustrative example is the publication with doi: 10.48550/arxiv.2204.03554. OpenAlex represents the preprint and the corresponding publication with two entries:
- Work ID: W4223588778 | DOI: 10.48550/arxiv.2204.03554 | Year: 2022
- Work ID: W4313525046 | DOI: 10.1109/msp.2022.3219240 | Year: 2023
The same piece of research is also duplicated in OpenAire:
- Publication ID: “doi_dedup___::c9087f7d9f805a41f6291b863691e13f” | Year: 2022 | DOI: https://dx.doi.org/10.48550/arxiv.2204.03554
- Publication ID: “doi_dedup___::087826c5d566e4b4187c98f090b93b43” | Year: 2023 | DOI: 10.1109/msp.2022.3219240
In Dimensions, both the preprint (pub.1146977105) and the published article (pub.1154180151) are found under different ids. However, Web of Science and Scopus only include the final published article, not the preprint in this particular example. This illustration of the problem may explain at least part of story behind potential differences in counts of research work. However, it is certainly not the only explanation. Other underlying mechanism for the outlined discrepancies may not be immediately obvious and require further investigation. Some studies have already started exploring these differences (e.g., Mongeon & Paul-Hus, 2016; Singh et al., 2021).
Conclusion
Our investigation into the visibility of Swiss HEIs across bibliometric databases reveals several key messages:
- The open databases OpenAlex and OpenAIRE as well as the closed database Dimensions are more inclusive compared to closed and more established databases, such as Scopus and WoS when it comes to the findability of HEIs.
- The collaborative approach in open databases makes data curation more sustainable compared to commercial databases, allowing reduction of duplication of curation efforts.
- Publication counts of institutions exhibit different patterns over the years, with open databases generally reporting higher counts for major institutions. Smaller institutions display more variability, likely reflecting differences in data processing and inclusion criteria.
- Challenges in data deduplication, such as treating preprints and published articles as different records, and the differences in the use of institutional identifiers, affect the institutional representation.
Addressing the issues presented here may help to improve the accuracy and reliability of bibliometric analyses for Swiss HEIs and pave the way forward to an ethical use of these data.
Acknowledgement
This blog post was partly inspired by ChatGPT 4.0, version dated 14th May 2024 (OpenAI 2021). The author further acknowledges helpful comments on the blog post by Dr Kathrin Thomas (University of Aberdeen).
This blog post has been written within the framework of the project “Towards Open Bibliometric Indicators” (TOBI), which is co-funded by swissuniversities and the ETH Library.
References
Mongeon, P., & Paul-Hus, A. (2016). The journal coverage of Web of Science and Scopus: a comparative analysis. Scientometrics, 106, 213-228.
OpenAI. (2021). GPT-4o (ChatGPT) [Computer software]. Retrieved from https://openai.com, accessed 14.05.2024.
Footnotes
[1] See Dederke, J. (2024, April 9). Exploring Swiss higher education institutions in commercial bibliometric databases. Swiss Year of Scientometrics SYoS. https://yearofscientometrics.ethz.ch/swissuniversities-institutions-in-commercial-databases/
[2] Swissuniversities is the Rectors’ Conference of the Swiss Universities, i.e., the umbrella organization of Swiss universities (https://www.swissuniversities.ch/).
[3] See https://www.swissuniversities.ch/themen/lehre-studium/akkreditierte-schweizer-hochschulen, accessed 21.02.2024. In the meantime, the following two institutions have been added: César Ritz Colleges Switzerland (university of applied sciences institute), Franklin University Insitute Switzerland (university institute).
Last update: 16 July 2024