Introduction

Current developments in institutional and national databases have led to more information about research outputs, especially on publications which have become commonplace output to be used on the research evaluation and funding allocation. Institutional and national databases on publications, be it an institutional current research information systems (later on CRIS) or a national house-built solution, are in almost all cases built for the context of monitoring researcher's outputs in a way that this data can then be used to evaluate on an institutional level (e.g. tenure-track, recruitment) or nationally (e.g. national funding models that take into account the research outputs in form of indicators) and further explored by general public (via e.g. portals or statistical dashboards). What makes both institutional and national databases apart from larger commercial bibliographic databases is that they assertive in including outputs from social sciences and humanities (later on SSH). To some extent also research and research outputs that are highly national (e.g. articles in domestic journals, publications in national language) are not well covered in commercial databases.

-yksi kappale CRIS-järjestelmien eduista (yleensä hyvin-strukturoitua ja määriteltyä, kattavaa sillä tyypillisesti käytetään tutkimuksen arvioinnissa tai rahoitusmallissa

To achieve a complete and accurate bibliographic database on an European level has been on the agenda for quite a few organizations or projects. European Network for Research Evaluation within the Social Sciences and Humanities (later on ENRESSH) and the Working Group 3 with a main objective to reflect upon the standardisation and the interoperability of CRISs dedicated to research outputs from SSH. One of the goals is to develop shared procedures for building and maintaining databases and design a roadmap for a European database. As part of the working group's work, a pilot case study was made on utilizing the Finnish national VIRTA Publication Information system's solution for wider set of organizations in Europe. This collaborative VIRTA-ENRESSH-POC of a decentralized approach to aggregate publication metadata was launched in Spring 2016 and the case study was carried out between 2017-2018 for 6 organizations from 4 countries (Belgium, Finland, Norway and Spain). In this POC, it was discussed that if on an European level a so called European Research Information Service could be built, that would provide a complete overview, i.e. metadata on publications, and would include all types of scholarly publications from all fields of science. The data collected in the pilot from had its highest quality and consistency in terms of the bibliographic data meanwhile the classifications varied.

The main objective of ENRESSH Working Group 3 is to reflect upon the standardisation and the interoperability of current research information systems (CRIS) dedicated to research outputs from the social sciences and humanities (SSH). One of the goals is to develop shared procedures for building and maintaining databases and design a roadmap for a European database.

One of the goals of the ENRESSH is to design a roadmap for a European database for SSH outputs. A proof of concept VIRTA-ENRESSH was built up - especially for SSH but not excluding others fields either.
In the context of ENRESSH and research of publishing in SSH fields, a metadata has to be at a sufficient level
VIRTA-ENRESSH to some extent have already explored the topic on minimum metadata set

European Publication Information Infrastructure

Perhaps even a bigger undertaking of collecting and combining bibliographic metadata on research publications on European level is ongoing as part the agenda of OpenAIRE, an organization behind a network of open science specialists in Europe and currently hosting one of the largest databases in Europe on research outputs. By utilizing the euroCRIS's CERIF data model at it's core, OpenAIRE has, with first version dating back to 2015 and recent updates in 2018, gained momentum by the OpenAIRE Guidelines for CRIS Managers to support metadata harvests from various institutional and national publication databases (e.g. VIRTA) and CRIS systems (e.g. METIS, PURE). As it stands, in 2019 several CRIS systems aim to be compliant with the Guidelines and thus harvestable by the OpenAIRE.

OpenAIREen ei kovin moni CRIS liittynyt. VIRTA ensimmäinen kansallinen.
OpenAIRE: ei myöskään kontrolloi kattavuutta (organisaatioittain, tieteenaloittain, julkaisutyypeittäin), luokitusten yhteensopivuutta eikä näin ollen mahdollista vertailua maiden välillä (tämä vaatisi ehdottomasti yhteisesti sovitettuja standardeja käytäntöjä - nyt top-down)

Albeit both initiatives seemingly share a common goal of having a complete set of metadata on publications, there is somewhat distinctive difference in the approach on accumulating metadata on research publications. While OpenAIRE already has a high number of records for publications (some 26 mil.), there is a high variation on the coverage of publications e.g. on a national level and a vast majority of publication are harvested either via repositories and other publication aggregators. This, although seemingly a very high number of records, is still far from commercial databases i.e. Scopus, Web of Science and Google Scholar (amount of records ranging from 100 to 400 mil.). OpenAIRE Explore (a portal for exploring the individual publication and its metadata from OpenAIRE database) aims to provide researchers and other interested persons a way to find relevant research. Other services include are the so called Content Provider Dashboards, which make it possible to measure and monitor the contents of your harvested database and make it possible to enrich the harvested metadata even further.

For purposes of exploring research publications, OpenAIRE provides great starting point. However, if there is a need to evaluate, monitor or assess some part of research done on an institutional, national or international level, the contents fall short. Mainly this is related to how the publications have been accumulated to the OpenAIRE database. Each harvested systems has to follow a set of Guidelines provided by OpenAIRE (currently including 3, for Literature Repositories, Data Archives and CRIS Managers) which state the use of data models (for repositories Dublin Core; data archives Data Cite and CRISs CERIF) and what validation there is in place for each metadata value provided in the harvest (format, ranges etc.). Especially for the CRIS Managers Guidelines there is little control on the metadata quality itself as many elements are included as optional and only a bare minimum of metadata is mandatory for the harvest. This leads to cases for example for research publications in which the metadata is minimal, only containing information e.g. on the title and the general publication type of this certain record. On many occasions the source systems, e.g. CRIS systems, have much higher quality metadata available, but via OpenAIRE harvest much of this information is lost due to shortcuts on mapping of the data models, the small amount of resources invested in providing metadata via endpoint to be harvested or the incompatibilities between data models. There is also little guarantee that the metadata on publications is evenly spread among scientific fields or between national and international (i.e. English) language, as some disciplines and publications written in English have tendency to skew the contains of large bibliographic databases.

One major findings of the ENRESSHs projects is that for a research publication database it is of great importance to be able to have a complete and inclusive set of research outputs, be it on institutional or nation level, for it to be used in any form of assessment or evaluation of research. This is generally achieved by the use of context relevant system choices, data models and criteria to import or input the publication metadata to databases. This approach is quite different of those of commercial databases or e.g. OpenAIRE, where metadata requirements are not able to take into account context related documentation on metadata, e.g. criteria on what is determined as "scientific" or what counts as an "article". For this reason, the aggregating commercial databases are in many cases not well equipped to answer to questions like "How many publications does organization X produce?" or "Which scientific field is most prominent in country Y?". Thus, their use in institutional or national contexts is difficult, as the coverage or the quality of metadata do not meet the needs set in various use cases.

Data model for ENRESSH

ENRESSH-VIRTA-POC

Common European standardization and data content need to be defined in collaboration with ENRESSH - “lowest common denominator” + additional optional information

The next step is to develop a data model specifically for the purpose of integrating institutional or national publication data from different countries. This needs to be done with an eye towards enhancing comprehensiveness, comparability and further use of the data. Although the data model and system should allow inclusion of all relevant scholarly outputs in different fields, it should also have enough metadata and structure to permit relevant subsets of publications to be used in comparisons and benchmarking
-inclusion of data: ei vielä niin formaalistai ja kattavasti kerättyjä. - mahdollista myöhemmin
-how implemented: yhteistyössä data providerien kanssa siten, että voidaan huomioida eri maiden erilaidet keruukäytännöt ja käyttötarkoitukset

CERIF

Figure 1: ENRESSH Minimum Data Model in relation to CERIF data model and OpenAIRE Guidelines for CRIS Managers

Focus on publications set only
- Inclusion of emerging outputs (data sets, researcher activities) might lead to worse data quality

How and if this kind of minimum metadata standard can be implemented - and on what level → Implementing in ENRESSH-VIRTA infrastructure
- Could be used on a separate ENRESSH publication database
- Could be used as a reference for e.g. OpenAIRE harvesting

Deliverable

A summary of minimum CERIF data model elements needed in research publication metadata transfers considering CRIS systems and national aggregators in European context
1. ENRESSH Minimum Data Model

First iteration, more work could be used on the specifics of metadata elements,

Implementing to ENRESSH-VIRTA

The VIRTA-ENRESSH-pilot was set up to integrate bibliographic metadata originating from different research information source systems.

"It is also possible to increase the comparability of data by developing automated methods to restructure and reclassify VIRTA data in a uniform way on the basis of the bibliographic metadata as well as information from external sources."

"The ontological approach also supports making data exchangeable with current research information standards such as EuroCRIS’s CERIF data model. In an ontology-based approach, an important decision is of course the choice of ontology. Here, various factors are relevant, such as expressiveness, domain- specificity, broadness, and adoption elsewhere. The CERIF interchange format, maintained by EuroCRIS, is a logical candidate, given its adoption in various European (CRIS) systems, high level of sophistication, and broad coverage of research information."

"Enriching these data with metadata on publication channels, e.g. the classification of journals as peer-reviewed or not, as high-prestige in different national contexts, or with Web of Science and Scopus based impact factors, makes them immediately useful for benchmarking and monitoring at local, regional, national and European level."
-enriching the matedata with publication channels
- mahdollistaa paremman metadatan (esim julkaisutyypit, tieteenalat)
-lähteiden dokumentointi!

Deliverable

An outline of implementing the research publication metadata transfer in ENRESSH-VIRTA infrastructure
1. Implementing in ENRESSH-VIRTA Infrastructure
2. Interoperability platform as supporting tool

Future tasks

As part of the NordRIS proposal and further work on ENRESSH-VIRTA
Discuss with validation on OpenAIRE's side if ENRESSH Minimum Data Model could be used
CEF Telecom call for proposals. The next call on ”Access to re-usable public sector information –PUBLIC OPEN DATA” opening on July might be relevant for a project compiling and opening European publication data. The deadline for proposals is on 14 November. Find more (pages 31-34): https://ec.europa.eu/inea/sites/inea/files/cef_telecom_work_programme_2019.pdf
Presentation(s) in euroCRIS Membership Meeting 2019 and/or euroCRIS Conference 2020

Things to keep in mind

Authority lists (journal lists etc.) to support the quality aspect of publication metadata
- Could be used to bypass some of the problems e.g. in field of science classification and/or quality aspects
Manual of good practices (SSH databases)

Supporting documents:

Puuska, Hanna-Mari; Guns, Raf; Pölönen, Janne; Sivertsen, Gunnar; Mañana-Rodríguez, Jorge; Engels, Tim (2018): Proof of Concept of a European database for social sciences and humanities publications: Description of the VIRTA-ENRESSH pilot. figshare. Journal contribution. https://doi.org/10.6084/m9.figshare.5993506.v1

https://dspacecris.eurocris.org/bitstream/11366/682/1/Puuska_et_al_CRIS2018_paper_Proof_of_concept_VIRTA-ENRESSH.pdf

Sīle, L. et al. (2017). European Databases and Repositories for Social Sciences and Humanities Research Output. Antwerp: ECOOM & ENRESSH. https://doi.org/10.6084/m9.figshare.5172322

http://enressh.eu/wp-content/uploads/2017/09/2017_ENRESSH_European_Databases.pdf

Towards the integration of European research information

https://dspacecris.eurocris.org/handle/11366/593

https://openaire-guidelines-for-cris-managers.readthedocs.io/en/latest/index.html

CERIF-tietomallin määrittely OpenAIRE tiedonsiirrossa

CERIF - VIRTA mapping

https://docs.google.com/document/d/1Rm4OMOUf3JEti6aLmCrnSilX-sbutknFTR7njeItmBc/edit?usp=sharing