In this article, we propose a Category Theory approach to (syntactic) interoperability between linguistic tools. The resulting category consists of textual documents, including any linguistic annotations, NLP tools that analyze texts and add additional linguistic information, and format converters. Format converters are necessary to make the tools both able to read and to produce different output formats, which is the key to interoperability. The idea behind this document is the parallelism between the concepts of composition and associativity in Category Theory with the NLP pipelines. We show how pipelines of linguistic tools can be modeled into the conceptual framework of Category Theory and we successfully apply this method to two real-life examples. Paper submitted to Applied Category Theory 2020 and accepted for Virtual Poster Session
In 2018, the European Strategic Forum for research infrastructures (ESFRI) was tasked by the Competitiveness Council, a configuration of the Council of the EU, to develop a common approach for monitoring of Research Infrastructures' performance. To this end, ESFRI established a working group, which has proposed 21 Key Performance Indicators (KPIs) to monitor the progress of the Research Infrastructures (RIs) addressed towards their objectives. The RIs were then asked to assess their relevance for their institution. The paper aims to identify the relevance of certain indicators for particular groups of RIs by using cluster and discriminant analysis. This could contribute to development of a monitoring system, tailored to particular RIs. To obtain a typology of the RIs, we first performed cluster analysis of the RIs according to their properties, which revealed clusters of RIs with similar characteristics, based on to the domain of operation, such as food, environment or engineering. Then, discriminant analysis was used to study how the relevance of the KPIs differs among the obtained clusters. This analysis revealed that the percentage of RIs correctly classified into five clusters, using the KPIs, is 80%. Such a high percentage indicates that there are significant differences in the relevance of certain indicators, depending on the ESFRI domain of the RI. The indicators therefore need to be adapted to the type of infrastructure. It is therefore proposed that the Strategic Working Groups of ESFRI addressing specific domains should be involved in the tailored development of the monitoring of pan-European RIs. Comment: 15 pages, 8 tables, 3 figures
More and more cultural institutions use Linked Data principles to share and connect their collection metadata. In the archival field, initiatives emerge to exploit data contained in archival descriptions and adapt encoding standards to the semantic web. In this context, online authority files can be used to enrich metadata. However, relying on a decentralized network of knowledge bases such as Wikidata, DBpedia or even Viaf has its own difficulties. This paper aims to offer a critical view of these linked authority files by adopting a close-reading approach. Through a practical case study, we intend to identify and illustrate the possibilities and limits of RDF triples compared to institutions' less structured metadata. Comment: Workshop "Dariah "Trust and Understanding: the value of metadata in a digitally joined-up world" (14/05/2018, Brussels), preprint of the submission to the journal "Archives et Biblioth\`eques de Belgique"
AARC (Authentication and Authorisation for Research Communities) is a two-year EC-funded project to develop and pilot an integrated cross-discipline authentication and authorisation framework, building on existing authentication and authorisation infrastructures (AAIs) and production federated infrastructure. AARC also champions federated access and offers tailored training to complement the actions needed to test AARC results and to promote AARC outcomes. This article describes a high-level blueprint architectures for interoperable AAIs. Comment: This text was part of a (public) EU deliverable document. It has a main part and a long appendix with more details about example infrastructures that were taken into acount
This paper describes the achievements of the H2020 project INDIGO-DataCloud. The project has provided e-infrastructures with tools, applications and cloud framework enhancements to manage the demanding requirements of scientific communities, either locally or through enhanced interfaces. The middleware developed allows to federate hybrid resources, to easily write, port and run scientific applications to the cloud. In particular, we have extended existing PaaS (Platform as a Service) solutions, allowing public and private e-infrastructures, including those provided by EGI, EUDAT, and Helix Nebula, to integrate their existing services and make them available through AAI services compliant with GEANT interfederation policies, thus guaranteeing transparency and trust in the provisioning of such services. Our middleware facilitates the execution of applications using containers on Cloud and Grid based infrastructures, as well as on HPC clusters. Our developments are freely downloadable as open source components, and are already being integrated into many scientific applications. 39 pages, 15 figures.Version accepted in Journal of Grid Computing
We propose a morphologically informed model for named entity recognition, which is based on LSTM-CRF architecture and combines word embeddings, Bi-LSTM character embeddings, part-of-speech (POS) tags, and morphological information. While previous work has focused on learning from raw word input, using word and character embeddings only, we show that for morphologically rich languages, such as Bulgarian, access to POS information contributes more to the performance gains than the detailed morphological information. Thus, we show that named entity recognition needs only coarse-grained POS tags, but at the same time it can benefit from simultaneously using some POS information of different granularity. Our evaluation results over a standard dataset show sizable improvements over the state-of-the-art for Bulgarian NER. named entity recognition; Bulgarian NER; morphology; morpho-syntax
International audience; This contribution will show how Access play a strong role in the creation and structuring of DARIAH, a European Digital Research Infrastructure in Arts and Humanities.To achieve this goal, this contribution will develop the concept of Access from five examples:_ Interdisciplinarity point of view_ Manage contradiction between national and international perspectives_ Involve different communities (not only researchers stakeholders)_ Manage tools and services_ Develop and use new collaboration toolsWe would like to demonstrate that speaking about Access always implies a selection, a choice, even in the perspective of "Open Access".
We investigated the evolution and transformation of scientific knowledge in the early modern period, analyzing more than 350 different editions of textbooks used for teaching astronomy in European universities from the late fifteenth century to mid-seventeenth century. These historical sources constitute the Sphaera Corpus. By examining different semantic relations among individual parts of each edition on record, we built a multiplex network consisting of six layers, as well as the aggregated network built from the superposition of all the layers. The network analysis reveals the emergence of five different communities. The contribution of each layer in shaping the communities and the properties of each community are studied. The most influential books in the corpus are found by calculating the average age of all the out-going and in-coming links for each book. A small group of editions is identified as a transmitter of knowledge as they bridge past knowledge to the future through a long temporal interval. Our analysis, moreover, identifies the most disruptive books. These books introduce new knowledge that is then adopted by almost all the books published afterwards until the end of the whole period of study. The historical research on the content of the identified books, as an empirical test, finally corroborates the results of all our analyses. 19 pages, 9 figures
KBSET is an environment that provides support for scholarly editing in two flavors: First, as a practical tool KBSET/Letters that accompanies the development of editions of correspondences (in particular from the 18th and 19th century), completely from source documents to PDF and HTML presentations. Second, as a prototypical tool KBSET/NER for experimentally investigating novel forms of working on editions that are centered around automated named entity recognition. KBSET can process declarative application-specific markup that is expressed in LaTeX notation and incorporate large external fact bases that are typically provided in RDF. KBSET includes specially developed LaTeX styles and a core system that is written in SWI-Prolog, which is used there in many roles, utilizing that it realizes the potential of Prolog as a unifying language. Comment: To appear in DECLARE 2019 Revised Selected Papers
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe's specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI, including many opportunities, synergies but also misconceptions, has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions. Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020). To appear