In this article, we propose a Category Theory approach to (syntactic) interoperability between linguistic tools. The resulting category consists of textual documents, including any linguistic annotations, NLP tools that analyze texts and add additional linguistic information, and format converters. Format converters are necessary to make the tools both able to read and to produce different output formats, which is the key to interoperability. The idea behind this document is the parallelism between the concepts of composition and associativity in Category Theory with the NLP pipelines. We show how pipelines of linguistic tools can be modeled into the conceptual framework of Category Theory and we successfully apply this method to two real-life examples. Paper submitted to Applied Category Theory 2020 and accepted for Virtual Poster Session
Miriam Baglioni; Alessia Bardi; Argiro Kokogiannaki; Paolo Manghi; Katerina Iatropoulou; Pedro Príncipe; André Vieira; Lars Holm Nielsen; Harry Dimitropoulos; Ioannis Foufoulas; +7 more
Miriam Baglioni; Alessia Bardi; Argiro Kokogiannaki; Paolo Manghi; Katerina Iatropoulou; Pedro Príncipe; André Vieira; Lars Holm Nielsen; Harry Dimitropoulos; Ioannis Foufoulas; Natalia Manola; Claudio Atzori; Sandro La Bruzzo; Emma Lazzeri; Michele Artini; Michele De Bonis; Andrea Dell’Amico;
Despite the hype, the effective implementation of Open Science is hindered by several cultural and technical barriers. Researchers embraced digital science, use “digital laboratories” (e.g. research infrastructures, thematic services) to conduct their research and publish research data, but practices and tools are still far from achieving the expectations of transparency and reproducibility of Open Science. The places where science is performed and the places where science is published are still regarded as different realms. Publishing is still a post-experimental, tedious, manual process, too often limited to articles, in some contexts semantically linked to datasets, rarely to software, generally disregarding digital representations of experiments. In this work we present the OpenAIRE Research Community Dashboard (RCD), designed to overcome some of these barriers for a given research community, minimizing the technical efforts and without renouncing any of the community services or practices. The RCD flanks digital laboratories of research communities with scholarly communication tools for discovering and publishing interlinked scientific products such as literature, datasets, and software. The benefits of the RCD are show-cased by means of two real-case scenarios: the European Marine Science community and the European Plate Observing System (EPOS) research infrastructure. This work is partly funded by the OpenAIRE-Advance H2020 project (grant number: 777541; call: H2020-EINFRA-2017) and the OpenAIREConnect H2020 project (grant number: 731011; call: H2020-EINFRA-2016-1). Moreover, we would like to thank our colleagues Michele Manunta, Francesco Casu, and Claudio De Luca (Institute for the Electromagnetic Sensing of the Environment, CNR, Italy) for their work on the EPOS infrastructure RCD; and Stephane Pesant (University of Bremen, Germany) his work on the European Marine Science RCD. First Online 30 August 2019
This paper describes the achievements of the H2020 project INDIGO-DataCloud. The project has provided e-infrastructures with tools, applications and cloud framework enhancements to manage the demanding requirements of scientific communities, either locally or through enhanced interfaces. The middleware developed allows to federate hybrid resources, to easily write, port and run scientific applications to the cloud. In particular, we have extended existing PaaS (Platform as a Service) solutions, allowing public and private e-infrastructures, including those provided by EGI, EUDAT, and Helix Nebula, to integrate their existing services and make them available through AAI services compliant with GEANT interfederation policies, thus guaranteeing transparency and trust in the provisioning of such services. Our middleware facilitates the execution of applications using containers on Cloud and Grid based infrastructures, as well as on HPC clusters. Our developments are freely downloadable as open source components, and are already being integrated into many scientific applications. 39 pages, 15 figures.Version accepted in Journal of Grid Computing
We propose a morphologically informed model for named entity recognition, which is based on LSTM-CRF architecture and combines word embeddings, Bi-LSTM character embeddings, part-of-speech (POS) tags, and morphological information. While previous work has focused on learning from raw word input, using word and character embeddings only, we show that for morphologically rich languages, such as Bulgarian, access to POS information contributes more to the performance gains than the detailed morphological information. Thus, we show that named entity recognition needs only coarse-grained POS tags, but at the same time it can benefit from simultaneously using some POS information of different granularity. Our evaluation results over a standard dataset show sizable improvements over the state-of-the-art for Bulgarian NER. named entity recognition; Bulgarian NER; morphology; morpho-syntax
International audience; This contribution will show how Access play a strong role in the creation and structuring of DARIAH, a European Digital Research Infrastructure in Arts and Humanities.To achieve this goal, this contribution will develop the concept of Access from five examples:_ Interdisciplinarity point of view_ Manage contradiction between national and international perspectives_ Involve different communities (not only researchers stakeholders)_ Manage tools and services_ Develop and use new collaboration toolsWe would like to demonstrate that speaking about Access always implies a selection, a choice, even in the perspective of "Open Access".
We investigated the evolution and transformation of scientific knowledge in the early modern period, analyzing more than 350 different editions of textbooks used for teaching astronomy in European universities from the late fifteenth century to mid-seventeenth century. These historical sources constitute the Sphaera Corpus. By examining different semantic relations among individual parts of each edition on record, we built a multiplex network consisting of six layers, as well as the aggregated network built from the superposition of all the layers. The network analysis reveals the emergence of five different communities. The contribution of each layer in shaping the communities and the properties of each community are studied. The most influential books in the corpus are found by calculating the average age of all the out-going and in-coming links for each book. A small group of editions is identified as a transmitter of knowledge as they bridge past knowledge to the future through a long temporal interval. Our analysis, moreover, identifies the most disruptive books. These books introduce new knowledge that is then adopted by almost all the books published afterwards until the end of the whole period of study. The historical research on the content of the identified books, as an empirical test, finally corroborates the results of all our analyses. 19 pages, 9 figures
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe's specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI, including many opportunities, synergies but also misconceptions, has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions. Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020). To appear
Lexical Markup Framework (LMF) or ISO 24613  is a de jure standard that provides a framework for modelling and encoding lexical information in retrodigitised print dictionaries and NLP lexical databases. An in-depth review is currently underway within the standardisation subcommittee , ISO-TC37/SC4/WG4, to find a more modular, flexible and durable follow up to the original LMF standard published in 2008. In this paper we will present some of the major improvements which have so far been implemented in the new version of LMF. Comment: AsiaLex 2019: Past, Present and Future, Jun 2019, Istanbul, Turkey
Publisher: Multidisciplinary Digital Publishing Institute
The historic view of metadata as “data about data” is expanding to include data about other items that must be created, used, and understood throughout the data and project life cycles. In this context, metadata might better be defined as the structured and standard part of documentation, and the metadata life cycle can be described as the metadata content that is required for documentation in each phase of the project and data life cycles. This incremental approach to metadata creation is similar to the spiral model used in software development. Each phase also has distinct users and specific questions to which they need answers. In many cases, the metadata life cycle involves hierarchies where latter phases have increased numbers of items. The relationships between metadata in different phases can be captured through structure in the metadata standard, or through conventions for identifiers. Metadata creation and management can be streamlined and simplified by re-using metadata across many records. Many of these ideas have been developed to various degrees in several Geoscience disciplines and are being used in metadata for documenting the integrated life cycle of environmental research in the Arctic, including projects, collection sites, and datasets.
CLARIN (Common Language Resources and Technology Infrastructure) is regarded as one of the most important European research infrastructures, offering and promoting a wide array of useful services for (digital) research in linguistics and humanities. However, the assessment of the users for its core technical development has been highly limited, therefore, it is unclear if the community is thoroughly aware of the status-quo of the growing infrastructure. In addition, CLARIN does not seem to be fully materialised marketing and business plans and strategies despite its strong technical assets. This article analyses the web traffic of the Virtual Language Observatory, one of the main web applications of CLARIN and a symbol of pan-European re-search cooperation, to evaluate the users and performance of the service in a transparent and scientific way. It is envisaged that the paper can raise awareness of the pressing issues on objective and transparent operation of the infrastructure though Open Evaluation, and the synergy between marketing and technical development. It also investigates the "science of web analytics" in an attempt to document the research process for the purpose of reusability and reproducibility, thus to find universal lessons for the use of a web analytics, rather than to merely produce a statistical report of a particular website which loses its value outside its context.