Advanced search in Research products
Research products
arrow_drop_down
Searching FieldsTerms
Any field
arrow_drop_down
includes
arrow_drop_down
Include:
The following results are related to DARIAH EU. Are you interested to view more results? Visit OpenAIRE - Explore.
47 Research products, page 1 of 5

  • DARIAH EU
  • Publications
  • Research data
  • Other research products
  • 2018-2022
  • FR
  • English
  • Mémoires en Sciences de l'Information et de la Communication
  • Hal-Diderot
  • Hyper Article en Ligne
  • DARIAH EU
  • Digital Humanities and Cultural Heritage

10
arrow_drop_down
Date (most recent)
arrow_drop_down
  • Publication . Article . Other literature type . 2022
    Open Access English
    Authors: 
    Elisa Nury; Claire Clivaz; Marta Błaszczyńska; Michael Kaiser; Agata Morka; Valérie Schaefer; Jadranka Stojanovski; Erzsébet Tóth-Czifra;
    Publisher: HAL CCSD
    Countries: Croatia, France, France
    Project: EC | OPERAS-P (871069)

    International audience; Published in OA on RESSI (http://www.ressi.ch/) at the end of Octobre 2021. We present here highlights from an enquiry on the innovations in scholarly writing in the Humanities and Social Sciences in the H2020 project OPERAS-P. This article explores the theme of Open Research Data and its role in the emergence of new models of scholarly writing. We examine more closely the obstacles and fostering conditions to the publication of research data, both from a social and a technical perspective.

  • Open Access English
    Authors: 
    Frank Uiterwaal; Franco Niccolucci; Sheena Bassett; Steven Krauwer; Hella Hollander; Femmy Admiraal; Laurent Romary; George Bruseker; Carlo Meghini; Jennifer Edmond; +1 more
    Publisher: Edinburgh University Press for the Association for History and Computing,, Edinburgh , Regno Unito
    Countries: France, France, France, Italy, Italy, Netherlands
    Project: EC | PARTHENOS (654119)

    This article has been accepted for publication by EUP in the IJHAC: International Journal of Humanities and Arts Computing (https://www.euppublishing.com/loi/ijhac); International audience; Since the first ESFRI roadmap in 2006, multiple humanities Research Infrastructures (RIs) have been set up all over the European continent, supporting archaeologists (ARIADNE), linguists (CLARIN-ERIC), Holocaust researchers (EHRI), cultural heritage specialists (IPERION-CH) and others. These examples only scratch the surface of the breadth of research communities that have benefited from close cooperation in the European Research Area.While each field developed discipline-specific services over the years, common themes can also be distinguished. All humanities RIs address, in varying degrees, questions around research data management, the use of standards and the desired interoperability of data across disciplinary boundaries.This article sheds light on how cluster project PARTHENOS developed pooled services and shared solutions for its audience of humanities researchers, RI managers and policymakers. In a time where the convergence of existing infrastructure is becoming ever more important – with the construction of a European Open Science Cloud as an audacious, ultimate goal – we hope that our experiences inform future work and provide inspiration on how to exploit synergies in interdisciplinary, transnational, scientific cooperation.

  • Open Access English
    Authors: 
    Marie-Laure Massot; Agnès Tricoche;
    Publisher: HAL CCSD
    Country: France
    Project: ANR | PSL (ANR-10-IDEX-0001)

    This article presents a study of the French-speaking digital humanities. It is based on the experience of two research engineers from the French National Center for Scientific Research (CNRS) who have been studying these issues for the last ten years. They conducted a survey at the École Normale Supérieure (ENS-Paris) which enabled them to draw up an overview of the transformation of the profession of humanities and social sciences research engineers in the context of the digital humanities. The Digit_Hum initiative, which they run in parallel with their respective activities at the ENS, also provided information for this overview thanks to its role as a space for discussion about the digital humanities along with training and structuring of this field at the ENS and the Université Paris Sciences & Lettres (PSL). Cet article est une réflexion sur les humanités numériques en contexte francophone. Elle s’appuie sur l'expérience de deux ingénieures du Centre National de la Recherche Scientifique travaillant sur ces questions depuis une dizaine d'années. À travers l'enquête qu'elles ont menée à l'École normale supérieure (ENS-Paris), elles dressent un panorama de la transformation du métier d'ingénieur(e) en sciences humaines et sociales dans le contexte des humanités numériques. L'initiative Digit_Hum, qu'elles animent en parallèle de leurs activités respectives à l'École, nourrit également ce témoignage en constituant un espace de discussions, de formations et de structuration des humanités numériques au sein de l'ENS et de l’Université Paris Sciences & Lettres.

  • Open Access English
    Authors: 
    Stefan Buddenbohm; Maaike A. de Jong; Jean-Luc Minel; Yoann Moranville;
    Publisher: HAL CCSD
    Country: France
    Project: EC | HaS-DARIAH (675570)

    AbstractHow can researchers identify suitable research data repositories for the deposit of their research data? Which repository matches best the technical and legal requirements of a specific research project? For this end and with a humanities perspective the Data Deposit Recommendation Service (DDRS) has been developed as a prototype. It not only serves as a functional service for selecting humanities research data repositories but it is particularly a technical demonstrator illustrating the potential of re-using an already existing infrastructure - in this case re3data - and the feasibility to set up this kind of service for other research disciplines. The documentation and the code of this project can be found in the DARIAH GitHub repository: https://dariah-eric.github.io/ddrs/.

  • Open Access English
    Authors: 
    Maryl, Maciej; Błaszczyńska, Marta; Zalotyńska, Agnieszka; Taylor, Laurence; Avanço, Karla; Balula, Ana; Buchner, Anna; Caliman, Lorena; Clivaz, Claire; Costa, Carlos; +21 more
    Publisher: HAL CCSD
    Countries: Croatia, France
    Project: EC | OPERAS-P (871069)

    This report discusses the scholarly communication issues in Social Sciences and Humanities that are relevant to the future development and functioning of OPERAS. The outcomes collected here can be divided into two groups of innovations regarding 1) the operation of OPERAS, and 2) its activities. The “operational” issues include the ways in which an innovative research infrastructure should be governed (Chapter 1) as well as the business models for open access publications in Social Sciences and Humanities (Chapter 2). The other group of issues is dedicated to strategic areas where OPERAS and its services may play an instrumental role in providing, enabling, or unlocking innovation: FAIR data (Chapter 3), bibliodiversity and multilingualism in scholarly communication (Chapter 4), the future of scholarly writing (Chapter 5), and quality assessment (Chapter 6). Each chapter provides an overview of the main findings and challenges with emphasis on recommendations for OPERAS and other stakeholders like e-infrastructures, publishers, SSH researchers, research performing organisations, policy makers, and funders. Links to data and further publications stemming from work concerning particular tasks are located at the end of each chapter.

  • Publication . Article . Other literature type . 2020
    Open Access English
    Authors: 
    Clivaz, Claire; Allen, Garrick V.;
    Publisher: HAL CCSD
    Country: France

    Ancient Manuscripts and Virtual Research Environments Lausanne, 10–11 September 2020 - Conference report

  • English
    Authors: 
    Edmond, Jennifer; Basaraba, Nicole; Doran, Michelle; Garnett, Vicky; Grile, Courtney Helen; Papaki, Eliza; Tóth-Czifra, Erzsébet;
    Publisher: HAL CCSD
    Country: France
  • Open Access English
    Authors: 
    Luca Foppiano; Laurent Romary;
    Publisher: HAL CCSD
    Country: France
    Project: EC | HIRMEOS (731102)

    International audience; This paper presents an attempt to provide a generic named-entity recognition and disambiguation module (NERD) called entity-fishing as a stable online service that demonstrates the possible delivery of sustainable technical services within DARIAH, the European digital research infrastructure for the arts and humanities. Deployed as part of the national infrastructure Huma-Num in France, this service provides an efficient state-of-the-art implementation coupled with standardised interfaces allowing an easy deployment on a variety of potential digital humanities contexts. The topics of accessibility and sustainability have been long discussed in the attempt of providing some best practices in the widely fragmented ecosystem of the DARIAH research infrastructure. The history of entity-fishing has been mentioned as an example of good practice: initially developed in the context of the FP9 CENDARI, the project was well received by the user community and continued to be further developed within the H2020 HIRMEOS project where several open access publishers have integrated the service to their collections of published monographs as a means to enhance retrieval and access.entity-fishing implements entity extraction as well as disambiguation against Wikipedia and Wikidata entries. The service is accessible through a REST API which allows easier and seamless integration, language independent and stable convention and a widely used service oriented architecture (SOA) design. Input and output data are carried out over a query data model with a defined structure providing flexibility to support the processing of partially annotated text or the repartition of text over several queries. The interface implements a variety of functionalities, like language recognition, sentence segmentation and modules for accessing and looking up concepts in the knowledge base. The API itself integrates more advanced contextual parametrisation or ranked outputs, allowing for the resilient integration in various possible use cases. The entity-fishing API has been used as a concrete use case3 to draft the experimental stand-off proposal, which has been submitted for integration into the TEI guidelines. The representation is also compliant with the Web Annotation Data Model (WADM).In this paper we aim at describing the functionalities of the service as a reference contribution to the subject of web-based NERD services. In order to cover all aspects, the architecture is structured to provide two complementary viewpoints. First, we discuss the system from the data angle, detailing the workflow from input to output and unpacking each building box in the processing flow. Secondly, with a more academic approach, we provide a transversal schema of the different components taking into account non-functional requirements in order to facilitate the discovery of bottlenecks, hotspots and weaknesses. The attempt here is to give a description of the tool and, at the same time, a technical software engineering analysis which will help the reader to understand our choice for the resources allocated in the infrastructure.Thanks to the work of million of volunteers, Wikipedia has reached today stability and completeness that leave no usable alternatives on the market (considering also the licence aspect). The launch of Wikidata in 2010 have completed the picture with a complementary language independent meta-model which is becoming the scientific reference for many disciplines. After providing an introduction to Wikipedia and Wikidata, we describe the knowledge base: the data organisation, the entity-fishing process to exploit it and the way it is built from nightly dumps using an offline process.We conclude the paper by presenting our solution for the service deployment: how and which the resources where allocated. The service has been in production since Q3 of 2017, and extensively used by the H2020 HIRMEOS partners during the integration with the publishing platforms. We believe we have strived to provide the best performances with the minimum amount of resources. Thanks to the Huma-num infrastructure we still have the possibility to scale up the infrastructure as needed, for example to support an increase of demand or temporary needs to process huge backlog of documents. On the long term, thanks to this sustainable environment, we are planning to keep delivering the service far beyond the end of the H2020 HIRMEOS project.

  • English
    Authors: 
    Khemakhem, Mohamed;
    Publisher: HAL CCSD
    Project: ANR | BASNUM (ANR-18-CE38-0003), EC | PARTHENOS (654119)

    Dictionaries could be considered as the most comprehensive reservoir of human knowledge, which carry not only the lexical description of words in one or more languages, but also the common awareness of a certain communityabout every known piece of knowledge in a time frame. Print dictionaries are the principle resources which enable the documentation and transfer of such knowledge. They already exist in abundant numbers, while new onesare continuously compiled, even with the recent strong move to digital resources.However, a majority of these dictionaries, even when available digitally, is still not fully structured due to the absence of scalable methods and techniques that can cover the variety of corresponding material. Moreover, the relatively few existing structured resources present limited exchange and query alternatives, given the discrepancy of their data models and formats.In this thesis we address the task of parsing lexical information in print dictionaries through the design of computer models that enable their automatic structuring. Solving this task goes hand in hand with finding a standardised output for these models to guarantee a maximum interoperability among resources and usability for downstream tasks.First, we present different classifications of the dictionaric resources to delimit the category of print dictionaries we aim to process. Second, we introduce the parsing task by providing an overview of the processing challengesand a study of the state of the art. Then, we present a novel approach based on a top-down parsing of the lexical information. We also outline the archiecture of the resulting system, called GROBID-Dictionaries, and the methodology we followed to close the gap between the conception of the system and its applicability to real-world scenarios.After that, we draw the landscape of the leading standards for structured lexical resources. In addition, we provide an analysis of two ongoing initiatives, TEI-Lex-0 and LMF, that aim at the unification of modelling the lexical information in print and electronic dictionaries. Based on that, we present a serialisation format that is inline with the schemes of the two standardisation initiatives and fits the approach implemented in our parsing system.After presenting the parsing and standardised serialisation facets of our lexical models, we provide an empirical study of their performance and behaviour. The investigation is based on a specific machine learning setup andseries of experiments carried out with a selected pool of varied dictionaries.We try in this study to present different ways for feature engineering and exhibit the strength and the limits of the best resulting models. We also dedicate two series of experiments for exploring the scalability of our models with regard to the processed documents and the employed machine learning technique.Finally, we sum up this thesis by presenting the major conclusions and opening new perspectives for extending our investigations in a number of research directions for parsing entry-based documents.; Les dictionnaires peuvent être considérés comme le réservoir le plus compréhensible de connaissances humaines, qui contiennent non seulement la description lexicale des mots dans une ou plusieurs langues, mais aussi la conscience commune d’une certaine communauté sur chaque élément de connaissance connu dans une période de temps donnée. Les dictionnaires imprimés sont les principales ressources qui permettent la documentation et le transfert de ces connaissances. Ils existent déjà en grand nombre, et de nouveaux dictionnaires sont continuellement compilés.Cependant, la majorité de ces dictionnaires dans leur version numérique n’est toujours pas structurée en raison de l’absence de méthodes et de techniques évolutives pouvant couvrir le nombre du matériel croissant et sa variété. En outre, les ressources structurées existantes, relativement peu nombreuses, présentent des alternatives d’échange et de recherche limitées, en raison d’un sérieux manque de synchronisation entre leurs schémas de structure.Dans cette thèse, nous abordons la tâche d’analyse des informations lexicales dans les dictionnaires imprimés en construisant des modèles qui permettent leur structuration automatique. La résolution de cette tâche va depair avec la recherche d’une sortie standardisée de ces modèles afin de garantir une interopérabilité maximale entre les ressources et une facilité d’utilisation pour les tâches en aval.Nous commençons par présenter différentes classifications des ressources dictionnaires pour délimiter les catégories des dictionnaires imprimés sur lesquelles ce travail se focalise. Ensuite, nous définissions la tâche d’analyse en fournissant un aperçu des défis de traitement et une étude de l’état de l’art.Nous présentons par la suite une nouvelle approche basée sur une analyse en cascade de l’information lexicale. Nous décrivons également l’architecture du système résultant, appelé GROBID-Dictionaries, et la méthodologie quenous avons suivie pour rapprocher la conception du système de son applicabilité aux scénarios du monde réel.Ensuite, nous prestons des normes clés pour les ressources lexicales structurées. En outre, nous fournissons une analyse de deux initiatives en cours, TEI-Lex-0 et LMF, qui visent à unifier la modélisation de l’information lexicale dans les dictionnaires imprimés et électroniques. Sur cette base, nous présentons un format de sérialisation conforme aux schémas des deux initiatives de normalisation et qui est assorti à l’approche développée dans notresystème d’analyse lexicale.Après avoir présenté les facettes d’analyse et de sérialisation normalisées de nos modèles lexicaux, nous fournissons une étude empirique de leurs performances et de leurs comportements. L’étude est basée sur une configuration spécifique d’apprentissage automatique et sur une série d’expériences menées avec un ensemble sélectionné de dictionnaires variés. Dans cette étude, nous essayons de présenter différentes manières d’ingénierie des caractéristiques et de montrer les points forts et les limites des meilleurs modèles résultants. Nous consacrons également deux séries d’expériences pour explorer l’extensibilité de nos modèles en ce qui concerne les documents traités et la technique d’apprentissage automatique employée.Enfin, nous clôturons cette thèse en présentant les principales conclusions et en ouvrant de nouvelles perspectives pour l’extension de nos investigations dans un certain nombre de directions de recherche pour l’analyse des documents structurés en un ensemble d’entrées.

  • Publication . Report . 2020
    English
    Authors: 
    Bertrand, Loïc; Anglos, Demetrios; Castillejo, Marta; Charbonnel, Bénédicte; David, Sophie; de Clercq, Hilde; Dubray, Fanny; Spring, Marika;
    Publisher: HAL CCSD
    Country: France
    Project: EC | E-RIHS PP (739503)

    The study and preservation of tangible cultural and natural heritage is a global challenge for science and society at large. The European Research Infrastructure for Heritage Science (E-RIHS) will play a leading role in research on the interpretation, preservation, documentation and management of heritage. As an interdisciplinary infrastructure, E-RIHS will interconnect knowledge and methodologies to address key scientific questions in the field of heritage as a whole. The infrastructure is built on ten core pillars. It will provide a structured and unified input of large-scale instruments, portable devices, physical and digital archives. Its implementation will focus on scientific excellence, interdisciplinarity and cooperation. In doing so, it will offer unprecedented research opportunities to a wide range of interdisciplinary scientific communities.

Advanced search in Research products
Research products
arrow_drop_down
Searching FieldsTerms
Any field
arrow_drop_down
includes
arrow_drop_down
Include:
The following results are related to DARIAH EU. Are you interested to view more results? Visit OpenAIRE - Explore.
47 Research products, page 1 of 5
  • Publication . Article . Other literature type . 2022
    Open Access English
    Authors: 
    Elisa Nury; Claire Clivaz; Marta Błaszczyńska; Michael Kaiser; Agata Morka; Valérie Schaefer; Jadranka Stojanovski; Erzsébet Tóth-Czifra;
    Publisher: HAL CCSD
    Countries: Croatia, France, France
    Project: EC | OPERAS-P (871069)

    International audience; Published in OA on RESSI (http://www.ressi.ch/) at the end of Octobre 2021. We present here highlights from an enquiry on the innovations in scholarly writing in the Humanities and Social Sciences in the H2020 project OPERAS-P. This article explores the theme of Open Research Data and its role in the emergence of new models of scholarly writing. We examine more closely the obstacles and fostering conditions to the publication of research data, both from a social and a technical perspective.

  • Open Access English
    Authors: 
    Frank Uiterwaal; Franco Niccolucci; Sheena Bassett; Steven Krauwer; Hella Hollander; Femmy Admiraal; Laurent Romary; George Bruseker; Carlo Meghini; Jennifer Edmond; +1 more
    Publisher: Edinburgh University Press for the Association for History and Computing,, Edinburgh , Regno Unito
    Countries: France, France, France, Italy, Italy, Netherlands
    Project: EC | PARTHENOS (654119)

    This article has been accepted for publication by EUP in the IJHAC: International Journal of Humanities and Arts Computing (https://www.euppublishing.com/loi/ijhac); International audience; Since the first ESFRI roadmap in 2006, multiple humanities Research Infrastructures (RIs) have been set up all over the European continent, supporting archaeologists (ARIADNE), linguists (CLARIN-ERIC), Holocaust researchers (EHRI), cultural heritage specialists (IPERION-CH) and others. These examples only scratch the surface of the breadth of research communities that have benefited from close cooperation in the European Research Area.While each field developed discipline-specific services over the years, common themes can also be distinguished. All humanities RIs address, in varying degrees, questions around research data management, the use of standards and the desired interoperability of data across disciplinary boundaries.This article sheds light on how cluster project PARTHENOS developed pooled services and shared solutions for its audience of humanities researchers, RI managers and policymakers. In a time where the convergence of existing infrastructure is becoming ever more important – with the construction of a European Open Science Cloud as an audacious, ultimate goal – we hope that our experiences inform future work and provide inspiration on how to exploit synergies in interdisciplinary, transnational, scientific cooperation.

  • Open Access English
    Authors: 
    Marie-Laure Massot; Agnès Tricoche;
    Publisher: HAL CCSD
    Country: France
    Project: ANR | PSL (ANR-10-IDEX-0001)

    This article presents a study of the French-speaking digital humanities. It is based on the experience of two research engineers from the French National Center for Scientific Research (CNRS) who have been studying these issues for the last ten years. They conducted a survey at the École Normale Supérieure (ENS-Paris) which enabled them to draw up an overview of the transformation of the profession of humanities and social sciences research engineers in the context of the digital humanities. The Digit_Hum initiative, which they run in parallel with their respective activities at the ENS, also provided information for this overview thanks to its role as a space for discussion about the digital humanities along with training and structuring of this field at the ENS and the Université Paris Sciences & Lettres (PSL). Cet article est une réflexion sur les humanités numériques en contexte francophone. Elle s’appuie sur l'expérience de deux ingénieures du Centre National de la Recherche Scientifique travaillant sur ces questions depuis une dizaine d'années. À travers l'enquête qu'elles ont menée à l'École normale supérieure (ENS-Paris), elles dressent un panorama de la transformation du métier d'ingénieur(e) en sciences humaines et sociales dans le contexte des humanités numériques. L'initiative Digit_Hum, qu'elles animent en parallèle de leurs activités respectives à l'École, nourrit également ce témoignage en constituant un espace de discussions, de formations et de structuration des humanités numériques au sein de l'ENS et de l’Université Paris Sciences & Lettres.

  • Open Access English
    Authors: 
    Stefan Buddenbohm; Maaike A. de Jong; Jean-Luc Minel; Yoann Moranville;
    Publisher: HAL CCSD
    Country: France
    Project: EC | HaS-DARIAH (675570)

    AbstractHow can researchers identify suitable research data repositories for the deposit of their research data? Which repository matches best the technical and legal requirements of a specific research project? For this end and with a humanities perspective the Data Deposit Recommendation Service (DDRS) has been developed as a prototype. It not only serves as a functional service for selecting humanities research data repositories but it is particularly a technical demonstrator illustrating the potential of re-using an already existing infrastructure - in this case re3data - and the feasibility to set up this kind of service for other research disciplines. The documentation and the code of this project can be found in the DARIAH GitHub repository: https://dariah-eric.github.io/ddrs/.

  • Open Access English
    Authors: 
    Maryl, Maciej; Błaszczyńska, Marta; Zalotyńska, Agnieszka; Taylor, Laurence; Avanço, Karla; Balula, Ana; Buchner, Anna; Caliman, Lorena; Clivaz, Claire; Costa, Carlos; +21 more
    Publisher: HAL CCSD
    Countries: Croatia, France
    Project: EC | OPERAS-P (871069)

    This report discusses the scholarly communication issues in Social Sciences and Humanities that are relevant to the future development and functioning of OPERAS. The outcomes collected here can be divided into two groups of innovations regarding 1) the operation of OPERAS, and 2) its activities. The “operational” issues include the ways in which an innovative research infrastructure should be governed (Chapter 1) as well as the business models for open access publications in Social Sciences and Humanities (Chapter 2). The other group of issues is dedicated to strategic areas where OPERAS and its services may play an instrumental role in providing, enabling, or unlocking innovation: FAIR data (Chapter 3), bibliodiversity and multilingualism in scholarly communication (Chapter 4), the future of scholarly writing (Chapter 5), and quality assessment (Chapter 6). Each chapter provides an overview of the main findings and challenges with emphasis on recommendations for OPERAS and other stakeholders like e-infrastructures, publishers, SSH researchers, research performing organisations, policy makers, and funders. Links to data and further publications stemming from work concerning particular tasks are located at the end of each chapter.

  • Publication . Article . Other literature type . 2020
    Open Access English
    Authors: 
    Clivaz, Claire; Allen, Garrick V.;
    Publisher: HAL CCSD
    Country: France

    Ancient Manuscripts and Virtual Research Environments Lausanne, 10–11 September 2020 - Conference report

  • English
    Authors: 
    Edmond, Jennifer; Basaraba, Nicole; Doran, Michelle; Garnett, Vicky; Grile, Courtney Helen; Papaki, Eliza; Tóth-Czifra, Erzsébet;
    Publisher: HAL CCSD
    Country: France
  • Open Access English
    Authors: 
    Luca Foppiano; Laurent Romary;
    Publisher: HAL CCSD
    Country: France
    Project: EC | HIRMEOS (731102)

    International audience; This paper presents an attempt to provide a generic named-entity recognition and disambiguation module (NERD) called entity-fishing as a stable online service that demonstrates the possible delivery of sustainable technical services within DARIAH, the European digital research infrastructure for the arts and humanities. Deployed as part of the national infrastructure Huma-Num in France, this service provides an efficient state-of-the-art implementation coupled with standardised interfaces allowing an easy deployment on a variety of potential digital humanities contexts. The topics of accessibility and sustainability have been long discussed in the attempt of providing some best practices in the widely fragmented ecosystem of the DARIAH research infrastructure. The history of entity-fishing has been mentioned as an example of good practice: initially developed in the context of the FP9 CENDARI, the project was well received by the user community and continued to be further developed within the H2020 HIRMEOS project where several open access publishers have integrated the service to their collections of published monographs as a means to enhance retrieval and access.entity-fishing implements entity extraction as well as disambiguation against Wikipedia and Wikidata entries. The service is accessible through a REST API which allows easier and seamless integration, language independent and stable convention and a widely used service oriented architecture (SOA) design. Input and output data are carried out over a query data model with a defined structure providing flexibility to support the processing of partially annotated text or the repartition of text over several queries. The interface implements a variety of functionalities, like language recognition, sentence segmentation and modules for accessing and looking up concepts in the knowledge base. The API itself integrates more advanced contextual parametrisation or ranked outputs, allowing for the resilient integration in various possible use cases. The entity-fishing API has been used as a concrete use case3 to draft the experimental stand-off proposal, which has been submitted for integration into the TEI guidelines. The representation is also compliant with the Web Annotation Data Model (WADM).In this paper we aim at describing the functionalities of the service as a reference contribution to the subject of web-based NERD services. In order to cover all aspects, the architecture is structured to provide two complementary viewpoints. First, we discuss the system from the data angle, detailing the workflow from input to output and unpacking each building box in the processing flow. Secondly, with a more academic approach, we provide a transversal schema of the different components taking into account non-functional requirements in order to facilitate the discovery of bottlenecks, hotspots and weaknesses. The attempt here is to give a description of the tool and, at the same time, a technical software engineering analysis which will help the reader to understand our choice for the resources allocated in the infrastructure.Thanks to the work of million of volunteers, Wikipedia has reached today stability and completeness that leave no usable alternatives on the market (considering also the licence aspect). The launch of Wikidata in 2010 have completed the picture with a complementary language independent meta-model which is becoming the scientific reference for many disciplines. After providing an introduction to Wikipedia and Wikidata, we describe the knowledge base: the data organisation, the entity-fishing process to exploit it and the way it is built from nightly dumps using an offline process.We conclude the paper by presenting our solution for the service deployment: how and which the resources where allocated. The service has been in production since Q3 of 2017, and extensively used by the H2020 HIRMEOS partners during the integration with the publishing platforms. We believe we have strived to provide the best performances with the minimum amount of resources. Thanks to the Huma-num infrastructure we still have the possibility to scale up the infrastructure as needed, for example to support an increase of demand or temporary needs to process huge backlog of documents. On the long term, thanks to this sustainable environment, we are planning to keep delivering the service far beyond the end of the H2020 HIRMEOS project.

  • English
    Authors: 
    Khemakhem, Mohamed;
    Publisher: HAL CCSD
    Project: ANR | BASNUM (ANR-18-CE38-0003), EC | PARTHENOS (654119)

    Dictionaries could be considered as the most comprehensive reservoir of human knowledge, which carry not only the lexical description of words in one or more languages, but also the common awareness of a certain communityabout every known piece of knowledge in a time frame. Print dictionaries are the principle resources which enable the documentation and transfer of such knowledge. They already exist in abundant numbers, while new onesare continuously compiled, even with the recent strong move to digital resources.However, a majority of these dictionaries, even when available digitally, is still not fully structured due to the absence of scalable methods and techniques that can cover the variety of corresponding material. Moreover, the relatively few existing structured resources present limited exchange and query alternatives, given the discrepancy of their data models and formats.In this thesis we address the task of parsing lexical information in print dictionaries through the design of computer models that enable their automatic structuring. Solving this task goes hand in hand with finding a standardised output for these models to guarantee a maximum interoperability among resources and usability for downstream tasks.First, we present different classifications of the dictionaric resources to delimit the category of print dictionaries we aim to process. Second, we introduce the parsing task by providing an overview of the processing challengesand a study of the state of the art. Then, we present a novel approach based on a top-down parsing of the lexical information. We also outline the archiecture of the resulting system, called GROBID-Dictionaries, and the methodology we followed to close the gap between the conception of the system and its applicability to real-world scenarios.After that, we draw the landscape of the leading standards for structured lexical resources. In addition, we provide an analysis of two ongoing initiatives, TEI-Lex-0 and LMF, that aim at the unification of modelling the lexical information in print and electronic dictionaries. Based on that, we present a serialisation format that is inline with the schemes of the two standardisation initiatives and fits the approach implemented in our parsing system.After presenting the parsing and standardised serialisation facets of our lexical models, we provide an empirical study of their performance and behaviour. The investigation is based on a specific machine learning setup andseries of experiments carried out with a selected pool of varied dictionaries.We try in this study to present different ways for feature engineering and exhibit the strength and the limits of the best resulting models. We also dedicate two series of experiments for exploring the scalability of our models with regard to the processed documents and the employed machine learning technique.Finally, we sum up this thesis by presenting the major conclusions and opening new perspectives for extending our investigations in a number of research directions for parsing entry-based documents.; Les dictionnaires peuvent être considérés comme le réservoir le plus compréhensible de connaissances humaines, qui contiennent non seulement la description lexicale des mots dans une ou plusieurs langues, mais aussi la conscience commune d’une certaine communauté sur chaque élément de connaissance connu dans une période de temps donnée. Les dictionnaires imprimés sont les principales ressources qui permettent la documentation et le transfert de ces connaissances. Ils existent déjà en grand nombre, et de nouveaux dictionnaires sont continuellement compilés.Cependant, la majorité de ces dictionnaires dans leur version numérique n’est toujours pas structurée en raison de l’absence de méthodes et de techniques évolutives pouvant couvrir le nombre du matériel croissant et sa variété. En outre, les ressources structurées existantes, relativement peu nombreuses, présentent des alternatives d’échange et de recherche limitées, en raison d’un sérieux manque de synchronisation entre leurs schémas de structure.Dans cette thèse, nous abordons la tâche d’analyse des informations lexicales dans les dictionnaires imprimés en construisant des modèles qui permettent leur structuration automatique. La résolution de cette tâche va depair avec la recherche d’une sortie standardisée de ces modèles afin de garantir une interopérabilité maximale entre les ressources et une facilité d’utilisation pour les tâches en aval.Nous commençons par présenter différentes classifications des ressources dictionnaires pour délimiter les catégories des dictionnaires imprimés sur lesquelles ce travail se focalise. Ensuite, nous définissions la tâche d’analyse en fournissant un aperçu des défis de traitement et une étude de l’état de l’art.Nous présentons par la suite une nouvelle approche basée sur une analyse en cascade de l’information lexicale. Nous décrivons également l’architecture du système résultant, appelé GROBID-Dictionaries, et la méthodologie quenous avons suivie pour rapprocher la conception du système de son applicabilité aux scénarios du monde réel.Ensuite, nous prestons des normes clés pour les ressources lexicales structurées. En outre, nous fournissons une analyse de deux initiatives en cours, TEI-Lex-0 et LMF, qui visent à unifier la modélisation de l’information lexicale dans les dictionnaires imprimés et électroniques. Sur cette base, nous présentons un format de sérialisation conforme aux schémas des deux initiatives de normalisation et qui est assorti à l’approche développée dans notresystème d’analyse lexicale.Après avoir présenté les facettes d’analyse et de sérialisation normalisées de nos modèles lexicaux, nous fournissons une étude empirique de leurs performances et de leurs comportements. L’étude est basée sur une configuration spécifique d’apprentissage automatique et sur une série d’expériences menées avec un ensemble sélectionné de dictionnaires variés. Dans cette étude, nous essayons de présenter différentes manières d’ingénierie des caractéristiques et de montrer les points forts et les limites des meilleurs modèles résultants. Nous consacrons également deux séries d’expériences pour explorer l’extensibilité de nos modèles en ce qui concerne les documents traités et la technique d’apprentissage automatique employée.Enfin, nous clôturons cette thèse en présentant les principales conclusions et en ouvrant de nouvelles perspectives pour l’extension de nos investigations dans un certain nombre de directions de recherche pour l’analyse des documents structurés en un ensemble d’entrées.

  • Publication . Report . 2020
    English
    Authors: 
    Bertrand, Loïc; Anglos, Demetrios; Castillejo, Marta; Charbonnel, Bénédicte; David, Sophie; de Clercq, Hilde; Dubray, Fanny; Spring, Marika;
    Publisher: HAL CCSD
    Country: France
    Project: EC | E-RIHS PP (739503)

    The study and preservation of tangible cultural and natural heritage is a global challenge for science and society at large. The European Research Infrastructure for Heritage Science (E-RIHS) will play a leading role in research on the interpretation, preservation, documentation and management of heritage. As an interdisciplinary infrastructure, E-RIHS will interconnect knowledge and methodologies to address key scientific questions in the field of heritage as a whole. The infrastructure is built on ten core pillars. It will provide a structured and unified input of large-scale instruments, portable devices, physical and digital archives. Its implementation will focus on scientific excellence, interdisciplinarity and cooperation. In doing so, it will offer unprecedented research opportunities to a wide range of interdisciplinary scientific communities.