Publisher: Japanese Association for Digital Humanities
Project: EC | HIRMEOS (731102)
International audience; This paper presents an attempt to provide a generic named-entity recognition and disambiguation module (NERD) called entity-fishing as a stable online service that demonstrates the possible delivery of sustainable technical services within DARIAH, the European digital research infrastructure for the arts and humanities. Deployed as part of the national infrastructure Huma-Num in France, this service provides an efficient state-of-the-art implementation coupled with standardised interfaces allowing an easy deployment on a variety of potential digital humanities contexts. The topics of accessibility and sustainability have been long discussed in the attempt of providing some best practices in the widely fragmented ecosystem of the DARIAH research infrastructure. The history of entity-fishing has been mentioned as an example of good practice: initially developed in the context of the FP9 CENDARI, the project was well received by the user community and continued to be further developed within the H2020 HIRMEOS project where several open access publishers have integrated the service to their collections of published monographs as a means to enhance retrieval and access.entity-fishing implements entity extraction as well as disambiguation against Wikipedia and Wikidata entries. The service is accessible through a REST API which allows easier and seamless integration, language independent and stable convention and a widely used service oriented architecture (SOA) design. Input and output data are carried out over a query data model with a defined structure providing flexibility to support the processing of partially annotated text or the repartition of text over several queries. The interface implements a variety of functionalities, like language recognition, sentence segmentation and modules for accessing and looking up concepts in the knowledge base. The API itself integrates more advanced contextual parametrisation or ranked outputs, allowing for the resilient integration in various possible use cases. The entity-fishing API has been used as a concrete use case3 to draft the experimental stand-off proposal, which has been submitted for integration into the TEI guidelines. The representation is also compliant with the Web Annotation Data Model (WADM).In this paper we aim at describing the functionalities of the service as a reference contribution to the subject of web-based NERD services. In order to cover all aspects, the architecture is structured to provide two complementary viewpoints. First, we discuss the system from the data angle, detailing the workflow from input to output and unpacking each building box in the processing flow. Secondly, with a more academic approach, we provide a transversal schema of the different components taking into account non-functional requirements in order to facilitate the discovery of bottlenecks, hotspots and weaknesses. The attempt here is to give a description of the tool and, at the same time, a technical software engineering analysis which will help the reader to understand our choice for the resources allocated in the infrastructure.Thanks to the work of million of volunteers, Wikipedia has reached today stability and completeness that leave no usable alternatives on the market (considering also the licence aspect). The launch of Wikidata in 2010 have completed the picture with a complementary language independent meta-model which is becoming the scientific reference for many disciplines. After providing an introduction to Wikipedia and Wikidata, we describe the knowledge base: the data organisation, the entity-fishing process to exploit it and the way it is built from nightly dumps using an offline process.We conclude the paper by presenting our solution for the service deployment: how and which the resources where allocated. The service has been in production since Q3 of 2017, and extensively used by the H2020 HIRMEOS partners during the integration with the publishing platforms. We believe we have strived to provide the best performances with the minimum amount of resources. Thanks to the Huma-num infrastructure we still have the possibility to scale up the infrastructure as needed, for example to support an increase of demand or temporary needs to process huge backlog of documents. On the long term, thanks to this sustainable environment, we are planning to keep delivering the service far beyond the end of the H2020 HIRMEOS project.
This poster has been awarded with the Best Poster Award at DARIAH2020 virtual annual event https://twitter.com/dariaheu/status/1327290958971609090?s=21 In order to provide the global community of scholars working in this field with a greater understanding of the current Spanish scenario, LINHD has recently promoted a research on the evolution of Digital Humanities in Spain in the last 25 years, a timeframe comparable with Unsworth first formulation of scholarly primitives. More than 1,000 records have been mapped, distributed as follow: 577 researchers; 368 projects; 88 resources; 9 post-graduate courses; and 8 specialised journals. Digital resources (i.e. repositories of documents, collections of artefacts, crowdsourcing platforms, dictionaries, databases, etc.), which are the object of this poster, have been produced, most of the time, with the aim to publish a service to improve the basic of day-to-day research workflow in the Humanities. Our initial objectives were: to classify and describe the digital resources mapped according with the classical and new scholarly primitives, in order to highlight presences, absence and recurring associations of these categories; To visualize the relationships between scholarly primitives and other dimensions in our data, like discipline and typology. to identify how the introduction of digital tools and methods has affected the basic functions of research in the Humanities in Spain over time. Data analysed is part of a larger dataset that can be downloaded at https://doi.org/10.5281/zenodo.3893546 The whole dataset has been extensively analysed in https://doi.org/10.3145/epi.2020.nov.01
International audience; This article presents an overview of approaches and results during our participation in the CLEF HIPE 2020 NERC-COARSE-LIT and EL-ONLY tasks for English and French. For these two tasks, we use two systems: 1) DeLFT, a Deep Learning framework for text processing; 2) entity-fishing, generic named entity recognition and disambiguation service deployed in the technical framework of INRIA.
AbstractThe paper presents Intergraph, a graph-based visual analytics technical demonstrator for the exploration and study of content in historical document collections. The designed prototype is motivated by a practical use case on a corpus of circa 15.000 digitized resources about European integration since 1945. The corpus allowed generating a dynamic multilayer network which represents different kinds of named entities appearing and co-appearing in the collections. To our knowledge, Intergraph is one of the first interactive tools to visualize dynamic multilayer graphs for collections of digitized historical sources. Graph visualization and interaction methods have been designed based on user requirements for content exploration by non-technical users without a strong background in network science, and to compensate for common flaws with the annotation of named entities. Users work with self-selected subsets of the overall data by interacting with a scene of small graphs which can be added, altered and compared. This allows an interest-driven navigation in the corpus and the discovery of the interconnections of its entities across time.
International audience; Because manuscripts are lost, burned, torn apart or thrown away, it is as complex as crucial to know how many of them still exist for any philologist preparing an edition. Thanks to a (semi-)automatic and fully-open source workflow, we have extracted, structured and annotated hundreds of manuscript sale catalogues published in 19th c. Paris. The obtained level of granularity allows us not only to reconcile different sales of a single item sold multiple times, but also to identify if the manuscript is now kept in a library. Using Sévigné as a test case, we were able to calculate that c. 1% of her manuscripts still has to be found because they are circulating on the private market. All the data we produced remain available for similar research on other authors.
International audience; The aim of the talk is to present the methodology used to reorganise the PACTOLS thesaurus of Frantiq, launched within the framework of the MASA consortium. PACTOLS is a multilingual and open repository about archaeology from Prehistory to the present and for Classics. It is organized into six micro-thesaurus at the root of its name (Peuples, Anthroponymes,Chronologie, Toponymes, Oeuvres, Lieux, Sujets). The goal is to turn it into a tool interoperable with information systems beyond its original documentary purpose, and usable by archaeologists as a repository for managing scientific data. During the talk, we will describe the choice of tools, the organisation of work within the steering group and the collaborations with specialists for the upgrading and development of the vocabulary while showing the strengths and limitations of some experiments. Above allit will show how the introduction of the conceptual categories of the BackBone Thesaurus of DARIAH, modelled on the CIDOC-CRM ontology, through a progressive deconstruction/reconstruction process, eventually had an impact on all micro thesauri and questioned the organisation of knowledge so far proposed.