publication . Article . 2020

entity-fishing: a DARIAH entity recognition and disambiguation service

Luca Foppiano; Laurent Romary;
Open Access English
  • Published: 19 Nov 2020
  • Publisher: HAL CCSD
  • Country: France
Abstract
International audience; This paper presents an attempt to provide a generic named-entity recognition and disambiguation module (NERD) called entity-fishing as a stable online service that demonstrates the possible delivery of sustainable technical services within DARIAH, the European digital research infrastructure for the arts and humanities. Deployed as part of the national infrastructure Huma-Num in France, this service provides an efficient state-of-the-art implementation coupled with standardised interfaces allowing an easy deployment on a variety of potential digital humanities contexts. The topics of accessibility and sustainability have been long discussed in the attempt of providing some best practices in the widely fragmented ecosystem of the DARIAH research infrastructure. The history of entity-fishing has been mentioned as an example of good practice: initially developed in the context of the FP9 CENDARI, the project was well received by the user community and continued to be further developed within the H2020 HIRMEOS project where several open access publishers have integrated the service to their collections of published monographs as a means to enhance retrieval and access.entity-fishing implements entity extraction as well as disambiguation against Wikipedia and Wikidata entries. The service is accessible through a REST API which allows easier and seamless integration, language independent and stable convention and a widely used service oriented architecture (SOA) design. Input and output data are carried out over a query data model with a defined structure providing flexibility to support the processing of partially annotated text or the repartition of text over several queries. The interface implements a variety of functionalities, like language recognition, sentence segmentation and modules for accessing and looking up concepts in the knowledge base. The API itself integrates more advanced contextual parametrisation or ranked outputs, allowing for the resilient integration in various possible use cases. The entity-fishing API has been used as a concrete use case3 to draft the experimental stand-off proposal, which has been submitted for integration into the TEI guidelines. The representation is also compliant with the Web Annotation Data Model (WADM).In this paper we aim at describing the functionalities of the service as a reference contribution to the subject of web-based NERD services. In order to cover all aspects, the architecture is structured to provide two complementary viewpoints. First, we discuss the system from the data angle, detailing the workflow from input to output and unpacking each building box in the processing flow. Secondly, with a more academic approach, we provide a transversal schema of the different components taking into account non-functional requirements in order to facilitate the discovery of bottlenecks, hotspots and weaknesses. The attempt here is to give a description of the tool and, at the same time, a technical software engineering analysis which will help the reader to understand our choice for the resources allocated in the infrastructure.Thanks to the work of million of volunteers, Wikipedia has reached today stability and completeness that leave no usable alternatives on the market (considering also the licence aspect). The launch of Wikidata in 2010 have completed the picture with a complementary language independent meta-model which is becoming the scientific reference for many disciplines. After providing an introduction to Wikipedia and Wikidata, we describe the knowledge base: the data organisation, the entity-fishing process to exploit it and the way it is built from nightly dumps using an offline process.We conclude the paper by presenting our solution for the service deployment: how and which the resources where allocated. The service has been in production since Q3 of 2017, and extensively used by the H2020 HIRMEOS partners during the integration with the publishing platforms. We believe we have strived to provide the best performances with the minimum amount of resources. Thanks to the Huma-num infrastructure we still have the possibility to scale up the infrastructure as needed, for example to support an increase of demand or temporary needs to process huge backlog of documents. On the long term, thanks to this sustainable environment, we are planning to keep delivering the service far beyond the end of the H2020 HIRMEOS project.
Persistent Identifiers
Subjects
free text keywords: [ INFO.INFO-TT ] Computer Science [cs]/Document and Text Processing, [ INFO ] Computer Science [cs], [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, [INFO]Computer Science [cs], Computer science, Schema (psychology), Annotation, World Wide Web, Use case, Software, business.industry, business, Software deployment, Knowledge base, Architecture, Workflow
20 references, page 1 of 2

Brando, Carmen, Francesca Frontini, and Jean-Gabriel Ganascia. 2016. “REDEN: Named Entity Linking in Digital Literary Editions Using Linked Data Sets.” Complex Systems Informatics and Modeling Quarterly, no. 7: 60-80. doi:10.7250/csimq.2016- 7.04. [OpenAIRE]

Buddenbohm, Stefan, and Raisa Barthauer. 2017. “D 4.1 - Gap Analysis of DARIAH Research Infrastructure.” DARIAH research report.. https://hal.archivesouvertes.fr/hal-01663594.

Cucerzan, Silviu. 2007. “Large-Scale Named Entity Disambiguation Based on Wikipedia Data.” In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL), 708-16. Stroudsburg, PA: Association for Computational Linguistics. https://www.aclweb.org/anthology/volumes/D07-1/.

Edwards, Paul N. 2003. “Infrastructure and Modernity: Force, Time, and Social Organization in the History of Sociotechnical Systems.” In Modernity and Technology, edited by Thomas J. Misa, Philip Brey, and Andrew Feenberg, 185-225. Cambridge, MA: MIT Press.

Lopez, Patrice. 2009. “GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications.” In Research and Advanced Technology for Digital Libraries: 13th European Conference, ECDL 2009…: Proceedings, edited by Maristella Agosti, José Borbinha, Sarantos Kapidakis, Christos Papatheodorou, and Giannis Tsakonas, 473-74. Lecture Notes in Computer Science 5714. Berlin, Heidelberg: Springer.

Lopez, Patrice. 2017. “Entity-Fishing.” Slides presented at WikiDataCon 2017, Berlin, Germany, October 28-29. Last revised 8 February 2018, https://www.wikidata.org/wiki/Wikidata:WikidataCon_2017/Documentation; accessed July 11, 2020, https://grobid.s3.amazonaws.com/presentations/29-10- 2017.pdf.

Lopez, Patrice, Alexander Meyer, and Laurent Romary. 2014. “CENDARI Virtual Research Environment & Named Entity Recognition Techniques.” Poster presented at the conference Grenzen überschreiten - Digitale Geisteswissenschaft heute und morgen, Berlin, Germany, February 28, 2014. Einstein-Zirkel Digital Humanities. https://hal.inria.fr/hal-01577975.

Milne, David N., Ian H. Witten, and David M. Nichols. 2007. “Extracting Corpus Specific Knowledge Bases from Wikipedia.” Working paper series, no. 03/2007, Department of Computer Science, University of Waikato, Hamilton, New Zealand. https://hdl.handle.net/10289/69.

Nadeau, David, and Satoshi Sekine. 2007. “A Survey of Named Entity Recognition and Classification.” In Named Entities: Recognition, Classification and Use, edited by Satoshi Sekine and Elisabete Ranchhod [Lingvisticae Investigationes 30:1], 3-26. [Amsterdam and Philadelphia]: John Benjamins. doi:10.1075/li.30.1.03nad.

Pellissier Tanon, Thomas, Denny Vrandečić, Sebastian Schaffert, Thomas Steiner, and Lydia Pintscher. 2016. “From Freebase to Wikidata: The Great Migration.” In WWW '16: Proceedings of the 25th International Conference on World Wide Web, 1419-28. Geneva, Switzerland: International World Wide Web Conferences Steering Committee. doi:10.1145/2872427.2874809.

Ratinov, Lev, Dan Roth, Doug Downey, and Mike Anderson. 2011. “Local and Global Algorithms for Disambiguation to Wikipedia.” In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 1:1375-84. Stroudsburg, PA: Association for Computational Linguistics. https://www.aclweb.org/anthology/P11-1138/.

Romary, Laurent, and Jennifer Edmond. 2017. “Sustainability in DARIAH.” Presentation at Sustainability of Digital Research Infrastructures for the Arts and Humanities (Workshop at the DARIAH Annual Event), Berlin, Germany, April 27. https://hal.inria.fr/hal-01516487.

Salton, Gerard, and Michael J. McGill. 1983. Introduction to Modern Information Retrieval. New York: McGraw-Hill.

Smith, David A., and Gregory Crane. 2001. “Disambiguating Geographic Names in a Historical Digital Library.” In Research and Advanced Technology for Digital Libraries: 5th European conference, ECDL 2001…: Proceedings, edited by Panos Constantopoulos and Ingeborg T. Sølvberg, 127-36. Lecture Notes in Computer Science 2163. Berlin: Springer.

Steiner, Thomas. 2014. “Bots vs. Wikipedians, Anons vs. Logged-Ins (Redux): A Global Study of Edit Activity on Wikipedia and Wikidata.” In OpenSym '14: Proceedings of the International Symposium on Open Collaboration, 25:1-25:7. New York: ACM. doi:10.1145/2641580.2641613.

20 references, page 1 of 2
Abstract
International audience; This paper presents an attempt to provide a generic named-entity recognition and disambiguation module (NERD) called entity-fishing as a stable online service that demonstrates the possible delivery of sustainable technical services within DARIAH, the European digital research infrastructure for the arts and humanities. Deployed as part of the national infrastructure Huma-Num in France, this service provides an efficient state-of-the-art implementation coupled with standardised interfaces allowing an easy deployment on a variety of potential digital humanities contexts. The topics of accessibility and sustainability have been long discussed in the attempt of providing some best practices in the widely fragmented ecosystem of the DARIAH research infrastructure. The history of entity-fishing has been mentioned as an example of good practice: initially developed in the context of the FP9 CENDARI, the project was well received by the user community and continued to be further developed within the H2020 HIRMEOS project where several open access publishers have integrated the service to their collections of published monographs as a means to enhance retrieval and access.entity-fishing implements entity extraction as well as disambiguation against Wikipedia and Wikidata entries. The service is accessible through a REST API which allows easier and seamless integration, language independent and stable convention and a widely used service oriented architecture (SOA) design. Input and output data are carried out over a query data model with a defined structure providing flexibility to support the processing of partially annotated text or the repartition of text over several queries. The interface implements a variety of functionalities, like language recognition, sentence segmentation and modules for accessing and looking up concepts in the knowledge base. The API itself integrates more advanced contextual parametrisation or ranked outputs, allowing for the resilient integration in various possible use cases. The entity-fishing API has been used as a concrete use case3 to draft the experimental stand-off proposal, which has been submitted for integration into the TEI guidelines. The representation is also compliant with the Web Annotation Data Model (WADM).In this paper we aim at describing the functionalities of the service as a reference contribution to the subject of web-based NERD services. In order to cover all aspects, the architecture is structured to provide two complementary viewpoints. First, we discuss the system from the data angle, detailing the workflow from input to output and unpacking each building box in the processing flow. Secondly, with a more academic approach, we provide a transversal schema of the different components taking into account non-functional requirements in order to facilitate the discovery of bottlenecks, hotspots and weaknesses. The attempt here is to give a description of the tool and, at the same time, a technical software engineering analysis which will help the reader to understand our choice for the resources allocated in the infrastructure.Thanks to the work of million of volunteers, Wikipedia has reached today stability and completeness that leave no usable alternatives on the market (considering also the licence aspect). The launch of Wikidata in 2010 have completed the picture with a complementary language independent meta-model which is becoming the scientific reference for many disciplines. After providing an introduction to Wikipedia and Wikidata, we describe the knowledge base: the data organisation, the entity-fishing process to exploit it and the way it is built from nightly dumps using an offline process.We conclude the paper by presenting our solution for the service deployment: how and which the resources where allocated. The service has been in production since Q3 of 2017, and extensively used by the H2020 HIRMEOS partners during the integration with the publishing platforms. We believe we have strived to provide the best performances with the minimum amount of resources. Thanks to the Huma-num infrastructure we still have the possibility to scale up the infrastructure as needed, for example to support an increase of demand or temporary needs to process huge backlog of documents. On the long term, thanks to this sustainable environment, we are planning to keep delivering the service far beyond the end of the H2020 HIRMEOS project.
Persistent Identifiers
Subjects
free text keywords: [ INFO.INFO-TT ] Computer Science [cs]/Document and Text Processing, [ INFO ] Computer Science [cs], [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, [INFO]Computer Science [cs], Computer science, Schema (psychology), Annotation, World Wide Web, Use case, Software, business.industry, business, Software deployment, Knowledge base, Architecture, Workflow
20 references, page 1 of 2

Brando, Carmen, Francesca Frontini, and Jean-Gabriel Ganascia. 2016. “REDEN: Named Entity Linking in Digital Literary Editions Using Linked Data Sets.” Complex Systems Informatics and Modeling Quarterly, no. 7: 60-80. doi:10.7250/csimq.2016- 7.04. [OpenAIRE]

Buddenbohm, Stefan, and Raisa Barthauer. 2017. “D 4.1 - Gap Analysis of DARIAH Research Infrastructure.” DARIAH research report.. https://hal.archivesouvertes.fr/hal-01663594.

Cucerzan, Silviu. 2007. “Large-Scale Named Entity Disambiguation Based on Wikipedia Data.” In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL), 708-16. Stroudsburg, PA: Association for Computational Linguistics. https://www.aclweb.org/anthology/volumes/D07-1/.

Edwards, Paul N. 2003. “Infrastructure and Modernity: Force, Time, and Social Organization in the History of Sociotechnical Systems.” In Modernity and Technology, edited by Thomas J. Misa, Philip Brey, and Andrew Feenberg, 185-225. Cambridge, MA: MIT Press.

Lopez, Patrice. 2009. “GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications.” In Research and Advanced Technology for Digital Libraries: 13th European Conference, ECDL 2009…: Proceedings, edited by Maristella Agosti, José Borbinha, Sarantos Kapidakis, Christos Papatheodorou, and Giannis Tsakonas, 473-74. Lecture Notes in Computer Science 5714. Berlin, Heidelberg: Springer.

Lopez, Patrice. 2017. “Entity-Fishing.” Slides presented at WikiDataCon 2017, Berlin, Germany, October 28-29. Last revised 8 February 2018, https://www.wikidata.org/wiki/Wikidata:WikidataCon_2017/Documentation; accessed July 11, 2020, https://grobid.s3.amazonaws.com/presentations/29-10- 2017.pdf.

Lopez, Patrice, Alexander Meyer, and Laurent Romary. 2014. “CENDARI Virtual Research Environment & Named Entity Recognition Techniques.” Poster presented at the conference Grenzen überschreiten - Digitale Geisteswissenschaft heute und morgen, Berlin, Germany, February 28, 2014. Einstein-Zirkel Digital Humanities. https://hal.inria.fr/hal-01577975.

Milne, David N., Ian H. Witten, and David M. Nichols. 2007. “Extracting Corpus Specific Knowledge Bases from Wikipedia.” Working paper series, no. 03/2007, Department of Computer Science, University of Waikato, Hamilton, New Zealand. https://hdl.handle.net/10289/69.

Nadeau, David, and Satoshi Sekine. 2007. “A Survey of Named Entity Recognition and Classification.” In Named Entities: Recognition, Classification and Use, edited by Satoshi Sekine and Elisabete Ranchhod [Lingvisticae Investigationes 30:1], 3-26. [Amsterdam and Philadelphia]: John Benjamins. doi:10.1075/li.30.1.03nad.

Pellissier Tanon, Thomas, Denny Vrandečić, Sebastian Schaffert, Thomas Steiner, and Lydia Pintscher. 2016. “From Freebase to Wikidata: The Great Migration.” In WWW '16: Proceedings of the 25th International Conference on World Wide Web, 1419-28. Geneva, Switzerland: International World Wide Web Conferences Steering Committee. doi:10.1145/2872427.2874809.

Ratinov, Lev, Dan Roth, Doug Downey, and Mike Anderson. 2011. “Local and Global Algorithms for Disambiguation to Wikipedia.” In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 1:1375-84. Stroudsburg, PA: Association for Computational Linguistics. https://www.aclweb.org/anthology/P11-1138/.

Romary, Laurent, and Jennifer Edmond. 2017. “Sustainability in DARIAH.” Presentation at Sustainability of Digital Research Infrastructures for the Arts and Humanities (Workshop at the DARIAH Annual Event), Berlin, Germany, April 27. https://hal.inria.fr/hal-01516487.

Salton, Gerard, and Michael J. McGill. 1983. Introduction to Modern Information Retrieval. New York: McGraw-Hill.

Smith, David A., and Gregory Crane. 2001. “Disambiguating Geographic Names in a Historical Digital Library.” In Research and Advanced Technology for Digital Libraries: 5th European conference, ECDL 2001…: Proceedings, edited by Panos Constantopoulos and Ingeborg T. Sølvberg, 127-36. Lecture Notes in Computer Science 2163. Berlin: Springer.

Steiner, Thomas. 2014. “Bots vs. Wikipedians, Anons vs. Logged-Ins (Redux): A Global Study of Edit Activity on Wikipedia and Wikidata.” In OpenSym '14: Proceedings of the International Symposium on Open Collaboration, 25:1-25:7. New York: ACM. doi:10.1145/2641580.2641613.

20 references, page 1 of 2
Any information missing or wrong?Report an Issue