publication . Conference object . 2017

Towards a IIIF-based corpus management platform

Joke Daems; Sally Chambers; Zere, Tecle; Christophe Verbruggen;
Open Access
  • Published: 03 Jul 2017
Abstract
International audience; The digital text platform is part of the Flemish contribution to DARIAH Belgium (DARIAH = Digital Research Infrastructure for the Arts and Humanities). The goal is to create a platform for the collaborative management and discovery of digitised textual collections that allows digital humanities researchers to prepare their corpora (consisting of, for example, digitised newspapers and books) for textual analysis. The platform will enable researchers to browse and search the digitised collections compiled, cleaned, enriched and managed by the researchers themselves. Once the relevant research sub-corpus has been compiled, data export tools, using standardised open formats (such as XML, JSON, .csv, .txt, etc.) will enable researchers to export sub-corpus for analysis with existing digital text analysis tools such as MALLET, (http://mallet.cs.umass.edu/topics.php) for topic modelling, VOYANT (http://voyant-tools.org) for data visualisation or AntConC (http://www.laurenceanthony.net/software/antconc/) for concordance and textual analysis.The platform has been conceived as part of a larger and modular virtual research environment service infrastructure (http://www.ghentcdh.ugent.be/projects/dariah-vl_vre.si). In a previous phase, possible frameworks and content management systems were tested, notably Islandora (a digital asset management system based on Fedora Commons and Drupal), but also Mediawiki and Omeka.One of the main challenges of the envisaged new platform is the possibility to integrate a wider variety of possible textual data streams (including a scan workflow). In addition, user-friendliness, scalability, adherence to standards and facilitating the interoperability of data are key issues to be addressed. The platform will build on the existing IIIF format, the International Image Interoperability Framework. This format is used by some of the most important libraries and cultural heritage institutions in the world, therefore providing access to enormous collections of digital objects. As the name suggests, IIIF is mainly focused on displaying and annotating images. However, we fully endorse the IIIF-community’s vision to develop an overarching interoperability framework for other data types, including all kinds of textual data. Benefits of the format include the interoperability, the ease of sharing images and annotations without the need to exchange files, and its support for multilingual data. In the months leading up to the conference, we will evaluate the existing IIIFpowered digital libraries and research projects and how they deal with practices of co-creation, data cleaning and enrichment of (structural) metadata. OCR improvement will become vital, as digital textual analysis can only be performed well on high-quality textual data. A related challenge will be combining the various input formats and converting them to different output formats required for analysis. In our poster, we will present a summary of our experiences with and technical assessment of our previous Islandora installation, in addition to our survey of the existing corpus management solutions. As a way of conclusion, we will introduce the envisioned new version of the platform.
Persistent Identifiers
Subjects
free text keywords: DARIAH, DARIAH-BE, Digital Humanities, Text analysis, Corpus management, [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, [SHS]Humanities and Social Sciences
Related Organizations
Any information missing or wrong?Report an Issue