Licence Creative Commons Attribution 4.0 (CC-BY); The digital age, by making large amounts of text available to us, prompts us to develop new and additional reading strategies supported by the use of computers and enabling us to deal with such amounts of text. One such "distant reading" strategy is stylometry, a method of quantitative text analysis which relies on the frequencies of certain linguistic features such as words, letters or grammatical units to statistically assess the relative similarity of texts to each other and to classify texts on this basis. This method is applied here to French drama of the seventeenth century, more precisely to the now famous "Corneille / Molière- controversy". In this controversy, some researchers claim that Pierre Corneille wrote several of the plays traditionally attributed to Molière. The methodological challenge, it is shown here, lies in the fact that categories such as authorship, genre (comedy vs. tragedy) and literary form (prose vs. verse) all have an influence on stylometric distance measures and classification. Cross-genre and cross-form authorship attribution needs to distinguish such competing signals if it is to produce reliable attribution results. This contribution describes two attempts to accomplish this, parameter optimization and feature-range selection. The contribution concludes with some more general remarks about the use of quantitative methods in a hermeneutic discipline such as literary studies.
International audience; This paper describes the workflow of the Grammateus project, from gathering data on Greek documentary papyri to the creation of a web application. The first stage is the selection of a corpus and the choice of metadata to record: papyrology specialists gather data from printed editions, existing online resources and digital facsimiles. In the next step, this data is transformed into the EpiDoc standard of XML TEI encoding, to facilitate its reuse by others, and processed for HTML display. We also reuse existing text transcriptions available on . Since these transcriptions may be regularly updated by the scholarly community, we aim to access them dynamically. Although the transcriptions follow the EpiDoc guidelines, the wide diversity of the papyri as well as small inconsistencies in encoding make data reuse challenging. Currently, our data is available on an institutional GitLab repository, and we will archive our final dataset according to the FAIR principles.
Publication . Part of book or chapter of book . 2016
International audience; This chapter gives an overview of one possible staged methodology for structuring LCI data by presenting a new scientific object, LEarning and TEaching Corpora (LETEC). Firstly, the chapter clarifies the notion of corpora, used in so many different ways in language studies, and underlines how corpora differ from raw language data. Secondly, using examples taken from actual online learning situations, the chapter illustrates the methodology that is used to collect, transform and organize data from online learning situations in order to make them sharable through open-access repositories. The ethics and rights for releasing a corpus as OpenData are discussed. Thirdly, the authors suggest how the transcription of interactions may become more systematic, and what benefits may be expected from analysis tools, before opening the CALL research perspective applied to LCI towards its applications to teacher-training in Computer-Mediated Communication (CMC), and the common interests the CALL field shares with researchers in the field of Corpus Linguistics working on CMC.
International audience; Die These, die hier vertreten wird, verortet die Bekehrungsmanöver in einer diametral entgegengesetzten Glaubensgemeinschaft: Eine Religion der Big Data gibt es wohl, und eines ihrer Evangelien nennt sich Netzwerkvisualisierung. Ohne Netzwerk geht nichts, alles ist Netzwerk. Sicherlich machen es zum einen die Datenflut und zum anderen die Verknüpfungen zwischen ebendiesen Daten nötig, sich Orientierung zu verschaffen. Im Zuge dessen wurde der Ideen- und Literaturgeschichte der Rekurs auf Netzwerkanalyse aufgebürdet. Das Kreuz, das es zu schleppen gilt, ist eben das Netzwerk. Aber liefern Netzwerke und ihre Visualisierungen wirklich die Orientierung, die die Geisteswissenschaften brauchen? Wozu sind Netzwerke für die Literaturwissenschaft gut? Was erlauben sie uns zu machen, was wir anders nicht bewerkstelligen könnten?Ansätze zur Beantwortung dieser Fragen werden in drei Schritten vorgestellt. Zunächst werde ich basal mit der Frage „Was ist ein Netzwerk?“ beginnen. Dabei geht es mir darum zu umreißen, was ein „gutes“ Netzwerk ausmacht, d.h. ein Netzwerk, aus dem man aus literaturwissenschaftlicher Sicht sinnvolle Informationen gewinnen kann. Im zweiten Teil stelle ich digitale Briefeditionen vor (im Speziellen meine eigene) und was diese an Anknüpfungspunkten für Netzmodelle bieten. In einem dritten Teil gehe ich schließlich auf die Einbettung von Netzmodellen in die konkrete Textarbeit ein.
International audience; One of the funded project proposals under DARIAH’s Open Humanities call 2015 was “Open History: Sustainable digital publishing of archival catalogues of twentieth-century history archives”. Based on the experiences of the Collaborative EuropeaN Digital Archival Research Infrastructure (CENDARI) and the European Holocaust Research Infrastructure (EHRI), the main goal of the “Open History” project was to enhance the dialogue between (meta-)data providers and research infrastructures. Integrating archival descriptions – when they were already available – held at a wide variety of twentieth-century history archives (from classic archives to memorial sites, libraries and private archives) into research infrastructures has proven to be a major challenge, which could not be done without some degree of limited to extensive pre-processing or other preparatory work. The “Open History” project organized two workshops and developed two tools: an easily accessible and general article on why the practice of standardization and sharing is important and how this can be achieved; and a model which provides checklists for self-analyses of archival institutions. The text that follows is the article we have developed. It intentionally remains at a general level, without much jargon, so that it can be easily read by those who are non-archivists or non-IT. Hence, we hope it will be easy to understand for both those who are describing the sources at various archives (with or without IT or archival sciences degrees), as well as decision-makers (directors and advisory boards) who wish to understand the benefits of investing in standardization and sharing of data. It is important to note is that this text is a first step, not a static, final result. Not all aspects about standardization and publication of (meta-)data are discussed, nor are updates or feedback mechanisms for annotations and comments discussed. The idea is that this text can be used in full or in part and that it will include further chapters and section updates as time goes by and as other communities begin using it. Some archives will read through much of these and see confirmation of what they have already been implementing; others – especially the smaller institutions, such as private memory institutions – will find this a low-key and hands-on introduction to help them in their efforts.