Advanced search in Research products
Research products
arrow_drop_down
Searching FieldsTerms
Any field
arrow_drop_down
includes
arrow_drop_down
Include:
The following results are related to DARIAH EU. Are you interested to view more results? Visit OpenAIRE - Explore.
13 Research products, page 1 of 2

  • DARIAH EU
  • Publications
  • Other research products
  • 2018-2022
  • Open Access
  • Article
  • arXiv.org e-Print Archive

10
arrow_drop_down
Relevance
arrow_drop_down
  • Publication . Article . Preprint . 2020
    Open Access English
    Authors: 
    Del Gratta, Riccardo;

    In this article, we propose a Category Theory approach to (syntactic) interoperability between linguistic tools. The resulting category consists of textual documents, including any linguistic annotations, NLP tools that analyze texts and add additional linguistic information, and format converters. Format converters are necessary to make the tools both able to read and to produce different output formats, which is the key to interoperability. The idea behind this document is the parallelism between the concepts of composition and associativity in Category Theory with the NLP pipelines. We show how pipelines of linguistic tools can be modeled into the conceptual framework of Category Theory and we successfully apply this method to two real-life examples. Paper submitted to Applied Category Theory 2020 and accepted for Virtual Poster Session

  • Publication . Article . Preprint . 2019 . Embargo End Date: 01 Jan 2019
    Open Access
    Authors: 
    Kolar, Jana; Cugmas, Marjan; Ferligoj, Anuška;
    Publisher: arXiv
    Project: EC | ACCELERATE (731112)

    In 2018, the European Strategic Forum for research infrastructures (ESFRI) was tasked by the Competitiveness Council, a configuration of the Council of the EU, to develop a common approach for monitoring of Research Infrastructures' performance. To this end, ESFRI established a working group, which has proposed 21 Key Performance Indicators (KPIs) to monitor the progress of the Research Infrastructures (RIs) addressed towards their objectives. The RIs were then asked to assess their relevance for their institution. The paper aims to identify the relevance of certain indicators for particular groups of RIs by using cluster and discriminant analysis. This could contribute to development of a monitoring system, tailored to particular RIs. To obtain a typology of the RIs, we first performed cluster analysis of the RIs according to their properties, which revealed clusters of RIs with similar characteristics, based on to the domain of operation, such as food, environment or engineering. Then, discriminant analysis was used to study how the relevance of the KPIs differs among the obtained clusters. This analysis revealed that the percentage of RIs correctly classified into five clusters, using the KPIs, is 80%. Such a high percentage indicates that there are significant differences in the relevance of certain indicators, depending on the ESFRI domain of the RI. The indicators therefore need to be adapted to the type of infrastructure. It is therefore proposed that the Strategic Working Groups of ESFRI addressing specific domains should be involved in the tailored development of the monitoring of pan-European RIs. Comment: 15 pages, 8 tables, 3 figures

  • Open Access English
    Authors: 
    Rizza, Ettore; Chardonnens, Anne; Van Hooland, Seth;
    Publisher: HAL CCSD
    Countries: France, Belgium

    More and more cultural institutions use Linked Data principles to share and connect their collection metadata. In the archival field, initiatives emerge to exploit data contained in archival descriptions and adapt encoding standards to the semantic web. In this context, online authority files can be used to enrich metadata. However, relying on a decentralized network of knowledge bases such as Wikidata, DBpedia or even Viaf has its own difficulties. This paper aims to offer a critical view of these linked authority files by adopting a close-reading approach. Through a practical case study, we intend to identify and illustrate the possibilities and limits of RDF triples compared to institutions' less structured metadata. Comment: Workshop "Dariah "Trust and Understanding: the value of metadata in a digitally joined-up world" (14/05/2018, Brussels), preprint of the submission to the journal "Archives et Biblioth\`eques de Belgique"

  • Publication . Article . Preprint . Conference object . 2019
    Open Access
    Authors: 
    Lilia Simeonova; Kiril Simov; Petya Osenova; Preslav Nakov;
    Publisher: Incoma Ltd., Shoumen, Bulgaria

    We propose a morphologically informed model for named entity recognition, which is based on LSTM-CRF architecture and combines word embeddings, Bi-LSTM character embeddings, part-of-speech (POS) tags, and morphological information. While previous work has focused on learning from raw word input, using word and character embeddings only, we show that for morphologically rich languages, such as Bulgarian, access to POS information contributes more to the performance gains than the detailed morphological information. Thus, we show that named entity recognition needs only coarse-grained POS tags, but at the same time it can benefit from simultaneously using some POS information of different granularity. Our evaluation results over a standard dataset show sizable improvements over the state-of-the-art for Bulgarian NER. Comment: named entity recognition; Bulgarian NER; morphology; morpho-syntax

  • Publication . Article . Preprint . 2020 . Embargo End Date: 01 Jan 2020
    Open Access
    Authors: 
    Zamani, Maryam; Tejedor, Alejandro; Vogl, Malte; Krautli, Florian; Valleriani, Matteo; Kantz, Holger;
    Publisher: arXiv

    We investigated the evolution and transformation of scientific knowledge in the early modern period, analyzing more than 350 different editions of textbooks used for teaching astronomy in European universities from the late fifteenth century to mid-seventeenth century. These historical sources constitute the Sphaera Corpus. By examining different semantic relations among individual parts of each edition on record, we built a multiplex network consisting of six layers, as well as the aggregated network built from the superposition of all the layers. The network analysis reveals the emergence of five different communities. The contribution of each layer in shaping the communities and the properties of each community are studied. The most influential books in the corpus are found by calculating the average age of all the out-going and in-coming links for each book. A small group of editions is identified as a transmitter of knowledge as they bridge past knowledge to the future through a long temporal interval. Our analysis, moreover, identifies the most disruptive books. These books introduce new knowledge that is then adopted by almost all the books published afterwards until the end of the whole period of study. The historical research on the content of the identified books, as an empirical test, finally corroborates the results of all our analyses. Comment: 19 pages, 9 figures

  • Publication . Other literature type . Article . Preprint . 2021
    Open Access

    The concept of literary genre is a highly complex one: not only are different genres frequently defined on several, but not necessarily the same levels of description, but consideration of genres as cognitive, social, or scholarly constructs with a rich history further complicate the matter. This contribution focuses on thematic aspects of genre with a quantitative approach, namely Topic Modeling. Topic Modeling has proven to be useful to discover thematic patterns and trends in large collections of texts, with a view to class or browse them on the basis of their dominant themes. It has rarely if ever, however, been applied to collections of dramatic texts. In this contribution, Topic Modeling is used to analyze a collection of French Drama of the Classical Age and the Enlightenment. The general aim of this contribution is to discover what semantic types of topics are found in this collection, whether different dramatic subgenres have distinctive dominant topics and plot-related topic patterns, and inversely, to what extent clustering methods based on topic scores per play produce groupings of texts which agree with more conventional genre distinctions. This contribution shows that interesting topic patterns can be detected which provide new insights into the thematic, subgenre-related structure of French drama as well as into the history of French drama of the Classical Age and the Enlightenment. Comment: 11 figures

  • Publication . Article . Preprint . 2018
    Open Access English
    Authors: 
    Nadia Boukhelifa; Michael Bryant; Natasa Bulatovic; Ivan Čukić; Jean-Daniel Fekete; Milica Knežević; Jörg Lehmann; David I. Stuart; Carsten Thiel;
    Publisher: HAL CCSD
    Countries: United Kingdom, France
    Project: EC | CENDARI (284432)

    International audience; The CENDARI infrastructure is a research-supporting platform designed to provide tools for transnational historical research, focusing on two topics: medieval culture and World War I. It exposes to the end users modern Web-based tools relying on a sophisticated infrastructure to collect, enrich, annotate, and search through large document corpora. Supporting researchers in their daily work is a novel concern for infrastructures. We describe how we gathered requirements through multiple methods to understand historians' needs and derive an abstract workflow to support them. We then outline the tools that we have built, tying their technical descriptions to the user requirements. The main tools are the note-taking environment and its faceted search capabilities; the data integration platform including the Data API, supporting semantic enrichment through entity recognition; and the environment supporting the software development processes throughout the project to keep both technical partners and researchers in the loop. The outcomes are technical together with new resources developed and gathered, and the research workflow that has been described and documented.

  • Publication . Article . Conference object . Preprint . 2018 . Embargo End Date: 01 Jan 2018
    Open Access
    Authors: 
    Christoph Hube; Besnik Fetahu;
    Publisher: arXiv
    Project: EC | DESIR (731081), EC | ALEXANDRIA (339233), EC | AFEL (687916)

    Biased language commonly occurs around topics which are of controversial nature, thus, stirring disagreement between the different involved parties of a discussion. This is due to the fact that for language and its use, specifically, the understanding and use of phrases, the stances are cohesive within the particular groups. However, such cohesiveness does not hold across groups. In collaborative environments or environments where impartial language is desired (e.g. Wikipedia, news media), statements and the language therein should represent equally the involved parties and be neutrally phrased. Biased language is introduced through the presence of inflammatory words or phrases, or statements that may be incorrect or one-sided, thus violating such consensus. In this work, we focus on the specific case of phrasing bias, which may be introduced through specific inflammatory words or phrases in a statement. For this purpose, we propose an approach that relies on a recurrent neural networks in order to capture the inter-dependencies between words in a phrase that introduced bias. We perform a thorough experimental evaluation, where we show the advantages of a neural based approach over competitors that rely on word lexicons and other hand-crafted features in detecting biased language. We are able to distinguish biased statements with a precision of P=0.92, thus significantly outperforming baseline models with an improvement of over 30%. Finally, we release the largest corpus of statements annotated for biased language. Comment: The Twelfth ACM International Conference on Web Search and Data Mining, February 11--15, 2019, Melbourne, VIC, Australia

  • Publication . Article . Preprint . 2021 . Embargo End Date: 01 Jan 2021
    Open Access
    Authors: 
    Papadopoulou, Maria; Smyrnaiou, Zacharoula;
    Publisher: arXiv

    Digital technologies, such as the Internet and Artificial Intelligence, are part of our daily lives, influencing broader aspects of our way of life, as well as the way we interact with the past. Having dramatically changed the ways in which knowledge is produced and consumed, the algorithmic age has also radically changed the relationship that the general public has with History. Fields of History such as Public and Oral History have particularly benefitted from the rise of digital culture. How does our digital culture affect the way we think, study, research and teach the past, as historical evidence spreads rapidly in the public sphere? How do digital technologies promote the study, writing and teaching of History? What should historians, students of history and pre-service history teachers be critically aware of, when swarmed with digitized or born-digital content, constantly growing on the Internet? And while these changes are now visible globally, how is the discipline of History situated within the digital transformation rapidly advancing in Greece? Finally, what are the consequences of these changes for History as a subject taught at Greek secondary schools? These are some of the issues raised in the text that follows, which is part of the course materials of the undergraduate course offered during winter semester 2020-2021 at the School University of Athens, School of Philosophy, Pedagogy, Psychology. Course Title: 'Pedagogics of History: Theory and Practice', Academic Institution: School of Philosophy-Pedagogy-Psychology, University of Athens. Comment: 47 pages, in Greek, 8 figures

  • Publication . Preprint . Article . 2019
    Open Access English
    Authors: 
    Bamman, David; Lewke, Olivia; Mansoor, Anya;

    We present in this work a new dataset of coreference annotations for works of literature in English, covering 29,103 mentions in 210,532 tokens from 100 works of fiction. This dataset differs from previous coreference datasets in containing documents whose average length (2,105.3 words) is four times longer than other benchmark datasets (463.7 for OntoNotes), and contains examples of difficult coreference problems common in literature. This dataset allows for an evaluation of cross-domain performance for the task of coreference resolution, and analysis into the characteristics of long-distance within-document coreference.

Advanced search in Research products
Research products
arrow_drop_down
Searching FieldsTerms
Any field
arrow_drop_down
includes
arrow_drop_down
Include:
The following results are related to DARIAH EU. Are you interested to view more results? Visit OpenAIRE - Explore.
13 Research products, page 1 of 2
  • Publication . Article . Preprint . 2020
    Open Access English
    Authors: 
    Del Gratta, Riccardo;

    In this article, we propose a Category Theory approach to (syntactic) interoperability between linguistic tools. The resulting category consists of textual documents, including any linguistic annotations, NLP tools that analyze texts and add additional linguistic information, and format converters. Format converters are necessary to make the tools both able to read and to produce different output formats, which is the key to interoperability. The idea behind this document is the parallelism between the concepts of composition and associativity in Category Theory with the NLP pipelines. We show how pipelines of linguistic tools can be modeled into the conceptual framework of Category Theory and we successfully apply this method to two real-life examples. Paper submitted to Applied Category Theory 2020 and accepted for Virtual Poster Session

  • Publication . Article . Preprint . 2019 . Embargo End Date: 01 Jan 2019
    Open Access
    Authors: 
    Kolar, Jana; Cugmas, Marjan; Ferligoj, Anuška;
    Publisher: arXiv
    Project: EC | ACCELERATE (731112)

    In 2018, the European Strategic Forum for research infrastructures (ESFRI) was tasked by the Competitiveness Council, a configuration of the Council of the EU, to develop a common approach for monitoring of Research Infrastructures' performance. To this end, ESFRI established a working group, which has proposed 21 Key Performance Indicators (KPIs) to monitor the progress of the Research Infrastructures (RIs) addressed towards their objectives. The RIs were then asked to assess their relevance for their institution. The paper aims to identify the relevance of certain indicators for particular groups of RIs by using cluster and discriminant analysis. This could contribute to development of a monitoring system, tailored to particular RIs. To obtain a typology of the RIs, we first performed cluster analysis of the RIs according to their properties, which revealed clusters of RIs with similar characteristics, based on to the domain of operation, such as food, environment or engineering. Then, discriminant analysis was used to study how the relevance of the KPIs differs among the obtained clusters. This analysis revealed that the percentage of RIs correctly classified into five clusters, using the KPIs, is 80%. Such a high percentage indicates that there are significant differences in the relevance of certain indicators, depending on the ESFRI domain of the RI. The indicators therefore need to be adapted to the type of infrastructure. It is therefore proposed that the Strategic Working Groups of ESFRI addressing specific domains should be involved in the tailored development of the monitoring of pan-European RIs. Comment: 15 pages, 8 tables, 3 figures

  • Open Access English
    Authors: 
    Rizza, Ettore; Chardonnens, Anne; Van Hooland, Seth;
    Publisher: HAL CCSD
    Countries: France, Belgium

    More and more cultural institutions use Linked Data principles to share and connect their collection metadata. In the archival field, initiatives emerge to exploit data contained in archival descriptions and adapt encoding standards to the semantic web. In this context, online authority files can be used to enrich metadata. However, relying on a decentralized network of knowledge bases such as Wikidata, DBpedia or even Viaf has its own difficulties. This paper aims to offer a critical view of these linked authority files by adopting a close-reading approach. Through a practical case study, we intend to identify and illustrate the possibilities and limits of RDF triples compared to institutions' less structured metadata. Comment: Workshop "Dariah "Trust and Understanding: the value of metadata in a digitally joined-up world" (14/05/2018, Brussels), preprint of the submission to the journal "Archives et Biblioth\`eques de Belgique"

  • Publication . Article . Preprint . Conference object . 2019
    Open Access
    Authors: 
    Lilia Simeonova; Kiril Simov; Petya Osenova; Preslav Nakov;
    Publisher: Incoma Ltd., Shoumen, Bulgaria

    We propose a morphologically informed model for named entity recognition, which is based on LSTM-CRF architecture and combines word embeddings, Bi-LSTM character embeddings, part-of-speech (POS) tags, and morphological information. While previous work has focused on learning from raw word input, using word and character embeddings only, we show that for morphologically rich languages, such as Bulgarian, access to POS information contributes more to the performance gains than the detailed morphological information. Thus, we show that named entity recognition needs only coarse-grained POS tags, but at the same time it can benefit from simultaneously using some POS information of different granularity. Our evaluation results over a standard dataset show sizable improvements over the state-of-the-art for Bulgarian NER. Comment: named entity recognition; Bulgarian NER; morphology; morpho-syntax

  • Publication . Article . Preprint . 2020 . Embargo End Date: 01 Jan 2020
    Open Access
    Authors: 
    Zamani, Maryam; Tejedor, Alejandro; Vogl, Malte; Krautli, Florian; Valleriani, Matteo; Kantz, Holger;
    Publisher: arXiv

    We investigated the evolution and transformation of scientific knowledge in the early modern period, analyzing more than 350 different editions of textbooks used for teaching astronomy in European universities from the late fifteenth century to mid-seventeenth century. These historical sources constitute the Sphaera Corpus. By examining different semantic relations among individual parts of each edition on record, we built a multiplex network consisting of six layers, as well as the aggregated network built from the superposition of all the layers. The network analysis reveals the emergence of five different communities. The contribution of each layer in shaping the communities and the properties of each community are studied. The most influential books in the corpus are found by calculating the average age of all the out-going and in-coming links for each book. A small group of editions is identified as a transmitter of knowledge as they bridge past knowledge to the future through a long temporal interval. Our analysis, moreover, identifies the most disruptive books. These books introduce new knowledge that is then adopted by almost all the books published afterwards until the end of the whole period of study. The historical research on the content of the identified books, as an empirical test, finally corroborates the results of all our analyses. Comment: 19 pages, 9 figures

  • Publication . Other literature type . Article . Preprint . 2021
    Open Access

    The concept of literary genre is a highly complex one: not only are different genres frequently defined on several, but not necessarily the same levels of description, but consideration of genres as cognitive, social, or scholarly constructs with a rich history further complicate the matter. This contribution focuses on thematic aspects of genre with a quantitative approach, namely Topic Modeling. Topic Modeling has proven to be useful to discover thematic patterns and trends in large collections of texts, with a view to class or browse them on the basis of their dominant themes. It has rarely if ever, however, been applied to collections of dramatic texts. In this contribution, Topic Modeling is used to analyze a collection of French Drama of the Classical Age and the Enlightenment. The general aim of this contribution is to discover what semantic types of topics are found in this collection, whether different dramatic subgenres have distinctive dominant topics and plot-related topic patterns, and inversely, to what extent clustering methods based on topic scores per play produce groupings of texts which agree with more conventional genre distinctions. This contribution shows that interesting topic patterns can be detected which provide new insights into the thematic, subgenre-related structure of French drama as well as into the history of French drama of the Classical Age and the Enlightenment. Comment: 11 figures

  • Publication . Article . Preprint . 2018
    Open Access English
    Authors: 
    Nadia Boukhelifa; Michael Bryant; Natasa Bulatovic; Ivan Čukić; Jean-Daniel Fekete; Milica Knežević; Jörg Lehmann; David I. Stuart; Carsten Thiel;
    Publisher: HAL CCSD
    Countries: United Kingdom, France
    Project: EC | CENDARI (284432)

    International audience; The CENDARI infrastructure is a research-supporting platform designed to provide tools for transnational historical research, focusing on two topics: medieval culture and World War I. It exposes to the end users modern Web-based tools relying on a sophisticated infrastructure to collect, enrich, annotate, and search through large document corpora. Supporting researchers in their daily work is a novel concern for infrastructures. We describe how we gathered requirements through multiple methods to understand historians' needs and derive an abstract workflow to support them. We then outline the tools that we have built, tying their technical descriptions to the user requirements. The main tools are the note-taking environment and its faceted search capabilities; the data integration platform including the Data API, supporting semantic enrichment through entity recognition; and the environment supporting the software development processes throughout the project to keep both technical partners and researchers in the loop. The outcomes are technical together with new resources developed and gathered, and the research workflow that has been described and documented.

  • Publication . Article . Conference object . Preprint . 2018 . Embargo End Date: 01 Jan 2018
    Open Access
    Authors: 
    Christoph Hube; Besnik Fetahu;
    Publisher: arXiv
    Project: EC | DESIR (731081), EC | ALEXANDRIA (339233), EC | AFEL (687916)

    Biased language commonly occurs around topics which are of controversial nature, thus, stirring disagreement between the different involved parties of a discussion. This is due to the fact that for language and its use, specifically, the understanding and use of phrases, the stances are cohesive within the particular groups. However, such cohesiveness does not hold across groups. In collaborative environments or environments where impartial language is desired (e.g. Wikipedia, news media), statements and the language therein should represent equally the involved parties and be neutrally phrased. Biased language is introduced through the presence of inflammatory words or phrases, or statements that may be incorrect or one-sided, thus violating such consensus. In this work, we focus on the specific case of phrasing bias, which may be introduced through specific inflammatory words or phrases in a statement. For this purpose, we propose an approach that relies on a recurrent neural networks in order to capture the inter-dependencies between words in a phrase that introduced bias. We perform a thorough experimental evaluation, where we show the advantages of a neural based approach over competitors that rely on word lexicons and other hand-crafted features in detecting biased language. We are able to distinguish biased statements with a precision of P=0.92, thus significantly outperforming baseline models with an improvement of over 30%. Finally, we release the largest corpus of statements annotated for biased language. Comment: The Twelfth ACM International Conference on Web Search and Data Mining, February 11--15, 2019, Melbourne, VIC, Australia

  • Publication . Article . Preprint . 2021 . Embargo End Date: 01 Jan 2021
    Open Access
    Authors: 
    Papadopoulou, Maria; Smyrnaiou, Zacharoula;
    Publisher: arXiv

    Digital technologies, such as the Internet and Artificial Intelligence, are part of our daily lives, influencing broader aspects of our way of life, as well as the way we interact with the past. Having dramatically changed the ways in which knowledge is produced and consumed, the algorithmic age has also radically changed the relationship that the general public has with History. Fields of History such as Public and Oral History have particularly benefitted from the rise of digital culture. How does our digital culture affect the way we think, study, research and teach the past, as historical evidence spreads rapidly in the public sphere? How do digital technologies promote the study, writing and teaching of History? What should historians, students of history and pre-service history teachers be critically aware of, when swarmed with digitized or born-digital content, constantly growing on the Internet? And while these changes are now visible globally, how is the discipline of History situated within the digital transformation rapidly advancing in Greece? Finally, what are the consequences of these changes for History as a subject taught at Greek secondary schools? These are some of the issues raised in the text that follows, which is part of the course materials of the undergraduate course offered during winter semester 2020-2021 at the School University of Athens, School of Philosophy, Pedagogy, Psychology. Course Title: 'Pedagogics of History: Theory and Practice', Academic Institution: School of Philosophy-Pedagogy-Psychology, University of Athens. Comment: 47 pages, in Greek, 8 figures

  • Publication . Preprint . Article . 2019
    Open Access English
    Authors: 
    Bamman, David; Lewke, Olivia; Mansoor, Anya;

    We present in this work a new dataset of coreference annotations for works of literature in English, covering 29,103 mentions in 210,532 tokens from 100 works of fiction. This dataset differs from previous coreference datasets in containing documents whose average length (2,105.3 words) is four times longer than other benchmark datasets (463.7 for OntoNotes), and contains examples of difficult coreference problems common in literature. This dataset allows for an evaluation of cross-domain performance for the task of coreference resolution, and analysis into the characteristics of long-distance within-document coreference.