- home
- Advanced Search
11 Research products, page 1 of 2
Loading
- Publication . Article . Other literature type . Conference object . 2020Open Access EnglishAuthors:Stefan Bornhofen; Marten Düring;Stefan Bornhofen; Marten Düring;Publisher: HAL CCSDCountry: FranceProject: ANR | BLIZAAR (ANR-15-CE23-0002)
AbstractThe paper presents Intergraph, a graph-based visual analytics technical demonstrator for the exploration and study of content in historical document collections. The designed prototype is motivated by a practical use case on a corpus of circa 15.000 digitized resources about European integration since 1945. The corpus allowed generating a dynamic multilayer network which represents different kinds of named entities appearing and co-appearing in the collections. To our knowledge, Intergraph is one of the first interactive tools to visualize dynamic multilayer graphs for collections of digitized historical sources. Graph visualization and interaction methods have been designed based on user requirements for content exploration by non-technical users without a strong background in network science, and to compensate for common flaws with the annotation of named entities. Users work with self-selected subsets of the overall data by interacting with a scene of small graphs which can be added, altered and compared. This allows an interest-driven navigation in the corpus and the discovery of the interconnections of its entities across time.
Average popularityAverage popularity In bottom 99%Average influencePopularity: Citation-based measure reflecting the current impact.Average influence In bottom 99%Influence: Citation-based measure reflecting the total impact.add Add to ORCIDPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product. - Publication . Article . Preprint . 2019Open Access EnglishAuthors:Kolar, Jana; Cugmas, Marjan; Ferligoj, Anu��ka;Kolar, Jana; Cugmas, Marjan; Ferligoj, Anu��ka;Project: EC | ACCELERATE (731112)
In 2018, the European Strategic Forum for research infrastructures (ESFRI) was tasked by the Competitiveness Council, a configuration of the Council of the EU, to develop a common approach for monitoring of Research Infrastructures' performance. To this end, ESFRI established a working group, which has proposed 21 Key Performance Indicators (KPIs) to monitor the progress of the Research Infrastructures (RIs) addressed towards their objectives. The RIs were then asked to assess their relevance for their institution. The paper aims to identify the relevance of certain indicators for particular groups of RIs by using cluster and discriminant analysis. This could contribute to development of a monitoring system, tailored to particular RIs. To obtain a typology of the RIs, we first performed cluster analysis of the RIs according to their properties, which revealed clusters of RIs with similar characteristics, based on to the domain of operation, such as food, environment or engineering. Then, discriminant analysis was used to study how the relevance of the KPIs differs among the obtained clusters. This analysis revealed that the percentage of RIs correctly classified into five clusters, using the KPIs, is 80%. Such a high percentage indicates that there are significant differences in the relevance of certain indicators, depending on the ESFRI domain of the RI. The indicators therefore need to be adapted to the type of infrastructure. It is therefore proposed that the Strategic Working Groups of ESFRI addressing specific domains should be involved in the tailored development of the monitoring of pan-European RIs. 15 pages, 8 tables, 3 figures
Average popularityAverage popularity In bottom 99%Average influencePopularity: Citation-based measure reflecting the current impact.Average influence In bottom 99%Influence: Citation-based measure reflecting the total impact.add Add to ORCIDPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product. - Publication . Other literature type . Conference object . 2018Open Access EnglishAuthors:Longhi, Julien;Longhi, Julien;Publisher: HAL CCSDCountry: France
International audience
- Publication . Conference object . Other literature type . 2016Open Access EnglishAuthors:Longhi, Julien;Longhi, Julien;Publisher: HAL CCSDCountry: France
International audience; This poster aims to describe issues encountered whilst structuring a corpus of tweets compiled from the key word intermittent (arts worker) in order to analyse a discursive topic related to the controversy surrounding the status of French arts workers. This corpus is part of the CoMeRe project (CoMeRe, 2014): it aims to build a kernel corpus of computer-mediated communication (CMC) genres with interactions in the French language. Three key words characterize the project: variety, standards and openness. A variety of interactions was sought: public or private interactions as well as interactions from informal, learning and professional situations. The CoMeRe project structured the corpora in a uniform way using the Text Encoding Initiative format (TEI, Burnard & Bauman, 2013) and described each corpus using Dublin Core and OLAC standards for metadata (DCMI, 2014; OLAC, 2008). The TEI model was extended in order to encompass the Interaction Space (IS) of CMC multimodal discourse (Chanier et al., 2014). The term 'openness' also characterizes the project: The corpora have been released as open data on the French national platform of linguistic resources (ORTOLANG, 2013) in order to pave the way for scientific examination by partners not involved in the project as well as replicative and cumulative research. This poster presentation aims to give an overview of the corpus building process using, as a case study, a corpus of tweets cmr-intermittent (Longhi et al., 2016). The following steps led to the choice of tweets: 1) In 2015, with the creation of a threshold of at least 10 tweets with the #intermittent (s), we identified 215 accounts, each of which had produced at least 10 tweets explicitly referenced as contributing to this theme (in order to have representative accounts). 2) By gathering all of the tweets sent by those 215 people, we collected 586, 239 tweets. 3) 10,876 of the 586, 239 tweets contained the #: #intermittent(s): the #intermittent corpus corresponds to these 10, 876 tweets. The poster will focus, firstly, on how features that are specific to Twitter were included and structured in the interaction space TEI model. We will exemplify how certain features are accounted for in TEI. These include hashtags that label tweets in order that other users can see tweets on the same topic and at signs that allow users to mention or reply to other users. Secondly, the poster will evoke some of the ethical and rights issues that had to be considered before publishing this corpus of tweets. Finally, the workflow and multi-stage quality control procedure adopted during the corpus building process will be illustrated.
- Publication . Other literature type . Part of book or chapter of book . 2017Open Access FrenchAuthors:Julien Longhi;Julien Longhi;Publisher: HAL CCSDCountry: France
International audience; L'analyse du discours politique connaît un renouvellement important, dû notamment aux nouveaux supports et formats d'expression, comme les réseaux sociaux numériques (RSN). Or, ces lieux de production d'écrits sont le plus souvent saisis par des disciplines qui les traitent comme des données sociales, plutôt que comme des discours. Cet article vise à décrire les enjeux philologiques, herméneutiques, et également institutionnels et interdisciplinaires, de la constitution d'un corpus de tweets politiques. Le corpus Polititweets (Longhi et al. 2014 : 34273 messages, 205 utilisateurs) a été élaboré selon le format TEI (avec des pistes d'extension aux formats CMC proposées par un groupe européen qui s'est constitué autour de cette question), afin de tenir compte des éléments spatio-temporels, contextuels, technologiques, interactionnels, thématiques, dialogiques, etc. des messages produits. Il s'agit donc dans un premier temps de décrire le contexte d'élaboration du corpus, la méthodologie et des considérations juridiques. Dans un second temps, nous détaillons les enjeux philologiques de la constitution du corpus, en explicitant les critères qui ont présidé à sa structuration, pour passer d'une base de données à un corpus au format TEI. Dans un dernier temps, nous décrivons la démarche de mise à disposition du corpus et les questions d'« open access ».
- Publication . Report . 2016Open Access FrenchAuthors:Alès, Catherine; Arena, Richard; Brandt-Grau, Astrid; Chaabane, Naceur; Cortes, Geneviève; Crespin, Renaud; Fretel, Julien; Gardey, Delphine; Guermeur, Ivan; Gueye, Lamine; +11 moreAlès, Catherine; Arena, Richard; Brandt-Grau, Astrid; Chaabane, Naceur; Cortes, Geneviève; Crespin, Renaud; Fretel, Julien; Gardey, Delphine; Guermeur, Ivan; Gueye, Lamine; Haegeman, Lilian; Hostein, Antony; Michel, Hélène; Nef, Anneliese; Vienne-Guerrin, Nathalie; Le Tellier-Becquart, Nathalie; Michel, Cécile; Vaccaro, Rossana; Didier, Emmanuel; Auvergnon, Philippe; Inowlocki, Lena;Publisher: HAL CCSDCountry: France
- Publication . Other literature type . Conference object . 2015Open Access EnglishAuthors:Longhi, Julien; Wigham, Ciara R.;Longhi, Julien; Wigham, Ciara R.;Publisher: HAL CCSDCountry: France
International audience; The CoMeRe project (CoMeRe, 2014) aims to build a kernel corpus of computer-mediated communication (CMC) genres with interactions in the French language. Three key words characterize the project: variety, standards and openness. The project gathered mono- and multimodal, synchronous and asynchronous communication data from both Internet and telecommunication networks (text chat, tweets, SMSs, forums, blogs). A variety of interactions was sought: public or private interactions as well as interactions from informal, learning and professional situations. Whereas some CMC data types were collected within the CoMeRe project, others had previously been collected and structured within different project partners’ local research teams. This meant that the project had to overcome disparities in corpus compilation choices. For this reason, the CoMeRe project structured the corpora in a uniform way using the Text Encoding Initiative format (TEI, Burnard & Bauman, 2013) and decided to describe each corpus using Dublin Core and OLAC standards for metadata (DCMI, 2014; OLAC, 2008). The TEI model was extended in order to encompass the Interaction Space (IS) of CMC multimodal discourse (Chanier et al., 2014). The term ‘openness’ also characterizes the project: The corpora have been released as open data on the French national platform of linguistic resources (ORTOLANG, 2013) in order to pave the way for scientific examination by partners not involved in the project as well as replicative and culumative research. This poster presentation aims to give an overview of the corpus building process using, as a case study, a corpus of political tweets cmr-polititweets (Longhi et al., 2014). The corpus stemmed from a local research project on lexicon (Digital Humanities and datajournalism, supported by the Fondation of Cergy-Pontoise University). It was built starting from seven French politicians from six different political parties. In order to generate political tweets, a set of lists citing these politicians was generated (7087 lists), and lists that have tweeted at least six times and for which the description contained the word ‘politics’ were selected (120 lists in total). Finally, 2934 tweets were recovered. In order to be sure that we selected politicians’ tweets (and not, for example, those of journalists), only the accounts cited in more than 12 lists were considered; 205 politicians were tweeting. We took the last 200 tweets of each of the 205 accounts on 27 March 2014 (34,273 tweets). This allowed us to recover data that focused on the period between the two rounds of the 2014 municipal elections in France. The poster will focus, firstly, on how features specific to Twitter were included and structured in the interaction space TEI model. We will exemplify how features including hashtags that label tweets so that other users can see tweets on the same topic, at signs that allow a user to mention or reply to other users and retweets that allow a user to repost a message from another Twitter user and share it with his own followers, were integrated into the model. Secondly, the poster will evoke some of the ethical and rights issues that had to be considered before publishing a corpus of tweets. Finally, the workflow & multi-stage quality control process adopted during the building of the corpus will be illustrated. This was an essential aspect considering that the corpus underwent format conversions: the local research team had initially structured the corpus in XML whilst the CoMeRe project applied the IS TEI model to the corpus.The political tweets corpus is now structured and available online. Analyses have started to be carried out: some ideas have been launched in Djemili et al. (2014) but further analyses must adhere rigorously to methodologies stemming from the natural language processing (NLP) field.
- Publication . Other literature type . Article . 2014Open Access EnglishAuthors:Thierry Chanier; Celine Poudat; Benoit Sagot; Georges Antoniadis; Ciara Wigham; Linda Hriba; Julien Longhi; Djame Seddah;Thierry Chanier; Celine Poudat; Benoit Sagot; Georges Antoniadis; Ciara Wigham; Linda Hriba; Julien Longhi; Djame Seddah;Publisher: HAL CCSDCountry: France
Final version to Special Issue of JLCL (Journal of Language Technology and Computational Linguistics (JLCL, http://jlcl.org/): BUILDING AND ANNOTATING CORPORA OF COMPUTER-MEDIATED DISCOURSE: Issues and Challenges at the Interface of Corpus and Computational Linguistics (ed. by Michael Beißwenger, Nelleke Oostdijk, Angelika Storrer & Henk van den Heuvel); International audience; The CoMeRe project aims to build a kernel corpus of different Computer-Mediated Com-munication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunication, as well as mono and multimodal, synchronous and asynchronous communications. Corpora are assem-bled using a standard, thanks to the TEI (Text Encoding Initiative) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body. The model is then instantiated through the first four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: Discourse Analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As NLP algorithms are an indispensable prerequisite for such research, we present our motiva-tions for applying an automatic annotation process to the CoMeRe corpora. Our wish to guarantee generic annotations meant we did not consider any processing beyond morphosyn-tactic labelling, but prioritized the automatic annotation of any freely variant elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and describe the processing pipeline for adding these. All CoMeRe corpora are verified, thanks to a staged quality control process, designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure ORTOLANG. We, therefore, highlight issues and decisions made concerning the OpenData perspective.
Average popularityAverage popularity In bottom 99%Average influencePopularity: Citation-based measure reflecting the current impact.Average influence In bottom 99%Influence: Citation-based measure reflecting the total impact.add Add to ORCIDPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product. - Publication . Part of book or chapter of book . 2016Open Access EnglishAuthors:Buzzoni, Marina;Buzzoni, Marina;Publisher: Open Book PublishersCountry: Italy
- Publication . Conference object . 2020Open Access EnglishAuthors:Nicholas, Lionel; Lyding, Verena; Borg, Claudia; Forascu, Corina; Fort, Karen; Zdravkova, Katerina; Kosem, Iztok; Cibej, Jaka; Holdt, Spela Arhar; Millour, Alice; +9 moreNicholas, Lionel; Lyding, Verena; Borg, Claudia; Forascu, Corina; Fort, Karen; Zdravkova, Katerina; Kosem, Iztok; Cibej, Jaka; Holdt, Spela Arhar; Millour, Alice; Konig, Alexander; Rodosthenous, Christos; Sangati, Federico; Hassan, Umair ul; Katinskaia, Anisia; Barreiro, Anabela; Aparaschivei, Lavina; HaCohen-Kerner, Yaakov; 12th edition of the Language Resources and Evaluation Conference (LREC'20);Country: Malta
We introduce in this paper a generic approach to combine implicit crowdsourcing and language learning in order to mass-produce language resources (LRs) for any language for which a crowd of language learners can be involved. We present the approach by explaining its core paradigm that consists in pairing specific types of LRs with specific exercises, by detailing both its strengths and challenges, and by discussing how much these challenges have been addressed at present. Accordingly, we also report on on-going proof-of-concept efforts aiming at developing the first prototypical implementation of the approach in order to correct and extend an LR called ConceptNet based on the input crowdsourced from language learners. We then present an international network called the European Network for Combining Language Learning with Crowdsourcing Techniques (enetCollect) that provides the context to accelerate the implementation of the generic approach. Finally, we exemplify how it can be used in several language learning scenarios to produce a multitude of NLP resources and how it can therefore alleviate the long-standing NLP issue of the lack of LRs. peer-reviewed
11 Research products, page 1 of 2
Loading
- Publication . Article . Other literature type . Conference object . 2020Open Access EnglishAuthors:Stefan Bornhofen; Marten Düring;Stefan Bornhofen; Marten Düring;Publisher: HAL CCSDCountry: FranceProject: ANR | BLIZAAR (ANR-15-CE23-0002)
AbstractThe paper presents Intergraph, a graph-based visual analytics technical demonstrator for the exploration and study of content in historical document collections. The designed prototype is motivated by a practical use case on a corpus of circa 15.000 digitized resources about European integration since 1945. The corpus allowed generating a dynamic multilayer network which represents different kinds of named entities appearing and co-appearing in the collections. To our knowledge, Intergraph is one of the first interactive tools to visualize dynamic multilayer graphs for collections of digitized historical sources. Graph visualization and interaction methods have been designed based on user requirements for content exploration by non-technical users without a strong background in network science, and to compensate for common flaws with the annotation of named entities. Users work with self-selected subsets of the overall data by interacting with a scene of small graphs which can be added, altered and compared. This allows an interest-driven navigation in the corpus and the discovery of the interconnections of its entities across time.
Average popularityAverage popularity In bottom 99%Average influencePopularity: Citation-based measure reflecting the current impact.Average influence In bottom 99%Influence: Citation-based measure reflecting the total impact.add Add to ORCIDPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product. - Publication . Article . Preprint . 2019Open Access EnglishAuthors:Kolar, Jana; Cugmas, Marjan; Ferligoj, Anu��ka;Kolar, Jana; Cugmas, Marjan; Ferligoj, Anu��ka;Project: EC | ACCELERATE (731112)
In 2018, the European Strategic Forum for research infrastructures (ESFRI) was tasked by the Competitiveness Council, a configuration of the Council of the EU, to develop a common approach for monitoring of Research Infrastructures' performance. To this end, ESFRI established a working group, which has proposed 21 Key Performance Indicators (KPIs) to monitor the progress of the Research Infrastructures (RIs) addressed towards their objectives. The RIs were then asked to assess their relevance for their institution. The paper aims to identify the relevance of certain indicators for particular groups of RIs by using cluster and discriminant analysis. This could contribute to development of a monitoring system, tailored to particular RIs. To obtain a typology of the RIs, we first performed cluster analysis of the RIs according to their properties, which revealed clusters of RIs with similar characteristics, based on to the domain of operation, such as food, environment or engineering. Then, discriminant analysis was used to study how the relevance of the KPIs differs among the obtained clusters. This analysis revealed that the percentage of RIs correctly classified into five clusters, using the KPIs, is 80%. Such a high percentage indicates that there are significant differences in the relevance of certain indicators, depending on the ESFRI domain of the RI. The indicators therefore need to be adapted to the type of infrastructure. It is therefore proposed that the Strategic Working Groups of ESFRI addressing specific domains should be involved in the tailored development of the monitoring of pan-European RIs. 15 pages, 8 tables, 3 figures
Average popularityAverage popularity In bottom 99%Average influencePopularity: Citation-based measure reflecting the current impact.Average influence In bottom 99%Influence: Citation-based measure reflecting the total impact.add Add to ORCIDPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product. - Publication . Other literature type . Conference object . 2018Open Access EnglishAuthors:Longhi, Julien;Longhi, Julien;Publisher: HAL CCSDCountry: France
International audience
- Publication . Conference object . Other literature type . 2016Open Access EnglishAuthors:Longhi, Julien;Longhi, Julien;Publisher: HAL CCSDCountry: France
International audience; This poster aims to describe issues encountered whilst structuring a corpus of tweets compiled from the key word intermittent (arts worker) in order to analyse a discursive topic related to the controversy surrounding the status of French arts workers. This corpus is part of the CoMeRe project (CoMeRe, 2014): it aims to build a kernel corpus of computer-mediated communication (CMC) genres with interactions in the French language. Three key words characterize the project: variety, standards and openness. A variety of interactions was sought: public or private interactions as well as interactions from informal, learning and professional situations. The CoMeRe project structured the corpora in a uniform way using the Text Encoding Initiative format (TEI, Burnard & Bauman, 2013) and described each corpus using Dublin Core and OLAC standards for metadata (DCMI, 2014; OLAC, 2008). The TEI model was extended in order to encompass the Interaction Space (IS) of CMC multimodal discourse (Chanier et al., 2014). The term 'openness' also characterizes the project: The corpora have been released as open data on the French national platform of linguistic resources (ORTOLANG, 2013) in order to pave the way for scientific examination by partners not involved in the project as well as replicative and cumulative research. This poster presentation aims to give an overview of the corpus building process using, as a case study, a corpus of tweets cmr-intermittent (Longhi et al., 2016). The following steps led to the choice of tweets: 1) In 2015, with the creation of a threshold of at least 10 tweets with the #intermittent (s), we identified 215 accounts, each of which had produced at least 10 tweets explicitly referenced as contributing to this theme (in order to have representative accounts). 2) By gathering all of the tweets sent by those 215 people, we collected 586, 239 tweets. 3) 10,876 of the 586, 239 tweets contained the #: #intermittent(s): the #intermittent corpus corresponds to these 10, 876 tweets. The poster will focus, firstly, on how features that are specific to Twitter were included and structured in the interaction space TEI model. We will exemplify how certain features are accounted for in TEI. These include hashtags that label tweets in order that other users can see tweets on the same topic and at signs that allow users to mention or reply to other users. Secondly, the poster will evoke some of the ethical and rights issues that had to be considered before publishing this corpus of tweets. Finally, the workflow and multi-stage quality control procedure adopted during the corpus building process will be illustrated.
- Publication . Other literature type . Part of book or chapter of book . 2017Open Access FrenchAuthors:Julien Longhi;Julien Longhi;Publisher: HAL CCSDCountry: France
International audience; L'analyse du discours politique connaît un renouvellement important, dû notamment aux nouveaux supports et formats d'expression, comme les réseaux sociaux numériques (RSN). Or, ces lieux de production d'écrits sont le plus souvent saisis par des disciplines qui les traitent comme des données sociales, plutôt que comme des discours. Cet article vise à décrire les enjeux philologiques, herméneutiques, et également institutionnels et interdisciplinaires, de la constitution d'un corpus de tweets politiques. Le corpus Polititweets (Longhi et al. 2014 : 34273 messages, 205 utilisateurs) a été élaboré selon le format TEI (avec des pistes d'extension aux formats CMC proposées par un groupe européen qui s'est constitué autour de cette question), afin de tenir compte des éléments spatio-temporels, contextuels, technologiques, interactionnels, thématiques, dialogiques, etc. des messages produits. Il s'agit donc dans un premier temps de décrire le contexte d'élaboration du corpus, la méthodologie et des considérations juridiques. Dans un second temps, nous détaillons les enjeux philologiques de la constitution du corpus, en explicitant les critères qui ont présidé à sa structuration, pour passer d'une base de données à un corpus au format TEI. Dans un dernier temps, nous décrivons la démarche de mise à disposition du corpus et les questions d'« open access ».
- Publication . Report . 2016Open Access FrenchAuthors:Alès, Catherine; Arena, Richard; Brandt-Grau, Astrid; Chaabane, Naceur; Cortes, Geneviève; Crespin, Renaud; Fretel, Julien; Gardey, Delphine; Guermeur, Ivan; Gueye, Lamine; +11 moreAlès, Catherine; Arena, Richard; Brandt-Grau, Astrid; Chaabane, Naceur; Cortes, Geneviève; Crespin, Renaud; Fretel, Julien; Gardey, Delphine; Guermeur, Ivan; Gueye, Lamine; Haegeman, Lilian; Hostein, Antony; Michel, Hélène; Nef, Anneliese; Vienne-Guerrin, Nathalie; Le Tellier-Becquart, Nathalie; Michel, Cécile; Vaccaro, Rossana; Didier, Emmanuel; Auvergnon, Philippe; Inowlocki, Lena;Publisher: HAL CCSDCountry: France
- Publication . Other literature type . Conference object . 2015Open Access EnglishAuthors:Longhi, Julien; Wigham, Ciara R.;Longhi, Julien; Wigham, Ciara R.;Publisher: HAL CCSDCountry: France
International audience; The CoMeRe project (CoMeRe, 2014) aims to build a kernel corpus of computer-mediated communication (CMC) genres with interactions in the French language. Three key words characterize the project: variety, standards and openness. The project gathered mono- and multimodal, synchronous and asynchronous communication data from both Internet and telecommunication networks (text chat, tweets, SMSs, forums, blogs). A variety of interactions was sought: public or private interactions as well as interactions from informal, learning and professional situations. Whereas some CMC data types were collected within the CoMeRe project, others had previously been collected and structured within different project partners’ local research teams. This meant that the project had to overcome disparities in corpus compilation choices. For this reason, the CoMeRe project structured the corpora in a uniform way using the Text Encoding Initiative format (TEI, Burnard & Bauman, 2013) and decided to describe each corpus using Dublin Core and OLAC standards for metadata (DCMI, 2014; OLAC, 2008). The TEI model was extended in order to encompass the Interaction Space (IS) of CMC multimodal discourse (Chanier et al., 2014). The term ‘openness’ also characterizes the project: The corpora have been released as open data on the French national platform of linguistic resources (ORTOLANG, 2013) in order to pave the way for scientific examination by partners not involved in the project as well as replicative and culumative research. This poster presentation aims to give an overview of the corpus building process using, as a case study, a corpus of political tweets cmr-polititweets (Longhi et al., 2014). The corpus stemmed from a local research project on lexicon (Digital Humanities and datajournalism, supported by the Fondation of Cergy-Pontoise University). It was built starting from seven French politicians from six different political parties. In order to generate political tweets, a set of lists citing these politicians was generated (7087 lists), and lists that have tweeted at least six times and for which the description contained the word ‘politics’ were selected (120 lists in total). Finally, 2934 tweets were recovered. In order to be sure that we selected politicians’ tweets (and not, for example, those of journalists), only the accounts cited in more than 12 lists were considered; 205 politicians were tweeting. We took the last 200 tweets of each of the 205 accounts on 27 March 2014 (34,273 tweets). This allowed us to recover data that focused on the period between the two rounds of the 2014 municipal elections in France. The poster will focus, firstly, on how features specific to Twitter were included and structured in the interaction space TEI model. We will exemplify how features including hashtags that label tweets so that other users can see tweets on the same topic, at signs that allow a user to mention or reply to other users and retweets that allow a user to repost a message from another Twitter user and share it with his own followers, were integrated into the model. Secondly, the poster will evoke some of the ethical and rights issues that had to be considered before publishing a corpus of tweets. Finally, the workflow & multi-stage quality control process adopted during the building of the corpus will be illustrated. This was an essential aspect considering that the corpus underwent format conversions: the local research team had initially structured the corpus in XML whilst the CoMeRe project applied the IS TEI model to the corpus.The political tweets corpus is now structured and available online. Analyses have started to be carried out: some ideas have been launched in Djemili et al. (2014) but further analyses must adhere rigorously to methodologies stemming from the natural language processing (NLP) field.
- Publication . Other literature type . Article . 2014Open Access EnglishAuthors:Thierry Chanier; Celine Poudat; Benoit Sagot; Georges Antoniadis; Ciara Wigham; Linda Hriba; Julien Longhi; Djame Seddah;Thierry Chanier; Celine Poudat; Benoit Sagot; Georges Antoniadis; Ciara Wigham; Linda Hriba; Julien Longhi; Djame Seddah;Publisher: HAL CCSDCountry: France
Final version to Special Issue of JLCL (Journal of Language Technology and Computational Linguistics (JLCL, http://jlcl.org/): BUILDING AND ANNOTATING CORPORA OF COMPUTER-MEDIATED DISCOURSE: Issues and Challenges at the Interface of Corpus and Computational Linguistics (ed. by Michael Beißwenger, Nelleke Oostdijk, Angelika Storrer & Henk van den Heuvel); International audience; The CoMeRe project aims to build a kernel corpus of different Computer-Mediated Com-munication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunication, as well as mono and multimodal, synchronous and asynchronous communications. Corpora are assem-bled using a standard, thanks to the TEI (Text Encoding Initiative) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body. The model is then instantiated through the first four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: Discourse Analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As NLP algorithms are an indispensable prerequisite for such research, we present our motiva-tions for applying an automatic annotation process to the CoMeRe corpora. Our wish to guarantee generic annotations meant we did not consider any processing beyond morphosyn-tactic labelling, but prioritized the automatic annotation of any freely variant elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and describe the processing pipeline for adding these. All CoMeRe corpora are verified, thanks to a staged quality control process, designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure ORTOLANG. We, therefore, highlight issues and decisions made concerning the OpenData perspective.
Average popularityAverage popularity In bottom 99%Average influencePopularity: Citation-based measure reflecting the current impact.Average influence In bottom 99%Influence: Citation-based measure reflecting the total impact.add Add to ORCIDPlease grant OpenAIRE to access and update your ORCID works.This Research product is the result of merged Research products in OpenAIRE.
You have already added works in your ORCID record related to the merged Research product. - Publication . Part of book or chapter of book . 2016Open Access EnglishAuthors:Buzzoni, Marina;Buzzoni, Marina;Publisher: Open Book PublishersCountry: Italy
- Publication . Conference object . 2020Open Access EnglishAuthors:Nicholas, Lionel; Lyding, Verena; Borg, Claudia; Forascu, Corina; Fort, Karen; Zdravkova, Katerina; Kosem, Iztok; Cibej, Jaka; Holdt, Spela Arhar; Millour, Alice; +9 moreNicholas, Lionel; Lyding, Verena; Borg, Claudia; Forascu, Corina; Fort, Karen; Zdravkova, Katerina; Kosem, Iztok; Cibej, Jaka; Holdt, Spela Arhar; Millour, Alice; Konig, Alexander; Rodosthenous, Christos; Sangati, Federico; Hassan, Umair ul; Katinskaia, Anisia; Barreiro, Anabela; Aparaschivei, Lavina; HaCohen-Kerner, Yaakov; 12th edition of the Language Resources and Evaluation Conference (LREC'20);Country: Malta
We introduce in this paper a generic approach to combine implicit crowdsourcing and language learning in order to mass-produce language resources (LRs) for any language for which a crowd of language learners can be involved. We present the approach by explaining its core paradigm that consists in pairing specific types of LRs with specific exercises, by detailing both its strengths and challenges, and by discussing how much these challenges have been addressed at present. Accordingly, we also report on on-going proof-of-concept efforts aiming at developing the first prototypical implementation of the approach in order to correct and extend an LR called ConceptNet based on the input crowdsourced from language learners. We then present an international network called the European Network for Combining Language Learning with Crowdsourcing Techniques (enetCollect) that provides the context to accelerate the implementation of the generic approach. Finally, we exemplify how it can be used in several language learning scenarios to produce a multitude of NLP resources and how it can therefore alleviate the long-standing NLP issue of the lack of LRs. peer-reviewed