publication . Preprint . 2019

An Annotated Dataset of Coreference in English Literature

Bamman, David; Lewke, Olivia; Mansoor, Anya;
Open Access English
  • Published: 02 Dec 2019
Abstract
We present in this work a new dataset of coreference annotations for works of literature in English, covering 29,103 mentions in 210,532 tokens from 100 works of fiction. This dataset differs from previous coreference datasets in containing documents whose average length (2,105.3 words) is four times longer than other benchmark datasets (463.7 for OntoNotes), and contains examples of difficult coreference problems common in literature. This dataset allows for an evaluation of cross-domain performance for the task of coreference resolution, and analysis into the characteristics of long-distance within-document coreference.
Subjects
free text keywords: Computer Science - Computation and Language
Communities
DARIAH EU
Download from
57 references, page 1 of 4

Agarwal, A., Corvalan, A., Jensen, J., and Rambow, O. (2012). Social network analysis of alice in wonderland. In Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature, pages 88-96, Montréal, Canada, June. Association for Computational Linguistics.

Bagga, A. and Baldwin, B. (1998). Algorithms for scoring coreference chains. In The First International Conference on Language Resources and Evaluation Workshop on Linguistics Coreference, volume 1, pages 563-566. Granada.

Bamman, D., Underwood, T., and Smith, N. A. (2014). A Bayesian mixed effects model of literary character. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 370-379, Baltimore, Maryland, June. Association for Computational Linguistics.

Bamman, D., Popat, S., and Shen, S. (2019). An annotated dataset of literary entities. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2138-2144, Minneapolis, Minnesota, June. Association for Computational Linguistics. [OpenAIRE]

Chen, H., Fan, Z., Lu, H., Yuille, A., and Rong, S. (2018). PreCo: A large-scale dataset in preschool vocabulary for coreference resolution. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 172-181, Brussels, Belgium, OctoberNovember. Association for Computational Linguistics.

Clark, K. and Manning, C. D. (2016). Improving coreference resolution by learning entity-level distributed representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 643-653, Berlin, Germany, August. Association for Computational Linguistics.

Cohen, K. B., Lanfranchi, A., Choi, M. J.-y., Bada, M., Baumgartner, W. A., Panteleyeva, N., Verspoor, K., Palmer, M., and Hunter, L. E. (2017). Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles. BMC Bioinformatics, 18(1):372, Aug.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, Minneapolis, Minnesota, June. Association for Computational Linguistics.

D'Souza, J. and Ng, V. (2012). Anaphora resolution in biomedical literature: A hybrid approach. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine, BCB '12, pages 113- 122, New York, NY, USA. ACM.

Elson, D. K., Dames, N., and McKeown, K. R. (2010). Extracting social networks from literary fiction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL '10, pages 138-147, Stroudsburg, PA, USA. Association for Computational Linguistics.

Gallois, A. (2016). The Metaphysics of Identity. Routledge.

Gasperin, C., Karamanis, N., and Seal, R. (2007). Annotation of anaphoric relations in biomedical full-text articles using a domain-relevant scheme. In Proceedings of DAARC, volume 2007. Citeseer.

Ghaddar, A. and Langlais, P. (2016). WikiCoref: An English coreference-annotated corpus of Wikipedia articles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 136-142, Portorož, Slovenia, May. European Language Resources Association (ELRA).

Guha, A., Iyyer, M., Bouman, D., and Boyd-Graber, J. (2015). Removing the training wheels: A coreference dataset that entertains humans and challenges computers. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1108-1118, Denver, Colorado, May-June. Association for Computational Linguistics.

Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., and Weischedel, R. (2006). OntoNotes: the 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACL-Short '06, pages 57-60, Stroudsburg, PA, USA. Association for Computational Linguistics.

57 references, page 1 of 4
Abstract
We present in this work a new dataset of coreference annotations for works of literature in English, covering 29,103 mentions in 210,532 tokens from 100 works of fiction. This dataset differs from previous coreference datasets in containing documents whose average length (2,105.3 words) is four times longer than other benchmark datasets (463.7 for OntoNotes), and contains examples of difficult coreference problems common in literature. This dataset allows for an evaluation of cross-domain performance for the task of coreference resolution, and analysis into the characteristics of long-distance within-document coreference.
Subjects
free text keywords: Computer Science - Computation and Language
Communities
DARIAH EU
Download from
57 references, page 1 of 4

Agarwal, A., Corvalan, A., Jensen, J., and Rambow, O. (2012). Social network analysis of alice in wonderland. In Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature, pages 88-96, Montréal, Canada, June. Association for Computational Linguistics.

Bagga, A. and Baldwin, B. (1998). Algorithms for scoring coreference chains. In The First International Conference on Language Resources and Evaluation Workshop on Linguistics Coreference, volume 1, pages 563-566. Granada.

Bamman, D., Underwood, T., and Smith, N. A. (2014). A Bayesian mixed effects model of literary character. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 370-379, Baltimore, Maryland, June. Association for Computational Linguistics.

Bamman, D., Popat, S., and Shen, S. (2019). An annotated dataset of literary entities. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2138-2144, Minneapolis, Minnesota, June. Association for Computational Linguistics. [OpenAIRE]

Chen, H., Fan, Z., Lu, H., Yuille, A., and Rong, S. (2018). PreCo: A large-scale dataset in preschool vocabulary for coreference resolution. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 172-181, Brussels, Belgium, OctoberNovember. Association for Computational Linguistics.

Clark, K. and Manning, C. D. (2016). Improving coreference resolution by learning entity-level distributed representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 643-653, Berlin, Germany, August. Association for Computational Linguistics.

Cohen, K. B., Lanfranchi, A., Choi, M. J.-y., Bada, M., Baumgartner, W. A., Panteleyeva, N., Verspoor, K., Palmer, M., and Hunter, L. E. (2017). Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles. BMC Bioinformatics, 18(1):372, Aug.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, Minneapolis, Minnesota, June. Association for Computational Linguistics.

D'Souza, J. and Ng, V. (2012). Anaphora resolution in biomedical literature: A hybrid approach. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine, BCB '12, pages 113- 122, New York, NY, USA. ACM.

Elson, D. K., Dames, N., and McKeown, K. R. (2010). Extracting social networks from literary fiction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL '10, pages 138-147, Stroudsburg, PA, USA. Association for Computational Linguistics.

Gallois, A. (2016). The Metaphysics of Identity. Routledge.

Gasperin, C., Karamanis, N., and Seal, R. (2007). Annotation of anaphoric relations in biomedical full-text articles using a domain-relevant scheme. In Proceedings of DAARC, volume 2007. Citeseer.

Ghaddar, A. and Langlais, P. (2016). WikiCoref: An English coreference-annotated corpus of Wikipedia articles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 136-142, Portorož, Slovenia, May. European Language Resources Association (ELRA).

Guha, A., Iyyer, M., Bouman, D., and Boyd-Graber, J. (2015). Removing the training wheels: A coreference dataset that entertains humans and challenges computers. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1108-1118, Denver, Colorado, May-June. Association for Computational Linguistics.

Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., and Weischedel, R. (2006). OntoNotes: the 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACL-Short '06, pages 57-60, Stroudsburg, PA, USA. Association for Computational Linguistics.

57 references, page 1 of 4
Any information missing or wrong?Report an Issue