publication . Conference object . 2015

Extracting Hierarchical Topic Models from the Web for Improving Digital Archive Access

Grefenstette, Gregory; Muchemi, Lawrence;
  • Published: 14 Dec 2015
  • Publisher: HAL CCSD
  • Country: France
International audience; Topic models provide a weighted list of terms specific to a given domain. For example, the terminology for painting, as a hobby, might include specific tools user in painting such brush, easel, canvas, as well as more specific terms such a common oil colors: deep aquamarine, cerulean blue, zinc white. For clothing, a topic model should include words such as shoes, boots, socks, skirt, hats, as well as more specific terms such as tennis shoes, cocktail dress, and specific brands of shoes, hats, shirts, etc. In addition to containing the characteristic terms of a topic, a topic model also contains the relative frequency of each term's use in the topic text. This frequency is useful in information retrieval settings; when a large number of results are returned for a query, they can be ordered by pertinence using the relative frequency of domain words to rank the responses. Providing a hierarchic topic model also allows an information retrieval application to create facets (Tunkelang, 2009), or categories appearing the result sets, with which the user can filter results, as on an online shopping site. One problem for many information retrieval platforms in digital humanity archives is the lack of topic models, other than those already foreseen and implemented when the archive was first digitized. A researcher wishing to look at a collection or archive from a new angle has no means of exploiting a new topic model corresponding to his or her axis of research. This obstacle has two causes: (1) technologically, the platform has to allow a re-annotation of the underlying archive with a new topic model. This technological problem is solvable by implementing a suite of natural language processing tools that can access the description of the textual description of elements in the archive, and identify there terms from a new topic model. For example, the commonly used information retrieval platform Lucene (Grainger, 2014) allows the administrator to add new facet annotations to existing documents. A second, more difficult problem is (2) building a new topic model. When done manually, this is a time-consuming task, with no assurance of being complete or adequate, unless great expense is outlayed, as is the case for MeSH, a medical subject heading taxonomy (Coletti and Bleich, 2001), for which regular monthly meetings are held for maintaining and updating the terminology. For subjects less important for society, few such ontological resources exist. When topic models are created automatically they can homogenize existing terminology (Newman et al, 2007) but often result in noise (Steyvers, et al., 2004) that may seem excessive to some archivists.
free text keywords: [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Any information missing or wrong?Report an Issue