Discovering and Describing Coherent and Meaningful Topics from Document Collections

Author

Anaya Sánchez, Henry

Director

Peñas Padilla, Anselmo

Berlanga Llavorí, Rafael

Date of defense

2016-01-25

Pages

134 p.



Department/Institute

Universitat Jaume I. Departament de Llenguatges i Sistemes Informàtics

Abstract

The main motivation behind this thesis is the problem of automatically discovering and describing coherent and meaningful topics underlying a target collection of text documents; where a topic is a theme that runs through documents in the collection. In this work, discovering topics means to (automatically) produce a processable representation for each of the individual topics in the collection despite they are unobserved data (e.g. using clusters of documents or probability distributions of words); whereas describing a topic aims to generate a summary of the representation of the topic that allows users to identify and discriminate the topic in the context of the target collection. By semantically coherent topics, we refer to topics that can be easily interpreted by humans, bearing an intelligible (underlying) subject or matter; whereas meaningful topics are meant to represent and summarize the main (vs. background or supporting) themes addressed by each of the individual documents in the target collection. Discovering and describing topics with these two features can be shown useful to exploratory browsing, but also to obtain semantic decompositions of document collections that bring support to many information accessing and processing tasks. Notice that these topics and their descriptions can be directly applied to provide ostensible end-users with a summary of the main contents included in a target collection of texts. There are two major trends to discover topics from a collection of text documents. These are clustering-based approaches and the approaches based on \emph{Probabilistic Topic Modeling} (PTM). The first ones represent each topic using a cluster of documents; whereas the second ones employ a probability distribution of words to define each topic. Nevertheless, as far as we know, none of the existing approaches simultaneously address the issues of ensuring coherence and meaningfulness on the discovered topics as defined in this work. Indeed, only a few existing approaches have been focused on the problem of discovering coherent topics, whereas the issue of providing meaningful topics has not been addressed so far. In this context, this thesis firstly proposes an abstract framework for discovering and describing topics. Then, from the proposed framework we derive and evaluate two general methodologies, one producing clusters of documents and the other one obtaining probability distributions of words, both aimed to discover and describe topics deemed to satisfy the requirements of coherence and meaningfulness. The main novelty of these methodologies is the combination of both: - modeling topics from sets of lexically related words in the context of the collection, so that these sets of words determine the \emph{aboutness} of each topic and hence topic coherence is deemed to be satisfied. - assessing topic meaningfulness by means of probabilistic criteria that penalize topics with an underlying content close to the random contents underlying the target text collection (e.g., topics determined by abstract concepts such as ``death victims of murder or accidents'', that can merge topics about specific accidents or crimes, etc.). In the framework and, consequently, in the two derived methodologies, the topic discovery process is implemented as an iterative search in which topics are successively discovered, in a fully unsupervised manner, until all the documents in the target collection are considered to be covered by at least one topic. No prior knowledge about the topics is utilized, and the number of topics is not needed to be prescribed beforehand. The latter is one of the strongest points of our proposal, since many approaches --most based on PTM-- require from setting a priori the number of topics to be discovered from the collection, which is very difficult to know in practice (mainly, if we are indeed interested in obtaining data that describe the collection). The experiments carried out over target collections of news stories and collections of tweets about different entities in a given domain (e.g., \emph{music/artists} and \emph{carmakers}) show that the proposed methodologies achieves a higher performance in terms of coherence scores and meaningfulness than state-of the-art related approaches. The latter is based on the agreement (i.e., comparison) with human annotations.

Keywords

Informática

Subjects

004 - Computer science and technology. Computing. Data processing

Documents

2016_Tesis_Anaya Sánchez_Henry.pdf

1.712Mb

 

Rights

ADVERTIMENT. Tots els drets reservats. L'accés als continguts d'aquesta tesi doctoral i la seva utilització ha de respectar els drets de la persona autora. Pot ser utilitzada per a consulta o estudi personal, així com en activitats o materials d'investigació i docència en els termes establerts a l'art. 32 del Text Refós de la Llei de Propietat Intel·lectual (RDL 1/1996). Per altres utilitzacions es requereix l'autorització prèvia i expressa de la persona autora. En qualsevol cas, en la utilització dels seus continguts caldrà indicar de forma clara el nom i cognoms de la persona autora i el títol de la tesi doctoral. No s'autoritza la seva reproducció o altres formes d'explotació efectuades amb finalitats de lucre ni la seva comunicació pública des d'un lloc aliè al servei TDX. Tampoc s'autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant als continguts de la tesi com als seus resums i índexs.

This item appears in the following Collection(s)