Session 1: Topic Modeling / Text mining

  • Cédric Archambeau (University College London – UK / Amazon – Berlin – Germany) - Latent IBP Compound Dirichlet Allocation: Sparse Topic Models Fit for Natural Languages.

Probabilistic topic models such as latent Dirichlet allocation are widespread tools to analyse and explore large text corpora. They postulate a generative model of text which ignores its sequential structure, but has been proven to be sufficient to capture the underlying semantics. However, the generative model is unable to handle out-of-vocabulary words and does not account for the power-law distribution of the vocabulary of natural languages. Among others, this leads in practice to the creation of an unnecessary large number of topics when modelling corpora of increasing size. We tackle this problem by introducing a probabilistic topic model based on the four-parameter IBP compound Dirichlet process, a stochastic process that generates sparse nonnegative vectors with potentially an unbounded number of entries. We call this new model the latent IBP compound Dirichlet allocation (LIDA), which enables us to model power-law distributions, both, in the number of topics summarising the documents and in the number of words defining each topic. It can be interpreted as a sparse variant of the hierarchical Pitman-Yor process when applied to topic modelling. We derive an efficient and simple collapsed Gibbs sampler closely related to the collapsed Gibbs sampler of latent Dirichlet allocation (LDA), making the model applicable in a wide range of domains. Our nonparametric Bayesian topic model compares favourably to the widely used hierarchical Dirichlet process and its heavy tailed version, the hierarchical Pitman-Yor process, on benchmark corpora. Experiments demonstrate that accounting for the power-distribution of real data is beneficial and that sparsity provides more interpretable results.
This is joint work with Balaji Lakshminarayanan and Guillaume Bouchard.

  • Julien Velcin (ERIC / Université Lyon 2 – Lyon – France) – Joint extraction of topics and sentiments

Faced with all the “big” data generated by users through the Web and the social media, we need efficient techniques to provide us with overviews such as categories and trends. Useful summaries can be calculated on textual data by using topic models, such as Latent Dirichlet Allocation (LDA) and variants able to deal with temporal dynamics (DTM, TOT). Besides, it turns out that topic models are also able to integrate the modeling of sentiments expressed towards these topics (ASUM, JST). Topic modeling and sentiment analysis (or opinion mining) are two popular tasks that have been treated separately in the past, even though they are complementary: sentiments usually target topics and topics can be the basis of subjective positions. In this talk, I will first give a brief overview on the various ways topics and opinions have been modeled together. I will then present an attempt to jointly extract topics and sentiments by extending the classical probabilistic LDA model. Based on two case studies experiments, I will show that our model named TTS is more fitted to capture the overall dynamics of the opinions expressed on short messages. In addition, I will show how such an hybrid topic-opinion model has been recently used to address the task of stock market prediction.

  • Quentin Pleplé (EPSI, Big Datext et Short Edition – Grenoble – France) – Interactive Topic Modeling

Topics discovered by the latent Dirichlet allocation (LDA) method are sometimes not meaningful for humans. The goal of our work is to improve the quality of topics presented to end-users. We present a novel method for interactive topic modeling. The method allows the user to give live feedback on the topics, and allows the inference algorithm to use that feedback to guide the LDA parameter search. The user can indicate that words should be removed from a topic, that topics should be merged, and/or that a topic should be split, or deleted. After each item of user feedback, we change the internal state of the variational EM algorithm in a way that preserves correctness, then re-run the algorithm until convergence. Experiments show that both contributions are successful in practice.