Model-based clustering and classification for high-dimensional data (with R)
Charles Bouveyron (Université Paris Descartes, web)
Model-based clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, high-dimensional data are nowadays more and more frequent and, unfortunately, classical model-based clustering techniques show a disappointing behavior in high-dimensional spaces. This is mainly due to the fact that model-based clustering methods are dramatically over-parametrized in this case. However, high-dimensional spaces have specific characteristics which are useful for clustering and recent techniques exploit those characteristics. After having recalled the bases of model-based clustering, dimension reduction approaches, regularization-based techniques, parsimonious modeling, subspace clustering methods and clustering methods based on variable selection are reviewed. Existing softwares for model-based clustering of high-dimensional data will be also reviewed and their practical use will be illustrated on real-world data sets.
Intermediate R Programming: The transition from “using” to “scientific computing”
John W. Emerson (Yale University, web)
This tutorial will be accessible to “newbie R users” who have strong programming backgrounds in other languages (Matlab, C/C++, Python, …) but is really aimed at “intermediate R users” of various levels. Instead of “using R” for the purpose of statistical analyses, we will emphasize understanding the structure of the language including some of its strengths and weaknesses. Some of the material covered will set the stage for the subsequent conference talk by the instructor in the session on High-Dimensional and Big Data.
Diffusion phenomena in networks : virality, influence and control Nicolas Vayatis (Ecole Normale Supérieure de Cachan, web)
In this talk, a unified framework is proposed to characterize contagion phenomena in three different fields: SIR epidemics, bond percolation and information cascades. In particular, a phase transition phenomenon is described for the behavior of the influence under the assumption of a locally positively correlated random graph which involves spectral characteristics of the underlying network. In the second part of the talk, the question of controlling SIS epidemics will be tackled under two different strategies: one based on dynamic allocation of treatments with the idea of seeking for the largest reduction in the amount of infectious edges, and the second based on a priority planning for which tight bounds on the epidemic threshold and the extinction time are provided. All strategies have been tested numerically for a variety of network models and diffusion/recovery rates.
Autoregressive Generative Models with Deep Learning Hugo Larochelle (Google, web)
In machine learning, the two dominating approaches to learning generative models of data has mostly been based on either directed graphical models or undirected graphical models. In this talk, I’ll discuss a third approach, which has become more popular recently: autoregressive generative models. Thanks to neural networks, this family of models has been shown to be very competitive, both in terms of the realism of the data they can generate and the data representation they can learn. I’ll discuss a variety of such neural autoregressive models and dissect the advantages and disadvantages of this approach.
On MCMC methods for tall data Rémi Bardenet (CNRS – Université de Lille, web)
Markov chain Monte Carlo methods are often deemed too computationally intensive to be of any practical use for big data applications, and in particular for inference on datasets containing a large number n of individual data points, also known as tall datasets. In scenarios where data are assumed independent, various approaches to scale up the Metropolis-Hastings algorithm in a Bayesian inference context have been recently proposed in machine learning and computational statistics. These approaches can be grouped into two categories: divide-and-conquer approaches and subsampling-based algorithms. In this talk, I will give an overview of the existing literature, commenting on the underlying assumptions and theoretical guarantees of each method. In particular, I will argue that it remains an open challenge to develop efficient MCMC algorithms for tall data in scenarios where the Bernstein-von Mises approximation is poor. Based on joint work with Arnaud Doucet and Chris Holmes, Univ. Oxford, UK http://arxiv.org/abs/1505.02827
Scalable Programming Strategies for Massive Data in R John W. Emerson (Yale University, web)
Computing with native R objects is limited to available memory (RAM), lacks shared-memory capabilities, and frequently incurs costly memory overhead. This talk will present three specific examples with take-away material: (1) using package _bigmemory_ for storing and interacting with massive matrices (2) using package _foreach_ as an elegant and portable framework for parallel computing, and (3) building a basic R package that includes C/C++ code for an algorithm that scales beyond the constraints of RAM.
An Adaptive Ridge Procedure for L0 Regularization and Applications Grégory Nuel (CNRS – Université Pierre et Marie Curie, web)
We present here a recently published approach (Frommlet and N., 2016) which purpose is to perform L0-penalized maximization by iteratively solving a weight ridge problem. The approach is very similar in the spirit and method to the (multi-step) adaptive LASSO, with the noticeable difference that the ridge problem is generally much easier to solve than the lasso one. We illustrate the method on simple generalized linear models and then consider various more sophisticated applications: estimation of piecewise constant hazard model in survival analysis, irregular histograms, and image segmentation.
Real time community detection in large networks Sébastien Loustau (artfact, web)
From Word Vectors to Sentence Representations Martin Jaggi (Ecole Polytechnique Fédérale de Lausanne, web)
Learning good representations for text is crucially important to advance many promising applications of learning on text data. We will overview some recent developments such as word vectors (trained unsupervised), and several recent ways to generalize word representations to sequences of words. Surprisingly, current state-of-the-art methods for the latter task area only available for supervised machine learning, and it is an important direction to find better unsupervised techniques, benefitting from huge text datasets which are abundant. With the example of sentiment classification (SemEval 2016 competition), we’ll discuss current convolutional neural network techniques, and also more recent simpler alternatives based on matrix factorizations, starting from Facebook’s FastText algorithm.
Novelties and limits of neural approaches for information access Benjamin Piwowarski (CNRS – Université Pierre et Marie Curie, web)
The last years have witnessed a wide development of the use of (deep) neural network approaches to information access, borrowing early achievements in natural language processing. In this talk, I will focus on the novelties brought by such approaches, and point out the current limitations, the latter being mainly due to the inherent difficulty of processing text at a document level.
NLP-driven Data Journalism: Event-based Extraction and Aggregation of International Alliances Relations Xavier Tannier (Université Paris Sud, web)
We take inspiration of computational and data journalism, and propose to combine techniques from supervised information extraction, information aggregation and visualization to build a tool identifying the evolution of alliance and opposition relations between countries, on specific topics. These relations are aggregated into numerical data that are visualized by time-series plots or dynamic graphs.
About two disinherited sides of statistics: data units and computational saving Christophe Biernacki (Université de Lille, web)
Statistics often focuses on designing models, theoretical estimates, related algorithms and model selection. However, some sides of this whole process are somewhat not really tackled by statisticians, leaving the practitioner with some empirically choices, thus poor theoretical warranties. In this context, we identify two situations of interest which are firstly the data unit definition, in case where the practitioner hesitates between few, and secondly the way of saving computational time, for instance by early stopping rules of some estimating algorithms. In the first case (data units), we highlight that it is possible to embed data unit selection into a classical model selection principle. We introduce the problem in a regression context before to focus on the model-based clustering and co-clustering context, for data of different kinds (continuous, categorical). It is a joint work with Alexandre Lourme (University of Bordeaux). In the second case (computational saving), we recall that an increasingly recurrent statistical question is to design a trade-off between estimate accuracy and computation time. Most estimates practically arise from algorithmic processes aiming at optimizing some standard, but usually only asymptotically relevant, criteria. Thus, the quality of the resulting estimate is a function of both the iteration number and also the involved sample size. We focus on estimating an early stopping time of a gradient descent estimation process aiming at maximizing the likelihood in the simplified context of linear regression (with some discussion in other contexts). It appears that the accuracy gain of such a stopping time increases with the number of covariates, indicating potential interest of the method in real situations involving many covariates. It is a joint work with Alain Célisse and Maxime Brunin (University of Lille and Inria, both).
ABC random forests for Bayesian parameter inference Christian Robert (Université Paris Dauphine, web)
Approximate Bayesian Computation (ABC) has grown into a standard methodology to handle Bayesian inference in models associated with intractable likelihood functions. Most ABC implementations require the selection of a summary statistic as the data itself is too large to be compared to simulated realisations from the assumed model. The dimension of this statistic is generally quite large. Furthermore, the tolerance level that governs the acceptance or rejection of parameter values needs to be calibrated and the range of calibration techniques available so far is mostly based on asymptotic arguments. We propose here to conduct Bayesian inference based on an arbitrarily large vector of summary statistics without imposing a selection of the relevant components and bypassing the derivation of a tolerance. The approach relies on the random forest methodology of Breiman (2001) when applied to regression. We advocate the derivation of a new random forest for each component of the parameter vector. Correlations between parameter components are handled by separate random forests. When compared with standard ABC solutions, this technology offers significant gains in terms of robustness to the choice of the summary statistics and of computing time.