The workshop **Statlearn** is a scientific workshop held every year, which focuses on current and upcoming trends in Statistical Learning. **Statlearn’19** took place **in Grenoble **on** April, 4-5 2019**. Statlearn is a conference of the **French Society of Statistics (SFdS)**. A **poster session**, with some local food, hold during lunch time where participants were encouraged to present their own work. Poster session was a great opportunity for young researchers to present their work.

## Program and slides

**Thursday, April 4th**

Aapo Hyvarinen (University College London, UK), *Nonlinear Independent Component Analysis: a Principled Framework for Unsupervised Deep Learning *[slides]

Nelly Pustelnik (ENS Lyon, France), *Discrete Mumford-Shah Model: from Image Restoration to Graph Analysis *[slides]

Diane Larlus (NAVER LABS Europe), *Learning Image Representations for Efficient Visual Search* [slides]

Judith Rousseau (University of Oxford, UK), *On the Impact of the Activation Function on Deep Neural Networks Training* [slides]

Julie Josse (Ecole Polytechnique, France), *On the Consistency of Supervised Learning with Missing Values* [slides]

**Friday, April 5th**

Max Welling (University of Amsterdam and Qualcomm, Netherlands), *Gauge Fields in Deep Learning* [slides]

Gaël Varoquaux (INRIA Paris, France), *Statistics on Dirty Categories: neither Categories, nor Free Text* [slides]

Gabriel Peyré (ENS Paris, France), *Optimal Transport for Machine Learning* [slides]

Nicolas Bonneel (University of Lyon 1, France), *Sliced Partial Optimal Transport* [slides]

Liliana Forzani (Universidad Nacional del Litoral, Argentina), *Partial Least Square: Statistics for the Chemometrics* [slides]

Eyke Hüllermeier (Paderborn University, Germany),* Analyzing and Learning from Ranking Data: New Problems and Challenges* [slides]

**Abstracts**

**Aapo Hyvarinen, University College London, UK**

*Nonlinear Independent Component Analysis: a Principled Framework for Unsupervised Deep Learning*

Unsupervised learning, in particular learning general nonlinear representations, is one of the deepest problems in machine learning. Estimating latent quantities in a generative model provides a principled framework, and has been successfully used in the linear case, e.g. with independent component analysis (ICA) and sparse coding. However, extending ICA to the nonlinear case has proven to be extremely difficult: A straight-forward extension is unidentifiable, i.e. it is not possible to recover those latent components that actually generated the data. Here, we show that this problem can be solved by using additional information either in the form of temporal structure or an additional, auxiliary variable. We start by formulating two generative models in which the data is an arbitrary but invertible nonlinear transformation of time series (components) which are statistically independent of each other. Drawing from the theory of linear ICA, we formulate two distinct classes of temporal structure of the components which enable identification, i.e. recovery of the original independent components. We show that in both cases, the actual learning can be performed by ordinary neural network training where only the input is defined in an unconventional manner, making software implementations straight-forward. We further generalize the framework to the case where instead of temporal structure, an additional auxiliary variable is observed (e.g. audio in addition to video). Our methods are closely related to “self-supervised” methods heuristically proposed in computer vision, and also provide a theoretical foundation for such methods.

The talk is based on the following papers:

http://www.cs.helsinki.fi/u/ahyvarin/papers/NIPS16.pdf

http://www.cs.helsinki.fi/u/ahyvarin/papers/AISTATS17.pdf

https://arxiv.org/pdf/1805.08651

**Nelly Pustelnik, ENS Lyon, France**

*Discrete Mumford-Shah Model: from Image Restoration to Graph Analysis*

The Mumford–Shah model is a standard model in image segmentation and many approximations have been proposed in order to approximate it. The major interest of this functional is to be able to perform jointly image restoration and contour detection. In this work, we propose a general formulation of the discrete counterpart of the Mumford–Shah functional. We derive a new proximal alternated minimization scheme, allowing to deal with the non-convex objective function, with proven convergence and numerical robustness to the initialization. The good behavior of the proposed strategy is evaluated and compared to state-of-the art approaches in image restoration and extended to graph analysis.

**Diane Larlus, Naver Labs Europe, France**

*Learning Image Representations for Efficient Visual Search*

Querying with an example image is a simple and intuitive interface to retrieve relevant information from a collection of images. In general, this is done by computing simple similarity functions between the representation of the visual query and the representations of the images in the collection, provided that the representation is suitable for this task, i.e. assuming that relevant items have similar representations and non-relevant items do not. This presentation will show how to train an embedding function that maps visual content into an appropriate representation space where the previous assumption holds, producing a solution to visual search hat is both effective and computationally efficient.

In a second part, the presentation will move beyond instance-level search and consider the task of semantic image search in complex scenes, where the goal is to retrieve images that share the same semantics as the query image. Despite being more subjective and more complex, one can show that the task of semantically ranking visual scenes is consistently implemented across a pool of human annotators, and that suitable embedding spaces can also be learnt for this task of semantic retrieval.

**Judith Rousseau, University of Oxford, UK**

*On the Impact of the Activation Function on Deep Neural Networks Training *

The weight initialization and the activation function of deep neural networks have a crucial impact on the performance of the training procedure. An inappropriate selection can lead to the loss of information of the input during forward propagation and the exponential vanishing/exploding of gradients during back-propagation. Understanding the theoretical properties of untrained random networks is key to identifying which deep networks may be trained successfully as recently demonstrated by (Schoenholz et al., 2017) who showed that for deep feedforward neural networks only a specic choice of hyperparameters known as the Edge of Chaos can lead to good performance. While the work by (Schoenholz et al., 2017) discuss trainability issues, we focus here on training acceleration and overall performance. We give a comprehensive theoretical analysis of the Edge of Chaos and show that we can indeed tune the initialization parameters and the activation function in order to accelerate the training and improve the performance.

**Julie Josse, Ecole Polytechnique, France**

*On the Consistency of Supervised Learning with Missing Values*

Missing data is a classic data science problem. There is an abundant literature on the subject, but it has mainly focused on estimating parameters and their variance from incomplete tables and is not positioned in a supervised context where the objective is to best predict a target and where missing data are considered in training and in the test set.

In this work, we show the consistency of two approaches to estimating the regression function.

The most striking one which has important consequences in practice, shows that mean imputation is consistent for supervised learning when missing values are not informative. This is as far as we know the first result justifying this very convenient practice of handling missing values. We then focus on decision trees as they also offer a natural way for empirical risk minimization with missing values, especially when using the missing in attributes method.

**Max Welling, University of Amsterdam and Qualcomm, Netherlands**

*Gauge Fields in Deep Learning*

Gauge field theory is the foundation of modern physics, including general relativity and the standard model of physics. It describes how a theory of physics should transform under symmetry transformations. For instance, in electrodynamics, electric forces may transform into magnetic forces if we transform a static observer to one that moves at constant speed. Similarly, in general relativity acceleration and gravity are equated to each other under symmetry transformations. Gauge fields also play a crucial role in modern quantum field theory and the standard model of physics, where they describe the forces between particles that transform into each other under (abstract) symmetry transformations.

In this work we describe how the mathematics of gauge groups becomes inevitable when you are interested in deep learning on manifolds. Defining a convolution on a manifold involves transporting geometric objects such as feature vectors and kernels across the manifold, which due to curvature become path dependent. As such it becomes very difficult to represent these objects in a global reference frame and one is forced to consider local frames. These reference frames are arbitrary and changing between them is called a (local) gauge transformation. Since we do not want our computations to depend on the specific choice of frames we are in turn forced to consider equivariance of our convolutions under gauge transformations. These considerations result in the first fully general theory of deep learning on manifolds, with gauge equivariant convolutions as the necessary key ingredient.

We develop a highly efficient gauge equivariant deep neural network (Unet) for segmentation on a sphere by approximating the sphere by a icosahedron. This model is tested on global climate data as well as omnidirectional indoor scenes data.

**Gaël Varoquaux, INRIA Paris, France**

*Statistics on Dirty Categories: neither Categories, nor Free Text*

The notion of categorical variables is classic in statistics: a feature of the data that takes distinct discrete values, such as “Patient” or “Control”. However, when integrating real-world data tables into a statistical analysis, many non-numerical columns do not clearly delineate categories. These may present typos in string representations, or morphological variants, eg “control” or “typical control”. The traditional approach to analyse such data is to first “clean” it, via simple transformation rules or feature engineering to cast it into well-separated categories. However, data cleaning is a major time sink for analysts.

Here, we consider statistical analysis directly on non standardized data. We introduce the notion of “Dirty categories”, which are neither well separated categories nor natural language. We show that accounting for their string representation helps the statistical analysis, for instance improving the prediction in supervised learning. Finally, we discuss approaches to represent such entries in ways that can be interpreted as categories, without loosing information on the morphological variants. Such data encoding is based on string similarities and character-level modeling. We show that these always improve on the common practice of one-hot encoding.

**Gabriel Peyré, ENS Paris, France**

*Optimal Transport for Machine Learning*

Optimal transport (OT) has become a fundamental mathematical tool at the interface between calculus of variations, partial differential equations and probability. It took however much more time for this notion to become mainstream in numerical applications. This situation is in large part due to the high computational cost of the underlying optimization problems. There is a recent wave of activity on the use of OT-related methods in fields as diverse as image processing, computer vision, computer graphics, statistical inference, machine learning. In this talk, I will review an emerging class of numerical approaches for the approximate resolution of OT-based optimization problems. This offers a new perspective for the application of OT in high dimension, to solve supervised (learning with transportation loss function) and unsupervised (generative network training) machine learning problems. More information and references can be found on the website of our book “Computational Optimal Transport” https://optimaltransport.github.io/

**Nicolas Bonneel, Liris, University of Lyon 1, France**

*Sliced Partial Optimal Transport*

Sliced optimal transport is a blazing fast way to compute a notion of optimal transport between uniform measures supported on point clouds via 1-d projections. However, it requires these point clouds to have the same cardinality. This talk will first introduce optimal transport, and show a fast numerical scheme to compute partial optimal transport in 1-d : this corresponds to solving an alignment problem often solved with dynamic programming, though our solution is much faster. We integrate this 1-d alignment algorithm within a sliced transport framework, for applications such as color transfer. We also make use of sliced partial optimal transport to solve point cloud registration tasks such as those traditionally solved with ICP. I’ll show results involving hundreds of thousands of points computed within seconds or minutes. I’ll also show preliminary results on sliced partial Wasserstein barycenters.

**Liliana Forzani, Universidad Nacional del Litoral, Argentina**

*Partial Least Square: Statistics for the Chemometrics*

Partial least squares (PLS) is one of the first methods for prediction in high-dimensional linear regressions in which the sample size need not be large relative to the number of predictors. Since its development, PLS regression has taken place mainly within the chemometrics community, where empirical prediction is the main issue, but PLS is now a core method for big data. However, studies of PLS have appeared in mainline statistics literature only from time to time and there have been no positive results on the theoretical properties of the chemometrics community’s use of PLS. In a joint work with R. Dennis Cook we study the theoretical properties of prediction using PLS in the same context that chemometrics community use. This is a joint work with R. Dennis Cook.

**Eyke Hüllermeier, Paderborn University, Germany**

*Analyzing and Learning from Ranking Data: New Problems and Challenges*

The analysis of ranking data has a long tradition in statistics, and corresponding methods have been used in various fields of application, such as psychology and the social sciences. More recently, applications in information retrieval and machine learning have caused a renewed interest in the analysis of rankings and topics such as “learning to rank” and preference learning. This talk provides a snapshot of ranking in the field of machine learning, with a specific focus on new problems and challenges from a statistical point of view. In addition to problems of unsupervised learning on ranking data and different types of ranking tasks in the realm of supervised learning, this also includes recent work on preference learning and ranking in an online setting.