Session 2: High dimension and applications

    • Ernest Fokoué (Rochester Institute of Technology – USA) Statistical Machine Learning with High Multidimensional Arrays of Data
    • This lecture intends to give the audience a cursory tour of some of the most common tools used for learning the patterns underlying high multidimensional arrays of data. I will first focus on the ubiquitous two dimensional array (matrix) setting with n denoting the sample size or number of p dimensional vectors under consideration, and I will touch on some of the most recent statistical and computational methods for dealing with both the n>>> p and the n <<< p scenarios. Among other things, I will talk about the techniques of regularization/penalization, selection and projection that help circumvent the inferential and prediction challenges inherent in high dimensional regression, classification and clustering. I will also cover ensemble techniques like random forest and random subspace learning in general that have proved formidable in mitigating many learning challenges arising in high dimensional predictive modelling. Throughout this presentation, the concept of “high” will remain relative, typically approached from pure commonsense but also with respect to the computational architecture ultimately used for implementing the devised methods. Indeed, some of the solutions to the high dimensional data modelling conundrum will come from making the most of high performance parallel computation now available through multicore CPU architectures now standard on all our desktop computers, or Graphics Processing Units (GPU) that can be easily added or even clusters of computers more and more commonly used by statisticians. For most of the methods, techniques and algorithms mentioned earlier, I will point to the existing parallel implementation wherever possible. If time allows it, I will give an overview of some of the most promising results and applications of multidimensional arrays (tensors) in machine learning, namely the use of m-tensors in image processing, topic modelling, recommender systems and community detection just to name a few.

    • Mathieu Fauvel (INRA et INPT-ENSAT – Toulouse – France) – Spectral-spatial classification of high-dimensional remote sensing images

In this talk, the classification of high dimensional remote sensing images will be discussed. By high dimensional, we mean that the number of features is high, typically several hundreds, while the number of training samples remains low. Feature will be of three kinds:
– Spectral: they are related to the reflectance in different wavelength domain acquire by the sensor.
– Spatial: they are related to some geometric feature extracted from the data.
– Temporal: they are related to different acquisition over the time.
Recent advances in spectral-spatial classification of high dimensional remote sensing images will be presented in this talk. Several techniques are investigated for combining both spatial and spectral information. Spatial information is extracted at the object (set of pixels) level rather than at the conventional pixel level. Mathematical morphology, Markov random field, and object classification will be discussed in a kernel methods framework.

  • Emeline Perthame (Inria Grenoble – Rhone-Alpes – France) – Variable selection for correlated data in high dimension using decorrelation methods

The analysis of high throughput data has renewed the statistical methodology for feature selection. Such data are both characterized by their high dimension and their heterogeneity, as the true signal and several confusing factors are often observed at the same time. In such a framework, the usual statistical approaches are questioned and can lead to misleading decisions as they are initially designed under independence assumption among variables. In this talk, I will present some improvements of variable selection methods in regression and supervised classification issues, by accounting for the dependence between selection statistics. The methods proposed in this talk are based on a factor model of covariates, which assumes that variables are conditionally independent given a vector of latent variables. During this talk, I will illustrate the impact of dependence on the stability on some usual selection procedures. Next, I will particularly focus on the analysis of event-related potentials data (ERP) which are widely collected in psychological research to determine the time courses of mental events. Such data are characterized by a temporal dependence pattern both strong and complex which can be modeled by the mentioned above factor model.