Jeudi 22 janvier 2026, 14h00
Rubén Martos, Université Paris 1 Panthéon-Sorbonne
[Site web](https://rmartosprieto.github.io)
Causal Inference of post-transcriptional regulation from long-read RNA sequences
Abstract We propose a novel framework for reconstructing the chronology of genetic regulation using causal inference based on Pearl’s theory. The approach proceeds in three main stages: causal discovery, causal inference, and chronology construction. We apply it to the *ndhB* and *ndhD* genes of the chloroplast in *Arabidopsis thaliana*, generating four alternative maturation timeline models per gene, each derived from a different causal discovery algorithm (HC, PC, LiNGAM, or NOTEARS).
Two methodological challenges are addressed: the presence of missing data, handled via an EM algorithm that jointly imputes missing values and estimates the Bayesian network, and the selection of the l1-regularization parameter in NOTEARS, for which we introduce a stability selection strategy. The resulting causal models consistently outperform reference chronologies in terms of both reliability and model fit. Moreover, by combining causal reasoning with domain expertise, the framework enables the formulation of testable hypotheses and the design of targeted experimental interventions grounded in theoretical predictions.
15h00
Marina Gomtsyan, Sorbonne University, LPSM
[Site web](https://www.lpsm.paris/users/mgomtsian/index)
Variable selection methods in sparse GLARMA models
Abstract
We propose novel variable selection methods for sparse GLARMA (Generalised Linear Autoregressive Moving Average) models, which can be used for modelling discrete-valued time series. These models allow us to introduce some dependence in a Generalised Linear Model (GLM). The key idea behind our estimation procedure is first to estimate the coefficients of the ARMA part of the GLARMA model and then use a regularised approach, namely the Lasso, to estimate the regression coefficients of the GLM part of the model.
Furthermore, we establish a sign-consistency result for the estimator of the regression coefficients in a sparse Poisson model without time dependence. The performance of our proposed methods was assessed on simulation studies in different frameworks and on several datasets in the field of molecular biology. Our approaches exhibit very good statistical performance, surpassing other methods in identifying non-null regression coefficients. Secondly, their low computational burden enables their application to relatively large datasets. Our proposed methods are implemented in R packages, which are publicly available on the Comprehensive R Archive Network (CRAN).
Jeudi 26 juin 2025
Geneviève Robin, OWKIN → CNRS MAP5
Jeudi 22 mai 2025
Jeudi 10 avril 2025
Chrysoula Kosma, ENS Paris-Saclay, Centre Borelli
Parameterizing Convolutional Neural Networks to Address Challenges in Modeling Irregularly Sampled Time Series
Abstract Multivariate time series are common in various real-world application areas, such as industrial and medical ones. In several cases, the sampling rate is not fixed, which results in sparse or misaligned observations across different variables. Traditional sequential neural network models, such as recurrent and convolutional neural networks (RNNs, CNNs), typically assume evenly spaced observations, which creates significant challenges for modeling irregular time series. While many proposed architectures use RNN variants to address irregular time intervals, CNNs have not been sufficiently explored in this context. In the presented study, we will introduce a method to parameterize convolutional layers by utilizing time-explicitly initialized kernels. These time-dependent functions enhance the learning of continuous-time hidden dynamics and can be efficiently incorporated into convolutional kernel weights. We will thus introduce a time-parameterized convolutional neural network, which retains the properties of standard convolutions but is specifically tailored for irregularly sampled time series. The proposed method is evaluated for interpolation and classification tasks with real-world datasets involving irregularly sampled multivariate time series. The experimental results prove the competitive performance of the proposed convolutions, providing interpretability for the input series by combining learnable time functions, which marks the first robust application of convolutions in the domain of irregular sampling. Our presentation will also address the challenges of diachronic modeling in neural networks for time series, offering insights into how incorporating structure that is specifically tailored to the unique characteristics of time series data—such as time parameterization—can improve model robustness. This approach also helps mitigate the increasing demand for large datasets, a challenge that has become even more pronounced in the era of large pretrained models.
Jeudi 27 mars 2025, 14h
Solenne Gaucher, École Polytechnique
[Site web](https://cmap.ip-paris.fr/recherche/decision-et-donnees/simpas)
[Site web](https://solennegaucher.github.io)
Classification and regression under fairness constraints
Abstract Artificial intelligence (AI) is increasingly shaping the decisions that affect our lives—from hiring and education to healthcare and access to social services. While AI promises efficiency and objectivity, it also carries the risk of perpetuating and even amplifying societal biases embedded in the data used to train these systems. Algorithmic fairness aims to design and analyze algorithms capable of providing predictions that are both reliable and equitable.
In this talk, I will introduce one of the main approaches to achieving this goal: statistical fairness. After outlining the basic principles of this approach, I will focus specifically on a fairness criterion known as “demographic parity,” which seeks to ensure that the distribution of predictions is identical across different populations. I will then discuss recent results related to regression and classification problems under this fairness constraint, exploring scenarios where differentiated treatment of populations is either permitted or prohibited.
Jeudi 13 mars 2025
Pavlo Mozharovskyi, Télécom Paris
Jeudi 6 février 2025, 14h00
François Bachoc, University Paul Sabatier, Institut de Mathématiques de Toulouse
[Site web](https://www.math.univ-toulouse.fr/~fbachoc/index.html)
Improved learning theory for kernel distribution regression with two-stage sampling
Abstract The distribution regression problem encompasses many important statistics and machine learning tasks, and arises in a large range of applications. Among various existing approaches to tackle this problem, kernel methods have become a method of choice. Indeed, kernel distribution regression is both computationally favorable, and supported by a recent learning theory. This theory also tackles the two-stage sampling setting, where only samples from the input distributions are available. In this talk, we improve the learning theory of kernel distribution regression. We address kernels based on Hilbertian embeddings, that encompass most, if not all, of the existing approaches. We introduce the novel near-unbiased condition on the Hilbertian embeddings, that enables us to provide new error bounds on the effect of the two-stage sampling, thanks to a new analysis. We show that this near-unbiased condition holds for three important classes of kernels, based on optimal transport and mean embedding. As a consequence, we strictly improve the existing convergence rates for these kernels. Our setting and results are illustrated by numerical experiments.
15h15
Toby Dylan Hocking, Univ. Sherbrooke, Canada
[Site web](https://www.usherbrooke.ca/informatique/nous-joindre/personnel/corps-professoral/professeurs/toby-dylan-hocking)
SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets
Toby Dylan Hocking, Gabrielle Thibault, Cameron Scott Bodine, Paul Nelson Arellano, Alexander F Shenkin, Olivia Jasmine Lindly
Abstract In many real-world applications of machine learning, we are interested to know if it is possible to train on the data that we have gathered so far, and obtain accurate predictions on a new test data subset that is qualitatively different in some respect (time period, geographic region, etc). Another question is whether data subsets are similar enough so that it is beneficial to combine subsets during model training. We propose SOAK, Same/Other/All K-fold cross-validation, a new method which can be used to answer both questions. SOAK systematically compares models which are trained on different subsets of data, and then used for prediction on a fixed test subset, to estimate the similarity of learnable/predictable patterns in data subsets. We show results of using SOAK on six new real data sets (with geographic/temporal subsets, to check if predictions are accurate on new subsets), 3 image pair data sets (subsets are different image types, to check that we get smaller prediction error on similar images), and 11 benchmark data sets with predefined train/test splits (to check similarity of predefined splits).
Jeudi 12 décembre 2024, 15h
Marie Chion, MRC Biostatistics Unit, Univ. Cambridge, UK.
https://www.mrc-bsu.cam.ac.uk/staff/marie-chion
From multiple imputation to the Bayesian framework in quantitative proteomics
Abstract In this seminar, we will look at the problem of missing values in data from quantitative proteomics using mass spectrometry. One way of dealing with this problem is to impute missing values, i.e. replace them with a value defined by the user or an algorithm. In this way, multiple imputation allows the imputation process to be iterated several times to obtain several complete data sets. These are then combined before applying conventional statistical tools. However, the usual software for the statistical analysis of proteomics data uses the average complete dataset and ignores the uncertainty induced by the random imputation process.
» Therefore, we present a rigorous method for multiple imputation using Rubin's rules and a variant of the t-moderate test that considers the variability arising from both the initial dataset and the multiple imputation process. As the t-moderate test is based on a Bayesian hierarchical model, we also propose a fully Bayesian framework for differential proteomic analysis and discuss the place of multiple imputation in such a framework.
Jeudi 12 décembre 2024, 16h15
Jeudi 5 décembre 2024, 14h
Tâm Le Minh, Inria, Laboratoire Jean Kuntzmann, Université Grenoble Alpes
https://tam-leminh.github.io <https://tam-leminh.github.io/>
Des modèles échangeables pour les données d'interactions écologiques
Abstract En écologie, l’analyse des données de relevé (présences-absences, abondances, interactions entre espèces) repose souvent sur l’utilisation de “modèles nuls”. Cependant, cette présente des limitations, fréquemment ignorées dans les études écologiques. En prenant pour exemple les réseaux d’interactions plantes-pollinisateurs, nous introduisons le modèle BEDD (Bipartite Expected Degree Distribution), un modèle nul qui permet de surmonter plusieurs de ces limitations en s’appuyant sur l’hypothèse d’échangeabilité des espèces observées.
» Les propriétés des modèles échangeables permettent de recourir à des méthodes d’inférence basées sur les U-statistiques, une classe de statistiques particulièrement adaptée à ce type de structure de données. Je décrirai quelques opportunités offertes par les U-statistiques pour l’analyse des réseaux bipartites, en particulier dans le contexte des interactions écologiques. À travers des exemples sur des données simulées et réelles, je mettrai en lumière le potentiel de cette approche, tout en discutant ses limites, notamment induites par l'hypothèse d'échangeabilité.
Jeudi 6 juin 2024
Aurélien BEAUDE, Doctorant , Equipe AROB@S, IBISC, https://forge.ibisc.univ-evry.fr/abeaude/AttOmics
Titre: The attention mechanism for omics data
Abstract
The increasing availability of high-throughput omics data allows for considering a new medicine centered on individual patients. Precision medicine relies on exploiting these high-throughput data with machine-learning models, especially those based on deep-learning approaches, to improve diagnosis. Due to the high-dimensional small-sample nature of omics data, current deep-learning models end up with many parameters and have to be fitted with a limited training set. Cellular functions are governed by the combined action of multiple molecular entities specific to a patient. The expression of one gene may impact the expression of other genes differently in different patients. With classical deep learning approaches, these interactions learned during training are assumed to be identical for all patients in the inference phase.
Self-attention can be used to improve the representation of the features vector by incorporating dynamically computed relationships between elements of the vector, i.e., computing patient-specific feature interactions. Applying self-attention to high-dimensional vectors such as omics profiles is challenging as the self-attention memory requirements scale quadratically with the number of elements. In AttOmics, to reduce the memory footprint of the self-attention matrix computation we decompose each omics profile into a set of groups, where each group contains related features. Group embeddings are computed by projecting each group with its own FCN, considering only intra-group interactions. Intergroup interactions are computed by applying the multi-head self-attention to the set of groups. With this approach, we can reduce the number of parameters compared to an MLP with a similar dimension while accurately predicting the type of cancer.
We extended this work to a multimodal setting in CrossAttOmics and used cross-attention to compute interactions between two modalities. Instead of computing the interactions between all the modality pairs, we focused on the known regulatory links between the different omics. By using only two or three omics combinations, CrossAttOmics can achieve better accuracy than training only on one modality. When training on small datasets, CrossAttOmics performs better than other architectures.
Jeudi 23 mai 2024
Analyse des données fonctionnelles (trajectories, random functions), Adaptive Functional Data Analysis
Valentin Patilea, CREST, ENSAI, France, https://ensai.fr/equipe/valentin-patilea/
Abstract
Functional Data Analysis (FDA) depends critically on the regularity of the observed curves or surfaces. Estimating this regularity is a difficult problem in nonparametric statistics. In FDA, however, it is much easier due to the replication nature of the data. After introducing the concept of local regularity for functional data, we provide user-friendly nonparametric methods for investigating it, for which we derive non-asymptotic concentration results. As an application of the local regularity estimation, the implications for functional PCA are shown. Flexible and computationally tractable estimators for the eigenelements of noisy, discretely observed functional data are proposed. These estimators adapt to the local smoothness of the sample paths, which may be non-differentiable and have time-varying regularity. In the course of constructing our estimator, we derive upper bounds on the quadratic risk and obtain the optimal smoothing bandwidth that minimizes these risk bounds. The optimal bandwidth can be different for each of the eigenelements. Simulation results justify our methodological contribution, which is available for use in the R package FDAdapt. Extensions of the adaptive FDA approach to streaming and multivariate functional data are also discussed.
Jeudi 25 avril 2024
14h00: Hugues Van Assel, ENS de Lyon
Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein Projection
Abstract Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets. Traditionally, this involves using dimensionality reduction methods to project data onto lower-dimensional spaces or organizing points into meaningful clusters. In practice, these methods are used sequentially, without guaranteeing that the clustering aligns well with the conducted dimensionality reduction. In this work, we offer a fresh perspective: that of distributions. Leveraging tools from optimal transport, particularly the Gromov-Wasserstein distance, we unify clustering and dimensionality reduction into a single framework called distributional reduction. This allows us to jointly address clustering and dimensionality reduction with a single optimization problem. Through comprehensive experiments, we highlight the versatility of our method and show that it outperforms existing approaches across a variety of image and genomics datasets.
15h15: Miguel Atencia, Universidad de Málaga, Espagne
Challenges in Reservoir Computing
Abstract In this expository talk, we will review the Echo State Network (ESN), a recurrent neural network that has achieved good results in time series tasks, such as forecasting, classification, and encoding-decoding. However, the lack of a rigorous mathematical foundation makes difficult their application in a general context. On the one hand, strong theoretical results, such as the Echo State Property and Universal Approximation, are non-constructive and require critical simplifying assumptions. On the other hand, usual heuristics for optimal hyper-parameter selection have turned out to be neither necessary nor sufficient. Some connections of ESN models with ideas from dynamical systems will be exposed, together with recent design proposals, as well as a novel application to time series clustering.
jeudi 7 mars 2024
Title: Molecular Motors: Stochastic Modeling and Statistical Inference.
Speaker: John Fricks, Arizona State University, U.S.A.
Abstract Molecular motors, specifically kinesin and dynein, transport cargos, including vesicles and ion channels, along microtubules in neurons to where they are needed. Such transport is vital to the well-functioning of neurons, and the breakdown in such transport function has been implicated in a number of neurodegenerative diseases. Since their discovery several decades ago, a variety of nano-scale experimental methods have been developed to better understand the function of transport-based molecular motors. In this talk, it will be shown how stochastic modeling techniques, such as functional central limit theorems, and statistical inference techniques for time series, such as particle filtering and EM algorithms, can be combined to better understand these experiments and give insight into the mechanisms behind motor-based intra-cellular transport.
Mardi 27 février 2024
Title: Boosting diversity in Regression Ensembles
Speaker: Jean-Michel Poggi (LMO, Orsay, University Paris-Saclay & University Paris Cité, France)
Abstract The practical interest of using ensemble methods has been highlighted in several works. Aggregation estimation as well as sequential prediction provide natural frameworks for studying ensemble methods and for adapting such strategies to time series data. Sequential prediction focuses on how to combine by weighting a given set of individual experts while aggregation is mainly interested in how to generate individual experts to improve prediction performance. We propose, in the regression context, a gradient boosting-based algorithm by incorporating a diversity term to guide the gradient boosting iterations. The idea is to trade off some individual optimality for global enhancement. The improvement is obtained with progressively generated predictors by boosting diversity. A convergence result is given ensuring that the associated optimisation strategy reaches the global optimum. Finally, we consider simulated, benchmark datasets and a real-world electricity demand dataset to show, by means of numerical experiments, the appropriateness of our procedure by examining the behavior not only of the final predictor or the aggregated one but also the whole generated sequence. In the experiments we consider a variety of different base learners of increasing complexity: stumps, CART trees, purely random forests and Breiman’s random forests.
This is joint work with Mathias Bourel (Universidad de la República, Montevideo, Uruguay), Jairo Cugliari (University Lyon 2, France), and Yannig Goude (EDF, France)
M. Bourel, J. Cugliari, Y. Goude, J.-M. Poggi, Boosting Diversity in Regression Ensembles, Stat. Anal. Data Min.: ASA Data Sci. J. (2023), 1-17, https://doi.org/10.1002/sam.11654