====== Probabilités et Statistiques ====== ===== Scientifiques ===== ===Séminaire de Paul Bastide 28/09/2018=== http://pbastide.github.io/docs/20180928_Evry.pdf ===== Séminaire SSB ===== Ce séminaire mensuel a lieu le mardi à 14h, alternativement sur les sites de Jouy en Josas, Évry et l'AgroParisTech. Le [[http://www.ssbgroup.fr/seminaire.html|programme]] est disponible sur le site du groupe [[http://www.ssbgroup.fr/|SSB]]. ==== Accès ==== * Pour se rendre à l'Unité Mathématique, Informatique & Génome (INRA-Jouy en Josas) plan d'accès sur [[http://w3.jouy.inra.fr/acces/acces.shtml|www.jouy.inra.fr]] * Pour se rendre au LaMME (Évry) [[../contact|plan d'accès]]. * Pour se rendre à l'AgroParisTech, 16 rue Claude Bernard à Paris, plan d'accès sur [[http://www.agroparistech.fr/-Adresses-et-plans-d-acces-.html|www.agroparistech.fr]] ===== Atelier Universitaire d'Evry en Bioinformatique (AUDEBI) ===== * [[http://www.ens.univ-evry.fr/emedia2014/course/view.php?id=1099|AUDeBI]]: Atelier Universitaire d'Evry en Bioinformatique ===== Conférences ponctuelles ===== ==== 2020/2021 ==== * 1/12/2020: Alain Durmus (CMLA, ENS Paris-Salcay) //Quantitative convergence of Unadjusted Langevin Monte Carlo and application to stochastic approximation. // ++ Voir résumé | \\ Stochastic approximation methods play a central role in maximum likelihood estimation problems involving intractable likelihood functions, such as marginal likelihoods arising in problems with missing or incomplete data, and in parametric empirical Bayesian estimation. Combined with Markov chain Monte Carlo algorithms, these stochastic optimisation methods have been successfully applied to a wide range of problems in science and industry. However, this strategy scales poorly to large problems because of methodological and theoretical difficulties related to using high-dimensional Markov chain Monte Carlo algorithms within a stochastic approximation scheme. This paper proposes to address these difficulties by using unadjusted Langevin algorithms to construct the stochastic approximation. This leads to a highly efficient stochastic optimisation methodology with favourable convergence properties that can be quantified explicitly and easily checked. The proposed methodology is demonstrated with three experiments, including a challenging application to high-dimensional statistical audio analysis and a sparse Bayesian logistic regression with random effects problem.++ ==== 2014/2015 ==== * 30/06/2015: Toby Dylan Hocking (McGill University, Montréal). //PeakSegJoint: fast supervised peak detection via joint segmentation of multiple count data samples// [[http://arxiv.org/abs/1506.01286|Preprint ArXiv]] * 20/05/2015: Henri-Jean Garchon (UVSQ). //Etude des facteurs génétiques de la spondylarthrite rhumatoïde//. * 19/05/2015: * [[http://odin.mdacc.tmc.edu/~wwang7|Wenyi Wang]], The University of Texas MD Anderson Cancer Center. //DeMix-Bayes: A Bayesian model for the deconvolution of mixed cancer transcriptomes in microarray and RNA sequencing data//. ++Abstract|\\ Clinically derived tumor tissues are often times made of both cancer and surrounding stromal cells. The expression measures of these samples are therefore partially derived from the non-tumor cells. This may explain why some previous studies have identified only a fraction of differentially expressed genes between tumor and surrounding tissue samples. What makes the in silico estimation of mixture components difficult is that the percentage of stromal cells varies from one tissue sample to another. Until recently, there has been limited work on statistical methods development that accounts for tumor heterogeneity in gene expression data. To this end, we propose a two-stage Bayesian deconvolution models (DeMix-Bayes) for both RNA-seq read counts and microarray expressions. Similar to our previous method DeMix, a heuristic search algorithm, DeMix-Bayes address two challenges: 1) estimation of both tumor proportion and tumor-specific expression, when neither is known a priori, 2) estimation of individualized expression profiles for both tumor and stromal tissues. We demonstrate the performance of our model in both synthetic microarray datasets and RNA-seq validation datasets.++ * Kim-Anh Do, The University of Texas MD Anderson Cancer Center. //Statistical Challenges in Cancer Research: Heterogeneity in Functional Imaging and Multi-Dimensional Omics Data//. ++Abstract|\\ Understanding different types of heterogeneity is one of the challenges that needs to be addressed to drive cancer research forward. I will describe, at a high level, the statistical questions posed by cancer research at MD Anderson that involve the different facets of heterogeneity. I will focus on two main research projects: (i) Simultaneous supervised classification of multivariate correlated objects collected; (ii) Pathway-based differential networks in genomics (DINGO) to jointly estimate group from perfusion computed tomography. The aim is to distinguish between biologically distinct tissue types, metastatic versus normal liver, through the evaluation of vasculature heterogeneity; specific conditional dependencies between treatment groups, cancer subtypes, or prognostic features, using different modalities of genomic data (mRNA expression, DNA copy number, methylation), in association with heterogeneous survival times in glioblastoma patients from the Cancer Genome Atlas (TCGA) study. I will discuss the development of computer-intensive statistical models, simulation studies conducted, and inferential results. This is joint work with Veera Baladandayuthapani, Min Jin Ha, Brian Hobbs, Jianhua Hu, and Yuan Wang++ * 17/12/2014: Flora Jay, Muséum d'histoire naturelle. //Méthodes statistiques pour la génétique des populations: inférence démographique et relation avec l'environnement//. * 30/09/2014: Nathan Touati, Genomic Vision. //Détection de bimodalité et peignage ADN//. * 23/09/2014: Marta Avalos, ISPED, Université de Bordeaux. //Consommation médicamenteuse et risque d'accident de la route : analyse avec des méthodes "sparses" d'une grande étude épidémiologique réalisée à partir de données médico-administratives//. ==== 2013/2014 ==== * **Mardi 08 avril 2014** : [[http://fnavarro.perso.math.cnrs.fr|Fabien Navarro]], Université de Nantes. **Titre** : Estimateurs de seuillage par blocs. * **Mardi 04 mars 2014 à 14h** : Charles-Elie Rabier, MIA - INRA Toulouse. **Titre** : Processus Gaussiens et arbres phylogénétiques pour le génome. * **Mardi 14 janvier 2014 à 10h** : [[http://halweb.uc3m.es/esp/Personal/personas/aarribas/esp/perso.html|Ana Arribas-Gil]], Univ. Carlos III, Madrid. **Titre** : Dynamic time warping and sequence alignment. * **Mardi 07 janvier 2014 à 11h**: [[https://sites.google.com/site/remikazma/|Rémi Kazma]], CNG. **Titre** : Variants rares et interactions gène-environnement. ==== 2012/2013 ==== * ** Jeudi 30 mai 2013 à 11h ** Philippe Lopez, Université Pierre et Marie Curie. **Titre** : Diversité génétique et mosaicisme: apport des méthodes de réseau. * ** Vendredi 22 mars 2013 à 11h30 **, Amélie Peres, ENS Paris. **Titre** : Etude de l'évolution fonctionnelle des génomes ancestraux de vertébrés par duplication de gènes. ++ Résumé | L'étude scientifique des processus biologiques s'appuie généralement sur l'observation de résultats expérimentaux réalisés sur des modèles vivants. Cette évidence masque le fait que l'ensemble des organismes vivants est le fruit de centaines de millions d'années d'évolution qui elles sont inaccessibles aux expériences, car situées dans le passé. Pour contribuer à rendre à la biologie cette dimension historique, notre laboratoire a entrepris de reconstruire une succession de génomes ancestraux, chez les vertébrés. Ici, nous exploitons à grande échelle les reconstitutions de génomes de vertébrés, afin de mettre en évidence, quantifier et dater les duplications de gènes au sein de différentes lignées de vertébrés. Ces duplications, en autorisant un relachement de la pression de sélection, constituent une source clé pour l'émergence de nouvelles fonctions. En premier lieu, cette étude a permis la mise en place d'un cadre méthodologique afin d'étudier l'évolution fonctionnelle des génomes par duplication de gènes: cette méthode, qui distingue les différents types de duplications de gènes (en tandem, dispersées, suite à une duplication complète de génome, ...) permet de mettre en évidence l'apparition de fonctions spécifiques au cours de l'évolution. Nos résultats sont en accord avec la littérature en ce qui concerne l'apparition de fonctions dont l'histoire était déjà connue, comme par exemple l'émergence du système immunitaire inné à partir de Bilateria (environ 600 millions d'années) puis plus tardivement le développement du système immunitaire acquis à partir de Chordata (environ 550 millions d'années). D'autres fonctions, comme la réponse aux stimuli olfactifs, le développement de l'épiderme, la synthèse de certaines hormones, apparaissent à différents stades de l'évolution, de manière spécifiques selon les lignées de vertébrés étudiées. A partir de l'ensemble des données obtenues, nous allons nous intéresser plus particulièrement à l'analyse de l'évolution de certaines fonctions (ex : système immunitaire, développement, reproduction, ...), ainsi qu'à l'étude de l'évolution de certaines familles de gènes. Nous intégrerons ces données sur l'amplification des familles par duplication à des informations sur les épisodes de sélection positive qui ont pu avoir lieu pendant la même période évolutive. Ce travail établit un tableau général de l'impact des différents types de duplications sur l'évolution et l'adaptation des organismes. ++ * ** Jeudi 21 mars 2013 à 10h **, [[http://www.sigmath.es.osaka-u.ac.jp/shimo-lab/en| Hidetoshi Shimodaira]], Osaka University. **Titre** : Bayesian is converted into frequentist by reversing the sign of the data length. ++ Résumé | The observed frequency of a particular outcome in data-based simulation, known as bootstrap probability (BP) of Felsenstein (1985), is very useful as a confidence level of data analysis with discrete outcomes such as estimating the phylogenetic tree from aligned DNA sequences or identifying the clusters from microarray expression profiles. We argue that the length of simulated data sets should be (-1) times the original data length for avoiding false positives, i.e., bias of hypothesis testing, although such a negative data length cannot be realized. In another word, we perform the “m out of n” bootstrap with m=-n. This turns out to be equivalent to the approximately unbiased (AU) confidence level computed by the multiscale bootstrap of Shimodaira (2002), but such a notion of negative data length has not been known until Shimodaira (2008). The method is illustrated in real data analysis of phylogenetic inference and hierarchical clustering. In the latter part of the talk, the mathematical justification is explained in terms of distance and curvature with connection to the geometrical theory of Efron and Tibshirani (1998) and the argument of Perlman and Wu (1999). BP is interpreted as Bayesian posterior probability and AU is the frequentist p-value, and thus changing the length of simulated data sets bridges the gap between these two confidence levels. ++ * **Jeudi 22 novembre 2012 à 10h30**, [[http://www.maths.lancs.ac.uk/~parkj1/|Juhyun Park]], Lancaster University. **Titre** : Functional average derivative regression. ++ Résumé | Single index model is studied in the framework of regression estimation involving functional data. The model extends a linear model with un unknown link function and has the advantage of having low sensitivity to dimensional effects and flexibility and easiness of interpretation. However its usage in functional regression setting is still found to be limited, partly due to its complexity of fitting the model. In this paper we develop a functional adaptation of the average derivative method that allows us to construct an explicit estimator for the functional single index. Thanks to its explicit form of the estimator, implementation is made relatively simple and straightforward. Theoretical properties of the estimator of the index are studied based on asymptotic results on nonparametric functional derivatives estimation. In particular it is shown that the nonlinear regression component of the model can be estimated at the standard univariate rate, being therefore insensitive to the infinite dimensionality of the data. Numerical examples are used to illustrate the method and to evaluate finite sample performance. ++ {{:intranet:parkj.pdf|Transparents}} (accès restreint) ==== 2011/2012 ==== * **Jeudi 26 juillet 2012 à 11h**, [[http://www.stat.berkeley.edu/~epurdom|Elizabeth Purdom]], UC Berkeley Statistics, Etats-Unis. **Titre** : Timing Chromosomal Abnormalities using Mutation Data ++ Résumé | Tumors accumulate large numbers of mutations and other chromosomal abnormalities due to the breakdown in genomic repair mechanisms that is a hallmark of tumors. However, not all of these abnormalities are believed to be crucial for tumor growth and progression. One important indicator of the importance of the abnormality is the relative order in which it occurred, relative to other abnormalities. Such early events may be critical abnormalities, and possibly targets for drug treatment or early diagnosis. Outside of animal models, we generally will not have tumors from multiple time points in the progression of the tumor, but rather only at the time point at which the tumor was removed. Therefore we cannot directly observe the temporal ordering of genomic abnormalities. However, the distribution of allele frequencies within regions with copy number aberrations provides information about when the chromosomal abnormality occurred, relative to other abnormalities in the tumor. Using sequencing data, we develop a probabilistic model for the observed allele frequency of a mutation (defined as the proportion of the number of reads covering the nucleotide position that contain the mutation) that allows us to order abnormalities within a tumor. Our method gives a novel insight into the biology of tumor progression through a quantitative evaluation of temporal ordering of chromosomal abnormalities. Moreover it gives a quantitative measure to compare across samples for highlighting driver mutations and events. ++ {{:intranet:Purdom_CNRS2012.pdf|Transparents}} (accès restreint) * **Vendredi 29 juin 2012 à 11h**, [[https://www.msu.edu/~hanada/|Kousuke Hanada]], RIKEN Plant Science Center, Japon. **Titre** : Small coding genes associated with morphogenesis are hidden in plant genomes ++ Résumé | Peptides translated from small coding genes play essential roles in multicellular organisms. However, small coding sequences in genomes tend to be missed. Here, we show that novel small open reading frames (sORFs: 30 to 100 amino acids) are associated with morphogenesis in A. thaliana. Using a designed array, we generated an expression atlas in 16 organs and 17 environmental conditions among 7901 identified coding sORFs. 4664 coding sORFs were expressed in at least one experimental condition, and 6516 were conserved in other land plants, at the amino acid level. Throughout overexpressing 473 transcribed and/or conserved coding sORFs, ~10% (49/473) induced visible phenotypic effects by a high proportion that is approximately seven times higher than that of randomly chosen known genes. Many coding sORFs hidden in plan genomes may be associated with morphogenesis. Our expression and phenotypic data for sORFs will promote further study of their roles in plants. ++ * ** Mardi 12 juin 2012 à 10h**, [[http://people.stat.sfu.ca/~dac5/Dave_Campbell/Dave_Campbell.html| Dave Campbell]], Simon Fraser University **Titre** : Parameter Estimability and Sta­­tistical Inference for Dynamical Systems using Data Cloning ++ Résumé | Systems of ordinary differential equations are commonly used to model natural phenomena. However such models are often built with more parameters than can be uniquely identified by available data. We introduce a test for parameter estimability, that works with Data Cloning, (a Markov Chain Monte Carlo algorithm), for simultaneously determining if the model parameters are estimable and if so, determining their maximum likelihood estimates and providing asymptotic standard errors. When not all model parameters are estimable, we show how to use the Data Cloning results and our estimability test to determine estimable parameter combinations when possible. We illustrate the method using three different real data systems that are known to be difficult to analyze. ++ * ** Jeudi 07 juin 2011 à 10h**, Nicolas Giroud. **Titre ** : Présentation de [[http://mathsamodeler.ujf-grenoble.fr/|Maths à modeler]]. * ** Vendredi 18 novembre 2011 à 15h**, [[http://sueur.jerome.perso.neuf.fr|Jérôme Sueur]] (Laboratoire Origine et Structure de la Biodiversité, Muséum national d'Histoire naturelle) ** Titre ** : Estimation de la biodiversité animale par l'acoustique : concepts et techniques * ** Mardi 17 octobre 2011 à 11h**, Henrik Bengtsson (University of California at San Francisco) **Titre** : Single-pair parent-specific copy number analysis ++ Résumé | It has been nearly a decade since the first DNA-microarray experiments for studying genomic aberrations and inferring copy numbers were first done. Several statistical methods have since evolved providing us with better signal-to-noise ratios and improved detection of copy number aberrations. More recently, the inference on parent-specific copy numbers (PSCNs) based on SNP microarrays have gained interest and a handful of methods have been proposed. Here we will present a method (Paired PSCBS) that allows us to obtain high-quality PSCNs from a single pair of tumor-normal samples without the use of external references or prior estimates. ++ * **Mardi 04 octobre 2011** (dans le cadre de SSB), 2 exposés: * à 11h, [[http://sph.bu.edu/index.php?option=com_sphdir&id=239&Itemid=340&INDEX=10833|Josée Dupuis]] (Boston University),Meta-analysis of genome-wide association results allowing for gene-by-environment interactions {{:intranet:reunions:dupuis_cnrs2011.pdf|Transparents}} (accès restreint) * à 14h, [[http://math.univ-lille1.fr/~biernack/|Christophe Biernacki]] (Université de Lille 1), Simultaneous Variables Clustering and Selection in Regression Models. {{:intranet:reunions:clere_ly.pdf|Transparents}} (accès restreint)\\ * ** Mercredi 21 septembre 2011 à 11h**, [[http://math.bu.edu/people/kolaczyk/index.html| Eric Kolaczyk]] (Boston University, en visite au labo) **Titre** : Some Results on Asymptotics for Inference in Networks. ++ Résumé | Contrary to the classical i.i.d. context, or even the contexts of time series or spatial data, there are a variety of seemingly "basic" inferential tasks in the context of networks for which the appropriate methodologies and/or the supporting theory is not yet developed. In particular, there is currently very little in the way of asymptotic statistics. In this talk I will discuss ongoing work on confidence intervals for network parameters in two settings: (i) parameter inference in exponential random graph models; and (ii) estimation of network summary characteristics. ++ {{:intranet:reunions:kolaczykevryexpo.pdf|Transparents}} (accès restreint)\\ * ** Mardi 06 septembre 2011 à 11h**, [[http://genome.jouy.inra.fr/~kzimm/|Karel Zimmermann]] (MIG, INRA, Jouy-en-Josas) **Titre** : Quelques ébats physiques avec des séquences de protéines: profils, analyse harmonique, répétitions, périodicité... ++ Résumé | Des méthodes de la physique pour le traitement de signal (autocorrélation, analyse harmonique) sont utilisées pour analyser les séquences biologiques (protéines). Les résultats reflètent bien la structure 3D. La méthode SVD permet de mieux comprendre les matrices de substitution (Blosum, PAM) et établir un "profil complet" d'une séquence. ++ {{:intranet:zimmermann_evry_20110906.pdf|Transparents}} (accès restreint)\\ [[seminaire:archives|Archives]] ===== Groupes de travail internes ===== * [[http://www.ens.univ-evry.fr/emedia2014/course/view.php?id=1099 |GT Audébi]] * [[intranet:reunions:gt_stat_gen|GT de statistique génétique]] (accès restreint) * [[intranet:reunions:connaissons_nous|Connaissons-nous]] (accès restreint) * [[intranet:reunions:duplinet|Duplinet]] (accès restreint) * [[intranet:reunions:journal_club| Journal club "genomic organization"]] (accès restreint)