- Productions scientifiques
This page hosts the posters of the “Statistical Analysis of Massive Genomic Data” conference which was held in Evry in November 2015. For other information about this conference, please visit:
IMGT/HighV-QUEST and IMGT Clonotype (AA): Identiﬁcation and Statistical Signiﬁcance Diversity per Gene for NGS Immunoproﬁles of Immunoglobulins and T Cell Receptors
The adaptive immune response is our ability to produce up to 2.1012 different immunoglobulins or antibodies and T cell receptors per individual to fight pathogens. IMGT®, the international ImMunoGeneTics information system® (http:\\www.imgt.org) built on IMGT-ONTOLOGY has been created to manage this huge diversity. The IMGT/HighV-QUEST, is the unique web portal for the analysis of immunoglobulin and T cell receptor high throughput sequences. It provides standardized and high quality outputs, the characterization and comparison of the clonotype diversity in up to one million sequences. We present a statistical procedure to analyze IMGT/HighV-QUEST outputs for evaluating significance of the IMGT clonotype (AA) diversity and expression, per gene of a given group.
Model selection for Genome Wide Association Studies
Genome wide association studies are performed for increasingly complex data. Within the framework of fitting a linear mixed model for each genotype there are a lot of design issues to consider. Depending on the phenotype of interest it might be necessary to control for other variables. It might also be beneficial to transform variables to make inference more robust. To deal with the complexity of the data within the Milieu Interieur project we have designed a pipeline that performs an automatic selection of covariates and then transforms the response variable such that the residuals of the regression of the response on the covariates are normally distributed. The pipeline has been used for a genome wide association study of immunophenotypes with hundreds of potential covariates.
Haplotype based estimation of complex diseases
GWAS data can be used to try and predict disease. To increase predictive accuracy, we try to take into account biological structure, namely chromosomal distance and phase information. In our new method, we capture interaction in short haplotypes using the popular machine learning algorithm random forest. We then sum up the interactions using lasso regression. We show results on the WTCCC 1 dataset.
Adaptive ridge regression for variable selection
Assessing the uncertainty pertaining to the conclusions derived from experimental data is challenging when there is a high number of possible explanations compared to the number of experiments. We propose a new two-stage “screen and clean” procedure for assessing the uncertainties pertaining to the selection of relevant variables in high-dimensional regression problems. In this two-stage method, screening consists in selecting a subset of candidate variables by a sparsity-inducing penalized regression, while cleaning consists in discarding all variables that do not pass a significance test. This test was originally based on ordinary least squares regression. We propose to improve the procedure by conveying more information from the screening stage to the cleaning stage. Our cleaning stage is based on an adaptively penalized regression whose weights are adjusted in the screening stage. Our procedure is amenable to the computation of p-values, allowing to control the False Discovery Rate. Our experiments show the benefits of our procedure, as we observe a systematic improvement of sensitivity compared to the original procedure.
Influence of transposable elements on the fate of duplicated genes in human
Gene duplication is an engine for the apparition of new gene functions. The fate of most duplicated genes is the loss of function, but duplicate copies remain functional. The evolutionary mechanisms explaining the duplicated genes maintenance are poorly understood. A recent study reveals that epigenetic modifications, such as DNA methylation, may contribute to duplicate gene evolution. Transposable elements (TE) are repeated genomic sequences which influence genome evolution. Also it is known that they are associated with DNA methylation. Thus TE play a role in epigenetic modifications. Studying the fate of duplicated genes via functionalization taking into account the TE context is one of the key issues of our project. We identified duplicate and singleton genes from the HOGENOM database. The overall TE density and the local GC content were calculated. The selective pressures acting on human and chimpanzee orthologous gene pairs was also evaluated. Our analysis reveals that density of TE varies between duplicate and singleton genes.
Use of De Bruijn Graphs for Baterial GWAS
Antimicrobial resistance has become a major public health concern, calling for a better definition of existing and novel resistance mechanisms and the discovery of novel resistance markers. Most existing GWAS approaches for bacterial genomes either look at SNPs obtained by sequence alignment or consider sets of k-mers, whose presence in the genome is associated with the phenotype of interest. We propose an alignment-free GWAS method, targeting any region of the genome and selecting haplotypes of variable length associated to the resistance phenotype. The exploitation of De Bruijn graph structure, implicitly containing all genomes k-mers of all sizes, allows to drastically reduce the number of explored features without loss of information, thus increasing the statistical power of the tests.
Expression of paralogous gene families in the human brain
The tissue-specificity of paralogous gene functions can be brought out by measuring gene expression. For example, members of gene family SRGAP2 show specific expression profiles in different brain tissues. Our main objective is to identify gene families having specific expression patterns between brain tissues or shared between brain and accessible tissues.To reach this goal we have to deal with genomic regions of paralogs with a high sequence homology that can impact RNA-seq expression measurements. It will be necessary to develop a bioinformatics method to improve the accuracy of these measures on duplicated genes. Another need is to compare expression profiles of gene families between brain and blood by using expression correlation analyses in both tissues. GTEx consortium data will be used to compare family expression profiles between different brain tissues and between brain tissues and blood.
Predicting from high-dimensional molecular data and environmental variables in stratified samples
We show approaches to predict childhood asthma based on genetic predictors (2.5 Million SNPs) as well as environmental variables from 1708 children. Besides the n-smaller-p problem the following issue has to be faced: The dataset arose from a two-phase sampling procedure. In order to take into account the resulting sample bias, we first carried out genetic feature selection using univariate survey regression models. Afterwards, we performed LASSO regression with observation weights on the above selected features. Purely multivariate approaches were applied to a manual selection of risk SNPs from literature. Approaches for application to the entire amount of SNPs have to be capable of handling the big amount of predictors and should incorporate the complex sampling design. Combining high-dimensional genetic data with low-dimensional environmental data by special approaches can improve the predictions.
Searching for missing heritability using univariate and multivariate approaches on both genotyping and sequencing data
To investigate the missing heritability of Alzheimer’s Disease, we used public data of 809 individuals from the Alzheimer’s Disease Neuroimaging Initiative with both genotyping (2.5M SNPs) and whole genome sequencing data. We first performed a GWAS on both types of data, testing the association between each SNP/rare variant and the phenotype. We found significant associations for 16 common SNPs in Linkage Disequilibrium in the region of the well-known APOE gene, after Bonferroni correction. Gene-based approaches applied on sequencing data (SKAT, burden test) could only confirm the association of APOE driven by common SNPs. This suggests that such approaches have low power with very rare causal variants. Finally, multivariate methods (sparse linear regression, decision tree, SNP heritability) failed to predict the case/control status from genotyping data but these results could be greatly improved on a much larger dataset.
Estimation of relationships and inbreeding from sequence data in presence of admixture
The 1000 Genomes Project (TGP) provides a unique source of whole genome sequencing data for studies of human population genetics and human diseases. We estimated the genomic inbreeding coefficient of each individual and found an unexpected high level of inbreeding in TGP. Inbred individuals were found in each of the 26 populations, with some populations showing proportions above 50%. We also detected 227 previously unreported pairs of close relatives (up to and including 1st-cousins). In addition, because admixed populations are present in the TGP, we performed simulations to study the robustness of inbreeding coefficient estimation in the presence of admixture. We found that our multi-point approach (FSuite) was quite robust to admixture unlike single-point methods (PLINK).
From Exome to Whole Genome sequencing analysis
The last few years show an enthusiastic evolution of the human genome resequencing projects profile. We reached the point where Whole Genome Sequencing (WGS) projects are about to become a new standard for research and personal genomics. To achieve this goal, major challenges have to be addressed: Management of massive data storage (1Po/year for X5 illumina), Optimization of computing resources and development of new analysis strategies. Here we present a focus on the last progress carried out at the Centre National de Génotypage to handle this leading transition phase. Since 2013, bioinformatics analyses of the CNG are processed at the HPC facility TGCC: CEA, “Très Grand Centre de Calcul”, Bruyère le Chatel, 2 Pflops including 400 Tflops and 5 PBytes dedicated to France Genomics projects. For example, pipelines for exome sequencing data analysis (varscope), mainly based on best practices for mapping, calling and first pass annotation (DePristo et al., 2011/Van der Auwera et al., 2013 ) was setup in 2013 but required a major upgrade to ensure speed and scalability fitting the production flow anticipation (up to 9000 WG for a X5 illumina sequencers). We present here the high throughput calibrated process deployed on the TGCC facility (fig.1). This process include a “map-reduce” optimization and software upgrade to handle the 10 times increase capacity required by the transition of Whole Exome Sequencing (WES) to WGS (fig.2). Moreover, we benchmarked several variant calling algorithms (NIST/GIAB; 1Zook et al., 2014) and added “4-multicall” step to our process based on HaplotypeCaller(HC; 2McKenna et al.,2010), UnifyedGenotyper(UG), Platypus(PY; 3Rimmer et al., 2014 ), and Sambamba/Samtools(SB; 4Tarasov et al., 2015; 5Li et al., 2009 )(fig.1, fig. 3). We evaluated our process from low coverage analysis (10X, ex: very large cohort sequencing), to standard coverage (30X) and finally high coverage sequencing (> 100x, ex: somatic mutation or mosaicism detection). In “standard” condition, data from a 30X experiment are currently completed in approx. 8 hours (fig. 1). We didn’t identify any deadlock situation up to a 240X coverage (equivalent to a full X5 flowcell). Finally, we compared relative performance of WES and WGS and showed equivalent coding variants detection, steadier genome coverage distribution and up to 50% wider access to biologically relevant annotated regions (fig 7,8,9). In conclusion, we are facing a major transition which requires refactoring and development to harvest knowledge from constantly over-flooding sources of data. We showed here benefits brought by WGS over WES in term of coverage and precision and important upgrades realized to support large WGS programs. Our next step in this context will be to evaluate and add compression data strategies for storage and processing optimization. In addition we will add our structural variation process to the production flow to provide our collaborators a more complete view of the genome organization.
Back to the stemhood: a top-down approach identifying extracellular factors involved in stemness of human embryonic stem cells
Human embryonic stem cells (hESCs) come from the inner cell mass of in vitro fertilized blastocysts. These cells, transiently present in vivo, give rise to the germ layers. In vitro, they can be maintained undifferentiated or committed into lineages under specific culture conditions. At present, little is known about the role of cell-cell/matrix interactions in regulating hESC stemness. We propose a top-down approach to identify novel extracellular proteins responsible for hESC stemness maintenance. The first step consists of an in silico study, combining the analysis of microarrays and the use of annotated databases, to build and define a hESC transcriptome (8,934 mRNAs) and interactome (6,514 proteins and 46,954 interactions). This interactome has been structurally studied and filtered to narrow it down to a short list of 125 candidate proteins, focusing on the extracellular proteins and their links to the transcription factor network. Some of these 125 proteins are already known to be involved in hESC stemness whereas most of them are not yet. The second step comprises an in vitro study where the role in maintaining hESC stemness of selected candidates is being assessed. To this end, shRNA technology is used to knockdown candidate gene expression. The effect this has on the expression of pluripotency and differentiation markers is investigated at the mRNA and protein levels using qPCR, immunofluorescence and Western blotting. This Systems Biology approach provides a more comprehensive understanding of the stemness state and its relation to the cell micro-environment helping to mobilise the great potential of hESCs.
Statistical and computational issues in subnetwork analysis of genome-scale data
Over the last decade, significant efforts have been undertaken to identify genes relevant for a large number of diseases using genome-wide association studies (GWAS). In most cases, the results have been underwhelming. Many phenotypes appear to be associated with variations across multiple genes. Pathway-based analysis is being widely used to prioritize groups of interacting genes associated with the phenotype of interest. Yet this analysis is limited to known pathways. Network-based methods relax this limitation by treating all subnetworks as potential pathways. Our goal is to apply a network-based method that combines knowledge of gene functional interactions with GWAS results. It would search for groups of interacting SNPs that associate best with dengue infectious disease. Such methods already exist. Nevertheless, we have identified a statistical bias in the core statistic that is now widely used to score subnetworks: subnetwork scores are biased towards larger subnetworks. Here, we illustrate the statistical and computational issues around subnetwork-based analysis.
Discovery of pairwise monotonic biomarkers for dengue severity
During dengue virus outbreaks, hospitals are regularly overcrowded with patients because of potential complications, which occur several days after hospital admission in only 0.1% of cases. Being able to predict which patients will develop complications early on would make it possible to focus the existing medical resources on fewer patients. Based on clinical data and transcriptomic data from blood serum in 50 patients at hospital admission, we aim to find simple predictors of dengue complications from omics data. Specifically, we use a generalization of linear models that describe the disease severity as a monotonic function of two transcript measurements.The monotonic model allows genome-wide screening and goes above the linear and logistic models; we are able to pick up relations such as “AND” and “OR” in between genes. We will present the results from our genome-wide screen for biomarkers for dengue severity.
Change-point detection with kernel methods : Application on the DNA copy number segmentation in cancer studies
A number of change-point detection methods have been proposed to analyse DNA copy number and allele B fraction profiles. These profiles are characterized by abrupt changes in their distribution (mean, number of modes, variance…). However, available approaches do not directly tackle this problem. In fact they first pre-process and transform the data and then detect abrupt changes in the mean of the pre-processed signal. This pre-processing results in a loss of information. The recently proposed kernel based segmentation approach offers a unified framework to detect changes in the whole distribution of a signal and is an interesting alternative to this ad-hoc pre-processing. However, kernel based segmentation is computationally inefficient and cannot be applied as is to large DNA copy number profiles. Indeed for an arbitrary kernel its complexity is quadratic in the size of the data both in space and time. We illustrate the performance of the kernel based segmentation and of an heuristic on the copy number the and allele B profiles for which we designed an adapted kernel. We assess the competitive performance of our approach using realistic profiles simulated using the acnr R package.
New insights for drug sensitivity prediction from cancer cell lines data
A recent comparative analysis of large-scale pharmacogenomic high-throughput screening (HTS) dataset, has found inconsistency between the measured cell line drug sensitivities across several studies. In this study, we explored the possibility to improve our understanding of the results from cell lines drug sensitivity and their consistency. We used data from the Cancer Genome Project (CGP) and the Cancer Cell Line Encyclopedia project (CCLE) studies. We focused on common cell lines and molecular information available as well as the IC50 of 15 drugs that have been screened by CGP and CCLE. We propose a methodology based only on transcriptomic data that has been applied in parallel in each dataset to build eleven robust clusters of cell lines based on genes selected with biological knowledge.High consistency was found between both clustering with an accuracy of 90%. Eleven couples were then defined as the pair of clusters from CGP and CCLE that share the highest proportion of cell lines. This clustering show higher homogeneity in term of drug sensitivity, compare the tissue type and allows to significantly associate six different couples to sensitivity or resistance to Erlotinib, Lapatinib, Palbociclib , Vemurafenib, PD0325901 or Selumetinib. We developed a methodology based on transcriptomic data and biological knowledge to build clusters of cell lines in independent datasets that does not involve drug sensitivity. Then, we defined a new universe where consistency between drug sensitivity data can be improved.
High-dimensional longitudinal genomic data : a survey and evaluation of publicly available implementations of machine learning methods
Problems related to high-dimensionality arise nowadays in many fields of biomedical and clinical trials research, in which longitudinal studies are usually conducted. In these fields, high-dimensional data have lead to the publication of an increasing number of related articles. However, methods appropriate for high-dimensional data analysis, accounting simultaneously for the longitudinal dimension of the data, have been proposed only recently. We performed a review of articles proposing these appropriate methods when assuming a mixed effects model. We evaluated by simulations those methods that are implemented through publicly available codes. L1 regularization methods were the most common approaches. We discuss capacities and limitations with a view to analysing the DALIA-1 trial data, a therapeutic HIV vaccine clinical trial in which 19 patients were vaccinated. This trial evaluated the administration of a dendritic cell based vaccine to HIV infected patients as a way to boost their immune response against HIV infection. A huge number of data were collected: longitudinal gene expression in the blood was repeatedly measured with microarrays over the course of the trial, as well as blood cell markers that were measured with ow cytometry and multiplex technologies [Lévy et al., European Journal of Immunology, 2014]
Searching gene-gene interactions in GWAS using a Group Lasso approach
It is recognized that some diseases may result from a complex genetic structure involving multiple interactions between genetic markers. Considering interactions at gene level rather than single markers scale may offer many advantages. Several gene-gene methods have thereby been proposed, most of them only applicable on a reduce number of genes. In this work we propose a group lasso approach that takes into account the gene structure as group of SNPs. We first define variable interactions for every couple of genes as a product of each gene representatives variables. To obtain gene representative variables, we compare different dimension reduction methods, principal component analysis, partial least squares and canonical correlation analysis. We then use a group lasso regression with a penalty by gene and a penalty by gene-pair. In order to improve estimation accuracy and to obtain p-values for each selected variables, we use an adaptive ridge cleaning approach.
Use of multivariate predictive modelling to explain Tacrolimus pharmacokinetics inter-patients variability based on a high-throughput genetic screening approach
With a high-throughput genetic screening approach (16,561 SNPs), we aimed to identify a set of co-variant germline polymorphisms predictive of the inter-patient Tacrolimus (Tac) pharmacokinetics variability in kidney transplantation. Tac blood concentration of 280 renal transplant recipients were monitored at each follow-up time after transplantation during three months (days 10, 14, 30, 60, 90). We used a predictive multivariate approach integrating both a features selection step (Fisher and Mutual Information) and a regression step (PLS1 and PLS2). At days 60 and 90, the predictive models significantly explains 70.2%, 62.9% and 22.9% of total Tac Co/Dose variability with 44 and 33 genes, respectively. As expected, they include the well-known CYP3A4 and CYP3A5 variants and highlight some molecular networks of drug metabolism. This approach also led us to identify the SLC28A3 polymorphism as a potential candidate gene.