- Productions scientifiques
The statistical analysis of biological sequence such as nucleotidic sequences (DNA and RNA) or amino-acids (proteins) needs the conception of different models according to the study. Since the way the nucleotides succeed one another in DNA sequences is dependant, Markov models are widely used for this purpose. The problem of these models is to consider the homogeneity of biological sequences. But, biological sequences are not homogeneous. A well-known example is the gc percent: along a sequence, gc-rich regions and gc-poor regions succeed one another. In order to take into account this heterogeneity, other models are used: the hidden Markov models (HMM). The sequence is divided in some homogeneous regions. There is a lot of applications to HMM, such as search of coding regions. But, all biological particularities can not appear under these models, that is why we develop new models: the drifting Markov models (DMM). Instead of fitting a transition matrix on a whole sequence (classical Markov model) or different transition matrices on different homogeneous parts of the sequence (HMM), we allow the transition matrix to vary (to drift) from the beginning to the end of the sequence. At each position t, we obtain a different transition matrix Pit/n (where n is the sequence length). Thus, our models are constrained heterogeneous Markov models. We give two ways to constrain models: polynomial DMM and polynomial splines DMM. For instance, for a degree 1 DMM (linear drift), we fix a transition matrix Pi0 at the beginning of the sequence and transition matrix Pi1 at the end of the sequence and we allow the transition matrix to vary linearly from Pi0 to Pi1:
Pit/n = (1-t/n) Pi0 + t/n Pi1.
Such a model could correspond to a soft evolution between two hidden states of an HMM, for which transitions could appear too sudden. DMM can be seen as a competitive model to the HMM one but it over all can be understood as a complementary tool: the hidden models of an HMM, usually fixed Markov chains can be replaced by DMM.
Along this work , we consider polynomial drift or drift by polynomial splines (in the way to make them more flexible than the polynomial ones). We estimate our models by different ways, evaluate their qualities and used them in biological applications such as the search of rare words. We develop the present software DRIMM dedicated to estimation of DMM. This program provide all the possibilities of DMM, such as computation of transition matrix in each position, computation of stationary laws… Use of this program for the search of rare words is proposed in auxiliary programs.
 Vergne, N. Chaînes de Markov régulées pour l'analyse de séquences biologiques. PhD thesis, Université d'Evry val d'Essonne, 2008.