Journées COMATEGE - SeqBio @ Marne 2012

Journées COMATEGE-SeqBio

26 & 27 novembre 2012

Programme ↑

Lundi 26 Novembre

9:45--10:15 : Café d'accueil
10:15--11:30 : Keynote Talk
Orateur : Riccardo Dondi (Université de Bergamo - Italie)

Titre : Some recent combinatorial approaches to genome comparison

Résumé :

Genome comparison has been the inspiration for several nice combinatorial problems in the last two decades. Genomes are usually considered as strings, hence natural problems related to string comparison, such as longest common subsequence and sequence alignment, have natural applications in the comparison of two or more genomes.
This talk will consider two recent approaches related to genome comparison. The first approach is justified by the concept of "exemplar". Given two genomes, with possibly duplicated genes, the goal is the inference of the original copy of each gene from which all other copies of that gene have originated from. We will consider some combinatorial problems related to this approach, based on the longest common subsequence problem. We will overview the main combinatorial variants of the problem, and some complexity and algorithmic results.
The second approach we will deal with is related to the organization and evolution of genomes. Recently, an alignment approach has been introduced for the comparison of two genomes, based on an evolutionary model where the evolutionary events considered are duplications and losses. In this talk we will overview some interesting combinatorial problems related to this approach, and some preliminary complexity and algorithmic results.
11:30--12:05 : Pierre Nicodeme

Titre : Waiting time in DNA evolution

Pierre Nicodeme

Résumé :

With regard to a stochastic evolution model, the aim is to compute how long one must wait until a k-mer not present in a sequence at the start time appears in it.
We will review work of Behrens and Vingron that does a simplification assumption, further work of Behrens, Nicaud and P.N. which rigorously solves the problem by use of automata, and recent work of P.N. that, using the Regnier-Szpankowski equations, provides explicit formulas involving also the Markov case.
12:05--14:00 Repas
14:00--14:35 : Annelyse Thévenin

Titre : Gene Family Assignment-Free Comparative Genomics

Daniel Doerr (Genome Informatics, Faculty of Technology and Institute for Bioinformatics, Center for Biotechnology (CeBiTec), Bielefeld University, Germany), Annelyse Thévenin (Genome Informatics, Faculty of Technology and Institute for Bioinformatics, Center for Biotechnology (CeBiTec), Bielefeld University, Germany), Jens Stoye (Genome Informatics, Faculty of Technology and Institute for Bioinformatics, Center for Biotechnology (CeBiTec), Bielefeld University, Germany)

Résumé :

The comparison of relative gene orders between two genomes offers deep insights into functional correlations of genes and the evolutionary relationships between the corresponding organisms. Methods for gene order analyses often require prior knowledge of homologies between all genes of the genomic data set. Since such information is hard to obtain, it is common to predict homologous groups based on sequence similarity. These hypothetical groups of homologous genes are called gene families.
This talk promotes a new branch of gene order studies in which prior assignment of gene families is not required. As a case study, we present a new similarity measure between pairs of genomes that is related to the breakpoint distance. We propose an exact and a heuristic algorithm for its computation. We evaluate our methods on a data set comprising 12 gamma-proteobacteria from literature.
In evaluating our algorithms, we show that the exact algorithm is suitable for computations on small genomes. Moreover, the results of our heuristic are close to those of the exact algorithm. In general, we demonstrate that gene order studies can be improved by direct, gene family assignment-free comparisons.
This work will be presented in RecombCG 2012.
14:35--15:10 : Thierry Lecroq

Titre : Three new strategies for exact string matching

Simone Faro (University of Catania, Italy), Thierry Lecroq (LITIS EA 4108, Université de Rouen, France)

Résumé :

In this talk we will present three new strategies for efficient practical on-line searches for patterns in texts.
The first strategy consists in using multiple sliding text-windows. We will show how it can be applied to some among the most exact efficient algorithms for the problem based on nondeterministic automata and comparison of characters.
The second strategy is based on a suitable factorization of the pattern and on a straightforward and light encoding of its suffix automaton. It turns out that on average this new technique leads to longer shift than that proposed by other known solutions which make use of suffix automata. The third startegy is based on multiple hashing functions which improves the performance of existing string matching algorithms when used for searching DNA sequences.
15:10--15:45 : Lhouari Nourine

Titre : Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données

Jean Marc Petit (LIRIS LYON), Lhouari Nourine (LIMOS)

Résumé :

Pour un ordre partiel (P, <) et un prédicat monotone Q sur P, la dualisation consiste à identifier tous les éléments maximaux de P vérifiant Q à partir de tous éléments minimaux de P ne vérifiant pas Q, et vice versa.
Dans le cadre de la découverte de motifs intéressants dans les bases de données, P représente un ensemble fini de motifs et quand P est isomorphe à un treillis booléen, le problème d'extraction de motifs est dit représentable par des ensembles. La classe de ces problèmes est notée RAS.
Dans cet article, nous introduisons une classe plus grande que RAS, notée WRAS, qui introduit une représentation faible par les ensembles pour les problèmes d'extraction de motifs intéressants. Nous identifions une sous-classe efficace de WRAS, notée EWRAS, pour laquelle le problème de dualisation reste quasi-polynomial.
Enfin, nous montrons que l'un des problèmes connus pour ne pas appartenir à RAS, à savoir les séquences fréquentes rigides avec joker, appartient à EWRAS.
Ces nouvelles classes devraient avoir un impact important pour la clarification des approches de découverte de motifs intéressants dans les bases de données.
15:45--16:15 Pause café
16:15--16:50 : Arnaud Lefebvre

Titre : Linear Computation of Unbordered Conjugate

J.-P. Duval (LITIS, Université de Rouen), T. Lecroq (LITIS, Université de Rouen), A. Lefebvre (LITIS, Université de Rouen)

Résumé :

We present an algorithm that, given a word w of length n, computes one of its unbordered conjugates. If such a conjugate does not exist, the algorithm computes one of its conjugates that is a power of an unbordered word. The time complexity is O(n): the number of comparisons is bounded by 4n.
16:50--17:25 : Laura Giambruno

Titre : Transducers for the bidirectional decoding of codes with a finite deciphering delay

Laura Giambruno (GREYC, Université de Caen Basse-Normandie, France), Sabrina Mantaci (Dipartimento di Matematica e Informatica, Università di Palermo, Italie), Jean Néraud (LITIS, Université de Rouen, France), Carla Selmi (LITIS, Université de Rouen, France)

Résumé :

Coding methods that allow bidirectional decoding of messages are important in order to guarantee data integrity. Several types of codes are used for this purpose, for instance bifix codes allow instantaneous bidirectional decoding. But they are usually too big (so do not guarantee compression), and they are difficult to construct. Prefix codes are usually very small instead, but even if they allow an instantaneous left-to-right decoding, the right-to-left decoding requires some deciphering delay. In 1999 B. Girod introduced an encoding/decoding method on binary alphabets, that, even if it makes use of prefix codes, it allows an instantaneous decoding also from right to left by paying just few additional memory bits. Such a method is based on the bitwise operation.
We generalize this method to a general finite alphabet A, to any operation defined on A, to any code with finite deciphering delay and to any key word in A of a length depending on the deciphering delay. We construct a deterministic transducer for such generalized method and we prove some interesting isomorphism properties. As an important result we prove that, for a fixed code X with finite deciphering delay, transducers associated to different keys have an isomorphic non trivial strongly connected component.
17:25--18:00 : Thierry Lecroq

Titre : Quasi-Linear Time Computation of Abelian Periods of a Word

G. Fici ( I3S, CNRS & Université Nice Sophia Antipolis, France), T. Lecroq (LITIS EA4108, Université de Rouen, 76821 Mont-Saint-Aignan Cedex, France), A. Lefebvre (LITIS EA4108, Université de Rouen, 76821 Mont-Saint-Aignan Cedex, France), É. Prieur-Gaston (LITIS EA4108, Université de Rouen, 76821 Mont-Saint-Aignan Cedex, France), W. F. Smyth (Dept of Computing and Software, McMaster University, Hamilton ON L8S 4K1, Canada & Dept of Mathematics & Statistics, University of Western Australia, Crawley WA 6009, Australia)

Résumé :

In the last couple of years much research have been devoted to Abelian complexity of words. Recently, Constantinescu and Ilie (Bulletin EATCS 89, 167?170, 2006) introduced the notion of Abelian period with head and tail.
An integer p is an Abelian period for a word w over a finite alphabet if w can be written as w=u₀u₁...u_k where all the u_i's have the same composition in letters. If the length of u₀ (respectively u_k) is smaller than p and its composition contained into the one of others u_i, we call it the head (respectively the tail).
In this article we present two O(n log log n) time algorithms for computing all the Abelian periods with neither head nor tail and an O(n log n) time algorithm for computing all the Abelian periods without head and with a possibly non-empty tail of a word of length n.

Mardi 27 Novembre

9:45--10:15 Café d'accueil
10:15--10:50 : Guillaume Rizk

Titre : Efficient data structures for large-scale genome sequencing data

Rayan Chikhi (ENS Cachan Bretagne), Guillaume Rizk (Algorizk)

Résumé :

To assemble and/or detect variants in large genomics data, an algorithmic ingredient of choice is the de Bruijn graph. On large datasets, analyzing such graph requires a significant amount of system memory. This talk deals with ultra-low memory representation of de Bruijn graphs.
We propose a new encoding, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives. An assembly software implementing this structure, Minia, performed a complete de novo assembly of human genome short reads using 5.7 Gb of memory in 23 hours. The article (and in the near future, the public source code) can be found on the webpage of Minia: http://minia.genouest.org
10:50--11:25 : Van Du Tran

Titre : New approaches to differential expression in RNA-seq data

Van Du Tran (Institut des Sciences du Végétal, CNRS, Gif-sur-Yvette; Institut de Génétique et Microbiologie, Université Paris-Sud, Orsay), Caroline Hartmann (Institut des Sciences du Végétal, CNRS, Gif-sur-Yvette), Martin Crespi (Institut des Sciences du Végétal, CNRS, Gif-sur-Yvette), Daniel Gautheret (Institut de Génétique et Microbiologie, Université Paris-Sud, Orsay).

Résumé :

RNA-seq has recently become the most popular tool for transcriptome analysis. We present in this work two new RNA-seq analysis methods that cannot be found in currently available pipelines or programs and yet address important questions. The first method is intended to determine clusters of co-regulated genes in time series experiments while the second one aims to identify the variation in RNA processing between different conditions. We show an application of these methods to data sets of Arabidopsis thaliana RNA deep sequencing experiments performed in various stress conditions.
All our RNA-seq analyses start with identical steps: for each sequencing library, mapping of short reads to a genome is realized with a standard splice junction mapper ? TopHat, and we use HTSeq-count for counting reads associated to annotated genes. From this point, we can perform the following two analyses.
Firstly, we consider the fold change profiles of genes in time series experiments. Each fold change profile, identified by a vector of fold change measures between consecutive time points, expresses the behavior of a gene over the course of time. Pearson correlation is then used as the main measure for gathering fold change profiles to form K-means clusters of co-regulated genes. Applied to a simple 4-point time course following a root stress, the method is able to identify clear sets of genes with similar behavior. For instance, a group of 331 (6%) genes shows a common peak followed by a descent after the initial root stress.
Our second method studies variations in the coverage profiles of genes obtained from different libraries. Such variations may capture events that are not visible when just counting reads independently of their position on the transcription unit. Such events may include alternative splicing, polyadenylation or start sites, as well as any kind of post-processing event an RNA can undergo. We again use Pearson correlation to distinguish the coverage profiles and discover this type of event. Applied to a pair of Arabidopsis RNA-seq libraries, this method identifies a number of intriguing processing events that escaped standard quantitative analysis. We will present a visual selection of such events.
11:25--12:00 : Yann Ponty

Titre : RNA folding with pseudoknots: Intractable, honestly?

Yann Ponty (CNRS/Ecole Polytechnique, France), Rolf Backofen (Uni Freiburg, Germany) and Saad Sheikh (University of Florida, USA)

Résumé :

Predicting the folding of an RNA sequence, while allowing general pseudoknots (PK), consists in finding a minimal free-energy matching of its n positions. Assuming independently contributing base-pairs, the problem can be solved in polynomial O(n³)-time using a variant of the maximal weighted matching. By contrast, the problem was previously proven NP-Hard in the more realistic nearest-neighbor energy model.
In this joint work with Rolf Backofen (Uni Freiburg, Germany) and Saad Sheikh (University of Florida, USA), we consider an intermediate model, called the stacking-pairs energy model. We extend a result by Lyngso, showing that RNA folding with PK is NP-Hard within a large class of parametrization of the model. We also show the approximability of the problem, by giving a practical O(n³) algorithm that achieves at least a 5-approximation for any parametrization of the stacking model. This contrasts nicely with the nearest-neighbor version of the problem, which we prove cannot be approximated within any positive ratio, unless P=NP.
12:00--14:00 : Repas
14:00--14:35 : Julien Clément

Titre : Analyse d'algorithmes de tri et de recherche en nombre de comparaisons de symboles

Julien Clément (GREYC), Thu Hien Nguyen Thi (GREYC), Brigitte Vallée (GREYC)

Résumé :

On aborde classiquement l'analyse en moyenne d'algorithmes en s'intéressant à la complexité mesurée en nombre de comparaisons. On a alors des résultats de complexité du type "tel algorithme est en O(n log n) en moyenne" (par exemple Quicksort pour citer un des plus connus). Ces affirmations se basent sur des hypothèses spécifiques -- que les comparaisons entre les données (ou clés), pour les deux premiers, et les comparaisons entre symboles pour le troisième ont un coût unitaire -- et elles ont le mérite évident de présenter un tableau facile à assimiler de la complexité des algorithmes de tri. Cependant, ces hypothèses simplificatrices sont de fait discutables: elles ne permettent pas de comparer précisément les avantages et inconvénients d'algorithmes reposant sur des principes différents (i.e., ceux qui comparent les clés et ceux qui utilisent une représentation sous forme de suite de symboles) d'une manière qui soit totalement satisfaisante à la fois du point de vue de la théorie de l'information et du point de vue algorithmique. En effet, d'abord le calcul n'est pas vu à son niveau le plus fondamental, c'est-à-dire, par la manipulation de symboles élémentaires tels que le bit, l'octet, le caractère ou le mot-machine. De plus les analyses simplifiées ne sont pas toujours informatives dans beaucoup d'applications (bases de données ou traitement de la langue naturelle), où l'information est de nature hautement 'non atomique' , au sens qu'elle ne se réduit pas à un seul mot-machine. J'exposerai à l'aide de plusieurs exemples d'algorithmes de recherche et de tri une méthode qui permet d'analyser en moyenne le nombre de comparaisons de symboles. Les clés à trier sont vues comme des mots (i.e., séquences de symboles) produits par un modèle de source général. Pour des sources particulières (comme les sources sans mémoire) on accède à une analyse fine (avec le calcul de constantes explicites...) de certains de ces algorithmes.
En collaboration avec Brigitte Vallée, Thu Hien Nguyen-Thi (et initié avec Philippe Flajolet).
14:35--15:20 : Keynote Talk
Oratrice : Valentina Boeva (Institut Curie)

Titre : Structural variants, loss of heterozygosity regions and copy number alterations in cancer: strategy of their detection using NGS data

Résumé :
Cancer genomes often display copy number alterations (CNAs) and/or losses of heterozygosity (LOH) (Hanahan and Weinberg, 2011). Genetic abnormalities in specific regions may be related to the aggressiveness of a cancer and be associated with clinical outcomes. In cancer, tumor suppressor genes can be deleted or mutated, whereas oncogenes can be amplified or mutated with a gain of function. At the same time, translocations can result in cancer-causing fusion proteins (BCR/ABL fusion in CML, BCL1/IGH in multiple myeloma, EWS/FLI1 in Ewing sarcoma, etc.)
With the arrival of new high-throughput sequencing technologies, our current power to detect genetic abnormalities has significantly improved. Genomic breakpoints of large structural variants (i.e., translocations or large duplications and deletions) can be identified using two complementary approaches: (1) calculation and segmentation of copy number and allelic content profiles and (2) analysis of 'discordant' mate-paired/paired-ends mappings (PEMs).
Investigation of copy number profiles allows identification of genomic regions of gain and loss. There exist three frequent obstacles in the analysis of cancer genomes: absence of an appropriate control sample for normal tissue, possible polyploidy and contamination of a tumor sample by normal cells. We therefore developed a bioinformatics tool, called FREEC [2], able to automatically detect CNAs with or without use of a control dataset. If a control sample is not available, FREEC normalizes copy number profiles using read mappability and GC-content. FREEC applies a LASSO-based segmentation procedure to the normalized profiles to predict CNAs. FREEC is able to analyze over-diploid tumor samples and samples contaminated by normal cells. If sequencing coverage is large enough (>15x), FREEC's extension, Control-FREEC [3], is able to calculate allelic content profiles and consequently predict loss of heterozygosity regions.
For PEM data, one can complement the information about CNAs and LOH (i.e., output of FREEC) with the predictions of structural variants made by another tool that we have developed, SVDetect [4]. SVDetect finds clusters of 'discordant' PEMs and uses all the characteristics of reads inside the clusters (orientation, order and clone insert size) to identify structural variants type. SVDetect allows identification of a large spectrum of rearrangements including large insertions-deletions, duplications, inversions, insertions of genomic shards and balanced/unbalanced intra/inter-chromosomal translocations.
The intersection of FREEC and SVDetect outputs allows one to (1) refine coordinates of CNAs using PEM data and (2) improve confidence in calling true positive rearrangements (particularly, in ambiguous satellite/repetitive regions).
1. Hanahan, D. and Weinberg, R.A. (2011) Hallmarks of cancer: the next generation, Cell, 144, 646-674.
2. V. Boeva et al. (2011) Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization, Bioinformatics, 27(2):268-9.
3. Boeva et al. (2012) Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data, Bioinformatics, 28 (3): 423-425.
4. B. Zeitouni et al. (2010) SVDetect - a bioinformatic tool to identify genomic structural variations from paired-end next-generation sequencing data, Bioinformatics, 26: 1895-1896.
15:20--15:55 : Benjamin Martin

Titre : Estimation de la similarité approchée entre séquences musicales

Benjamin Martin, Daniel G. Brown (University of Waterloo, ON, Canada), Pierre Hanna (LaBRI, Université Bordeaux 1, France) et Pascal Ferraro (LaBRI, Université Bordeaux 1, France)

Résumé :
L'estimation de la similarité approchée entre séquences musicales a de nombreuses applications pour l'analyse de bases de données musicales. Dans ce contexte, la similarité locale par alignement de séquences peut être aisément adaptée aux particularités sémantiques des symboles musicaux pour rapprocher des morceaux sur la base de critères musicalement pertinents. Cependant, comme dans le cas de comparaison de séquences biologiques, le calcul d'alignement a un coût calculatoire élevé qui s'avère problématique pour l'estimation pratique de ces similarités sur les bases de données musicales. Notre étude se concentre alors l'adaptation des idées de la méthode heuristique d'indexation BLAST aux séquences musicales. Nous présentons et justifions les filtres heuristiques mis en place dans notre cas, qui permettent d'éliminer rapidement un grand nombre de paires de séquences a priori peu similaires. Nous détaillons une première étude statistique permettant de guider le choix d'une graine d'indexation pertinente. Notre méthode est testée dans le cadre de l'identification de reprises musicales. Nos expériences comparent la détection par alignement et celle par BLAST et soulignent un gain important en temps de calcul tout en conservant une identification précise des séquences similaires.
15:55--16:25 Pause café
16:25--17:00 : Alain Guénoche

Titre : Consensus multiple : une méthode pour séparer des gènes divergents

Alain Guénoche

Résumé :
Nous présentons une méthode pour décider si un ensemble (profil) d'arbres phylogénétiques donné admet un unique arbre consensus ou si plusieurs arbres, consensus des classes d'une partition du profil, constituent une meilleure représentation. Cette méthode est basée sur une mesure de la congruence des arbres du profil. Son optimisation permet de décider si le profil est suffisamment homogène pour admettre un seul arbre consensus, ou si plusieurs arbres, correspondant à des gènes divergents, s'imposent.
17:00--17:35 : Gabriele Fici

Titre : On Binary Jumbled Pattern Matching

Gabriele Fici

Résumé :
Given a text s and a pattern t, the Jumbled Pattern Matching problem consists in finding the occurrences of the anagrams of t in s. We investigate the JPM problem in the case when s is a binary string. We show how the combinatorial properties of binary words allow one to design algorithms and data structures for this specific case.