Professor Hadi Quesneville
URGI - Unité de Recherche Génomique Info, genomics and bioinformatics research unit at Institut National de la Recherche Agronomique (INRA ), Versailles, France.
Jozef Schell seminar room
The recent successes of new sequencing technologies allow today sequencing increasingly large genomes at reduced costs. Transposable elements (TEs) constitute the most structurally dynamic components and the largest portion of nuclear sequences of these large genomes, e.g. 85% of the maize genome (Schnable et al. 2009), and 88% of the wheat genome (Choulet et al. 2010). Therefore, TEs annotation should be considered a major task in these genome projects. However, this still remains a major challenge, since a good TE annotation relies critically on an expertly assembled reference sequence set, data that currently cannot be obtained in an automatic fashion. This crucial step is now a bottleneck for many genome analyses.
We scaled-up a repeat detection and an annotation pipeline, both part of the REPET package (Flutre et al. 2011, now at its v2.2 release). We applied new strategies, to cope with very large genomes such as the wheat. These strategies are iterative and can be summarized as follows. (i) Detection of the most easy to found TEs, with stringent parameters, to build a first TE library. They often corresponds to young TEs and the less degenerate ones. (ii) TE annotation and splicing of the corresponding sequences from the initial contigs. We then obtain a reduced genome sequence. (iii) Detection of the other TEs with sensitive parameters on the reduced genome sequence to build a second TE library. (iv) Annotation of the original contigs with the concatenation of the two TE libraries. The rational here is that these large genomes are mostly made of few TE families easy to found because present in number of copies. They will be detected in the first step and this will allow reducing the genome size by an important factor. Using this approach we were able to reduce the wheat 3B chromosome from 986Mbp to ~230Mbp, a reasonable size for a detection of TEs with sensitive parameters.
With these tools and approaches we study genome complexity through the impact of transposable elements (TEs) and other repeats on genome structure, function and evolution. Hence, we found that the majority of the repeats found in the A. thaliana genome are rather ancient and likely to derive from the retention of fragments deposited by ancestral bursts that occurred early during the Brassicaceae evolution. We illustrate the way repeated sequences are composed by mutations towards genomic dark matter over time. Our results further suggest that the deleterious impact of repeats on gene expression as well as their regulation through small RNA-mediated pathways can last over prolonged periods. We show that a substantial pool of small RNAs corresponds to old repeats suggesting that repeat sequence divergence is accompanied by a diversifying population of small RNAs.