1. Albers, P., Weytjens, B., De Mot, R., Marchal, K., & Springael, D. (2018). Molecular processes underlying synergistic linuron mineralization in a triple-species bacterial consortium biofilm revealed by differential transcriptomics. MICROBIOLOGYOPEN, 7(2).
    The proteobacteria Variovorax sp. WDL1, Comamonas testosteroni WDL7, and Hyphomicrobium sulfonivorans WDL6 compose a triple-species consortium that synergistically degrades and grows on the phenylurea herbicide linuron. To acquire a better insight into the interactions between the consortium members and the underlying molecular mechanisms, we compared the transcriptomes of the key biodegrading strains WDL7 and WDL1 grown as biofilms in either isolation or consortium conditions by differential RNAseq analysis. Differentially expressed pathways and cellular systems were inferred using the network-based algorithm PheNetic. Coculturing affected mainly metabolism in WDL1. Significantly enhanced expression of hylA encoding linuron hydrolase was observed. Moreover, differential expression of several pathways involved in carbohydrate, amino acid, nitrogen, and sulfur metabolism was observed indicating that WDL1 gains carbon and energy from linuron indirectly by consuming excretion products from WDL7 and/or WDL6. Moreover, in consortium conditions, WDL1 showed a pronounced stress response and overexpression of cell to cell interaction systems such as quorum sensing, contact-dependent inhibition, and Type VI secretion. Since the latter two systems can mediate interference competition, it prompts the question if synergistic linuron degradation is the result of true adaptive cooperation or rather a facultative interaction between bacteria that coincidentally occupy complementary metabolic niches.
  2. Mushthofa, M. (2018). Network-based modelling for omics data. Ghent University. Faculty of Sciences, Ghent, Belgium.
  3. Forslund, K., Pereira, C., Capella-Gutierrez, S., Sousa da Silva, A., Altenhoff, A., Huerta-Cepas, J., Muffato, M., et al. (2018). Gearing up to handle the mosaic nature of life in the quest for orthologs. BIOINFORMATICS, 34(2), 323–329.
    The Quest for Orthologs (QfO) is an open collaboration framework for experts in comparative phylogenomics and related research areas who have an interest in highly accurate orthology predictions and their applications. We here report highlights and discussion points from the QfO meeting 2015 held in Barcelona. Achievements in recent years have established a basis to support developments for improved orthology prediction and to explore new approaches. Central to the QfO effort is proper benchmarking of methods and services, as well as design of standardized datasets and standardized formats to allow sharing and comparison of results. Simultaneously, analysis pipelines have been improved, evaluated and adapted to handle large datasets. All this would not have occurred without the long-term collaboration of Consortium members. Meeting regularly to review and coordinate complementary activities from a broad spectrum of innovative researchers clearly benefits the community. Highlights of the meeting include addressing sources of and legitimacy of disagreements between orthology calls, the context dependency of orthology definitions, special challenges encountered when analyzing very anciently rooted orthologies, orthology in the light of whole-genome duplications, and the concept of orthologous versus paralogous relationships at different levels, including domain-level orthology. Furthermore, particular needs for different applications (e.g. plant genomics, ancient gene families and others) and the infrastructure for making orthology inferences available (e.g. interfaces with model organism databases) were discussed, with several ongoing efforts that are expected to be reported on during the upcoming 2017 QfO meeting.
  4. Hansen, B. O., Meyer, E. H., Ferrari, C., Vaid, N., Movahedi, S., Vandepoele, K., Nikoloski, Z., et al. (2018). Ensemble gene function prediction database reveals genes important for complex I formation in Arabidopsis thaliana. NEW PHYTOLOGIST, 217(4), 1521–1534.
    Recent advances in gene function prediction rely on ensemble approaches that integrate results from multiple inference methods to produce superior predictions. Yet, these developments remain largely unexplored in plants. We have explored and compared two methods to integrate 10 gene co-function networks for Arabidopsis thaliana and demonstrate how the integration of these networks produces more accurate gene function predictions for a larger fraction of genes with unknown function. These predictions were used to identify genes involved in mitochondrial complex I formation, and for five of them, we confirmed the predictions experimentally. The ensemble predictions are provided as a user-friendly online database, EnsembleNet. The methods presented here demonstrate that ensemble gene function prediction is a powerful method to boost prediction performance, whereas the EnsembleNet database provides a cutting-edge community tool to guide experimentalists.
  5. Khan, Aziz, Fornes, O., Stigliani, A., Gheorghe, M., Castro-Mondragon, J. A., van der Lee, R., Bessy, A., et al. (2018). JASPAR 2018 : update of the open-access database of transcription factor binding profiles and its web framework. NUCLEIC ACIDS RESEARCH, 46(D1), D260–D266.
    JASPAR ( is an open-access database of curated, non-redundant transcription factor (TF)-binding profiles stored as position frequency matrices (PFMs) and TF flexible models (TFFMs) for TFs across multiple species in six taxonomic groups. In the 2018 release of JASPAR, the CORE collection has been expanded with 322 new PFMs (60 for vertebrates and 262 for plants) and 33 PFMs were updated (24 for vertebrates, 8 for plants and 1 for insects). These new profiles represent a 30% expansion compared to the 2016 release. In addition, we have introduced 316 TFFMs (95 for vertebrates, 218 for plants and 3 for insects). This release incorporates clusters of similar PFMs in each taxon and each TF class per taxon. The JASPAR 2018 CORE vertebrate collection of PFMs was used to predict TF-binding sites in the human genome. The predictions are made available to the scientific community through a UCSC Genome Browser track data hub. Finally, this update comes with a new web framework with an interactive and responsive user-interface, along with new features. All the underlying data can be retrieved programmatically using a RESTful API and through the JASPAR 2018 R/Bioconductor package.
  6. Lang, D., Ullrich, K. K., Murat, F., Fuchs, J., Jenkins, J., Haas, F. B., Piednoel, M., et al. (2018). The Physcomitrella patens chromosome-scale assembly reveals moss genome structure and evolution. PLANT JOURNAL, 93(3), 515–533.
    The draft genome of the moss model, Physcomitrella patens, comprised approximately 2000 unordered scaffolds. In order to enable analyses of genome structure and evolution we generated a chromosome-scale genome assembly using genetic linkage as well as (end) sequencing of long DNA fragments. We find that 57% of the genome comprises transposable elements (TEs), some of which may be actively transposing during the life cycle. Unlike in flowering plant genomes, gene-and TE-rich regions show an overall even distribution along the chromosomes. However, the chromosomes are mono-centric with peaks of a class of Copia elements potentially coinciding with centromeres. Gene body methylation is evident in 5.7% of the protein-coding genes, typically coinciding with low GC and low expression. Some giant virus insertions are transcriptionally active and might protect gametes from viral infection via siRNA mediated silencing. Structure-based detection methods show that the genome evolved via two rounds of whole genome duplications (WGDs), apparently common in mosses but not in liverworts and hornworts. Several hundred genes are present in colinear regions conserved since the last common ancestor of plants. These syntenic regions are enriched for functions related to plant-specific cell growth and tissue organization. The P. patens genome lacks the TE-rich pericentromeric and gene-rich distal regions typical for most flowering plant genomes. More non-seed plant genomes are needed to unravel how plant genomes evolve, and to understand whether the P. patens genome structure is typical for mosses or bryophytes.
  7. Van Bel, M., Diels, T., Vancaester, E., Kreft, L., Botzki, A., Van de Peer, Y., Coppens, F., et al. (2018). PLAZA 4.0 : an integrative resource for functional, evolutionary and comparative plant genomics. NUCLEIC ACIDS RESEARCH, 46(D1), D1190–D1196.
    PLAZA ( is a plant-oriented online resource for comparative, evolutionary and functional genomics. The PLAZA platform consists of multiple independent instances focusing on different plant clades, while also providing access to a consistent set of reference species. Each PLAZA instance contains structural and functional gene annotations, gene family data and phylogenetic trees and detailed gene colinearity information. A user-friendly web interface makes the necessary tools and visualizations accessible, specific for each data type. Here we present PLAZA 4.0, the latest iteration of the PLAZA framework. This version consists of two new instances (Dicots 4.0 and Monocots 4.0) providing a large increase in newly available species, and offers access to updated and newly implemented tools and visualizations, helping users with the ever-increasing demands for complex and in-depth analyzes. The total number of species across both instances nearly doubles from 37 species in PLAZA 3.0 to 71 species in PLAZA 4.0, with a much broader coverage of crop species (e.g. wheat, palm oil) and species of evolutionary interest (e.g. spruce, Marchantia). The new PLAZA instances can also be accessed by a programming interface through a RESTful web service, thus allowing bioinformaticians to optimally leverage the power of the PLAZA platform.
  8. Van Goethem, M. W., Pierneef, R., Bezuidt, O. K., Van de Peer, Y., Cowan, D. A., & Makhalanyane, T. P. (2018). A reservoir of “historical” antibiotic resistance genes in remote pristine Antarctic soils. MICROBIOME, 6.
    Background: Soil bacteria naturally produce antibiotics as a competitive mechanism, with a concomitant evolution, and exchange by horizontal gene transfer, of a range of antibiotic resistance mechanisms. Surveys of bacterial resistance elements in edaphic systems have originated primarily from human-impacted environments, with relatively little information from remote and pristine environments, where the resistome may comprise the ancestral gene diversity. Methods: We used shotgun metagenomics to assess antibiotic resistance gene (ARG) distribution in 17 pristine and remote Antarctic surface soils within the undisturbed Mackay Glacier region. We also interrogated the phylogenetic placement of ARGs compared to environmental ARG sequences and tested for the presence of horizontal gene transfer elements flanking ARGs. Results: In total, 177 naturally occurring ARGs were identified, most of which encoded single or multi-drug efflux pumps. Resistance mechanisms for the inactivation of aminoglycosides, chloramphenicol and beta-lactam antibiotics were also common. Gram-negative bacteria harboured most ARGs (71%), with fewer genes from Gram-positive Actinobacteria and Bacilli (Firmicutes) (9%), reflecting the taxonomic composition of the soils. Strikingly, the abundance of ARGs per sample had a strong, negative correlation with species richness (r=-0.49, P < 0.05). This result, coupled with a lack of mobile genetic elements flanking ARGs, suggests that these genes are ancient acquisitions of horizontal transfer events. Conclusions: ARGs in these remote and uncontaminated soils most likely represent functional efficient historical genes that have since been vertically inherited over generations. The historical ARGs in these pristine environments carry a strong phylogenetic signal and form a monophyletic group relative to ARGs from other similar environments.
  9. Wan, T., Liu, Z.-M., Li, L.-F., Leitch, A. R., Leitch, I. J., Lohaus, R., Liu, Z.-J., et al. (2018). A genome for gnetophytes and early evolution of seed plants. NATURE PLANTS, 4(2), 82–89.
    Gnetophytes are an enigmatic gymnosperm lineage comprising three genera, Gnetum, Welwitschia and Ephedra, which are morphologically distinct from all other seed plants. Their distinctiveness has triggered much debate as to their origin, evolution and phylogenetic placement among seed plants. To increase our understanding of the evolution of gnetophytes, and their relation to other seed plants, we report here a high-quality draft genome sequence for Gnetum montanum, the first for any gnetophyte. By using a novel genome assembly strategy to deal with high levels of heterozygosity, we assembled >4 Gb of sequence encoding 27,491 protein-coding genes. Comparative analysis of the G. montanum genome with other gymnosperm genomes unveiled some remarkable and distinctive genomic features, such as a diverse assemblage of retrotransposons with evidence for elevated frequencies of elimination rather than accumulation, considerable differences in intron architecture, including both length distribution and proportions of (retro) transposon elements, and distinctive patterns of proliferation of functional protein domains. Furthermore, a few gene families showed Gnetum-specific copy number expansions (for example, cellulose synthase) or contractions (for example, Late Embryogenesis Abundant protein), which could be connected with Gnetum's distinctive morphological innovations associated with their adaptation to warm, mesic environments. Overall, the G. montanum genome enables a better resolution of ancestral genomic features within seed plants, and the identification of genomic characters that distinguish Gnetum from other gymnosperms.
  10. Zwaenepoel, Arthur, Diels, T., Amar, D., Van Parys, T., Shamir, R., Van de Peer, Y., & Tzfadia, O. (2018). MorphDB : prioritizing genes for specialized metabolism pathways and gene ontology categories in plants. FRONTIERS IN PLANT SCIENCE, 9.
    Recent times have seen an enormous growth of "omics" data, of which high-throughput gene expression data are arguably the most important from a functional perspective. Despite huge improvements in computational techniques for the functional classification of gene sequences, common similarity-based methods often fall short of providing full and reliable functional information. Recently, the combination of comparative genomics with approaches in functional genomics has received considerable interest for gene function analysis, leveraging both gene expression based guilt-by-association methods and annotation efforts in closely related model organisms. Besides the identification of missing genes in pathways, these methods also typically enable the discovery of biological regulators (i.e., transcription factors or signaling genes). A previously built guilt-by-association method is MORPH, which was proven to be an efficient algorithm that performs particularly well in identifying and prioritizing missing genes in plant metabolic pathways. Here, we present MorphDB, a resource where MORPH-based candidate genes for large-scale functional annotations (Gene Ontology, MapMan bins) are integrated across multiple plant species. Besides a gene centric query utility, we present a comparative network approach that enables researchers to efficiently browse MORPH predictions across functional gene sets and species, facilitating efficient gene discovery and candidate gene prioritization. MorphDB is available at We also provide a toolkit, named "MORPH bulk" (, for running MORPH in bulk mode on novel data sets, enabling researchers to apply MORPH to their own species of interest.
  11. Burgess, S. T., Bartley, K., Marr, E. J., Wright, H. W., Weaver, R. J., Prickett, J. C., … Nisbet, A. J. (2018). Draft genome assembly of the sheep scab mite, Psoroptes ovis. MICROBIOLOGY RESOURCE ANNOUNCEMENTS, 6(16).
    Sheep scab, caused by infestation with Psoroptes ovis, is highly contagious, results in intense pruritus, and represents a major welfare and economic concern. Here, we report the first draft genome assembly and gene prediction of P. ovis based on PacBio de novo sequencing. The ∼63.2-Mb genome encodes 12,041 protein-coding genes.
  12. Kulkarni, S. R., Vaneechoutte, D., Van de Velde, J., & Vandepoele, K. (2018). TF2Network : predicting transcription factor regulators and gene regulatory networks in Arabidopsis using publicly available binding site information. NUCLEIC ACIDS RESEARCH, 46(6).
    A gene regulatory network (GRN) is a collection of regulatory interactions between transcription factors (TFs) and their target genes. GRNs control different biological processes and have been instrumental to understand the organization and complexity of gene regulation. Although various experimental methods have been used to map GRNs in Arabidop-sis thaliana, their limited throughput combined with the large number of TFs makes that for many genes our knowledge about regulating TFs is incomplete. We introduce TF2Network, a tool that exploits the vast amount of TF binding site information and enables the delineation of GRNs by detecting potential regulators for a set of co-expressed or functionally related genes. Validation using two experimental benchmarks reveals that TF2Network predicts the correct regulator in 75-92% of the test sets. Furthermore, our tool is robust to noise in the input gene sets, has a low false discovery rate, and shows a better performance to recover correct regulators compared to other plant tools. TF2Network is accessible through a web interface where GRNs are interactively visualized and annotated with various types of experimental functional information. TF2Network was used to perform systematic functional and regulatory gene annotations, identifying new TFs involved in circadian rhythm and stress response.
  13. Crauwels, Sam, Van Opstaele, F., Jaskula-Goiris, B., Steensels, J., Verreth, C., Bosmans, L., Paulussen, C., et al. (2017). Fermentation assays reveal differences in sugar and (off-) flavor metabolism across different Brettanomyces bruxellensis strains. FEMS YEAST RESEARCH, 17(1).
    Brettanomyces (Dekkera) bruxellensis is an ascomycetous yeast of major importance in the food, beverage and biofuel industry. It has been isolated from various man-made ecological niches that are typically characterized by harsh environmental conditions such as wine, beer, soft drink, etc. Recent comparative genomics studies revealed an immense intraspecific diversity, but it is still unclear whether this genetic diversity also leads to systematic differences in fermentation performance and (off-)flavor production, and to what extent strains have evolved to match their ecological niche. Here, we present an evaluation of the fermentation properties of eight genetically diverse B. bruxellensis strains originating from beer, wine and soft drinks. We show that sugar consumption and aroma production during fermentation are determined by both the yeast strain and composition of the medium. Furthermore, our results indicate a strong niche adaptation of B. bruxellensis, most clearly for wine strains. For example, only strains originally isolated from wine were able to thrive well and produce the typical Brettanomyces-related phenolic off-flavors 4-ethylguaiacol and 4-ethylphenol when inoculated in red wine. Sulfite tolerance was found as a key factor explaining the observed differences in fermentation performance and off-flavor production. Sequence analysis of genes related to phenolic off-flavor production, however, revealed only marginal differences between the isolates tested, especially at the amino acid level. Altogether, our study provides novel insights in the Brettanomyces metabolism of flavor production, and is highly relevant for both the wine and beer industry.
  14. Crèvecoeur, I., Gudmundsdottir, V., Vig, S., Marques Câmara Sodré, F., D’Hertog, W., Fierro Gutierrez, A. C. E., Van Lommel, L., et al. (2017). Early differences in islets from prediabetic NOD mice : combined microarray and proteomic analysis. DIABETOLOGIA, 60(3), 475–489.
    AIMS/HYPOTHESIS: Type 1 diabetes is an endocrine disease where a long preclinical phase, characterised by immune cell infiltration in the islets of Langerhans, precedes elevated blood glucose levels and disease onset. Although several studies have investigated the role of the immune system in this process of insulitis, the importance of the beta cells themselves in the initiation of type 1 diabetes is less well understood. The aim of this study was to investigate intrinsic differences present in the islets from diabetes-prone NOD mice before the onset of insulitis. METHODS: The islet transcriptome and proteome of 2-3-week-old mice was investigated by microarray and 2-dimensional difference gel electrophoresis (2D-DIGE), respectively. Subsequent analyses using sophisticated pathway analysis and ranking of differentially expressed genes and proteins based on their relevance in type 1 diabetes were performed. RESULTS: In the preinsulitic period, alterations in general pathways related to metabolism and cell communication were already present. Additionally, our analyses pointed to an important role for post-translational modifications (PTMs), especially citrullination by PAD2 and protein misfolding due to low expression levels of protein disulphide isomerases (PDIA3, 4 and 6), as causative mechanisms that induce beta cell stress and potential auto-antigen generation. CONCLUSIONS/INTERPRETATION: We conclude that the pancreatic islets, irrespective of immune differences, may contribute to the initiation of the autoimmune process. DATA AVAILABILITY: All microarray data are available in the ArrayExpress database ( ) under accession number E-MTAB-5264.
  15. Mizrachi, E., Verbeke, L., Christie, N., Fierro Gutierrez, A. C. E., Mansfield, S. D., Davis, M. F., Gjersing, E., et al. (2017). Network-based integration of systems genetics data reveals pathways associated with lignocellulosic biomass accumulation and processing. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 114(5), 1195–1200.
    As a consequence of their remarkable adaptability, fast growth, and superior wood properties, eucalypt tree plantations have emerged as key renewable feedstocks (over 20 million ha globally) for the production of pulp, paper, bioenergy, and other lignocellulosic products. However, most biomass properties such as growth, wood density, and wood chemistry are complex traits that are hard to improve in long-lived perennials. Systems genetics, a process of harnessing multiple levels of component trait information (e.g., transcript, protein, and metabolite variation) in populations that vary in complex traits, has proven effective for dissecting the genetics and biology of such traits. We have applied a network-based data integration (NBDI) method for a systems-level analysis of genes, processes and pathways underlying biomass and bioenergy-related traits using a segregating Eucalyptus hybrid population. We show that the integrative approach can link biologically meaningful sets of genes to complex traits and at the same time reveal the molecular basis of trait variation. Gene sets identified for related woody biomass traits were found to share regulatory loci, cluster in network neighborhoods, and exhibit enrichment for molecular functions such as xylan metabolism and cell wall development. These findings offer a framework for identifying the molecular underpinnings of complex biomass and bioprocessing-related traits. A more thorough understanding of the molecular basis of plant biomass traits should provide additional opportunities for the establishment of a sustainable bio-based economy.
  16. Vandepoele, Klaas. (2017). A guide to the PLAZA 3.0 plant comparative genomic database. In A. D. van Dijk (Ed.), Plant genomics databases : methods and protocols (Vol. 1533, pp. 183–200). New York, NY, USA: Springer.
    PLAZA 3.0 is an online resource for comparative genomics and offers a versatile platform to study gene functions and gene families or to analyze genome organization and evolution in the green plant lineage. Starting from genome sequence information for over 35 plant species, precomputed comparative genomic data sets cover homologous gene families, multiple sequence alignments, phylogenetic trees, and genomic colinearity information within and between species. Complementary functional data sets, a Workbench, and interactive visualization tools are available through a user-friendly web interface, making PLAZA an excellent starting point to translate sequence or omics data sets into biological knowledge. PLAZA is available at .
  17. Verheggen, K., Volders, P.-J., Mestdagh, P., Menschaert, G., Van Damme, P., Gevaert, K., Martens, L., et al. (2017). Noncoding after all : biases in proteomics data do not explain observed absence of lncRNA translation products. JOURNAL OF PROTEOME RESEARCH, 16(7), 2508–2515.
    Over the past decade, long noncoding RNAs (lncRNAs) have emerged as novel functional entities of the eukaryotic genome. However, the scientific community remains divided over the amount of true noncoding transcripts among the large number of unannotated transcripts identified by recent large scale and deep RNA-sequencing efforts. Here, we systematically exclude possible technical reasons underlying the absence of lncRNA-encoded proteins in mass spectrometry data sets, strongly suggesting that the large majority of lncRNAs is indeed not translated.
  18. De Schutter, K., Tsaneva, M., Kulkarni, S. R., Rougé, P., Vandepoele, K., & Van Damme, E. (2017). Evolutionary relationships and expression analysis of EUL domain proteins in rice (Oryza sativa). RICE, 10.
    Background: Lectins, defined as 'Proteins that can recognize and bind specific carbohydrate structures', are widespread among all kingdoms of life and play an important role in various biological processes in the cell. Most plant lectins are involved in stress signaling and/or defense. The family of Euonymus-related lectins (EULs) represents a group of stress-related lectins composed of one or two EUL domains. The latter protein domain is unique in that it is ubiquitous in land plants, suggesting an important role for these proteins. Results: Despite the availability of multiple completely sequenced rice genomes, little is known on the occurrence of lectins in rice. We identified 329 putative lectin genes in the genome of Oryza sativa subsp. japonica belonging to nine out of 12 plant lectin families. In this paper, an in-depth molecular characterization of the EUL family of rice was performed. In addition, analyses of the promoter sequences and investigation of the transcript levels for these EUL genes enabled retrieval of important information related to the function and stress responsiveness of these lectins. Finally, a comparative analysis between rice cultivars and several monocot and dicot species revealed a high degree of sequence conservation within the EUL domain as well as in the domain organization of these lectins. Conclusions: The presence of EULs throughout the plant kingdom and the high degree of sequence conservation in the EUL domain suggest that these proteins serve an important function in the plant cell. Analysis of the promoter region of the rice EUL genes revealed a diversity of stress responsive elements. Furthermore analysis of the expression profiles of the EUL genes confirmed that they are differentially regulated in response to several types of stress. These data suggest a potential role for the EULs in plant stress signaling and defense.
  19. Avila Cobos, F., Anckaert, J., Volders, P.-J., Everaert, C., Rombaut, D., Vandesompele, J., De Preter, K., et al. (2017). Zipper plot : visualizing transcriptional activity of genomic regions. BMC BIOINFORMATICS, 18.
    Background: Reconstructing transcript models from RNA-sequencing (RNA-seq) data and establishing these as independent transcriptional units can be a challenging task. Current state-of-the-art tools for long non-coding RNA (lncRNA) annotation are mainly based on evolutionary constraints, which may result in false negatives due to the overall limited conservation of lncRNAs. Results: To tackle this problem we have developed the Zipper plot, a novel visualization and analysis method that enables users to simultaneously interrogate thousands of human putative transcription start sites (TSSs) in relation to various features that are indicative for transcriptional activity. These include publicly available CAGE-sequencing, ChIP-sequencing and DNase-sequencing datasets. Our method only requires three tab-separated fields (chromosome, genomic coordinate of the TSS and strand) as input and generates a report that includes a detailed summary table, a Zipper plot and several statistics derived from this plot. Conclusion: Using the Zipper plot, we found evidence of transcription for a set of well-characterized lncRNAs and observed that fewer mono-exonic lncRNAs have CAGE peaks overlapping with their TSSs compared to multi-exonic lncRNAs. Using publicly available RNA-seq data, we found more than one hundred cases where junction reads connected protein-coding gene exons with a downstream mono-exonic lncRNA, revealing the need for a careful evaluation of lncRNA 5′-boundaries. Our method is implemented using the statistical programming language R and is freely available as a webtool.
  20. Cormier, A., Avia, K., Sterck, L., Derrien, T., Wucher, V., Andres, G., Monsoor, M., et al. (2017). Re-annotation, improved large-scale assembly and establishment of a catalogue of noncoding loci for the genome of the model brown alga Ectocarpus. NEW PHYTOLOGIST, 214(1), 219–232.
    The genome of the filamentous brown alga Ectocarpus was the first to be completely sequenced from within the brown algal group and has served as a key reference genome both for this lineage and for the stramenopiles. We present a complete structural and functional reannotation of the Ectocarpus genome. The large-scale assembly of the Ectocarpus genome was significantly improved and genome-wide gene re-annotation using extensive RNA-seq data improved the structure of 11 108 existing protein-coding genes and added 2030 new loci. A genome-wide analysis of splicing isoforms identified an average of 1.6 transcripts per locus. A large number of previously undescribed noncoding genes were identified and annotated, including 717 loci that produce long noncoding RNAs. Conservation of lncRNAs between Ectocarpus and another brown alga, the kelp Saccharina japonica, suggests that at least a proportion of these loci serve a function. Finally, a large collection of single nucleotide polymorphism-based markers was developed for genetic analyses. These resources are available through an updated and improved genome database. This study significantly improves the utility of the Ectocarpus genome as a high-quality reference for the study of many important aspects of brown algal biology and as a reference for genomic analyses across the stramenopiles.
  21. De La Torre, A. R., Li, Z., Van de Peer, Y., & Ingvarsson, P. K. (2017). Contrasting rates of molecular evolution and patterns of selection among gymnosperms and flowering plants. MOLECULAR BIOLOGY AND EVOLUTION, 34(6), 1363–1377.
    The majority of variation in rates of molecular evolution among seed plants remains both unexplored and unexplained. Although some attention has been given to flowering plants, reports of molecular evolutionary rates for their sister plant clade (gymnosperms) are scarce, and to our knowledge differences in molecular evolution among seed plant clades have never been tested in a phylogenetic framework. Angiosperms and gymnosperms differ in a number of features, of which contrasting reproductive biology, life spans, and population sizes are the most prominent. The highly conserved morphology of gymnosperms evidenced by similarity of extant species to fossil records and the high levels of macrosynteny at the genomic level have led scientists to believe that gymnosperms are slow-evolving plants, although some studies have offered contradictory results. Here, we used 31,968 nucleotide sites obtained from orthologous genes across a wide taxonomic sampling that includes representatives of most conifers, cycads, ginkgo, and many angiosperms with a sequenced genome. Our results suggest that angiosperms and gymnosperms differ considerably in their rates of molecular evolution per unit time, with gymnosperm rates being, on average, seven times lower than angiosperm species. Longer generation times and larger genome sizes are some of the factors explaining the slow rates of molecular evolution found in gymnosperms. In contrast to their slow rates of molecular evolution, gymnosperms possess higher substitution rate ratios than angiosperm taxa. Finally, our study suggests stronger and more efficient purifying and diversifying selection in gymnosperm than in angiosperm species, probably in relation to larger effective population sizes.
  22. Ruprecht, C., Proost, S., Hernandez-Coronado, M., Ortiz-Ramirez, C., Lang, D., Rensing, S. A., Becker, J. D., et al. (2017). Phylogenomic analysis of gene co-expression networks reveals the evolution of functional modules. PLANT JOURNAL, 90(3), 447–465.
    Molecular evolutionary studies correlate genomic and phylogenetic information with the emergence of new traits of organisms. These traits are, however, the consequence of dynamic gene networks composed of functional modules, which might not be captured by genomic analyses. Here, we established a method that combines large-scale genomic and phylogenetic data with gene co-expression networks to extensively study the evolutionary make-up of modules in the moss Physcomitrella patens, and in the angiosperms Arabidopsis thaliana and Oryza sativa (rice). We first show that younger genes are less annotated than older genes. By mapping genomic data onto the co-expression networks, we found that genes from the same evolutionary period tend to be connected, whereas old and young genes tend to be disconnected. Consequently, the analysis revealed modules that emerged at a specific time in plant evolution. To uncover the evolutionary relationships of the modules that are conserved across the plant kingdom, we added phylogenetic information that revealed duplication and speciation events on the module level. This combined analysis revealed an independent duplication of cell wall modules in bryophytes and angiosperms, suggesting a parallel evolution of cell wall pathways in land plants.
  23. Pannier, L., Merino, E., Marchal, K., & Collado-Vides, J. (2017). Effect of genomic distance on coexpression of coregulated genes in E. coli. PLOS ONE, 12(4).
    In prokaryotes, genomic distance is a feature that in addition to coregulation affects coexpression. Several observations, such as genomic clustering of highly coexpressed small regulons, support the idea that coexpression behavior of coregulated genes is affected by the distance between the coregulated genes. However, the specific contribution of distance in addition to coregulation in determining the degree of coexpression has not yet been studied systematically. In this work, we exploit the rich information in RegulonDB to study how the genomic distance between coregulated genes affects their degree of coexpression, measured by pairwise similarity of expression profiles obtained under a large number of conditions. We observed that, in general, coregulated genes display higher degrees of coexpression as they are more closely located on the genome. This contribution of genomic distance in determining the degree of coexpression was relatively small compared to the degree of coexpression that was determined by the tightness of the coregulation (degree of overlap of regulatory programs) but was shown to be evolutionary constrained. In addition, the distance effect was sufficient to guarantee coexpression of coregulated genes that are located at very short distances, irrespective of their tightness of coregulation. This is partly but definitely not always because the close distance is also the cause of the coregulation. In cases where it is not, we hypothesize that the effect of the distance on coexpression could be caused by the fact that coregulated genes closely located to each other are also relatively more equidistantly located from their common TF and therefore subject to more similar levels of TF molecules. The absolute genomic distance of the coregulated genes to their common TF-coding gene tends to be less important in determining the degree of coexpression. Our results pinpoint the importance of taking into account the combined effect of distance and coregulation when studying prokaryotic coexpression and transcriptional regulation.
  24. Van de Peer, Y., Mizrachi, E., & Marchal, K. (2017). The evolutionary significance of polyploidy. NATURE REVIEWS GENETICS, 18(7), 411–424.
    Polyploidy, or the duplication of entire genomes, has been observed in prokaryotic and eukaryotic organisms, and in somatic and germ cells. The consequences of polyploidization are complex and variable, and they differ greatly between systems (clonal or non-clonal) and species, but the process has often been considered to be an evolutionary 'dead end'. Here, we review the accumulating evidence that correlates polyploidization with environmental change or stress, and that has led to an increased recognition of its short-term adaptive potential. In addition, we discuss how, once polyploidy has been established, the unique retention profile of duplicated genes following whole-genome duplication might explain key longer-term evolutionary transitions and a general increase in biological complexity.
  25. Vlastaridis, P., Kyriakidou, P., Chaliotis, A., Van de Peer, Y., Oliver, S. G., & Amoutzias, G. D. (2017). Estimating the total number of phosphoproteins and phosphorylation sites in eukaryotic proteomes. GIGASCIENCE, 6(2), 1–11.
    Background: Phosphorylation is the most frequent post-translational modification made to proteins and may regulate protein activity as either a molecular digital switch or a rheostat. Despite the cornucopia of high-throughput (HTP) phosphoproteomic data in the last decade, it remains unclear how many proteins are phosphorylated and how many phosphorylation sites (p-sites) can exist in total within a eukaryotic proteome. We present the first reliable estimates of the total number of phosphoproteins and p-sites for four eukaryotes (human, mouse, Arabidopsis, and yeast). Results: In all, 187 HTP phosphoproteomic datasets were filtered, compiled, and studied along with two low-throughput (LTP) compendia. Estimates of the number of phosphoproteins and p-sites were inferred by two methods: Capture-Recapture, and fitting the saturation curve of cumulative redundant vs. cumulative non-redundant phosphoproteins/p-sites. Estimates were also adjusted for different levels of noise within the individual datasets and other confounding factors. We estimate that in total, 13 000, 11 000, and 3000 phosphoproteins and 230 000, 156 000, and 40 000 p-sites exist in human, mouse, and yeast, respectively, whereas estimates for Arabidopsis were not as reliable. Conclusions: Most of the phosphoproteins have been discovered for human, mouse, and yeast, while the dataset for Arabidopsis is still far from complete. The datasets for p-sites are not as close to saturation as those for phosphoproteins. Integration of the LTP data suggests that current HTP phosphoproteomics appears to be capable of capturing 70% to 95% of total phosphoproteins, but only 40% to 60% of total p-sites.
  26. Yao, Y., & Van de Peer, Y. (2017). Simulating biological complexity through artificial evolution. In 2017 3RD IEEE INTERNATIONAL CONFERENCE ON CYBERNETICS (CYBCONF) (pp. 101–108). New York, NY, USA: IEEE.
  27. Jacobs, Bart, Goetghebeur, E., Vandesompele, J., De Ganck, A., Nijs, N., Beckers, A., Papazova, N., et al. (2017). Model-based classification for digital PCR : your Umbrella for rain. ANALYTICAL CHEMISTRY, 89(8), 4461–4467.
    Standard data analysis pipelines for digital PCR estimate the concentration of a target nucleic acid by digitizing the end-point fluorescence of the parallel micro-PCR reactions, using an automated hard threshold. While it is known that misclassification has a major impact on the concentration estimate and substantially reduces accuracy, the uncertainty of this classification is typically ignored. We introduce a model-based clustering method to estimate the probability that the target is present (absent) in a partition conditional on its observed fluorescence and the distributional shape in no-template control samples. This methodology acknowledges the inherent uncertainty of the classification and provides a natural measure of precision, both at individual partition level and at the level of the global concentration. We illustrate our method on genetically modified organism, inhibition, dynamic range, and mutation detection experiments. We show that our method provides concentration estimates of similar accuracy or better than the current standard, along with a more realistic measure of precision. The individual partition probabilities and diagnostic density plots further allow for some quality control. An R implementation of our method, called Umbrella, is available, providing a more objective and automated data analysis procedure for absolute dPCR quantification.
  28. Van Parys, T., Melckenbeeck, I., Houbraken, M., Audenaert, P., Colle, D., Pickavet, M., … Van de Peer, Y. (2017). A Cytoscape app for motif enumeration with ISMAGS. BIOINFORMATICS, 33(3), 461–463.
    We present a Cytoscape app for the ISMAGS algorithm, which can enumerate all instances of a motif in a graph, making optimal use of the motif's symmetries to make the search more efficient. The Cytoscape app provides a handy interface for this algorithm, which allows more efficient network analysis.
  29. Decock, A., Ongenaert, M., De Wilde, B., Brichard, B., Noguera, R., Speleman, F., & Vandesompele, J. (2017). Stage 4S neuroblastoma tumors show a characteristic DNA methylation portrait. OncoPoint, 5th Research seminar, Abstracts. Presented at the 5th OncoPoint research seminar.
  30. Mizrachi, E., Verbeke, L., Van de Peer, Y., Marchal, K., & Myburg, A. A. (2017). Principles of systems biology, no. 14 : [...] Network analysis of woody biomass. CELL SYSTEMS.
    This month: sage advice from phage to their offspring; systematic analyses of protein quality control, mitochondrial respiration, and woody biomass; a continental-scale experiment; and engineered protein tools galore.
  31. Li, Zhen, De La Torre, A. R., Sterck, L., Cánovas, F. M., Avila, C., Merino, I., Cabezas, J. A., et al. (2017). Single-copy genes as molecular markers for phylogenomic studies in seed plants. GENOME BIOLOGY AND EVOLUTION, 9(5), 1130–1147.
    Phylogenetic relationships among seed plant taxa, especially within the gymnosperms, remain contested. In contrast to angio-sperms, for which several genomic, transcriptomic and phylogenetic resources are available, there are few, if any, molecular markers that allow broad comparisons among gymnosperm species. With few gymnosperm genomes available, recently obtained transcriptomes in gymnosperms are a great addition to identifying single-copy gene families as molecular markers for phylogenomic analysis in seed plants. Taking advantage of an increasing number of available genomes and transcriptomes, we identified single-copy genes in a broad collection of seed plants and used these to infer phylogenetic relationships between major seed plant taxa. This study aims at extending the current phylogenetic toolkit for seed plants, assessing its ability for resolving seed plant phylogeny, and discussing potential factors affecting phylogenetic reconstruction. In total, we identified 3,072 single-copy genes in 31 gymnosperms and 2,156 single-copy genes in 34 angiosperms. All studied seed plants shared 1,469 single-copy genes, which are generally involved in functions like DNA metabolism, cell cycle, and photosynthesis. A selected set of 106 single-copy genes provided good resolution for the seed plant phylogeny except for gnetophytes. Although some of our analyses support a sister relationship between gnetophytes and other gymnosperms, phylogenetic trees from concatenated alignments without 3rd codon positions and amino acid alignments under the CAT + GTR model, support gnetophytes as a sister group to Pinaceae. Our phylogenomic analyses demonstrate that, in general, single-copy genes can uncover both recent and deep divergences of seed plant phylogeny.
  32. Ruprecht, C., Lohaus, R., Vanneste, K., Mutwil, M., Nikoloski, Z., Van de Peer, Y., & Persson, S. (2017). Revisiting ancestral polyploidy in plants. SCIENCE ADVANCES, 3(7).
    Whole-genome duplications (WGDs) or polyploidy events have been studied extensively in plants. In a now widely cited paper, Jiao et al. presented evidence for two ancient, ancestral plant WGDs predating the origin of flowering and seed plants, respectively. This finding was based primarily on a bimodal age distribution of gene duplication events obtained from molecular dating of almost 800 phylogenetic gene trees. We reanalyzed the phylogenomic data of Jiao et al. and found that the strong bimodality of the age distribution may be the result of technical and methodological issues and may hence not be a "true" signal of two WGD events. By using a state-of-the-art molecular dating algorithm, we demonstrate that the reported bimodal age distribution is not robust and should be interpreted with caution. Thus, there exists little evidence for two ancient WGDs in plants from phylogenomic dating.
  33. Unver, T., Wu, Z., Sterck, L., Turktas, M., Lohaus, R., Li, Z., Yang, M., et al. (2017). Genome of wild olive and the evolution of oil biosynthesis. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 114(44), E9413–E9422.
    Here we present the genome sequence and annotation of the wild olive tree (Olea europaea var. sylvestris), called oleaster, which is considered an ancestor of cultivated olive trees. More than 50,000 protein-coding genes were predicted, a majority of which could be anchored to 23 pseudochromosomes obtained through a newly constructed genetic map. The oleaster genome contains signatures of two Oleaceae lineage-specific paleopolyploidy events, dated at similar to 28 and similar to 59 Mya. These events contributed to the expansion and neo-functionalization of genes and gene families that play important roles in oil biosynthesis. The functional divergence of oil biosynthesis pathway genes, such as FAD2, SACPD, EAR, and ACPTE, following duplication, has been responsible for the differential accumulation of oleic and linoleic acids produced in olive compared with sesame, a closely related oil crop. Duplicated oleaster FAD2 genes are regulated by an siRNA derived from a transposable element-rich region, leading to suppressed levels of FAD2 gene expression. Additionally, neofunctionalization of members of the SACPD gene family has led to increased expression of SACPD2,3, 5, and 7, consequently resulting in an increased desaturation of steric acid. Taken together, decreased FAD2 expression and increased SACPD expression likely explain the accumulation of exceptionally high levels of oleic acid in olive. The oleaster genome thus provides important insights into the evolution of oil biosynthesis and will be a valuable resource for oil crop genomics.
  34. Cañas, R. A., Li, Z., Pascual, M. B., Castro-Rodríguez, V., Ávila, C., Sterck, L., Van de Peer, Y., et al. (2017). The gene expression landscape of pine seedling tissues. PLANT JOURNAL, 91(6), 1064–1087.
    Conifers dominate vast regions of the Northern hemisphere. They are the main source of raw materials for timber industry as well as a wide range of biomaterials. Despite their inherent difficulties as experimental models for classical plant biology research, the technological advances in genomics research are enabling fundamental studies on these plants. The use of laser capture microdissection followed by transcriptomic analysis is a powerful tool for unravelling the molecular and functional organization of conifer tissues and specialized cells. In the present work, 14 different tissues from 1-month-old maritime pine (Pinus pinaster) seedlings have been isolated and their transcriptomes analysed. The results increased the sequence information and number of full-length transcripts from a previous reference transcriptome and added 39 841 new transcripts. In total, 2376 transcripts were ubiquitously expressed in all of the examined tissues. These transcripts could be considered the core 'housekeeping genes' in pine. The genes have been clustered in function to their expression profiles. This analysis reduced the number of profiles to 38, most of these defined by their expression in a unique tissue that is much higher than in the other tissues. The expression and localization data are accessible at ( This study presents an overview of the gene expression distribution in different pine tissues, specifically highlighting the relationships between tissue gene expression and function. This transcriptome atlas is a valuable resource for functional genomics research in conifers.
  35. Kreft, L., Botzki, A., Coppens, F., Vandepoele, K., & Van Bel, M. (2017). PhyD3 : a phylogenetic tree viewer with extended phyloXML support for functional genomics data visualization. BIOINFORMATICS, 33(18), 2946–2947.
    Motivation: Comparative and evolutionary studies utilize phylogenetic trees to analyze and visualize biological data. Recently, several web-based tools for the display, manipulation and annotation of phylogenetic trees, such as iTOL and Evolview, have released updates to be compatible with the latest web technologies. While those web tools operate an open server access model with a multitude of registered users, a feature-rich open source solution using current web technologies is not available. Results: Here, we present an extension of the widely used PhyloXML standard with several new options to accommodate functional genomics or annotation datasets for advanced visualization. Furthermore, PhyD3 has been developed as a lightweight tool using the JavaScript library D3.js to achieve a state-of-the-art phylogenetic tree visualization in the web browser, with support for advanced annotations. The current implementation is open source, easily adaptable and easy to implement in third parties' web sites. Availability and implementation: More information about PhyD3 itself, installation procedures and implementation links are available at and at Supplementary information: Supplementary data are available at Bioinformatics online.
  36. Roodt, D., Lohaus, R., Sterck, L., Swanepoel, R. L., Van de Peer, Y., & Mizrachi, E. (2017). Evidence for an ancient whole genome duplication in the cycad lineage. PLOS ONE, 12(9).
    Contrary to the many whole genome duplication events recorded for angiosperms (flowering plants), whole genome duplications in gymnosperms (non-flowering seed plants) seem to be much rarer. Although ancient whole genome duplications have been reported for most gymnosperm lineages as well, some are still contested and need to be confirmed. For instance, data for ginkgo, but particularly cycads have remained inconclusive so far, likely due to the quality of the data available and flaws in the analysis. We extracted and sequenced RNA from both the cycad Encephalartos natalensis and Ginkgo biloba. This was followed by transcriptome assembly, after which these data were used to build paralog age distributions. Based on these distributions, we identified remnants of an ancient whole genome duplication in both cycads and ginkgo. The most parsimonious explanation would be that this whole genome duplication event was shared between both species and had occurred prior to their divergence, about 300 million years ago.
  37. Zhang, G.-Q., Liu, K.-W., Li, Z., Lohaus, R., Hsiao, Y.-Y., Niu, S.-C., … Liu, Z.-J. (2017). The Apostasia genome and the evolution of orchids. NATURE, 549(7672), 379–383.
    Constituting approximately 10% of flowering plant species, orchids (Orchidaceae) display unique flower morphologies, possess an extraordinary diversity in lifestyle, and have successfully colonized almost every habitat on Earth(1-3). Here we report the draft genome sequence of Apostasia shenzhenica(4), a representative of one of two genera that form a sister lineage to the rest of the Orchidaceae, providing a reference for inferring the genome content and structure of the most recent common ancestor of all extant orchids and improving our understanding of their origins and evolution. In addition, we present transcriptome data for representatives of Vanilloideae, Cypripedioideae and Orchidoideae, and novel third-generation genome data for two species of Epidendroideae, covering all five orchid subfamilies. A. shenzhenica shows clear evidence of a whole-genome duplication, which is shared by all orchids and occurred shortly before their divergence. Comparisons between A. shenzhenica and other orchids and angiosperms also permitted the reconstruction of an ancestral orchid gene toolkit. We identify new gene families, gene family expansions and contractions, and changes within MADS-box gene classes, which control a diverse suite of developmental processes, during orchid evolution. This study sheds new light on the genetic mechanisms underpinning key orchid innovations, including the development of the labellum and gynostemium, pollinia, and seeds without endosperm, as well as the evolution of epiphytism; reveals relationships between the Orchidaceae subfamilies; and helps clarify the evolutionary history of orchids within the angiosperms.
  38. Heydari, M., Miclotte, G., Demeester, P., Van de Peer, Y., & Fostier, J. (2017). Evaluation of the impact of Illumina error correction tools on de novo genome assembly. BMC BIOINFORMATICS, 18.
    Background: Recently, many standalone applications have been proposed to correct sequencing errors in Illumina data. The key idea is that downstream analysis tools such as de novo genome assemblers benefit from a reduced error rate in the input data. Surprisingly, a systematic validation of this assumption using state-of-the-art assembly methods is lacking, even for recently published methods. Results: For twelve recent Illumina error correction tools (EC tools) we evaluated both their ability to correct sequencing errors and their ability to improve de novo genome assembly in terms of contig size and accuracy. Conclusions: We confirm that most EC tools reduce the number of errors in sequencing data without introducing many new errors. However, we found that many EC tools suffer from poor performance in certain sequence contexts such as regions with low coverage or regions that contain short repeated or low-complexity sequences. Reads overlapping such regions are often ill-corrected in an inconsistent manner, leading to breakpoints in the resulting assemblies that are not present in assemblies obtained from uncorrected data. Resolving this systematic flaw in future EC tools could greatly improve the applicability of such tools.
  39. Miclotte, G., Plaisance, S., Rombauts, S., Van de Peer, Y., Audenaert, P., & Fostier, J. (2017). OMSim : a simulator for optical map data. BIOINFORMATICS, 33(17), 2740–2742.
    Motivation: The Bionano Genomics platform allows for the optical detection of short sequence patterns in very long DNA molecules (up to 2.5 Mbp). Molecules with overlapping patterns can be assembled to generate a consensus optical map of the entire genome. In turn, these optical maps can be used to validate or improve de novo genome assembly projects or to detect large-scale structural variation in genomes. Simulated optical map data can assist in the development and benchmarking of tools that operate on those data, such as alignment and assembly software. Additionally, it can help to optimize the experimental setup for a genome of interest. Such a simulator is currently not available. Results: We have developed a simulator, OMSim, that produces synthetic optical map data that mimics real Bionano Genomics data. These simulated data have been tested for compatibility with the Bionano Genomics Irys software system and the Irys-scaffolding scripts. OMSim is capable of handling very large genomes (over 30 Gbp) with high throughput and low memory requirements.
  40. De Smet, R., Sabaghian, E., Li, Z., Saeys, Y., & Van de Peer, Y. (2017). Coordinated functional divergence of genes after genome duplication in Arabidopsis thaliana. PLANT CELL, 29(11), 2786–2800.
    Gene and genome duplications have been rampant during the evolution of flowering plants. Unlike small-scale gene duplications, whole-genome duplications (WGDs) copy entire pathways or networks, and as such create the unique situation in which such duplicated pathways or networks could evolve novel functionality through the coordinated sub-or neofunctionalization of its constituent genes. Here, we describe a remarkable case of coordinated gene expression divergence following WGDs in Arabidopsis thaliana. We identified a set of 92 homoeologous gene pairs that all show a similar pattern of tissue-specific gene expression divergence following WGD, with one homoeolog showing predominant expression in aerial tissues and the other homoeolog showing biased expression in tip-growth tissues. We provide evidence that this pattern of gene expression divergence seems to involve genes with a role in cell polarity and that likely function in the maintenance of cell wall integrity. Following WGD, many of these duplicated genes evolved separate functions through subfunctionalization in growth/development and stress response. Uncoupling these processes through genome duplications likely provided important adaptations with respect to growth and morphogenesis and defense against biotic and abiotic stress.
  41. Tasdighian, S., Van Bel, M., Li, Z., Van de Peer, Y., Carretero-Paulet, L., & Maere, S. (2017). Reciprocally retained genes in the angiosperm lineage show the hallmarks of dosage balance sensitivity. PLANT CELL, 29(11), 2766–2785.
    In several organisms, particular functional categories of genes, such as regulatory and complex-forming genes, are preferentially retained after whole-genome multiplications but rarely duplicate through small-scale duplication, a pattern referred to as reciprocal retention. This peculiar duplication behavior is hypothesized to stem from constraints on the dosage balance between the genes concerned and their interaction context. However, the evidence for a relationship between reciprocal retention and dosage balance sensitivity remains fragmentary. Here, we identified which gene families are most strongly reciprocally retained in the angiosperm lineage and studied their functional and evolutionary characteristics. Reciprocally retained gene families exhibit stronger sequence divergence constraints and lower rates of functional and expression divergence than other gene families, suggesting that dosage balance sensitivity is a general characteristic of reciprocally retained genes. Gene families functioning in regulatory and signaling processes are much more strongly represented at the top of the reciprocal retention ranking than those functioning in multiprotein complexes, suggesting that regulatory imbalances may lead to stronger fitness effects than classical stoichiometric protein complex imbalances. Finally, reciprocally retained duplicates are often subject to dosage balance constraints for prolonged evolutionary times, which may have repercussions for the ease with which genome multiplications can engender evolutionary innovation.
  42. Del Cortona, A., Leliaert, F., Bogaert, K., Turmel, M., Boedeker, C., Janouškovec, J., Lopez-Bautista, J. M., et al. (2017). The plastid genome in Cladophorales green algae is encoded by hairpin chromosomes. CURRENT BIOLOGY, 27(24), 3771–3782.
    Virtually all plastid (chloroplast) genomes are circular double-stranded DNA molecules, typically between 100 and 200 kb in size and encoding circa 80-250 genes. Exceptions to this universal plastid genome architecture are very few and include the dinoflagellates, where genes are located on DNA minicircles. Here we report on the highly deviant chloroplast genome of Cladophorales green algae, which is entirely fragmented into hairpin chromosomes. Short-and long-read high-throughput sequencing of DNA and RNA demonstrated that the chloroplast genes of Boodlea composita are encoded on 1-to 7-kb DNA contigs with an exceptionally high GC content, each containing a long inverted repeat with one or two protein-coding genes and conserved non-coding regions putatively involved in replication and/or expression. We propose that these contigs correspond to linear single-stranded DNA molecules that fold onto themselves to form hairpin chromosomes. The Boodlea chloroplast genes are highly divergent from their corresponding orthologs, and display an alternative genetic code. The origin of this highly deviant chloroplast genome most likely occurred before the emergence of the Cladophorales, and coincided with an elevated transfer of chloroplast genes to the nucleus. A chloroplast genome that is composed only of linear DNA molecules is unprecedented among eukaryotes, and highlights unexpected variation in plastid genome architecture.
  43. Vaneechoutte, D., Estrada, A. R., Lin, Y.-C., Loraine, A. E., & Vandepoele, K. (2017). Genome-wide characterization of differential transcript usage in Arabidopsis thaliana. PLANT JOURNAL, 92(6), 1218–1231.
    Alternative splicing and the usage of alternate transcription start- or stop sites allows a single gene to produce multiple transcript isoforms. Most plant genes express certain isoforms at a significantly higher level than others, but under specific conditions this expression dominance can change, resulting in a different set of dominant isoforms. These events of differential transcript usage (DTU) have been observed for thousands of Arabidopsis thaliana, Zea mays and Vitis vinifera genes, and have been linked to development and stress response. However, neither the characteristics of these genes, nor the implications of DTU on their protein coding sequences or functions, are currently well understood. Here we present a dataset of isoform dominance and DTU for all genes in the AtRTD2 reference transcriptome based on a protocol that was benchmarked on simulated data and validated through comparison with a published reverse transciptase-polymerase chain reaction panel. We report DTU events for 8148 genes across 206 public RNA-Seq samples, and find that protein sequences are affected in 22% of the cases. The observed DTU events show high consistency across replicates, and reveal reproducible patterns in response to treatment and development. We also demonstrate that genes with different evolutionary ages, expression breadths and functions show large differences in the frequency at which they undergo DTU, and in the effect that these events have on their protein sequences. Finally, we showcase how the generated dataset can be used to explore DTU events for genes of interest or to find genes with specific DTU in samples of interest.
  44. Swings, T., Weytjens, B., Schalck, T., Bonte, C., Verstraeten, N., Michiels, J., & Marchal, K. (2017). Network-based identification of adaptive pathways in evolved ethanol-tolerant bacterial populations. MOLECULAR BIOLOGY AND EVOLUTION, 34(11), 2927–2943.
    Efficient production of ethanol for use as a renewable fuel requires organisms with a high level of ethanol tolerance. However, this trait is complex and increased tolerance therefore requires mutations in multiple genes and pathways. Here, we use experimental evolution for a system-level analysis of adaptation of Escherichia coli to high ethanol stress. As adaptation to extreme stress often results in complex mutational data sets consisting of both causal and noncausal passenger mutations, identifying the true adaptive mutations in these settings is not trivial. Therefore, we developed a novel method named IAMBEE (Identification of Adaptive Mutations in Bacterial Evolution Experiments). IAMBEE exploits the temporal profile of the acquisition of mutations during evolution in combination with the functional implications of each mutation at the protein level. These data are mapped to a genome-wide interaction network to search for adaptive mutations at the level of pathways. The 16 evolved populations in our data set together harbored 2,286 mutated genes with 4,470 unique mutations. Analysis by IAMBEE significantly reduced this number and resulted in identification of 90 mutated genes and 345 unique mutations that are most likely to be adaptive. Moreover, IAMBEE not only enabled the identification of previously known pathways involved in ethanol tolerance, but also identified novel systems such as the AcrAB-TolC efflux pump and fatty acids biosynthesis and even allowed to gain insight into the temporal profile of adaptation to ethanol stress. Furthermore, this method offers a solid framework for identifying the molecular underpinnings of other complex traits as well.
  45. Wuyts, V., Mattheus, W., Roosens, N. H., Marchal, K., Bertrand, S., & De Keersmaecker, S. C. (2017). Molecular subtyping of Salmonella Typhimurium with multiplex oligonucleotide ligation-PCR (MOL-PCR). In K. A. Bishop-Lilly (Ed.), Diagnostic bacteriology : methods and protocols (Vol. 1616, pp. 39–69). New York, NY, USA: Springer Humana Press.
    A multiplex oligonucleotide ligation-PCR (MOL-PCR) assay is a valuable high-throughput technique for the detection of bacteria and viruses, for characterization of pathogens and for diagnosis of genetic diseases, as it allows one to combine different types of molecular markers in a high-throughput multiplex assay. A MOL-PCR assay starts with a multiplex oligonucleotide ligation reaction for detection of the molecular marker, followed by a singleplex PCR for signal amplification and analysis of the MOL-PCR products on a Luminex platform. This last step occurs through a liquid bead suspension array in which the MOL-PCR products are hybridized to MagPlex-TAG beads. In this chapter, we describe the complete procedure for a MOL-PCR assay for subtyping of Salmonella enterica subsp. enterica serovar Typhimurium (S. Typhimurium) and its monophasic variant S. 1,4[5],12:i:- from DNA isolation through heat lysis up to data interpretation through a Gödel Prime Product. The subtyping assay consists of 50 discriminative molecular markers and two internal positive control markers divided over three MOL-PCR assays.
  46. Wingfield, B. D., Berger, D. K., Steenkamp, E. T., Lim, H.-J., Duong, T. A., Bluhm, B. H., De Beer, Z. W., et al. (2017). Draft genome of Cercospora zeina, Fusarium pininemorale, Hawksworthiomyces lignivorus, Huntiella decipiens and Ophiostoma ips. IMA FUNGUS, 8(2), 385–396.
    The genomes of Cercospora zeina, Fusarium pininemorale, Hawksworthiomyces lignivorus, Huntiella decipiens, and Ophiostoma ips are presented in this genome announcement. Three of these genomes are from plant pathogens and otherwise economically important fungal species. Fusarium pininemorale and H. decipiens are not known to cause significant disease but are closely related to species of economic importance. The genome sizes range from 25.99 Mb in the case of O. ips to 4.82 Mb for H. lignivorus. These genomes include the first reports of a genome from the genus Hawksworthiomyces. The availability of these genome data will allow the resolution of longstanding questions regarding the taxonomy of these species. In addition these genome sequences through comparative studies with closely related organisms will increase our understanding of how these species or close relatives cause disease.
  47. Orr, Russell JS, Rombauts, S., Van de Peer, Y., & Shalchian-Tabrizi, K. (2017). Draft genome sequences of two unclassified Chitinophagaceae bacteria, IBVUCB1 and IBVUCB2, isolated from environmental samples. GENOME ANNOUNCEMENTS, 5(33).
    We report here the draft genome sequences of two Chitinophagaceae bacteria, IBVUCB1 and IBVUCB2, assembled from metagenomes of surface samples from freshwater lakes. The genomes are >99% complete and may represent new genera within the Chitinophagaceae family, indicating a larger diversity than currently identified.
  48. Babiychuk, E., Trinh, H. K., Vandepoele, K., Van De Slijke, E., Geelen, D., De Jaeger, G., Obokata, J., et al. (2017). The mutation nrpb1-A325V in the largest subunit of RNA polymerase II suppresses compromised growth of Arabidopsis plants deficient in a function of the general transcription factor IIF. PLANT JOURNAL, 89(4), 730–745.
    The evolutionarily conserved 12-subunit RNA polymerase II (Pol II) is a central catalytic component that drives RNA synthesis during the transcription cycle that consists of transcription initiation, elongation, and termination. A diverse set of general transcription factors, including a multifunctional TFIIF, govern Pol II selectivity, kinetic properties, and transcription coupling with posttranscriptional processes. Here, we show that TFIIF of Arabidopsis (Arabidopsis thaliana) resembles the metazoan complex that is composed of the TFIIF and TFIIF polypeptides. Arabidopsis has two TFIIF subunits, of which TFIIF1/MAN1 is essential and TFIIF2/MAN2 is not. In the partial loss-of-function mutant allele man1-1, the winged helix domain of Arabidopsis TFIIF1/MAN1 was dispensable for plant viability, whereas the cellular organization of the shoot and root apical meristems were abnormal. Forward genetic screening identified an epistatic interaction between the largest Pol II subunit nrpb1-A325V variant and the man1-1 mutation. The suppression of the man1-1 mutant developmental defects by a mutation in Pol II suggests a link between TFIIF functions in Arabidopsis transcription cycle and the maintenance of cellular organization in the shoot and root apical meristems.
  49. Christie, N., Myburg, A. A., Joubert, F., Murray, S. L., Carstens, M., Lin, Y.-C., Meyer, J., et al. (2017). Systems genetics reveals a transcriptional network associated with susceptibility in the maize-grey leaf spot pathosystem. PLANT JOURNAL, 89(4), 746–763.
    We used a systems genetics approach to elucidate the molecular mechanisms of the responses of maize to grey leaf spot (GLS) disease caused by Cercosporazeina, a threat to maize production globally. Expression analysis of earleaf samples in a subtropical maize recombinant inbred line population (CML444xSC Malawi) subjected in the field to C. zeina infection allowed detection of 20206 expression quantitative trait loci (eQTLs). Four trans-eQTL hotspots coincided with GLS disease QTLs mapped in the same field experiment. Co-expression network analysis identified three expression modules correlated with GLS disease scores. The module (GY-s) most highly correlated with susceptibility (r=0.71; 179 genes) was enriched for the glyoxylate pathway, lipid metabolism, diterpenoid biosynthesis and responses to pathogen molecules such as chitin. The GY-s module was enriched for genes with trans-eQTLs in hotspots on chromosomes 9 and 10, which also coincided with phenotypic QTLs for susceptibility to GLS. This transcriptional network has significant overlap with the GLS susceptibility response of maize line B73, and may reflect pathogen manipulation for nutrient acquisition and/or unsuccessful defence responses, such as kauralexin production by the diterpenoid biosynthesis pathway. The co-expression module that correlated best with resistance (TQ-r; 1498 genes) was enriched for genes with trans-eQTLs in hotspots coinciding with GLS resistance QTLs on chromosome 9. Jasmonate responses were implicated in resistance to GLS through co-expression of COI1 and enrichment of genes with the Gene Ontology term cullin-RING ubiquitin ligase complex' in the TQ-r module. Consistent with this, JAZ repressor expression was highly correlated with the severity of GLS disease in the GY-s susceptibility network.
  50. Van de Velde, Jan, Van Bel, M., Vaneechoutte, D., & Vandepoele, K. (2016). A collection of conserved noncoding sequences to study gene regulation in flowering plants. PLANT PHYSIOLOGY, 171(4), 2586–2598.
    Transcription factors (TFs) regulate gene expression by binding cis-regulatory elements, of which the identification remains an ongoing challenge owing to the prevalence of large numbers of nonfunctional TF binding sites. Powerful comparative genomics methods, such as phylogenetic footprinting, can be used for the detection of conserved noncoding sequences (CNSs), which are functionally constrained and can greatly help in reducing the number of false-positive elements. In this study, we applied a phylogenetic footprinting approach for the identification of CNSs in 10 dicot plants, yielding 1,032,291 CNSs associated with 243,187 genes. To annotate CNSs with TF binding sites, we made use of binding site information for 642 TFs originating from 35 TF families in Arabidopsis (Arabidopsis thaliana). In three species, the identified CNSs were evaluated using TF chromatin immunoprecipitation sequencing data, resulting in significant overlap for the majority of data sets. To identify ultraconserved CNSs, we included genomes of additional plant families and identified 715 binding sites for 501 genes conserved in dicots, monocots, mosses, and green algae. Additionally, we found that genes that are part of conserved mini-regulons have a higher coherence in their expression profile than other divergent gene pairs. All identified CNSs were integrated in the PLAZA 3.0 Dicots comparative genomics platform ( together with new functionalities facilitating the exploration of conserved cis-regulatory elements and their associated genes. The availability of this data set in a user-friendly platform enables the exploration of functional noncoding DNA to study gene regulation in a variety of plant species, including crops.
  51. Perazzolli, M., Herrero, N., Sterck, L., Lenzi, L., Pellegrini, A., Puopolo, G., Van de Peer, Y., et al. (2016). Transcriptomic responses of a simplified soil microcosm to a plant pathogen and its biocontrol agent reveal a complex reaction to harsh habitat. BMC GENOMICS, 17.
    Background: Soil microorganisms are key determinants of soil fertility and plant health. Soil phytopathogenic fungi are one of the most important causes of crop losses worldwide. Microbial biocontrol agents have been extensively studied as alternatives for controlling phytopathogenic soil microorganisms, but molecular interactions between them have mainly been characterised in dual cultures, without taking into account the soil microbial community. We used an RNA sequencing approach to elucidate the molecular interplay of a soil microbial community in response to a plant pathogen and its biocontrol agent, in order to examine the molecular patterns activated by the microorganisms. Results: A simplified soil microcosm containing 11 soil microorganisms was incubated with a plant root pathogen (Armillaria mellea) and its biocontrol agent (Trichoderma atroviride) for 24 h under controlled conditions. More than 46 million paired-end reads were obtained for each replicate and 28,309 differentially expressed genes were identified in total. Pathway analysis revealed complex adaptations of soil microorganisms to the harsh conditions of the soil matrix and to reciprocal microbial competition/cooperation relationships. Both the phytopathogen and its biocontrol agent were specifically recognised by the simplified soil microcosm: defence reaction mechanisms and neutral adaptation processes were activated in response to competitive (T. atroviride) or non-competitive (A. mellea) microorganisms, respectively. Moreover, activation of resistance mechanisms dominated in the simplified soil microcosm in the presence of both A. mellea and T. atroviride. Biocontrol processes of T. atroviride were already activated during incubation in the simplified soil microcosm, possibly to occupy niches in a competitive ecosystem, and they were not further enhanced by the introduction of A. mellea. Conclusions: This work represents an additional step towards understanding molecular interactions between plant pathogens and biocontrol agents within a soil ecosystem. Global transcriptional analysis of the simplified soil microcosm revealed complex metabolic adaptation in the soil environment and specific responses to antagonistic or neutral intruders.
  52. Bolton, M. D., Ebert, M. K., Faino, L., Rivera-Varas, V., de Jonge, R., Van de Peer, Y., Thomma, B. P., et al. (2016). RNA-sequencing of Cercospora beticola DMI-sensitive and -resistant isolates after treatment with tetraconazole identifies common and contrasting pathway induction. FUNGAL GENETICS AND BIOLOGY, 92, 1–13.
    Cercospora beticola causes Cercospora leaf spot of sugar beet. Cercospora leaf spot management measures often include application of the sterol demethylation inhibitor (DMI) class of fungicides. The reliance on DMIs and the consequent selection pressures imposed by their widespread use has led to the emergence of resistance in C. beticola populations. Insight into the molecular basis of tetraconazole resistance may lead to molecular tools to identify DMI-resistant strains for fungicide resistance management programs. Previous work has shown that expression of the gene encoding the DMI target enzyme (CYP51) is generally higher and inducible in DMI-resistant C beticola field strains. In this study, we extended the molecular basis of DMI resistance in this pathosystem by profiling the transcriptional response of two C. beticola strains contrasting for resistance to tetraconazole. A majority of the genes in the ergosterol biosynthesis pathway were induced to similar levels in both strains with the exception of CbCyp51, which was induced several-fold higher in the DMI-resistant strain. In contrast, a secondary metabolite gene cluster was induced in the resistance strain, but repressed in the sensitive strain. Genes encoding proteins with various cell membrane fortification processes were induced in the resistance strain. Site-directed and ectopic mutants of candidate DMI-resistance genes all resulted in significantly higher EC50 values than the wild type strain, suggesting that the cell wall and/or membrane modified as a result of the transformation process increased resistance to tetraconazole. Taken together, this study identifies important cell membrane components and provides insight into the molecular events underlying DMI resistance in C beticola.
  53. Van de Peer, Y., & Pires, J. C. (2016). Editorial overview: Genome studies and molecular genetics : of plant genes, genomes, and genomics. CURRENT OPINION IN PLANT BIOLOGY.
  54. Li, Zhen, Defoort, J., Tasdighian, S., Maere, S., Van de Peer, Y., & De Smet, R. (2016). Gene duplicability of core genes is highly consistent across all angiosperms. PLANT CELL, 28(2), 326–344.
    Gene duplication is an important mechanism for adding to genomic novelty. Hence, which genes undergo duplication and are preserved following duplication is an important question. It has been observed that gene duplicability, or the ability of genes to be retained following duplication, is a nonrandom process, with certain genes being more amenable to survive duplication events than others. Primarily, gene essentiality and the type of duplication (small-scale versus large-scale) have been shown in different species to influence the (long-term) survival of novel genes. However, an overarching view of "gene duplicability" is lacking, mainly due to the fact that previous studies usually focused on individual species and did not account for the influence of genomic context and the time of duplication. Here, we present a large-scale study in which we investigated duplicate retention for 9178 gene families shared between 37 flowering plant species, referred to as angiosperm core gene families. For most gene families, we observe a strikingly consistent pattern of gene duplicability across species, with gene families being either primarily single-copy or multicopy in all species. An intermediate class contains gene families that are often retained in duplicate for periods extending to tens of millions of years after whole-genome duplication, but ultimately appear to be largely restored to singleton status, suggesting that these genes may be dosage balance sensitive. The distinction between single-copy and multicopy gene families is reflected in their functional annotation, with single-copy genes being mainly involved in the maintenance of genome stability and organelle function and multicopy genes in signaling, transport, and metabolism. The intermediate class was overrepresented in regulatory genes, further suggesting that these represent putative dosage-balance-sensitive genes.
  55. Lohaus, R., & Van de Peer, Y. (2016). Of dups and dinos : evolution at the K/Pg boundary. (Y. Van de Peer & J. C. Pires, Eds.)CURRENT OPINION IN PLANT BIOLOGY, 30, 62–69.
    Fifteen years into sequencing entire plant genomes, more than 30 paleopolyploidy events could be mapped on the tree of flowering plants (and many more when also transcriptome data sets are considered). While some genome duplications are very old and have occurred early in the evolution of dicots and monocots, or even before, others are more recent and seem to have occurred independently in many different plant lineages. Strikingly, a majority of these duplications date somewhere between 55 and 75 million years ago (mya), and thus likely correlate with the K/Pg boundary. If true, this would suggest that plants that had their genome duplicated at that time, had an increased chance to survive the most recent mass extinction event, at 66 mya, which wiped out a majority of plant and animal life, including all non-avian dinosaurs. Here, we review several processes, both neutral and adaptive, that might explain the establishment of polyploid plants, following the K/Pg mass extinction.
  56. Miclotte, G., Heydari, M., Demeester, P., Rombauts, S., Van de Peer, Y., Audenaert, P., & Fostier, J. (2016). Jabba: hybrid error correction for long sequencing reads. ALGORITHMS FOR MOLECULAR BIOLOGY, 11, 10.
    Background: Third generation sequencing platforms produce longer reads with higher error rates than second generation technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. Results: In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is the use of a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of MEMs in the context of third generation reads are presented. Conclusion: Jabba produces highly reliable corrected reads: almost all corrected reads align to the reference, and these alignments have a very high identity. Many of the aligned reads are error-free. Additionally, Jabba corrects reads using a very low amount of CPU time. From this we conclude that pseudo alignment with MEMs is a fast and reliable method to map long highly erroneous sequences on a de Bruijn graph.
  57. Xie, Q., Tzfadia, O., Levy, M., Weithorn, E., Peled-Zehavi, H., Van Parys, T., Van de Peer, Y., et al. (2016). hfAIM: a reliable bioinformatics approach for in silico genome-wide identification of autophagy-associated Atg8-interacting motifs in various organisms. AUTOPHAGY, 12(5), 876–887.
    Most of the proteins that are specifically turned over by selective autophagy are recognized by the presence of short Atg8 interacting motifs (AIMs) that facilitate their association with the autophagy apparatus. Such AIMs can be identified by bioinformatics methods based on their defined degenerate consensus F/W/Y-X-X-L/I/V sequences in which X represents any amino acid. Achieving reliability and/or fidelity of the prediction of such AIMs on a genome-wide scale represents a major challenge. Here, we present a bioinformatics approach, high fidelity AIM (hfAIM), which uses additional sequence requirementsthe presence of acidic amino acids and the absence of positively charged amino acids in certain positionsto reliably identify AIMs in proteins. We demonstrate that the use of the hfAIM method allows for in silico high fidelity prediction of AIMs in AIM-containing proteins (ACPs) on a genome-wide scale in various organisms. Furthermore, by using hfAIM to identify putative AIMs in the Arabidopsis proteome, we illustrate a potential contribution of selective autophagy to various biological processes. More specifically, we identified 9 peroxisomal PEX proteins that contain hfAIM motifs, among which AtPEX1, AtPEX6 and AtPEX10 possess evolutionary-conserved AIMs. Bimolecular fluorescence complementation (BiFC) results verified that AtPEX6 and AtPEX10 indeed interact with Atg8 in planta. In addition, we show that mutations occurring within or nearby hfAIMs in PEX1, PEX6 and PEX10 caused defects in the growth and development of various organisms. Taken together, the above results suggest that the hfAIM tool can be used to effectively perform genome-wide in silico screens of proteins that are potentially regulated by selective autophagy. The hfAIM system is a web tool that can be accessed at link:
  58. Kaewphan, S., Van Landeghem, S., Ohta, T., Van de Peer, Y., Ginter, F., & Pyysalo, S. (2016). Cell line name recognition in support of the identification of synthetic lethality in cancer from text. BIOINFORMATICS, 32(2), 276–282.
    Motivation: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus. Results: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers.
  59. Tzfadia, O., Diels, T., De Meyer, S., Vandepoele, K., Aharoni, A., & Van de Peer, Y. (2016). CoExpNetViz: comparative co-expression networks construction and visualization tool. FRONTIERS IN PLANT SCIENCE, 6.
    Motivation: Comparative transcriptomics is a common approach in functional gene discovery efforts. It allows for finding conserved co-expression patterns between orthologous genes in closely related plant species, suggesting that these genes potentially share similar function and regulation. Several efficient co-expression-based tools have been commonly used in plant research but most of these pipelines are limited to data from model systems, which greatly limit their utility. Moreover, in addition, none of the existing pipelines allow plant researchers to make use of their own unpublished gene expression data for performing a comparative co-expression analysis and generate multi-species co-expression networks. Results: We introduce CoExpNetViz, a computational tool that uses a set of query or "bait" genes as an input (chosen by the user) and a minimum of one pre-processed gene expression dataset. The CoExpNetViz algorithm proceeds in three main steps; (i) for every bait gene submitted, co-expression values are calculated using mutual information and Pearson correlation coefficients, (ii) non bait (or target) genes are grouped based on cross-species orthology, and (iii) output files are generated and results can be visualized as network graphs in Cytoscape. Availability: The CoExpNetViz tool is freely available both as a PHP web server (link: (implemented in C++) and as a Cytoscape plugin (implemented in Java). Both versions of the CoExpNetViz tool support LINUX and Windows platforms.
  60. Van Landeghem, S., Van Parys, T., Dubois, M., Inzé, D., & Van de Peer, Y. (2016). Diffany: an ontology-driven framework to infer, visualise and analyse differential molecular networks. BMC BIOINFORMATICS, 17.
    Background: Differential networks have recently been introduced as a powerful way to study the dynamic rewiring capabilities of an interactome in response to changing environmental conditions or stimuli. Currently, such differential networks are generated and visualised using ad hoc methods, and are often limited to the analysis of only one condition-specific response or one interaction type at a time. Results: In this work, we present a generic, ontology-driven framework to infer, visualise and analyse an arbitrary set of condition-specific responses against one reference network. To this end, we have implemented novel ontology-based algorithms that can process highly heterogeneous networks, accounting for both physical interactions and regulatory associations, symmetric and directed edges, edge weights and negation. We propose this integrative framework as a standardised methodology that allows a unified view on differential networks and promotes comparability between differential network studies. As an illustrative application, we demonstrate its usefulness on a plant abiotic stress study and we experimentally confirmed a predicted regulator. Availability: Diffany is freely available as open-source java library and Cytoscape plugin from
  61. Zhang, G.-Q., Xu, Q., Bian, C., Tsai, W.-C., Yeh, C.-M., Liu, K.-W., Yoshida, K., et al. (2016). The Dendrobium catenatum Lindl. genome sequence provides insights into polysaccharide synthase, floral development and adaptive evolution. SCIENTIFIC REPORTS, 6.
    Orchids make up about 10% of all seed plant species, have great economical value, and are of specific scientific interest because of their renowned flowers and ecological adaptations. Here, we report the first draft genome sequence of a lithophytic orchid, Dendrobium catenatum. We predict 28,910 protein-coding genes, and find evidence of a whole genome duplication shared with Phalaenopsis. We observed the expansion of many resistance-related genes, suggesting a powerful immune system responsible for adaptation to a wide range of ecological niches. We also discovered extensive duplication of genes involved in glucomannan synthase activities, likely related to the synthesis of medicinal polysaccharides. Expansion of MADS-box gene clades ANR1, StMADS11, and MIKC*, involved in the regulation of development and growth, suggests that these expansions are associated with the astonishing diversity of plant architecture in the genus Dendrobium. On the contrary, members of the type I MADS box gene family are missing, which might explain the loss of the endospermous seed. The findings reported here will be important for future studies into polysaccharide synthesis, adaptations to diverse environments and flower architecture of Orchidaceae.
  62. Goeminne, L., Gevaert, K., & Clement, L. (2016). Peptide-level robust ridge regression improves estimation, sensitivity, and specificity in data-dependent quantitative label-free shotgun proteomics. MOLECULAR & CELLULAR PROTEOMICS, 15(2), 657–668.
    Peptide intensities from mass spectra are increasingly used for relative quantitation of proteins in complex samples. However, numerous issues inherent to the mass spectrometry workflow turn quantitative proteomic data analysis into a crucial challenge. We and others have shown that modeling at the peptide level outperforms classical summarization-based approaches, which typically also discard a lot of proteins at the data preprocessing step. Peptide-based linear regression models, however, still suffer from unbalanced datasets due to missing peptide intensities, outlying peptide intensities and overfitting. Here, we further improve upon peptide-based models by three modular extensions: ridge regression, improved variance estimation by borrowing information across proteins with empirical Bayes and M-estimation with Huber weights. We illustrate our method on the CPTAC spike-in study and on a study comparing wild-type and ArgP knock-out Francisella tularensis proteomes. We show that the fold change estimates of our robust approach are more precise and more accurate than those from state-of-the-art summarization-based methods and peptide-based regression models, which leads to an improved sensitivity and specificity. We also demonstrate that ionization competition effects come already into play at very low spike-in concentrations and confirm that analyses with peptide-based regression methods on peptide intensity values aggregated by charge state and modification status (e.g. MaxQuant’s peptides.txt file) are slightly superior to analyses on raw peptide intensity values (e.g. MaxQuant’s evidence.txt file).
  63. Yao, Yao, Marchal, K., & Van de Peer, Y. (2016). Adaptive self-organizing organisms using a bio-inspired gene regulatory network controller: for the aggregation of evolutionary robots under a changing environment. In Ying Tan (Ed.), Handbook of research on design, control and modeling of swarm robotics (pp. 68–82). Hershey, PA, USA: IGI Global.
    This work has explored the adaptive potential of simulated swarm robots that contain a genomic encoding of a bio-inspired gene regulatory network (GRN). An artificial genome is combined with a flexible agent-based system, representing the activated part of the regulatory network that transduces environmental cues into phenotypic behavior. Using an Alife simulation framework that mimics a changing environment, we have shown that separating the static from the conditionally active part of the network contributes to a better adaptive behavior. This chapter describes the biologically inspired concept of GRNs to develop a distributed robot self-organizing approach. In particular, it shows that by using this approach, multiple swarm robots can aggregate to form a robotic organism that can adapt its configuration as a response to a dynamically changing environment. In addition, through the comparison of several different simulation experiments, the results illustrate the impact of evolutionary operators such as mutations and duplications on improving the adaptability of organisms.
  64. Wuyts, V., Roosens, N. H., Bertrand, S., Marchal, K., & De Keersmaecker, S. C. (2016). Optimized MOL-PCR for characterization of microbial pathogens. In Current protocols in cytometry (Vol. suppl. 75, pp. 13.15.1-13.15.15). New York, NY, USA: Wiley.
    Characterization of microbial pathogens is necessary for surveillance, outbreak detection, and tracing of outbreak sources. This unit describes a multiplex oligonucleotide ligation-PCR (MOL-PCR) optimized for characterization of microbial pathogens. With MOL-PCR, different types of markers, like unique sequences, single-nucleotide polymorphisms (SNPs) and indels, can be simultaneously analyzed in one assay. This assay consists of a multiplex ligation for detection of the markers, a singleplex PCR for signal amplification, and hybridization to MagPlex-TAG beads for readout on a Luminex platform after fluorescent staining. The current protocol describes the MOL-PCR, as well as methods for DNA isolation, probe design, and data interpretation and it is based on an optimized MOL-PCR assay for subtyping of Salmonella Typhimurium.
  65. De Coninck, A. (2016). High performance computing for large-scale genomic prediction. Ghent University. Faculty of Bioscience Engineering, Ghent, Belgium.
    In the past decades genetics was studied intensively leading to the knowledge that DNA is the molecule behind genetic inheritance and starting from the new millennium next-generation sequencing methods made it possible to sample this DNA with an ever decreasing cost. Animal and plant breeders have always made use of genetic information to predict agronomic performance of new breeds. While this genetic information previously was gathered from the pedigree of the population under study, genomic information of the DNA makes it possible to also deduce correlations between individuals that do not share any known ancestors leading to so-called genomic prediction of agronomic performance. Nowadays, the number of informative samples that can be taken from a genome ranges from one thousand to one million. Using all this information in a breeding context where agronomic performance is predicted and optimized for different environmental conditions is not a straightforward task. Moreover, the number of individuals for which this information is available keeps on growing and thus sophisticated computational methods are required for analyzing these large scale genomic data sets. This thesis introduces some concepts of high performance computing in a genomic prediction context and shows that analyzing phenotypic records of large numbers of genotyped individuals leads to a better prediction accuracy of the agronomic performance in different environments. Finally, it is even shown that the parts of the DNA that influence the agronomic performance under certain environmental conditions can be pinpointed, and this knowledge can thus be used by breeders to select individuals that thrive better in the targeted environment.
  66. Debyser, G., Mesuere, B., Clement, L., Van de Weygaert, J., Van Hecke, P., Duytschaever, G., Aerts, M., et al. (2016). Faecal proteomics : a tool to investigate dysbiosis and inflammation in patients with cystic fibrosis. JOURNAL OF CYSTIC FIBROSIS, 15(2), 242–250.
  67. Veeckman, E., Ruttink, T., & Vandepoele, K. (2016). Are we there yet? : reliably estimating the completeness of plant genome sequences. PLANT CELL, 28(8), 1759–1768.
    Genome sequencing is becoming cheaper and faster thanks to the introduction of next-generation sequencing techniques. Dozens of new plant genome sequences have been released in recent years, ranging from small to gigantic repeat-rich or polyploid genomes. Most genome projects have a dual purpose: delivering a contiguous, complete genome assembly and creating a full catalog of correctly predicted genes. Frequently, the completeness of a species' gene catalog is measured using a set of marker genes that are expected to be present. This expectation can be defined along an evolutionary gradient, ranging from highly conserved genes to species-specific genes. Large-scale population resequencing studies have revealed that gene space is fairly variable even between closely related individuals, which limits the definition of the expected gene space, and, consequently, the accuracy of estimates used to assess genome and gene space completeness. We argue that, based on the desired applications of a genome sequencing project, different completeness scores for the genome assembly and/or gene space should be determined. Using examples from several dicot and monocot genomes, we outline some pitfalls and recommendations regarding methods to estimate completeness during different steps of genome assembly and annotation.
  68. Cao, T. N. P., Greenhalgh, R., Dermauw, W., Rombauts, S., Bajda-Wybouw, S., Zhurov, V., … Clark, R. M. (2016). Complex evolutionary dynamics of massively expanded chemosensory receptor families in an extreme generalist chelicerate herbivore. GENOME BIOLOGY AND EVOLUTION, 8(11), 3323–3339.
    While mechanisms to detoxify plant produced, anti-herbivore compounds have been associated with plant host use by herbivores, less is known about the role of chemosensory perception in their life histories. This is especially true for generalists, including chelicerate herbivores that evolved herbivory independently from the more studied insect lineages. To shed light on chemosensory perception in a generalist herbivore, we characterized the chemosensory receptors (CRs) of the chelicerate two-spotted spider mite, Tetranychus urticae, an extreme generalist. Strikingly, T. urticae has more CRs than reported in any other arthropod to date. Including pseudogenes, 689 gustatory receptors were identified, as were 136 degenerin/Epithelial Na+ Channels (ENaCs) that have also been implicated as CRs in insects. The genomic distribution of T. urticae gustatory receptors indicates recurring bursts of lineage-specific proliferations, with the extent of receptor clusters reminiscent of those observed in the CR-rich genomes of vertebrates or C. elegans. Although pseudogenization of many gustatory receptors within clusters suggests relaxed selection, a subset of receptors is expressed. Consistent with functions as CRs, the genomic distribution and expression of ENaCs in lineage-specific T. urticae expansions mirrors that observed for gustatory receptors. The expansion of ENaCs in T. urticae to > 3-fold that reported in other animals was unexpected, raising the possibility that ENaCs in T. urticae have been co-opted to fulfill a major role performed by unrelated CRs in other animals. More broadly, our findings suggest an elaborate role for chemosensory perception in generalist herbivores that are of key ecological and agricultural importance.
  69. Yao, Y., Storme, V., Marchal, K., & Van de Peer, Y. (2016). Emergent adaptive behaviour of GRN-controlled simulated robots in a changing environment. PEERJ, 4.
    We developed a bio-inspired robot controller combining an artificial genome with an agent-based control system. The genome encodes a gene regulatory network (GRN) that is switched on by environmental cues and, following the rules of transcriptional regulation, provides output signals to actuators. Whereas the genome represents the full encoding of the transcriptional network, the agent-based system mimics the active regulatory network and signal transduction system also present in naturally occurring biological systems. Using such a design that separates the static from the conditionally active part of the gene regulatory network contributes to a better general adaptive behaviour. Here, we have explored the potential of our platform with respect to the evolution of adaptive behaviour, such as preying when food becomes scarce, in a complex and changing environment and show through simulations of swarm robots in an A-life environment that evolution of collective behaviour likely can be attributed to bio-inspired evolutionary processes acting at different levels, from the gene and the genome to the individual robot and robot population.
  70. Van Leene, J., Blomme, J., Kulkarni, S. R., Cannoot, B., De Winne, N., Eeckhout, D., Persiau, G., et al. (2016). Functional characterization of the Arabidopsis transcription factor bZIP29 reveals its role in leaf and root development. JOURNAL OF EXPERIMENTAL BOTANY, 67(19), 5825–5840.
    Plant bZIP group I transcription factors have been reported mainly for their role during vascular development and osmosensory responses. Interestingly, bZIP29 has been identified in a cell cycle interactome, indicating additional functions of bZIP29 in plant development. Here, bZIP29 was functionally characterized to study its role during plant development. It is not present in vascular tissue but is specifically expressed in proliferative tissues. Genome-wide mapping of bZIP29 target genes confirmed its role in stress and osmosensory responses, but also identified specific binding to several core cell cycle genes and to genes involved in cell wall organization. bZIP29 protein complex analyses validated interaction with other bZIP group I members and provided insight into regulatory mechanisms acting on bZIP dimers. In agreement with bZIP29 expression in proliferative tissues and with its binding to promoters of cell cycle regulators, dominant-negative repression of bZIP29 altered the cell number in leaves and in the root meristem. A transcriptome analysis on the root meristem, however, indicated that bZIP29 might regulate cell number through control of cell wall organization. Finally, ectopic dominant-negative repression of bZIP29 and redundant factors led to a seedling-lethal phenotype, pointing to essential roles for bZIP group I factors early in plant development.
  71. Kerchev, P., Waszczak, C., Lewandowska, A., Willems, P., Shapiguzov, A., Li, Z., … Van Breusegem, F. (2016). Lack of GLYCOLATE OXIDASE1, but not GLYCOLATE OXIDASE2, attenuates the photorespiratory phenotype of CATALASE2-deficient Arabidopsis. PLANT PHYSIOLOGY, 171(3), 1704–1719.
    The genes coding for the core metabolic enzymes of the photorespiratory pathway that allows plants with C3-type photosynthesis to survive in an oxygen-rich atmosphere, have been largely discovered in genetic screens aimed to isolate mutants that are unviable under ambient air. As an exception, glycolate oxidase (GOX) mutants with a photorespiratory phenotype have not been described yet in C3 species. Using Arabidopsis (Arabidopsis thaliana) mutants lacking the peroxisomal CATALASE2 (cat2-2) that display stunted growth and cell death lesions under ambient air, we isolated a second-site loss-of-function mutation in GLYCOLATE OXIDASE1 (GOX1) that attenuated the photorespiratory phenotype of cat2-2. Interestingly, knocking out the nearly identical GOX2 in the cat2-2 background did not affect the photorespiratory phenotype, indicating that GOX1 and GOX2 play distinct metabolic roles. We further investigated their individual functions in single gox1-1 and gox2-1 mutants and revealed that their phenotypes can be modulated by environmental conditions that increase the metabolic flux through the photorespiratory pathway. High light negatively affected the photosynthetic performance and growth of both gox1-1 and gox2-1 mutants, but the negative consequences of severe photorespiration were more pronounced in the absence of GOX1, which was accompanied with lesser ability to process glycolate. Taken together, our results point toward divergent functions of the two photorespiratory GOX isoforms in Arabidopsis and contribute to a better understanding of the photorespiratory pathway.
  72. Vlastaridis, P., Oliver, S. G., Van de Peer, Y., & Amoutzias, G. D. (2016). The challenges of interpreting phosphoproteomics data : a critical view through the bioinformatics lens. In C. Angelini, P. M. Rancoita, & S. Rovetta (Eds.), Lecture Notes in Computer Science (Vol. 9874, pp. 196–204). Presented at the 12th International meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB 2015), Cham, Switzerland: Springer.
    During the last decade, there has been great progress in high-throughput (HTP) phosphoproteomics and hundreds or even thousands of phosphorylation sites (p-sites) can now be detected in a single experiment. This success is attributable to a combination of very sensitive Mass Spectrometry instruments, better phosphopeptide enrichment techniques and bioinformatics software that are capable of detecting peptides and localizing p-sites. These new technologies have opened up a whole new level of gene regulation to be studied, with great potential for therapeutics and synthetic biology. Nevertheless, many challenges remain to be resolved; these concern the biases and noise of these proteomic technologies, the biological noise that is present, as well as the incompleteness of the current datasets. Despite these problems, the datasets published so far appear to represent a good sample of a complete phosphoproteome of some organisms and are capable of revealing their major properties.
  73. De Maeyer, D. (2016). Network-based omics data analysis. KU Leuven. Faculteit Bio-ingenieurswetenschappen, Leuven.
  74. Pulido Tamayo, S. (2016). Exploiting natural selection to study adaptive behavior. Ghent University. Faculty of Sciences ; KU Leuven. Faculty of Bioscience Engineering, Ghent ; Leuven, Belgium.
  75. Pulido Tamayo, S., Weytjens, B., De Maeyer, D., & Marchal, K. (2016). SSA-ME Detection of cancer driver genes using mutual exclusivity by small subnetwork analysis. SCIENTIFIC REPORTS, 6.
    Because of its clonal evolution a tumor rarely contains multiple genomic alterations in the same pathway as disrupting the pathway by one gene often is sufficient to confer the complete fitness advantage. As a result, many cancer driver genes display mutual exclusivity across tumors. However, searching for mutually exclusive gene sets requires analyzing all possible combinations of genes, leading to a problem which is typically too computationally complex to be solved without a stringent a priori filtering, restricting the mutations included in the analysis. To overcome this problem, we present SSA-ME, a network-based method to detect cancer driver genes based on independently scoring small subnetworks for mutual exclusivity using a reinforced learning approach. Because of the algorithmic efficiency, no stringent upfront filtering is required. Analysis of TCGA cancer datasets illustrates the added value of SSA-ME: well-known recurrently mutated but also rarely mutated drivers are prioritized. We show that using mutual exclusivity to detect cancer driver genes is complementary to state-of-the art approaches. This framework, in which a large number of small subnetworks are being analyzed in order to solve a computationally complex problem (SSA), can be generically applied to any problem in which local neighborhoods in a network hold useful information.
  76. Van, T. L., van Leeuwen, M., Fierro Gutierrez, A. C. E., De Maeyer, D., Van den Eynden, J., Verbeke, L., De Raedt, L., et al. (2016). Simultaneous discovery of cancer subtypes and subtype features by molecular data integration. BIOINFORMATICS, 32(17), i445–i454. Presented at the 15th European conference on Computational Biology (ECCB).
    Motivation: Subtyping cancer is key to an improved and more personalized prognosis/treatment. The increasing availability of tumor related molecular data provides the opportunity to identify molecular subtypes in a data-driven way. Molecular subtypes are defined as groups of samples that have a similar molecular mechanism at the origin of the carcinogenesis. The molecular mechanisms are reflected by subtype-specific mutational and expression features. Data-driven subtyping is a complex problem as subtyping and identifying the molecular mechanisms that drive carcinogenesis are confounded problems. Many current integrative subtyping methods use global mutational and/or expression tumor profiles to group tumor samples in subtypes but do not explicitly extract the subtype-specific features. We therefore present a method that solves both tasks of subtyping and identification of subtype-specific features simultaneously. Hereto our method integrates' mutational and expression data while taking into account the clonal properties of carcinogenesis. Key to our method is a formalization of the problem as a rank matrix factorization of ranked data that approaches the subtyping problem as multi-view bi-clustering. Results: We introduce a novel integrative framework to identify subtypes by combining mutational and expression features. The incomparable measurement data is integrated by transformation into ranked data and subtypes are defined as multi-view bi-clusters. We formalize the model using rank matrix factorization, resulting in the SRF algorithm. Experiments on simulated data and the TCGA breast cancer data demonstrate that SRF is able to capture subtle differences that existing methods may miss.
  77. LE, P., Makhalanyane, T. P., Guerrero, L. D., Vikram, S., Van de Peer, Y., & Cowan, D. A. (2016). Comparative metagenomic analysis reveals mechanisms for stress response in hypoliths from extreme hyperarid deserts. GENOME BIOLOGY AND EVOLUTION, 8(9), 2737–2747.
    Understanding microbial adaptation to environmental stressors is crucial for interpreting broader ecological patterns. In the most extreme hot and cold deserts, cryptic niche communities are thought to play key roles in ecosystem processes and represent excellent model systems for investigating microbial responses to environmental stressors. However, relatively little is known about the genetic diversity underlying such functional processes in climatically extreme desert systems. This study presents the first comparative metagenome analysis of cyanobacteria-dominated hypolithic communities in hot (Namib Desert, Namibia) and cold (Miers Valley, Antarctica) hyperarid deserts. The most abundant phyla in both hypolith metagenomes were Actinobacteria, Proteobacteria, Cyanobacteria and Bacteroidetes with Cyanobacteria dominating in Antarctic hypoliths. However, no significant differences between the two metagenomes were identified. The Antarctic hypolithic metagenome displayed a high number of sequences assigned to sigma factors, replication, recombination and repair, translation, ribosomal structure, and biogenesis. In contrast, the Namib Desert metagenome showed a high abundance of sequences assigned to carbohydrate transport and metabolism. Metagenome data analysis also revealed significant divergence in the genetic determinants of amino acid and nucleotide metabolism between these two metagenomes and those of soil from other polar deserts, hot deserts, and non-desert soils. Our results suggest extensive niche differentiation in hypolithic microbial communities from these two extreme environments and a high genetic capacity for survival under environmental extremes.
  78. Jelen, V., de Jonge, R., Van de Peer, Y., Javornik, B., & Jakše, J. (2016). Complete mitochondrial genome of the Verticillium-wilt causing plant pathogen Verticillium nonalfalfae. PLOS ONE, 11(2).
    Verticillium nonalfalfae is a fungal plant pathogen that causes wilt disease by colonizing the vascular tissues of host plants. The disease induced by hop isolates of V. nonalfalfae manifests in two different forms, ranging from mild symptoms to complete plant dieback, caused by mild and lethal pathotypes, respectively. Pathogenicity variations between the causal strains have been attributed to differences in genomic sequences and perhaps also to differences in their mitochondrial genomes. We used data from our recent Illumina NGS-based project of genome sequencing V. nonalfalfae to study the mitochondrial genomes of its different strains. The aim of the research was to prepare a V. nonalfalfae reference mitochondrial genome and to determine its phylogenetic placement in the fungal kingdom. The resulting 26,139 bp circular DNA molecule contains a full complement of the 14 "standard" fungal mitochondrial protein-coding genes of the electron transport chain and ATP synthase subunits, together with a small rRNA subunit, a large rRNA subunit, which contains ribosomal protein S3 encoded within a type IA-intron and 26 tRNAs. Phylogenetic analysis of this mitochondrial genome placed it in the Verticillium spp. lineage in the Glomerellales group, which is also supported by previous phylogenetic studies based on nuclear markers. The clustering with the closely related Verticillium dahliae mitochondrial genome showed a very conserved synteny and a high sequence similarity. Two distinguishing mitochondrial genome features were also found-a potential long non-coding RNA (orf414) contained only in the Verticillium spp. of the fungal kingdom, and a specific fragment length polymorphism observed only in V. dahliae and V. nubilum of all the Verticillium spp., thus showing potential as a species specific biomarker.
  79. Olsen, J. L., Rouzé, P., Verhelst, B., Lin, Y.-C., Bayer, T., Collen, J., Dattolo, E., et al. (2016). The genome of the seagrass Zostera marina reveals angiosperm adaptation to the sea. NATURE, 530(7590), 331–335.
    Seagrasses colonized the sea(1) on at least three independent occasions to form the basis of one of the most productive and widespread coastal ecosystems on the planet(2). Here we report the genome of Zostera marina (L.), the first, to our knowledge, marine angiosperm to be fully sequenced. This reveals unique insights into the genomic losses and gains involved in achieving the structural and physiological adaptations required for its marine lifestyle, arguably the most severe habitat shift ever accomplished by flowering plants. Key angiosperm innovations that were lost include the entire repertoire of stomatal genes(3), genes involved in the synthesis of terpenoids and ethylene signalling, and genes for ultraviolet protection and phytochromes for far-red sensing. Seagrasses have also regained functions enabling them to adjust to full salinity. Their cell walls contain all of the polysaccharides typical of land plants, but also contain polyanionic, low-methylated pectins and sulfated galactans, a feature shared with the cell walls of all macroalgae(4) and that is important for ion homoeostasis, nutrient uptake and O-2/CO2 exchange through leaf epidermal cells. The Z. marina genome resource will markedly advance a wide range of functional ecological studies from adaptation of marine ecosystems under climate warming(5,6), to unravelling the mechanisms of osmoregulation under high salinities that may further inform our understanding of the evolution of salt tolerance in crop plants(7).
  80. De Maeyer, D., Weytjens, B., De Raedt, L., & Marchal, K. (2016). Network-based analysis of eQTL data to prioritize driver mutations. GENOME BIOLOGY AND EVOLUTION, 8(3), 481–494.
    In clonal systems, interpreting driver genes in terms of molecular networks helps understanding how these drivers elicit an adaptive phenotype. Obtaining such a network-based understanding depends on the correct identification of driver genes. In clonal systems, independent evolved lines can acquire a similar adaptive phenotype by affecting the same molecular pathways, a phenomenon referred to as parallelism at the molecular pathway level. This implies that successful driver identification depends on interpreting mutated genes in terms of molecular networks. Driver identification and obtaining a network-based understanding of the adaptive phenotype are thus confounded problems that ideally should be solved simultaneously. In this study, a network-based eQTL method is presented that solves both the driver identification and the network-based interpretation problem. As input the method uses coupled genotype-expression phenotype data (eQTL data) of independently evolved lines with similar adaptive phenotypes and an organism-specific genome-wide interaction network. The search for mutational consistency at pathway level is defined as a subnetwork inference problem, which consists of inferring a subnetwork from the genome-wide interaction network that best connects the genes containing mutations to differentially expressed genes. Based on their connectivity with the differentially expressed genes, mutated genes are prioritized as driver genes. Based on semisynthetic data and two publicly available data sets, we illustrate the potential of the network-based eQTL method to prioritize driver genes and to gain insights in the molecular mechanisms underlying an adaptive phenotype. The method is available at
  81. Hermans, Kim, Roberfroid, S., Thijs, I. M., Kint, G., De Coster, D., Marchal, K., Vanderleyden, J., et al. (2016). FabR regulates Salmonella biofilm formation via its direct target FabB. BMC GENOMICS, 17.
    Background: Biofilm formation is an important survival strategy of Salmonella in all environments. By mutant screening, we showed a knock-out mutant of fabR, encoding a repressor of unsaturated fatty acid biosynthesis (UFA), to have impaired biofilm formation. In order to unravel how this regulator impinges on Salmonella biofilm formation, we aimed at elucidating the S. Typhimurium FabR regulon. Hereto, we applied a combinatorial high-throughput approach, combining ChIP-chip with transcriptomics. Results: All the previously identified E. coli FabR transcriptional target genes (fabA, fabB and yqfA) were shown to be direct S. Typhimurium FabR targets as well. As we found a fabB overexpressing strain to partly mimic the biofilm defect of the fabR mutant, the effect of FabR on biofilms can be attributed at least partly to FabB, which plays a key role in UFA biosynthesis. Additionally, ChIP-chip identified a number of novel direct FabR targets (the intergenic regions between hpaR/hpaG and ddg/ydfZ) and yet putative direct targets (i.a. genes involved in tRNA metabolism, ribosome synthesis and translation). Next to UFA biosynthesis, a number of these direct targets and other indirect targets identified by transcriptomics (e.g. ribosomal genes, ompA, ompC, ompX, osmB, osmC, sseI), could possibly contribute to the effect of FabR on biofilm formation. Conclusion: Overall, our results point at the importance of FabR and UFA biosynthesis in Salmonella biofilm formation and their role as potential targets for biofilm inhibitory strategies.
  82. Pulido Tamayo, S., Duitama, J., & Marchal, K. (2016). EXPLoRA-web: linkage analysis of quantitative trait loci using bulk segregant analysis. NUCLEIC ACIDS RESEARCH, 44(W1), W142–W146.
    Identification of genomic regions associated with a phenotype of interest is a fundamental step toward solving questions in biology and improving industrial research. Bulk segregant analysis (BSA) combined with high-throughput sequencing is a technique to efficiently identify these genomic regions associated with a trait of interest. However, distinguishing true from spuriously linked genomic regions and accurately delineating the genomic positions of these truly linked regions requires the use of complex statistical models currently implemented in software tools that are generally difficult to operate for non-expert users. To facilitate the exploration and analysis of data generated by bulked segregant analysis, we present EXPLoRA-web, a web service wrapped around our previously published algorithm EXPLoRA, which exploits linkage disequilibrium to increase the power and accuracy of quantitative trait loci identification in BSA analysis. EXPLoRA-web provides a user friendly interface that enables easy data upload and parallel processing of different parameter configurations. Results are provided graphically and as BED file and/or text file and the input is expected in widely used formats, enabling straightforward BSA data analysis. The web server is available at
  83. Mushthofa, M., Schockaert, S., & De Cock, M. (2016). Computing attractors of multi-valued gene regulatory networks using fuzzy answer set programming. Computational Intelligence, IEEE World congress, Proceedings. Presented at the 2016 IEEE World congress on Computational Intelligence (WCCI 2016), New York, NY, USA: IEEE.
  84. Gerits, E., Blommaert, E., Lippell, A., O’Neill, A. J., Weytjens, B., De Maeyer, D., Fierro, A. C., et al. (2016). Elucidation of the mode of action of a new antibacterial compound active against Staphylococcus aureus and Pseudomonas aeruginosa. PLOS ONE, 11(5).
    Nosocomial and community-acquired infections caused by multidrug resistant bacteria represent a major human health problem. Thus, there is an urgent need for the development of antibiotics with new modes of action. In this study, we investigated the antibacterial characteristics and mode of action of a new antimicrobial compound, SPI031 (N-alkylated 3, 6-dihalogenocarbazol 1-(sec-butylamino)-3-(3,6-dichloro-9H-carbazol-9-yl) propan-2-ol), which was previously identified in our group. This compound exhibits broad-spectrum antibacterial activity, including activity against the human pathogens Staphylococcus aureus and Pseudomonas aeruginosa. We found that SPI031 has rapid bactericidal activity (7-log reduction within 30 min at 4x MIC) and that the frequency of resistance development against SPI031 is low. To elucidate the mode of action of SPI031, we performed a macromolecular synthesis assay, which showed that SPI031 causes non-specific inhibition of macromolecular biosynthesis pathways. Liposome leakage and membrane permeability studies revealed that SPI031 rapidly exerts membrane damage, which is likely the primary cause of its antibacterial activity. These findings were supported by a mutational analysis of SPI031-resistant mutants, a transcriptome analysis and the identification of transposon mutants with altered sensitivity to the compound. In conclusion, our results show that SPI031 exerts its antimicrobial activity by causing membrane damage, making it an interesting starting point for the development of new antibacterial therapies.
  85. Veeckman, E., Vandepoele, K., Asp, T., Roldàn-Ruiz, I., & Ruttink, T. (2016). Genomic variation in the FT gene family of perennial ryegrass (Lolium perenne). In I. Roldàn-Ruiz, J. Baert, & D. Reheul (Eds.), Breeding in a world of scarcity : proceedings of the 2015 meeting of the section “Forage Crops and Amenity Grasses” of Eucarpia (pp. 121–126). Presented at the 31st Symposium of Eucarpia’s “Forage Crops and Amenity Grasses” Section, Cham, Switzerland: Springer.
    The timing of fl owering is of prime importance for several agronomic traits, and its genetic control is therefore of great interest to breeders. Several signaling pathways converge on FLOWERING LOCUS T (FT) gene family members, which act as central regulators of fl owering, branching and seed dormancy. We identifi ed the complete FT gene family in the Lolium perenne genome and performed phylogenetic analysis to delineate functional clades and to identify putative functionally redundant paralogs. Five FT genes of L. perenne were selected for targeted resequencing in a genepool of 746 accessions to describe genetic diversity in wild accessions, commercial cultivars and breeding material.
  86. Vandermarliere, E., Maddelein, D., Hulstaert, N., Stes, E., Di Michele, M., Gevaert, K., … Martens, L. (2015). PepShell : visualization of conformational proteomics data. JOURNAL OF PROTEOME RESEARCH, 14(4), 1987–1990.
    Proteins are dynamic molecules; they undergo crucial conformational changes induced by post-translational modifications and by binding of cofactors or other molecules. The characterization of these conformational changes and their relation to protein function is a central goal of structural biology. Unfortunately, most conventional methods to obtain structural information do not provide information on protein dynamics. Therefore, mass spectrometry-based approaches, such as limited proteolysis, hydrogen-deuterium exchange, and stable-isotope labeling, are frequently used to characterize protein conformation and dynamics, yet the interpretation of these data can be cumbersome and time consuming. Here, we present PepShell, a tool that allows interactive data analysis of mass spectrometry-based conformational proteomics studies by visualization of the identified peptides both at the sequence and structure levels. Moreover, PepShell allows the comparison of experiments under different conditions, including different proteolysis times or binding of the protein to different substrates or inhibitors.
  87. Ferreira, G. B., Vanherwegen, A.-S., Eelen, G., Gutíerrez, A. C. F., Van Lommel, L., Marchal, K., Verlinden, L., et al. (2015). Vitamin D3 induces tolerance in human dendritic cells by activation of intracellular metabolic pathways. CELL REPORTS, 10(5), 711–725.
    Metabolic switches in various immune cell subsets enforce phenotype and function. In the present study, we demonstrate that the active form of vitamin D, 1,25-dihydroxyvitamin D-3 (1,25(OH)(2)D-3), induces human monocyte-derived tolerogenic dendritic cells (DC) by metabolic reprogramming. Microarray analysis demonstrated that 1,25(OH)(2)D-3 upregulated several genes directly related to glucose metabolism, tricarboxylic acid cycle (TCA), and oxidative phosphorylation (OXPHOS). Although OXPHOS was promoted by 1,25(OH)(2)D-3, hypoxia did not change the tolerogenic function of 1,25(OH)(2)D-3-treated DCs. Instead, glucose availability and glycolysis, controlled by the PI3K/Akt/mTOR pathway, dictate the induction and maintenance of the 1,25(OH)(2)D(3)conditioned tolerogenic DC phenotype and function. This metabolic reprogramming is unique for 1,25(OH)(2)D-3, because the tolerogenic DC phenotype induced by other immune modulators did not depend on similar metabolic changes. We put forward that these metabolic insights in tolerogenic DC biology can be used to advance DC-based immunotherapies, influencing DC longevity and their resistance to environmental metabolic stress.
  88. Proost, S., Van Bel, M., Vaneechoutte, D., Van de Peer, Y., Inzé, D., Mueller-Roeber, B., & Vandepoele, K. (2015). PLAZA 3.0 : an access point for plant comparative genomics. NUCLEIC ACIDS RESEARCH, 43(D1), D974–D981.
    Comparative sequence analysis has significantly altered our view on the complexity of genome organization and gene functions in different kingdoms. PLAZA 3.0 is designed to make comparative genomics data for plants available through a user-friendly web interface. Structural and functional annotation, gene families, protein domains, phylogenetic trees and detailed information about genome organization can easily be queried and visualized. Compared with the first version released in 2009, which featured nine organisms, the number of integrated genomes is more than four times higher, and now covers 37 plant species. The new species provide a wider phylogenetic range as well as a more in-depth sampling of specific clades, and genomes of additional crop species are present. The functional annotation has been expanded and now comprises data from Gene Ontology, MapMan, UniProtKB/Swiss-Prot, PlnTFDB and PlantTFDB. Furthermore, we improved the algorithms to transfer functional annotation from well-characterized plant genomes to other species. The additional data and new features make PLAZA 3.0 ( a versatile and comprehensible resource for users wanting to explore genome information to study different aspects of plant biology, both in model and non-model organisms.
  89. Szakonyi, D., Van Landeghem, S., Baerenfaller, K., Baeyens, L., Blomme, J., Casanova-Sáez, R., De Bodt, S., et al. (2015). The KnownLeaf literature curation system captures knowledge about Arabidopsis leaf growth and development and facilitates integrated data mining. CURRENT PLANT BIOLOGY, 2, 1–11.
    The information that connects genotypes and phenotypes is essentially embedded in research articles written in natural language. To facilitate access to this knowledge, we constructed a framework for the curation of the scientific literature studying the molecular mechanisms that control leaf growth and development in Arabidopsis thaliana (Arabidopsis). Standard structured statements, called relations, were designed to capture diverse data types, including phenotypes and gene expression linked to genotype description, growth conditions, genetic and molecular interactions, and details about molecular entities. Relations were then annotated from the literature, defining the relevant terms according to standard biomedical ontologies. This curation process was supported by a dedicated graphical user interface, called Leaf Knowtator. A total of 283 primary research articles were curated by a community of annotators, yielding 9947 relations monitored for consistency and over 12,500 references to Arabidopsis genes. This information was converted into a relational database (KnownLeaf) and merged with other public Arabidopsis resources relative to transcriptional networks, protein–protein interaction, gene co-expression, and additional molecular annotations. Within KnownLeaf, leaf phenotype data can be searched together with molecular data originating either from this curation initiative or from external public resources. Finally, we built a network (LeafNet) with a portion of the KnownLeaf database content to graphically represent the leaf phenotype relations in a molecular context, offering an intuitive starting point for knowledge mining. Literature curation efforts such as ours provide high quality structured information accessible to computational analysis, and thereby to a wide range of applications. DATA: The presented work was performed in the framework of the AGRON-OMICS project (Arabidopsis GRO wth Network integrating OMICS technologies) supported by European Commission 6th Framework Programme project (Grant number LSHG-CT-2006-037704). This is a data integration and data sharing portal collecting all the all the major results from the consortium. All data presented in our paper is available here.
  90. Goeminne, L., Argentini, A., Martens, L., & Clement, L. (2015). Summarization vs. peptide-based models in label-free quantitative proteomics : performance, pitfalls, and data analysis guidelines. JOURNAL OF PROTEOME RESEARCH, 14(6), 2457–2465.
    Quantitative label-free mass spectrometry is increasingly used to analyze the proteomes of complex biological samples. However, the choice of appropriate data analysis methods remains a major challenge. We therefore provide a rigorous comparison between peptide-based models and peptide-summarization-based pipelines. We show that peptide-based models outperform summarization-based pipelines in terms of sensitivity, specificity, accuracy, and precision. We also demonstrate that the predefined FDR cutoffs for the detection of differentially regulated proteins can become problematic when differentially expressed (DE) proteins are highly abundant in one or more samples. Care should therefore be taken when data are interpreted from samples with spiked-in internal controls and from samples that contain a few very highly abundant proteins. We do, however, show that specific diagnostic plots can be used for assessing differentially expressed proteins and the overall quality of the obtained fold change estimates. Finally, our study also illustrates that imputation under the "missing by low abundance" assumption is beneficial for the detection of differential expression in proteins with low abundance, but it negatively affects moderately to highly abundant proteins. Hence, imputation strategies that are commonly implemented in standard proteomics software should be used with care.​
  91. Masuzzo, P., Martens, L., 2014 Cell Migration workshop participants, the, Ampe, C., De Wever, O., & Van Troys, M. (2015). An open data ecosystem for cell migration research. TRENDS IN CELL BIOLOGY.
    Cell migration research has recently become both a high content and a high throughput field thanks to technological, computational, and methodological advances. Simultaneously, however, urgent bioinformatics needs regarding data management, standardization, and dissemination have emerged. To address these concerns, we propose to establish an open data ecosystem for cell migration research.
  92. Vaudel, M., Burkhart, J. M., Zahedi, R. P., Oveland, E., Berven, F. S., Sickmann, A., Martens, L., et al. (2015). PeptideShaker enables reanalysis of MS-derived proteomics data sets. NATURE BIOTECHNOLOGY, 33(1), 22–24.
  93. De Coninck, A., Kourounis, D., Verbosio, F., Schenk, O., De Baets, B., Maenhout, S., & Fostier, J. (2015). Including explicit marker-by-environment interaction for large-scale genomic prediction. In X. Draye (Ed.), COMMUNICATIONS IN AGRICULTURAL AND APPLIED BIOLOGICAL SCIENCES (Vol. 80, pp. 117–121). Presented at the 20th National symposium on Applied Biological Sciences.
    Genomic prediction for plants is heavily influenced by the environment. Not only do the environmental conditions influence the phenotypic traits directly, genetic effects may also vary across different environments. Therefore, it is essential to include marker-by-environment interactions in the linear mixed models used for analyzing the genomic data. However, when every genetic marker is coupled to every environmental covariate, the problem size grows dramatically. Luckily, information about marker-by-environment interaction is only sparsely present in data sets, since each plant is tested in a limited number of environmental conditions only. In contrast, the genotypes of plants are a dense source of information and thus including marker effects and their interaction with environment in a single-step genomic prediction setting requires the coupling of sparse and dense matrix algebra. Our implementation of this strategy uses distributed computing techniques together with an optimized library for sparse matrix manipulations (PARDISO) to efficiently use a high performance computing cluster for the analysis of large-scale data sets.
  94. Wuyts, V., Roosens, N. H., Bertrand, S., Marchal, K., & De Keersmaecker, S. C. (2015). Guidelines for optimisation of a multiplex oligonucleotide ligation-PCR for characterisation of microbial pathogens in a microsphere suspension array. BIOMED RESEARCH INTERNATIONAL.
    With multiplex oligonucleotide ligation-PCR (MOL-PCR) different molecular markers can be simultaneously analysed in a single assay and high levels of multiplexing can be achieved in high-throughput format. As such, MOL-PCR is a convenient solution for microbial detection and identification assays where many markers should be analysed, including for routine further characterisation of an identified microbial pathogenic isolate. For an assay aimed at routine use, optimisation in terms of differentiation between positive and negative results and of cost and effort is indispensable. As MOL-PCR includes a multiplex ligation step, followed by a singleplex PCR and analysis with microspheres on a Luminex device, several parameters are accessible for optimisation. Although MOL-PCR performance may be influenced by the markers used in the assay and the targeted bacterial species, evaluation of the method of DNA isolation, the probe concentration, the amount of microspheres, and the concentration of reporter dye is advisable in the development of any MOL-PCR assay. Therefore, we here describe our observations made during the optimisation of a 20-plex MOL-PCR assay for subtyping of Salmonella Typhimurium with the aim to provide a possible workflow as guidance for the development and optimisation of a MOL-PCR assay for the characterisation of other microbial pathogens.
  95. Crappé, J., Ndah, E., Koch, A., Steyaert, S., Fijałkowska, D., De Keulenaer, S., De Meester, E., et al. (2015). PROTEOFORMER: deep proteome coverage through ribosome profiling and MS integration. NUCLEIC ACIDS RESEARCH, 43(5).
    An increasing amount of studies integrate mRNA sequencing data into MS-based proteomics to complement the translation product search space. However, several factors, including extensive regulation of mRNA translation and the need for three- or six-frame-translation, impede the use of mRNA-seq data for the construction of a protein sequence search database. With that in mind, we developed the PROTEOFORMER tool that automatically processes data of the recently developed ribosome profiling method (sequencing of ribosome-protected mRNA fragments), resulting in genome-wide visualization of ribosome occupancy. Our tool also includes a translation initiation site calling algorithm allowing the delineation of the open reading frames (ORFs) of all translation products. A complete protein synthesis-based sequence database can thus be compiled for mass spectrometry-based identification. This approach increases the overall protein identification rates with 3% and 11% (improved and new identifications) for human and mouse, respectively, and enables proteome-wide detection of 5'-extended proteoforms, upstream ORF translation and near-cognate translation start sites. The PROTEOFORMER tool is available as a stand-alone pipeline and has been implemented in the galaxy framework for ease of use.
  96. Cai, J., Liu, X., Vanneste, K., Proost, S., Tsai, W.-C., Liu, K.-W., … Liu, Z.-J. (2015). The genome sequence of the orchid Phalaenopsis equestris. NATURE GENETICS, 47(1), 65–72.
    Orchidaceae, renowned for its spectacular flowers and other reproductive and ecological adaptations, is one of the most diverse plant families. Here we present the genome sequence of the tropical epiphytic orchid Phalaenopsis equestris, a frequently used parent species for orchid breeding. P. equestris is the first plant with crassulacean acid metabolism (CAM) for which the genome has been sequenced. Our assembled genome contains 29,431 predicted protein-coding genes. We find that contigs likely to be underassembled, owing to heterozygosity, are enriched for genes that might be involved in self-incompatibility pathways. We find evidence for an orchid-specific paleopolyploidy event that preceded the radiation of most orchid clades, and our results suggest that gene duplication might have contributed to the evolution of CAM photosynthesis in P. equestris. Finally, we find expanded and diversified families of MADS-box C/D-class, B-class AP3 and AGL6-class genes, which might contribute to the highly specialized morphology of orchid flowers.
  97. Van Leene, J., Eeckhout, D., Cannoot, B., De Winne, N., Persiau, G., Van De Slijke, E., Vercruysse, L., et al. (2015). An improved toolbox to unravel the plant cellular machinery by tandem affinity purification of Arabidopsis protein complexes. NATURE PROTOCOLS, 10(1), 169–187.
    Tandem affinity purification coupled to mass spectrometry (TAP-MS) is one of the most advanced methods to characterize protein complexes in plants, giving a comprehensive view on the protein-protein interactions (PPIs) of a certain protein of interest (bait). The bait protein is fused to a double affinity tag, which consists of a protein G tag and a streptavidin-binding peptide separated by a very specific protease cleavage site, allowing highly specific protein complex isolation under near-physiological conditions. Implementation of this optimized TAP tag, combined with ultrasensitive MS, means that these experiments can be performed on small amounts (25 mg of total protein) of protein extracts from Arabidopsis cell suspension cultures. It is also possible to use this approach to isolate low abundant protein complexes from Arabidopsis seedlings, thus opening perspectives for the exploration of protein complexes in a plant developmental context. Next to protocols for efficient biomass generation of seedlings (similar to 7.5 months), we provide detailed protocols for TAP (1 d), and for sample preparation and liquid chromatography-tandem MS (LC-MS/MS; similar to 5 d), either from Arabidopsis seedlings or from cell cultures. For the identification of specific co-purifying proteins, we use an extended protein database and filter against a list of nonspecific proteins on the basis of the occurrence of a co-purified protein among 543 TAP experiments. The value of the provided protocols is illustrated through numerous applications described in recent literature.
  98. Volders, P.-J., Verheggen, K., Menschaert, G., Vandepoele, K., Martens, L., Vandesompele, J., & Mestdagh, P. (2015). An update on LNCipedia : a database for annotated human lncRNA sequences. NUCLEIC ACIDS RESEARCH, 43(D1), D174–D180.
    The human genome is pervasively transcribed, producing thousands of non-coding RNA transcripts. The majority of these transcripts are long non-coding RNAs (lncRNAs) and novel lncRNA genes are being identified at rapid pace. To streamline these efforts, we created LNCipedia, an online repository of lncRNA transcripts and annotation. Here, we present LNCipedia 3.0 (, the latest version of the publicly available human lncRNA database. Compared to the previous version of LNCipedia, the database grew over five times in size, gaining over 90 000 new lncRNA transcripts. Assessment of the protein-coding potential of LNCipedia entries is improved with state-of-the art methods that include large-scale reprocessing of publicly available proteomics data. As a result, a high-confidence set of lncRNA transcripts with low coding potential is defined and made available for download. In addition, a tool to assess lncRNA gene conservation between human, mouse and zebrafish has been implemented.
  99. Glibert, P., Meert, P., Van Steendam, K., Van Nieuwerburgh, F., De Coninck, D., Martens, L., Dhaenens, M., et al. (2015). Phospho-iTRAQ : assessing isobaric labels for the large-scale study of phosphopeptide stoichiometry. JOURNAL OF PROTEOME RESEARCH, 14(2), 839–849.
    The ability to distinguish between phosphopeptides of high and low stoichiometry is essential to discover the true extent of protein phosphorylation. We here extend the strategy whereby a peptide sample is briefly split in two identical parts and differentially labeled preceding the phosphatase treatment of one part. Our use of isobaric tags for relative and absolute quantitation (iTRAQ) marks the first time that isobaric tags have been applied for the large-scale analysis of phosphopeptides. Our Phospho-iTRAQ method focuses on the unmodified counterparts of phosphorylated peptides, which thus circumvents the ionization, fragmentation, and phospho-enrichment difficulties that hamper quantitation of stoichiometry in most common phosphoproteomics methods. Since iTRAQ enables multiplexing, simultaneous (phospho)proteome comparison between internal replicates and multiple samples is possible. The technique was validated on multiple instrument platforms by adding internal standards of high stoichiometry to a complex lysate of control and EGF-stimulated HeLa cells. To demonstrate the flexibility of PhosphoiTRAQ with regards to the experimental setup, the proteome coverage was extended through gel fractionation, while an internal replicate measurement created more stringent data analysis opportunities. The latest developments in MS instrumentation promise to further increase the resolution of the stoichiometric measurement of Phospho-iTRAQ in the future. The data have been deposited to the ProteomeXchange with identifier PXD001574.
  100. Verbist, Bie, Klambauer, G., Vervoort, L., Talloen, W., Shkedy, Z., Thas, O., Bender, A., et al. (2015). Using transcriptomics to guide lead optimization in drug discovery projects : lessons learned from the QSTAR project. DRUG DISCOVERY TODAY, 20(5), 505–513.
    The pharmaceutical industry is faced with steadily declining R&D efficiency which results in fewer drugs reaching the market despite increased investment. A major cause for this low efficiency is the failure of drug candidates in late-stage development owing to safety issues or previously undiscovered side-effects. We analyzed to what extent gene expression data can help to de-risk drug development in early phases by detecting the biological effects of compounds across disease areas, targets and scaffolds. For eight drug discovery projects within a global pharmaceutical company, gene expression data were informative and able to support go/no-go decisions. Our studies show that gene expression profiling can detect adverse effects of compounds, and is a valuable tool in early-stage drug discovery decision making.
  101. Mensaert, K., Van Criekinge, W., Thas, O., Schuuring, E., Steenbergen, R. D., Wisman, G. B. A., & De Meyer, T. (2015). Mining for viral fragments in methylation enriched sequencing data. FRONTIERS IN GENETICS, 6.
  102. Crauwels, S., Van Assche, A., de Jonge, R., Borneman, A., Verreth, C., Troels, P., De Samblanx, G., et al. (2015). Comparative phenomics and targeted use of genomics reveals variation in carbon and nitrogen assimilation among different Brettanomyces bruxellensis strains. APPLIED MICROBIOLOGY AND BIOTECHNOLOGY, 99(21), 9123–9134.
    Recent studies have suggested a correlation between genotype groups of Brettanomyces bruxellensis and their source of isolation. To further explore this relationship, the objective of this study was to assess metabolic differences in carbon and nitrogen assimilation between different B. bruxellensis strains from three beverages, including beer, wine, and soft drink, using Biolog Phenotype Microarrays. While some similarities of physiology were noted, many traits were variable among strains. Interestingly, some phenotypes were found that could be linked to strain origin, especially for the assimilation of particular alpha- and beta-glycosides as well as alpha- and beta-substituted monosaccharides. Based upon gene presence or absence, an alpha-glucosidase and beta-glucosidase were found explaining the observed phenotypes. Further, using a PCR screen on a large number of isolates, we have been able to specifically link a genomic deletion to the beer strains, suggesting that this region may have a fitness cost for B. bruxellensis in certain fermentation systems such as brewing. More specifically, none of the beer strains were found to contain a beta-glucosidase, which may have direct impacts on the ability for these strains to compete with other microbes or on flavor production.
  103. Ranade, S. S., Lin, Y.-C., Van de Peer, Y., & García-Gil, M. R. (2015). Comparative in silico analysis of SSRs in coding regions of high confidence predicted genes in Norway spruce (Picea abies) and Loblolly pine (Pinus taeda). BMC GENETICS, 16.
    Background: Microsatellites or simple sequence repeats (SSRs) are DNA sequences consisting of 1-6 bp tandem repeat motifs present in the genome. SSRs are considered to be one of the most powerful tools in genetic studies. We carried out a comparative study of perfect SSR loci belonging to class I (>= 20) and class II (>= 12 and < 20 bp) types located in coding regions of high confidence genes in Picea abies and Pinus taeda. SSRLocator was used to retrieve SSRs from the full length CDS of predicted genes in both species. Results: Trimers were the most abundant motifs in class I followed by hexamers in Picea abies, while trimers and hexamers were equally abundant in Pinus taeda class I SSRs. Hexamers were most frequent within class II SSRs followed by trimers, in both species. Although the frequency of genes containing SSRs was slightly higher in Pinus taeda, SSR counts per Mbp for class I was similar in both species (P-value = 0.22); while for class II SSRs, it was significantly higher in Picea abies (P-value = 0.00009). AT-rich motifs were higher in abundance than the GC-rich motifs, within class II SSRs in both the species (P-values = 10(-9) and 0). With reference to class I SSRs, AT-rich and GC-rich motifs were detected with equal frequency in Pinus taeda (P-value = 0.24); while in Picea abies, GC-rich motifs were detected with higher frequency than the AT-rich motifs (P-value = 0.0005). Conclusions: Our study gives a comparative overview of the genome SSRs composition based on high confidence genes in the two recently sequenced and economically important conifers and, also provides information on functional molecular markers that can be applied in genetic studies in Pinus and Picea species.
  104. Delhomme, N., Sundstrom, G., Zamani, N., Lantz, H., Lin, Y.-C., Hvidsten, T. R., Hoppner, M. P., et al. (2015). Serendipitous meta-transcriptomics : the fungal community of Norway spruce (Picea abies). PLOS ONE, 10(9).
    After performing de novo transcript assembly of >1 billion RNA-Sequencing reads obtained from 22 samples of different Norway spruce (Picea abies) tissues that were not surface sterilized, we found that assembled sequences captured a mix of plant, lichen, and fungal transcripts. The latter were likely expressed by endophytic and epiphytic symbionts, indicating that these organisms were present, alive, and metabolically active. Here, we show that these serendipitously sequenced transcripts need not be considered merely as contamination, as is common, but that they provide insight into the plant's phyllosphere. Notably, we could classify these transcripts as originating predominantly from Dothideomycetes and Leotiomycetes species, with functional annotation of gene families indicating active growth and metabolism, with particular regards to glucose intake and processing, as well as gene regulation.
  105. Van den Eynden, J., Fierro Gutierrez, A. C. E., Verbeke, L., & Marchal, K. (2015). SomInaClust: detection of cancer genes based on somatic mutation patterns of inactivation and clustering. BMC BIOINFORMATICS, 16.
    Background: With the advances in high throughput technologies, increasing amounts of cancer somatic mutation data are being generated and made available. Only a small number of (driver) mutations occur in driver genes and are responsible for carcinogenesis, while the majority of (passenger) mutations do not influence tumour biology. In this study, SomInaClust is introduced, a method that accurately identifies driver genes based on their mutation pattern across tumour samples and then classifies them into oncogenes or tumour suppressor genes respectively. Results: SomInaClust starts from the observation that oncogenes mainly contain mutations that, due to positive selection, cluster at similar positions in a gene across patient samples, whereas tumour suppressor genes contain a high number of protein-truncating mutations throughout the entire gene length. The method was shown to prioritize driver genes in 9 different solid cancers. Furthermore it was found to be complementary to existing similar-purpose methods with the additional advantages that it has a higher sensitivity, also for rare mutations (occurring in less than 1% of all samples), and it accurately classifies candidate driver genes in putative oncogenes and tumour suppressor genes. Pathway enrichment analysis showed that the identified genes belong to known cancer signalling pathways, and that the distinction between oncogenes and tumour suppressor genes is biologically relevant. Conclusions: SomInaClust was shown to detect candidate driver genes based on somatic mutation patterns of inactivation and clustering and to distinguish oncogenes from tumour suppressor genes. The method could be used for the identification of new cancer genes or to filter mutation data for further data-integration purposes.
  106. Verstraeten, N., Knapen, W. J., Kint, C. I., Liebens, V., Van den Bergh, B., Dewachter, L., Michiels, J. E., et al. (2015). Obg and membrane depolarization are part of a microbial bet-hedging strategy that leads to antibiotic tolerance. MOLECULAR CELL, 59(1), 9–21.
    Within bacterial populations, a small fraction of per-sister cells is transiently capable of surviving exposure to lethal doses of antibiotics. As a bet-hedging strategy, persistence levels are determined both by stochastic induction and by environmental stimuli called responsive diversification. Little is known about the mechanisms that link the low frequency of per-sisters to environmental signals. Our results support a central role for the conserved GTPase Obg in determining persistence in Escherichia coli in response to nutrient starvation. Obg-mediated persistence requires the stringent response alarmone (p) ppGpp and proceeds through transcriptional control of the hokB-sokB type I toxin-antitoxin module. In individual cells, increased Obg levels induce HokB expression, which in turn results in a collapse of the membrane potential, leading to dormancy. Obg also controls persistence in Pseudomonas aeruginosa and thus constitutes a conserved regulator of antibiotic tolerance. Combined, our findings signify an important step toward unraveling shared genetic mechanisms underlying persistence.
  107. Voordeckers, K., Kominek, J., Das, A., Espinosa-Cantú, A., De Maeyer, D., Arslan, A., Van Pee, M., et al. (2015). Adaptation to high ethanol reveals complex evolutionary pathways. PLOS GENETICS, 11(11).
    Tolerance to high levels of ethanol is an ecologically and industrially relevant phenotype of microbes, but the molecular mechanisms underlying this complex trait remain largely unknown. Here, we use long-term experimental evolution of isogenic yeast populations of different initial ploidy to study adaptation to increasing levels of ethanol. Whole-genome sequencing of more than 30 evolved populations and over 100 adapted clones isolated throughout this two-year evolution experiment revealed how a complex interplay of de novo single nucleotide mutations, copy number variation, ploidy changes, mutator phenotypes, and clonal interference led to a significant increase in ethanol tolerance. Although the specific mutations differ between different evolved lineages, application of a novel computational pipeline, PheNetic, revealed that many mutations target functional modules involved in stress response, cell cycle regulation, DNA repair and respiration. Measuring the fitness effects of selected mutations introduced in non-evolved ethanol-sensitive cells revealed several adaptive mutations that had previously not been implicated in ethanol tolerance, including mutations in PRT1, VPS70 and MEX67. Interestingly, variation in VPS70 was recently identified as a QTL for ethanol tolerance in an industrial bio-ethanol strain. Taken together, our results show how, in contrast to adaptation to some other stresses, adaptation to a continuous complex and severe stress involves interplay of different evolutionary mechanisms. In addition, our study reveals functional modules involved in ethanol resistance and identifies several mutations that could help to improve the ethanol tolerance of industrial yeasts.
  108. De Maeyer, D., Weytjens, B., Renkens, J., De Raedt, L., & Marchal, K. (2015). PheNetic : network-based interpretation of molecular profiling data. NUCLEIC ACIDS RESEARCH, 43(W1), W244–W250.
    Molecular profiling experiments have become standard in current wet-lab practices. Classically, enrichment analysis has been used to identify biological functions related to these experimental results. Combining molecular profiling results with the wealth of currently available interactomics data, however, offers the opportunity to identify the molecular mechanism behind an observed molecular phenotype. In this paper, we therefore introduce 'PheNetic', a userfriendly web server for inferring a sub-network based on probabilistic logical querying. PheNetic extracts from an interactome, the sub-network that best explains genes prioritized through a molecular profiling experiment. Depending on its run mode, PheNetic searches either for a regulatorymechanism that gave explains to the observed molecular phenotype or for the pathways (in) activated in the molecular phenotype. The web server provides access to a large number of interactomes, making sub-network inference readily applicable to a wide variety of organisms. The inferred sub-networks can be interactively visualized in the browser. PheNetic's method and use are illustrated using an example analysis of differential expression results of ampicillin treated Escherichia coli cells. The PheNetic web service is available at
  109. De Witte, D., Van de Velde, J., Decap, D., Van Bel, M., Audenaert, P., Demeester, P., … Fostier, J. (2015). BLSSpeller : exhaustive comparative discovery of conserved cis-regulatory elements. BIOINFORMATICS, 31(23), 3758–3766.
    Motivation: The accurate discovery and annotation of regulatory elements remains a challenging problem. The growing number of sequenced genomes creates new opportunities for comparative approaches to motif discovery. Putative binding sites are then considered to be functional if they are conserved in orthologous promoter sequences of multiple related species. Existing methods for comparative motif discovery usually rely on pregenerated multiple sequence alignments, which are difficult to obtain for more diverged species such as plants. As a consequence, misaligned regulatory elements often remain undetected. Results: We present a novel algorithm that supports both alignment-free and alignment-based motif discovery in the promoter sequences of related species. Putative motifs are exhaustively enumerated as words over the IUPAC alphabet and screened for conservation using the branch length score. Additionally, a confidence score is established in a genome-wide fashion. In order to take advantage of a cloud computing infrastructure, the MapReduce programming model is adopted. The method is applied to four monocotyledon plant species and it is shown that high-scoring motifs are significantly enriched for open chromatin regions in Oryza sativa and for transcription factor binding sites inferred through protein-binding microarrays in O. sativa and Zea mays. Furthermore, the method is shown to recover experimentally profiled ga2ox1-like KN1 binding sites in Z. mays.
  110. Soltis, P. S., Marchant, D. B., Van de Peer, Y., & Soltis, D. E. (2015). Polyploidy and genome evolution in plants. CURRENT OPINION IN GENETICS & DEVELOPMENT, 35, 119–125.
    Plant genomes vary in size and complexity, fueled in part by processes of whole-genome duplication (WGD; polyploidy) and subsequent genome evolution. Despite repeated episodes of WGD throughout the evolutionary history of angiosperms in particular, the genomes are not uniformly large, and even plants with very small genomes carry the signatures of ancient duplication events. The processes governing the evolution of plant genomes following these ancient events are largely unknown. Here, we consider mechanisms of diploidization, evidence of genome reorganization in recently formed polyploid species, and macroevolutionary patterns of WGD in plant genomes and propose that the ongoing genomic changes observed in recent polyploids may illustrate the diploidization processes that result in ancient signatures of WGD over geological timescales.
  111. Sundell, D., Mannapperuma, C., Netotea, S., Delhomme, N., Lin, Y.-C., Sjödin, A., Van de Peer, Y., et al. (2015). The plant genome integrative explorer resource : NEW PHYTOLOGIST, 208(4), 1149–1156.
    Accessing and exploring large-scale genomics data sets remains a significant challenge to researchers without specialist bioinformatics training. We present the integrated platform for exploration of Populus, conifer and Arabidopsis genomics data, which includes expression networks and associated visualization tools. Standard features of a model organism database are provided, including genome browsers, gene list annotation, BLAST homology searches and gene information pages. Community annotation updating is supported via integration of WebApollo. We have produced an RNA-sequencing (RNA-Seq) expression atlas for Populus tremula and have integrated these data within the expression tools. An updated version of the COMPLEX resource for performing comparative plant expression analyses of gene coexpression network conservation between species has also been integrated. The platform provides intuitive access to large-scale and genome-wide genomics data from model forest tree species, facilitating both community contributions to annotation improvement and tools supporting use of the included data resources to inform biological insight.
  112. Wuyts, V., Denayer, S., Roosens, N. H., Mattheus, W., Bertrand, S., Marchal, K., … De Keersmaecker, S. C. (2015). The usefulness of whole genome sequencing for outbreak investigation of food pathogens, Salmonella Enteritidis as a case study. LABINFO. Brussel: FAVV.
  113. Wuyts, V., Denayer, S., Roosens, N. H., Mattheus, W., Bertrand, S., Marchal, K., … De Keersmaecker, S. C. (2015). User‐friendly WGS analysis of Salmonella Enteritidis PT4 outbreaks. In Food Microbiology, 20th Conference, Abstracts. Brussels, Belgium: Belgian Society for Food Microbiology (BSFM).
  114. De Meyer, Tim, Bady, P., Trooskens, G., Kurscheid, S., Bloch, J., Kros, J. M., Hainfellner, J. A., et al. (2015). Genome-wide DNA methylation detection by MethylCap-seq and Infinium HumanMethylation450 BeadChips: an independent large-scale comparison. SCIENTIFIC REPORTS, 5.
    Two cost-efficient genome-scale methodologies to assess DNA-methylation are MethylCap-seq and Illumina's Infinium HumanMethylation450 BeadChips (HM450). Objective information regarding the best-suited methodology for a specific research question is scant. Therefore, we performed a large-scale evaluation on a set of 70 brain tissue samples, i.e. 65 glioblastoma and 5 non-tumoral tissues. As MethylCap-seq coverages were limited, we focused on the inherent capacity of the methodology to detect methylated loci rather than a quantitative analysis. MethylCap-seq and HM450 data were dichotomized and performances were compared using a gold standard free Bayesian modelling procedure. While conditional specificity was adequate for both approaches, conditional sensitivity was systematically higher for HM450. In addition, genome-wide characteristics were compared, revealing that HM450 probes identified substantially fewer regions compared to MethylCap-seq. Although results indicated that the latter method can detect more potentially relevant DNA-methylation, this did not translate into the discovery of more differentially methylated loci between tumours and controls compared to HM450. Our results therefore indicate that both methodologies are complementary, with a higher sensitivity for HM450 and a far larger genome-wide coverage for MethylCap-seq, but also that a more comprehensive character does not automatically imply more significant results in biomarker studies.
  115. Gonzalez Sanchez, N., Pauwels, L., Baekelandt, A., De Milde, L., Van Leene, J., Besbrugge, N., … Inzé, D. (2015). A repressor protein complex regulates leaf growth in Arabidopsis. PLANT CELL, 27(8), 2273–2287.
    Cell number is an important determinant of final organ size. In the leaf, a large proportion of cells are derived from the stomatal lineage. Meristemoids, which are stem cell-like precursor cells, undergo asymmetric divisions, generating several pavement cells adjacent to the two guard cells. However, the mechanism controlling the asymmetric divisions of these stem cells prior to differentiation is not well understood. Here, we characterized PEAPOD (PPD) proteins, the only transcriptional regulators known to negatively regulate meristemoid division. PPD proteins interact with KIX8 and KIX9, which act as adaptor proteins for the corepressor TOPLESS. D3-type cyclin encoding genes were identified among direct targets of PPD2, being negatively regulated by PPDs and KIX8/9. Accordingly, kix8 kix9 mutants phenocopied PPD loss-of-function producing larger leaves resulting from increased meristemoid amplifying divisions. The identified conserved complex might be specific for leaf growth in the second dimension, since it is not present in Poaceae (grasses), which also lack the developmental program it controls.
  116. Verbeke, L., Van den Eynden, J., Demeester, P., Marchal, K., & Fostier, J. (2015). Pathway relevance ranking for tumor samples through network-based data integration (award for Outstanding Oral Poster Presentation ). In 23e Annual International Conference on Intelligent Systems for Molecular Biology, Abstracts (p. 1). Dublin, Ireland.
  117. Miclotte, G., Heydari, M., Demeester, P., Audenaert, P., & Fostier, J. (2015). Jabba: hybrid error correction for long sequencing reads using maximal exact matches. In Lecture Notes in Bioinformatics (Vol. 9289, pp. 175–188). Georgia Technol Inst, Atlanta, GA: Springer.
    Third generation sequencing platforms produce longer reads with higher error rates than second generation sequencing technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is that this mapping is constructed with a seed and extend methodology, using maximal exact matches as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of maximal exact matches in the context of third generation reads are presented.
  118. Pulido Tamayo, S., Sanchez Rodriguez, A., Swings, T., Van den Berghe, B., Dubey, A., Steenackers, H., … Marchal, K. (2015). Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations. NUCLEIC ACIDS RESEARCH, 43(16).
    Clonal populations accumulate mutations over time, resulting in different haplotypes. Deep sequencing of such a population in principle provides information to reconstruct these haplotypes and the frequency at which the haplotypes occur. However, this reconstruction is technically not trivial, especially not in clonal systems with a relatively low mutation frequency. The low number of segregating sites in those systems adds ambiguity to the haplotype phasing and thus obviates the reconstruction of genome-wide haplotypes based on sequence overlap information. Therefore, we present EVORhA, a haplotype reconstruction method that complements phasing information in the non-empty read overlap with the frequency estimations of inferred local haplotypes. As was shown with simulated data, as soon as read lengths and/or mutation rates become restrictive for state-of-the-art methods, the use of this additional frequency information allows EVORhA to still reliably reconstruct genome-wide haplotypes. On real data, we show the applicability of the method in reconstructing the population composition of evolved bacterial populations and in decomposing mixed bacterial infections from clinical samples.
  119. Decap, D., Reumers, J., Herzeel, C., Costanza, P., & Fostier, J. (2015). Halvade: scalable sequence analysis with MapReduce. BIOINFORMATICS, 31(15), 2482–2488.
    Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. Results: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50x coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading. Availability and implementation: Halvade is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR.
  120. Verbeke, L., Van den Eynden, J., Fierro Gutierrez, A. C. E., Demeester, P., Fostier, J., & Marchal, K. (2015). Pathway relevance ranking for tumor samples through network-based data integration. PLOS ONE, 10(7).
    The study of cancer, a highly heterogeneous disease with different causes and clinical outcomes, requires a multi-angle approach and the collection of large multi-omics datasets that, ideally, should be analyzed simultaneously. We present a new pathway relevance ranking method that is able to prioritize pathways according to the information contained in any combination of tumor related omics datasets. Key to the method is the conversion of all available data into a single comprehensive network representation containing not only genes but also individual patient samples. Additionally, all data are linked through a network of previously identified molecular interactions. We demonstrate the performance of the new method by applying it to breast and ovarian cancer datasets from The Cancer Genome Atlas. By integrating gene expression, copy number, mutation and methylation data, the method's potential to identify key pathways involved in breast cancer development shared by different molecular subtypes is illustrated. Interestingly, certain pathways were ranked equally important for different subtypes, even when the underlying (epi)-genetic disturbances were diverse. Next to prioritizing universally high-scoring pathways, the pathway ranking method was able to identify subtype-specific pathways. Often the score of a pathway could not be motivated by a single mutation, copy number or methylation alteration, but rather by a combination of genetic and epi-genetic disturbances, stressing the need for a network-based data integration approach. The analysis of ovarian tumors, as a function of survival-based subtypes, demonstrated the method's ability to correctly identify key pathways, irrespective of tumor subtype. A differential analysis of survival-based subtypes revealed several pathways with higher importance for the bad-outcome patient group than for the good-outcome patient group. Many of the pathways exhibiting higher importance for the bad-outcome patient group could be related to ovarian tumor proliferation and survival.
  121. Herzeel, C., Costanza, P., Decap, D., Fostier, J., & Reumers, J. (2015). elPrep: high-performance preparation of sequence alignment/map files for variant calling. PLOS ONE, 10(7).
    elPrep is a high-performance tool for preparing sequence alignment/map files for variant calling in sequencing pipelines. It can be used as a replacement for SAMtools and Picard for preparation steps such as filtering, sorting, marking duplicates, reordering contigs, and so on, while producing identical results. What sets elPrep apart is its software architecture that allows executing preparation pipelines by making only a single pass through the data, no matter how many preparation steps are used in the pipeline. elPrep is designed as a multithreaded application that runs entirely in memory, avoids repeated file I/O, and merges the computation of several preparation steps to significantly speed up the execution time. For example, for a preparation pipeline of five steps on a whole-exome BAM file (NA12878), we reduce the execution time from about 1: 40 hours, when using a combination of SAMtools and Picard, to about 15 minutes when using elPrep, while utilising the same server resources, here 48 threads and 23GB of RAM. For the same pipeline on whole-genome data (NA12878), elPrep reduces the runtime from 24 hours to less than 5 hours. As a typical clinical study may contain sequencing data for hundreds of patients, elPrep can remove several hundreds of hours of computing time, and thus substantially reduce analysis time and cost.
  122. Vyverman, M., De Baets, B., Fack, V., & Dawyndt, P. (2015). A long fragment aligner called ALFALFA. BMC BIOINFORMATICS, 16.
    Background: Rapid evolutions in sequencing technology force read mappers into flexible adaptation to longer reads, changing error models, memory barriers and novel applications. Results: ALFALFA achieves a high performance in accurately mapping long single-end and paired-end reads to gigabase-scale reference genomes, while remaining competitive for mapping shorter reads. Its seed-and-extend workflow is underpinned by fast retrieval of super-maximal exact matches from an enhanced sparse suffix array, with flexible parameter tuning to balance performance, memory footprint and accuracy. Conclusions: ALFALFA is open source and available at
  123. Mushthofa, M., Schockaert, S., & De Cock, M. (2015). Solving disjunctive fuzzy answer set programs. In F. Calimeri, G. Ianni, & M. Truszczynski (Eds.), Lecture Notes in Computer Science (Vol. 9345, pp. 453–466). Presented at the 13th International conference of Logic Programming and Nonmonotonic Reasoning (LPNMR 2015), Cham, Switzerland: Springer.
    Fuzzy Answer Set Programming (FASP) is an extension of the popular Answer Set Programming (ASP) paradigm which is tailored for continuous domains. Despite the existence of several prototype implementations, none of the existing solvers can handle disjunctive rules in a sound and efficient manner. We first show that a large class of disjunctive FASP programs called the self-reinforcing cycle-free (SRCF) programs can be polynomially reduced to normal FASP programs. We then introduce a general method for solving disjunctive FASP programs, which combines the proposed reduction with the use of mixed integer programming for minimality checking. We also report the result of the experimental benchmark of this method.
  124. Wang, F., Muto, A., Van de Velde, J., Neyt, P., Himanen, K., Vandepoele, K., & Van Lijsebettens, M. (2015). Functional analysis of the Arabidopsis TETRASPANIN gene family in plant growth and development. PLANT PHYSIOLOGY, 169(3), 2200–2214.
    TETRASPANIN (TET) genes encode conserved integral membrane proteins that are known in animals to function in cellular communication during gamete fusion, immunity reaction and pathogen recognition. In plants, functional information is limited to one of the 17 members of the Arabidopsis TET gene family and to expression data in reproductive stages. Here, the promoter activity of all 17 Arabidopsis TET genes was investigated by pAtTET::NLS-GFP/GUS reporter lines throughout the life cycle, which predicted functional divergence in the paralogous genes per clade. However, partial overlap was observed for many TET genes across the clades, correlating with few phenotypes in single mutants and therefore requiring double mutant combinations for functional investigation. Mutational analysis showed a role for TET13 in primary root growth and lateral root development, and redundant roles for TET5 and TET6 in leaf and root growth through negative regulation of cell proliferation. Strikingly, a number of TET genes were expressed in embryonic and seedling progenitor cells and remained expressed until the differentiation state in the mature plant, suggesting a dynamic function over developmental stages. cis-regulatory elements together with transcription factor binding data provided molecular insight into the site, conditions and perturbations that affect TET gene expression, and positioned the TET genes in different molecular pathways; the data represent a hypothesis-generating resource for further functional analyses.
  125. Potenza, E., Racchi, M. L., Sterck, L., Coller, E., Asquini, E., Tosatto, S. C., Velasco, R., et al. (2015). Exploration of alternative splicing events in ten different grapevine cultivars. BMC GENOMICS, 16.
    Background: The complex dynamics of gene regulation in plants are still far from being fully understood. Among many factors involved, alternative splicing (AS) in particular is one of the least well documented. For many years, AS has been considered of less relevant in plants, especially when compared to animals, however, since the introduction of next generation sequencing techniques the number of plant genes believed to be alternatively spliced has increased exponentially. Results: Here, we performed a comprehensive high-throughput transcript sequencing of ten different grapevine cultivars, which resulted in the first high coverage atlas of the grape berry transcriptome. We also developed findAS, a software tool for the analysis of alternatively spliced junctions. We demonstrate that at least 44 % of multi-exonic genes undergo AS and a large number of low abundance splice variants is present within the 131.622 splice junctions we have annotated from Pinot noir. Conclusions: Our analysis shows that similar to 70 % of AS events have relatively low expression levels, furthermore alternative splice sites seem to be enriched near the constitutive ones in some extent showing the noise of the splicing mechanisms. However, AS seems to be extensively conserved among the 10 cultivars.
  126. Van den Berge, K., De Smet, R., Van de Peer, Y., & Clement, L. (2015). Quantifying expression divergence of duplicated genes with microarrays. Belgian Statistical Society, 23rd Annual meeting, Abstracts. Presented at the 23rd Annual meeting of the Belgian Statistical Society.
    Whole genome duplication (WGD) events are widespread among flowering plants. They result in two redundant genomes within the individual. Most duplicated genes derived from a WGD event (i.e. homeologous genes) will get lost during evolution. Nonetheless, they provide raw material for the evolution of genes with novel functions. Expression divergence is often used to assess the contribution of WGD in this respect. Microarray technology can be used for this purpose. With microarrays, the expression of a gene is measured by multiple 'probes', i.e. a probeset. Quantifying expression divergence involves differential expression analysis between two distinct genes, which is challenging as it involves different probesets, each having different characteristics. We show that standard analysis methods adopted in the evolutionary genomics literature typically lead to an excess of false positives, explaining the high number of reported significantly diverged genes. We propose a novel data analysis strategy to account for these probe effects. An empirical null distribution is established by adopting a test statistic on probes within a probeset. This null distribution can be incorporated in a local fdr estimate for every gene pair, which rigorously defines significant expression divergence. We illustrate our method in a case study on Arabidopsis thaliana.
  127. Trypsteen, W., De Neve, J., Bosman, K., Nijhuis, M., Thas, O., Vandekerckhove, L., & De Spiegelaere, W. (2015). Robust regression methods for real-time PCR. qPCR and digital PCR congress 2015 : poster presentation abstracts. Presented at the 3rd qPCR and Digital PCR congress.
  128. Trypsteen, W., Vynck, M., De Neve, J., Bonczkowski, P., Kiselinova, M., Malatinková, E., VERVISCH, K., et al. (2015). ddcRquant: threshold determination for single channel droplet digital PCR experiments. qPCR and digital PCR congress 2015 : poster presentation abstracts. Presented at the 3rd qPCR and Digital PCR congress.
  129. Mattiello, F., Thas, O., & Verbist, B. (2015). Principal bicorrelation analysis: unraveling associations between three data sources. JOURNAL OF BIOPHARMACEUTICAL STATISTICS, 26(3), 534–551.
    In this article, we propose a statistical explorative method for data integration. It is developed in the context of early drug development for which it enables the detection of chemical substructures and the identification of genes that mediate their association with the bioactivity (BA). The core of the method is a sparse singular value decomposition for the identification of the gene set and a permutation-based method for the control of the false discovery rate. The method is illustrated using a real dataset, and its properties are empirically evaluated by means of a simulation study. Quantitative Structure Transcriptional Activity Relationship (QSTAR, is a new paradigm in early drug development that extends QSAR by not only considering data on the chemical structure of the compounds and on the compound-induced BA, but by simultaneously using transcriptomics data (gene expression). This approach enables, for example, the detection of chemical substructures that are associated with BA, while at the same time a gene set is correlated with both these substructures and the BA. Although causal associations cannot be formally concluded, these associations may suggest that the compounds act on the BA through a particular genomic pathway.
  130. Vanneste, K., Sterck, L., Myburg, A. A., Van de Peer, Y., & Mizrachi, E. (2015). Horsetails are ancient polyploids : evidence from Equisetum giganteum. PLANT CELL, 27(6), 1567–1578.
    Horsetails represent an enigmatic clade within the land plants. Despite consisting only of one genus (Equisetum) that contains 15 species, they are thought to represent the oldest extant genus within the vascular plants dating back possibly as far as the Triassic. Horsetails have retained several ancient features and are also characterized by a particularly high chromosome count (n = 108). Whole-genome duplications (WGDs) have been uncovered in many angiosperm clades and have been associated with the success of angiosperms, both in terms of species richness and biomass dominance, but remain understudied in nonangiosperm clades. Here, we report unambiguous evidence of an ancient WGD in the fern linage, based on sequencing and de novo assembly of an expressed gene catalog (transcriptome) from the giant horsetail (Equisetum giganteum). We demonstrate that horsetails underwent an independent paleopolyploidy during the Late Cretaceous prior to the diversification of the genus but did not experience any recent polyploidizations that could account for their high chromosome number. We also discuss the specific retention of genes following the WGD and how this may be linked to their long-term survival.
  131. Verkest, A., Byzova, M., Martens, C., Willems, P., Verwulgen, T., Slabbinck, B., Rombaut, D., et al. (2015). Selection for improved energy use efficiency and drought tolerance in canola results in distinct transcriptome and epigenome changes. PLANT PHYSIOLOGY, 168(4), 1338–1350.
    To increase both the yield potential and stability of crops, integrated breeding strategies are used that have mostly a direct genetic basis, but the utility of epigenetics to improve complex traits is unclear. A better understanding of the status of the epigenome and its contribution to agronomic performance would help in developing approaches to incorporate the epigenetic component of complex traits into breeding programs. Starting from isogenic canola (Brassica napus) lines, epilines were generated by selecting, repeatedly for three generations, for increased energy use efficiency and drought tolerance. These epilines had an enhanced energy use efficiency, drought tolerance, and nitrogen use efficiency. Transcriptome analysis of the epilines and a line selected for its energy use efficiency solely revealed common differentially expressed genes related to the onset of stress tolerance-regulating signaling events. Genes related to responses to salt, osmotic, abscisic acid, and drought treatments were specifically differentially expressed in the drought-tolerant epilines. The status of the epigenome, scored as differential trimethylation of lysine-4 of histone 3, further supported the phenotype by targeting drought-responsive genes and facilitating the transcription of the differentially expressed genes. From these results, we conclude that the canola epigenome can be shaped by selection to increase energy use efficiency and stress tolerance. Hence, these findings warrant the further development of strategies to incorporate epigenetics into breeding.
  132. Zhang, Zhonghua, Mao, L., Chen, H., Bu, F., Li, G., Sun, J., Li, S., et al. (2015). Genome-wide mapping of structural variations reveals a copy number variant that determines reproductive morphology in cucumber. PLANT CELL, 27(6), 1595–1604.
    Structural variations (SVs) represent a major source of genetic diversity. However, the functional impact and formation mechanisms of SVs in plant genomes remain largely unexplored. Here, we report a nucleotide-resolution SV map of cucumber (Cucumis sativas) that comprises 26,788 SVs based on deep resequencing of 115 diverse accessions. The largest proportion of cucumber SVs was formed through nonhomologous end-joining rearrangements, and the occurrence of SVs is closely associated with regions of high nucleotide diversity. These SVs affect the coding regions of 1676 genes, some of which are associated with cucumber domestication. Based on the map, we discovered a copy number variation (CNV) involving four genes that defines the Female (F) locus and gives rise to gynoecious cucumber plants, which bear only female flowers and set fruit at almost every node. The CNV arose from a recent 30.2-kb duplication at a meiotically unstable region, likely via microhomology-mediated break-induced replication. The SV set provides a snapshot of structural variations in plants and will serve as an important resource for exploring genes underlying key traits and for facilitating practical breeding in cucumber.
  133. Defauw, A. (2015). Human heart heterogeneity and its role in the onset and perpetuation of cardiac arrhythmias. Ghent University. Faculty of Sciences, Ghent, Belgium.
  134. Ghorbani, S., Lin, Y.-C., Parizot, B., Fernandez Salina, A., Njo, M., Van de Peer, Y., Beeckman, T., et al. (2015). Expanding the repertoire of secretory peptides controlling root development with comparative genome analysis and functional assays. JOURNAL OF EXPERIMENTAL BOTANY, 66(17), 5257–5269.
    Plant genomes encode numerous small secretory peptides (SSPs) whose functions have yet to be explored. Based on structural features that characterize SSP families known to take part in postembryonic development, this comparative genome analysis resulted in the identification of genes coding for oligopeptides potentially involved in cell-to-cell communication. Because genome annotation based on short sequence homology is difficult, the criteria for the de novo identification and aggregation of conserved SSP sequences were first benchmarked across five reference plant species. The resulting gene families were then extended to 32 genome sequences, including major crops. The global phylogenetic pattern common to the functionally characterized SSP families suggests that their apparition and expansion coincide with that of the land plants. The SSP families can be searched online for members, sequences and consensus ( Looking for putative regulators of root development, Arabidopsis thaliana SSP genes were further selected through transcriptome meta-analysis based on their expression at specific stages and in specific cell types in the course of the lateral root formation. As an additional indication that formerly uncharacterized SSPs may control development, this study showed that root growth and branching were altered by the application of synthetic peptides matching conserved SSP motifs, sometimes in very specific ways. The strategy used in the study, combining comparative genomics, transcriptome meta-analysis and peptide functional assays in planta, pinpoints factors potentially involved in non-cell-autonomous regulatory mechanisms. A similar approach can be implemented in different species for the study of a wide range of developmental programmes.
  135. Fierro Gutierrez, A. C. E., Leroux, O., De Coninck, B., Cammue, B. P., Marchal, K., Prinsen, E., Van Der Straeten, D., et al. (2015). Ultraviolet-B radiation stimulates downward leaf curling in Arabidopsis thaliana. PLANT PHYSIOLOGY AND BIOCHEMISTRY, 93, 9–17.
  136. De Tiège, A., Tanghe, K., Braeckman, J., & Van de Peer, Y. (2015). Life’s dual nature: a way out of the impasse of the gene-centred “versus” complex systems controversy on life. In P. Pontarotti (Ed.), Evolutionary biology : biodiversification from genotype to phenotype (pp. 35–52). Berlin, Germany: Springer.
    Living cells and organisms are complex physical systems. Does their organization or complexity primarily rely on the intra-molecular crystalline structure of genetic nucleic acid sequences? Or is it, as critics of the ‘gene-centred’ perspective claim, predominantly a result of the inter- and supra-molecular – thus ‘holistic’ – network dynamics of genetic and various extra-genetic factors? The twentieth-century successes in several branches of genetics caused intensive focus on the causal role of genes in the biochemistry, development and evolution of living organisms, resulting in a relative abstraction or even neglect of life’s complex systems dynamics. Today, however, partly due to the success of systems biology, a number of authors defend life’s systems complexity while criticizing the gene-centred approach. Here, we offer a way out of the impasse of the gene-centred ‘versus’ complex systems perspective to arrive at a more balanced and complete understanding of life’s multifaceted nature. After sketching the conceptual and historical background of the controversy, we show how the present state of knowledge in biology vindicates both the holistically complex and gene-centred nature of life on Earth, but decisively falsifies extreme genetic ‘determinism’ and ‘reductionism’ as well as extreme ‘gene-de-centrism’. Contrary to what is often claimed, the fact that genes are one among many extra-genetic causal factors contributing to the biochemistry and development of cells and organisms, only undermines or falsifies genetic determinism and reductionism but not necessarily gene-centrism. Some implications for evolutionary theory, i.e., for the controversy between the Modern Synthesis and an ‘Extended Synthesis’, are outlined.
  137. Morel, G., Sterck, L., Swennen, D., Marcet-Houben, M., Onesime, D., Levasseur, A., Jacques, N., et al. (2015). Differential gene retention as an evolutionary mechanism to generate biodiversity and adaptation in yeasts. SCIENTIFIC REPORTS, 5.
    The evolutionary history of the characters underlying the adaptation of microorganisms to food and biotechnological uses is poorly understood. We undertook comparative genomics to investigate evolutionary relationships of the dairy yeast Geotrichum candidum within Saccharomycotina. Surprisingly, a remarkable proportion of genes showed discordant phylogenies, clustering with the filamentous fungus subphylum (Pezizomycotina), rather than the yeast subphylum (Saccharomycotina), of the Ascomycota. These genes appear not to be the result of Horizontal Gene Transfer (HGT), but to have been specifically retained by G. candidum after the filamentous fungiyeasts split concomitant with the yeasts' genome contraction. We refer to these genes as SRAGs (Specifically Retained Ancestral Genes), having been lost by all or nearly all other yeasts, and thus contributing to the phenotypic specificity of lineages. SRAG functions include lipases consistent with a role in cheese making and novel endoglucanases associated with degradation of plant material. Similar gene retention was observed in three other distantly related yeasts representative of this ecologically diverse subphylum. The phenomenon thus appears to be widespread in the Saccharomycotina and argues that, alongside neo-functionalization following gene duplication and HGT, specific gene retention must be recognized as an important mechanism for generation of biodiversity and adaptation in yeasts.
  138. Gonnelli, G., Stock, M., Verwaeren, J., Maddelein, D., De Baets, B., Martens, L., & Degroeve, S. (2015). A decoy-free approach to the identification of peptides. JOURNAL OF PROTEOME RESEARCH, 14(4), 1792–1798.
    A growing number of proteogenomics and metaproteomics studies indicate potential limitations of the application of the decoy database paradigm used to separate correct peptide identifications from incorrect ones in traditional shotgun proteomics. We therefore propose a binary classifier called Nokoi that allows fast yet reliable decoy-free separation of correct from incorrect peptide-to-spectrum matches (PSMs). Nokoi was trained on a very large collection of heterogeneous data using ranks supplied by the Mascot search engine to label correct and incorrect PSMs. We show that Nokoi outperforms Mascot and achieves a performance very close to that of Percolator at substantially higher processing speeds.
  139. Oveland, E., Muth, T., Rapp, E., Martens, L., Berven, F. S., & Barsnes, H. (2015). Viewing the proteome : how to visualize proteomics data? PROTEOMICS, 15(8), 1341–1355.
    Proteomics has become one of the main approaches for analyzing and understanding biological systems. Yet similar to other high-throughput analysis methods, the presentation of the large amounts of obtained data in easily interpretable ways remains challenging. In this review, we present an overview of the different ways in which proteomics software supports the visualization and interpretation of proteomics data. The unique challenges and current solutions for visualizing the different aspects of proteomics data, from acquired spectra via protein identification and quantification to pathway analysis, are discussed, and examples of the most useful visualization approaches are highlighted. Finally, we offer our ideas about future directions for proteomics data visualization.
  140. Vermeire, T., Vermaere, S., Schepens, B., Saelens, X., Van Gucht, S., Martens, L., & Vandermarliere, E. (2015). Scop3D : three-dimensional visualization of sequence conservation. PROTEOMICS, 15(8), 1448–1452.
    The integration of a protein's structure with its known sequence variation provides insight on how that protein evolves, for instance in terms of (changing) function or immunogenicity. Yet, collating the corresponding sequence variants into a multiple sequence alignment, calculating each position's conservation, and mapping this information back onto a relevant structure is not straightforward. We therefore built the Sequence Conservation on Protein 3D structure (scop3D) tool to perform these tasks automatically. The output consists of two modified PDB files in which the B-values for each position are replaced by the percentage sequence conservation, or the information entropy for each position, respectively. Furthermore, text files with absolute and relative amino acid occurrences for each position are also provided, along with snapshots of the protein from six distinct directions in space. The visualization provided by scop3D can for instance be used as an aid in vaccine development or to identify antigenic hotspots, which we here demonstrate based on an analysis of the fusion proteins of human respiratory syncytial virus and mumps virus.
  141. Muth, T., Behne, A., Heyer, R., Kohrs, F., Benndorf, D., Hoffmann, M., Lehteva, M., et al. (2015). The MetaProteomeAnalyzer : a powerful open-source software suite for metaproteomics data analysis and interpretation. JOURNAL OF PROTEOME RESEARCH, 14(3), 1557–1565.
    The enormous challenges of mass spectrometry-based metaproteomics are primarily related to the analysis and interpretation of the acquired data. This includes reliable identification of mass spectra and the meaningful integration of taxonomic and functional meta-information from samples containing hundreds of unknown species. To ease these difficulties, we developed a dedicated software suite, the MetaProteomeAnalyzer, an intuitive open-source tool for metaproteomics data analysis and interpretation, which includes multiple search engines and the feature to decrease data redundancy by grouping protein hits to so-called meta-proteins. We also designed a graph database back-end for the MetaProteomeAnalyzer to allow seamless analysis of results. The functionality of the MetaProteomeAnalyzer is demonstrated using a sample of a microbial community taken from a biogas plant.
  142. Wuyts, V., Mattheus, W., Roosens, N. H., Marchal, K., Bertrand, S., & De Keersmaecker, S. C. (2015). A multiplex oligonucleotide ligation-PCR as a complementary tool for subtyping of Salmonella Typhimurium. APPLIED MICROBIOLOGY AND BIOTECHNOLOGY, 99(19), 8137–8149.
    Subtyping below the serovar level is essential for surveillance and outbreak detection and investigation of Salmonella enterica subsp. enterica serovar Typhimurium (S. Typhimurium) and its monophasic variant 1,4,[5],12:i:- (S. 1,4,[5],12:i:-), frequent causes of foodborne infections. In an attempt to overcome the intrinsic shortcomings of currently used subtyping techniques, a multiplex oligonucleotide ligation-PCR (MOL-PCR) assay was developed which combines different types of molecular markers in a high throughput microsphere suspension array. The 52 molecular markers include prophage genes, amplified fragment length polymorphism (AFLP) elements, Salmonella genomic island 1 (SGI1), allantoinase gene allB, MLVA locus STTR10, antibiotic resistance genes, single nucleotide polymorphisms (SNPs) and phase 2 flagellar gene fljB. The in vitro stability of these markers was confirmed in a serial passage experiment. The validation of the MOL-PCR assay for subtyping of S. Typhimurium and S. 1,4,[5],12:i:- on 519 isolates shows that the method is rapid, reproducible, flexible, accessible, easy to use and relatively inexpensive. Additionally, a 100 % typeability and a discriminatory power equivalent to that of phage typing were observed, and epidemiological concordance was assessed on isolates of 2 different outbreaks. Furthermore, a data analysis method is provided so that the MOL-PCR assay allows for objective, computerised data analysis and data interpretation of which the results can be easily exchanged between different laboratories in an international surveillance network.
  143. Martens, Lennart, Kohlbacher, O., & Weintraub, S. T. (2015). Managing expectations when publishing tools and methods for computational proteomics. JOURNAL OF PROTEOME RESEARCH.
    Computational tools are pivotal in proteomics because they are crucial for identification, quantification, and statistical assessment of data. The gateway to finding the best choice of a tool or approach for a particular problem is frequently journal articles, yet there is Often an overwhelming variety of options that makes it hard to decide on the best solution. This is particularly difficult for nonexperts in bioinformatics. The maturity, reliability, and performance of tools can vary widely because publications may appear at different stages of development. A novel idea might merit early publication despite only offering proof-of-principle, while it may take years before a tool Can be considered mature, and-by that time it might be difficult for a new publication to be accepted, because of a perceived lack of novelty. After discussions with members of the computational mass spectrometry community, we describe here proposed recommendations for organization of informatics manuscripts as a Way to set the expectations of readers (and reviewers) through three different manuscript types that are based on existing journal designations. Brief Communications are short reports describing novel computational approaches where the implementation is not necessarily production-ready. Research Articles present both a novel idea and mature implementation that has been suitably benchmarked. Application Notes focus on a mature and tested tool or concept and need not be novel but should offer advancement from improved,quality, ease of use, and/or implementation. Organizing computational proteomics contributions into these three manuscript types will facilitate the review process and will also enable readers to identify the maturity and applicability of the tool for their own workflows.
  144. Wuyts, V., Denayer, S., Roosens, N. H., Mattheus, W., Bertrand, S., Marchal, K., … De Keersmaecker, S. C. (2015). Whole genome sequence analysis of Salmonella Enteritidis PT4 outbreaks from a national reference laboratory’s viewpoint. PLOS CURRENTS OUTBREAKS.
    Introduction: In April and May 2014, two suspected egg-related outbreaks of Salmonella enterica subsp. enterica serovar Enteritidis (S. Enteritidis) were investigated by the Belgian National Reference Laboratory of Foodborne Outbreaks. Both the suspected food and human isolates being available, and this for both outbreaks, made these the ideal case study for a retrospective whole genome sequencing (WGS) analysis with the goal to investigate the feasibility of this technology for outbreak investigation by a National Reference Laboratory or National Reference Centre without thorough bioinformatics expertise. Methods: The two outbreaks were originally investigated epidemiologically with a standard questionnaire and with serotyping, phage typing, multiple-locus variable-number of tandem repeats analysis (MLVA) and antimicrobial susceptibility testing as classical microbiological methods. Retrospectively, WGS of six outbreak isolates was done on an Illumina HiSeq. Analysis of the WGS data was performed with currently available, user-friendly software and tools, namely CLC Genomics Workbench, the tools available on the server of the Center for Genomic Epidemiology and BLAST Ring Image Generator (BRIG). Results: To all collected human and food outbreak isolates, classical microbiological investigation assigned phage type PT4 (variant phage type PT4a for one human isolate) and MLVA profile 3-10-5-4-1, both of which are common for human isolates in Belgium. The WGS analysis confirmed the link between food and human isolates for each of the outbreaks and clearly discriminated between the two outbreaks occurring in a same time period, thereby suggesting a non-common source of contamination. Also, an additional plasmid carrying an antibiotic resistance gene was discovered in the human isolate with the variant phage type PT4a. Discussion: For the two investigated outbreaks occurring at geographically separated locations, the gold standard classical microbiological subtyping methods were not sufficiently discriminative to distinguish between or assign a common origin of contamination for the two outbreaks, while WGS analysis could do so. This case study demonstrated the added value of WGS for outbreak investigations by confirming and/or discriminating food and human isolates between and within outbreaks. It also proved the feasibility of WGS as complementary or even future replacing (sub)typing method for the average routine laboratory.
  145. De Coninck, A., Kourounis, D., Verbosio, F., Schenk, O., De Baets, B., Maenhout, S., & Fostier, J. (2015). Towards parallel large-scale genomic prediction by coupling sparse and dense matrix algebra. In M. Daneshtalab, M. Aldinucci, V. Leppänen, J. Lilius, & M. Brorsson (Eds.), 23RD EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2015) (pp. 747–750). New York, NY, USA: IEEE.
    Genomic prediction for plant breeding requires taking into account environmental effects and variations of genetic effects across environments. The latter can be modelled by estimating the effect of each genetic marker in every possible environmental condition, which leads to a huge amount of effects to be estimated. Nonetheless, the information about these effects is only sparsely present, due to the fact that plants are only tested in a limited number of environmental conditions. In contrast, the genotypes of the plants are a dense source of information and thus the estimation of both types of effects in one single step would require as well dense as sparse matrix formalisms. This paper presents a way to efficiently apply a high performance computing infrastructure for dealing with large-scale genomic prediction settings, relying on the coupling of dense and sparse matrix algebra.
  146. Glover, N. M., Daron, J., Pingault, L., Vandepoele, K., Paux, E., Feuillet, C., & Choulet, F. (2015). Small-scale gene duplications played a major role in the recent evolution of wheat chromosome 3B. GENOME BIOLOGY, 16.
    Background: Bread wheat is not only an important crop, but its large (17 Gb), highly repetitive, and hexaploid genome makes it a good model to study the organization and evolution of complex genomes. Recently, we produced a high quality reference sequence of wheat chromosome 3B (774 Mb), which provides an excellent opportunity to study the evolutionary dynamics of a large and polyploid genome, specifically the impact of single gene duplications. Results: We find that 27 % of the 3B predicted genes are non-syntenic with the orthologous chromosomes of Brachypodium distachyon, Oryza sativa, and Sorghum bicolor, whereas, by applying the same criteria, non-syntenic genes represent on average only 10 % of the predicted genes in these three model grasses. These non-syntenic genes on 3B have high sequence similarity to at least one other gene in the wheat genome, indicating that hexaploid wheat has undergone massive small-scale interchromosomal gene duplications compared to other grasses. Insertions of non-syntenic genes occurred at a similar rate along the chromosome, but these genes tend to be retained at a higher frequency in the distal, recombinogenic regions. The ratio of non-synonymous to synonymous substitution rates showed a more relaxed selection pressure for non-syntenic genes compared to syntenic genes, and gene ontology analysis indicated that non-syntenic genes may be enriched in functions involved in disease resistance. Conclusion: Our results highlight the major impact of single gene duplications on the wheat gene complement and confirm the accelerated evolution of the Triticeae lineage among grasses.
  147. Koch, A., De Meyer, T., Jeschke, J., & Van Criekinge, W. (2015). MEXPRESS : visualizing expression, DNA methylation and clinical TCGA data. BMC GENOMICS, 16.
    Background: In recent years, increasing amounts of genomic and clinical cancer data have become publically available through large-scale collaborative projects such as The Cancer Genome Atlas (TCGA). However, as long as these datasets are difficult to access and interpret, they are essentially useless for a major part of the research community and their scientific potential will not be fully realized. To address these issues we developed MEXPRESS, a straightforward and easy-to-use web tool for the integration and visualization of the expression, DNA methylation and clinical TCGA data on a single-gene level ( Results: In comparison to existing tools, MEXPRESS allows researchers to quickly visualize and interpret the different TCGA datasets and their relationships for a single gene, as demonstrated for GSTP1 in prostate adenocarcinoma. We also used MEXPRESS to reveal the differences in the DNA methylation status of the PAM50 marker gene MLPH between the breast cancer subtypes and how these differences were linked to the expression of MPLH. Conclusions: We have created a user-friendly tool for the visualization and interpretation of TCGA data, offering clinical researchers a simple way to evaluate the TCGA data for their genes or candidate biomarkers of interest.
  148. Merhej, E., Schockaert, S., & De Cock, M. (2015). Using rules of thumb for repairing inconsistent answer set programs. In C. Beierle & A. Dekhtyar (Eds.), Lecture Notes in Artificial Intelligence (Vol. 9310, pp. 368–381). Presented at the 9th International conference on Scalable Uncertainty Management (SUM), Berlin, Germany: Springer.
    Answer set programming is a form of declarative programming that can be used to elegantly model various systems. When the available knowledge about these systems is imperfect, however, the resulting programs can be inconsistent. In such cases, it is of interest to find plausible repairs, i.e.~plausible modifications to the original program that ensure the existence of at least one answer set. Although several approaches to this end have already been proposed, most of them merely find a repair which is in some sense minimal. In many applications, however, expert knowledge is available which could allow us to identify better repairs. In this paper, we analyze the potential of using expert knowledge in this way, by focusing on a specific case study: gene regulatory networks. We show how we can identify the repairs that best agree with insights about such networks that have been reported in the literature, and experimentally compare this strategy against the baseline strategy of identifying minimal repairs.
  149. Van Neste, C. (2015). Porting forensic DNA analysis to deep sequencing. Ghent University. Faculty of Pharmaceutical Sciences, Ghent, Belgium.
    Forensic DNA profiles of short tandem repeat (STR) loci are currently obtained using PCR followed by capillary electrophoresis (CE). Massively parallel sequencing (MPS) technologies do not rely on size separation and thus relieve the limitations on locus multiplexy. Deep sequencing with MPS creates possibilities within forensics for analyzing degraded samples and mixed samples. Furthermore, in the same analysis single nucleotide polymorphism (SNP) markers can be included, which can generate phenotypic or ancestry leads for forensic investigators. Data analysis of raw sequencer reads, resulting in a reliable and usable forensic human identification report is still in early development. The aim of the doctoral research was to develop a program for forensic DNA data analysis. The main results are the data analysis framework MyFLq (My Forensic Loci queries) and nomenclature service FLAD (Forensic Loci Allele Database). MyFLq and FLAD can be used together in a forensic workflow that has backward compatibility with CE. To my knowledge, this is the first open-source and complete solution for forensic MPS raw data analysis.
  150. De La Torre, A. R., Lin, Y.-C., Van de Peer, Y., & Ingvarsson, P. K. (2015). Genome-wide analysis reveals diverged patterns of codon bias, gene expression, and rates of sequence evolution in Picea gene families. GENOME BIOLOGY AND EVOLUTION, 7(4), 1002–1015.
    The recent sequencing of several gymnosperm genomes has greatly facilitated studying the evolution of their genes and gene families. In this study, we examine the evidence for expression-mediated selection in the first two fully sequenced representatives of the gymnosperm plant clade (Picea abies and Picea glauca). We use genome-wide estimates of gene expression (> 50,000 expressed genes) to study the relationship between gene expression, codon bias, rates of sequence divergence, protein length, and gene duplication. We found that gene expression is correlated with rates of sequence divergence and codon bias, suggesting that natural selection is acting on Picea protein-coding genes for translational efficiency. Gene expression, rates of sequence divergence, and codon bias are correlated with the size of gene families, with large multicopy gene families having, on average, a lower expression level and breadth, lower codon bias, and higher rates of sequence divergence than single-copy gene families. Tissue-specific patterns of gene expression were more common in large gene families with large gene expression divergence than in single-copy families. Recent family expansions combined with large gene expression variation in paralogs and increased rates of sequence evolution suggest that some Picea gene families are rapidly evolving to cope with biotic and abiotic stress. Our study highlights the importance of gene expression and natural selection in shaping the evolution of protein-coding genes in Picea species, and sets the ground for further studies investigating the evolution of individual gene families in gymnosperms.
  151. Vriet, C., Lemmens, K., Vandepoele, K., Reuzeau, C., & Russinova, E. (2015). Evolutionary trails of plant steroid genes. TRENDS IN PLANT SCIENCE, 20(5), 301–308.
    Plant steroids - brassinosteroids (BRs) and their precursors, phytosterols-play a major role in plant growth, development, stress tolerance, and have high potential for agricultural applications. Currently, this prospect is limited by a lack of information about their evolution and expression dynamics (spatial and temporal) across plant species. The increasing number of sequenced genomes offers an opportunity for evolutionary studies that might help to prioritize functional analyses with the aim to improve crop yield and stress tolerance. In this review we provide a glimpse of the origin, evolution, and functional conservation of phytosterol and BR genes in the green plant lineage using comparative sequence and expression analyses of publicly available datasets.
  152. Yao, Yao. (2015). Using a novel bio-inspired robotic model to study artificial evolution. Ghent University. Faculty of Sciences, Ghent, Belgium.
  153. Voordeckers, K., Kominek, J., Das, A., Espinosa-Cantu, A., De Maeyer, D., Marchal, K., DeLuna, A., et al. (2015). Adaptation to high ethanol reveals complex evolutionary pathways. YEAST (Vol. 32, pp. S272–S272). Presented at the 27th International conference on Yeast Genetics and Molecular Biology (ICYGMB).
  154. Ccenhua, M. T., Pulido-Tamayo, S., Imamura, H., Verbeke, L., Cotton, J., Dujardin, J.-C., & Marchal, K. (2015). Studying the relationship between drug resistance and genomic variations in Leishmania donovani using a network-based method. In TROPICAL MEDICINE & INTERNATIONAL HEALTH (Vol. 20, pp. 165–165). Basel, Switzerland.
  155. van der Borght, Koen, Thys, K., Wetzels, Y., Clement, L., Verbist, B., Reumers, J., van Vlijmen, H., et al. (2015). QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles. BMC BIOINFORMATICS, 16.
    Background: Next generation sequencing enables studying heterogeneous populations of viral infections. When the sequencing is done at high coverage depth ("deep sequencing"), low frequency variants can be detected. Here we present QQ-SNV (, a logistic regression classifier model developed for the Illumina sequencing platforms that uses the quantiles of the quality scores, to distinguish true single nucleotide variants from sequencing errors based on the estimated SNV probability. To train the model, we created a dataset of an in silico mixture of five HIV-1 plasmids. Testing of our method in comparison to the existing methods LoFreq, ShoRAH, and V-Phaser 2 was performed on two HIV and four HCV plasmid mixture datasets and one influenza H1N1 clinical dataset. Results: For default application of QQ-SNV, variants were called using a SNV probability cutoff of 0.5 (QQ-SNVD). To improve the sensitivity we used a SNV probability cutoff of 0.0001 (QQ-SNVHS). To also increase specificity, SNVs called were overruled when their frequency was below the 80th percentile calculated on the distribution of error frequencies (QQ-SNVHS-P80). When comparing QQ-SNV versus the other methods on the plasmid mixture test sets, QQ-SNVD performed similarly to the existing approaches. QQ-SNVHS was more sensitive on all test sets but with more false positives. QQ-SNVHS-P80 was found to be the most accurate method over all test sets by balancing sensitivity and specificity. When applied to a paired-end HCV sequencing study, with lowest spiked-in true frequency of 0.5 %, QQ-SNVHS-P80 revealed a sensitivity of 100 % (vs. 40-60 % for the existing methods) and a specificity of 100 % (vs. 98.0-99.7 % for the existing methods). In addition, QQ-SNV required the least overall computation time to process the test sets. Finally, when testing on a clinical sample, four putative true variants with frequency below 0.5 % were consistently detected by QQ-SNVHS-P80 from different generations of Illumina sequencers. Conclusions: We developed and successfully evaluated a novel method, called QQ-SNV, for highly efficient single nucleotide variant calling on Illumina deep sequencing virology data.
  156. Verbist, B., Clement, L., Reumers, J., Thys, K., Vapirev, A., Talloen, W., Wetzels, Y., et al. (2015). ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering. BMC BIOINFORMATICS, 16.
    Background: Deep-sequencing allows for an in-depth characterization of sequence variation in complex populations. However, technology associated errors may impede a powerful assessment of low-frequency mutations. Fortunately, base calls are complemented with quality scores which are derived from a quadruplet of intensities, one channel for each nucleotide type for Illumina sequencing. The highest intensity of the four channels determines the base that is called. Mismatch bases can often be corrected by the second best base, i.e. the base with the second highest intensity in the quadruplet. A virus variant model-based clustering method, ViVaMBC, is presented that explores quality scores and second best base calls for identifying and quantifying viral variants. ViVaMBC is optimized to call variants at the codon level (nucleotide triplets) which enables immediate biological interpretation of the variants with respect to their antiviral drug responses. Results: Using mixtures of HCV plasmids we show that our method accurately estimates frequencies down to 0.5%. The estimates are unbiased when average coverages of 25,000 are reached. A comparison with the SNP-callers V-Phaser2, ShoRAH, and LoFreq shows that ViVaMBC has a superb sensitivity and specificity for variants with frequencies above 0.4%. Unlike the competitors, ViVaMBC reports a higher number of false-positive findings with frequencies below 0.4% which might partially originate from picking up artificial variants introduced by errors in the sample and library preparation step. Conclusions: ViVaMBC is the first method to call viral variants directly at the codon level. The strength of the approach lies in modeling the error probabilities based on the quality scores. Although the use of second best base calls appeared very promising in our data exploration phase, their utility was limited. They provided a slight increase in sensitivity, which however does not warrant the additional computational cost of running the offline base caller. Apparently a lot of information is already contained in the quality scores enabling the model based clustering procedure to adjust the majority of the sequencing errors. Overall the sensitivity of ViVaMBC is such that technical constraints like PCR errors start to form the bottleneck for low frequency variant detection.
  157. Verbist, B., Thys, K., Reumers, J., Wetzels, Y., Van der Borght, K., Talloen, W., Aerssens, J., et al. (2015). VirVarSeq: a low-frequency virus variant detection pipeline for Illumina sequencing using adaptive base-calling accuracy filtering. BIOINFORMATICS, 31(1), 94–101.
    Motivation: In virology, massively parallel sequencing (MPS) opens many opportunities for studying viral quasi-species, e.g. in HIV-1- and HCV-infected patients. This is essential for understanding pathways to resistance, which can substantially improve treatment. Although MPS platforms allow in-depth characterization of sequence variation, their measurements still involve substantial technical noise. For Illumina sequencing, single base substitutions are the main error source and impede powerful assessment of low-frequency mutations. Fortunately, base calls are complemented with quality scores (Qs) that are useful for differentiating errors from the real low-frequency mutations. Results: A variant calling tool, Q-cpileup, is proposed, which exploits the Qs of nucleotides in a filtering strategy to increase specificity. The tool is imbedded in an open-source pipeline, VirVarSeq, which allows variant calling starting from fastq files. Using both plasmid mixtures and clinical samples, we show that Q-cpileup is able to reduce the number of false-positive findings. The filtering strategy is adaptive and provides an optimized threshold for individual samples in each sequencing run. Additionally, linkage information is kept between single-nucleotide polymorphisms as variants are called at the codon level. This enables virologists to have an immediate biological interpretation of the reported variants with respect to their antiviral drug responses. A comparison with existing SNP caller tools reveals that calling variants at the codon level with Q-cpileup results in an outstanding sensitivity while maintaining a good specificity for variants with frequencies down to 0.5%.
  158. Nelissen, H., Eeckhout, D., Demuynck, K., Persiau, G., Walton, A., Van Bel, M., … De Jaeger, G. (2015). Dynamic changes in ANGUSTIFOLIA3 complex composition reveal a growth regulatory mechanism in the maize leaf. PLANT CELL, 27(6), 1605–1619.
    Most molecular processes during plant development occur with a particular spatio-temporal specificity. Thus far, it has remained technically challenging to capture dynamic protein-protein interactions within a growing organ, where the interplay between cell division and cell expansion is instrumental. Here, we combined high-resolution sampling of the growing maize (Zea mays) leaf with tandem affinity purification followed by mass spectrometry. Our results indicate that the growth-regulating SWI/SNF chromatin remodeling complex associated with ANGUSTIFOLIA3 (AN3) was conserved within growing organs and between dicots and monocots. Moreover, we were able to demonstrate the dynamics of the AN3-interacting proteins within the growing leaf, since copurified GROWTH-REGULATING FACTORs (GRFs) varied throughout the growing leaf. Indeed, GRF1, GRF6, GRF7, GRF12, GRF15, and GRF17 were significantly enriched in the division zone of the growing leaf, while GRF4 and GRF10 levels were comparable between division zone and expansion zone in the growing leaf. These dynamics were also reflected at the mRNA and protein levels, indicating tight developmental regulation of the AN3-associated chromatin remodeling complex. In addition, the phenotypes of maize plants overexpressing miRNA396a-resistant GRF1 support a model proposing that distinct associations of the chromatin remodeling complex with specific GRFs tightly regulate the transition between cell division and cell expansion. Together, our data demonstrate that advancing from static to dynamic protein-protein interaction analysis in a growing organ adds insights in how developmental switches are regulated.
  159. Cloots, L., De Maeyer, D., & Marchal, K. (2014). Path finding in biological networks. In N. Kasabov (Ed.), Springer handbook of bio-/neuroinformatics (pp. 289–309). Berlin, Germany: Springer.
  160. De Tiège, A., Tanghe, K., Braeckman, J., & Van de Peer, Y. (2014). From DNA- to NA-centrism and the conditions for gene-centrism revisited. BIOLOGY & PHILOSOPHY, 29(1), 55–69.
    First the 'Weismann barrier' and later on Francis Crick's 'central dogma' of molecular biology nourished the gene-centric paradigm of life, i.e., the conception of the gene/genome as a 'central source' from which hereditary specificity unidirectionally flows or radiates into cellular biochemistry and development. Today, due to advances in molecular genetics and epigenetics, such as the discovery of complex post-genomic and epigenetic processes in which genes are causally integrated, many theorists argue that a gene-centric conception of the organism has become problematic. Here, we first explore the causal implications of the following two central dogma-related issues: (1) widespread reverse transcription-arguing for an extension from 'DNA-genome' to RNA-encompassing 'NA-genome' and, thus, from traditional DNA-centrism to a broader 'NA-centrism'; and (2) the absence of a mechanism of reverse translation-arguing for the 'structural primacy' of NA-sequence over protein in cellular biochemistry. Secondly, we explore whether this latter conclusion can be extended to a 'functional primacy' of NA-sequence over protein in cellular biochemistry, which would imply a limited kind of 'gene/NA-centrism' confined to the subcellular level of NA/protein-based biochemistry. Finally, we explore the conditions-and their (non)fulfilment-for a more generalised form of gene-centrism extendable to higher levels of biological organisation. We conclude that the higher we go in the biological hierarchy, the more dubious gene-centric claims become.
  161. De Witte, D., Van Bel, M., Audenaert, P., Demeester, P., Dhoedt, B., Vandepoele, K., & Fostier, J. (2014). A parallel, distributed-memory framework for comparative motif discovery. In R. Wyrzykowski, J. Dongarra, K. Karczewski, & J. Wasniewski (Eds.), Lecture Notes in Computer Science (Vol. 8385, pp. 268–277). Warsaw, Poland: Springer.
    The increasing number of sequenced organisms has opened new possibilities for the computational discovery of cis-regulatory elements ('motifs') based on phylogenetic footprinting. Word-based, exhaustive approaches are among the best performing algorithms, however, they pose significant computational challenges as the number of candidate motifs to evaluate is very high. In this contribution, we describe a parallel, distributed-memory framework for de novo comparative motif discovery. Within this framework, two approaches for phylogenetic footprinting are implemented: an alignment-based and an alignment-free method. The framework is able to statistically evaluate the conservation of motifs in a search space containing over 160 million candidate motifs using a distributed-memory cluster with 200 CPU cores in a few hours. Software available from
  162. Zhurov, V., Navarro, M., Bruinsma, K. A., Arbona, V., Santamaria, M. E., Cazaux, M., … Grbić, V. (2014). Reciprocal responses in the interaction between Arabidopsis and the cell-content feeding chelicerate herbivore spider mite. PLANT PHYSIOLOGY, 164(1), 384–399.
    Most molecular-genetic studies of plant defense responses to arthropod herbivores have focused on insects. However, plant-feeding mites are also pests of diverse plants, and mites induce different patterns of damage to plant tissues than do well-studied insects (e.g. lepidopteran larvae or aphids). The two-spotted spidermite (Tetranychus urticae) is among the most significant mite pests in agriculture, feeding on a staggering number of plant hosts. To understand the interactions between spider mite and a plant at the molecular level, we examined reciprocal genome-wide responses of mites and its host Arabidopsis (Arabidopsis thaliana). Despite differences in feeding guilds, we found that transcriptional responses of Arabidopsis to mite herbivory resembled those observed for lepidopteran herbivores. Mutant analysis of induced plant defense pathways showed functionally that only a subset of induced programs, including jasmonic acid signaling and biosynthesis of indole glucosinolates, are central to Arabidopsis's defense to mite herbivory. On the herbivore side, indole glucosinolates dramatically increased mite mortality and development times. We identified an indole glucosinolate dose-dependent increase in the number of differentially expressedmite genes belonging to pathways associated with detoxification of xenobiotics. This demonstrates that spider mite is sensitive to Arabidopsis defenses that have also been associated with the deterrence of insect herbivores that are very distantly related to chelicerates. Our findings provide molecular insights into the nature of, and response to, herbivory for a representative of a major class of arthropod herbivores.
  163. Ciesielska, K., Van Bogaert, I., Chevineau, S., Li, B., Groeneboer, S., Soetaert, W., Van de Peer, Y., et al. (2014). Exoproteome analysis of Starmerella bombicola results in the discovery of an esterase required for lactonization of sophorolipids. JOURNAL OF PROTEOMICS, 98, 159–174.
  164. Bolton, M. D., de Jonge, R., Inderbitzin, P., Liu, Z., Birla, K., Van de Peer, Y., Subbarao, K. V., et al. (2014). The heterothallic sugarbeet pathogen Cercospora beticola contains exon fragments of both MAT genes that are homogenized by concerted evolution. FUNGAL GENETICS AND BIOLOGY, 62, 43–54.
    Dothideomycetes is one of the most ecologically diverse and economically important classes of fungi. Sexual reproduction in this group is governed by mating type (MAT) genes at the MAT1 locus. Self-sterile (heterothallic) species contain one of two genes at MAT1 (MAT1-1-1 or MAT1-2-1) and only isolates of opposite mating type are sexually compatible. In contrast, self-fertile (homothallic) species contain both MAT genes at MAT1. Knowledge of the reproductive capacities of plant pathogens are of particular interest because recombining populations tend to be more difficult to manage in agricultural settings. In this study, we sequenced MAT1 in the heterothallic Dothideomycete fungus Cercospora beticola to gain insight into the reproductive capabilities of this important plant pathogen. In addition to the expected MAT gene at MAT1, each isolate contained fragments of both MAT1-1-1 and MAT1-2-1 at ostensibly random loci across the genome. When MAT fragments from each locus were manually assembled, they reconstituted MAT1-1-1 and MAT1-2-1 exons with high identity, suggesting a retroposition event occurred in a homothallic ancestor in which both MAT genes were fused. The genome sequences of related taxa revealed that MAT gene fragment pattern of Cercospora zeae-maydis was analogous to C beticola. In contrast, the genome of more distantly related Mycosphaerella graminicola did not contain MAT fragments. Although fragments occurred in syntenic regions of the C bed cola and C zeae-maydis genomes, each MAT fragment was more closely related to the intact MAT gene of the same species. Taken together, these data suggest MAT genes fragmented after divergence of M. graminicola from the remaining taxa, and concerted evolution functioned to homogenize MAT fragments and MAT genes in each species.
  165. Bracken-Grissom, H., Collins, A. G., Collins, T., Crandall, K., Distel, D., Dunn, C., Giribet, G., et al. (2014). The Global Invertebrate Genomics Alliance (GIGA): developing community resources to study diverse invertebrate genomes. JOURNAL OF HEREDITY, 105(1), 1–18.
    Over 95% of all metazoan (animal) species comprise the invertebrates, but very few genomes from these organisms have been sequenced. We have, therefore, formed a Global Invertebrate Genomics Alliance (GIGA). Our intent is to build a collaborative network of diverse scientists to tackle major challenges (e.g., species selection, sample collection and storage, sequence assembly, annotation, analytical tools) associated with genome/transcriptome sequencing across a large taxonomic spectrum. We aim to promote standards that will facilitate comparative approaches to invertebrate genomics and collaborations across the international scientific community. Candidate study taxa include species from Porifera, Ctenophora, Cnidaria, Placozoa, Mollusca, Arthropoda, Echinodermata, Annelida, Bryozoa, and Platyhelminthes, among others. GIGA will target 7000 noninsect/nonnematode species, with an emphasis on marine taxa because of the unrivaled phyletic diversity in the oceans. Priorities for selecting invertebrates for sequencing will include, but are not restricted to, their phylogenetic placement; relevance to organismal, ecological, and conservation research; and their importance to fisheries and human health. We highlight benefits of sequencing both whole genomes (DNA) and transcriptomes and also suggest policies for genomic-level data access and sharing based on transparency and inclusiveness. The GIGA Web site () has been launched to facilitate this collaborative venture.
  166. Verheggen, K., Barsnes, H., & Martens, L. (2014). Distributed computing and data storage in proteomics: many hands make light work, and a stronger memory. PROTEOMICS, 14(4-5), 367–377.
    Modern day proteomics generates ever more complex data, causing the requirements on the storage and processing of such data to outgrow the capacity of most desktop computers. To cope with the increased computational demands, distributed architectures have gained substantial popularity in the recent years. In this review, we provide an overview of the current techniques for distributed computing, along with examples of how the techniques are currently being employed in the field of proteomics. We thus underline the benefits of distributed computing in proteomics, while also pointing out the potential issues and pitfalls involved.
  167. Kelchtermans, P., Bittremieux, W., De Grave, K., Degroeve, S., Ramon, J., Laukens, K., Valkenborg, D., et al. (2014). Machine learning applications in proteomics research: how the past can boost the future. PROTEOMICS, 14(4-5), 353–366.
    Machine learning is a subdiscipline within artificial intelligence that focuses on algorithms that allow computers to learn solving a (complex) problem from existing data. This ability can be used to generate a solution to a particularly intractable problem, given that enough data are available to train and subsequently evaluate an algorithm on. Since MS-based proteomics has no shortage of complex problems, and since publicly available data are becoming available in ever growing amounts, machine learning is fast becoming a very popular tool in the field. We here therefore present an overview of the different applications of machine learning in proteomics that together cover nearly the entire wet- and dry-lab workflow, and that address key bottlenecks in experiment planning and design, as well as in data processing and analysis.
  168. Beck, F., Geiger, J., Gambaryan, S., Veit, J., Vaudel, M., Nollau, P., Kohlbacher, O., et al. (2014). Time-resolved characterization of cAMP/PKA-dependent signaling reveals that platelet inhibition is a concerted process involving multiple signaling pathways. BLOOD, 123(5), e1–e10.
    One of the most important physiological platelet inhibitors is endothelium-derived prostacyclin which stimulates the platelet cyclic adenosine monophosphate/protein kinase A (cAMP/PKA)-signaling cascade and inhibits virtually all platelet-activating key mechanisms. Using quantitative mass spectrometry, we analyzed time-resolved phosphorylation patterns in human platelets after treatment with iloprost, a stable prostacyclin analog, for 0, 10, 30, and 60 seconds to characterize key mediators of platelet inhibition and activation in 3 independent biological replicates. We quantified over 2700 different phosphorylated peptides of which 360 were significantly regulated upon stimulation. This comprehensive and time-resolved analysis indicates that platelet inhibition is a multi-pronged process involving different kinases and phosphatases as well as many previously unanticipated proteins and pathways.
  169. Muth, T., Weilnböck, L., Rapp, E., Huber, C. G., Martens, L., Vaudel, M., & Barsnes, H. (2014). DeNovoGUI: an open source graphical user interface for de novo sequencing of tandem mass spectra. JOURNAL OF PROTEOME RESEARCH, 13(2), 1143–1146.
    De nova sequencing is a popular technique in proteomics for identifying peptides from tandem mass spectra without having to rely on a protein sequence database. Despite the strong potential of de nova sequencing algorithms, their adoption threshold remains quite high. We here present a user-friendly and lightweight graphical user interface called DeNovoGUI for running parallelized versions of the freely available de nova sequencing software PepNovo+, greatly simplifying the use of de novo sequencing in proteomics. Our platform-independent software is freely available under the permissible Apache2 open source license. Source code, binaries, and additional documentation are available at
  170. Vaudel, M., Sickmann, A., & Martens, L. (2014). Introduction to opportunities and pitfalls in functional mass spectrometry based proteomics. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS, 1844(1A), 12–20.
  171. Liebens, V., Defraine, V., Van der Leyden, A., De Groote, V. N., Fierro, C., Beullens, S., Verstraeten, N., et al. (2014). A putative de-N-acetylase of the PIG-L superfamily affects fluoroquinolone tolerance in Pseudomonas aeruginosa. PATHOGENS AND DISEASE, 71(1), 39–54.
    A major cause of treatment failure of infections caused by Pseudomonas aeruginosa is the presence of antibiotic-insensitive persister cells. The mechanism of persister formation in P.aeruginosa is largely unknown, and so far, only few genetic determinants have been linked to P.aeruginosa persistence. Based on a previous high-throughput screening, we here present dnpA (de-N-acetylase involved in persistence; gene locus PA14_66140/PA5002) as a new gene involved in noninherited fluoroquinolone tolerance in P.aeruginosa. Fluoroquinolone tolerance of a dnpA mutant is strongly reduced both in planktonic culture and in a biofilm model, whereas overexpression of dnpA in the wild-type strain increases the persister fraction. In addition, the susceptibility of the dnpA mutant to different classes of antibiotics is not affected. dnpA is part of the conserved LPS core oligosaccharide biosynthesis gene cluster. Based on primary sequence analysis, we predict that DnpA is a de-N-acetylase, acting on an unidentified substrate. Site-directed mutagenesis suggests that this enzymatic activity is essential for DnpA-mediated persistence. A transcriptome analysis indicates that DnpA primarily affects the expression of genes involved in surface-associated processes. We discuss the implications of these findings for future antipersister therapies targeted at chronic P.aeruginosa infections.
  172. Spaepen, S., Bossuyt, S., Engelen, K., Marchal, K., & Vanderleyden, J. (2014). Phenotypical and molecular responses of Arabidopsis thaliana roots as a result of inoculation with the auxin-producing bacterium Azospirillum brasilense. NEW PHYTOLOGIST, 201(3), 850–861.
    The auxin-producing bacterium Azospirillum brasilense Sp245 can promote the growth of several plant species. The model plant Arabidopsis thaliana was chosen as host plant to gain an insight into the molecular mechanisms that govern this interaction. The determination of differential gene expression in Arabidopsis roots after inoculation with either A. brasilense wild-type or an auxin biosynthesis mutant was achieved by microarray analysis. Arabidopsis thaliana inoculation with A. brasilense wild-type increases the number of lateral roots and root hairs, and elevates the internal auxin concentration in the plant. The A. thaliana root transcriptome undergoes extensive changes on A. brasilense inoculation, and the effects are more pronounced at later time points. The wild-type bacterial strain induces changes in hormone-and defense-related genes, as well as in plant cell wall-related genes. The A. brasilense mutant, however, does not elicit these transcriptional changes to the same extent. There are qualitative and quantitative differences between A. thaliana responses to the wild-type A. brasilense strain and the auxin biosynthesis mutant strain, based on both phenotypic and transcriptomic data. This illustrates the major role played by auxin in the Azospirillum-Arabidopsis interaction, and possibly also in other bacterium-plant interactions.
  173. Myburg, A. A., Grattapaglia, D., Tuskan, G. A., Hellsten, U., Hayes, R. D., Grimwood, J., Jenkins, J., et al. (2014). The genome of Eucalyptus grandis. NATURE, 510(7505), 356–362.
    Eucalypts are the world's most widely planted hardwood trees. Their outstanding diversity, adaptability and growth have made them a global renewable resource of fibre and energy. We sequenced and assembled >94% of the 640-megabase genome of Eucalyptus grandis. Of 36,376 predicted protein-coding genes, 34% occur in tandem duplications, the largest proportion thus far in plant genomes. Eucalyptus also shows the highest diversity of genes for specialized metabolites such as terpenes that act as chemical defence and provide unique pharmaceutical oils. Genome sequencing of the E. grandis sister species E. globulus and a set of inbred E. grandis tree genomes reveals dynamic genome evolution and hotspots of inbreeding depression. The E. grandis genome is the first reference for the eudicot order Myrtales and is placed here sister to the eurosids. This resource expands our understanding of the unique biology of large woody perennials and provides a powerful tool to accelerate comparative biology, breeding and biotechnology.
  174. Vanneste, Kevin, Maere, S., & Van de Peer, Y. (2014). Tangled up in two: a burst of genome duplications at the end of the Cretaceous and the consequences for plant evolution. PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 369(1648).
    Genome sequencing has demonstrated that besides frequent small-scale duplications, large-scale duplication events such as whole genome duplications (WGDs) are found on many branches of the evolutionary tree of life. Especially in the plant lineage, there is evidence for recurrent WGDs, and the ancestor of all angiosperms was in fact most likely a polyploid species. The number of WGDs found in sequenced plant genomes allows us to investigate questions about the roles of WGDs that were hitherto impossible to address. An intriguing observation is that many plant WGDs seem associated with periods of increased environmental stress and/or fluctuations, a trend that is evident for both present-day polyploids and palaeopolyploids formed around the Cretaceous-Palaeogene (K-Pg) extinction at 66 Ma. Here, we revisit the WGDs in plants that mark the K-Pg boundary, and discuss some specific examples of biological innovations and/or diversifications that may be linked to these WGDs. We review evidence for the processes that could have contributed to increased polyploid establishment at the K-Pg boundary, and discuss the implications on subsequent plant evolution in the Cenozoic.
  175. Mushthofa, M., Torres Torres, G. A., Van de Peer, Y., Marchal, K., & De Cock, M. (2014). ASP-G: an ASP-based method for finding attractors in genetic regulatory networks. BIOINFORMATICS, 30(21), 3086–3092.
    Motivation: Boolean network models are suitable to simulate GRNs in the absence of detailed kinetic information. However, reducing the biological reality implies making assumptions on how genes interact (interaction rules) and how their state is updated during the simulation (update scheme). The exact choice of the assumptions largely determines the outcome of the simulations. In most cases, however, the biologically correct assumptions are unknown. An ideal simulation thus implies testing different rules and schemes to determine those that best capture an observed biological phenomenon. This is not trivial because most current methods to simulate Boolean network models of GRNs and to compute their attractors impose specific assumptions that cannot be easily altered, as they are built into the system. Results: To allow for a more flexible simulation framework, we developed ASP-G. We show the correctness of ASP-G in simulating Boolean network models and obtaining attractors under different assumptions by successfully recapitulating the detection of attractors of previously published studies. We also provide an example of how performing simulation of network models under different settings help determine the assumptions under which a certain conclusion holds. The main added value of ASP-G is in its modularity and declarativity, making it more flexible and less error-prone than traditional approaches. The declarative nature of ASP-G comes at the expense of being slower than the more dedicated systems but still achieves a good efficiency with respect to computational time. Availability and implementation: The source code of ASP-G is available at
  176. Mushthofa, M., Schockaert, S., & De Cock, M. (2014). A finite-valued solver for disjunctive fuzzy answer set programs. In T. Schaub, G. Friedrich, & B. O’Sullivan (Eds.), Frontiers in Artificial Intelligence and Applications (Vol. 263, pp. 645–650). Presented at the 21st European conference on Artificial Intelligence (ECAI 2014), Amsterdam, The Netherlands: IOS Press.
    Fuzzy Answer Set Programming (FASP) is a declarative programming paradigm which extends the flexibility and expressiveness of classical Answer Set Programming (ASP), with the aim of modeling continuous application domains. In contrast to the availability of efficient ASP solvers, there have been few attempts at implementing FASP solvers. In this paper, we propose an implementation of FASP based on a reduction to classical ASP. We also develop a prototype implementation of this method. To the best of our knowledge, this is the first solver for disjunctive FASP programs. Moreover, we experimentally show that our solver performs well in comparison to an existing solver (under reasonable assumptions) for the more restrictive class of normal FASP programs.
  177. Crauwels, Sam, Zhu, B., Steensels, J., Busschaert, P., De Samblanx, G., Marchal, K., Willems, K. A., et al. (2014). Assessing genetic diversity among Brettanomyces yeasts by DNA fingerprinting and whole-genome sequencing. APPLIED AND ENVIRONMENTAL MICROBIOLOGY, 80(14), 4398–4413.
    Brettanomyces yeasts, with the species Brettanomyces (Dekkera) bruxellensis being the most important one, are generally reported to be spoilage yeasts in the beer and wine industry due to the production of phenolic off flavors. However, B. bruxellensis is also known to be a beneficial contributor in certain fermentation processes, such as the production of certain specialty beers. Nevertheless, despite its economic importance, Brettanomyces yeasts remain poorly understood at the genetic and genomic levels. In this study, the genetic relationship between more than 50 Brettanomyces strains from all presently known species and from several sources was studied using a combination of DNA fingerprinting techniques. This revealed an intriguing correlation between the B. bruxellensis fingerprints and the respective isolation source. To further explore this relationship, we sequenced a (beneficial) beer isolate of B. bruxellensis (VIB X9085; ST05.12/22) and compared its genome sequence with the genome sequences of two wine spoilage strains (AWRI 1499 and CBS 2499). ST05.12/22 was found to be substantially different from both wine strains, especially at the level of single nucleotide polymorphisms (SNPs). In addition, there were major differences in the genome structures between the strains investigated, including the presence of large duplications and deletions. Gene content analysis revealed the presence of 20 genes which were present in both wine strains but absent in the beer strain, including many genes involved in carbon and nitrogen metabolism, and vice versa, no genes that were missing in both AWRI 1499 and CBS 2499 were found in ST05.12/22. Together, this study provides tools to discriminate Brettanomyces strains and provides a first glimpse at the genetic diversity and genome plasticity of B. bruxellensis.
  178. Lindemose, S., Jensen, M. K., Van de Velde, J., O’Shea, C., Heyndrickx, K., Workman, C. T., Vandepoele, K., et al. (2014). A DNA-binding-site landscape and regulatory network analysis for NAC transcription factors in Arabidopsis thaliana. NUCLEIC ACIDS RESEARCH, 42(12), 7681–7693.
    Target gene identification for transcription factors is a prerequisite for the systems wide understanding of organismal behaviour. NAM-ATAF1/2-CUC2 (NAC) transcription factors are amongst the largest transcription factor families in plants, yet limited data exist from unbiased approaches to resolve the DNA-binding preferences of individual members. Here, we present a TF-target gene identification workflow based on the integration of novel protein binding microarray data with gene expression and multi-species promoter sequence conservation to identify the DNA-binding specificities and the gene regulatory networks of 12 NAC transcription factors. Our data offer specific single-base resolution fingerprints for most TFs studied and indicate that NAC DNA-binding specificities might be predicted from their DNA-binding domain's sequence. The developed methodology, including the application of complementary functional genomics filters, makes it possible to translate, for each TF, protein binding microarray data into a set of high-quality target genes. With this approach, we confirm NAC target genes reported from independent in vivo analyses. We emphasize that candidate target gene sets together with the workflow associated with functional modules offer a strong resource to unravel the regulatory potential of NAC genes and that this workflow could be used to study other families of transcription factors.
  179. Defauw, A., Vandersickel, N., Dawyndt, P., & Panfilov, A. (2014). Small size ionic heterogeneities in the human heart can attract rotors. AMERICAN JOURNAL OF PHYSIOLOGY-HEART AND CIRCULATORY PHYSIOLOGY, 307(10), H1456–H1468.
    Rotors occurring in the heart underlie the mechanisms of cardiac arrhythmias. Answering the question whether or not the location of rotors is related to local properties of cardiac tissue has important practical applications. This is because ablation of rotors has been shown to be an effective way to fight cardiac arrhythmias. In this study, we investigate, in silico, the dynamics of rotors in 2D and in an anatomical model of human ventricles using a TNNP model for ventricular cells. We study the effect of small size ionic heterogeneities, similar to those measured experimentally. It is shown that such heterogeneities can not only anchor, but can also attract rotors rotating at a substantial distance from the heterogeneity. This attraction distance depends on the extent of the heterogeneities and can be as large as 5-6 cm in realistic conditions. We conclude that small size ionic heterogeneities can be preferred localization points for rotors, and discuss their possible mechanism and value for applications.
  180. Houbraken, Maarten, Demeyer, S., Michoel, T., Audenaert, P., Colle, D., & Pickavet, M. (2014). The index-based subgraph matching algorithm with general symmetries (ISMAGS): exploiting symmetry for faster subgraph enumeration. PLOS ONE, 9(5).
    Subgraph matching algorithms are used to find and enumerate specific interconnection structures in networks. By enumerating these specific structures/subgraphs, the fundamental properties of the network can be derived. More specifically in biological networks, subgraph matching algorithms are used to discover network motifs, specific patterns occurring more often than expected by chance. Finding these network motifs yields information on the underlying biological relations modelled by the network. In this work, we present the Index-based Subgraph Matching Algorithm with General Symmetries (ISMAGS), an improved version of the Index-based Subgraph Matching Algorithm (ISMA). ISMA quickly finds all instances of a predefined motif in a network by intelligently exploring the search space and taking into account easily identifiable symmetric structures. However, more complex symmetries (possibly involving switching multiple nodes) are not taken into account, resulting in superfluous output. ISMAGS overcomes this problem by using a customised symmetry analysis phase to detect all symmetric structures in the network motif subgraphs. These structures are then converted to symmetry-breaking constraints used to prune the search space and speed up calculations. The performance of the algorithm was tested on several types of networks (biological, social and computer networks) for various subgraphs with a varying degree of symmetry. For subgraphs with complex (multi-node) symmetric structures, high speed-up factors are obtained as the search space is pruned by the symmetry-breaking constraints. For subgraphs with no or simple symmetric structures, ISMAGS still reduces computation times by optimising set operations. Moreover, the calculated list of subgraph instances is minimal as it contains no instances that differ by only a subgraph symmetry. An implementation of the algorithm is freely available at
  181. Blanc-Mathieu, R., Verhelst, B., Derelle, E., Rombauts, S., Bouget, F.-Y., Carre, I., Chateau, A., et al. (2014). An improved genome of the model marine alga Ostreococcus tauri unfolds by assessing Illumina de novo assemblies. BMC GENOMICS, 15.
    Background: Cost effective next generation sequencing technologies now enable the production of genomic datasets for many novel planktonic eukaryotes, representing an understudied reservoir of genetic diversity. O. tauri is the smallest free-living photosynthetic eukaryote known to date, a coccoid green alga that was first isolated in 1995 in a lagoon by the Mediterranean sea. Its simple features, ease of culture and the sequencing of its 13 Mb haploid nuclear genome have promoted this microalga as a new model organism for cell biology. Here, we investigated the quality of genome assemblies of Illumina GAIIx 75 bp paired end reads from Ustreococcus touri, thereby also improving the existing assembly and showing the genome to be stably maintained in culture. Results: The 3 assemblers used, ABySS, CLCBio and Velvet, produced 95% complete genomes in 1402 to 2080 scaffolds with a very low rate of misassembly. Reciprocally, these assemblies improved the original genome assembly by filling in 930 gaps. Combined with additional analysis of raw reads and PCR sequencing effort, 1194 gaps have been solved in total adding up to 460 kb of sequence. Mapping of RNAseq II lumina data on this updated genome led to a twofold reduction in the proportion of multi-exon protein coding genes, representing 19% of the total 7699 protein coding genes. The comparison of the DNA extracted in 2001 and 2009 revealed the fixation of 8 single nucleotide substitutions and 2 deletions during the approximately 6000 generations in the lab. The deletions either knocked out or truncated two predicted transmembrane proteins, including a glutamate receptor like gene. Conclusion: High coverage (>80 fold) paired end Illumina sequencing enables a high quality 95% complete genome assembly of a compact 13 Mb haploid eukaryote. This genome sequence has remained stable for 6000 generations of lab culture.
  182. Le Van, Thanh, Van Leeuwen, M., Nijssen, S., Fierro, A. C., Marchal, K., & De Raedt, L. (2014). Ranked tiling. In T. Calders, F. Esposito, E. Hüllermeier, & R. Meo (Eds.), Lecture Notes in Artificial Intelligence (Vol. 8725, pp. 98–113). Presented at the 7th European conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2014), Berlin, Germany: Springer.
    Tiling is a well-known pattern mining technique. Traditionally, it discovers large areas of ones in binary databases or matrices, where an area is defined by a set of rows and a set of columns. In this paper, we introduce the novel problem of ranked tiling, which is concerned with finding interesting areas in ranked data. In this data, each transaction defines a complete ranking of the columns. Ranked data occurs naturally in applications like sports or other competitions. It is also a useful abstraction when dealing with numeric data in which the rows are incomparable. We introduce a scoring function for ranked tiling, as well as an algorithm using constraint programming and optimization principles. We empirically evaluate the approach on both synthetic and real-life datasets, and demonstrate the applicability of the framework in several case studies. One case study involves a heterogeneous dataset concerning the discovery of biomarkers for different subtypes of breast cancer patients. An analysis of the tiles by a domain expert shows that our approach can lead to the discovery of novel insights.
  183. De Coninck, A., Fostier, J., Maenhout, S., & De Baets, B. (2014). A high performance computing approach for genomic prediction. In N. Gengler (Ed.), 19th National symposium on Applied Biological Sciences, Proceedings (Vol. 79, pp. 115–119). Presented at the 19th National symposium on Applied Biological Sciences.
    In the field of genomic prediction, genotypes of animals or plants are used to predict either phenotypic properties of new crosses or breeding values (EBVs) for detecting superior parents. Since quantitative traits of importance to breeders are mostly regulated by a large number of loci (QTL), high-density SNP markers are used to genotype individuals. The most frequently applied SNP arrays for cattle consist of 50,000 SNP markers, but even genotypes with 700,000 SNPs are already available (Cole et al., 2012). Some widely used analysis methods rely on a linear mixed model backbone (Meuwissen et al., 2001), which models the SNP marker effects as random effects, drawn from a normal distribution. The estimates for the marker effects are known as BLUP, which are linear functions of the response variates. It has been shown that when no major genes contribute to the trait, Bayesian predictions and BLUP result in approximately the same prediction accuracy for the EBVs (Hayes et al., 2009; Legarra et al., 2011; Daetwyler et al., 2013). At present the number of individuals included in the genomic prediction setting is still an order of magnitude smaller than the number of genetic markers on widely used SNP arrays, causing algorithms to directly estimate EBVs, which is in this case computationally more efficient than first estimating the marker effects (VanRaden, 2008; Misztal et al., 2009; Piepho, 2009; Shen et al., 2013). Nonetheless, it has been shown theoretically (Hayes et al., 2009) that in order to increase the prediction accuracy of the EBVs for traits with a low heritability, the number of genotyped records should increase dramatically. Most widely used implementations like synbreed (Wimmer et al., 2012) and BLUPF901 are not able to handle data sets that contain more than a few thousand individuals, since they are limited by the physical memory accessible by the computing processor. We present DAIRRy-BLUP, a parallel framework that takes advantage of a distributed-memory compute cluster in order to enable the analysis of large-scale datasets. Additionally, results on simulated data illustrate that the use of such large-scale datasets is warranted as it significantly improves the prediction accuracy of EBVs and marker effects.
  184. Vanhauwaert, Suzanne, Van Peer, G., Rihani, A., Janssens, E., Rondou, P., Lefever, S., De Paepe, A., et al. (2014). Expressed repeat elements improve RT-qPCR normalization across a wide range of zebrafish gene expression studies. PLOS ONE, 9(10).
    The selection and validation of stably expressed reference genes is a critical issue for proper RT-qPCR data normalization. In zebrafish expression studies, many commonly used reference genes are not generally applicable given their variability in expression levels under a variety of experimental conditions. Inappropriate use of these reference genes may lead to false interpretation of expression data and unreliable conclusions. In this study, we evaluated a novel normalization method in zebrafish using expressed repetitive elements (ERE) as reference targets, instead of specific protein coding mRNA targets. We assessed and compared the expression stability of a number of EREs to that of commonly used zebrafish reference genes in a diverse set of experimental conditions including a developmental time series, a set of different organs from adult fish and different treatments of zebrafish embryos including morpholino injections and administration of chemicals. Using geNorm and rank aggregation analysis we demonstrated that EREs have a higher overall expression stability compared to the commonly used reference genes. Moreover, we propose a limited set of ERE reference targets (hatn10, dna15ta1 and loopern4), that show stable expression throughout the wide range of experiments in this study, as strong candidates for inclusion as reference targets for qPCR normalization in future zebrafish expression studies. Our applied strategy to find and evaluate candidate expressed repeat elements for RT-qPCR data normalization has high potential to be used also for other species.
  185. Vandepitte, Katrien, De Meyer, T., Helsen, K., Van Acker, K., Roldàn-Ruiz, I., Mergeay, J., & Honnay, O. (2014). Rapid genetic adaptation precedes the spread of an exotic plant species. MOLECULAR ECOLOGY, 23(9), 2157–2164.
    Human activities have increasingly introduced plant species far outside their native ranges under environmental conditions that can strongly differ from those originally met. Therefore, before spreading, and potentially causing ecological and economical damage, non-native species may rapidly evolve. Evidence of genetically based adaptation during the process of becoming invasive is very scant, however, which is due to the lack of knowledge regarding the historical genetic makeup of the introduced populations and the lack of genomic resources. Capitalizing on the availability of old non-native herbarium specimens, we examined frequency shifts in genic SNPs of the Pyrenean Rocket (Sisymbrium austriacum subsp. chrysanthum), comparing the (i) native, (ii) currently spreading non-native and (iii) historically introduced gene pool. Results show strong divergence in flowering time genes during the establishment phase, indicating that rapid genetic adaptation preceded the spread of this species and possibly assisted in overcoming environmental constraints.
  186. Mensaert, K., Denil, S., Trooskens, G., Van Criekinge, W., Thas, O., & De Meyer, T. (2014). Next-generation technologies and data analytical approaches for epigenomics. ENVIRONMENTAL AND MOLECULAR MUTAGENESIS, 55(3), 155–170.
  187. Pajoro, A., Biewers, S., Dougali, E., Valentim, F. L., Mendes, M. A., Porri, A., Coupland, G., et al. (2014). The (r)evolution of gene regulatory networks controlling Arabidopsis plant reproduction: a two-decade history. JOURNAL OF EXPERIMENTAL BOTANY, 65(17), 4731–4745.
    Successful plant reproduction relies on the perfect orchestration of singular processes that culminate in the product of reproduction: the seed. The floral transition, floral organ development, and fertilization are well-studied processes and the genetic regulation of the various steps is being increasingly unveiled. Initially, based predominantly on genetic studies, the regulatory pathways were considered to be linear, but recent genome-wide analyses, using high-throughput technologies, have begun to reveal a different scenario. Complex gene regulatory networks underlie these processes, including transcription factors, microRNAs, movable factors, hormones, and chromatin-modifying proteins. Here we review recent progress in understanding the networks that control the major steps in plant reproduction, showing how new advances in experimental and computational technologies have been instrumental. As these recent discoveries were obtained using the model species Arabidopsis thaliana, we will restrict this review to regulatory networks in this important model species. However, more fragmentary information obtained from other species reveals that both the developmental processes and the underlying regulatory networks are largely conserved, making this review also of interest to those studying other plant species.
  188. Sante, T., Vergult, S., Volders, P.-J., Kloosterman, W. P., Trooskens, G., De Preter, K., Dheedene, A., et al. (2014). ViVar: a comprehensive platform for the analysis and visualization of structural genomic variation. PLOS ONE, 9(12).
    Structural genomic variations play an important role in human disease and phenotypic diversity. With the rise of high-throughput sequencing tools, mate-pair/paired-end/single-read sequencing has become an important technique for the detection and exploration of structural variation. Several analysis tools exist to handle different parts and aspects of such sequencing based structural variation analyses pipelines. A comprehensive analysis platform to handle all steps, from processing the sequencing data, to the discovery and visualization of structural variants, is missing. The ViVar platform is built to handle the discovery of structural variants, from Depth Of Coverage analysis, aberrant read pair clustering to split read analysis. ViVar provides you with powerful visualization options, enables easy reporting of results and better usability and data management. The platform facilitates the processing, analysis and visualization, of structural variation based on massive parallel sequencing data, enabling the rapid identification of disease loci or genes. ViVar allows you to scale your analysis with your work load over multiple (cloud) servers, has user access control to keep your data safe and is easy expandable as analysis techniques advance.
  189. Vermeirssen, V., De Clercq, I., Van Parys, T., Van Breusegem, F., & Van de Peer, Y. (2014). Arabidopsis ensemble reverse-engineered gene regulatory network discloses interconnected transcription factors in oxidative stress. PLANT CELL, 26(12), 4656–4679.
    The abiotic stress response in plants is complex and tightly controlled by gene regulation. We present an abiotic stress gene regulatory network of 200,014 interactions for 11,938 target genes by integrating four complementary reverse-engineering solutions through average rank aggregation on an Arabidopsis thaliana microarray expression compendium. This ensemble performed the most robustly in benchmarking and greatly expands upon the availability of interactions currently reported. Besides recovering 1182 known regulatory interactions, cis-regulatory motifs and coherent functionalities of target genes corresponded with the predicted transcription factors. We provide a valuable resource of 572 abiotic stress modules of coregulated genes with functional and regulatory information, from which we deduced functional relationships for 1966 uncharacterized genes and many regulators. Using gain-and loss-of-function mutants of seven transcription factors grown under control and salt stress conditions, we experimentally validated 141 out of 271 predictions (52% precision) for 102 selected genes and mapped 148 additional transcription factor-gene regulatory interactions (49% recall). We identified an intricate core oxidative stress regulatory network where NAC13, NAC053, ERF6, WRKY6, and NAC032 transcription factors interconnect and function in detoxification. Our work shows that ensemble reverse-engineering can generate robust biological hypotheses of gene regulation in a multicellular eukaryote that can be tested by medium-throughput experimental validation.
  190. Vargas, L., Santa Brigida, A. B., Mota Filho, J. P., de Carvalho, T. G., Rojas, C. A., Vaneechoutte, D., Van Bel, M., et al. (2014). Drought tolerance conferred to sugarcane by association with Gluconacetobacter diazotrophicus: a transcriptomic view of hormone pathways. PLOS ONE, 9(12).
    Sugarcane interacts with particular types of beneficial nitrogen-fixing bacteria that provide fixed-nitrogen and plant growth hormones to host plants, promoting an increase in plant biomass. Other benefits, as enhanced tolerance to abiotic stresses have been reported to some diazotrophs. Here we aim to study the effects of the association between the diazotroph Gluconacetobacter diazotrophicus PAL5 and sugarcane cv. SP70-1143 during water depletion by characterizing differential transcriptome profiles of sugarcane. RNA-seq libraries were generated from roots and shoots of sugarcane plants free of endophytes that were inoculated with G. diazotrophicus and subjected to water depletion for 3 days. A sugarcane reference transcriptome was constructed and used for the identification of differentially expressed transcripts. The differential profile of non-inoculated SP70-1143 suggests that it responds to water deficit stress by the activation of drought-responsive markers and hormone pathways, as ABA and Ethylene. qRT-PCR revealed that root samples had higher levels of G. diazotrophicus 3 days after water deficit, compared to roots of inoculated plants watered normally. With prolonged drought only inoculated plants survived, indicating that SP70-1143 plants colonized with G. diazotrophicus become more tolerant to drought stress than non-inoculated plants. Strengthening this hypothesis, several gene expression responses to drought were inactivated or regulated in an opposite manner, especially in roots, when plants were colonized by the bacteria. The data suggests that colonized roots would not be suffering from stress in the same way as non-inoculated plants. On the other hand, shoots specifically activate ABA-dependent signaling genes, which could act as key elements in the drought resistance conferred by G. diazotrophicus to SP70-1143. This work reports for the first time the involvement of G. diazotrophicus in the promotion of drought-tolerance to sugarcane cv. SP70-1143, and it describes the initial molecular events that may trigger the increased drought tolerance in the host plant.
  191. Ranade, S. S., Lin, Y.-C., Zuccolo, A., Van de Peer, Y., & Garcia-Gil, M. del R. (2014). Comparative in silico analysis of EST-SSRs in angiosperm and gymnosperm tree genera. BMC PLANT BIOLOGY, 14.
    Background: Simple Sequence Repeats (SSRs) derived from Expressed Sequence Tags (ESTs) belong to the expressed fraction of the genome and are important for gene regulation, recombination, DNA replication, cell cycle and mismatch repair. Here, we present a comparative analysis of the SSR motif distribution in the 5'UTR, ORF and 3'UTR fractions of ESTs across selected genera of woody trees representing gymnosperms (17 species from seven genera) and angiosperms (40 species from eight genera). Results: Our analysis supports a modest contribution of EST-SSR length to genome size in gymnosperms, while EST-SSR density was not associated with genome size in neither angiosperms nor gymnosperms. Multiple factors seem to have contributed to the lower abundance of EST-SSRs in gymnosperms that has resulted in a non-linear relationship with genome size diversity. The AG/CT motif was found to be the most abundant in SSRs of both angiosperms and gymnosperms, with a relative increase in AT/AT in the latter. Our data also reveals a higher abundance of hexamers across the gymnosperm genera. Conclusions: Our analysis provides the foundation for future comparative studies at the species level to unravel the evolutionary processes that control the SSR genesis and divergence between angiosperm and gymnosperm tree species.
  192. Lin, Y.-C., Boone, M., Meuris, L., Lemmens, I., Van Roy, N., Soete, A., Reumers, J., et al. (2014). Genome dynamics of the human embryonic kidney 293 lineage in response to cell biology manipulations. NATURE COMMUNICATIONS, 5.
    The HEK293 human cell lineage is widely used in cell biology and biotechnology. Here we use whole-genome resequencing of six 293 cell lines to study the dynamics of this aneuploid genome in response to the manipulations used to generate common 293 cell derivatives, such as transformation and stable clone generation (293T); suspension growth adaptation (293S); and cytotoxic lectin selection (293SG). Remarkably, we observe that copy number alteration detection could identify the genomic region that enabled cell survival under selective conditions (i.c. ricin selection). Furthermore, we present methods to detect human/vector genome breakpoints and a user-friendly visualization tool for the 293 genome data. We also establish that the genome structure composition is in steady state for most of these cell lines when standard cell culturing conditions are used. This resource enables novel and more informed studies with 293 cells, and we will distribute the sequenced cell lines to this effect.
  193. Jacobs, Bart, Goetghebeur, E., & Clement, L. (2014). Impact of variance components on reliability of absolute quantification using digital PCR. BMC BIOINFORMATICS, 15.
    Background: Digital polymerase chain reaction (dPCR) is an increasingly popular technology for detecting and quantifying target nucleic acids. Its advertised strength is high precision absolute quantification without needing reference curves. The standard data analytic approach follows a seemingly straightforward theoretical framework but ignores sources of variation in the data generating process. These stem from both technical and biological factors, where we distinguish features that are 1) hard-wired in the equipment, 2) user-dependent and 3) provided by manufacturers but may be adapted by the user. The impact of the corresponding variance components on the accuracy and precision of target concentration estimators presented in the literature is studied through simulation. Results: We reveal how system-specific technical factors influence accuracy as well as precision of concentration estimates. We find that a well-chosen sample dilution level and modifiable settings such as the fluorescence cut-off for target copy detection have a substantial impact on reliability and can be adapted to the sample analysed in ways that matter. User-dependent technical variation, including pipette inaccuracy and specific sources of sample heterogeneity, leads to a steep increase in uncertainty of estimated concentrations. Users can discover this through replicate experiments and derived variance estimation. Finally, the detection performance can be improved by optimizing the fluorescence intensity cut point as suboptimal thresholds reduce the accuracy of concentration estimates considerably. Conclusions: Like any other technology, dPCR is subject to variation induced by natural perturbations, systematic settings as well as user-dependent protocols. Corresponding uncertainty may be controlled with an adapted experimental design. Our findings point to modifiable key sources of uncertainty that form an important starting point for the development of guidelines on dPCR design and data analysis with correct precision bounds. Besides clever choices of sample dilution levels, experiment-specific tuning of machine settings can greatly improve results. Well-chosen data-driven fluorescence intensity thresholds in particular result in major improvements in target presence detection. We call on manufacturers to provide sufficiently detailed output data that allows users to maximize the potential of the method in their setting and obtain high precision and accuracy for their experiments.
  194. Ahmed, S., Cock, J. M., Pessia, E., Luthringer, R., Cormier, A., Robuchon, M., Sterck, L., et al. (2014). A haploid system of sex determination in the brown alga Ectocarpus sp. CURRENT BIOLOGY, 24(17), 1945–1957.
    Background: A common feature of most genetic sex-determination systems studied so far is that sex is determined by nonrecombining genomic regions, which can be of various sizes depending on the species. These regions have evolved independently and repeatedly across diverse groups. A number of such sex-determining regions (SDRs) have been studied in animals, plants, and fungi, but very little is known about the evolution of sexes in other eukaryotic lineages. Results: We report here the sequencing and genomic analysis of the SDR of Ectocarpus, a brown alga that has been evolving independently from plants, animals, and fungi for over one giga-annum. In Ectocarpus, sex is expressed during the haploid phase of the life cycle, and both the female (U) and the male (V) sex chromosomes contain nonrecombining regions. The U and V of this species have been diverging for more than 70 mega-annum, yet gene degeneration has been modest, and the SDR is relatively small, with no evidence for evolutionary strata. These features may be explained by the occurrence of strong purifying selection during the haploid phase of the life cycle and the low level of sexual dimorphism. V is dominant over U, suggesting that femaleness may be the default state, adopted when the male haplotype is absent. Conclusions: The Ectocarpus UV system has clearly had a distinct evolutionary trajectory not only to the well-studied XY and ZW systems but also to the UV systems described so far. Nonetheless, some striking similarities exist, indicating remarkable universality of the underlying processes shaping sex chromosome evolution across distant lineages.
  195. Walzer, M., Pernas, L. E., Nasso, S., Bittremieux, W., Nahnsen, S., Kelchtermans, P., Pichler, P., et al. (2014). qcML: an exchange format for quality control metrics from mass spectrometry experiments. MOLECULAR & CELLULAR PROTEOMICS, 13(8), 1905–1913.
    Quality control is increasingly recognized as a crucial aspect of mass spectrometry based proteomics. Several recent papers discuss relevant parameters for quality control and present applications to extract these from the instrumental raw data. What has been missing, however, is a standard data exchange format for reporting these performance metrics. We therefore developed the qcML format, an XML-based standard that follows the design principles of the related mzML, mzIdentML, mzQuantML, and TraML standards from the HUPO-PSI (Proteomics Standards Initiative). In addition to the XML format, we also provide tools for the calculation of a wide range of quality metrics as well as a database format and interconversion tools, so that existing LIMS systems can easily add relational storage of the quality control data to their existing schema. We here describe the qcML specification, along with possible use cases and an illustrative example of the subsequent analysis possibilities. All information about qcML is available at
  196. Chaves, I., Lin, Y.-C., Pinto-Ricardo, C., Van de Peer, Y., & Miguel, C. (2014). miRNA profiling in leaf and cork tissues of Quercus suber reveals novel miRNAs and tissue-specific expression patterns. TREE GENETICS & GENOMES, 10(3), 721–737.
    The differentiation of cork (phellem) cells from the phellogen (cork cambium) is a secondary growth process observed in the cork oak tree conferring a unique ability to produce a thick layer of cork. At present, the molecular regulators of phellem differentiation are unknown. The previously documented involvement of microRNAs (miRNAs) in the regulation of developmental processes, including secondary growth, motivated the search for these regulators in cork oak tissues. We performed deep sequencing of the small RNA fraction obtained from cork oak leaves and differentiating phellem. RNA sequences with lengths of 19-25 nt derived from the two libraries were analysed, leading to the identification of 41 families of conserved miRNAs, of which the most abundant were miR167, miR165/166, miR396 and miR159. Thirty novel miRNA candidates were also unveiled, 11 of which were unique to leaves and 13 to phellem. Northern blot detection of a set of conserved and novel miRNAs confirmed their differential expression profile. Prediction and analysis of putative miRNA target genes provided clues regarding processes taking place in leaf and phellem tissues, but further experimental work will be needed for functional characterization. In conclusion, we here provide a first characterization of the miRNA population in a Fagacea species, and the comparative analysis of miRNA expression in leaf and phellem libraries represents an important step to uncovering specific regulatory networks controlling phellem differentiation.
  197. Choulet, F., Alberti, A., Theil, S., Glover, N., Barbe, V., Daron, J., Pingault, L., et al. (2014). Structural and functional partitioning of bread wheat chromosome 3B. SCIENCE, 345(6194).
    We produced a reference sequence of the 1-gigabase chromosome 3B of hexaploid bread wheat. By sequencing 8452 bacterial artificial chromosomes in pools, we assembled a sequence of 774 megabases carrying 5326 protein-coding genes, 1938 pseudogenes, and 85% of transposable elements. The distribution of structural and functional features along the chromosome revealed partitioning correlated with meiotic recombination. Comparative analyses indicated high wheat-specific inter-and intrachromosomal gene duplication activities that are potential sources of variability for adaption. In addition to providing a better understanding of the organization, function, and evolution of a large and polyploid genome, the availability of a high-quality sequence anchored to genetic maps will accelerate the identification of genes underlying important agronomic traits.
  198. Sonnhammer, E. L., Gabaldón, T., da Silva, A. W. S., Martin, M., Robinson-Rechavi, M., Boeckmann, B., Thomas, P. D., et al. (2014). Big data and other challenges in the quest for orthologs. BIOINFORMATICS.
    Given the rapid increase of species with a sequenced genome, the need to identify orthologous genes between them has emerged as a central bioinformatics task. Many different methods exist for orthology detection, which makes it difficult to decide which one to choose for a particular application. Here, we review the latest developments and issues in the orthology field, and summarize the most recent results reported at the third 'Quest for Orthologs' meeting. We focus on community efforts such as the adoption of reference proteomes, standard file formats and benchmarking. Progress in these areas is good, and they are already beneficial to both orthology consumers and providers. However, a major current issue is that the massive increase in complete proteomes poses computational challenges to many of the ortholog database providers, as most orthology inference algorithms scale at least quadratically with the number of proteomes. The Quest for Orthologs consortium is an open community with a number of working groups that join efforts to enhance various aspects of orthology analysis, such as defining standard formats and datasets, documenting community resources and benchmarking.
  199. Steyaert, Sandra, Van Criekinge, W., De Paepe, A., Denil, S., Mensaert, K., Vandepitte, K., Vanden Berghe, W., et al. (2014). SNP-guided identification of monoallelic DNA-methylation events from enrichment-based sequencing data. NUCLEIC ACIDS RESEARCH, 42(20).
    Monoallelic gene expression is typically initiated early in the development of an organism. Dysregulation of monoallelic gene expression has already been linked to several non-Mendelian inherited genetic disorders. In humans, DNA-methylation is deemed to be an important regulator of monoallelic gene expression, but only few examples are known. One important reason is that current, cost-affordable truly genome-wide methods to assess DNA-methylation are based on sequencing post-enrichment. Here, we present a new methodology based on classical population genetic theory, i.e. the Hardy-Weinberg theorem, that combines methylomic data from MethylCap-seq with associated SNP profiles to identify monoallelically methylated loci. Applied on 334 MethylCap-seq samples of very diverse origin, this resulted in the identification of 80 genomic regions featured by monoallelic DNA-methylation. Of these 80 loci, 49 are located in genic regions of which 25 have already been linked to imprinting. Further analysis revealed statistically significant enrichment of these loci in promoter regions, further establishing the relevance and usefulness of the method. Additional validation was done using both 14 whole-genome bisulfite sequencing data sets and 16 mRNA-seq data sets. Importantly, the developed approach can be easily applied to other enrichment-based sequencing technologies, like the ChIP-seq-based identification of monoallelic histone modifications.
  200. Kyndt, T., Denil, S., Bauters, L., Van Criekinge, W., & De Meyer, T. (2014). Systemic suppression of the shoot metabolism upon rice root nematode infection. PLOS ONE, 9(9).
    Hirschmanniella oryzae is the most common plant-parasitic nematode in flooded rice cultivation systems. These migratory animals penetrate the plant roots and feed on the root cells, creating large cavities, extensive root necrosis and rotting. The objective of this study was to investigate the systemic response of the rice plant upon root infection by this nematode. RNA sequencing was applied on the above-ground parts of the rice plants at 3 and 7 days post inoculation. The data revealed significant modifications in the primary metabolism of the plant shoot, with a general suppression of for instance chlorophyll biosynthesis, the brassinosteroid pathway, and amino acid production. In the secondary metabolism, we detected a repression of the isoprenoid and shikimate pathways. These molecular changes can have dramatic consequences for the growth and yield of the rice plants, and could potentially change their susceptibility to above-ground pathogens and pests.
  201. Heyndrickx, K., Van de Velde, J., Wang, C., Weigel, D., & Vandepoele, K. (2014). A functional and evolutionary perspective on transcription factor binding in Arabidopsis thaliana. PLANT CELL, 26(10), 3894–3910.
    Understanding the mechanisms underlying gene regulation is paramount to comprehend the translation from genotype to phenotype. The two are connected by gene expression, and it is generally thought that variation in transcription factor (TF) function is an important determinant of phenotypic evolution. We analyzed publicly available genome-wide chromatin immunoprecipitation experiments for 27 TFs in Arabidopsis thaliana and constructed an experimental network containing 46,619 regulatory interactions and 15,188 target genes. We identified hub targets and highly occupied target (HOT) regions, which are enriched for genes involved in development, stimulus responses, signaling, and gene regulatory processes in the currently profiled network. We provide several lines of evidence that TF binding at plant HOT regions is functional, in contrast to that in animals, and not merely the result of accessible chromatin. HOT regions harbor specific DNA motifs, are enriched for differentially expressed genes, and are often conserved across crucifers and dicots, even though they are not under higher levels of purifying selection than non-HOT regions. Distal bound regions are under purifying selection as well and are enriched for a chromatin state showing regulation by the Polycomb repressive complex. Gene expression complexity is positively correlated with the total number of bound TFs, revealing insights in the regulatory code for genes with different expression breadths. The integration of noncanonical and canonical DNA motif information yields new hypotheses on cobinding and tethering between specific TFs involved in flowering and light regulation.
  202. Van de Velde, Jan, Heyndrickx, K., & Vandepoele, K. (2014). Inference of transcriptional networks in Arabidopsis through conserved noncoding sequence analysis. PLANT CELL, 26(7), 2729–2745.
    Transcriptional regulation plays an important role in establishing gene expression profiles during development or in response to (a) biotic stimuli. Transcription factor binding sites (TFBSs) are the functional elements that determine transcriptional activity, and the identification of individual TFBS in genome sequences is a major goal to inferring regulatory networks. We have developed a phylogenetic footprinting approach for the identification of conserved noncoding sequences (CNSs) across 12 dicot plants. Whereas both alignment and non-alignment-based techniques were applied to identify functional motifs in a multispecies context, our method accounts for incomplete motif conservation as well as high sequence divergence between related species. We identified 69,361 footprints associated with 17,895 genes. Through the integration of known TFBS obtained from the literature and experimental studies, we used the CNSs to compile a gene regulatory network in Arabidopsis thaliana containing 40,758 interactions, of which two-thirds act through binding events located in DNase I hypersensitive sites. This network shows significant enrichment toward in vivo targets of known regulators, and its overall quality was confirmed using five different biological validation metrics. Finally, through the integration of detailed expression and function information, we demonstrate how static CNSs can be converted into condition-dependent regulatory networks, offering opportunities for regulatory gene annotation.
  203. Vanneste, Kevin, Baele, G., Maere, S., & Van de Peer, Y. (2014). Analysis of 41 plant genomes supports a wave of successful genome duplications in association with the Cretaceous-Paleogene boundary. GENOME RESEARCH, 24(8), 1334–1347.
    Ancient whole-genome duplications (WGDs), also referred to as paleopolyploidizations, have been reported in most evolutionary lineages. Their attributed role remains a major topic of discussion, ranging from an evolutionary dead end to a road toward evolutionary success, with evidence supporting both fates. Previously, based on dating WGDs in a limited number of plant species, we found a clustering of angiosperm paleopolyploidizations around the Cretaceous Paleogene (K-Pg) extinction event about 66 million years ago. Here we revisit this finding, which has proven controversial, by combining genome sequence information for many more plant lineages and using more sophisticated analyses. We include 38 full genome sequences and three transcriptome assemblies in a Bayesian evolutionary analysis framework that incorporates uncorrelated relaxed clock methods and fossil uncertainty. In accordance with earlier findings, we demonstrate a strongly nonrandom pattern of genome duplications over time with many WGDs clustering around the K-Pg boundary. We interpret these results in the context of recent studies on invasive polyploid plant species, and suggest that polyploid establishment is promoted during times of environmental stress. We argue that considering the evolutionary potential of polyploids in light of the environmental and ecological conditions present around the time of polyploidization could mitigate the stark contrast in the proposed evolutionary fates of polyploids.
  204. Vandenbussche, Filip, Tilbrook, K., Fierro, A. C., Marchal, K., Poelman, D., Van Der Straeten, D., & Ulm, R. (2014). Photoreceptor-mediated bending towards UV-B in Arabidopsis. MOLECULAR PLANT, 7(6), 1041–1052.
    Plants reorient their growth towards light to optimize photosynthetic light capture-a process known as phototropism. Phototropins are the photoreceptors essential for phototropic growth towards blue and ultraviolet-A (UV-A) light. Here we detail a phototropic response towards UV-B in etiolated Arabidopsis seedlings. We report that early differential growth is mediated by phototropins but clear phototropic bending to UV-B is maintained in phot1 phot2 double mutants. We further show that this phototropin-independent phototropic response to UV-B requires the UV-B photoreceptor UVR8. Broad UV-B-mediated repression of auxin-responsive genes suggests that UVR8 regulates directional bending by affecting auxin signaling. Kinetic analysis shows that UVR8-dependent directional bending occurs later than the phototropin response. We conclude that plants may use the full short-wavelength spectrum of sunlight to efficiently reorient photosynthetic tissue with incoming light.
  205. Vercruyssen, L., Verkest, A., Gonzalez Sanchez, N., Heyndrickx, K., Eeckhout, D., Han, S.-K., Jégu, T., et al. (2014). ANGUSTIFOLIA3 binds to SWI/SNF chromatin remodeling complexes to regulate transcription during Arabidopsis leaf development. PLANT CELL, 26(1), 210–229.
    The transcriptional coactivator ANGUSTIFOLIA3 (AN3) stimulates cell proliferation during Arabidopsis thaliana leaf development, but the molecular mechanism is largely unknown. Here, we show that inducible nuclear localization of AN3 during initial leaf growth results in differential expression of important transcriptional regulators, including GROWTH REGULATING FACTORs (GRFs). Chromatin purification further revealed the presence of AN3 at the loci of GRF5, GRF6, CYTOKININ RESPONSE FACTOR2, CONSTANS-LIKE5 (COL5), HECATE1 (HEC1), and ARABIDOPSIS RESPONSE REGULATOR4 (ARR4). Tandem affinity purification of protein complexes using AN3 as bait identified plant SWITCH/SUCROSE NONFERMENTING (SWI/SNF) chromatin remodeling complexes formed around the ATPases BRAHMA (BRM) or SPLAYED. Moreover, SWI/SNF ASSOCIATED PROTEIN 73B (SWP73B) is recruited by AN3 to the promoters of GRF5, GRF3, COL5, and ARR4, and both SWP73B and BRM occupy the HEC1 promoter. Furthermore, we show that AN3 and BRM genetically interact. The data indicate that AN3 associates with chromatin remodelers to regulate transcription. In addition, modification of SWI3C expression levels increases leaf size, underlining the importance of chromatin dynamics for growth regulation. Our results place the SWI/SNF-AN3 module as a major player at the transition from cell proliferation to cell differentiation in a developing leaf.
  206. Verkest, A., Abeel, T., Heyndrickx, K., Van Leene, J., Lanz, C., Van De Slijke, E., … De Jaeger, G. (2014). A generic tool for transcription factor target gene discovery in Arabidopsis cell suspension cultures based on tandem chromatin affinity purification. PLANT PHYSIOLOGY, 164(3), 1122–1133.
    Genome-wide identification of transcription factor (TF) binding sites is pivotal to our understanding of gene expression regulation. Although much progress has been made in the determination of potential binding regions of proteins by chromatin immunoprecipitation, this method has some inherent limitations regarding DNA enrichment efficiency and antibody necessity. Here, we report an alternative strategy for assaying in vivo TF-DNA binding in Arabidopsis (Arabidopsis thaliana) cells by tandem chromatin affinity purification (TChAP). Evaluation of TChAP using the E2Fa TF and comparison with traditional chromatin immunoprecipitation and single chromatin affinity purification illustrates the suitability of TChAP and provides a resource for exploring the E2Fa transcriptional network. Integration with transcriptome, cis-regulatory element, functional enrichment, and coexpression network analyses demonstrates the quality of the E2Fa TChAP sequencing data and validates the identification of new direct E2Fa targets. TChAP enhances both TF target mapping throughput, by circumventing issues related to antibody availability, and output, by improving DNA enrichment efficiency.
  207. Yao, Yao, Marchal, K., & Van de Peer, Y. (2014). Improving the adaptability of simulated evolutionary swarm robots in dynamically changing environments. PLOS ONE, 9(3).
    One of the important challenges in the field of evolutionary robotics is the development of systems that can adapt to a changing environment. However, the ability to adapt to unknown and fluctuating environments is not straightforward. Here, we explore the adaptive potential of simulated swarm robots that contain a genomic encoding of a bio-inspired gene regulatory network (GRN). An artificial genome is combined with a flexible agent-based system, representing the activated part of the regulatory network that transduces environmental cues into phenotypic behaviour. Using an artificial life simulation framework that mimics a dynamically changing environment, we show that separating the static from the conditionally active part of the network contributes to a better adaptive behaviour. Furthermore, in contrast with most hitherto developed ANN-based systems that need to re-optimize their complete controller network from scratch each time they are subjected to novel conditions, our system uses its genome to store GRNs whose performance was optimized under a particular environmental condition for a sufficiently long time. When subjected to a new environment, the previous condition-specific GRN might become inactivated, but remains present. This ability to store 'good behaviour' and to disconnect it from the novel rewiring that is essential under a new condition allows faster re-adaptation if any of the previously observed environmental conditions is reencountered. As we show here, applying these evolutionary-based principles leads to accelerated and improved adaptive evolution in a non-stable environment.
  208. Zamariola, L., De Storme, N., Vannerum, K., Vandepoele, K., Armstrong, S. J., Franklin, F. C. H., & Geelen, D. (2014). SHUGOSHINs and PATRONUS protect meiotic centromere cohesion in Arabidopsis thaliana. PLANT JOURNAL, 77(5), 782–794.
    In meiosis, chromosome cohesion is maintained by the cohesin complex, which is released in a two-step manner. At meiosis I, the meiosis-specific cohesin subunit Rec8 is cleaved by the protease Separase along chromosome arms, allowing homologous chromosome segregation. Next, in meiosis II, cleavage of the remaining centromere cohesin results in separation of the sister chromatids. In eukaryotes, protection of centromeric cohesion in meiosis I is mediated by SHUGOSHINs (SGOs). The Arabidopsis genome contains two SGO homologs. Here we demonstrate that Atsgo1 mutants show a premature loss of cohesion of sister chromatid centromeres at anaphase I and that AtSGO2 partially rescues this loss of cohesion. In addition to SGOs, we characterize PATRONUS which is specifically required for the maintenance of cohesion of sister chromatid centromeres in meiosis II. In contrast to the Atsgo1 Atsgo2 double mutant, patronus T-DNA insertion mutants only display loss of sister chromatid cohesion after meiosis I, and additionally show disorganized spindles, resulting in defects in chromosome segregation in meiosis. This leads to reduced fertility and aneuploid offspring. Furthermore, we detect aneuploidy in sporophytic tissue, indicating a role for PATRONUS in chromosome segregation in somatic cells. Thus, ploidy stability is preserved in Arabidopsis by PATRONUS during both meiosis and mitosis.
  209. Duitama, J., Sánchez-Rodriguez, A., Goovaerts, A., Pulido Tamayo, S., Hubmann, G., Foulquié-Moreno, M. R., Thevelein, J. M., et al. (2014). Improved linkage analysis of Quantitative Trait Loci using bulk segregants unveils a novel determinant of high ethanol tolerance in yeast. BMC GENOMICS, 15.
    Background: Bulk segregant analysis (BSA) coupled to high throughput sequencing is a powerful method to map genomic regions related with phenotypes of interest. It relies on crossing two parents, one inferior and one superior for a trait of interest. Segregants displaying the trait of the superior parent are pooled, the DNA extracted and sequenced. Genomic regions linked to the trait of interest are identified by searching the pool for overrepresented alleles that normally originate from the superior parent. BSA data analysis is non-trivial due to sequencing, alignment and screening errors. Results: To increase the power of the BSA technology and obtain a better distinction between spuriously and truly linked regions, we developed EXPLoRA (EXtraction of over-rePresented aLleles in BSA), an algorithm for BSA data analysis that explicitly models the dependency between neighboring marker sites by exploiting the properties of linkage disequilibrium through a Hidden Markov Model (HMM). Reanalyzing a BSA dataset for high ethanol tolerance in yeast allowed reliably identifying QTLs linked to this phenotype that could not be identified with statistical significance in the original study. Experimental validation of one of the least pronounced linked regions, by identifying its causative gene VPS70, confirmed the potential of our method. Conclusions: EXPLoRA has a performance at least as good as the state-of-the-art and it is robust even at low signal to noise ratio's i.e. when the true linkage signal is diluted by sampling, screening errors or when few segregants are available.
  210. Morreel, K., Saeys, Y., Dima, O., Lu, F., Van de Peer, Y., Vanholme, R., Ralph, J., et al. (2014). Systematic structural characterization of metabolites in Arabidopsis via candidate substrate-product pair networks. PLANT CELL, 26(3), 929–945.
    Plant metabolomics is increasingly used for pathway discovery and to elucidate gene function. However, the main bottleneck is the identification of the detected compounds. This is more pronounced for secondary metabolites as many of their pathways are still underexplored. Here, an algorithm is presented in which liquid chromatography-mass spectrometry profiles are searched for pairs of peaks that have mass and retention time differences corresponding with those of substrates and products from well-known enzymatic reactions. Concatenating the latter peak pairs, called candidate substrate-product pairs (CSPP), into a network displays tentative (bio) synthetic routes. Starting from known peaks, propagating the network along these routes allows the characterization of adjacent peaks leading to their structure prediction. As a proof-of-principle, this high-throughput cheminformatics procedure was applied to the Arabidopsis thaliana leaf metabolome where it allowed the characterization of the structures of 60% of the profiled compounds. Moreover, based on searches in the Chemical Abstract Service database, the algorithm led to the characterization of 61 compounds that had never been described in plants before. The CSPP-based annotation was confirmed by independent MSn experiments. In addition to being high throughput, this method allows the annotation of low-abundance compounds that are otherwise not amenable to isolation and purification. This method will greatly advance the value of metabolomics in systems biology.
  211. Fu, Q., Fierro Gutierrez, A. C. E., Meysman, P., Sanchez Rodriguez, A., Vandepoele, K., Marchal, K., & Engelen, K. (2014). MAGIC: access portal to a cross-platform gene expression compendium for maize. BIOINFORMATICS, 30(9), 1316–1318.
    To facilitate the exploration of publicly available Zea mays expression data, we constructed a maize expression compendium, making use of an integration methodology and a consistent probe to gene mapping based on the 5b.60 sequence release of Z. mays. The compendium is made available through a web portal MAGIC that hosts a variety of analysis tools to easily browse and analyze the data. Our compendium is different from previous initiatives in combining expression values across different experiments by providing a consistent gene annotation across different platforms.
  212. Sánchez-Rodríguez, A., Tytgat, H. L., Winderickx, J., Vanderleyden, J., Lebeer, S., & Marchal, K. (2014). A network-based approach to identify substrate classes of bacterial glycosyltransferases. BMC GENOMICS, 15.
    Background: Bacterial interactions with the environment-and/or host largely depend on the bacterial glycome. The specificities of a bacterial glycome are largely determined by glycosyltransferases (GTs), the enzymes involved in transferring sugar moieties from an activated donor to a specific substrate. Of these GTs their coding regions, but mainly also their substrate specificity are still largely unannotated as most sequence-based annotation flows suffer from the lack of characterized sequence motifs that can aid in the prediction of the substrate specificity. Results: In this work, we developed an analysis flow that uses sequence-based strategies to predict novel GTs, but also exploits a network-based approach to infer the putative substrate classes of these predicted GTs. Our analysis flow was benchmarked with the well-documented GT-repertoire of Campylobacter jejuni NCTC 11168 and applied to the probiotic model Lactobacillus rhamnosus GG to expand our insights in the glycosylation potential of this bacterium. In L. rhamnosus GG we could predict 48 GTs of which eight were not previously reported. For at least 20 of these GTs a substrate relation was inferred. Conclusions: We confirmed through experimental validation our prediction of WelI acting upstream of WelE in the biosynthesis of exopolysaccharides. We further hypothesize to have identified in L. rhamnosus GG the yet undiscovered genes involved in the biosynthesis of glucose-rich glycans and novel GTs involved in the glycosylation of proteins. Interestingly, we also predict GTs with well-known functions in peptidoglycan synthesis to also play a role in protein glycosylation.
  213. Van Maele, L., Fougeron, D., Janot, L., Didierlaurent, A., Cayet, D., Tabareau, J., Rumbo, M., et al. (2014). Airway structural cells regulate TLR5-mediated mucosal adjuvant activity. MUCOSAL IMMUNOLOGY, 7(3), 489–500.
    Antigen-presenting cell (APC) activation is enhanced by vaccine adjuvants. Most vaccines are based on the assumption that adjuvant activity of Toll-like receptor (TLR) agonists depends on direct, functional activation of APCs. Here, we sought to establish whether TLR stimulation in non-hematopoietic cells contributes to flagellin's mucosal adjuvant activity. Nasal administration of flagellin enhanced T-cell-mediated immunity, and systemic and secretory antibody responses to coadministered antigens in a TLR5-dependent manner. Mucosal adjuvant activity was not affected by either abrogation of TLR5 signaling in hematopoietic cells or the presence of flagellin-specific, circulating neutralizing antibodies. We found that flagellin is rapidly degraded in conducting airways, does not translocate into lung parenchyma and stimulates an early immune response, suggesting that TLR5 signaling is regionalized. The flagellin-specific early response of lung was regulated by radioresistant cells expressing TLR5 (particularly the airway epithelial cells). Flagellin stimulated the epithelial production of a small set of mediators that included the chemokine CCL20, which is known to promote APC recruitment in mucosal tissues. Our data suggest that (i) the adjuvant activity of TLR agonists in mucosal vaccination may require TLR stimulation of structural cells and (ii) harnessing the effect of adjuvants on epithelial cells can improve mucosal vaccines.
  214. Vaudel, M., Venne, A. S., Berven, F. S., Zahedi, R. P., Martens, L., & Barsnes, H. (2014). Shedding light on black boxes in protein identification. PROTEOMICS, 14(9), 1001–1005.
    Performing a well thought-out proteomics data analysis can be a daunting task, especially for newcomers to the field. Even researchers experienced in the proteomics field can find it challenging to follow existing publication guidelines for MS-based protein identification and characterization in detail. One of the primary goals of bioinformatics is to enable any researcher to interpret the vast amounts of data generated in modern biology, by providing user-friendly and robust end-user applications, clear documentation, and corresponding teaching materials. In that spirit, we here present an extensive tutorial for peptide and protein identification, available at . The material is completely based on freely available and open-source tools, and has already been used and refined at numerous international courses over the past 3 years. During this time, it has demonstrated its ability to allow even complete beginners to intuitively conduct advanced bioinformatics workflows, interpret the results, and understand their context. This tutorial is thus aimed at fully empowering users, by removing black boxes in the proteomics informatics pipeline.
  215. Vizcaino, J. A., Deutsch, E. W., Wang, R., Csordas, A., Reisinger, F., Ríos, D., Dianes, J. A., et al. (2014). ProteomeXchange provides globally coordinated proteomics data submission and dissemination. NATURE BIOTECHNOLOGY, 32(3), 223–226.
  216. Ishchukov, I., Wu, Y., Van Puyvelde, S., Vanderleyden, J., & Marchal, K. (2014). Inferring the relation between transcriptional and posttranscriptional regulation from expression compendia. BMC MICROBIOLOGY, 14.
    Background: Publicly available expression compendia that measure both mRNAs and sRNAs provide a promising resource to simultaneously infer the transcriptional and the posttranscriptional network. To maximally exploit the information contained in such compendia, we propose an analysis flow that combines publicly available expression compendia and sequence-based predictions to infer novel sRNA-target interactions and to reconstruct the relation between the sRNA and the transcriptional network. Results: We relied on module inference to construct modules of coexpressed genes (sRNAs). TFs and sRNAs were assigned to these modules using the state-of-the-art inference techniques LeMoNe and Context Likelihood of Relatedness (CLR). Combining these expressions with sequence-based sRNA-target interactions allowed us to predict 30 novel sRNA-target interactions comprising 14 sRNAs. Our results highlight the role of the posttranscriptional network in finetuning the transcriptional regulation, e.g. by intra-operonic regulation. Conclusion: In this work we show how strategies that combine expression information with sequence-based predictions can help unveiling the intricate interaction between the transcriptional and the posttranscriptional network in prokaryotic model systems.
  217. Meysman, P., Sonego, P., Bianco, L., Fu, Q., Ledezma-Tejeida, D., Gama-Castro, S., Liebens, V., et al. (2014). COLOMBOS v2.0 : an ever expanding collection of bacterial expression compendia. NUCLEIC ACIDS RESEARCH, 42(D1), D649–D653.
    The COLOMBOS database ( features comprehensive organism-specific cross-platform gene expression compendia of several bacterial model organisms and is supported by a fully interactive web portal and an extensive web API. COLOMBOS was originally published in PLoS One, and COLOMBOS v2.0 includes both an update of the expression data, by expanding the previously available compendia and by adding compendia for several new species, and an update of the surrounding functionality, with improved search and visualization options and novel tools for programmatic access to the database. The scope of the database has also been extended to incorporate RNA-seq data in our compendia by a dedicated analysis pipeline. We demonstrate the validity and robustness of this approach by comparing the same RNA samples measured in parallel using both microarrays and RNA-seq. As far as we know, COLOMBOS currently hosts the largest homogenized gene expression compendia available for seven bacterial model organisms.
  218. De Antonellis, Pasqualino, Carotenuto, M., Vandenbussche, J., De Vita, G., Ferrucci, V., Medaglia, C., Boffa, I., et al. (2014). Early targets of miR-34a in neuroblastoma. MOLECULAR & CELLULAR PROTEOMICS, 13(8), 2114–2131.
    Several genes encoding for proteins involved in proliferation, invasion, and apoptosis are known to be direct miR-34a targets. Here, we used proteomics to screen for targets of miR-34a in neuroblastoma (NBL), a childhood cancer that originates from precursor cells of the sympathetic nervous system. We examined the effect of miR-34a overexpression using a tetracycline inducible system in two NBL cell lines (SHEP and SH-SY5Y) at early time points of expression (6, 12, and 24 h). Proteome analysis using post-metabolic labeling led to the identification of 2,082 proteins, and among these 186 were regulated (112 proteins down-regulated and 74 up-regulated). Prediction of miR-34a targets via bioinformatics showed that 32 transcripts held miR-34a seed sequences in their 3'-UTR. By combining the proteomics data with Kaplan Meier gene-expression studies, we identified seven new gene products (ALG13, TIMM13, TGM2, ABCF2, CTCF, Ki67, and LYAR) that were correlated with worse clinical outcomes. These were further validated in vitro by 3'-UTR seed sequence regulation. In addition, Michigan Molecular Interactions searches indicated that together these proteins affect signaling pathways that regulate cell cycle and proliferation, focal adhesions, and other cellular properties that overall enhance tumor progression (including signaling pathways such as TGF-beta, WNT, MAPK, and FAK). In conclusion, proteome analysis has here identified early targets of miR-34a with relevance to NBL tumorigenesis. Along with the results of previous studies, our data strongly suggest miR-34a as a useful tool for improving the chance of therapeutic success with NBL.
  219. Mestdagh, P., Hartmann, N., Baeriswy, L., Andreasen, D., Bernard, N., Chen, C., Cheo, D., et al. (2014). Evaluation of quantitative miRNA expression platforms in the microRNA quality control (miRQC) study. NATURE METHODS, 11(8), 809–815.
    MicroRNAs are important negative regulators of protein-coding gene expression and have been studied intensively over the past years. Several measurement platforms have been developed to determine relative miRNA abundance in biological samples using different technologies such as small RNA sequencing, reverse transcription quantitative PCR (RT-qPCR) and (microarray) hybridization. In this study, we systematically compared 12 commercially available platforms for analysis of microRNA expression. We measured an identical set of 20 standardized positive and negative control samples, including human universal reference RNA, human brain RNA and titrations thereof, human serum samples and synthetic spikes from micro RNA family members with varying homology. We developed robust quality metrics to objectively assess platform performance in terms of reproducibility, sensitivity, accuracy, specificity and concordance of differential expression. The results indicate that each method has its strengths and weaknesses, which help to guide informed selection of a quantitative microRNA gene expression platform for particular study goals.
  220. Zarrineh, P., Sanchez-Rodriguez, A., Hosseinkhan, N., Narimani, Z., Marchal, K., & Masoudi-Nejad, A. (2014). Genome-scale co-expression network comparison across escherichia coli and salmonella enterica serovar typhimurium reveals significant conservation at the regulon level of local regulators despite their dissimilar lifestyles. PLOS ONE, 9(8).
    Availability of genome-wide gene expression datasets provides the opportunity to study gene expression across different organisms under a plethora of experimental conditions. In our previous work, we developed an algorithm called COMODO (COnserved MODules across Organisms) that identifies conserved expression modules between two species. In the present study, we expanded COMODO to detect the co-expression conservation across three organisms by adapting the statistics behind it. We applied COMODO to study expression conservation/divergence between Escherichia coli, Salmonella enterica, and Bacillus subtilis. We observed that some parts of the regulatory interaction networks were conserved between E. coli and S. enterica especially in the regulon of local regulators. However, such conservation was not observed between the regulatory interaction networks of B. subtilis and the two other species. We found co-expression conservation on a number of genes involved in quorum sensing, but almost no conservation for genes involved in pathogenicity across E. coli and S. enterica which could partially explain their different lifestyles. We concluded that despite their different lifestyles, no significant rewiring have occurred at the level of local regulons involved for instance, and notable conservation can be detected in signaling pathways and stress sensing in the phylogenetically close species S. enterica and E. coli. Moreover, conservation of local regulons seems to depend on the evolutionary time of divergence across species disappearing at larger distances as shown by the comparison with B. subtilis. Global regulons follow a different trend and show major rewiring even at the limited evolutionary distance that separates E. coli and S. enterica.
  221. Fawcett, J., Van de Peer, Y., & Maere, S. (2013). Significance and biological consequences of polyploidization in land plant evolution. In J. Greilhuber, J. Doležel, & J. F. Wendel (Eds.), Physical structure, behaviour and evolution of plant genomes (Vol. 2, pp. 277–293). Vienna, Austria: Springer.
  222. Vanneste, Kevin, Van de Peer, Y., & Maere, S. (2013). Inference of genome duplications from age distributions revisited. MOLECULAR BIOLOGY AND EVOLUTION, 30(1), 177–190.
    Whole-genome duplications (WGDs), thought to facilitate evolutionary innovations and adaptations, have been uncovered in many phylogenetic lineages. WGDs are frequently inferred from duplicate age distributions, where they manifest themselves as peaks against a small-scale duplication background. However, the interpretation of duplicate age distributions is complicated by the use of K-S, the number of synonymous substitutions per synonymous site, as a proxy for the age of paralogs. Two particular concerns are the stochastic nature of synonymous substitutions leading to increasing uncertainty in K-S with increasing age since duplication and K-S saturation caused by the inability of evolutionary models to fully correct for the occurrence of multiple substitutions at the same site. K-S stochasticity is expected to erode the signal of older WGDs, whereas K-S saturation may lead to artificial peaks in the distribution. Here, we investigate the consequences of these effects on K-S-based age distributions and WGD inference by simulating the evolution of duplicated sequences according to predefined real age distributions and re-estimating the corresponding K-S distributions. We show that, although K-S estimates can be used for WGD inference far beyond the commonly accepted K-S threshold of 1, K-S saturation effects can cause artificial peaks at higher ages. Moreover, K-S stochasticity and saturation may lead to confounded peaks encompassing multiple WGD events and/or saturation artifacts. We argue that K-S effects need to be properly accounted for when inferring WGDs from age distributions and that the failure to do so could lead to false inferences.
  223. Vandepitte, K, Honnay, O., Mergeay, J., Breyne, P., Roldàn-Ruiz, I., & De Meyer, T. (2013). SNP discovery using paired-end RAD-tag sequencing on pooled genomic DNA of Sisymbrium austriacum (Brassicaceae). MOLECULAR ECOLOGY RESOURCES, 13(2), 269–275.
    Single nucleotide polymorphisms SNPs are rapidly replacing anonymous markers in population genomic studies, but their use in non model organisms is hampered by the scarcity of cost-effective approaches to uncover genome-wide variation in a comprehensive subset of individuals. The screening of one or only a few individuals induces ascertainment bias. To discover SNPs for a population genomic study of the Pyrenean rocket (Sisymbrium austriacum subsp. chrysanthum), we undertook a pooled RAD-PE (Restriction site Associated DNA Paired-End sequencing) approach. RAD tags were generated from the PstI-digested pooled genomic DNA of 12 individuals sampled across the species distribution range and paired-end sequenced using Illumina technology to produce similar to 24.5Mb of sequences, covering similar to 7% of the specie's genome. Sequences were assembled into similar to 76000 contigs with a mean length of 323bp (N50=357bp, sequencing depth=24x). In all, >15000 SNPs were called, of which 47% were annotated in putative genic regions based on homology with the Arabidopsis thaliana genome. Gene ontology (GO) slim categorization demonstrated that the identified SNPs covered extant genic variation well. The validation of 300 SNPs on a larger set of individuals using a KASPar assay underpinned the utility of pooled RAD-PE as an inexpensive genome-wide SNP discovery technique (success rate: 87%). In addition to SNPs, we discovered >600 putative SSR markers.
  224. Colaert, N., Maddelein, D., Impens, F., Van Damme, P., Plasman, K., Helsens, K., Hulstaert, N., et al. (2013). The Online Protein Processing Resource (TOPPR) : a database and analysis platform for protein processing events. NUCLEIC ACIDS RESEARCH, 41(D1), D333–D337.
    We here present The Online Protein Processing Resource (TOPPR;, an online database that contains thousands of published proteolytically processed sites in human and mouse proteins. These cleavage events were identified with COmbinded FRActional DIagonal Chromatography proteomics technologies, and the resulting database is provided with full data provenance. Indeed, TOPPR provides an interactive visual display of the actual fragmentation mass spectrum that led to each identification of a reported processed site, complete with fragment ion annotations and search engine scores. Apart from warehousing and disseminating these data in an intuitive manner, TOPPR also provides an online analysis platform, including methods to analyze protease specificity and substrate-centric analyses. Concretely, TOPPR supports three ways to retrieve data: (i) the retrieval of all substrates for one or more cellular stimuli or assays; (ii) a substrate search by UniProtKB/Swiss-Prot accession number, entry name or description; and (iii) a motif search that retrieves substrates matching a user-defined protease specificity profile. The analysis of the substrates is supported through the presence of a variety of annotations, including predicted secondary structure, known domains and experimentally obtained 3D structure where available. Across substrates, substrate orthologs and conserved sequence stretches can also be shown, with iceLogo visualization provided for the latter.
  225. Andolfo, G., Sanseverino, W., Rombauts, S., Van de Peer, Y., Bradeen, J., Carputo, D., Frusciante, L., et al. (2013). Overview of tomato (Solanum lycopersicum) candidate pathogen recognition genes reveals important Solanum R locus dynamics. NEW PHYTOLOGIST, 197(1), 223–237.
    To investigate the genome-wide spatial arrangement of R loci, a complete catalogue of tomato (Solanum lycopersicum) and potato (Solanum tuberosum) nucleotide-binding site (NBS) NBS, receptor-like protein (RLP) and receptor-like kinase (RLK) gene repertories was generated. Candidate pathogen recognition genes were characterized with respect to structural diversity, phylogenetic relationships and chromosomal distribution. NBS genes frequently occur in clusters of related gene copies that also include RLP or RLK genes. This scenario is compatible with the existence of selective pressures optimizing coordinated transcription. A number of duplication events associated with lineage-specific evolution were discovered. These findings suggest that different evolutionary mechanisms shaped pathogen recognition gene cluster architecture to expand and to modulate the defence repertoire. Analysis of pathogen recognition gene clusters associated with documented resistance function allowed the identification of adaptive divergence events and the reconstruction of the evolution history of these loci. Differences in candidate pathogen recognition gene number and organization were found between tomato and potato. Most candidate pathogen recognition gene orthologues were distributed at less than perfectly matching positions, suggesting an ongoing lineage-specific rearrangement. Indeed, a local expansion of Toll/Interleukin-1 receptor (TIR)-NBS-leucine-rich repeat (LRR) (TNL) genes in the potato genome was evident. Taken together, these findings have implications for improved understanding of the mechanisms of molecular adaptive selection at Solanum R loci.
  226. Volders, P.-J., Helsens, K., Wang, X., Menten, B., Martens, L., Gevaert, K., Vandesompele, J., et al. (2013). LNCipedia : a database for annotated human lncRNA transcript sequences and structures. NUCLEIC ACIDS RESEARCH, 41(D1), D246–D251.
    Here, we present LNCipedia (, a novel database for human long non-coding RNA (lncRNA) transcripts and genes. LncRNAs constitute a large and diverse class of non-coding RNA genes. Although several lncRNAs have been functionally annotated, the majority remains to be characterized. Different high-throughput methods to identify new lncRNAs (including RNA sequencing and annotation of chromatin-state maps) have been applied in various studies resulting in multiple unrelated lncRNA data sets. LNCipedia offers 21 488 annotated human lncRNA transcripts obtained from different sources. In addition to basic transcript information and gene structure, several statistics are determined for each entry in the database, such as secondary structure information, protein coding potential and microRNA binding sites. Our analyses suggest that, much like microRNAs, many lncRNAs have a significant secondary structure, in-line with their presumed association with proteins or protein complexes. Available literature on specific lncRNAs is linked, and users or authors can submit articles through a web interface. Protein coding potential is assessed by two different prediction algorithms: Coding Potential Calculator and HMMER. In addition, a novel strategy has been integrated for detecting potentially coding lncRNAs by automatically re-analysing the large body of publicly available mass spectrometry data in the PRIDE database. LNCipedia is publicly available and allows users to query and download lncRNA sequences and structures based on different search criteria. The database may serve as a resource to initiate small- and large-scale lncRNA studies. As an example, the LNCipedia content was used to develop a custom microarray for expression profiling of all available lncRNAs.
  227. Van Landeghem, S., Bjorne, J., Wei, C.-H., Hakala, K., Pyysalo, S., Ananiadou, S., Kao, H.-Y., et al. (2013). Large-scale event extraction from literature with multi-level gene normalization. PLOS ONE, 8(4).
    Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access ( Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from, under the Creative Commons -Attribution - Share Alike (CC BY-SA) license.
  228. Van Bogaert, Inge, Holvoet, K., Roelants, S., Li, B., Lin, Y.-C., Van de Peer, Y., & Soetaert, W. (2013). The biosynthetic gene cluster for sophorolipids : a biotechnological interesting biosurfactant produced by Starmerella bombicola. MOLECULAR MICROBIOLOGY, 88(3), 501–509.
    Sophorolipids are promising biological derived surfactants or detergents which find application in household cleaning, personal care and cosmetics. They are produced by specific yeast species and among those, Starmerella bombicola (former Candida bombicola) is the most widely used and studied one. Despite the commercial interest in sophorolipids, the biosynthetic pathway of these secondary metabolites remained hitherto partially unsolved. In this manuscript we present the sophorolipid gene cluster consisting of five genes directly involved in sophorolipid synthesis: a cytochrome P450 monooxygenase, two glucosyltransferases, an acetyltransferase and a transporter. It was demonstrated that disabling the first step of the pathway cytochrome P450 monooxygenase mediated terminal or subterminal hydroxylation of a common fatty acid results in complete abolishment of sophorolipid production. This phenotype could be complemented by supplying the yeast with hydroxylated fatty acids. On the other hand, knocking out the transporter gene yields mutants still able to secrete sophorolipids, though only at levels of 10% as compared with the wild type, suggesting alternative routes for secretion. Finally, it was proved that hampering sophorolipid production does not affect cell growth or cell viability in laboratory conditions, as can be expected for secondary metabolites.
  229. Van Landeghem, S., De Bodt, S., Drebert, Z., Inzé, D., & Van de Peer, Y. (2013). The potential of text mining in data integration and network biology for plant research : a case study on Arabidopsis. PLANT CELL, 25(3), 794–807.
    Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein-protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies.
  230. Van den Abbeele, P., Belzer, C., Goossens, M., Kleerebezem, M., De Vos, W. M., Thas, O., De Weirdt, R., et al. (2013). Butyrate-producing Clostridium cluster XIVa species specifically colonize mucins in an in vitro gut model. ISME JOURNAL, 7(5), 949–961.
    The human gut is colonized by a complex microbiota with multiple benefits. Although the surface-attached, mucosal microbiota has a unique composition and potential to influence human health, it remains difficult to study in vivo. Therefore, we performed an in-depth microbial characterization (human intestinal tract chip (HITChip)) of a recently developed dynamic in vitro gut model, which simulates both luminal and mucosal gut microbes (mucosal-simulator of human intestinal microbial ecosystem (M-SHIME)). Inter-individual differences among human subjects were confirmed and microbial patterns unique for each individual were preserved in vitro. Furthermore, in correspondence with in vivo studies, Bacteroidetes and Proteobacteria were enriched in the luminal content while Firmicutes rather colonized the mucin layer, with Clostridium cluster XIVa accounting for almost 60% of the mucin-adhered microbiota. Of the many acetate and/or lactate-converting butyrate producers within this cluster, Roseburia intestinalis and Eubacterium rectale most specifically colonized mucins. These 16S rRNA gene-based results were confirmed at a functional level as butyryl-CoA:acetate-CoA transferase gene sequences belonged to different species in the luminal as opposed to the mucin-adhered microbiota, with Roseburia species governing the mucosal butyrate production. Correspondingly, the simulated mucosal environment induced a shift from acetate towards butyrate. As not only inter-individual differences were preserved but also because compared with conventional models, washout of relevant mucin-adhered microbes was avoided, simulating the mucosal gut microbiota represents a breakthrough in modeling and mechanistically studying the human intestinal microbiome in health and disease. Finally, as mucosal butyrate producers produce butyrate close to the epithelium, they may enhance butyrate bioavailability, which could be useful in treating diseases, such as inflammatory bowel disease.
  231. Vyverman, M., De Baets, B., Fack, V., & Dawyndt, P. (2013). essaMEM: finding maximal exact matches using enhanced sparse suffix arrays. BIOINFORMATICS, 29(6), 802–804.
  232. Claesen, J., Clement, L., Shkedy, Z., Foulquié-Moreno, M. R., & Burzykowski, T. (2013). Simultaneous mapping of multiple gene loci with pooled segregants. PLOS ONE, 8(2).
    The analysis of polygenic, phenotypic characteristics such as quantitative traits or inheritable diseases remains an important challenge. It requires reliable scoring of many genetic markers covering the entire genome. The advent of high-throughput sequencing technologies provides a new way to evaluate large numbers of single nucleotide polymorphisms (SNPs) as genetic markers. Combining the technologies with pooling of segregants, as performed in bulked segregant analysis (BSA), should, in principle, allow the simultaneous mapping of multiple genetic loci present throughout the genome. The gene mapping process, applied here, consists of three steps: First, a controlled crossing of parents with and without a trait. Second, selection based on phenotypic screening of the offspring, followed by the mapping of short offspring sequences against the parental reference. The final step aims at detecting genetic markers such as SNPs, insertions and deletions with next generation sequencing (NGS). Markers in close proximity of genomic loci that are associated to the trait have a higher probability to be inherited together. Hence, these markers are very useful for discovering the loci and the genetic mechanism underlying the characteristic of interest. Within this context, NGS produces binomial counts along the genome, i.e., the number of sequenced reads that matches with the SNP of the parental reference strain, which is a proxy for the number of individuals in the offspring that share the SNP with the parent. Genomic loci associated with the trait can thus be discovered by analyzing trends in the counts along the genome. We exploit the link between smoothing splines and generalized mixed models for estimating the underlying structure present in the SNP scatterplots.
  233. Muth, T., Benndorf, D., Reichl, U., Rapp, E., & Martens, L. (2013). Searching for a needle in a stack of needles: challenges in metaproteomics data analysis. MOLECULAR BIOSYSTEMS, 9(4), 578–585.
    In the past years the integral study of microbial communities of varying complexity has gained increasing research interest. Mass spectrometry-driven metaproteomics enables the analysis of such communities on the functional level, but this fledgling field still faces various technical and semantic challenges regarding experimental data analysis and interpretation. In the present review, we outline the hurdles involved and attempt to cover the most valuable methods and software implementations available to researchers in the field today. Beyond merely focusing on protein identification, we provide an overview on different data pre- and post-processing steps, such as metabolic pathway analysis, that can be useful in a typical metaproteomics workflow. Finally, we briefly discuss directions for future work.
  234. Vandermarliere, E., & Martens, L. (2013). Protein structure as a means to triage proposed PTM sites. PROTEOMICS, 13(6), 1028–1035.
    PTMs such as phosphorylation are often important actors in protein regulation and recognition. These functions require both visibility and accessibility to other proteins; that the modification is located at the surface of the protein. Currently, many repositories provide information on PTMs but structural information is often lacking. This study, which focuses on phosphorylation sites available in UniProtKB/Swiss-Prot, illustrates that most phosphorylation sites are indeed found at the surface of the protein, but that some sites are found buried in the core of the protein. Several of these identified buried phosphorylation sites can easily become accessible upon small conformational changes while others would require the whole protein to unfold and are hence most unlikely modification sites. Subsequent analysis of phosphorylation sites available in PRIDE demonstrates that taking the structure of the protein into account would be a good guide in the identification of the actual phosphorylated positions in phophoproteomics experiments. This analysis illustrates that care must be taken when simply accepting the position of a PTM without first analyzing its position within the protein structure if the latter is available.
  235. De Smet, Riet, Adams, K. L., Vandepoele, K., Van Montagu, M., Maere, S., & Van de Peer, Y. (2013). Convergent gene loss following gene and genome duplications creates single-copy families in flowering plants. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 110(8), 2898–2903.
    The importance of gene gain through duplication has long been appreciated. In contrast, the importance of gene loss has only recently attracted attention. Indeed, studies in organisms ranging from plants to worms and humans suggest that duplication of some genes might be better tolerated than that of others. Here we have undertaken a large-scale study to investigate the existence of duplication-resistant genes in the sequenced genomes of 20 flowering plants. We demonstrate that there is a large set of genes that is convergently restored to single-copy status following multiple genome-wide and smaller scale duplication events. We rule out the possibility that such a pattern could be explained by random gene loss only and therefore propose that there is selection pressure to preserve such genes as singletons. This is further substantiated by the observation that angiosperm single-copy genes do not comprise a random fraction of the genome, but instead are often involved in essential housekeeping functions that are highly conserved across all eukaryotes. Furthermore, single-copy genes are generally expressed more highly and in more tissues than non-single-copy genes, and they exhibit higher sequence conservation. Finally, we propose different hypotheses to explain their resistance against duplication.
  236. Barsnes, H., & Martens, L. (2013). Crowdsourcing in proteomics: public resources lead to better experiments. AMINO ACIDS, 44(4), 1129–1137.
    With the growing interest in the field of proteomics, the amount of publicly available proteome resources has also increased dramatically. This means that there are many useful resources available for almost all aspects of a proteomics experiment. However, it remains vital to use the right resource, for the right purpose, at the right time. This review is therefore meant to aid the reader in obtaining an overview of the available resources and their application, thus providing the necessary background to choose the appropriate resources for the experiment at hand. Many of the resources are also taking advantage of so-called crowdsourcing to maximize the potential of the resource. What this means and how this can improve future experiments will also be discussed. The text roughly follows the steps involved in a proteomics experiment, starting with the planning of the experiment, via the processing of the data and the analysis of the results, to the community-wide sharing of the produced data.
  237. Delbroek, L., Van Kolen, K., Steegmans, L., da Cunha, R., Mandemakers, W., Daneels, G., De Bock, P.-J., et al. (2013). Development of an enzyme-linked immunosorbent assay for detection of cellular and in vivo LRRK2 S935 phosphorylation. JOURNAL OF PHARMACEUTICAL AND BIOMEDICAL ANALYSIS, 76, 49–58.
  238. Menschaert, G., Van Criekinge, W., Notelaers, T., Koch, A., Crappé, J., Gevaert, K., & Van Damme, P. (2013). Deep proteome coverage based on ribosome profiling aids mass spectrometry-based protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events. MOLECULAR & CELLULAR PROTEOMICS, 12(7), 1780–1790.
    An increasing number of studies involve integrative analysis of gene and protein expression data, taking advantage of new technologies such as next-generation transcriptome sequencing (RNA-Seq) and highly sensitive mass spectrometry (MS) instrumentation. Recently, a strategy, termed ribosome profiling (or RIBO-seq), based on deep sequencing of ribosome-protected mRNA fragments, indirectly monitoring protein synthesis, has been described. We devised a proteogenomic approach constructing a custom protein sequence search space, built from both SwissProt and RIBO-seq derived translation products, applicable for MS/MS spectrum identification. To record the impact of using the constructed deep proteome database we performed two alternative MS-based proteomic strategies: (I) a regular shotgun proteomic and (II) an N-terminal COFRADIC approach. While the former technique gives an overall assessment on the protein and peptide level, the latter technique, specifically enabling the isolation of N-terminal peptides, is very appropriate in validating the RIBO-seq derived (alternative) translation initiation site profile. We demonstrate that this proteogenomic approach increases the overall protein identification rate with 2.5% (e.g. new protein products, new protein splice variants, SNP variant proteins, and N-terminally extended forms of known proteins) as compared to only searching UniProtKB-SwissProt. Furthermore, using this custom database, identification of N-terminal COFRADIC data resulted in detection of 16 alternative start sites giving rise to N-terminally extended protein variants besides the identification of four translated uORFs. Notably, the characterization of these new translation products revealed the use of multiple near-cognate (non-AUG) start codons. As deep sequencing techniques are becoming more standard, less expensive, and widespread, we anticipate that mRNA-seq and especially custom-tailored RIBO-seq will become indispensible in the MS-based protein or peptide identification process.
  239. Verbeke, Lieven, Cloots, L., Demeester, P., Fostier, J., & Marchal, K. (2013). EPSILON: an eQTL prioritization framework using similarity measures derived from local networks. BIOINFORMATICS, 29(10), 1308–1316.
    Motivation: When genomic data are associated with gene expression data, the resulting expression quantitative trait loci (eQTL) will likely span multiple genes. eQTL prioritization techniques can be used to select the most likely causal gene affecting the expression of a target gene from a list of candidates. As an input, these techniques use physical interaction networks that often contain highly connected genes and unreliable or irrelevant interactions that can interfere with the prioritization process. We present EPSILON, an extendable framework for eQTL prioritization, which mitigates the effect of highly connected genes and unreliable interactions by constructing a local network before a network-based similarity measure is applied to select the true causal gene. Results: We tested the new method on three eQTL datasets derived from yeast data using three different association techniques. A physical interaction network was constructed, and each eQTL in each dataset was prioritized using the EPSILON approach: first, a local network was constructed using a k-trials shortest path algorithm, followed by the calculation of a network-based similarity measure. Three similarity measures were evaluated: random walks, the Laplacian Exponential Diffusion kernel and the Regularized Commute-Time kernel. The aim was to predict knockout interactions from a yeast knockout compendium. EPSILON outperformed two reference prioritization methods, random assignment and shortest path prioritization. Next, we found that using a local network significantly increased prioritization performance in terms of predicted knockout pairs when compared with using exactly the same network similarity measures on the global network, with an average increase in prioritization performance of 8 percentage points (P < 10(-5)).
  240. De Meyer, Tim, Mampaey, E., Vlemmix, M., Denil, S., Trooskens, G., Renard, J.-P., De Keulenaer, S., et al. (2013). Quality evaluation of methyl binding domain based kits for enrichment DNA-methylation sequencing. PLOS ONE, 8(3).
    DNA-methylation is an important epigenetic feature in health and disease. Methylated sequence capturing by Methyl Binding Domain (MBD) based enrichment followed by second-generation sequencing provides the best combination of sensitivity and cost-efficiency for genome-wide DNA-methylation profiling. However, existing implementations are numerous, and quality control and optimization require expensive external validation. Therefore, this study has two aims: 1) to identify a best performing kit for MBD-based enrichment using independent validation data, and 2) to evaluate whether quality evaluation can also be performed solely based on the characteristics of the generated sequences. Five commercially available kits for MBD enrichment were combined with Illumina GAIIx sequencing for three cell lines (HCT15, DU145, PC3). Reduced representation bisulfite sequencing data (all three cell lines) and publicly available Illumina Infinium BeadChip data (DU145 and PC3) were used for benchmarking. Consistent large-scale differences in yield, sensitivity and specificity between the different kits could be identified, with Diagenode's MethylCap kit as overall best performing kit under the tested conditions. This kit could also be identified with the Fragment CpG-plot, which summarizes the CpG content of the captured fragments, implying that the latter can be used as a tool to monitor data quality. In conclusion, there are major quality differences between kits for MBD-based capturing of methylated DNA, with the MethylCap kit performing best under the used settings. The Fragment CpG-plot is able to monitor data quality based on inherent sequence data characteristics, and is therefore a cost-efficient tool for experimental optimization, but also to monitor quality throughout routine applications.
  241. Steenbergen, R. D., Ongenaert, M., Snellenberg, S., Trooskens, G., van der Meide, W. F., Pandey, D., Bloushtain-Qimron, N., et al. (2013). Methylation-specific digital karyotyping of HPV16E6E7 expressing human keratinocytes identifies novel methylation events in cervical carcinogenesis. JOURNAL OF PATHOLOGY, 231(1), 53–62.
  242. Verelst, W., Bertolini, E., De Bodt, S., Vandepoele, K., Demeulenaere, M., Pé, M. E., & Inzé, D. (2013). Molecular and physiological analysis of growth-limiting drought stress in Brachypodium distachyon leaves. MOLECULAR PLANT, 6(2), 311–322.
    The drought-tolerant grass Brachypodium distachyon is an emerging model species for temperate grasses and cereal crops. To explore the usefulness of this species for drought studies, a reproducible in vivo drought assay was developed. Spontaneous soil drying led to a 45% reduction in leaf size, and this was mostly due to a decrease in cell expansion, whereas cell division remained largely unaffected by drought. To investigate the molecular basis of the observed leaf growth reduction, the third Brachypodium leaf was dissected in three zones, namely proliferation, expansion, and mature zones, and subjected to transcriptome analysis, based on a whole-genome tiling array. This approach allowed us to highlight that transcriptome profiles of different developmental leaf zones respond differently to drought. Several genes and functional processes involved in drought tolerance were identified. The transcriptome data suggest an increased energy availability in the proliferation zones, along with an up-regulation of sterol synthesis that may influence membrane fluidity. This information may be used to improve the tolerance of temperate cereals to drought, which is undoubtedly one of the major environmental challenges faced by agriculture today and in the near future.
  243. Swarts, D. R., Henfling, M. E., Van Neste, L., van Suylen, R. J., Dingemans, A.-M. C., Dinjens, W. N., Haesevoets, A., et al. (2013). CD44 and OTP are strong prognostic markers for pulmonary carcinoids. CLINICAL CANCER RESEARCH, 19(8), 2197–2207.
  244. De Storme, N., De Schrijver, J., Van Criekinge, W., Wewer, V., Dörmann, P., & Geelen, D. (2013). GLUCAN SYNTHASE-LIKE8 and STEROL METHYLTRANSFERASE2 are required for ploidy consistency of the sexual reproduction system in Arabidopsis. PLANT CELL, 25(2), 387–403.
    In sexually reproducing plants, the meiocyte-producing archesporal cell lineage is maintained at the diploid state to consolidate the formation of haploid gametes. In search of molecular factors that regulate this ploidy consistency, we isolated an Arabidopsis thaliana mutant, called enlarged tetrad2 (et2), which produces tetraploid meiocytes through the stochastic occurrence of premeiotic endomitosis. Endomitotic polyploidization events were induced by alterations in cell wall formation, and similar cytokinetic defects were sporadically observed in other tissues, including cotyledons and leaves. ET2 encodes GLUCAN SYNTHASE-LIKE8 (GSL8), a callose synthase that mediates the deposition of callose at developing cell plates, root hairs, and plasmodesmata. Unlike other gsl8 mutants, in which defects in cell plate formation are seedling lethal, cytokinetic defects in et2 predominantly occur in flowers and have little effect on vegetative growth and development. Similarly, mutations in STEROL METHYLTRANSFERASE2 (SMT2), a major sterol biosynthesis enzyme, also lead to weak cytokinetic defects, primarily in the flowers. In addition, SMT2 allelic mutants also generate tetraploid meiocytes through the ectopic induction of premeiotic endomitosis. These observations demonstrate that appropriate callose and sterol biosynthesis are required for maintaining the ploidy level of the premeiotic germ lineage and that subtle defects in cytokinesis may lead to diploid gametes and polyploid offspring.
  245. Meysman, P., Sánchez-Rodríguez, A., Fu, Q., Marchal, K., & Engelen, K. (2013). Expression divergence between Escherichia coli and Salmonella enterica serovar Typhimurium reflects their lifestyles. MOLECULAR BIOLOGY AND EVOLUTION, 30(6), 1302–1314.
    Escherichia coli K12 is a commensal bacteria and one of the best-studied model organisms. Salmonella enterica serovar Typhimurium, on the other hand, is a facultative intracellular pathogen. These two prokaryotic species can be considered related phylogenetically, and they share a large amount of their genetic material, which is commonly termed the "core genome." Despite their shared core genome, both species display very different lifestyles, and it is unclear to what extent the core genome, apart from the species-specific genes, plays a role in this lifestyle divergence. In this study, we focus on the differences in expression domains for the orthologous genes in E. coli and S. Typhimurium. The iterative comparison of coexpression methodology was used on large expression compendia of both species to uncover the conservation and divergence of gene expression. We found that gene expression conservation occurs mostly independently from amino acid similarity. According to our estimates, at least more than one quarter of the orthologous genes has a different expression domain in E. coli than in S. Typhimurium. Genes involved with key cellular processes are most likely to have conserved their expression domains, whereas genes showing diverged expression are associated with metabolic processes that, although present in both species, are regulated differently. The expression domains of the shared "core" genome of E. coli and S. Typhimurium, consisting of highly conserved orthologs, have been tuned to help accommodate the differences in lifestyle and the pathogenic potential of Salmonella.
  246. Ruttink, T., Sterck, L., Rohde, A., Bendixen, C., Rouzé, P., Asp, T., Van de Peer, Y., et al. (2013). Orthology Guided Assembly in highly heterozygous crops: creating a reference transcriptome to uncover genetic diversity in Lolium perenne. PLANT BIOTECHNOLOGY JOURNAL, 11(5), 605–617.
    Despite current advances in next-generation sequencing data analysis procedures, de novo assembly of a reference sequence required for SNP discovery and expression analysis is still a major challenge in genetically uncharacterized, highly heterozygous species. High levels of polymorphism inherent to outbreeding crop species hamper De Bruijn Graph-based de novo assembly algorithms, causing transcript fragmentation and the redundant assembly of allelic contigs. If multiple genotypes are sequenced to study genetic diversity, primary de novo assembly is best performed per genotype to limit the level of polymorphism and avoid transcript fragmentation. Here, we propose an Orthology Guided Assembly procedure that first uses sequence similarity (tBLASTn) to proteins of a model species to select allelic and fragmented contigs from all genotypes and then performs CAP3 clustering on a gene-by-gene basis. Thus, we simultaneously annotate putative orthologues for each protein of the model species, resolve allelic redundancy and fragmentation and create a de novo transcript sequence representing the consensus of all alleles present in the sequenced genotypes. We demonstrate the procedure using RNA-seq data from 14 genotypes of Lolium perenne to generate a reference transcriptome for gene discovery and translational research, to reveal the transcriptome-wide distribution and density of SNPs in an outbreeding crop and to illustrate the effect of polymorphisms on the assembly procedure. The results presented here illustrate that constructing a non-redundant reference sequence is essential for comparative genomics, orthology-based annotation and candidate gene selection but also for read mapping and subsequent polymorphism discovery and/or read count-based gene expression analysis.
  247. De Maeyer, D., Renkens, J., Cloots, L., De Raedt, L., & Marchal, K. (2013). PheNetic: network-based interpretation of unstructured gene lists in E. coli. MOLECULAR BIOSYSTEMS, 9(7), 1594–1603.
    At the present time, omics experiments are commonly used in wet lab practice to identify leads involved in interesting phenotypes. These omics experiments often result in unstructured gene lists, the interpretation of which in terms of pathways or the mode of action is challenging. To aid in the interpretation of such gene lists, we developed PheNetic, a decision theoretic method that exploits publicly available information, captured in a comprehensive interaction network to obtain a mechanistic view of the listed genes. PheNetic selects from an interaction network the sub-networks highlighted by these gene lists. We applied PheNetic to an Escherichia coli interaction network to reanalyse a previously published KO compendium, assessing gene expression of 27 E. coli knock-out mutants under mild acidic conditions. Being able to unveil previously described mechanisms involved in acid resistance demonstrated both the performance of our method and the added value of our integrated E. coli network.
  248. Sánchez-Rodríguez, A., Cloots, L., & Marchal, K. (2013). Omics derived networks in bacteria. CURRENT BIOINFORMATICS, 8(4), 489–495.
    Understanding the cellular behavior from a systems perspective requires the identification of functional and physical interactions among diverse molecular entities in a cell (i.e. DNA/RNA, proteins and metabolites). Powerful and scalable technologies enabled the generation of genome-wide datasets that describe cellular systems by capturing the interactions of their building blocks under different environmental stimuli. The most straightforward way to represent such datasets is by means of molecular networks of which nodes correspond to molecular entities and edges to the interactions amongst those entities. In this review we give an overview of the different functional and physical interaction networks in bacteria that have been or potentially can be built by the integration of diverse omics datasets.
  249. Meysman, P., Marchal, K., & Engelen, K. (2013). Identifying common structural DNA properties in transcription factor binding site sets of the LacI-GalR family. CURRENT BIOINFORMATICS, 8(4), 483–488.
    It is well known that transcription factors can induce deformations in their DNA-binding sites upon complex formation. However, few attempts have been made to investigate the extent to which induced structural deformations in the DNA molecule are conserved between different members of the same transcription factor family. In this article, we used the CRoSSeD methodology for describing DNA structural properties to extract common features in the binding sites of different LacI-GalR family members. The most significant feature identified in this way was located at the center of the binding sites, which is also the most likely location for an induced DNA deformation following an amino acid interdigitation. This feature was related further to specific elements present in the protein structure and was used to identify and characterize deviant family members. A general family-wide binding site model was constructed and applied to screen for unknown member binding sites.
  250. Wrangle, J., Wang, W., Koch, A., Easwaran, H., Mohammad, H. P., Vendetti, F., Van Criekinge, W., et al. (2013). Alterations of immune response of non-small lung cancer with azacytidine. ONCOTARGET, 4(11), 2067–2079.
    Innovative therapies are needed for advanced Non-Small Cell Lung Cancer (NSCLC). We have undertaken a genomics based, hypothesis driving, approach to query an emerging potential that epigenetic therapy may sensitize to immune checkpoint therapy targeting PD-L1/PD-1 interaction. NSCLC cell lines were treated with the DNA hypomethylating agent azacytidine (AZA - Vidaza) and genes and pathways altered were mapped by genome-wide expression and DNA methylation analyses. AZA-induced pathways were analyzed in The Cancer Genome Atlas (TCGA) project by mapping the derived gene signatures in hundreds of lung adeno (LUAD) and squamous cell carcinoma (LUSC) samples. AZA up-regulates genes and pathways related to both innate and adaptive immunity and genes related to immune evasion in a several NSCLC lines. DNA hypermethylation and low expression of IRF7, an interferon transcription factor, tracks with this signature particularly in LUSC. In concert with these events, AZA up-regulates PD-L1 transcripts and protein, a key ligand-mediator of immune tolerance. Analysis of TCGA samples demonstrates that a significant proportion of primary NSCLC have low expression of AZA-induced immune genes, including PD-L1. We hypothesize that epigenetic therapy combined with blockade of immune checkpoints - in particular the PD-1/PD-L1 pathway - may augment response of NSCLC by shifting the balance between immune activation and immune inhibition, particularly in a subset of NSCLC with low expression of these pathways. Our studies define a biomarker strategy for response in a recently initiated trial to examine the potential of epigenetic therapy to sensitize patients with NSCLC to PD-1 immune checkpoint blockade.
  251. Ji, Hongli, Gheysen, G., Denil, S., Lindsey, K., Topping, J. F., Nahar, K., Haegeman, A., et al. (2013). Transcriptional analysis through RNA sequencing of giant cells induced by Meloidogyne graminicola in rice roots. JOURNAL OF EXPERIMENTAL BOTANY, 64(12), 3885–3898.
    One of the reasons for the progressive yield decline observed in aerobic rice production is the rapid build-up of populations of the rice root knot nematode Meloidogyne graminicola. These nematodes induce specialized feeding cells inside root tissue, called giant cells. By injecting effectors in and sipping metabolites out of these cells, they reprogramme normal cell development and deprive the plant of its nutrients. In this research we have studied the transcriptome of giant cells in rice, after isolation of these cells by laser-capture microdissection. The expression profiles revealed a general induction of primary metabolism inside the giant cells. Although the roots were shielded from light induction, we detected a remarkable induction of genes involved in chloroplast biogenesis and tetrapyrrole synthesis. The presence of chloroplast-like structures inside these dark-grown cells was confirmed by confocal microscopy. On the other hand, genes involved in secondary metabolism and more specifically, the majority of defence-related genes were strongly suppressed in the giant cells. In addition, significant induction of transcripts involved in epigenetic processes was detected inside these cells 7 days after infection.
  252. Houbraken, Maarten, Demeyer, S., Staessens, D., Audenaert, P., Colle, D., & Pickavet, M. (2013). Fault tolerant network design inspired by Physarum polycephalum. NATURAL COMPUTING, 12(2), 277–289.
    Physarum polycephalum, a true slime mould, is a primitive, unicellular organism that creates networks to transport nutrients while foraging. The design of these natural networks proved to be advanced, e.g. the slime mould was able to find the shortest path through a maze. The underlying principles of this design have been mathematically modelled in literature. As in real life the slime mould can design fault tolerant networks, its principles can be applied to the design of man-made networks. In this paper, an existing model and algorithm are adapted and extended with stimulation and migration mechanisms which encourage formation of alternative paths, optimize edge positioning and allow for automated design. The extended model can then be used to better design fault tolerant networks. The extended algorithm is applied to several national and international network configurations. Results show that the extensions allow the model to capture the fault tolerance requirements more accurately. The resulting extended algorithm overcomes weaknesses in geometric graph design and can be used to design fault tolerant networks such as telecommunication networks with varying fault tolerance requirements.
  253. Demeyer, S., Michoel, T., Fostier, J., Audenaert, P., Pickavet, M., & Demeester, P. (2013). The index-based subgraph matching algorithm (ISMA): fast subgraph enumeration in large networks using optimized search trees. PLOS ONE, 8(4).
    Subgraph matching algorithms are designed to find all instances of predefined subgraphs in a large graph or network and play an important role in the discovery and analysis of so-called network motifs, subgraph patterns which occur more often than expected by chance. We present the index-based subgraph matching algorithm (ISMA), a novel tree-based algorithm. ISMA realizes a speedup compared to existing algorithms by carefully selecting the order in which the nodes of a query subgraph are investigated. In order to achieve this, we developed a number of data structures and maximally exploited symmetry characteristics of the subgraph. We compared ISMA to a naive recursive tree-based algorithm and to a number of well-known subgraph matching algorithms. Our algorithm outperforms the other algorithms, especially on large networks and with large query subgraphs. An implementation of ISMA in Java is freely available at
  254. Fannes, T., Vandermarliere, E., Schietgat, L., Degroeve, S., Martens, L., & Ramon, J. (2013). Predicting tryptic cleavage from proteomics data using decision tree ensembles. JOURNAL OF PROTEOME RESEARCH, 12(5), 2253–2259.
    Trypsin is the workhorse protease in mass spectrometry-based proteomics experiments and is used to digest proteins into more readily analyzable peptides. To identify these peptides after mass spectrometric analysis, the actual digestion has to be mimicked as faithfully as possible in Aim In this paper we introduce CP-DT (Cleavage Prediction with Decision Trees), an algorithm based on a decision tree ensemble that was learned on publicly available peptide identification data from the PRIDE repository. We demonstrate that CP-DT is able to accurately predict tryptic cleavage: tests on three independent data sets show that CP-DT significantly outperforms the Keil rules that are currently used to predict tryptic cleavage. Moreover, the trees generated by CP-DT can make predictions efficiently and are interpretable by domain experts.
  255. Zimmer, A. D., Lang, D., Buchta, K., Rombauts, S., Nishiyama, T., Hasebe, M., Van de Peer, Y., et al. (2013). Reannotation and extended community resources for the genome of the non-seed plant Physcomitrella patens provide insights into the evolution of plant gene structures and functions. BMC GENOMICS, 14.
    Background: The moss Physcomitrella patens as a model species provides an important reference for early-diverging lineages of plants and the release of the genome in 2008 opened the doors to genome-wide studies. The usability of a reference genome greatly depends on the quality of the annotation and the availability of centralized community resources. Therefore, in the light of accumulating evidence for missing genes, fragmentary gene structures, false annotations and a low rate of functional annotations on the original release, we decided to improve the moss genome annotation. Results: Here, we report the complete moss genome re-annotation (designated V1.6) incorporating the increased transcript availability from a multitude of developmental stages and tissue types. We demonstrate the utility of the improved P. patens genome annotation for comparative genomics and new extensions to the resource as a central repository for this plant "flagship" genome. The structural annotation of 32,275 protein-coding genes results in 8387 additional loci including 1456 loci with known protein domains or homologs in Plantae. This is the first release to include information on transcript isoforms, suggesting alternative splicing events for at least 10.8% of the loci. Furthermore, this release now also provides information on non-protein-coding loci. Functional annotations were improved regarding quality and coverage, resulting in 58% annotated loci (previously: 41%) that comprise also 7200 additional loci with GO annotations. Access and manual curation of the functional and structural genome annotation is provided via the model organism database. Conclusions: Comparative analysis of gene structure evolution along the green plant lineage provides novel insights, such as a comparatively high number of loci with 5'-UTR introns in the moss. Comparative analysis of functional annotations reveals expansions of moss house-keeping and metabolic genes and further possibly adaptive, lineage-specific expansions and gains including at least 13% orphan genes.
  256. Stock, M., Hoefman, S., Kerckhof, F.-M., Boon, N., De Vos, P., De Baets, B., Heylen, K., et al. (2013). Exploration and prediction of interactions between methanotrophs and heterotrophs. RESEARCH IN MICROBIOLOGY, 164(10), 1045–1054.
  257. De Witte, D., Van de Velde, J., Van Bel, M., Audenaert, P., Demeester, P., Dhoedt, B., Vandepoele, K., et al. (2013). Comparative motif discovery in the cloud. Benelux Bioinformatics Conference 2013, Abstracts. Presented at the Benelux Bioinformatics Conference 2013.
  258. De Clercq, I., Vermeirssen, V., Van Aken, O., Vandepoele, K., Murcha, M. W., Law, S. R., Inzé, A., et al. (2013). The membrane-bound NAC transcription factor ANAC013 functions in mitochondrial retrograde regulation of the oxidative stress response in Arabidopsis. PLANT CELL, 25(9), 3472–3490.
    Upon disturbance of their function by stress, mitochondria can signal to the nucleus to steer the expression of responsive genes. This mitochondria-to-nucleus communication is often referred to as mitochondrial retrograde regulation (MRR). Although reactive oxygen species and calcium are likely candidate signaling molecules for MRR, the protein signaling components in plants remain largely unknown. Through meta-analysis of transcriptome data, we detected a set of genes that are common and robust targets of MRR and used them as a bait to identify its transcriptional regulators. In the upstream regions of these mitochondrial dysfunction stimulon (MDS) genes, we found a cis-regulatory element, the mitochondrial dysfunction motif (MDM), which is necessary and sufficient for gene expression under various mitochondrial perturbation conditions. Yeast one-hybrid analysis and electrophoretic mobility shift assays revealed that the transmembrane domain-containing NO APICAL MERISTEM/ARABIDOPSIS TRANSCRIPTION ACTIVATION FACTOR/CUP-SHAPED COTYLEDON transcription factors (ANAC013, ANAC016, ANAC017, ANAC053, and ANAC078) bound to the MDM cis-regulatory element. We demonstrate that ANAC013 mediates MRRinduced expression of the MDS genes by direct interaction with the MDMcis-regulatory element and triggers increased oxidative stress tolerance. In conclusion, we characterized ANAC013 as a regulator of MRR upon stress in Arabidopsis thaliana.
  259. Verhelst, Bram, Van de Peer, Y., & Rouzé, P. (2013). The complex intron landscape and massive intron invasion in a picoeukaryote provides insights into intron evolution. GENOME BIOLOGY AND EVOLUTION, 5(12), 2393–2401.
    Genes in pieces and spliceosomal introns are a landmark of eukaryotes, with intron invasion usually assumed to have happened early on in evolution. Here, we analyse the intron landscape of Micromonas, a unicellular green alga in the Mamiellophyceae lineage, demonstrating the co-existence of several classes of introns and the occurrence of recent massive intron invasion. This study focuses on two strains, CCMP1545 and RCC299, and their related individuals from ocean samplings, showing that they not only harbour different classes of introns depending on their location in the genome, as for other Mamiellophyceae, but uniquely carry several classes of repeat introns. These introns, dubbed introner elements (IEs), are found at novel positions in genes and have conserved sequences, contrary to canonical introns. This IE invasion has a huge impact on the genome, doubling the number of introns in the CCMP1545 strain. We hypothesize that each IE class originated from a single ancestral IE that has been colonizing the genome after strain divergence by inserting copies of itself into genes by intron transposition, likely involving reverse splicing. Along with similar cases recently observed in other organisms, our observations in Micromonas strains shed a new light on the evolution of introns, suggesting that intron gain is more widespread than previously thought.
  260. Read, B. A., Kegel, J., Klute, M. J., Kuo, A., Lefebvre, S. C., Maumus, F., Mayer, C., et al. (2013). Pan genome of the phytoplankton Emiliania underpins its global distribution. NATURE, 499(7457), 209–213.
    Coccolithophores have influenced the global climate for over 200 million years(1). These marine phytoplankton can account for 20 per cent of total carbon fixation in some systems(2). They form blooms that can occupy hundreds of thousands of square kilometres and are distinguished by their elegantly sculpted calcium carbonate exoskeletons (coccoliths), rendering them visible from space(3). Although coccolithophores export carbon in the form of organic matter and calcite to the sea floor, they also release CO2 in the calcification process. Hence, they have a complex influence on the carbon cycle, driving either CO2 production or uptake, sequestration and export to the deep ocean(4). Here we report the first haptophyte reference genome, from the coccolithophore Emiliania huxleyi strain CCMP1516, and sequences from 13 additional isolates. Our analyses reveal a pan genome (core genes plus genes distributed variably between strains) probably supported by an atypical complement of repetitive sequence in the genome. Comparisons across strains demonstrate that E. huxleyi, which has long been considered a single species, harbours extensive genome variability reflected in different metabolic repertoires. Genome variability within this species complex seems to underpin its capacity both to thrive in habitats ranging from the equator to the subarctic and to form large-scale episodic blooms under a wide variety of environmental conditions.
  261. Galagan, J. E., Minch, K., Peterson, M., Lyubetskaya, A., Azizi, E., Sweet, L., Gomes, A., et al. (2013). The Mycobacterium tuberculosis regulatory network and hypoxia. NATURE, 499(7457), 178–183.
    We have taken the first steps towards a complete reconstruction of the Mycobacterium tuberculosis regulatory network based on ChIP-Seq and combined this reconstruction with system-wide profiling of messenger RNAs, proteins, metabolites and lipids during hypoxia and re-aeration. Adaptations to hypoxia are thought to have a prominent role in M. tuberculosis pathogenesis. Using ChIP-Seq combined with expression data from the induction of the same factors, we have reconstructed a draft regulatory network based on 50 transcription factors. This network model revealed a direct interconnection between the hypoxic response, lipid catabolism, lipid anabolism and the production of cell wall lipids. As a validation of this model, in response to oxygen availability we observe substantial alterations in lipid content and changes in gene expression and metabolites in corresponding metabolic pathways. The regulatory network reveals transcription factors underlying these changes, allows us to computationally predict expression changes, and indicates that Rv0081 is a regulatory hub.
  262. Masuzzo, P., Hulstaert, N., Huyck, L., Ampe, C., Van Troys, M., & Martens, L. (2013). CellMissy: a tool for management, storage and analysis of cell migration data produced in wound healing-like assays. BIOINFORMATICS, 29(20), 2661–2663.
    Automated image processing has allowed cell migration research to evolve to a high-throughput research field. As a consequence, there is now an unmet need for data management in this domain. The absence of a generic management system for the quantitative data generated in cell migration assays results in each dataset being treated in isolation, making data comparison across experiments difficult. Moreover, by integrating quality control and analysis capabilities into such a data management system, the common practice of having to manually transfer data across different downstream analysis tools will be markedly sped up and made more robust. In addition, access to a data management solution creates gateways for data standardization, meta-analysis and structured public data dissemination.
  263. Degroeve, S., & Martens, L. (2013). MS2PIP: a tool for MS/MS peak intensity prediction. BIOINFORMATICS, 29(24), 3199–3203.
    Motivation: Tandem mass spectrometry provides the means tomatch mass spectrometry signal observations with the chemical entities that generated them. The technology produces signal spectra that contain information about the chemical dissociation pattern of a peptide that was forced to fragment using methods like collision-induced dissociation. The ability to predict these MS 2 signals and to understand this fragmentation process is important for sensitive high-throughput proteomics research. Results: We present a new tool called (MSPIP)-P-2 for predicting the intensity of the most important fragment ion signal peaks from a peptide sequence. (MSPIP)-P-2 pre-processes a large dataset with confident peptide-to-spectrum matches to facilitate data-driven model induction using a random forest regression learning algorithm. The intensity predictions of (MSPIP)-P-2 were evaluated on several independent evaluation sets and found to correlate significantly better with the observed fragment-ion intensities as compared with the current state-of-the-art PeptideART tool.
  264. Defauw, A., Kazbanov, I., Dierckx, H., Dawyndt, P., & Panfilov, A. (2013). Action potential duration heterogeneity of cardiac tissue can be evaluated from cell properties using Gaussian Green’s function approach. PLOS ONE, 8(11).
    Action potential duration (APD) heterogeneity of cardiac tissue is one of the most important factors underlying initiation of deadly cardiac arrhythmias. In many cases such heterogeneity can be measured at tissue level only, while it originates from differences between the individual cardiac cells. The extent of heterogeneity at tissue and single cell level can differ substantially and in many cases it is important to know the relation between them. Here we study effects from cell coupling on APD heterogeneity in cardiac tissue in numerical simulations using the ionic TP06 model for human cardiac tissue. We show that the effect of cell coupling on APD heterogeneity can be described mathematically using a Gaussian Green's function approach. This relates the problem of electrotonic interactions to a wide range of classical problems in physics, chemistry and biology, for which robust methods exist. We show that, both for determining effects of tissue heterogeneity from cell heterogeneity (forward problem) as well as for determining cell properties from tissue level measurements (inverse problem), this approach is promising. We illustrate the solution of the forward and inverse problem on several examples of 1D and 2D systems.
  265. Defauw, A., Dawyndt, P., & Panfilov, A. (2013). Initiation and dynamics of a spiral wave around an ionic heterogeneity in a model for human cardiac tissue. PHYSICAL REVIEW E, 88(6).
    In relation to cardiac arrhythmias, heterogeneity of cardiac tissue is one of the most important factors underlying the onset of spiral waves and determining their type. In this paper, we numerically model heterogeneity of realistic size and value and study formation and dynamics of spiral waves around such heterogeneity. We find that the only sustained pattern obtained is a single spiral wave anchored around the heterogeneity. Dynamics of an anchored spiral wave depend on the extent of heterogeneity, and for certain heterogeneity size, we find abrupt regional increase in the period of excitation occurring as a bifurcation. We study factors determining spatial distribution of excitation periods of anchored spiral waves and discuss consequences of such dynamics for cardiac arrhythmias and possibilities for experimental testings of our predictions.
  266. Muth, T., Peters, J., Blackburn, J., Rapp, E., & Martens, L. (2013). ProteoCloud: a full-featured open source proteomics cloud computing pipeline. JOURNAL OF PROTEOMICS, 88, 104–108.
  267. Perez-Riverol, Y., Hermjakob, H., Kohlbacher, O., Martens, L., Creasy, D., Cox, J., Leprevost, F., et al. (2013). Computational proteomics pitfalls and challenges: HavanaBioinfo 2012 Workshop report. JOURNAL OF PROTEOMICS, 87, 134–138.
  268. Martens, Lennart. (2013). Bringing proteomics into the clinic: the need for the field to finally take itself seriously. PROTEOMICS CLINICAL APPLICATIONS, 7(5-6), 388–391.
    Abstract: Proteomics has fast become a standard tool in the life sciences, with increasingly sophisticated approaches and instruments delivering ever growing numbers of identified and quantified proteins. Yet despite the enormous technological progress, and the triumphant papers published on whole-cell proteomes being collected and analyzed, proteomics has so far failed to enter the clinic for routine applications. This is a peculiar contradiction, and one that warrants some closer study. I here argue that for proteomics to make a difference in the clinic, it needs to stop shirking responsibility, and to mature into an analytical, transparent, and reproducible discipline that also invests in the consolidation of its technology rather than only focusing on the next big leap forward. A key enabling factor in this maturation process is quality control and quality assurance, with bioinformatics, in its least noticeable but most influential form, as a key underlying technology.
  269. Martens, Lennart. (2013). Resilience in the proteomics data ecosystem: how the field cares for its data. PROTEOMICS, 13(10-11), 1548–1550.
    The public dissemination of data is an integral part of the life sciences. In the field of proteomics too, data sharing has taken off over the last few years, with the first downstream uses of these data quickly gaining prominence. At the same time, the recent unfortunate demise of two repositories, NCBI Peptidome and ProteomeCommons Tranche, has shown the frailty of such data gathering efforts. Heroic efforts by the PRIDE and Peptidome teams to rescue the Peptidome data have now ensured their continued availability to the field, and alternatives have already been put in place for Tranche. But with public data increasingly at the hub of the life sciences, it is a good time to look at the proteomics data ecosystem in some more detail.
  270. Baele, G., Lemey, P., & Vansteelandt, S. (2013). Make the most of your samples : Bayes factor estimators for high-dimensional models of sequence evolution. BMC BIOINFORMATICS, 14.
    Background: Accurate model comparison requires extensive computation times, especially for parameter-rich models of sequence evolution. In the Bayesian framework, model selection is typically performed through the evaluation of a Bayes factor, the ratio of two marginal likelihoods (one for each model). Recently introduced techniques to estimate (log) marginal likelihoods, such as path sampling and stepping-stone sampling, offer increased accuracy over the traditional harmonic mean estimator at an increased computational cost. Most often, each model's marginal likelihood will be estimated individually, which leads the resulting Bayes factor to suffer from errors associated with each of these independent estimation processes. Results: We here assess the original 'model-switch' path sampling approach for direct Bayes factor estimation in phylogenetics, as well as an extension that uses more samples, to construct a direct path between two competing models, thereby eliminating the need to calculate each model's marginal likelihood independently. Further, we provide a competing Bayes factor estimator using an adaptation of the recently introduced stepping-stone sampling algorithm and set out to determine appropriate settings for accurately calculating such Bayes factors, with context-dependent evolutionary models as an example. While we show that modest efforts are required to roughly identify the increase in model fit, only drastically increased computation times ensure the accuracy needed to detect more subtle details of the evolutionary process. Conclusions: We show that our adaptation of stepping-stone sampling for direct Bayes factor calculation outperforms the original path sampling approach as well as an extension that exploits more samples. Our proposed approach for Bayes factor estimation also has preferable statistical properties over the use of individual marginal likelihood estimates for both models under comparison. Assuming a sigmoid function to determine the path between two competing models, we provide evidence that a single well-chosen sigmoid shape value requires less computational efforts in order to approximate the true value of the (log) Bayes factor compared to the original approach. We show that the (log) Bayes factors calculated using path sampling and stepping-stone sampling differ drastically from those estimated using either of the harmonic mean estimators, supporting earlier claims that the latter systematically overestimate the performance of high-dimensional models, which we show can lead to erroneous conclusions. Based on our results, we argue that highly accurate estimation of differences in model fit for high-dimensional models requires much more computational effort than suggested in recent studies on marginal likelihood estimation.
  271. Vandepoele, Klaas, Van Bel, M., Richard, G., Van Landeghem, S., Verhelst, B., Moreau, H., Van de Peer, Y., et al. (2013). pico-PLAZA, a genome database of microbial photosynthetic eukaryotes. ENVIRONMENTAL MICROBIOLOGY, 15(8), 2147–2153.
    With the advent of next generation genome sequencing, the number of sequenced algal genomes and transcriptomes is rapidly growing. Although a few genome portals exist to browse individual genome sequences, exploring complete genome information from multiple species for the analysis of user-defined sequences or gene lists remains a major challenge. pico-PLAZA is a web-based resource ( for algal genomics that combines different data types with intuitive tools to explore genomic diversity, perform integrative evolutionary sequence analysis and study gene functions. Apart from homologous gene families, multiple sequence alignments, phylogenetic trees, Gene Ontology, InterPro and text-mining functional annotations, different interactive viewers are available to study genome organization using gene collinearity and synteny information. Different search functions, documentation pages, export functions and an extensive glossary are available to guide non-expert scientists. PLAZA can be used to functionally characterize large-scale ES /RNA-Seq data sets and to perform environmental genomics. Functional enrichments analysis of 16 Phaeodactylumtricornutum transcriptome libraries offers a molecular view on diatom adaptation to different environments of ecological relevance. Furthermore, we show how complementary genomic data sources can easily be combined to identify marker genes to study the diversity and distribution of algal species, for example in metagenomes, or to quantify intraspecific diversity from environmental strains.
  272. Ciesielska, K., Li, B., Groeneboer, S., Van Bogaert, I., Lin, Y.-C., Soetaert, W., Van de Peer, Y., et al. (2013). SILAC-based proteome analysis of Starmerella bombicola sophorolipid production. JOURNAL OF PROTEOME RESEARCH, 12(10), 4376–4392.
    Starmerella (Candida) bombicola is the biosurfactant-producing species that caught the greatest deal of attention in the academic and industrial world due to its ability of producing large amounts of sophorolipids. Despite its high economic potential, the biochemistry behind the sophorolipid biosynthesis is still poorly understood. Here we present the first proteomic characterization of S. bombicola for which we created a lys1 Delta. mutant to allow the use of SILAC for quantitative analysis. To characterize the processes behind the production of these biosurfactants, we compared the proteome of sophorolipid producing (early stationary phase) and nonproducing cells (exponential phase). We report the simultaneous production of all known enzymes involved in sophorolipid biosynthesis including a predicted sophorolipid transporter. In addition, we identified the heme binding protein Dap1 as a possible regulator for Cyp52M1. Our results further indicate that ammonium and phosphate limitation are not the sole limiting factors inducing sophorolipid biosynthesis.
  273. Yang, Yudi, Foulquié-Moreno, M. R., Clement, L., Erdei, É., Tanghe, A., Schaerlaekens, K., Dumortier, F., et al. (2013). QTL analysis of high thermotolerance with superior and downgraded parental yeast strains reveals new minor QTLs and converges on novel causative alleles involved in RNA processing. PLOS GENETICS, 9(8).
    Revealing QTLs with a minor effect in complex traits remains difficult. Initial strategies had limited success because of interference by major QTLs and epistasis. New strategies focused on eliminating major QTLs in subsequent mapping experiments. Since genetic analysis of superior segregants from natural diploid strains usually also reveals QTLs linked to the inferior parent, we have extended this strategy for minor QTL identification by eliminating QTLs in both parent strains and repeating the QTL mapping with pooled-segregant whole-genome sequence analysis. We first mapped multiple QTLs responsible for high thermotolerance in a natural yeast strain, MUCL28177, compared to the laboratory strain, BY4742. Using single and bulk reciprocal hemizygosity analysis we identified MKT1 and PRP42 as causative genes in QTLs linked to the superior and inferior parent, respectively. We subsequently downgraded both parents by replacing their superior allele with the inferior allele of the other parent. QTL mapping using pooled-segregant whole-genome sequence analysis with the segregants from the cross of the downgraded parents, revealed several new QTLs. We validated the two most-strongly linked new QTLs by identifying NCS2 and SMD2 as causative genes linked to the superior downgraded parent and we found an allele-specific epistatic interaction between PRP42 and SMD2. Interestingly, the related function of PRP42 and SMD2 suggests an important role for RNA processing in high thermotolerance and underscores the relevance of analyzing minor QTLs. Our results show that identification of minor QTLs involved in complex traits can be successfully accomplished by crossing parent strains that have both been downgraded for a single QTL. This novel approach has the advantage of maintaining all relevant genetic diversity as well as enough phenotypic difference between the parent strains for the trait-of-interest and thus maximizes the chances of successfully identifying additional minor QTLs that are relevant for the phenotypic difference between the original parents.
  274. Forcheh, A. C., Verbeke, G., Kasim, A., Lin, D., Shkedy, Z., Talloen, W., … Clement, L. (2013). beadarrayFilter : an R package to filter beads. R JOURNAL, 5(1), 171–180.
    Microarrays enable the expression levels of thousands of genes to be measured simultaneously. However, only a small fraction of these genes are expected to be expressed under different experimental conditions. Nowadays, filtering has been introduced as a step in the microarray preprocessing pipeline. Gene filtering aims at reducing the dimensionality of data by filtering redundant features prior to the actual statistical analysis. Previous filtering methods focus on the Affymetrix platform and can not be easily ported to the Illumina platform. As such, we developed a filtering method for Illumina bead arrays. We developed an R package, beadarrayFilter, to implement the latter method. In this paper, the main functions in the package are highlighted and using many examples, we illustrate how beadarrayFilter can be used to filter bead arrays.
  275. Roelants, S., Saerens, K., Derycke, T., Li, B., Lin, Y.-C., Van de Peer, Y., De Maeseneire, S., et al. (2013). Candida bombicola as a platform organism for the production of tailor-made biomolecules. BIOTECHNOLOGY AND BIOENGINEERING, 110(9), 2494–2503.
  276. Vaudel, M., Breiter, D., Beck, F., Rahnenführer, J., Martens, L., & Zahedi, R. P. (2013). D-score: a search engine independent MD-score. PROTEOMICS, 13(6), 1036–1041.
    While peptides carrying PTMs are routinely identified in gel-free MS, the localization of the PTMs onto the peptide sequences remains challenging. Search engine scores of secondary peptide matches have been used in different approaches in order to infer the quality of site inference, by penalizing the localization whenever the search engine similarly scored two candidate peptides with different site assignments. In the present work, we show how the estimation of posterior error probabilities for peptide candidates allows the estimation of a PTM score called the D-score, for multiple search engine studies. We demonstrate the applicability of this score to three popular search engines: Mascot, OMSSA, and X!Tandem, and evaluate its performance using an already published high resolution data set of synthetic phosphopeptides. For those peptides with phosphorylation site inference uncertainty, the number of spectrum matches with correctly localized phosphorylation increased by up to 25.7% when compared to using Mascot alone, although the actual increase depended on the fragmentation method used. Since this method relies only on search engine scores, it can be readily applied to the scoring of the localization of virtually any modification at no additional experimental or in silico cost.
  277. Perez Novo, C., Zhang, Y., Denil, S., Trooskens, G., De Meyer, T., Van Criekinge, W., Van Cauwenberge, P., et al. (2013). Staphylococcal enterotoxin B influences the DNA methylation pattern in nasal polyp tissue : a preliminary study. ALLERGY ASTHMA AND CLINICAL IMMUNOLOGY, 9.
    Staphylococcal enterotoxins may influence the pro-inflammatory pattern of chronic sinus diseases via epigenetic events. This work intended to investigate the potential of staphylococcal enterotoxin B (SEB) to induce changes in the DNA methylation pattern. Nasal polyp tissue explants were cultured in the presence and absence of SEB; genomic DNA was then isolated and used for whole genome methylation analysis. Results showed that SEB stimulation altered the methylation pattern of gene regions when compared with non stimulated tissue. Data enrichment analysis highlighted two genes: the IKBKB and STAT-5B, both playing a crucial role in T- cell maturation/activation and immune response.
  278. Nystedt, B., Street, N. R., Wetterbom, A., Zuccolo, A., Lin, Y.-C., Scofield, D. G., Vezzi, F., et al. (2013). The Norway spruce genome sequence and conifer genome evolution. NATURE, 497(7451), 579–584.
    Conifers have dominated forests for more than 200 million years and are of huge ecological and economic importance. Here we present the draft assembly of the 20-gigabase genome of Norway spruce (Picea abies), the first available for any gymnosperm. The number of well-supported genes (28,354) is similar to the >100 times smaller genome of Arabidopsis thaliana, and there is no evidence of a recent whole-genome duplication in the gymnosperm lineage. Instead, the large genome size seems to result from the slow and steady accumulation of a diverse set of long-terminal repeat transposable elements, possibly owing to the lack of an efficient elimination mechanism. Comparative sequencing of Pinus sylvestris, Abies sibirica, Juniperus communis, Taxus baccata and Gnetum gnemon reveals that the transposable element diversity is shared among extant conifers. Expression of 24-nucleotide small RNAs, previously implicated in transposable element silencing, is tissue-specific and much lower than in other plants. We further identify numerous long (>10,000 base pairs) introns, gene-like fragments, uncharacterized long non-coding RNAs and short RNAs. This opens up new genomic avenues for conifer forestry and breeding.
  279. Vandepitte, Katrien, De Meyer, T., Jacquemyn, H., Roldàn-Ruiz, I., & Honnay, O. (2013). The impact of extensive clonal growth on fine-scale mating patterns: a full paternity analysis of a lily-of-the-valley population (Convallaria majalis). ANNALS OF BOTANY, 111(4), 623–628.
    The combination of clonality and a mating system promoting outcrossing is considered advantageous because outcrossing avoids the fitness costs of selfing within clones (geitonogamy) while clonality assures local persistence and increases floral display. The spatial spread of genetically identical plants (ramets) may, however, also decrease paternal diversity (the number of sires fertilizing a given dam) and fertility, particularly towards the centre of large clumped clones. This study aimed to quantify the impact of extensive clonal growth on fine-scale paternity patterns in a population of the allogamous Convallaria majalis. A full analysis of paternity was performed by genotyping all flowering individuals and all viable seeds produced during a single season using AFLP. Mating patterns were examined and the spatial position of ramets was related to the extent of multiple paternity, fruiting success and seed production. The overall outcrossing rate was high (91 ) and pollen flow into the population was considerable (27 ). Despite extensive clonal growth, multiple paternity was relatively common (the fraction of siblings sharing the same father was 053 within ramets). The diversity of offspring collected from reproductive ramets surrounded by genetically identical inflorescences was as high as among offspring collected from ramets surrounded by distinct genets. There was no significant relationship between the similarity of the pollen load received by two ramets and the distance between them. Neither the distance of ramets with respect to distinct genets nor the distance to the genet centre significantly affected fruiting success or seed production. Random mating and considerable pollen inflow most probably implied that pollen dispersal distances were sufficiently high to mitigate local mate scarcity despite extensive clonal spread. The data provide no evidence for the intrusion of clonal growth on fine-scale plant mating patterns.
  280. Sharpe, K., Stewart, G. D., Mackay, A., Van Neste, C., Rofe, C., Berney, D., Kayani, I., et al. (2013). The effect of VEGF-targeted therapy on biomarker expression in sequential tissue from patients with metastatic clear cell renal cancer. CLINICAL CANCER RESEARCH, 19(24), 6924–6934.
  281. Lutz, S. M., Vansteelandt, S., & Lange, C. (2013). Testing for direct genetic effects using a screening step in family-based association studies. FRONTIERS IN GENETICS, 4.
  282. Heyman, J., Cools, T., Vandenbussche, F., Heyndrickx, K., Van Leene, J., Vercauteren, I., Vanderauwera, S., et al. (2013). ERF115 controls root quiescent center cell division and stem cell replenishment. SCIENCE, 342(6160), 860–863.
    The quiescent center (QC) plays an essential role during root development by creating a microenvironment that preserves the stem cell fate of its surrounding cells. Despite being surrounded by highly mitotic active cells, QC cells self-renew at a low proliferation rate. Here, we identified the ERF115 transcription factor as a rate-limiting factor of QC cell division, acting as a transcriptional activator of the phytosulfokine PSK5 peptide hormone. ERF115 marks QC cell division but is restrained through proteolysis by the APC/C-CCS52A2 ubiquitin ligase, whereas QC proliferation is driven by brassinosteroid-dependent ERF115 expression. Together, these two antagonistic mechanisms delimit ERF115 activity, which is called upon when surrounding stem cells are damaged, revealing a cell cycle regulatory mechanism accounting for stem cell niche longevity.
  283. Staes, A., Vandenbussche, J., Demol, H., Goethals, M., Yilmaz-Rumpf, S., Hulstaert, N., Degroeve, S., et al. (2013). Asn₃, a reliable, robust, and universal lock mass for improved accuracy in LC-MS and LC-MS/MS. ANALYTICAL CHEMISTRY, 85(22), 11054–11060.
    The use of internal calibrants (the so-called lock mass approach) provides much greater accuracy in mass spectrometry based proteomics. However, the polydimethylcyclosiloxane (PCM) peaks commonly used for this purpose are quite unreliable, leading to missing calibrant peaks in spectra and correspondingly lower mass measurement accuracy. Therefore, we here introduce a universally applicable and robust internal calibrant, the tripeptide Asn(3). We show that Asn(3) is a substantial improvement over PCM both in terms of consistent detection and resulting mass measurement accuracy. Asn(3) is also very easy to adopt in the lab, as it requires only minor adjustments to the analytical setup.
  284. Vandermarliere, E., Mueller, M., & Martens, L. (2013). Getting intimate with trypsin, the leading protease in proteomics. MASS SPECTROMETRY REVIEWS, 32(6), 453–465.
  285. De Antonellis, P, Carotenuto, M., De Vita, G., Vandenbussche, J., Medaglia, C., Vandesopele, J., Mestdagh, P., et al. (2013). Early targets of MIR-34A in neuroblastoma. PEDIATRIC BLOOD & CANCER (Vol. 60, pp. 116–116). Presented at the 45th Congress of the International Society fo Paediatric Oncology (SIOP 2013).
  286. Wuyts, V., Mattheus, W., De Laminne de Bex, G., Wildemauwe, C., Roosens, N. H., Marchal, K., De Keersmaecker, S. C., et al. (2013). MLVA as a tool for public health surveillance of human Salmonella Typhimurium : prospective study in Belgium and evaluation of MLVA loci stability. PLOS ONE, 8(12).
    Surveillance of Salmonella enterica subsp. enterica serovar Typhimurium (S. Typhimurium) is generally considered to benefit from molecular techniques like multiple-locus variable-number of tandem repeats analysis (MLVA), which allow earlier detection and confinement of outbreaks. Here, a surveillance study, including phage typing, antimicrobial susceptibility testing and the in Europe most commonly used 5-loci MLVA on 1,420 S. Typhimurium isolates collected between 2010 and 2012 in Belgium, was used to evaluate the added value of MLVA for public health surveillance. Phage types DT193, DT195, DT120, DT104, DT12 and U302 dominate the Belgian S. Typhimurium population. A combined resistance to ampicillin, streptomycin, sulphonamides and tetracycline (ASSuT) with or without additional resistances was observed for 42.5% of the isolates. 414 different MLVA profiles were detected, of which 14 frequent profiles included 44.4% of the S. Typhimurium population. During a serial passage experiment on selected isolates to investigate the in vitro stability of the 5 MLVA loci, variations over time were observed for loci STTR6, STTR10, STTR5 and STTR9. This study demonstrates that MLVA improves public health surveillance of S. Typhimurium. However, the 5-loci MLVA should be complemented with other subtyping methods for investigation of possible outbreaks with frequent MLVA profiles. Also, variability in these MLVA loci should be taken into account when investigating extended outbreaks and studying dynamics over longer periods.
  287. Van Bel, M., Proost, S., Van Neste, C., Deforce, D., Van de Peer, Y., & Vandepoele, K. (2013). TRAPID : an efficient online tool for the functional and comparative analysis of de novo RNA-Seq transcriptomes. GENOME BIOLOGY, 14(12).
    Transcriptome analysis through next-generation sequencing technologies allows the generation of detailed gene catalogs for non-model species, at the cost of new challenges with regards to computational requirements and bioinformatics expertise. Here, we present TRAPID, an online tool for the fast and efficient processing of assembled RNA-Seq transcriptome data, developed to mitigate these challenges. TRAPID offers high-throughput open reading frame detection, frameshift correction and includes a functional, comparative and phylogenetic toolbox, making use of 175 reference proteomes. Benchmarking and comparison against state-of-the-art transcript analysis tools reveals the efficiency and unique features of the TRAPID system.
  288. Verbeke, Lieven, Demeester, P., Fostier, J., & Marchal, K. (2013). EPSILON: an eQTL prioritization framework using similarity measures derived from local networks. Intelligent Systems for Molecular Biology, 21st Annual international conference, Abstracts. Presented at the 21st Annual international conference on Intelligent Systems for Molecular Biology (ISMB 2013).
  289. Verbeke, Lieven, Fierro, A., Van den Eynden, J., Demeester, P., Fostier, J., & Marchal, K. (2013). Identifying relevant pathways for different breast cancer subtypes using network based data integration. Regulatory and Systems Genomics, 6th Annual RECOMB/ISCB conference, Abstracts. Presented at the 6th Annual RECOMB/ISCB conference on Regulatory and Systems Genomics, with DREAM Challenges 2013, International Society for Computational Biology (ISCB).
  290. Venken, L., Marchal, K., & Vanderleyden, J. (2013). Synthetic biology and microdevices : a powerful combination. ACM JOURNAL ON EMERGING TECHNOLOGIES IN COMPUTING SYSTEMS, 9(4).
    Recent developments demonstrate that the combination of microbiology with micro-and nanoelectronics is a successful approach to develop new miniaturized sensing devices and other technologies. In the last decade, there has been a shift from the optimization of the abiotic components, for example, the chip, to the improvement of the processing capabilities of cells through genetic engineering. The synthetic biology approach will not only give rise to systems with new functionalities, but will also improve the robustness and speed of their response towards applied signals. To this end, the development of new genetic circuits has to be guided by computational design methods that enable to tune and optimize the circuit response. As the successful design of genetic circuits is highly dependent on the quality and reliability of its composing elements, intense characterization of standard biological parts will be crucial for an efficient rational design process in the development of new genetic circuits. Microengineered devices can thereby offer a new analytical approach for the study of complex biological parts and systems. By summarizing the recent techniques in creating new synthetic circuits and in integrating biology with microdevices, this review aims at emphasizing the power of combining synthetic biology with microfluidics and microelectronics.
  291. Meyer, Pablo, Siwo, G., Zeevi, D., Sharon, E., Norel, R., Segal, E., Stolovitzky, G., et al. (2013). Inferring gene expression from ribosomal promoter sequences, a crowdsourcing approach. GENOME RESEARCH, 23(11), 1928–1937.
    The Gene Promoter Expression Prediction challenge consisted of predicting gene expression from promoter sequences in a previously unknown experimentally generated data set. The challenge was presented to the community in the framework of the sixth Dialogue for Reverse Engineering Assessments and Methods (DREAM6), a community effort to evaluate the status of systems biology modeling methodologies. Nucleotide-specific promoter activity was obtained by measuring fluorescence from promoter sequences fused upstream of a gene for yellow fluorescence protein and inserted in the same genomic site of yeast Saccharomyces cerevisiae. Twenty-one teams submitted results predicting the expression levels of 53 different promoters from yeast ribosomal protein genes. Analysis of participant predictions shows that accurate values for low-expressed and mutated promoters were difficult to obtain, although in the latter case, only when the mutation induced a large change in promoter activity compared to the wild-type sequence. As in previous DREAM challenges, we found that aggregation of participant predictions provided robust results, but did not fare better than the three best algorithms. Finally, this study not only provides a benchmark for the assessment of methods predicting activity of a specific set of promoters from their sequence, but it also shows that the top performing algorithm, which used machine-learning approaches, can be improved by the addition of biological features such as transcription factor binding sites.
  292. Aslankoohi, E., Zhu, B., Rezaei, M. N., Voordeckers, K., De Maeyer, D., Marchal, K., Dornez, E., et al. (2013). Dynamics of the Saccharomyces cerevisiae transcriptome during bread dough fermentation. APPLIED AND ENVIRONMENTAL MICROBIOLOGY, 79(23), 7325–7333.
    The behavior of yeast cells during industrial processes such as the production of beer, wine, and bioethanol has been extensively studied. In contrast, our knowledge about yeast physiology during solid-state processes, such as bread dough, cheese, or cocoa fermentation, remains limited. We investigated changes in the transcriptomes of three genetically distinct Saccharomyces cerevisiae strains during bread dough fermentation. Our results show that regardless of the genetic background, all three strains exhibit similar changes in expression patterns. At the onset of fermentation, expression of glucose- regulated genes changes dramatically, and the osmotic stress response is activated. The middle fermentation phase is characterized by the induction of genes involved in amino acid metabolism. Finally, at the latest time point, cells suffer from nutrient depletion and activate pathways associated with starvation and stress responses. Further analysis shows that genes regulated by the high- osmolarity glycerol (HOG) pathway, the major pathway involved in the response to osmotic stress and glycerol homeostasis, are among the most differentially expressed genes at the onset of fermentation. More importantly, deletion of HOG1 and other genes of this pathway significantly reduces the fermentation capacity. Together, our results demonstrate that cells embedded in a solid matrix such as bread dough suffer severe osmotic stress and that a proper induction of the HOG pathway is critical for optimal fermentation.
  293. De Roy, Karen, Clement, L., Thas, O., Wang, Y., & Boon, N. (2012). Flow cytometry for fast microbial community fingerprinting. WATER RESEARCH, 46(3), 907–919.
  294. Abeel, T., Van Parys, T., Saeys, Y., Galagan, J., & Van de Peer, Y. (2012). GenomeView : a next-generation genome browser. NUCLEIC ACIDS RESEARCH, 40(2).
    Due to ongoing advances in sequencing technologies, billions of nucleotide sequences are now produced on a daily basis. A major challenge is to visualize these data for further downstream analysis. To this end, we present GenomeView, a stand-alone genome browser specifically designed to visualize and manipulate a multitude of genomics data. GenomeView enables users to dynamically browse high volumes of aligned short-read data, with dynamic navigation and semantic zooming, from the whole genome level to the single nucleotide. At the same time, the tool enables visualization of whole genome alignments of dozens of genomes relative to a reference sequence. GenomeView is unique in its capability to interactively handle huge data sets consisting of tens of aligned genomes, thousands of annotation features and millions of mapped short reads both as viewer and editor. GenomeView is freely available as an open source software package.
  295. Fawcett, J., Rouzé, P., & Van de Peer, Y. (2012). Higher intron loss rate in Arabidopsis thaliana than A. lyrata is consistent with stronger selection for a smaller genome. MOLECULAR BIOLOGY AND EVOLUTION, 29(2), 849–859.
    The number of introns varies considerably among different organisms. This can be explained by the differences in the rates of intron gain and loss. Two factors that are likely to influence these rates are selection for or against introns and the mutation rate that generates the novel intron or the intronless copy. Although it has been speculated that stronger selection for a compact genome might result in a higher rate of intron loss and a lower rate of intron gain, clear evidence is lacking, and the role of selection in determining these rates has not been established. Here, we studied the gain and loss of introns in the two closely related species Arabidopsis thaliana and A. lyrata as it was recently shown that A. thaliana has been undergoing a faster genome reduction driven by selection. We found that A. thaliana has lost six times more introns than A. lyrata since the divergence of the two species but gained very few introns. We suggest that stronger selection for genome reduction probably resulted in the much higher intron loss rate in A. thaliana, although further analysis is required as we could not find evidence that the loss rate increased in A. thaliana as opposed to having decreased in A. lyrata compared with the rate in the common ancestor. We also examined the pattern of the intron gains and losses to better understand the mechanisms by which they occur. Microsimilarity was detected between the splice sites of several gained and lost introns, suggesting that nonhomologous end joining repair of double-strand breaks might be a common pathway not only for intron gain but also for intron loss.
  296. Clement, Lieven, De Beuf, K., Thas, O., Vuylsteke, M., Irizarry, R., & Crainiceanu, C. M. (2012). Fast wavelet based functional models for transcriptome analysis with tiling arrays. STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY, 11(1).
    For a better understanding of the biology of an organism, a complete description is needed of all regions of the genome that are actively transcribed. Tiling arrays are used for this purpose. They allow for the discovery of novel transcripts and the assessment of differential expression between two or more experimental conditions such as genotype, treatment, tissue, etc. In tiling array literature, many efforts are devoted to transcript discovery, whereas more recent developments also focus on differential expression. To our knowledge, however, no methods for tiling arrays have been described that can simultaneously assess transcript discovery and identify differentially expressed transcripts. In this paper, we adopt wavelet based functional models to the context of tiling arrays. The high dimensionality of the data triggered us to avoid inference based on Bayesian MCMC methods. Instead, we introduce a fast empirical Bayes method that provides adaptive regularization of the functional effects. A simulation study and a case study illustrate that our approach is well suited for the simultaneous assessment of transcript discovery and differential expression in tiling array studies, and that it outperforms methods that accomplish only one of these tasks.
  297. Thas, O., Clement, L., Rayner, J., Carvalho, B., & Van Criekinge, W. (2012). An omnibus consistent adaptive percentile modified Wilcoxon rank sum test with applications in gene expression studies. BIOMETRICS, 68(2), 446–454.
    We present an adaptive percentile modified Wilcoxon rank sum test for the two-sample problem. The test is basically a Wilcoxon rank sum test applied on a fraction of the sample observations, and the fraction is adaptively determined by the sample observations. Most of the theory is developed under a location-shift model, but we demonstrate that the test is also meaningful for testing against more general alternatives. The test may be particularly useful for the analysis of massive datasets in which quasi-automatic hypothesis testing is required. We investigate the power characteristics of the new test in a simulation study, and we apply the test to a microarray experiment on colorectal cancer. These empirical studies demonstrate that the new test has good overall power and that it succeeds better in finding differentially expressed genes as compared to other popular tests. We conclude that the new nonparametric test is widely applicable and that its power is comparable to the power of the Baumgartner-Weiß-Schindler test.
  298. Proost, Sebastian, Fostier, J., De Witte, D., Dhoedt, B., Demeester, P., Van de Peer, Y., & Vandepoele, K. (2012). i-ADHoRe 3.0 : fast and sensitive detection of genomic homology in extremely large data sets. NUCLEIC ACIDS RESEARCH, 40(2).
  299. Jones, Alexandre ME, Aebersold, R., Ahrens, C. H., Apweiler, R., Baerenfaller, K., Baker, M., Bendixen, E., et al. (2012). The HUPO initiative on Model Organism Proteomes, iMOP. PROTEOMICS, 12(3), 340–345.
    The community working on model organisms is growing steadily and the number of model organisms for which proteome data are being generated is continuously increasing. To standardize efforts and to make optimal use of proteomics data acquired from model organisms, a new Human Proteome Organisation (HUPO) initiative on model organism proteomes (iMOP) was approved at the HUPO Ninth Annual World Congress in Sydney, 2010. iMOP will seek to stimulate scientific exchange and disseminate HUPO best practices. The needs of model organism researchers for central databases will be better represented, catalyzing the integration of proteomics and organism-specific databases. Full details of iMOP activities, members, tools and resources can be found at our website and new members are invited to join us.
  300. Deutsch, E. W., Chambers, M., Neumann, S., Levander, F., Binz, P.-A., Shofstahl, J., Campbell, D. S., et al. (2012). TraML: a standard format for exchange of selected reaction monitoring transition lists. MOLECULAR & CELLULAR PROTEOMICS, 11(4).
    Targeted proteomics via selected reaction monitoring (SRM) is a powerful mass spectrometric technique affording higher dynamic range, increased specificity and lower limits of detection than other shotgun mass spectrometry methods when applied to proteome analyses. However, it involves selective measurement of predetermined analytes, which requires more preparation in the form of selecting appropriate signatures for the proteins and peptides that are to be targeted. There is a growing number of software programs and resources for selecting optimal transitions and the instrument settings used for the detection and quantification of the targeted peptides, but the exchange of this information is hindered by a lack of a standard format. We have developed a new standardized format, called TraML, for encoding transition lists and associated metadata. In addition to introducing the TraML format, we demonstrate several implementations across the community, and provide semantic validators, extensive documentation, and multiple example instances to demonstrate correctly written documents. Widespread use of TraML will facilitate the exchange of transitions, reduce time spent handling incompatible list formats, increase the reusability of previously optimized transitions, and thus accelerate the widespread adoption of targeted proteomics via SRM.
  301. Wang, Rui, Fabregat, A., Ríos, D., Ovelleiro, D., Foster, J. M., Côté, R. G., Griss, J., et al. (2012). PRIDE Inspector: a tool to visualize and validate MS proteomics data. NATURE BIOTECHNOLOGY, 30(2), 135–137.
  302. Vansteelandt, S., VanderWeele, T. J., & Robins, J. M. (2012). Semiparametric tests for sufficient cause interaction. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 74(2), 223–244.
    A sufficient cause interaction between two exposures signals the presence of individuals for whom the outcome would occur only under certain values of the two exposures. When the outcome is dichotomous and all exposures are categorical, then, under certain no confounding assumptions, empirical conditions for sufficient cause interactions can be constructed on the basis of the sign of linear contrasts of conditional outcome probabilities between differently exposed subgroups, given confounders. It is argued that logistic regression models are unsatisfactory for evaluating such contrasts, and that Bernoulli regression models with linear link are prone to misspecification. We therefore develop semiparametric tests for sufficient cause interactions under models which postulate probability contrasts in terms of a finite dimensional parameter, but which are otherwise unspecified. Estimation is often not feasible in these models because it would require non-parametric estimation of auxiliary conditional expectations given high dimensional variables. We therefore develop multiply robust tests under a union model which assumes that at least one of several working submodels holds. In the special case of a randomized experiment or a family-based genetic study in which the joint exposure distribution is known by design or Mendelian inheritance, the procedure leads to asymptotically distribution-free tests of the null hypothesis of no sufficient cause interaction.
  303. Hacquard, S., Joly, D. L., Lin, Y.-C., Tisserant, E., Feau, N., Delaruelle, C., Legué, V., et al. (2012). A comprehensive analysis of genes encoding small secreted proteins identifies candidate effectors in Melampsora larici-populina (poplar leaf rust). MOLECULAR PLANT-MICROBE INTERACTIONS, 25(3), 279–293.
    The obligate biotrophic rust fungus Melampsora larici-populina is the most devastating and widespread pathogen of poplars. Studies over recent years have identified various small secreted proteins (SSP) from plant biotrophic filamentous pathogens and have highlighted their role as effectors in host-pathogen interactions. The recent analysis of the M. larici-populina genome sequence has revealed the presence of 1,184 SSP-encoding genes in this rust fungus. In the present study, the expression and evolutionary dynamics of these SSP were investigated to pinpoint the arsenal of putative effectors that could be involved in the interaction between the rust fungus and poplar. Similarity with effectors previously described in Melampsora spp., richness in cysteines, and organization in large families were extensively detailed and discussed. Positive selection analyses conducted over clusters of paralogous genes revealed fast-evolving candidate effectors. Transcript profiling of selected M. laricipopulina SSP showed a timely coordinated expression during leaf infection, and the accumulation of four candidate effectors in distinct rust infection structures was demonstrated by immunolocalization. This integrated and multifaceted approach helps to prioritize candidate effector genes for functional studies
  304. Malacarne, G., Perazzolli, M., Cestaro, A., Sterck, L., Fontana, P., Van de Peer, Y., Viola, R., et al. (2012). Deconstruction of the (paleo)polyploid grapevine genome based on the analysis of transposition events involving NBS resistance genes. PLOS ONE, 7(1).
    Plants have followed a reticulate type of evolution and taxa have frequently merged via allopolyploidization. A polyploid structure of sequenced genomes has often been proposed, but the chromosomes belonging to putative component genomes are difficult to identify. The 19 grapevine chromosomes are evolutionary stable structures: their homologous triplets have strongly conserved gene order, interrupted by rare translocations. The aim of this study is to examine how the grapevine nucleotide-binding site (NBS)-encoding resistance (NBS-R) genes have evolved in the genomic context and to understand mechanisms for the genome evolution. We show that, in grapevine, i) helitrons have significantly contributed to transposition of NBS-R genes, and ii) NBS-R gene cluster similarity indicates the existence of two groups of chromosomes (named as Va and Vc) that may have evolved independently. Chromosome triplets consist of two Va and one Vc chromosomes, as expected from the tetraploid and diploid conditions of the two component genomes. The hexaploid state could have been derived from either allopolyploidy or the separation of the Va and Vc component genomes in the same nucleus before fusion, as known for Rosaceae species. Time estimation indicates that grapevine component genomes may have fused about 60 mya, having had at least 40-60 mya to evolve independently. Chromosome number variation in the Vitaceae and related families, and the gap between the time of eudicot radiation and the age of Vitaceae fossils, are accounted for by our hypothesis.
  305. Abou-El-Ardat, Khalil, Monsieurs, P., Anastasov, N., Atkinson, M., Derradji, H., De Meyer, T., Bekaert, S., et al. (2012). Low dose irradiation of thyroid cells reveals a unique transcriptomic and epigenetic signature in RET/PTC-positive cells. MUTATION RESEARCH-FUNDAMENTAL AND MOLECULAR MECHANISMS OF MUTAGENESIS, 731(1-2), 27–40.
  306. Dessimoz, C., Gabaldón, T., Roos, D. S., Sonnhammer, E. L., Herrero, J., Quest Orthologs Consortium, the, Vandepoele, K., et al. (2012). Toward community standards in the quest for orthologs. BIOINFORMATICS, 28(6), 900–904.
  307. Kyndt, T., Denil, S., Haegeman, A., Trooskens, G., De Meyer, T., Van Criekinge, W., & Gheysen, G. (2012). Transcriptome analysis of rice mature root tissue and root tips in early development by massive parallel sequencing. JOURNAL OF EXPERIMENTAL BOTANY, 63(5), 2141–2157.
  308. Van Bel, M., Proost, S., Wischnitzki, E., Movahedi, S., Scheerlinck, C., Van de Peer, Y., & Vandepoele, K. (2012). Dissecting plant genomes with the PLAZA comparative genomics platform. PLANT PHYSIOLOGY, 158(2), 590–600.
    With the arrival of low-cost, next-generation sequencing, a multitude of new plant genomes are being publicly released, providing unseen opportunities and challenges for comparative genomics studies. Here, we present PLAZA 2.5, a user-friendly online research environment to explore genomic information from different plants. This new release features updates to previous genome annotations and a substantial number of newly available plant genomes as well as various new interactive tools and visualizations. Currently, PLAZA hosts 25 organisms covering a broad taxonomic range, including 13 eudicots, five monocots, one lycopod, one moss, and five algae. The available data consist of structural and functional gene annotations, homologous gene families, multiple sequence alignments, phylogenetic trees, and colinear regions within and between species. A new Integrative Orthology Viewer, combining information from different orthology prediction methodologies, was developed to efficiently investigate complex orthology relationships. Cross-species expression analysis revealed that the integration of complementary data types extended the scope of complex orthology relationships, especially between more distantly related species. Finally, based on phylogenetic profiling, we propose a set of core gene families within the green plant lineage that will be instrumental to assess the gene space of draft or newly sequenced plant genomes during the assembly or annotation phase.
  309. Whitford, R., Fernandez Salina, A., Tejos Ulloa, R., Cuéllar Pérez, A., Kleine-Vehn, J., Vanneste, S., Drozdzecki, A., et al. (2012). GOLVEN secretory peptides regulate auxin carrier turnover during plant gravitropic responses. DEVELOPMENTAL CELL, 22(3), 678–685.
  310. Vyverman, M., De Baets, B., Fack, V., & Dawyndt, P. (2012). Prospects and limitations of full-text index structures in genome analysis. NUCLEIC ACIDS RESEARCH, 40(15), 6993–7015.
    The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared.
  311. Milner, D. A., Jr, Pochet, N., Krupka, M., Williams, C., Seydel, K., Taylor, T., Van de Peer, Y., et al. (2012). Transcriptional profiling of Plasmodium falciparum parasites from patients with severe malaria identifies distinct low vs. high parasitemic clusters. PLOS ONE, 7(7).
    Background: In the past decade, estimates of malaria infections have dropped from 500 million to 225 million per year; likewise, mortality rates have dropped from 3 million to 791,000 per year. However, approximately 90% of these deaths continue to occur in sub-Saharan Africa, and 85% involve children less than 5 years of age. Malaria mortality in children generally results from one or more of the following clinical syndromes: severe anemia, acidosis, and cerebral malaria. Although much is known about the clinical and pathological manifestations of CM, insights into the biology of the malaria parasite, specifically transcription during this manifestation of severe infection, are lacking. Methods and Findings: We collected peripheral blood from children meeting the clinical case definition of cerebral malaria from a cohort in Malawi, examined the patients for the presence or absence of malaria retinopathy, and performed whole genome transcriptional profiling for Plasmodium falciparum using a custom designed Affymetrix array. We identified two distinct physiological states that showed highly significant association with the level of parasitemia. We compared both groups of Malawi expression profiles with our previously acquired ex vivo expression profiles of parasites derived from infected patients with mild disease; a large collection of in vitro Plasmodium falciparum life cycle gene expression profiles; and an extensively annotated compendium of expression data from Saccharomyces cerevisiae. The high parasitemia patient group demonstrated a unique biology with elevated expression of Hrd1, a member of endoplasmic reticulum-associated protein degradation system. Conclusions: The presence of a unique high parasitemia state may be indicative of the parasite biology of the clinically recognized hyperparasitemic severe disease syndrome.
  312. Heyndrickx, K., & Vandepoele, K. (2012). Systematic identification of functional plant modules through the integration of complementary data sources. PLANT PHYSIOLOGY, 159(3), 884–901.
    A major challenge is to unravel how genes interact and are regulated to exert specific biological functions. The integration of genome-wide functional genomics data, followed by the construction of gene networks, provides a powerful approach to identify functional gene modules. Large-scale expression data, functional gene annotations, experimental protein-protein interactions, and transcription factor-target interactions were integrated to delineate modules in Arabidopsis (Arabidopsis thaliana). The different experimental input data sets showed little overlap, demonstrating the advantage of combining multiple data types to study gene function and regulation. In the set of 1,563 modules covering 13,142 genes, most modules displayed strong coexpression, but functional and cis-regulatory coherence was less prevalent. Highly connected hub genes showed a significant enrichment toward embryo lethality and evidence for cross talk between different biological processes. Comparative analysis revealed that 58% of the modules showed conserved coexpression across multiple plants. Using module-based functional predictions, 5,562 genes were annotated, and an evaluation experiment disclosed that, based on 197 recently experimentally characterized genes, 38.1% of these functions could be inferred through the module context. Examples of confirmed genes of unknown function related to cell wall biogenesis, xylem and phloem pattern formation, cell cycle, hormone stimulus, and circadian rhythm highlight the potential to identify new gene functions. The module-based predictions offer new biological hypotheses for functionally unknown genes in Arabidopsis (1,701 genes) and six other plant species (43,621 genes). Furthermore, the inferred modules provide new insights into the conservation of coexpression and coregulation as well as a starting point for comparative functional annotation.
  313. Petrov, V., Vermeirssen, V., De Clercq, I., Van Breusegem, F., Minkov, I., Vandepoele, K., & Gechev, T. S. (2012). Identification of cis-regulatory elements specific for different types of reactive oxygen species in Arabidopsis thaliana. GENE, 499(1), 52–60.
  314. Quimbaya Gomez, M. A., Vandepoele, K., Raspé, E., Matthijs, M., Dhondt, S., Beemster, G., Berx, G., et al. (2012). Identification of putative cancer genes through data integration and comparative genomics between plants and humans. CELLULAR AND MOLECULAR LIFE SCIENCES, 69(12), 2041–2055.
    Coordination of cell division with growth and development is essential for the survival of organisms. Mistakes made during replication of genetic material can result in cell death, growth defects, or cancer. Because of the essential role of the molecular machinery that controls DNA replication and mitosis during development, its high degree of conservation among organisms is not surprising. Mammalian cell cycle genes have orthologues in plants, and vice versa. However, besides the many known and characterized proliferation genes, still undiscovered regulatory genes are expected to exist with conserved functions in plants and humans. Starting from genome-wide Arabidopsis thaliana microarray data, an integrative strategy based on coexpression, functional enrichment analysis, and cis-regulatory element annotation was combined with a comparative genomics approach between plants and humans to detect conserved cell cycle genes involved in DNA replication and/or DNA repair. With this systemic strategy, a set of 339 genes was identified as potentially conserved proliferation genes. Experimental analysis confirmed that 20 out of 40 selected genes had an impact on plant cell proliferation; likewise, an evolutionarily conserved role in cell division was corroborated for two human orthologues. Moreover, association analysis integrating Homo sapiens gene expression data with clinical information revealed that, for 45 genes, altered transcript levels and relapse risk clearly correlated. Our results illustrate how a systematic exploration of the A. thaliana genome can contribute to the experimental identification of new cell cycle regulators that might represent novel oncogenes or/and tumor suppressors.
  315. Van de Peer, Y., & ChrisPires, J. (2012). Getting up to speed. CURRENT OPINION IN PLANT BIOLOGY.
  316. Wang, F., Vandepoele, K., & Van Lijsebettens, M. (2012). Tetraspanin genes in plants. PLANT SCIENCE, 190, 9–15.
  317. Brown, J. R., Hanna, M., Tesar, B., Werner, L., Pochet, N., Asara, J. M., Wang, Y. E., et al. (2012). Integrative genomic analysis implicates gain of PIK3CA at 3q26 and MYC at 8q24 in chronic lymphocytic leukemia. CLINICAL CANCER RESEARCH, 18(14), 3791–3802.
    Purpose: The disease course of chronic lymphocytic leukemia (CLL) varies significantly within cytogenetic groups. We hypothesized that high-resolution genomic analysis of CLL would identify additional recurrent abnormalities associated with short time-to-first therapy (TTFT). Experimental Design: We undertook high-resolution genomic analysis of 161 prospectively enrolled CLLs using Affymetrix 6.0 SNP arrays, and integrated analysis of this data set with gene expression profiles. Results: Copy number analysis (CNA) of nonprogressive CLL reveals a stable genotype, with a median of only 1 somatic CNA per sample. Progressive CLL with 13q deletion was associated with additional somatic CNAs, and a greater number of CNAs was predictive of TTFT. We identified other recurrent CNAs associated with short TTFT: 8q24 amplification focused on the cancer susceptibility locus near MYC in 3.7%; 3q26 amplifications focused on PIK3CA in 5.6%; and 8p deletions in 5% of patients. Sequencing of MYC further identified somatic mutations in two CLLs. We determined which catalytic subunits of phosphoinositide 3-kinase (PI3K) were in active complex with the p85 regulatory subunit and showed enrichment for the a subunit in three CLLs carrying PIK3CA amplification. Conclusions: Our findings implicate amplifications of 3q26 focused on PIK3CA and 8q24 focused on MYC in CLL.
  318. Vaulot, D., Lepere, C., Toulza, E., De la Iglesia, R., Poulain, J., Gaboyer, F., Moreau, H., et al. (2012). Metagenomes of the picoalga Bathycoccus from the Chile coastal upwelling. PLOS ONE, 7(6).
    Among small photosynthetic eukaryotes that play a key role in oceanic food webs, picoplanktonic Mamiellophyceae such as Bathycoccus, Micromonas, and Ostreococcus are particularly important in coastal regions. By using a combination of cell sorting by flow cytometry, whole genome amplification (WGA), and 454 pyrosequencing, we obtained metagenomic data for two natural picophytoplankton populations from the coastal upwelling waters off central Chile. About 60% of the reads of each sample could be mapped to the genome of Bathycoccus strain from the Mediterranean Sea (RCC1105), representing a total of 9 Mbp (sample T142) and 13 Mbp (sample T149) of non-redundant Bathycoccus genome sequences. WGA did not amplify all regions uniformly, resulting in unequal coverage along a given chromosome and between chromosomes. The identity at the DNA level between the metagenomes and the cultured genome was very high (96.3% identical bases for the three larger chromosomes over a 360 kbp alignment). At least two to three different genotypes seemed to be present in each natural sample based on read mapping to Bathycoccus RCC1105 genome.
  319. Van Landeghem, S., Björne, J., Abeel, T., De Baets, B., Salakoski, T., & Van de Peer, Y. (2012). Semantically linking molecular entities in literature through entity relationships. BMC BIOINFORMATICS, 13. Presented at the Conference on BioNLP Shared Task.
    Background: Text mining tools have gained popularity to process the vast amount of available research articles in the biomedical literature. It is crucial that such tools extract information with a sufficient level of detail to be applicable in real life scenarios. Studies of mining non-causal molecular relations attribute to this goal by formally identifying the relations between genes, promoters, complexes and various other molecular entities found in text. More importantly, these studies help to enhance integration of text mining results with database facts. Results: We describe, compare and evaluate two frameworks developed for the prediction of non-causal or 'entity' relations (REL) between gene symbols and domain terms. For the corresponding REL challenge of the BioNLP Shared Task of 2011, these systems ranked first (57.7% F-score) and second (41.6% F-score). In this paper, we investigate the performance discrepancy of 16 percentage points by benchmarking on a related and more extensive dataset, analysing the contribution of both the term detection and relation extraction modules. We further construct a hybrid system combining the two frameworks and experiment with intersection and union combinations, achieving respectively high-precision and high-recall results. Finally, we highlight extremely high-performance results (F-score >90%) obtained for the specific subclass of embedded entity relations that are essential for integrating text mining predictions with database facts. Conclusions: The results from this study will enable us in the near future to annotate semantic relations between molecular entities in the entire scientific literature available through PubMed. The recent release of the EVEX dataset, containing biomolecular event predictions for millions of PubMed articles, is an interesting and exciting opportunity to overlay these entity relations with event predictions on a literature-wide scale.
  320. Van Landeghem, S., Hakala, K., Rönnqvist, S., Salakoski, T., Van de Peer, Y., & Ginter, F. (2012). Exploring biomolecular literature with EVEX : connecting genes through events, homology, and indirect associations. ADVANCES IN BIOINFORMATICS, 2012.
    Technological advancements in the field of genetics have led not only to an abundance of experimental data, but also caused an exponential increase of the number of published biomolecular studies. Text mining is widely accepted as a promising technique to help researchers in the life sciences deal with the amount of available literature. This paper presents a freely available web application built on top of 21.3 million detailed biomolecular events extracted from all PubMed abstracts. These text mining results were generated by a state-of-the-art event extraction system and enriched with gene family associations and abstract generalizations, accounting for lexical variants and synonymy. The EVEX resource locates relevant literature on phosphorylation, regulation targets, binding partners, and several other biomolecular events and assigns confidence values to these events. The search function accepts official gene/protein symbols as well as common names from all species. Finally, the web application is a powerful tool for generating homology-based hypotheses as well as novel, indirect associations between genes and proteins such as coregulators.
  321. Amoutzias, G. D., He, Y., Lilley, K. S., Van de Peer, Y., & Oliver, S. G. (2012). Evaluation and properties of the budding yeast phosphoproteome. MOLECULAR & CELLULAR PROTEOMICS, 11(6).
    We have assembled a reliable phosphoproteomic data set for budding yeast Saccharomyces cerevisiae and have investigated its properties. Twelve publicly available phosphoproteome data sets were triaged to obtain a subset of high-confidence phosphorylation sites (p-sites), free of "noisy" phosphorylations. Analysis of this combined data set suggests that the inventory of phosphoproteins in yeast is close to completion, but that these proteins may have many undiscovered p-sites. Proteins involved in budding and protein kinase activity have high numbers of p-sites and are highly over-represented in the vast majority of the yeast phosphoproteome data sets. The yeast phosphoproteome is characterized by a few proteins with many p-sites and many proteins with a few p-sites. We confirm a tendency for p-sites to cluster together and find evidence that kinases may phosphorylate off-target amino acids that are within one or two residues of their cognate target. This suggests that the precise position of the phosphorylated amino acid is not a stringent requirement for regulatory fidelity. Compared with nonphosphorylated proteins, phosphoproteins are more ancient, more abundant, have longer unstructured regions, have more genetic interactions, more protein interactions, and are under tighter post-translational regulation. It appears that phosphoproteins constitute the raw material for pathway rewiring and adaptation at various evolutionary rates.
  322. Brown, JR, Hanna, M., Tesar, B., Pochet, N., Vartanov, A., Fernandes, S., Werner, L., et al. (2012). Germline copy number variation associated with Mendelian inheritance of CLL in two families. LEUKEMIA, 26(7), 1710–1713.
  323. De Smet, Riet, & Van de Peer, Y. (2012). Redundancy and rewiring of genetic networks following genome-wide duplication events. CURRENT OPINION IN PLANT BIOLOGY, 15(2), 168–176.
    Polyploidy or whole-genome duplication is a frequent phenomenon within the plant kingdom and has been associated with the occurrence of evolutionary novelty and increase in biological complexity. Because genome-wide duplication events duplicate whole molecular networks it is of interest to investigate how these networks evolve subsequent to such events. Although genome duplications are generally followed by massive gene loss, at least part of the network is usually retained in duplicate and can rewire to execute novel functions. Alternatively, the network can remain largely redundant and as such confer robustness against mutations. The increasing availability of high-throughput data makes it possible to study evolution following whole genome duplication events at the network level. Here we discuss how the use of 'omics' data in network analysis can provide novel insights on network redundancy and rewiring and conclude with some directions for future research.
  324. Helsens, K., Mueller, M., Hulstaert, N., & Martens, L. (2012). Sigpep: calculating unique peptide signature transition sets in a complete proteome background. PROTEOMICS, 12(8), 1142–1146.
  325. Moruz, L., Staes, A., Foster, J. M., Hatzou, M., Timmerman, E., Martens, L., & Käll, L. (2012). Chromatographic retention time prediction for posttranslationally modified peptides. PROTEOMICS, 12(8), 1151–1159.
    Retention time prediction of peptides in liquid chromatography has proven to be a valuable tool for mass spectrometry-based proteomics, especially in designing more efficient procedures for state-of-the-art targeted workflows. Additionally, accurate retention time predictions can also be used to increase confidence in identifications in shotgun experiments. Despite these obvious benefits, the use of such methods has so far not been extended to (posttranslationally) modified peptides due to the absence of efficient predictors for such peptides. We here therefore describe a new retention time predictor for modified peptides, built on the foundations of our existing Elude algorithm. We evaluated our software by applying it on five types of commonly encountered modifications. Our results show that Elude now yields equally good prediction performances for modified and unmodified peptides, with correlation coefficients between predicted and observed retention times ranging from 0.93 to 0.98 for all the investigated datasets. Furthermore, we show that our predictor handles peptides carrying multiple modifications as well. This latest version of Elude is fully portable to new chromatographic conditions and can readily be applied to other types of posttranslational modifications. Elude is available under the permissive Apache2 open source License at or can be run via a web-interface at .
  326. Moreau, H., Verhelst, B., Couloux, A., Derelle, E., Rombauts, S., Grimsley, N., Van Bel, M., et al. (2012). Gene functionalities and genome structure in Bathycoccus prasinos reflect cellular specializations at the base of the green lineage. GENOME BIOLOGY, 13(8).
    Background: Bathycoccus prasinos is an extremely small cosmopolitan marine green alga whose cells are covered with intricate spider's web patterned scales that develop within the Golgi cisternae before their transport to the cell surface. The objective of this work is to sequence and analyze its genome, and to present a comparative analysis with other known genomes of the green lineage. Research: Its small genome of 15 Mb consists of 19 chromosomes and lacks transposons. Although 70% of all B. prasinos genes share similarities with other Viridiplantae genes, up to 428 genes were probably acquired by horizontal gene transfer, mainly from other eukaryotes. Two chromosomes, one big and one small, are atypical, an unusual synapomorphic feature within the Mamiellales. Genes on these atypical outlier chromosomes show lower GC content and a significant fraction of putative horizontal gene transfer genes. Whereas the small outlier chromosome lacks colinearity with other Mamiellales and contains many unknown genes without homologs in other species, the big outlier shows a higher intron content, increased expression levels and a unique clustering pattern of housekeeping functionalities. Four gene families are highly expanded in B. prasinos, including sialyltransferases, sialidases, ankyrin repeats and zinc ion-binding genes, and we hypothesize that these genes are associated with the process of scale biogenesis. Conclusion: The minimal genomes of the Mamiellophyceae provide a baseline for evolutionary and functional analyses of metabolic processes in green plants.
  327. Decock, A., Ongenaert, M., Hoebeeck, J., De Preter, K., Van Peer, G., Van Criekinge, W., Ladenstein, R., et al. (2012). Genome-wide promoter methylation analysis in neuroblastoma identifies prognostic methylation biomarkers. GENOME BIOLOGY, 13(10).
    BACKGROUND: Accurate outcome prediction in neuroblastoma, which is necessary to enable the optimal choice of risk-related therapy, remains a challenge. To improve neuroblastoma patient stratification, this study aimed to identify prognostic tumor DNA methylation biomarkers. RESULTS: To identify genes silenced by promoter methylation, we first applied two independent genome-wide methylation screening methodologies to eight neuroblastoma cell lines. Specifically, we used re-expression profiling upon 5-aza-2'-deoxycytidine (DAC) treatment and massively parallel sequencing after capturing with a methyl-CpG-binding domain (MBD-seq). Putative methylation markers were selected from DAC-upregulated genes through a literature search and an upfront methylation-specific PCR on 20 primary neuroblastoma tumors, as well as through MBD- seq in combination with publicly available neuroblastoma tumor gene expression data. This yielded 43 candidate biomarkers that were subsequently tested by high-throughput methylation-specific PCR on an independent cohort of 89 primary neuroblastoma tumors that had been selected for risk classification and survival. Based on this analysis, methylation of KRT19, FAS, PRPH, CNR1, QPCT, HIST1H3C, ACSS3 and GRB10 was found to be associated with at least one of the classical risk factors, namely age, stage or MYCN status. Importantly, HIST1H3C and GNAS methylation was associated with overall and/or event-free survival. CONCLUSIONS: This study combines two genome-wide methylation discovery methodologies and is the most extensive validation study in neuroblastoma performed thus far. We identified several novel prognostic DNA methylation markers and provide a basis for the development of a DNA methylation-based prognostic classifier in neuroblastoma.
  328. De Meyer, Tim, Van daeleCaroline, De Buyzere, M., Denil, S., De Bacquer, D., Segers, P., Cooman, L., et al. (2012). No shorter telomeres in subjects with a family history of cardiovascular disease in the Asklepios Study. ARTERIOSCLEROSIS THROMBOSIS AND VASCULAR BIOLOGY, 32(12), 3076–3081.
    Objective : Shorter telomere length is associated with the occurrence of cardiovascular events, but the question of causality is complicated by the intertwined effects of inheritance, aging, and lifestyle factors on both telomere length and cardiovascular disease (CVD). Some studies indicated that healthy offspring of coronary artery disease patients exhibited shorter telomeres than subjects without a family history. Importantly, this result would imply that inheritance of shorter telomeres is a primary abnormality associated with an increased risk of CVD, the so-called Telomere Hypothesis of CVD. Therefore, we aimed at further validating the latter results in the large, population-representative Asklepios Study. Methods and results : Peripheral blood leukocyte telomere length was measured using telomere restriction fragment analysis in the young to middle-aged (approximate to 35-55 years old) Asklepios study population, free from overt CVD, and could be successfully combined with data from the Asklepios Family History Database for 2136 subjects. No shorter telomere length could be found in healthy subjects with a family history of CVD compared with those without. Conclusion : These findings cast serious doubt on the hypothesis that telomere length is shorter in families with an increased risk of CVD and do not support the Telomere Hypothesis of CVD.
  329. Klochendler, A., Weinberg-Corem, N., Moran, M., Swisa, A., Pochet, N., Savova, V., Vikeså, J., et al. (2012). A transgenic mouse marking live replicating cells reveals in vivo transcriptional program of proliferation. DEVELOPMENTAL CELL, 23(4), 681–690.
    Most adult mammalian tissues are quiescent, with rare cell divisions serving to maintain homeostasis. At present, the isolation and study of replicating cells from their in vivo niche typically involves immunostaining for intracellular markers of proliferation, causing the loss of sensitive biological material. We describe a transgenic mouse strain, expressing a CyclinB1-GFP fusion reporter, that marks replicating cells in the S/G2/M phases of the cell cycle. Using flow cytometry, we isolate live replicating cells from the liver and compare their transcriptome to that of quiescent cells to reveal gene expression programs associated with cell proliferation in vivo. We find that replicating hepatocytes have reduced expression of genes characteristic of liver differentiation. This reporter system provides a powerful platform for gene expression and metabolic and functional studies of replicating cells in their in vivo niche.
  330. Gonnelli, G., Hulstaert, N., Degroeve, S., & Martens, L. (2012). Towards a human proteomics atlas. ANALYTICAL AND BIOANALYTICAL CHEMISTRY, 404(4), 1069–1077.
    Proteomics research has taken up an increasingly important role in life sciences over the past few years. Due to a strong push from publishers and funders alike, the community has also started to freely share its data in earnest, making use of public repositories such as the highly popular PRIDE database at EMBL-EBI. Reuse of these publicly available data has so far been confined to rather specific, targeted reanalyses, but this limited reuse is set to expand dramatically as repositories continue to grow exponentially. Examples of large-scale reuse are readily found in other omics disciplines, where more comprehensive public data have already accumulated over longer periods. Here, a typical example of integrative data reuse is provided by the construction of so-called expression atlases. We here therefore investigate the issues involved in using the human data currently stored in the PRIDE database to construct a robust, tissue-specific protein expression atlas from tandem-MS based label-free quantification.
  331. Movahedi, S., Van Bel, M., Heyndrickx, K., & Vandepoele, K. (2012). Comparative co-expression analysis in plant biology. PLANT CELL AND ENVIRONMENT, 35(10), 1787–1798.
    The analysis of gene expression data generated by high-throughput microarray transcript profiling experiments has shown that transcriptionally coordinated genes are often functionally related. Based on large-scale expression compendia grouping multiple experiments, this guilt-by-association principle has been applied to study modular gene programmes, identify cis-regulatory elements or predict functions for unknown genes in different model plants. Recently, several studies have demonstrated how, through the integration of gene homology and expression information, correlated gene expression patterns can be compared between species. The incorporation of detailed functional annotations as well as experimental data describing proteinprotein interactions, phenotypes or tissue specific expression, provides an invaluable source of information to identify conserved gene modules and translate biological knowledge from model organisms to crops. In this review, we describe the different steps required to systematically compare expression data across species. Apart from the technical challenges to compute and display expression networks from multiple species, some future applications of plant comparative transcriptomics are highlighted.
  332. Vaudel, M., Burkhart, J. M., Radau, S., Zahedi, R. P., Martens, L., & Sickmann, A. (2012). Integral quantification accuracy estimation for reporter ion-based quantitative proteomics (iQuARI). JOURNAL OF PROTEOME RESEARCH, 11(10), 5072–5080.
    With the increasing popularity of comparative studies of complex proteomes, reporter ion-based quantification methods such as iTRAQ and TMT have become common-place in biological studies. Their appeal derives from simple multiplexing and quantification of several samples at reasonable cost. This advantage yet comes with a known shortcoming: precursors of different species can interfere, thus reducing the quantification accuracy. Recently, two methods were brought to the community alleviating the amount of interference via novel experimental design. Before considering setting up a new workflow, tuning the system, optimizing identification and quantification rates, etc. one legitimately asks: is it really worth the effort, time and money? The question is actually not easy to answer since the interference is heavily sample and system dependent. Moreover, there was to date no method allowing the inline estimation of error rates for reporter quantification. We therefore introduce a method called iQuARI to compute false discovery rates for reporter ion based quantification experiments as easily as Target/Decoy FDR for identification. With it, the scientist can accurately estimate the amount of interference in his sample on his system and eventually consider removing shadows subsequently, a task for which reporter ion quantification might not be the solution of choice.
  333. Vaudel, M., Burkhart, J. M., Breiter, D., Zahedi, R. P., Sickmann, A., & Martens, L. (2012). A complex standard for protein identification, designed by evolution. JOURNAL OF PROTEOME RESEARCH, 11(10), 5065–5071.
    Shotgun proteomic investigations rely on the algorithmic assignment of mass spectra to peptides. The quality of these matches is therefore a cornerstone in the analysis and has been the subject of numerous recent developments. In order to establish the benefits of novel algorithms, they are applied to reference samples of known content. However, these were recently shown to be either too simple to resemble typical real-life samples or as leading to results of lower accuracy as the method itself. Here, we describe how to use the proteome of Pyrococcus furiosus, a hyperthermophile, as a standard to evaluate proteomics identification workflows. Indeed, we prove that the Pyrococcus furiosus proteome provides a valid method for detecting random hits, comparable to the decoy databases currently in popular use, but we also prove that the Pyrococcus furiosus proteome goes squarely beyond the decoy approach by also providing many hundreds of highly reliable true positive hits. Searching the Pyrococcus furiosus proteome can thus be used as a unique test that provides the ability to reliably detect both false positives as well as proteome-scale true positives, allowing the rigorous testing of identification algorithms at the peptide and protein level.
  334. Sato, S., Tabata, S., Hirakawa, H., Asamizu, E., Shirasawa, K., Isobe, S., Kaneko, T., et al. (2012). The tomato genome sequence provides insights into fleshy fruit evolution. NATURE, 485(7400), 635–641.
    Tomato (Solanum lycopersicum) is a major crop plant and a model system for fruit development. Solanum is one of the largest angiosperm genera(1) and includes annual and perennial plants from diverse habitats. Here we present a high-quality genome sequence of domesticated tomato, a draft sequence of its closest wild relative, Solanum pimpinellifolium(2), and compare them to each other and to the potato genome (Solanum tuberosum). The two tomato genomes show only 0.6% nucleotide divergence and signs of recent admixture, but show more than 8% divergence from potato, with nine large and several smaller inversions. In contrast to Arabidopsis, but similar to soybean, tomato and potato small RNAs map predominantly to gene-rich chromosomal regions, including gene promoters. The Solanum lineage has experienced two consecutive genome triplications: one that is ancient and shared with rosids, and a more recent one. These triplications set the stage for the neofunctionalization of genes controlling fruit characteristics, such as colour and fleshiness.
  335. Sterck, L., Billiau, K., Abeel, T., Rouzé, P., & Van de Peer, Y. (2012). ORCAE: online resource for community annotation of eukaryotes. NATURE METHODS, 9(11), 1041–1041.
  336. Vekemans, D., Proost, S., Vanneste, K., Coenen, H., Viaene, T., Ruelens, P., Maere, S., et al. (2012). Gamma paleohexaploidy in the stem lineage of core eudicots: significance for MADS-box gene and species diversification. MOLECULAR BIOLOGY AND EVOLUTION, 29(12), 3793–3806.
    Comparative genome biology has unveiled the polyploid origin of all angiosperms and the role of recurrent polyploidization in the amplification of gene families and the structuring of genomes. Which species share certain ancient polyploidy events, and which do not, is ill defined because of the limited number of sequenced genomes and transcriptomes and their uneven phylogenetic distribution. Previously, it has been suggested that most, but probably not all, of the eudicots have shared an ancient hexaploidy event, referred to as the gamma triplication. In this study, detailed phylogenies of subfamilies of MADS-box genes suggest that the gamma triplication has occurred before the divergence of Gunnerales but after the divergence of Buxales and Trochodendrales. Large-scale phylogenetic and K-S-based approaches on the inflorescence transcriptomes of Gunnera manicata (Gunnerales) and Pachysandra terminalis (Buxales) provide further support for this placement, enabling us to position the gamma triplication in the stem lineage of the core eudicots. This triplication likely initiated the functional diversification of key regulators of reproductive development in the core eudicots, comprising 75% of flowering plants. Although it is possible that the gamma event triggered early core eudicot diversification, our dating estimates suggest that the event occurred early in the stem lineage, well before the rapid speciation of the earliest core eudicot lineages. The evolutionary significance of this paleopolyploidy event may thus rather lie in establishing a species lineage that was resilient to extinction, but with the genomic potential for later diversification. We consider that the traits generated from this potential characterize extant core eudicots both chemically and morphologically.
  337. Kyndt, T., Denil, S., Haegeman, A., Trooskens, G., Bauters, L., Van Criekinge, W., De Meyer, T., et al. (2012). Transcriptional reprogramming by root knot and migratory nematode infection in rice. NEW PHYTOLOGIST, 196(3), 887–900.
    Rice is one of the most important staple crops worldwide, but its yield is compromised by different pathogens, including plant-parasitic nematodes. In this study we have characterized specific and general responses of rice (Oryza sativa) roots challenged with two endoparasitic nematodes with very different modes of action. Local transcriptional changes in rice roots upon root knot (Meloidogyne graminicola) and root rot nematode (RRN, Hirschmanniella oryzae) infection were studied at two time points (3 and 7 d after infection, dai), using mRNA-seq. Our results confirm that root knot nematodes (RKNs), which feed as sedentary endoparasites, stimulate metabolic pathways in the root, and enhance nutrient transport towards the induced root gall. The migratory RRNs, on the other hand, induce programmed cell death and oxidative stress, and obstruct the normal metabolic activity of the root. While RRN infection causes up-regulation of biotic stress-related genes early in the infection, the sedentary RKNs suppress the local defense pathways (e.g. salicylic acid and ethylene pathways). Interestingly, hormone pathways mainly involved in plant development were strongly induced (gibberellin) or repressed (cytokinin) at 3 dai. These results uncover previously unrecognized nematode-induced expression profiles related to their specific infection strategy.
  338. Degroeve, S., Staes, A., De Bock, P.-J., & Martens, L. (2012). The effect of peptide identification search algorithms on MS2-based label-free protein quantification. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY, 16(9), 443–448.
    Several approaches exist for the quantification of proteins in complex samples processed by liquid chromatography-mass spectrometry followed by fragmentation analysis (MS2). One of these approaches is label-free MS2-based quantification, which takes advantage of the information computed from MS2 spectrum observations to estimate the abundance of a protein in a sample. As a first step in this approach, fragmentation spectra are typically matched to the peptides that generated them by a search algorithm. Because different search algorithms identify overlapping but non-identical sets of peptides, here we investigate whether these differences in peptide identification have an impact on the quantification of the proteins in the sample. We therefore evaluated the effect of using different search algorithms by examining the reproducibility of protein quantification in technical repeat measurements of the same sample. From our results, it is clear that a search engine effect does exist for MS2-based label-free protein quantification methods. As a general conclusion, it is recommended to address the overall possibility of search engine-induced bias in the protein quantification results of label-free MS2-based methods by performing the analysis with two or more distinct search engines.
  339. Burkhart, J. M., Vaudel, M., Gambaryan, S., Radau, S., Walter, U., Martens, L., Geiger, J., et al. (2012). The first comprehensive and quantitative analysis of human platelet protein composition allows the comparative analysis of structural and functional pathways. BLOOD, 120(15), E73–E82.
    Antiplatelet treatment is of fundamental importance in combatting functions/dysfunction of platelets in the pathogenesis of cardiovascular and inflammatory diseases. Dysfunction of anucleate platelets is likely to be completely attributable to alterations in posttranslational modifications and protein expression. We therefore examined the proteome of platelets highly purified from fresh blood donations, using elaborate protocols to ensure negligible contamination by leukocytes, erythrocytes, and plasma. Using quantitative mass spectrometry, we created the first comprehensive and quantitative human platelet proteome, comprising almost 4000 unique proteins, estimated copy numbers for similar to 3700 of those, and assessed intersubject (4 donors) as well as intrasubject (3 different blood samples from 1 donor) variations of the proteome. For the first time, our data allow for a systematic and weighted appraisal of protein networks and pathways in human platelets, and indicate the feasibility of differential and comprehensive proteome analyses from small blood donations. Because 85% of the platelet proteome shows no variation between healthy donors, this study represents the starting point for disease-oriented platelet proteomics. In the near future, comprehensive and quantitative comparisons between normal and well-defined dysfunctional platelets, or between platelets obtained from donors at various stages of chronic cardiovascular and inflammatory diseases will be feasible.
  340. De Keulenaer, Sarah, HELLEMANS, J., Lefever, S., Renard, J.-P., De Schrijver, J., Van de Voorde, H., Tabatabaiefar, M. A., et al. (2012). Molecular diagnostics for congenital hearing loss including 15 deafness genes using a next generation sequencing platform. BMC MEDICAL GENOMICS, 5.
    Background: Hereditary hearing loss (HL) can originate from mutations in one of many genes involved in the complex process of hearing. Identification of the genetic defects in patients is currently labor intensive and expensive. While screening with Sanger sequencing for GJB2 mutations is common, this is not the case for the other known deafness genes (> 60). Next generation sequencing technology (NGS) has the potential to be much more cost efficient. Published methods mainly use hybridization based target enrichment procedures that are time saving and efficient, but lead to loss in sensitivity. In this study we used a semi-automated PCR amplification and NGS in order to combine high sensitivity, speed and cost efficiency. Results: In this proof of concept study, we screened 15 autosomal recessive deafness genes in 5 patients with congenital genetic deafness. 646 specific primer pairs for all exons and most of the UTR of the 15 selected genes were designed using primerXL. Using patient specific identifiers, all amplicons were pooled and analyzed using the Roche 454 NGS technology. Three of these patients are members of families in which a region of interest has previously been characterized by linkage studies. In these, we were able to identify two new mutations in CDH23 and OTOF. For another patient, the etiology of deafness was unclear, and no causal mutation was found. In a fifth patient, included as a positive control, we could confirm a known mutation in TMC1. Conclusions: We have developed an assay that holds great promise as a tool for screening patients with familial autosomal recessive nonsyndromal hearing loss (ARNSHL). For the first time, an efficient, reliable and cost effective genetic test, based on PCR enrichment, for newborns with undiagnosed deafness is available.
  341. De Beuf, K., De Schrijver, J., Thas, O., Van Criekinge, W., Irizarry, R. A., & Clement, L. (2012). Improved base-calling and quality scores for 454 sequencing based on a Hurdle Poisson model. BMC BIOINFORMATICS, 13.
    Background: 454 pyrosequencing is a commonly used massively parallel DNA sequencing technology with a wide variety of application fields such as epigenetics, metagenomics and transcriptomics. A well-known problem of this platform is its sensitivity to base-calling insertion and deletion errors, particularly in the presence of long homopolymers. In addition, the base-call quality scores are not informative with respect to whether an insertion or a deletion error is more likely. Surprisingly, not much effort has been devoted to the development of improved base-calling methods and more intuitive quality scores for this platform. Results: We present HPCall, a 454 base-calling method based on a weighted Hurdle Poisson model. HPCall uses a probabilistic framework to call the homopolymer lengths in the sequence by modeling well-known 454 noise predictors. Base-calling quality is assessed based on estimated probabilities for each homopolymer length, which are easily transformed to useful quality scores. Conclusions: Using a reference data set of the Escherichia coli K-12 strain, we show that HPCall produces superior quality scores that are very informative towards possible insertion and deletion errors, while maintaining a base-calling accuracy that is better than the current one. Given the generality of the framework, HPCall has the potential to also adapt to other homopolymer-sensitive sequencing technologies.
  342. Helsens, K., & Martens, L. (2012). Enabling computational proteomics by public and local data management systems. CIRCULATION-CARDIOVASCULAR GENETICS, 5(2), O9–O15.
  343. Vaudel, M., Sickmann, A., & Martens, L. (2012). Current methods for global proteome identification. EXPERT REVIEW OF PROTEOMICS, 9(5), 519–532.
    In a time frame of a few decades, protein identification went from laborious single protein identification to automated identification of entire proteomes. This shift was enabled by the emergence of peptide-centric, gel-free analyses, in particular the so-called shotgun approaches, which not only rely on extensive experiments, but also on cutting-edge data processing methods. The present review therefore provides an overview of a shotgun proteomics identification workflow, listing the state-of-the-art methods involved and software that implement these. The authors focus on freely available tools where possible. Finally, data analysis in the context of emerging across-omics studies will also be discussed briefly, where proteomics goes beyond merely delivering a list of protein accession numbers.
  344. Xu, S., Wu, W., Sun, H., Lu, J., Yuan, B., Xia, Y., De Moor, B., et al. (2012). Association of the vascular endothelial growth factor gene polymorphisms (-460C/T,+405G/C and+936T/C) with endometriosis: a meta-analysis. ANNALS OF HUMAN GENETICS, 76(6), 464–471.
    Published data on the association between the vascular endothelial growth factor (VEGF) gene 460C/T (rs833061), +405G/C (rs2010963), +936T/C (rs3025039) polymorphisms and endometriosis risk are inconclusive. Eleven eligible case-control studies including 2690 cases and 2803 controls were included in this meta-analysis through searching the databases of PubMed and CBMdisc (up to August 1, 2011). In the overall analysis, no significant association between the 460C/T and +405G/C polymorphisms and risk of endometriosis was observed. However, significant associations were observed between endometriosis risk and VEGF+936T polymorphism with summarized odds ratio of 1.19 (95%CI, 1.021.37), 1.18 (95%CI, 1.031.37), 1.15 (95%CI, 1.011.30) for CT versus CC genotype, dominant mode (CT/TT vs. CC) and allele comparison (T vs. C), respectively. Furthermore, stratified analysis showed that significantly strong association between +936T/C polymorphism and endometriosis was present only in stage IIIIV (OR = 1.32 for dominant mode; OR = 1.30 for T vs. C), but not in stage III. However, no significantly increased risk of endometriosis was found in any of the genetic models in Asians or in Caucasians. This meta-analysis supports that VEGF+936T/C polymorphism is capable of causing endometriosis susceptibility.
  345. Aslankoohi, E., Voordeckers, K., Sun, H., Sanchez-Rodriguez, A., van der Zande, E., Marchal, K., & Verstrepen, K. J. (2012). Nucleosomes affect local transformation efficiency. NUCLEIC ACIDS RESEARCH, 40(19), 9506–9512.
    Genetic transformation is a natural process during which foreign DNA enters a cell and integrates into the genome. Apart from its relevance for horizontal gene transfer in nature, transformation is also the cornerstone of today's recombinant gene technology. Despite its importance, relatively little is known about the factors that determine transformation efficiency. We hypothesize that differences in DNA accessibility associated with nucleosome positioning may affect local transformation efficiency. We investigated the landscape of transformation efficiency at various positions in the Saccharomyces cerevisiae genome and correlated these measurements with nucleosome positioning. We find that transformation efficiency shows a highly significant inverse correlation with relative nucleosome density. This correlation was lost when the nucleosome pattern, but not the underlying sequence was changed. Together, our results demonstrate a novel role for nucleosomes and also allow researchers to predict transformation efficiency of a target region and select spots in the genome that are likely to yield higher transformation efficiency.
  346. Wu, Wei, Cai, H., Sun, H., Lu, J., Zhao, D., Qin, Y., Han, X., et al. (2012). Follicle stimulating hormone receptor G-29A, 919A > G, 2039A > G polymorphism and the risk of male infertility: A meta-analysis. GENE, 505(2), 388–392.
  347. Voordeckers, K., De Maeyer, D., van der Zande, E., Vinces, M. D., Meert, W., Cloots, L., Ryan, O., et al. (2012). Identification of a complex genetic network underlying Saccharomyces cerevisiae colony morphology. MOLECULAR MICROBIOLOGY, 86(1), 225–239.
    When grown on solid substrates, different microorganisms often form colonies with very specific morphologies. Whereas the pioneers of microbiology often used colony morphology to discriminate between species and strains, the phenomenon has not received much attention recently. In this study, we use a genome-wide assay in the model yeast Saccharomyces cerevisiae to identify all genes that affect colony morphology. We show that several major signalling cascades, including the MAPK, TORC, SNF1 and RIM101 pathways play a role, indicating that morphological changes are a reaction to changing environments. Other genes that affect colony morphology are involved in protein sorting and epigenetic regulation. Interestingly, the screen reveals only few genes that are likely to play a direct role in establishing colony morphology, with one notable example being FLO11, a gene encoding a cell-surface adhesin that has already been implicated in colony morphology, biofilm formation, and invasive and pseudohyphal growth. Using a series of modified promoters for fine-tuning FLO11 expression, we confirm the central role of Flo11 and show that differences in FLO11 expression result in distinct colony morphologies. Together, our results provide a first comprehensive look at the complex genetic network that underlies the diversity in the morphologies of yeast colonies.
  348. Meysman, P., Marchal, K., & Engelen, K. (2012). DNA structural properties in the classification of genomic transcription. BIOINFORMATICS AND BIOLOGY INSIGHTS, 6, 155–168.
    It has been long known that DNA molecules encode information at various levels. The most basic level comprises the base sequence itself and is primarily important for the encoding of proteins and direct base recognition by DNA-binding proteins. A more elusive level consists of the local structural properties of the DNA molecule wherein the DNA sequence only plays an indirect supportive role. These properties are nevertheless an important factor in a large number of biomolecular processes and can be considered as informative signals for the presence of a variety of genomic features. Several recent studies have unequivocally shown the benefit of relying on such DNA properties for modeling and predicting genomic features as diverse as transcription start sites, transcription factor binding sites, or nucleosome occupancy. This review is meant to provide an overview of the key aspects of these DNA conformational and physicochemical properties. To illustrate their potential added value compared to relying solely on the nucleotide sequence in genomics studies, we discuss their application in research on transcription regulation mechanisms as representative cases.
  349. Zhang, Yingjie, & Thas, O. (2012). Constrained ordination analysis in the presence of zero inflation. STATISTICAL MODELLING, 12(6), 463–485.
    Constrained ordination analysis, with canonical correspondence analysis (CCA) as its best known method, is a class of popular techniques for analyzing species abundance studies in ecology. These methods rely on distributional assumptions on the conditional abundance distributions. For abundance observations, the Poisson and the negative binomial distributions are the most frequently considered distributions. However, many large abundance studies result in many zero abundances. This may happen because of several reasons. To name one, in microbial community ecology the abundances of a very large number of species are nowadays often obtained by means of sequencing the pooled DNA sample. Due to the small sensitivity for rare species, too many observed zeroes are to be expected. Moreover, more zeroes are expected with increasing number of species. We propose a constrained ordination method based on zero-altered count distributions (e.g., zero-inflated Poisson, hurdle models). We show how the parameters and the environmental gradients can be estimated. In simulation studies we examine the behaviour of the estimators, and we apply the method to a real data set. We conclude that in the presence of zero inflation our methods give better results than the Poisson-based approaches.
  350. De Witte, D., Van Bel, M., Demeester, P., Dhoedt, B., Vandepoele, K., & Fostier, J. (2012). A high performance computing approach to the dicovery of conserved motifs. 20e Annual Conference on Intelligent Systems for Molecular Biology, Abstracts (pp. 1–1). Presented at the 20e Annual Conference on Intelligent Systems for Molecular Biology (ISMB - 2012).
  351. De Witte, D., Van Bel, M., Demeester, P., Dhoedt, B., Vandepoele, K., & Fostier, J. (2012). Alignment-free genome-wide comparative motif discovery in 4 Monocot species. 11th European Conference on Computational Biology, Abstracts (pp. 1–1). Presented at the 11th European Conference on Computational Biology (ECCB - 2012).
  352. Verbeke, Lieven, Marchal, K., & Fostier, J. (2012). Exploring the complementarity of eQTL mapping methods (poster). 20e Annual Conference on Intelligent Systems for Molecular Biology, Abstracts (pp. 1–1). Presented at the 20e Annual Conference on Intelligent Systems for Molecular Biology (ISMB - 2012).
  353. Côté, R. G., Griss, J., Dianes, J. A., Wang, R., Wright, J. C., van den Toorn, H. W., van Breukelen, B., et al. (2012). The PRoteomics IDEntification (PRIDE) converter 2 framework: an improved suite of tools to facilitate data submission to the PRIDE database and the ProteomeXchange consortium. MOLECULAR & CELLULAR PROTEOMICS, 11(12), 1682–1689.
    The original PRIDE Converter tool greatly simplified the process of submitting mass spectrometry (MS)-based proteomics data to the PRIDE database. However, after much user feedback, it was noted that the tool had some limitations and could not handle several user requirements that were now becoming commonplace. This prompted us to design and implement a whole new suite of tools that would build on the successes of the original PRIDE Converter and allow users to generate submission-ready, well-annotated PRIDE XML files. The PRIDE Converter 2 tool suite allows users to convert search result files into PRIDE XML (the format needed for performing submissions to the PRIDE database), generate mzTab skeleton files that can be used as a basis to submit quantitative and gel-based MS data, and post-process PRIDE XML files by filtering out contaminants and empty spectra, or by merging several PRIDE XML files together. All the tools have both a graphical user interface that provides a dialog-based, user-friendly way to convert and prepare files for submission, as well as a command-line interface that can be used to integrate the tools into existing or novel pipelines, for batch processing and power users. The PRIDE Converter 2 tool suite will thus become a cornerstone in the submission process to PRIDE and, by extension, to the ProteomeXchange consortium of MSproteomics data repositories.
  354. Vaudel, M., Burkhart, J. M., Zahedi, R. P., Martens, L., & Sickmann, A. (2012). iTRAQ data interpretation. In K. Marcus (Ed.), Quantitative methods in proteomics (Vol. 893, pp. 501–509). New York, NY, USA: Humana Press.
    Quantitative proteomic analysis can help elucidating unexplored biological questions; it, however, relies on highly reproducible experiments and reliable data processing. Among the existing strategies, iTRAQ is known as an easy to use method allowing relative comparison of up to eight multiplexed samples. Once the data is acquired it is important that the final protein quantification reflects the actual amounts in the samples. Data interpretation must thus be achieved with a constant focus on quality. Here, we describe a workflow for processing iTRAQ data in user-friendly environments with emphasis on quality control.
  355. Cock, J. M., Sterck, L., Ahmed, S., Allen, A. E., Amoutzias, G., Anthouard, V., Artiguenave, F., et al. (2012). The Ectocarpus genome and brown algal genomics : the Ectocarpus Genome Consortium. (G Piganeau, Ed.)Advances in Botanical Research, 64, 141–184.
    Brown algae are important organisms both because of their key ecological roles in coastal ecosystems and because of the remarkable biological features that they have acquired during their unusual evolutionary history. The recent sequencing of the complete genome of the filamentous brown alga Ectocarpus has provided unprecedented access to the molecular processes that underlie brown algal biology. Analysis of the genome sequence, which exhibits several unusual structural features, identified genes that are predicted to play key roles in several aspects of brown algal metabolism, in the construction of the multicellular bodyplan and in resistance to biotic and abiotic stresses. Information from the genome sequence is currently being used in combination with other genomic, genetic and biochemical tools to further investigate these and other aspects of brown algal biology at the molecular level. Here, we review some of the major discoveries that emerged from the analysis of the Ectocarpus genome sequence, with a particular focus on the unusual genome structure, inferences about brown algal evolution and novel aspects of brown algal metabolism.
  356. Murat, F., Van de Peer, Y., & Salse, J. (2012). Decoding plant and animal genome plasticity from differential paleo-evolutionary patterns and processes. GENOME BIOLOGY AND EVOLUTION, 4(9), 917–928.
    Continuing advances in genome sequencing technologies and computational methods for comparative genomics currently allow inferring the evolutionary history of entire plant and animal genomes. Based on the comparison of the plant and animal genome paleohistory, major differences are unveiled in 1) evolutionary mechanisms (i.e., polyploidization versus diploidization processes), 2) genome conservation (i.e., coding versus noncoding sequence maintenance), and 3) modern genome architecture (i.e., genome organization including repeats expansion versus contraction phenomena). This article discusses how extant animal and plant genomes are the result of inherently different rates and modes of genome evolution resulting in relatively stable animal and much more dynamic and plastic plant genomes.
  357. Claeys, Marleen, Storms, V., Sun, H., Michoel, T., & Marchal, K. (2012). MotifSuite: workflow for probabilistic motif detection and assessment. BIOINFORMATICS, 28(14), 1931–1932.
    Motivation: Probabilistic motif detection requires a multi-step approach going from the actual de novo regulatory motif finding up to a tedious assessment of the predicted motifs. MotifSuite, a user-friendly web interface streamlines this analysis flow. Its core consists of two post-processing procedures that allow prioritizing the motif detection output. The tools offered by MotifSuite are built around the well-established motif detection tool MotifSampler and can also be used in combination with any other probabilistic motif detection tool. Elaborate guidelines on each of its applications have been provided.
  358. Sun, Hong, Guns, T., Fierro, A. C., Thorrez, L., Nijssen, S., & Marchal, K. (2012). Unveiling combinatorial regulation through the combination of ChIP information and in silico cis-regulatory module detection. NUCLEIC ACIDS RESEARCH, 40(12).
    Computationally retrieving biologically relevant cis-regulatory modules (CRMs) is not straightforward. Because of the large number of candidates and the imperfection of the screening methods, many spurious CRMs are detected that are as high scoring as the biologically true ones. Using ChIP-information allows not only to reduce the regions in which the binding sites of the assayed transcription factor (TF) should be located, but also allows restricting the valid CRMs to those that contain the assayed TF (here referred to as applying CRM detection in a query-based mode). In this study, we show that exploiting ChIP-information in a query-based way makes in silico CRM detection a much more feasible endeavor. To be able to handle the large datasets, the query-based setting and other specificities proper to CRM detection on ChIP-Seq based data, we developed a novel powerful CRM detection method 'CPModule'. By applying it on a well-studied ChIP-Seq data set involved in self-renewal of mouse embryonic stem cells, we demonstrate how our tool can recover combinatorial regulation of five known TFs that are key in the self-renewal of mouse embryonic stem cells. Additionally, we make a number of new predictions on combinatorial regulation of these five key TFs with other TFs documented in TRANSFAC.
  359. Jeschke, J., Van Neste, L., Glöckner, S. C., Dhir, M., Calmon, M. F., Deregowski, V., Van Criekinge, W., et al. (2012). Biomarkers for detection and prognosis of breast cancer identified by a functional hypermethylome screen. EPIGENETICS, 7(7), 701–709.
    Breast cancer (BC) is a disease with diverse tumor heterogeneity, which challenges conventional approaches to develop biomarkers for early detection and prognosis. To identify effective biomarkers, we performed a genome-wide screen for functional methylation changes in BC, i.e., genes silenced by promoter hypermethylation, using a functionally proven gene expression approach. A subset of candidate hypermethylated genes were validated in primary BCs and tested as markers for detection and prognosis prediction of BC. We identified 33 cancer specific methylated genes and, among these, two categories of genes: (1) highly frequent methylated genes that detect early stages of BC. Within that category, we have identified the combination of NDRG2 and HOXD1 as the most sensitive (94%) and specific (90%) gene combination for detection of BC; (2) genes that show stage dependent methylation frequency pattern, which are candidates to help delineate BC prognostic signatures. For this category, we found that methylation of CDO1, CKM, CRIP1, KL and TAC1 correlated with clinical prognostic variables and was a significant prognosticator for poor overall survival in BC patients. CKM [Hazard ratio (HR) = 2.68] and TAC1 (HR = 7.73) were the strongest single markers and the combination of both (TAC1 and CKM) was associated with poor overall survival independent of age and stage in our training (HR = 1.92) and validation cohort (HR = 2.87). Our study demonstrates an efficient method to utilize functional methylation changes in BC for the development of effective biomarkers for detection and prognosis prediction of BC.
  360. Carvalho, Beatriz, Sillars-Hardebol, A. H., Postma, C., Mongera, S., Terhaar Sive Droste, J., Obulkasim, A., van de Wiel, M., et al. (2012). Colorectal adenoma to carcinoma progression is accompanied by changes in gene expression associated with ageing, chromosomal instability, and fatty acid metabolism. CELLULAR ONCOLOGY, 35(1), 53–63.
    Colorectal cancer develops in a multi-step manner from normal epithelium, through a pre-malignant lesion (so-called adenoma), into a malignant lesion (carcinoma), which invades surrounding tissues and eventually can spread systemically (metastasis). It is estimated that only about 5% of adenomas do progress to a carcinoma. The present study aimed to unravel the biology of adenoma to carcinoma progression by mRNA expression profiling, and to identify candidate biomarkers for adenomas that are truly at high risk of progression. Genome-wide mRNA expression profiles were obtained from a series of 37 colorectal adenomas and 31 colorectal carcinomas using oligonucleotide microarrays. Differentially expressed genes were validated in an independent colorectal gene expression data set. Gene Set Enrichment Analysis (GSEA) was used to identify altered expression of sets of genes associated with specific biological processes, in order to better understand the biology of colorectal adenoma to carcinoma progression. mRNA expression of 248 genes was significantly different, of which 96 were upregulated and 152 downregulated in carcinomas compared to adenomas. Classification of adenomas and carcinomas using the expression of these genes showed to be very accurate, also when tested in an independent expression data set. Gene-sets associated with ageing (which is related to senescence) and chromosomal instability were upregulated, and a gene-set associated with fatty acid metabolism was downregulated in carcinomas compared to adenomas. Moreover, gene-sets associated with chromosomal location revealed chromosome 4q22 loss and chromosome 20q gain of gene-set expression as being relevant in this progression. These data are consistent with the notion that adenomas and carcinomas are distinct biological entities. Disruption of specific biological processes like senescence (ageing), maintenance of chromosomal instability and altered metabolism, are key factors in the progression from adenoma to carcinoma.
  361. Van Neste, L., Herman, J. G., Otto, G., Bigley, J. W., Epstein, J. I., & Van Criekinge, W. (2012). The epigenetic promise for prostate cancer diagnosis. PROSTATE, 72(11), 1248–1261.
  362. De Beuf, K., Pipelers, P., Andriankaja, M., Thas, O., Inzé, D., Crainiceanu, C., & Clement, L. (2012). Analysis of tiling array expression studies with flexible designs in Bioconductor (waveTiling). BMC BIOINFORMATICS, 13.
    Background: Existing statistical methods for tiling array transcriptome data either focus on transcript discovery in one biological or experimental condition or on the detection of differential expression between two conditions. Increasingly often, however, biologists are interested in time-course studies, studies with more than two conditions or even multiple-factor studies. As these studies are currently analyzed with the traditional microarray analysis techniques, they do not exploit the genome-wide nature of tiling array data to its full potential. Results: We present an R Bioconductor package, waveTiling, which implements a wavelet-based model for analyzing transcriptome data and extends it towards more complex experimental designs. With waveTiling the user is able to discover (1) group-wise expressed regions, (2) differentially expressed regions between any two groups in single-factor studies and in (3) multifactorial designs. Moreover, for time-course experiments it is also possible to detect (4) linear time effects and (5) a circadian rhythm of transcripts. By considering the expression values of the individual tiling probes as a function of genomic position, effect regions can be detected regardless of existing annotation. Three case studies with different experimental set-ups illustrate the use and the flexibility of the model-based transcriptome analysis. Conclusions: The waveTiling package provides the user with a convenient tool for the analysis of tiling array trancriptome data for a multitude of experimental set-ups. Regardless of the study design, the probe-wise analysis allows for the detection of transcriptional effects in both exonic, intronic and intergenic regions, without prior consultation of existing annotation.
  363. Vansteelandt, S., & Lange, C. (2012). Causation and causal inference for genetic effects. HUMAN GENETICS, 131(10), 1665–1676.
    Over the past three decades, substantial developments have been made on how to infer the causal effect of an exposure on an outcome, using data from observational studies, with the randomized experiment as the golden standard. These developments have reshaped the paradigm of how to build statistical models, how to adjust for confounding, how to assess direct effects, mediated effects and interactions, and even how to analyze data from randomized experiments. The congruence of random transmission of alleles during meiosis and the randomization in controlled experiments/trials, suggests that genetic studies may lend themselves naturally to a causal analysis. In this contribution, we will reflect on this and motivate, through illustrative examples, where insights from the causal inference literature may help to understand and correct for typical biases in genetic effect estimates.
  364. Berzuini, C., Vansteelandt, S., Foco, L., Pastorino, R., & Bernardinelli, L. (2012). Direct genetic effects and their estimation from matched case-control data. GENETIC EPIDEMIOLOGY, 36(6), 652–662.
    In genetic association studies, a single marker is often associated with multiple, correlated phenotypes (e.g., obesity and cardiovascular disease, or nicotine dependence and lung cancer). A pervasive question is then whether that marker exerts independent effects on all phenotypes. In this paper, we address this question by assessing whether there is a genetic effect on one phenotype that is not mediated through the other ones, so called direct genetic effect. Answering such question may represent an important step in the elucidation of the underlying biological mechanism. Under rather restrictive conditions, such direct genetic effects are known to be estimable by standard regression methods. Under more lenient conditions, in a prospective or unmatched case-control study, these effects can be estimated by using a previously proposed G-estimation method (Vansteelandt [2009] Epidemiology 20, 851860). The present paper extends this method to matched case-control studies, and investigates the conditions under which this extension is valid. We illustrate the method on data from a matched case-control study, which we use to elucidate the pathway implications of a detected association between myocardial infarction and a genetic locus in the chromosomal region of the FTO gene.
  365. Fardo, D. W., Liu, J., Demeo, D. L., Silverman, E. K., & Vansteelandt, S. (2012). Gene-environment interaction testing in family-based association studies with phenotypically ascertained samples: a causal inference approach. BIOSTATISTICS, 13(3), 468–481.
    We propose a method for testing gene-environment (G x E) interactions on a complex trait in family-based studies in which a phenotypic ascertainment criterion has been imposed. This novel approach employs G-estimation, a semiparametric estimation technique from the causal inference literature, to avoid modeling of the association between the environmental exposure and the phenotype, to gain robustness against unmeasured confounding due to population substructure, and to acknowledge the ascertainment conditions. The proposed test allows for incomplete parental genotypes. It is compared by simulation studies to an analogous conditional likelihood-based approach and to the QBAT-I test, which also invokes the G-estimation principle but ignores ascertainment. We apply our approach to a study of chronic obstructive pulmonary disorder.
  366. Schockaert, S., & De Cock, M. (2011). Diversification of search results as a fuzzy satisfiability problem. Information Retrieval, 33rd European conference, Proceedings. Presented at the DDR-2011 : diversity in document retrieval, at the 33rd European conference on Information Retrieval (ECIR 2011).
  367. Proost, Sebastian, Pattyn, P., Gerats, T., & Van de Peer, Y. (2011). Journey through the past: 150 million years of plant genome evolution. PLANT JOURNAL, 66(1), 58–65.
  368. Armananzas, R., Saeys, Y., Inza, I., Garcia-Torres, M., Bielza, C., Van de Peer, Y., & Larranaga, P. (2011). Peakbin selection in mass spectrometry data using a consensus approach with estimation of distribution algorithms. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 8(3), 760–774.
    Progress is continuously being made in the quest for stable biomarkers linked to complex diseases. Mass spectrometers are one of the devices for tackling this problem. The data profiles they produce are noisy and unstable. In these profiles, biomarkers are detected as signal regions (peaks), where control and disease samples behave differently. Mass spectrometry (MS) data generally contain a limited number of samples described by a high number of features. In this work, we present a novel class of evolutionary algorithms, estimation of distribution algorithms (EDA), as an efficient peak selector in this MS domain. There is a trade-of f between the reliability of the detected biomarkers and the low number of samples for analysis. For this reason, we introduce a consensus approach, built upon the classical EDA scheme, that improves stability and robustness of the final set of relevant peaks. An entire data workflow is designed to yield unbiased results. Four publicly available MS data sets (two MALDI-TOF and another two SELDI-TOF) are analyzed. The results are compared to the original works, and a new plot (peak frequential plot) for graphically inspecting the relevant peaks is introduced. A complete online supplementary page, which can be found at, includes extended info and results, in addition to Matlab scripts and references.
  369. Fostier, J., Proost, S., Dhoedt, B., Saeys, Y., Demeester, P., Van de Peer, Y., & Vandepoele, K. (2011). A greedy, graph-based algorithm for the alignment of multiple homologous gene lists. BIOINFORMATICS, 27(6), 749–756.
  370. Bowden, J., & Vansteelandt, S. (2011). Mendelian randomization analysis of case-control data using structural mean models. STATISTICS IN MEDICINE, 30(6), 678–694.
  371. Babiychuk, E., Vandepoele, K., Wissing, J., Garcia-Diaz, M., De Rycke, R., Akbari, H., Joubès, J., et al. (2011). Plastid gene expression and plant development require a plastidic protein of the mitochondrial transcription termination factor family. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 108(16), 6674–6679.
  372. Vyverman, M., De Schrijver, J., Van Criekinge, W., Dawyndt, P., & Fack, V. (2011). Accurate long read mapping using enhanced suffix arrays. In M. Pellegrini, A. Fred, J. Filipe, & H. Gamboa (Eds.), BIOINFORMATICS (pp. 102–107). Presented at the International conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS 2011), Lisbon, Portugal: SciTePress.
    With the rise of high throughput sequencing, new programs have been developed for dealing with the alignment of a huge amount of short read data to reference genomes. Recent developments in sequencing technology allow longer reads, but the mappers for short reads are not suited for reads of several hundreds of base pairs. We propose an algorithm for mapping longer reads, which is based on chaining maximal exact matches and uses heuristics and the Needleman-Wunsch algorithm to bridge the gaps. To compute maximal exact matches we use a specialized index structure, called enhanced suffix array. The proposed algorithm is very accurate and can handle large reads with mutations and long insertions and deletions.
  373. Victor, P., Cornelis, C., De Cock, M., & Teredesai, A. (2011). Trust and distrust-based recommendations for controversial reviews. IEEE INTELLIGENT SYSTEMS, 26(1), 48–55.
  374. Victor, P., Cornelis, C., & De Cock, M. (2011). Trust networks for recommender systems. Atlantis Computation Intelligence Systems (Vol. 4). Amsterdam, The Netherlands: Atlantis.
  375. Schockaert, S., Makarytska, N., & De Cock, M. (2011). Fuzzy methods on the web: a critical discussion. In Chris Cornelis, G. Deschrijver, M. Nachtegael, S. Schockaert, & Y. Shi (Eds.), 35 Years of fuzzy set theory : celebratory volume dedicated to the retirement of Etienne E. Kerre (Vol. 261, pp. 237–266). Berlin, Germany: Springer.
  376. Martens, Lennart, Chambers, M., Sturm, M., Kessner, D., Levander, F., Shofstahl, J., Tang, W. H., et al. (2011). mzML: a community standard for mass spectrometry data. MOLECULAR & CELLULAR PROTEOMICS, 10(1).
    Mass spectrometry is a fundamental tool for discovery and analysis in the life sciences. With the rapid advances in mass spectrometry technology and methods, it has become imperative to provide a standard output format for mass spectrometry data that will facilitate data sharing and analysis. Initially, the efforts to develop a standard format for mass spectrometry data resulted in multiple formats, each designed with a different underlying philosophy. To resolve the issues associated with having multiple formats, vendors, researchers, and software developers convened under the banner of the HUPO PSI to develop a single standard. The new data format incorporated many of the desirable technical attributes from the previous data formats, while adding a number of improvements, including features such as a controlled vocabulary with validation tools to ensure consistent usage of the format, improved support for selected reaction monitoring data, and immediately available implementations to facilitate rapid adoption by the community. The resulting standard data format, mzML, is a well tested open-source format for mass spectrometer output files that can be readily utilized by the community and easily adapted for incremental advances in mass spectrometry technology.
  377. Vaudel, M., Barsnes, H., Berven, F. S., Sickmann, A., & Martens, L. (2011). SearchGUI: an open-source graphical user interface for simultaneous OMSSA and X!Tandem searches. PROTEOMICS, 11(5), 996–999.
    The identification of proteins by mass spectrometry is a standard technique in the field of proteomics, relying on search engines to perform the identifications of the acquired spectra. Here, we present a user-friendly, lightweight and open-source graphical user interface called SearchGUI (, for configuring and running the freely available OMSSA (open mass spectrometry search algorithm) and X!Tandem search engines simultaneously. Freely available under the permissible Apache2 license, SearchGUI is supported on Windows, Linux and OSX.
  378. Barsnes, H., Vaudel, M., Colaert, N., Helsens, K., Sickmann, A., Berven, F. S., & Martens, L. (2011). compomics-utilities: an open-source Java library for computational proteomics. BMC BIOINFORMATICS, 12.
    Background: The growing interest in the field of proteomics has increased the demand for software tools and applications that process and analyze the resulting data. And even though the purpose of these tools can vary significantly, they usually share a basic set of features, including the handling of protein and peptide sequences, the visualization of (and interaction with) spectra and chromatograms, and the parsing of results from various proteomics search engines. Developers typically spend considerable time and effort implementing these support structures, which detracts from working on the novel aspects of their tool. Results: In order to simplify the development of proteomics tools, we have implemented an open-source support library for computational proteomics, called compomics-utilities. The library contains a broad set of features required for reading, parsing, and analyzing proteomics data. compomics-utilities is already used by a long list of existing software, ensuring library stability and continued support and development. Conclusions: As a user-friendly, well-documented and open-source library, compomics-utilities greatly simplifies the implementation of the basic features needed in most proteomics tools. Implemented in 100% Java, compomics-utilities is fully portable across platforms and architectures. Our library thus allows the developers to focus on the novel aspects of their tools, rather than on the basic functions, which can contribute substantially to faster development, and better tools for proteomics.
  379. Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., & Dawyndt, P. (2011). Make Histri: reconstructing the exchange history of bacterial and archaeal type strains. SYSTEMATIC AND APPLIED MICROBIOLOGY, 34(5), 328–336.
  380. De Ceuleneer, M., De Wit, V., Van Steendam, K., Van Nieuwerburgh, F., Tilleman, K., & Deforce, D. (2011). Modification of citrulline residues with 2,3-butanedione facilitates their detection by liquid chromatography/mass spectrometry. RAPID COMMUNICATIONS IN MASS SPECTROMETRY, 25(11), 1536–1542.
  381. Vermoote, M., Vandekerckhove, T., Flahou, B., Pasmans, F., Smet, A., De Groote, D., Van Criekinge, W., et al. (2011). Genome sequence of Helicobacter suis supports its role in gastric pathology. VETERINARY RESEARCH, 42.
    Helicobacter (H.) suis has been associated with chronic gastritis and ulcers of the pars oesophagea in pigs, and with gastritis, peptic ulcer disease and gastric mucosa-associated lymphoid tissue lymphoma in humans. In order to obtain better insight into the genes involved in pathogenicity and in the specific adaptation to the gastric environment of H. suis, a genome analysis was performed of two H. suis strains isolated from the gastric mucosa of swine. Homologs of the vast majority of genes shown to be important for gastric colonization of the human pathogen H. pylori were detected in the H. suis genome. H. suis encodes several putative outer membrane proteins, of which two similar to the H. pylori adhesins HpaA and HorB. H. suis harbours an almost complete comB type IV secretion system and members of the type IV secretion system 3, but lacks most of the genes present in the cag pathogenicity island of H. pylori. Homologs of genes encoding the H. pylori neutrophil-activating protein and g-glutamyl transpeptidase were identified in H. suis. H. suis also possesses several other presumptive virulence-associated genes, including homologs for mviN, the H. pylori flavodoxin gene, and a homolog of the H. pylori vacuolating cytotoxin A gene. It was concluded that although genes coding for some important virulence factors in H. pylori, such as the cytotoxin-associated protein (CagA), are not detected in the H. suis genome, homologs of other genes associated with colonization and virulence of H. pylori and other bacteria are present.
  382. Degroeve, S., Colaert, N., Vandekerckhove, J., Gevaert, K., & Martens, L. (2011). A reproducibility-based evaluation procedure for quantifying the differences between MS/MS peak intensity normalization methods. PROTEOMICS, 11(6), 1172–1180.
    The identification of peptides and proteins from fragmentation mass spectra is a very common approach in the field of proteomics. Contemporary high-throughput peptide identification pipelines can quickly produce large quantities of MS/MS data that contain valuable knowledge about the actual physicochemical processes involved in the peptide fragmentation process, which can be extracted through extensive data mining studies. As these studies attempt to exploit the intensity information contained in the MS/MS spectra, a critical step required for a meaningful comparison of this information between MS/MS spectra is peak intensity normalization. We here describe a procedure for quantifying the efficiency of different published normalization methods in terms of the quartile coefficient of dispersion (qcod) statistic. The quartile coefficient of dispersion is applied to measure the dispersion of the peak intensities between redundant MS/MS spectra, allowing the quantification of the differences in computed peak intensity reproducibility between the different normalization methods. We demonstrate that our results are independent of the data set used in the evaluation procedure, allowing us to provide generic guidance on the choice of normalization method to apply in a certain MS/MS pipeline application.
  383. Colaert, N., Vandekerckhove, J., Gevaert, K., & Martens, L. (2011). A comparison of MS2-based label-free quantitative proteomic techniques with regards to accuracy and precision. PROTEOMICS, 11(6), 1110–1113.
    The advent of algorithms for fragmentation spectrum-based label-free quantitative proteomics has enabled straightforward quantification of shotgun proteomic experiments. Despite the popularity of these approaches, few studies have been performed to assess their performance. We have therefore profiled the precision and the accuracy of three distinct relative label-free methods on both the protein and the proteome level. We derived our test data from two well-characterized publicly available quantitative data sets.
  384. Ghesquière, B., Helsens, K., Vandekerckhove, J., & Gevaert, K. (2011). A stringent approach to improve the quality of nitrotyrosine peptide identifications. PROTEOMICS, 11(6), 1094–1098.
    Tyrosine nitration is the consequence of a complex machinery of formation and merging of oxygen and nitrogen radicals, and has been associated with both physiological pathways as well as with several human diseases. The latter turned this posttranslational protein modification into an interesting biomarker, being either a consequence of the disease or a factor contributing to the disease onset. However, the interpretation of MS and MS/MS data of peptides containing nitrotyrosine has proven to be very challenging and consequently, the risk of linking MS/MS spectra to incorrect peptide sequences exists and has been reported. Here, we discuss the causes of data misinterpretation and describe a general method to avoid mistakes of MS/MS spectrum misinterpretation. Central in our approach is the reduction of nitrotyrosine into aminotyrosine and the use of the Peptizer algorithm to inspect MS/MS quality-related assumptions.
  385. Burkhart, J. M., Vaudel, M., Zahedi, R., Martens, L., & Sickmann, A. (2011). iTRAQ protein quantification: a quality-controlled workflow. PROTEOMICS, 11(6), 1125–1134.
    Reporter ion-based methods are among the major techniques to quantify peptides and proteins. Two main labels, tandem mass tag (TMT) and iTRAQ, are widely used by the proteomics community. They are, however, often applied as out-of-the-box methods, without thorough quality control. Thus, due to undiscovered limitations of the technique, irrelevant results might be trusted. To address this issue, we here propose a step-by-step quality control of the iTRAQ workflow. From sample preparation to final ratio calculation we provide metrics and techniques assessing the actual effectiveness of iTRAQ quantification as well as a novel method for more reliable protein ratio estimation.
  386. Barsnes, H., Eidhammer, I., & Martens, L. (2011). A global analysis of peptide fragmentation variability. PROTEOMICS, 11(6), 1181–1188.
    Understanding the fragmentation process in MS/MS experiments is vital when trying to validate the results of such experiments, and one way of improving our understanding is to analyze existing data. We here present our findings from an analysis of a large and diverse data set of MS/MS-based peptide identifications, in which each peptide has been identified from multiple spectra, recorded on two commonly used types of electrospray instruments. By analyzing these data we were able to study fragmentation variability on three levels: (i) variation in detection rates and intensities for fragment ions from the same peptide sequence measured multiple times on a single instrument; (ii) consistency of rank-based fragmentation patterns; and (iii) a set of general observations on fragment ion occurrence in MS/MS experiments, regardless of sequence. Our results confirm that substantial variation can be found at all levels, even when high-quality identifications are used and the experimental conditions as well as the peptide sequences are kept constant. Finally, we discuss the observed variability in light of ongoing efforts to create spectral libraries and predictive software for target selection in targeted proteomics.
  387. De Leeneer, K., HELLEMANS, J., De Schrijver, J., Baetens, M., Poppe, B., Van Criekinge, W., De Paepe, A., et al. (2011). Massive parallel amplicon sequencing of the breast cancer genes BRCA1 and BRCA2: opportunities, challenges, and limitations. HUMAN MUTATION, 32(3), 335–344.
  388. Van Maerken, T., Rihani, A., Dreidax, D., De Clercq, S., Yigit, N., Marine, J.-C., Westermann, F., et al. (2011). Functional analysis of the p53 pathway in neuroblastoma cells using the small-molecule MDM2 antagonist Nutlin-3. MOLECULAR CANCER THERAPEUTICS, 10(6), 983–993.
  389. Raj, D. B. T. G., Ghesquière, B., Tharkeshwar Raghunath, A. K., Coen, K., Derua, R., Vanderschaeghe, D., Rysman, E., et al. (2011). A novel strategy for the comprehensive analysis of the biomolecular composition of isolated plasma membranes. MOLECULAR SYSTEMS BIOLOGY, 7.
  390. Blondeel, Marjon, Schockaert, S., De Cock, M., & Vermeir, D. (2011). Fuzzy autoepistemic logic: reflecting about knowledge of truth degrees. Lecture Notes in Computer Science (Vol. 6717, pp. 616–627). Presented at the 11th European conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQARU 2011), Berlin, Germany: Springer.
  391. Colaert, Niklaas, Gevaert, K., & Martens, L. (2011). RIBAR and xRIBAR: methods for reproducible relative MS/MS-based label-free protein quantification. JOURNAL OF PROTEOME RESEARCH, 10(7), 3183–3189.
    Mass spectrometry-driven proteomics is increasingly relying on quantitative analyses for biological discoveries. As a result, different methods and algorithms have been developed to perform relative or absolute quantification based on mass spectrometry data. One of the most popular quantification methods are the so-called label-free approaches, which require no special sample processing, and can even be applied retroactively to existing data sets. Of these label-free methods, the MS/MS-based approaches are most often applied, mainly because of their inherent simplicity as compared to MS-based methods. The main application of these approaches is the determination of relative protein amounts between different samples, expressed as protein ratios. However, as we demonstrate here, there are some issues with the reproducibility across replicates of these protein ratio sets obtained from the various, MS/MS-based label-free methods, indicating that the existing methods are not optimally robust. We therefore present two new Methods (called RIBAR and xRIBAR) that use the available MS/MS data more effectively, achieving increased robustness. Both the accuracy and the precision of our novel methods are analyzed and compared to the existing methods to illustrate the increased robustness of our new methods over existing ones.
  392. Vaudel, M., Burkhart, J. M., Sickmann, A., Martens, L., & Zahedi, R. P. (2011). Peptide identification quality control. PROTEOMICS, 11(10), 2105–2114.
    Identification of large proteomics data sets is routinely performed using sophisticated software tools called search engines. Yet despite the importance of the identification process, its configuration and execution is often performed according to established lab habits, and is mostly unsupervised by detailed quality control. In order to establish easily obtainable quality control criteria that can be broadly applied to the identification process, we here introduce several simple quality control methods. An unbiased quality control of identification parameters will be conducted using target/decoy searches providing significant improvement over identification standards. MASCOT identifications were for instance increased by 13% at a constant level of confidence. The target/decoy approach can however not be universally applied. We therefore also quality control the application of this strategy itself, providing useful and intuitive metrics for evaluating the precision and robustness of the obtained false discovery rate.
  393. Hu, T. T., Pattyn, P., Bakker, E. G., Cao, J., Cheng, J.-F., Clark, R. M., Fahlgren, N., et al. (2011). The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. NATURE GENETICS, 43(5), 476–481.
    We report the 207-Mb genome sequence of the North American Arabidopsis lyrata strain MN47 based on 8.3x dideoxy sequence coverage. We predict 32,670 genes in this outcrossing species compared to the 27,025 genes in the selfing species Arabidopsis thaliana. The much smaller 125-Mb genome of A. thaliana, which diverged from A. lyrata 10 million years ago, likely constitutes the derived state for the family. We found evidence for DNA loss from large-scale rearrangements, but most of the difference in genome size can be attributed to hundreds of thousands of small deletions, mostly in noncoding DNA and transposons. Analysis of deletions and insertions still segregating in A. thaliana indicates that the process of DNA loss is ongoing, suggesting pervasive selection for a smaller genome. The high-quality reference genome sequence for A. lyrata will be an important resource for functional, evolutionary and ecological studies in the genus Arabidopsis.
  394. Movahedi, S., Van de Peer, Y., & Vandepoele, K. (2011). Comparative network analysis reveals that tissue specificity and gene function are important factors influencing the mode of expression evolution in Arabidopsis and rice. PLANT PHYSIOLOGY, 156(3), 1316–1330.
    Microarray experiments have yielded massive amounts of expression information measured under various conditions for the model species Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa). Expression compendia grouping multiple experiments make it possible to define correlated gene expression patterns within one species and to study how expression has evolved between species. We developed a robust framework to measure expression context conservation (ECC) and found, by analyzing 4,630 pairs of orthologous Arabidopsis and rice genes, that 77% showed conserved coexpression. Examples of nonconserved ECC categories suggested a link between regulatory evolution and environmental adaptations and included genes involved in signal transduction, response to different abiotic stresses, and hormone stimuli. To identify genomic features that influence expression evolution, we analyzed the relationship between ECC, tissue specificity, and protein evolution. Tissue-specific genes showed higher expression conservation compared with broadly expressed genes but were fast evolving at the protein level. No significant correlation was found between protein and expression evolution, implying that both modes of gene evolution are not strongly coupled in plants. By integration of cis-regulatory elements, many ECC conserved genes were significantly enriched for shared DNA motifs, hinting at the conservation of ancestral regulatory interactions in both model species. Surprisingly, for several tissue-specific genes, patterns of concerted network evolution were observed, unveiling conserved coexpression in the absence of conservation of tissue specificity. These findings demonstrate that orthologs inferred through sequence similarity in many cases do not share similar biological functions and highlight the importance of incorporating expression information when comparing genes across species.
  395. Van Damme, Petra, Evjenth, R., Foyn, H., Demeyer, K., De Bock, P.-J., Lillehaug, J. R., Vandekerckhove, J., et al. (2011). Proteome-derived peptide libraries allow detailed analysis of the substrate specificities of Nα-acetyltransferases and point to hNaa10p as the post-translational actin Nα-acetyltransferase. MOLECULAR & CELLULAR PROTEOMICS, 10(5).
    The impact of N(alpha)-terminal acetylation on protein stability and protein function in general recently acquired renewed and increasing attention. Although the substrate specificity profile of the conserved enzymes responsible for N(alpha)-terminal acetylation in yeast has been well documented, the lack of higher eukaryotic models has hampered the specificity profile determination of N(alpha)-acetyltransferases (NATs) of higher eukaryotes. The fact that several types of protein N termini are acetylated by so far unknown NATs stresses the importance of developing tools for analyzing NAT specificities. Here, we report on a method that implies the use of natural, proteome-derived modified peptide libraries, which, when used in combination with two strong cation exchange separation steps, allows for the delineation of the in vitro specificity profiles of NATs. The human NatA complex, composed of the auxiliary hNaa15p (NATH/hNat1) subunit and the catalytic hNaa10p (hArd1) and hNaa50p (hNat5) subunits, cotranslationally acetylates protein N termini initiating with Ser, Ala, Thr, Val, and Gly following the removal of the initial Met. In our studies, purified hNaa50p preferred Met-Xaa starting N termini (Xaa mainly being a hydrophobic amino acid) in agreement with previous data. Surprisingly, purified hNaa10p preferred acidic N termini, representing a group of in vivo acetylated proteins for which there are currently no NAT(s) identified. The most prominent representatives of the group of acidic N termini are gamma- and beta-actin. Indeed, by using an independent quantitative assay, hNaa10p strongly acetylated peptides representing the N termini of both gamma- and beta-actin, and only to a lesser extent, its previously characterized substrate motifs. The immunoprecipitated NatA complex also acetylated the actin N termini efficiently, though displaying a strong shift in specificity toward its known Ser-starting type of substrates. Thus, complex formation of NatA might alter the substrate specificity profile as compared with its isolated catalytic subunits, and, furthermore, NatA or hNaa10p may function as a post-translational actin N(alpha)-acetyltransferase.
  396. Colaert, N., Van Huele, C., Degroeve, S., Staes, A., Vandekerckhove, J., Gevaert, K., & Martens, L. (2011). Combining quantitative proteomics data processing workflows for greater sensitivity. NATURE METHODS, 8(6), 481–U66.
    We here describe a normalization method to combine quantitative proteomics data. By merging the output of two popular quantification software packages, we obtained a 20% increase (on average) in the number of quantified human proteins without suffering from a loss of quality. Our integrative workflow is freely available through our user-friendly, open-source Rover software (
  397. Foster, J. M., Degroeve, S., Gatto, L., Visser, M., Wang, R., Griss, J., Apweiler, R., et al. (2011). A posteriori quality control for the curation and reuse of public proteomics data. PROTEOMICS, 11(11), 2182–2194.
    Proteomics is a rapidly expanding field encompassing a multitude of complex techniques and data types. To date much effort has been devoted to achieving the highest possible coverage of proteomes with the aim to inform future developments in basic biology as well as in clinical settings. As a result, growing amounts of data have been deposited in publicly available proteomics databases. These data are in turn increasingly reused for orthogonal downstream purposes such as data mining and machine learning. These downstream uses however, need ways to a posteriori validate whether a particular data set is suitable for the envisioned purpose. Furthermore, the (semi-) automatic curation of repository data is dependent on analyses that can highlight misannotation and edge conditions for data sets. Such curation is an important prerequisite for efficient proteomics data reuse in the life sciences in general. We therefore present here a selection of quality control metrics and approaches for the a posteriori detection of potential issues encountered in typical proteomics data sets. We illustrate our metrics by relying on publicly available data from the Proteomics Identifications Database ( PRIDE), and simultaneously show the usefulness of the large body of PRIDE data as a means to derive empirical background distributions for relevant metrics.
  398. Chancerel, E., Lepoittevin, C., Le Provost, G., Lin, Y.-C., Jaramillo-Correa, J. P., Eckert, A. J., Wegrzyn, J. L., et al. (2011). Development and implementation of a highly-multiplexed SNP array for genetic mapping in maritime pine and comparative mapping with loblolly pine. BMC GENOMICS, 12.
    Background: Single nucleotide polymorphisms (SNPs) are the most abundant source of genetic variation among individuals of a species. New genotyping technologies allow examining hundreds to thousands of SNPs in a single reaction for a wide range of applications such as genetic diversity analysis, linkage mapping, fine QTL mapping, association studies, marker-assisted or genome-wide selection. In this paper, we evaluated the potential of highly-multiplexed SNP genotyping for genetic mapping in maritime pine (Pinus pinaster Ait.), the main conifer used for commercial plantation in southwestern Europe. Results: We designed a custom GoldenGate assay for 1,536 SNPs detected through the resequencing of gene fragments (707 in vitro SNPs/Indels) and from Sanger-derived Expressed Sequenced Tags assembled into a unigene set (829 in silico SNPs/Indels). Offspring from three-generation outbred (G2) and inbred (F2) pedigrees were genotyped. The success rate of the assay was 63.6% and 74.8% for in silico and in vitro SNPs, respectively. A genotyping error rate of 0.4% was further estimated from segregating data of SNPs belonging to the same gene. Overall, 394 SNPs were available for mapping. A total of 287 SNPs were integrated with previously mapped markers in the G2 parental maps, while 179 SNPs were localized on the map generated from the analysis of the F2 progeny. Based on 98 markers segregating in both pedigrees, we were able to generate a consensus map comprising 357 SNPs from 292 different loci. Finally, the analysis of sequence homology between mapped markers and their orthologs in a Pinus taeda linkage map, made it possible to align the 12 linkage groups of both species. Conclusions: Our results show that the GoldenGate assay can be used successfully for high-throughput SNP genotyping in maritime pine, a conifer species that has a genome seven times the size of the human genome. This SNP-array will be extended thanks to recent sequencing effort using new generation sequencing technologies and will include SNPs from comparative orthologous sequences that were identified in the present study, providing a wider collection of anchor points for comparative genomics among the conifers.
  399. Van Damme, Petra, Hole, K., Pimenta-Marques, A., Helsens, K., Vandekerckhove, J., Martinho, R. G., Gevaert, K., et al. (2011). NatF contributes to an evolutionary shift in protein N-terminal acetylation and is important for normal chromosome segregation. PLOS GENETICS, 7(7).
    N-terminal acetylation (N-Ac) is a highly abundant eukaryotic protein modification. Proteomics revealed a significant increase in the occurrence of N-Ac from lower to higher eukaryotes, but evidence explaining the underlying molecular mechanism(s) is currently lacking. We first analysed protein N-termini and their acetylation degrees, suggesting that evolution of substrates is not a major cause for the evolutionary shift in N-Ac. Further, we investigated the presence of putative N-terminal acetyltransferases (NATs) in higher eukaryotes. The purified recombinant human and Drosophila homologues of a novel NAT candidate was subjected to in vitro peptide library acetylation assays. This provided evidence for its NAT activity targeting Met-Lys- and other Met-starting protein N-termini, and the enzyme was termed Naa60p and its activity NatF. Its in vivo activity was investigated by ectopically expressing human Naa60p in yeast followed by N-terminal COFRADIC analyses. hNaa60p acetylated distinct Met-starting yeast protein N-termini and increased general acetylation levels, thereby altering yeast in vivo acetylation patterns towards those of higher eukaryotes. Further, its activity in human cells was verified by overexpression and knockdown of hNAA60 followed by N-terminal COFRADIC. NatF's cellular impact was demonstrated in Drosophila cells where NAA60 knockdown induced chromosomal segregation defects. In summary, our study revealed a novel major protein modifier contributing to the evolution of N-Ac, redundancy among NATs, and an essential regulator of normal chromosome segregation. With the characterization of NatF, the co-translational N-Ac machinery appears complete since all the major substrate groups in eukaryotes are accounted for.
  400. Helsens, K., Van Damme, P., Degroeve, S., Martens, L., Arnesen, T., Vandekerckhove, J., & Gevaert, K. (2011). Bioinformatics analysis of a Saccharomyces cerevisiae N-terminal proteome provides evidence of alternative translation initiation and post-translational N-terminal acetylation. JOURNAL OF PROTEOME RESEARCH, 10(8), 3578–3589.
    Initiation of protein translation is a well-studied fundamental process, albeit high-throughput and more comprehensive determination of the exact translation initiation sites (TIS) was only recently made possible following the introduction of positional proteomics techniques that target protein N-termini. Precise translation initiation is of crucial importance, as truncated or extended proteins might fold, function, and locate erroneously. Still, as already shown for some proteins, alternative translation initiation can also serve as a regulatory mechanism. By applying N-terminal COFRADIC (combined fractional diagonal chromatography), we here isolated N-terminal peptides of a Saccharomyces cerevisiae proteome and analyzed both annotated and alternative TIS. We analyzed this N-terminome of S. cerevisiae which resulted in the identification of 650 unique N-terminal peptides corresponding to database annotated TIS. Furthermore, 56 unique N(alpha)-acetylated peptides were identified that suggest alternative TIS (MS/MS-based), while MS-based evidence of N(alpha)-acetylation led to an additional 33 such peptides. To improve the overall sensitivity of the analysis, we also included the 5' UTR (untranslated region) in-frame translations together with the yeast protein sequences in UniProtKB/Swiss-Prot. To ensure the quality of the individual peptide identifications, peptide-to-spectrum matches were only accepted at a 99% probability threshold and were subsequently analyzed in detail by the Peptizer tool to automatically ascertain their compliance with several expert criteria. Furthermore, we have also identified 60 MS/MS-based and 117 MS-based N(alpha)-acetylated peptides that point to N(alpha)-acetylation as a post-translational modification since these peptides did not start nor were preceded (in their corresponding protein sequence) by a methionine residue. Next, we evaluated consensus sequence features of nudeic acids and amino acids across each of these groups of peptides and evaluated the results in the context of publicly available data. Taken together, we present a list of 706 annotated and alternative TIS for yeast proteins and found that under normal growth conditions alternative TIS might (co)occur in S. cerevisiae in roughly one tenth of all proteins. Furthermore, we found that the nucleic acid and amino acid features proximate to these alternative TIS favor either guanine or adenine nucleotides following the start codon or acidic amino acids following the initiator methionine. Finally, we also observed an unexpected high number of N(alpha)-acetylated peptides that could not be related to TIS and therefore suggest events of post-translational N(alpha)-acetylation.
  401. Baele, G., Van de Peer, Y., & Vansteelandt, S. (2011). Context-dependent codon partition models provide significant increases in model fit in atpB and rbcL protein-coding genes. BMC EVOLUTIONARY BIOLOGY, 11.
    Background: Accurate modelling of substitution processes in protein-coding sequences is often hampered by the computational burdens associated with full codon models. Lately, codon partition models have been proposed as a viable alternative, mimicking the substitution behaviour of codon models at a low computational cost. Such codon partition models however impose independent evolution of the different codon positions, which is overly restrictive from a biological point of view. Given that empirical research has provided indications of context-dependent substitution patterns at four-fold degenerate sites, we take those indications into account in this paper.Results: We present so-called context-dependent codon partition models to assess previous empirical claims that the evolution of four-fold degenerate sites is strongly dependent on the composition of its two flanking bases. To this end, we have estimated and compared various existing independent models, codon models, codon partition models and context-dependent codon partition models for the atpB and rbcL genes of the chloroplast genome, which are frequently used in plant systematics. Such context-dependent codon partition models employ a full dependency scheme for four-fold degenerate sites, whilst maintaining the independence assumption for the first and second codon positions. Conclusions: We show that, both in the atpB and rbcL alignments of a collection of land plants, these context-dependent codon partition models significantly improve model fit over existing codon partition models. Using Bayes factors based on thermodynamic integration, we show that in both datasets the same context-dependent codon partition model yields the largest increase in model fit compared to an independent evolutionary model. Context-dependent codon partition models hence perform closer to codon models, which remain the best performing models at a drastically increased computational cost, compared to codon partition models, but remain computationally interesting alternatives to codon models. Finally, we observe that the substitution patterns in both datasets are drastically different, leading to the conclusion that combined analysis of these two genes using a single model may not be advisable from a context-dependent point of view.
  402. Colaert, N., Barsnes, H., Vaudel, M., Helsens, K., Timmerman, E., Sickmann, A., Gevaert, K., et al. (2011). thermo-msf-parser: an open source Java library to parse and visualize Thermo Proteome Discoverer msf files. JOURNAL OF PROTEOME RESEARCH, 10(8), 3840–3843.
    The Thermo Proteome Discoverer program integrates both peptide identification and quantification into a single workflow for peptide-centric proteomics. Furthermore, its close integration with Thermo mass spectrometers has made it increasingly popular in the field. Here, we present a Java library to parse the msf files that constitute the output of Proteome Discoverer. The parser is also implemented as a graphical user interface allowing convenient access to the information found in the msf files, and in Rover, a program to analyze and validate quantitative proteomics information. All code, binaries, and documentation is freely available at
  403. Staes, A., Impens, F., Van Damme, P., Ruttens, B., Goethals, M., Demol, H., Timmerman, E., et al. (2011). Selecting protein N-terminal peptides by combined fractional diagonal chromatography. NATURE PROTOCOLS, 6(8), 1130–1141.
    In recent years, procedures for selecting the N-terminal peptides of proteins with analysis by mass spectrometry have been established to characterize protease-mediated cleavage and protein alpha-N-acetylation on a proteomic level. As a pioneering technology, N-terminal combined fractional diagonal chromatography (COFRADIC) has been used in numerous studies in which these protein modifications were investigated. Derivatization of primary amines-which can include stable isotope labeling-occurs before trypsin digestion so that cleavage occurs after arginine residues. Strong cation exchange (SCX) chromatography results in the removal of most of the internal peptides. Diagonal, reversed-phase peptide chromatography, in which the two runs are separated by reaction with 2,4,6-trinitrobenzenesulfonic acid, results in the removal of the C-terminal peptides and remaining internal peptides and the fractionation of the sample. We describe here the fully matured N-terminal COFRADIC protocol as it is currently routinely used, including the most substantial improvements (including treatment with glutamine cyclotransferase and pyroglutamyl aminopeptidase to remove pyroglutamate before SCX, and a sample pooling scheme to reduce the overall number of liquid chromatography-tandem mass spectrometry analyses) that were made since its original publication. Completion of the N-terminal COFRADIC procedure takes similar to 5 d.
  404. Decock, A., Ongenaert, M., Vandesompele, J., & Speleman, F. (2011). Neuroblastoma epigenetics : from candidate gene approaches to genome-wide screenings. EPIGENETICS, 6(8), 962–970.
    Neuroblastoma (NB) is a childhood tumor originating from sympathetic nervous system cells. Although recently new insights into genes involved in NB have emerged, the molecular basis of neuroblastoma development and progression still remains poorly understood. The best-characterized genetic alterations include amplification of the proto-oncogene MYCN, ALK activating mutations or amplification, gain of chromosome arm 17q and losses of 1p, 3p and 11q. Epigenetic alterations have been described as well: caspase-8 (CASP8) and RAS-association domain family 1 isoform A (RASSF1A) DNA-methylation are important events for the development and progression of neuroblastoma. In total, about 75 genes are described as epigenetically affected in NB cell lines and/or NB primary samples. These epigenetic alterations were either found using a candidate gene approach or based on the analysis of genome-wide screening techniques. This review gives an extensive overview of all epigenetic changes described in NB as of today, with a main focus on both prognostic use and the potential of genome-wide techniques to find epigenetic prognostic biomarkers in NB. We summarize the key findings so far and the state-of-the-art of the upcoming methods at a unique time frame in the transition towards combined genome-wide chromatin immunoprecipitation (ChIP) and DNA sequencing techniques.
  405. Michoel, T., Joshi, A., Nachtergaele, B., & Van de Peer, Y. (2011). Enrichment and aggregation of topological motifs are independent organizational principles of integrated interaction networks. MOLECULAR BIOSYSTEMS, 7(10), 2769–2778.
    Topological network motifs represent functional relationships within and between regulatory and protein-protein interaction networks. Enriched motifs often aggregate into self-contained units forming functional modules. Theoretical models for network evolution by duplication-divergence mechanisms and for network topology by hierarchical scale-free networks have suggested a one-to-one relation between network motif enrichment and aggregation, but this relation has never been tested quantitatively in real biological interaction networks. Here we introduce a novel method for assessing the statistical significance of network motif aggregation and for identifying clusters of overlapping network motifs. Using an integrated network of transcriptional, posttranslational and protein-protein interactions in yeast we show that network motif aggregation reflects a local modularity property which is independent of network motif enrichment. In particular our method identified novel functional network themes for a set of motifs which are not enriched yet aggregate significantly and challenges the conventional view that network motif enrichment is the most basic organizational principle of complex networks.
  406. Lefever, E., Hoste, V., & De Cock, M. (2011). ParaSense or how to use parallel corpora for word sense disambiguation. Proceedings of the 49th annual meeting of the Association for Computational Linguistics : short papers (pp. 317–322). Presented at the 49th Annual meeting of the Association for Computational Linguistics : Human Language Technologies (ACL-HLT 2011), Association for Computational Linguistics (ACL).
  407. Vansteelandt, S., Bowden, J., Babanezhad, M., & Goetghebeur, E. (2011). On instrumental variables estimation of causal odds ratios. STATISTICAL SCIENCE, 26(3), 403–422.
    Inference for causal effects can benefit from the availability of an instrumental variable (IV) which, by definition, is associated with the given exposure, but not with the outcome of interest other than through a causal exposure effect. Estimation methods for instrumental variables are now well established for continuous outcomes, but much less so for dichotomous outcomes. In this article we review IV estimation of so-called conditional causal odds ratios which express the effect of an arbitrary exposure on a dichotomous outcome conditional on the exposure level, instrumental variable and measured covariates. In addition, we propose IV estimators of so-called marginal causal odds ratios which express the effect of an arbitrary exposure on a dichotomous outcome at the population level, and are therefore of greater public health relevance. We explore interconnections between the different estimators and support the results with extensive simulation studies and three applications.
  408. Clayton, R., Bernus, O., Cherry, E., Dierckx, H., Fenton, F., Mirabella, L., Panfilov, A., et al. (2011). Models of cardiac tissue electrophysiology: progress, challenges and open questions. PROGRESS IN BIOPHYSICS & MOLECULAR BIOLOGY, 104(1-3), 22–48.
  409. Baetens, M., Van Laer, L., De Leeneer, K., HELLEMANS, J., De Schrijver, J., Van de Voorde, H., Renard, M., et al. (2011). Applying massive parallel sequencing to molecular diagnosis of Marfan and Loeys-Dietz syndromes. HUMAN MUTATION, 32(9), 1053–1062.
  410. Clayton, R., Nash, M., Bradley, C., Panfilov, A., Paterson, D., & Taggart, P. (2011). Experiment-model interaction for analysis of epicardial activation during human ventricular fibrillation with global myocardial ischaemia. PROGRESS IN BIOPHYSICS & MOLECULAR BIOLOGY, 107(1), 101–111.
  411. Quinn, T., Granite, S., Allessie, M., Antzelevitch, C., Bollensdorff, C., Bub, G., Burton, R., et al. (2011). Minimum information about a cardiac electrophysiology experiment (MICEE): standardised reporting for model reproducibility, interoperability, and data sharing. PROGRESS IN BIOPHYSICS & MOLECULAR BIOLOGY, 107(1), 4–10.
  412. Van Maerken, T., Rihani, A., De Paepe, A., Vandesompele, J., & Speleman, F. (2011). Pharmacological activation of p53 and gene therapy as emerging treatment options for neuroblastoma. BELGIAN JOURNAL OF MEDICAL ONCOLOGY, 5(5), 219–221.
  413. Mestdagh, Pieter, Lefever, S., Pattyn, F., Ridzon, D., Fredlund, E., Fieuw, A., Ongenaert, M., et al. (2011). The microRNA body map : dissecting microRNA function through integrative genomics. NUCLEIC ACIDS RESEARCH, 39(20).
    While a growing body of evidence implicates regulatory miRNA modules in various aspects of human disease and development, insights into specific miRNA function remain limited. Here, we present an innovative approach to elucidate tissue-specific miRNA functions that goes beyond miRNA target prediction and expression correlation. This approach is based on a multi-level integration of corresponding miRNA and mRNA gene expression levels, miRNA target prediction, transcription factor target prediction and mechanistic models of gene network regulation. Predicted miRNA functions were either validated experimentally or compared to published data. The predicted miRNA functions are accessible in the miRNA bodymap, an interactive online compendium and mining tool of high-dimensional newly generated and published miRNA expression profiles. The miRNA bodymap enables prioritization of candidate miRNAs based on their expression pattern or functional annotation across tissue or disease subgroup. The miRNA bodymap project provides users with a single one-stop data-mining solution and has great potential to become a community resource.
  414. De Weer, An, Van der Meulen, J., Rondou, P., Taghon, T., Konrad, T. A., De Preter, K., Mestdagh, P., et al. (2011). EVI1-mediated down regulation of MIR449A is essential for the survival of EVI1 positive leukaemic cells. BRITISH JOURNAL OF HAEMATOLOGY, 154(3), 337–348.
    Chromosomal rearrangements involving the MECOM (MDS1 and EVI1 complex) locus are recurrent genetic events in myeloid leukaemia and are associated with poor prognosis. In this study, we assessed the role of MECOM locus protein EVI1 in the transcriptional regulation of microRNAs (miRNAs) involved in the leukaemic phenotype. For this, we profiled expression of 366 miRNAs in 38 MECOM-rearranged patient samples, normal bone marrow controls and MECOM (EVI1) knock down/re-expression models. Cross-comparison of these miRNA expression profiling data showed that MECOM rearranged leukaemias are characterized by down regulation of MIR449A. Reconstitution of MIR449A expression in MECOM-rearranged cell line models induced apoptosis resulting in a strong decrease in cell viability. These effects might be mediated in part by MIR449A regulation of NOTCH1 and BCL2, which are shown here to be bona fide MIR449A targets. Finally, we confirmed that MIR449A repression is mediated through direct promoter occupation of the EVI1 transcriptional repressor. In conclusion, this study reveals MIR449A as a crucial direct target of the MECOM locus protein EVI1 involved in the pathogenesis of MECOM-rearranged leukaemias and unravels NOTCH1 and BCL2 as important novel targets of MIR449A. This EVI1-MIR449A-NOTCH1/BCL2 regulatory axis might open new possibilities for the development of therapeutic strategies in this poor prognostic leukaemia subgroup.
  415. Kano, Y., Bjorne, J., Ginter, F., Salakoski, T., Buyko, E., Hahn, U., … Tsujii, J. (2011). U-Compare bio-event meta-service : compatible BioNLP event extraction services. BMC BIOINFORMATICS, 12.
    Background: Bio-molecular event extraction from literature is recognized as an important task of bio text mining and, as such, many relevant systems have been developed and made available during the last decade. While such systems provide useful services individually, there is a need for a meta-service to enable comparison and ensemble of such services, offering optimal solutions for various purposes. Results: We have integrated nine event extraction systems in the U-Compare framework, making them inter-compatible and interoperable with other U-Compare components. The U-Compare event meta-service provides various meta-level features for comparison and ensemble of multiple event extraction systems. Experimental results show that the performance improvements achieved by the ensemble are significant. Conclusions: While individual event extraction systems themselves provide useful features for bio text mining, the U-Compare meta-service is expected to improve the accessibility to the individual systems, and to enable meta-level uses over multiple event extraction systems such as comparison and ensemble.
  416. Van Landeghem, S., Ginter, F., Van de Peer, Y., & Salakoski, T. (2011). EVEX: a PubMed-scale resource for homology-based generalization of text mining predictions. Proceedings of the 2011 workshop on biomedical natural language processing (pp. 28–37). Presented at the Workshop on Biomedical Natural Language Processing (ACL-HLT 2011), Association for Computational Linguistics (ACL).
    In comparative genomics, functional annotations are transferred from one organism to another relying on sequence similarity. With more than 20 million citations in PubMed, text mining provides the ideal tool for generating additional large-scale homology-based predictions. To this end, we have refined a recent dataset of biomolecular events extracted from text, and integrated these predictions with records from public gene databases. Accounting for lexical variation of gene symbols, we have implemented a disambiguation algorithm that uniquely links the arguments of 11.2 million biomolecular events to well-defined gene families, providing interesting opportunities for query expansion and hypothesis generation. The resulting MySQL database, including all 19.2 million original events as well as their homology-based variants, is publicly available at
  417. Vandamme, P., & Dawyndt, P. (2011). Classification and identification of the Burkholderia cepacia complex: past, present and future. SYSTEMATIC AND APPLIED MICROBIOLOGY, 34(2), 87–95.
  418. Hole, K., Van Damme, P., Dalva, M., Aksnes, H., Glomnes, N., Varhaug, J. E., Lillehaug, J. R., et al. (2011). The human N-alpha-acetyltransferase 40 (hNaa40p/hNatD) is conserved from yeast and N-terminally acetylates histones H2A and H4. PLOS ONE, 6(9).
    Protein N(alpha)-terminal acetylation (Nt-acetylation) is considered one of the most common protein modification in eukaryotes, and 80-90% of all soluble human proteins are modified in this way, with functional implications ranging from altered protein function and stability to translocation potency amongst others. Nt-acetylation is catalyzed by N-terminal acetyltransferases (NATs), and in yeast five NAT types are identified and denoted NatA-NatE. Higher eukaryotes additionally express NatF. Except for NatD, human orthologues for all yeast NATs are identified. yNatD is defined as the catalytic unit Naa40p (Nat4) which co-translationally Nt-acetylates histones H2A and H4. In this study we identified and characterized hNaa40p/hNatD, the human orthologue of the yeast Naa40p. An in vitro proteome-derived peptide library Nt-acetylation assay indicated that recombinant hNaa40p acetylates N-termini starting with the consensus sequence Ser-Gly-Gly-Gly-Lys-, strongly resembling the N-termini of the human histones H2A and H4. This was confirmed as recombinant hNaa40p Nt-acetylated the oligopeptides derived from the N-termini of both histones. In contrast, a synthetically Nt-acetylated H4 N-terminal peptide with all lysines being non-acetylated, was not significantly acetylated by hNaa40p, indicating that hNaa40p catalyzed H4 N(alpha)-acetylation and not H4 lysine N(epsilon)-acetylation. Also, immunoprecipitated hNaa40p specifically Nt-acetylated H4 in vitro. Heterologous expression of hNaa40p in a yeast naa40-Delta strain restored Nt-acetylation of yeast histone H4, but not H2A in vivo, probably reflecting the fact that the N-terminal sequences of human H2A and H4 are highly similar to each other and to yeast H4 while the N-terminal sequence of yeast H2A differs. Thus, Naa40p seems to have co-evolved with the human H2A sequence. Finally, a partial co-sedimentation with ribosomes indicates that hNaa40p co-translationally acetylates H2A and H4. Combined, our results strongly suggest that human Naa40p/NatD is conserved from yeast. Thus, the NATs of all classes of N-terminally acetylated proteins in humans now appear to be accounted for.
  419. Grbić, M., Van Leeuwen, T., Clark, R. M., Rombauts, S., Rouzé, P., Grbić, V., Osborne, E. J., et al. (2011). The genome of Tetranychus urticae reveals herbivorous pest adaptations. NATURE, 479(7374), 487–492.
    The spider mite Tetranychus urticae is a cosmopolitan agricultural pest with an extensive host plant range and an extreme record of pesticide resistance. Here we present the completely sequenced and annotated spider mite genome, representing the first complete chelicerate genome. At 90 megabases T. urticae has the smallest sequenced arthropod genome. Compared with other arthropods, the spider mite genome shows unique changes in the hormonal environment and organization of the Hox complex, and also reveals evolutionary innovation of silk production. We find strong signatures of polyphagy and detoxification in gene families associated with feeding on different hosts and in new gene families acquired by lateral gene transfer. Deep transcriptome analysis of mites feeding on different plants shows how this pest responds to a changing host environment. The T. urticae genome thus offers new insights into arthropod evolution and plant-herbivore interactions, and provides unique opportunities for developing novel plant protection strategies.
  420. Helsens, K., Brusniak, M.-Y., Deutsch, E., Moritz, R. L., & Martens, L. (2011). jTraML: an open source Java API for TraML, the PSI standard for sharing SRM transitions. JOURNAL OF PROTEOME RESEARCH, 10(11), 5260–5263.
    We here present jTraML, a Java API for the Proteomics Standards Initiative TraML data standard. The library provides fully functional classes for all elements specified in the TraML XSD document, as well as convenient methods to construct controlled vocabulary-based instances required to define SRM transitions. The use of jTraML is demonstrated via a two-way conversion tool between TraML documents and vendor specific files, facilitating the adoption process of this new community standard. The library is released as open source under the permissive Apache2 license and can be downloaded from TraML files can also be converted online at
  421. VanderWeele, T. J., & Vansteelandt, S. (2011). A weighting approach to causal effects and additive interaction in case-control studies: marginal structural linear odds models. AMERICAN JOURNAL OF EPIDEMIOLOGY, 174(10), 1197–1203.
    Estimates of additive interaction from case-control data are often obtained by logistic regression; such models can also be used to adjust for covariates. This approach to estimating additive interaction has come under some criticism because of possible misspecification of the logistic model: If the underlying model is linear, the logistic model will be misspecified. The authors propose an inverse probability of treatment weighting approach to causal effects and additive interaction in case-control studies. Under the assumption of no unmeasured confounding, the approach amounts to fitting a marginal structural linear odds model. The approach allows for the estimation of measures of additive interaction between dichotomous exposures, such as the relative excess risk due to interaction, using case-control data without having to rely on modeling assumptions for the outcome conditional on the exposures and covariates. Rather than using conditional models for the outcome, models are instead specified for the exposures conditional on the covariates. The approach is illustrated by assessing additive interaction between genetic and environmental factors using data from a case-control study.
  422. Martinussen, T., Vansteelandt, S., Gerster, M., & Hjelmborg, J. von B. (2011). Estimation of direct effects for survival data by using the Aalen additive hazards model. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 73(5), 773–788.
    We extend the definition of the controlled direct effect of a point exposure on a survival outcome, other than through some given, time-fixed intermediate variable, to the additive hazard scale. We propose two-stage estimators for this effect when the exposure is dichotomous and randomly assigned and when the association between the intermediate variable and the survival outcome is confounded only by measured factors, which may themselves be affected by the exposure. The first stage of the estimation procedure involves assessing the effect of the intermediate variable on the survival outcome via Aalen's additive regression for the event time, given exposure, intermediate variable and confounders. The second stage involves applying Aalen's additive model, given the exposure alone, to a modified stochastic process (i.e. a modification of the observed counting process based on the first-stage estimates). We give the large sample properties of the estimator proposed and investigate its small sample properties by Monte Carlo simulation. A real data example is provided for illustration.
  423. Victor, P., Cornelis, C., De Cock, M., & Herrera-Viedma, E. (2011). Practical aggregation operators for gradual trust and distrust. FUZZY SETS AND SYSTEMS, 184(1), 126–147. Presented at the EUROFUSE Workshop on Preference Modelling and Decision Analysis.
  424. Abou-El-Ardat, K., Derradji, H., De Vos, W., De Meyer, T., Bekaert, S., Van Criekinge, W., & Baatout, S. (2011). Response to low-dose X-irradiation is p53-dependent in a papillary thyroid carcinoma model system. INTERNATIONAL JOURNAL OF ONCOLOGY, 39(6), 1429–1441.
    The link between high doses of radiation and thyroid cancer has been well established in various studies, as opposed to the effects of low doses. In this study, we investigated the effects of low-dose X-ray irradiation in a papillary thyroid carcinoma model with wild-type and mutated p53. A low dose of 62.5 mGy was enough to cause an upregulation of p16 and a decrease in the number of TPC-1 cells in the S phase, but not in the number of BCPAP p53-mutant cells. At a dose of 0.5 Gy, visible signs of senescence appeared only in the TPC-I cells. We conclude that low doses of X-rays are enough to cause a change in cell cycle distribution, possibly p53dependent p16 activation, but no significant apoptosis. Senescence requires higher doses of X-irradiation via a mechanism involving both p16 and p21.
  425. Bauters, Kim, Schockaert, S., De Cock, M., & Vermeir, D. (2011). Weak and strong disjunction in possibilistic ASP. In S. Benferhat & J. Grant (Eds.), Lecture Notes in Artificial Intelligence (Vol. 6929, pp. 475–488). Presented at the 5th International conference on Scalable Uncertainty Management (SUM 2011), Berlin, Germany: Springer.
    Possibilistic answer set programming (PASP) unites answer set programming (ASP) and possibilistic logic (PL) by associating cer- tainty values with rules. The resulting framework allows to combine both non-monotonic reasoning and reasoning under uncertainty in a single framework. While PASP has been well-studied for possibilistic definite and possibilistic normal programs, we argue that the current semantics of possibilistic disjunctive programs are not entirely satisfactory. The problem is twofold. First, the treatment of negation-as-failure in existing approaches follows an all-or-nothing scheme that is hard to match with the graded notion of proof underlying PASP. Second, we advocate that the notion of disjunction can be interpreted in several ways. In particu- lar, in addition to the view of ordinary ASP where disjunctions are used to induce a non-deterministic choice, the possibilistic setting naturally leads to a more epistemic view of disjunction. In this paper, we propose a semantics for possibilistic disjunctive programs, discussing both views on disjunction. Extending our earlier work, we interpret such programs as sets of constraints on possibility distributions, whose least specific solutions correspond to answer sets.
  426. Joshi, Anagha, Van de Peer, Y., & Michoel, T. (2011). Structural and functional organization of RNA regulons in the post-transcriptional regulatory network of yeast. NUCLEIC ACIDS RESEARCH, 39(21), 9108–9117.
    Post-transcriptional control of mRNA transcript processing by RNA binding proteins (RBPs) is an important step in the regulation of gene expression and protein production. The post-transcriptional regulatory network is similar in complexity to the transcriptional regulatory network and is thought to be organized in RNA regulons, coherent sets of functionally related mRNAs combinatorially regulated by common RBPs. We integrated genome-wide transcriptional and translational expression data in yeast with large-scale regulatory networks of transcription factor and RBP binding interactions to analyze the functional organization of post-transcriptional regulation and RNA regulons at a system level. We found that post-transcriptional feedback loops and mixed bifan motifs are overrepresented in the integrated regulatory network and control the coordinated translation of RNA regulons, manifested as clusters of functionally related mRNAs which are strongly coexpressed in the translatome data. These translatome clusters are more functionally coherent than transcriptome clusters and are expressed with higher mRNA and protein levels and less noise. Our results show how the post-transcriptional network is intertwined with the transcriptional network to regulate gene expression in a coordinated way and that the integration of heterogeneous genome-wide datasets allows to relate structure to function in regulatory networks at a system level.
  427. Lefever, E., Hoste, V., & De Cock, M. (2011). Using parallel corpora for word sense disambiguation. In P. De Causmaecker, J. Maervoet, T. Messelis, K. Verbeeck, & T. Vermeulen (Eds.), BNAIC : Belgian/Netherlands Artificial Intelligence Conference (pp. 407–408). Gent, Belgium.
  428. De Leeneer, K., De Schrijver, J., Clement, L., Baetens, M., Lefever, S., De Keulenaer, S., Van Criekinge, W., et al. (2011). Practical tools to implement massive parallel pyrosequencing of PCR products in next generation molecular diagnostics. PLOS ONE, 6(9).
    Despite improvements in terms of sequence quality and price per basepair, Sanger sequencing remains restricted to screening of individual disease genes. The development of massively parallel sequencing (MPS) technologies heralded an era in which molecular diagnostics for multigenic disorders becomes reality. Here, we outline different PCR amplification based strategies for the screening of a multitude of genes in a patient cohort. We performed a thorough evaluation in terms of set-up, coverage and sequencing variants on the data of 10 GS-FLX experiments (over 200 patients). Crucially, we determined the actual coverage that is required for reliable diagnostic results using MPS, and provide a tool to calculate the number of patients that can be screened in a single run. Finally, we provide an overview of factors contributing to false negative or false positive mutation calls and suggest ways to maximize sensitivity and specificity, both important in a routine setting. By describing practical strategies for screening of multigenic disorders in a multitude of samples and providing answers to questions about minimum required coverage, the number of patients that can be screened in a single run and the factors that may affect sensitivity and specificity we hope to facilitate the implementation of MPS technology in molecular diagnostics. A
  429. Waegeman, W., & De Baets, B. (2011). ERA ranking representability: the missing link between ordinal regression and multi-class classification. Proceedings of the 2011 11th International Conference on Intelligent Systems Design and Applications (ISDA) (pp. 1188–1193). Presented at the 11th International conference on Intelligent Systems Design and Applications (ISDA 2011), Piscataway, NJ, USA: IEEE.
    Can a multi-class classification model in some situations be simplified to an ordinal regression model without sacrificing performance? We try to answer this question from a theoretical point of view for one-versus-one multi-class ensembles. To that end, sufficient conditions are derived for which a one-versus-one ensemble becomes ranking representable, i.e. conditions for which the ensemble can be reduced to a ranking or ordinal regression model such that a similar performance on training data is measured. As performance measure, we use the area under the ROC curve (AUC) and its reformulation in terms of graphs. For the three-class case, this results in a new type of cycle transitivity for pairwise AUCs that can be verified by solving an integer quadratic program. Moreover, solving this integer quadratic program can be avoided, since its solution converges for an infinite data sample to a simple form, resulting in a deviation bound that becomes tighter with increasing sample size.
  430. Fukuda, S., Nakajima, J., De Baets, B., Waegeman, W., Mukai, T., Mouton, A., & Orikura, N. (2011). A discussion on the accuracy-complexity relationship in modelling fish habitat preference using genetic Takagi-Sugeno fuzzy systems. Proceedings 2011 IEEE 5th International Workshop on Genetic and Evolutionary Fuzzy Systems (GEFS 2011) (pp. 81–86). Presented at the 2011 IEEE 5th International workshop on Genetic and Evolutionary Fuzzy Systems (GEFS 2011), Piscataway, NJ, USA: IEEE.
    The relationship among accuracy, interpretability, and complexity of genetic fuzzy systems (GFSs) is a hot topic and is actively studied in the GFS domain. Because different problems have different views of interpretation, it is quite difficult to evaluate the interpretability of GFSs in general. The present study aims to analyze accuracy-complexity relationship in fish habitat modelling using a genetic Takagi-Sugeno fuzzy model called fuzzy habitat preference model (FHPM). The model complexity was defined by bit lengths of a genetic algorithm (GA) assigned to the consequent part of the model, while fuzzy rules and antecedent parts were kept the same. FHPM was developed on the basis of the mean squared errors between the composite habitat preference and the observed presence-absence of fish. The model accuracy was evaluated using multiple performance measures. As a result, the different model complexities resulted in slightly different habitat preference curves and model accuracies. At some complexities, the model accuracy was found to be slightly improved with increased model complexity. The result suggests that an optimal point exists where the model complexity can take a balance between the accuracy and the complexity of the target models, which depends partly on data characteristics and model formulations of the GFSs.
  431. Rayner, JCW, Thas, O., & Best, D. (2011). Smooth tests of goodness of fit. WILEY INTERDISCIPLINARY REVIEWS. COMPUTATIONAL STATISTICS, 3(5), 397–406.
  432. Van Landeghem, S., De Baets, B., Van de Peer, Y., & Saeys, Y. (2011). High-precision bio-molecular event extraction from text using parallel binary classifiers. COMPUTATIONAL INTELLIGENCE, 27(4), 645–664.
    We have developed a machine learning framework to accurately extract complex genetic interactions from text. Employing type-specific classifiers, this framework processes research articles to extract various biological events. Subsequently, the algorithm identifies regulation events that take other events as arguments, allowing a nested structure of predictions. All predictions are merged into an integrated network, useful for visualization and for deduction of new biological knowledge. In this paper, we discuss several design choices for an event-based extraction framework. These detailed studies help improving on existing systems, which is illustrated by the relative performance gain of 10% of our system compared to the official results in the recent BioNLP'09 Shared Task. Our framework now achieves state-of-the-art performance with 37.43 recall, 54.81 precision and 44.48 F-score. We further present the first study of feature selection for bio-molecular event extraction from text. While producing more cost-effective models, feature selection can also lead to a better insight into the complexity of the challenge. Finally, this paper tries to bridge the gap between theoretical relation extraction from text and experimental work on bio-molecular interactions by discussing interesting opportunities to employ event-based text mining tools for real-life tasks such as hypothesis generation, database curation and knowledge discovery.
  433. Young, N. D., Debellé, F., Oldroyd, G. E., Geurts, R., Cannon, S. B., Udvardi, M. K., Benedito, V. A., et al. (2011). The Medicago genome provides insight into the evolution of rhizobial symbioses. NATURE, 480(7378), 520–524.
    Legumes (Fabaceae or Leguminosae) are unique among cultivated plants for their ability to carry out endosymbiotic nitrogen fixation with rhizobial bacteria, a process that takes place in a specialized structure known as the nodule. Legumes belong to one of the two main groups of eurosids, the Fabidae, which includes most species capable of endosymbiotic nitrogen fixation(1). Legumes comprise several evolutionary lineages derived from a common ancestor 60 million years ago (Myr ago). Papilionoids are the largest clade, dating nearly to the origin of legumes and containing most cultivated species(2). Medicago truncatula is a long-established model for the study of legume biology. Here we describe the draft sequence of the M. truncatula euchromatin based on a recently completed BAC assembly supplemented with Illumina shotgun sequence, together capturing similar to 94% of all M. truncatula genes. A whole-genome duplication (WGD) approximately 58 Myr ago had a major role in shaping the M. truncatula genome and thereby contributed to the evolution of endosymbiotic nitrogen fixation. Subsequent to the WGD, the M. truncatula genome experienced higher levels of rearrangement than two other sequenced legumes, Glycine max and Lotus japonicus. M. truncatula is a close relative of alfalfa (Medicago sativa), a widely cultivated crop with limited genomics tools and complex autotetraploid genetics. As such, the M. truncatula genome sequence provides significant opportunities to expand alfalfa's genomic toolbox.
  434. Vernieuwe, H., De Baets, B., Minet, J., Pauwels, V., Lambot, S., Vanclooster, M., & Verhoest, N. (2011). Integrating coarse-scale uncertain soil moisture data into a fine-scale hydrological modelling scenario. HYDROLOGY AND EARTH SYSTEM SCIENCES, 15(10), 3101–3114.
  435. De Bock, Jasper, & De Cooman, G. (2011). State sequence prediction in imprecise hidden Markov models. In F. Coolen, G. DeCooman, T. Fetz, & M. Oberguggenberger (Eds.), ISIPTA  ’11 - PROCEEDINGS OF THE SEVENTH INTERNATIONAL SYMPOSIUM ON IMPRECISE PROBABILITY: THEORIES AND APPLICATIONS (pp. 159–168). Presented at the 7th International symposium on Imprecise Probability: Theories and Applications (ISIPTA 2011).
    We present an efficient exact algorithm for estimating state sequences from outputs (or observations) in imprecise hidden Markov models (iHMM), where both the uncertainty linking one state to the next, and that linking a state to its output, are represented using coherent lower previsions. The notion of independence we associate with the credal network representing the iHMM is that of epistemic irrelevance. We consider as best estimates for state sequences the (Walley-Sen) maximal sequences for the posterior joint state model (conditioned on the observed output sequence), associated with a gain function that is the indicator of the state sequence. This corresponds to (and generalises) finding the state sequence with the highest posterior probability in HMMs with precise transition and output probabilities (pHMMs). We argue that the computational complexity is at worst quadratic in the length of the Markov chain, cubic in the number of states, and essentially linear in the number of maximal state sequences. For binary iHMMs, we investigate experimentally how the number of maximal state sequences depends on the model parameters.
  436. Van Damme, Petra, Arnesen, T., & Gevaert, K. (2011). Protein alpha-N-acetylation studied by N-terminomics. FEBS JOURNAL, 278(20), 3822–3834.
    Cotranslational protein N-terminal modifications, including proteolytic maturation such as initiator methionine excision by methionine aminopeptidases and N-terminal blocking, occur universally. Protein alpha-N-acetylation, or the transfer of the acetyl moiety of acetyl-coenzyme A to nascent protein N-termini, catalysed by multisubunit N-terminal acetyltransferase complexes, generally takes place during protein translation. Nearly all protein modifications are known to influence different protein aspects such as folding, stability, activity and localization, and several studies have indicated similar functions for protein alpha-N-acetylation. However, until recently, protein alpha-N-acetylation remained poorly explored, mainly due to the absence of targeted proteomics technologies. The recent emergence of N-terminomics technologies that allow isolation of protein N-terminal peptides, together with proteogenomics efforts combining experimental and informational content have greatly boosted the field of alpha-N-acetylation. In this review, we report on such emerging technologies as well as on breakthroughs in our understanding of protein N-terminal biology.
  437. Blondeel, M., Schockaert, S., De Cock, M., & Vermeir, D. (2011). Complexity of fuzzy answer set programming under Łukasiewicz semantics: first results. In Poster proceedings of the 5th international conference on scalable uncertainty management (pp. 7–12). Dayton, OH, USA.
    Fuzzy answer set programming (FASP) has recently been proposed as a generalization of answer set programming in which propositions are allowed to be graded. Little is known about its computational complexity. In this paper we present some results and reveal a connection to an open problem about integer equations, suggesting that characterizing the complexity of FASP may not be straightforward.
  438. Colaert, N., Degroeve, S., Helsens, K., & Martens, L. (2011). Analysis of the resolution limitations of peptide identification algorithms. JOURNAL OF PROTEOME RESEARCH, 10(12), 5555–5561.
    Proteome identification using peptide-centric proteomics techniques is a routinely used analysis technique. One of the most powerful and popular methods for the identification of peptides from MS/MS spectra is protein database matching using search engines. Significance thresholding through false discovery rate (FDR) estimation by target/decoy searches is used to ensure the retention of predominantly confident assignments of MS/MS spectra to peptides. However, shortcomings have become apparent when such decoy searches are used to estimate the FDR. To study these shortcomings, we here introduce a novel kind of decoy database that contains isobaric mutated versions of the peptides that were identified in the original search. Because of the supervised way in which the entrapment sequences are generated, we call this a directed decoy database. Since the peptides found in our directed decoy database are thus specifically designed to look quite similar to the forward identifications, the limitations of the existing search algorithms in making correct calls in such strongly confusing situations can be analyzed. Interestingly, for the vast majority of confidently identified peptide identifications, a directed decoy peptide-to-spectrum match can be found that has a better or equal match score than the forward match score, highlighting an important issue in the interpretation of peptide identifications in present-day high-throughput proteomics.
  439. De Bruyne, Katrien, Slabbinck, B., Waegeman, W., Vauterin, P., De Baets, B., & Vandamme, P. (2011). Bacterial species identification from MALDI-TOF mass spectra through data analysis and machine learning. SYSTEMATIC AND APPLIED MICROBIOLOGY, 34(1), 20–29.
  440. De Ruyck, Kim, Sabbe, N., Oberije, C., Vandecasteele, K., Thas, O., De Ruysscher, D., Lambin, P., et al. (2011). Development of a multicomponent prediction model for acute esophagitis in lung cancer patients receiving chemoradiotherapy. INTERNATIONAL JOURNAL OF RADIATION ONCOLOGY BIOLOGY PHYSICS, 81(2), 537–544.
  441. Abrahams, J.-P., Apweiler, R., Balling, R., Bertero, M. G., Bujnicki, J. M., Chayen, N. E., … Taussig, M. J. (2011). “4D Biology for health and disease” workshop report. NEW BIOTECHNOLOGY, 28(4), 291–293.
    The "4D Biology Workshop for Health and Disease", held on 16-17th of March 2010 in Brussels, aimed at finding the best organising principles for large-scale proteomics, interactomics and structural genomics/biology initiatives, and setting the vision for future high-throughput research and large-scale data gathering in biological and medical science. Major conclusions of the workshop include the following. (i) Development of new technologies and approaches to data analysis is crucial. Biophysical methods should be developed that span a broad range of time/spatial resolution and characterise structures and kinetics of interactions. Mathematics, physics, computational and engineering tools need to be used more in biology and new tools need to be developed. (ii) Database efforts need to focus on improved definitions of ontologies and standards so that system-scale data and associated metadata can be understood and shared efficiently. (iii) Research infrastructures should play a key role in fostering multidisciplinary research, maximising knowledge exchange between disciplines and facilitating access to diverse technologies. (iv) Understanding disease on a molecular level is crucial. System approaches may represent a new paradigm in the search for biomarkers and new targets in human disease. (v) Appropriate education and training should be provided to help efficient exchange of knowledge between theoreticians, experimental biologists and clinicians. These conclusions provide a strong basis for creating major possibilities in advancing research and clinical applications towards personalised medicine.
  442. Vandewoestyne, M., Pede, V., Lambein, K., Dhaenens, M., Offner, F., Praet, M., Philippé, J., et al. (2011). Laser microdissection for the assessment of the clonal relationship between chronic lymphocytic leukemia/small lymphocytic lymphoma and proliferating B cells within lymph node pseudofollicles. LEUKEMIA, 25(5), 883–888.
  443. Victor, P., Cornelis, C., & De Cock, M. (2011). Trust and recommendations. In F. Ricci, L. Rokach, B. Shapira, & P. B. Kantor (Eds.), Recommender systems handbook (pp. 645–676). New York, NY, USA: Springer.
  444. Fayruzov, T., Janssen, J., Vermeir, D., Cornelis, C., & De Cock, M. (2011). Modelling gene and protein regulatory networks with answer set pogramming. INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 5(2), 209–229.
    Recently, many approaches to model regulatory networks have been proposed in the systems biology domain. However, the task is far from being solved. In this paper, we propose an Answer Set Programming (ASP)-based approach to model interaction networks. We build a general ASP framework that describes the network semantics and allows modelling specific networks with little effort. ASP provides a rich and flexible toolbox that allows expanding the framework with desired features. In this paper, we tune our framework to mimic Boolean network behaviour and apply it to model the Budding Yeast and Fission Yeast cell cycle networks. The obtained steady states of these networks correspond to those of the Boolean networks.
  445. Rademaker, M., & De Baets, B. (2011). Optimal restoration of stochastic monotonicity with respect to cumulative label frequency loss functions. INFORMATION SCIENCES, 181(4), 747–757.
  446. Van Pottelberge, G., Mestdagh, P., Bracke, K., Thas, O., VAN DURME, Y., Joos, G., Vandesompele, J., et al. (2011). MicroRNA expression in induced sputum of smokers and patients with chronic obstructive pulmonary disease. AMERICAN JOURNAL OF RESPIRATORY AND CRITICAL CARE MEDICINE, 183(7), 898–906.
    Rationale: Chronic obstructive pulmonary disease (COPD) is characterized by progressive inflammation in the airways and lungs combined with disturbed homeostatic functions of pulmonary cells. MicroRNAs (miRNAs) have the ability to regulate these processes by interfering with gene transcription and translation. Objectives: We aimed to identify miRNA expression in induced sputum and examined whether the expression of miRNAs differed between patients with COPD and subjects without airflow limitation. Methods: Expression of 627 miRNAs was evaluated in induced sputum supernatant of 32 subjects by stem loop reverse transcription-quantitative polymerase chain reaction. Differentially expressed miRNAs were validated in an independent replication cohort of 41 subjects. Enrichment of miRNA target genes was identified by in silico analysis. Protein expression of target genes was determined by ELISA. Measurements and Main Results: Thirty-four miRNAs were differentially expressed between never-smokers and current smokers without airflow limitation in the screening cohort. Eight miRNAs were expressed at a significantly lower level in current-smoking patients with COPD compared with never-smokers without airflow limitation. Reduced expression of let-7c and miR-125b in patients with COPD compared with healthy subjects was confirmed in the validation cohort. Target genes of let-7c were significantly enriched in the sputum of patients with severe COPD. The concentration of tumor necrosis factor receptor type II (TNFR-II, implicated in COPD pathogenesis and a predicted target gene of let-7c) was inversely correlated with the sputum levels of let-7c. Conclusions: let-7c is significantly reduced in the sputum of currently smoking patients with COPD and is associated with increased expression of TNFR-II.
  447. Duplessis, S., Cuomo, C. A., Lin, Y.-C., Aerts, A., Tisserant, E., Veneault-Fourrey, C., Joly, D. L., et al. (2011). Obligate biotrophy features unraveled by the genomic analysis of rust fungi. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 108(22), 9166–9171.
    Rust fungi are some of the most devastating pathogens of crop plants. They are obligate biotrophs, which extract nutrients only from living plant tissues and cannot grow apart from their hosts. Their lifestyle has slowed the dissection of molecular mechanisms underlying host invasion and avoidance or suppression of plant innate immunity. We sequenced the 101-Mb genome of Melampsora larici-populina, the causal agent of poplar leaf rust, and the 89-Mb genome of Puccinia graminis f. sp. tritici, the causal agent of wheat and barley stem rust. We then compared the 16,399 predicted proteins of M. larici-populina with the 17,773 predicted proteins of P. graminis f. sp tritici. Genomic features related to their obligate biotrophic lifestyle include expanded lineage-specific gene families, a large repertoire of effector-like small secreted proteins, impaired nitrogen and sulfur assimilation pathways, and expanded families of amino acid and oligopeptide membrane transporters. The dramatic up-regulation of transcripts coding for small secreted proteins, secreted hydrolytic enzymes, and transporters in planta suggests that they play a role in host infection and nutrient acquisition. Some of these genomic hallmarks are mirrored in the genomes of other microbial eukaryotes that have independently evolved to infect plants, indicating convergent adaptation to a biotrophic existence inside plant cells.
  448. Audenaert, P., Van Parys, T., Brondel, F., Pickavet, M., Demeester, P., Van de Peer, Y., & Michoel, T. (2011). CyClus3D: a Cytoscape plugin for clustering network motifs in integrated networks. BIOINFORMATICS, 27(11), 1587–1588.
    Network motifs in integrated molecular networks represent functional relationships between distinct data types. They aggregate to form dense topological structures corresponding to functional modules which cannot be detected by traditional graph clustering algorithms. We developed CyClus3D, a Cytoscape plugin for clustering composite three-node network motifs using a 3D spectral clustering algorithm.
  449. Bauters, Kim, Schockaert, S., Vermeir, D., & De Cock, M. (2011). Communicating ASP and the polynomial hierarchy. In J. Delgrande & W. Faber (Eds.), Lecture Notes in Artificial Intelligence (Vol. 6645, pp. 67–79). Presented at the 11th International conference on Logic Programming and Nonmonotonic Reasoning (LPNMR 2011), Berlin, Germany: Springer.
    Communicating answer set programming is a framework to represent and reason about the combined knowledge of multiple agents using the idea of stable models. The semantics and expressiveness of this framework crucially depends on the nature of the communication mechanism that is adopted. The communication mechanism we introduce in this paper allows us to focus on a sequence of programs, where each program in the sequence may successively eliminate some of the remaining models. The underlying intuition is that of leaders and followers: each agent’s decisions are limited by what its leaders have previously decided. We show that extending answer set programs in this way allows us to capture the entire polynomial hierarchy.
  450. Afanasyeva, E., Mestdagh, P., Kumps, C., Vandesompele, J., Ehemann, V., Theissen, J., Zapatka, M., et al. (2011). MicroRNA miR-885-5p targets CDK2 and MCM5, activates p53 and inhibits proliferation and survival. CELL DEATH AND DIFFERENTIATION, 18(6), 974–984.
    Several microRNA (miRNA) loci are found within genomic regions frequently deleted in primary neuroblastoma, including miR-885-5p at 3p25.3. In this study, we demonstrate that miR-885-5p is downregulated on loss of 3p25.3 region in neuroblastoma. Experimentally enforced miR-885-5p expression in neuroblastoma cell lines inhibits proliferation triggering cell cycle arrest, senescence and/or apoptosis. miR-885-5p leads to the accumulation of p53 protein and activates the p53 pathway, resulting in upregulation of p53 targets. Enforced miR-885-5p expression consistently leads to downregulation of cyclin-dependent kinase (CDK2) and mini-chromosome maintenance protein (MCM5). Both genes are targeted by miR-885-5p via predicted binding sites within the 3'-untranslated regions (UTRs) of CDK2 and MCM5. Transcript profiling after miR-885-5p introduction in neuroblastoma cells reveals alterations in expression of multiple genes, including several p53 target genes and a number of factors involved in p53 pathway activity. Taken together, these data provide evidence that miR-885-5p has a tumor suppressive role in neuroblastoma interfering with cell cycle progression and cell survival.
  451. De Preter, K., Mestdagh, P., Vermeulen, J., Zeka, F., Naranjo, A., Bray, I., Castel, V., et al. (2011). miRNA Expression profiling enables risk stratification in archived and fresh neuroblastoma tumor samples. CLINICAL CANCER RESEARCH, 17(24), 7684–7692.
    Purpose: More accurate assessment of prognosis is important to further improve the choice of risk-related therapy in neuroblastoma (NB) patients. In this study, we aimed to establish and validate a prognostic miRNA signature for children with NB and tested it in both fresh frozen and archived formalin-fixed paraffin-embedded (FFPE) samples. Experimental Design: Four hundred-thirty human mature miRNAs were profiled in two patient subgroups with maximally divergent clinical courses. Univariate logistic regression analysis was used to select miRNAs correlating with NB patient survival. A 25-miRNA gene signature was built using 51 training samples, tested on 179 test samples, and validated on an independent set of 304 fresh frozen tumor samples and 75 archived FFPE samples. Results: The 25-miRNA signature significantly discriminates the test patients with respect to progression-free and overall survival (P < 0.0001), both in the overall population and in the cohort of high-risk patients. Multivariate analysis indicates that the miRNA signature is an independent predictor of patient survival after controlling for current risk factors. The results were confirmed in an external validation set. In contrast to a previously published mRNA classifier, the 25-miRNA signature was found to be predictive for patient survival in a set of 75 FFPE neuroblastoma samples. Conclusions: In this study, we present the largest NB miRNA expression study so far, including more than 500 NB patients. We established and validated a robust miRNA classifier, able to identify a cohort of high-risk NB patients at greater risk for adverse outcome using both fresh frozen and archived material.
  452. Ban, J., Jug, G., Mestdagh, P., Schwentner, R., Kauer, M., Aryee, D., Schaefer, K.-L., et al. (2011). Hsa-mir-145 is the top EWS-FLI1-repressed microRNA involved in a positive feedback loop in Ewing’s sarcoma. ONCOGENE, 30(18), 2173–2180.
    EWS-FLI1 is a chromosome translocation-derived chimeric transcription factor that has a central and rate-limiting role in the pathogenesis of Ewing's sarcoma. Although the EWS-FLI1 transcriptomic signature has been extensively characterized on the mRNA level, information on its impact on non-coding RNA expression is lacking. We have performed a genome-wide analysis of microRNAs affected by RNAi-mediated silencing of EWS-FLI1 in Ewing's sarcoma cell lines, and differentially expressed between primary Ewing's sarcoma and mesenchymal progenitor cells. Here, we report on the identification of hsa-mir-145 as the top EWS-FLI1-repressed microRNA. Upon knockdown of EWS-FLI1, hsa-mir-145 expression dramatically increases in all Ewing's sarcoma cell lines tested. Vice versa, ectopic expression of the microRNA in Ewing's sarcoma cell lines strongly reduced EWS-FLI1 protein, whereas transfection of an anti-mir to hsa-mir-145 increased the EWS-FLI1 levels. Reporter gene assays revealed that this modulation of EWS-FLI1 protein was mediated by the microRNA targeting the FLI1 3'-untranslated region. Mutual regulations of EWS-FLI1 and hsa-mir-145 were mirrored by an inverse correlation between their expression levels in four of the Ewing's sarcoma cell lines tested. Consistent with the role of EWS-FLI1 in Ewing's sarcoma growth regulation, forced hsa-mir-145 expression halted Ewing's sarcoma cell line growth. These results identify feedback regulation between EWS-FLI1 and hsa-mir-145 as an important component of the EWS-FLI1-mediated Ewing's sarcomagenesis that may open a new avenue to future microRNA-mediated therapy of this devastating malignant disease.
  453. Hoffmann, T. J., Vansteelandt, S., Lange, C., Silverman, E. K., DeMeo, D. L., & Laird, N. M. (2011). Combining disease models to test for gene-environment interaction in nuclear families. BIOMETRICS, 67(4), 1260–1270.
    It is useful to have robust geneenvironment interaction tests that can utilize a variety of family structures in an efficient way. This article focuses on tests for geneenvironment interaction in the presence of main genetic and environmental effects. The objective is to develop powerful tests that can combine trio data with parental genotypes and discordant sibships when parents' genotypes are missing. We first make a modest improvement on a method for discordant sibs (discordant on phenotype), but the approach does not allow one to use families when all offspring are affected, e.g., trios. We then make a modest improvement on a Mendelian transmission-based approach that is inefficient when discordant sibs are available, but can be applied to any nuclear family. Finally, we propose a hybrid approach that utilizes the most efficient method for a specific family type, then combines over families. We utilize this hybrid approach to analyze a chronic obstructive pulmonary disorder dataset to test for geneenvironment interaction in the Serpine2 gene with smoking. The methods are freely available in the R package fbati.
  454. De Neve, J., Thas, O., Clement, L., & Ottoy, J.-P. (2011). Probabilistic index models. In P. Canas Rodrigues & M. de Carvalho (Eds.), Proceedings of the 17th European Young Statisticians Meeting (pp. 73–77). Presented at the 17th European Young Statisticians Meeting, Lisbon, Portugal: Universidade Nova de Lisboa. Faculdade de Ciências e Tecnologia.
  455. Blondé, W., Mironov, V., Venkatesan, A., Antezana San Roman, E. Z., De Baets, B., & Kuiper, M. (2011). Reasoning with bio-ontologies: using relational closure rules to enable practical querying. BIOINFORMATICS, 27(11), 1562–1568.
    Motivation: Ontologies have become indispensable in the Life Sciences for managing large amounts of knowledge. The use of logics in ontologies ranges from sound modelling to practical querying of that knowledge, thus adding a considerable value. We conceive reasoning on bio-ontologies as a semi-automated process in three steps: (i) defining a logic-based representation language; (ii) building a consistent ontology using that language; and (iii) exploiting the ontology through querying. Results: Here, we report on how we have implemented this approach to reasoning on the OBO Foundry ontologies within BioGateway, a biological Resource Description Framework knowledge base. By separating the three steps in a manual curation effort on Metarel, a vocabulary that specifies relation semantics, we were able to apply reasoning on a large scale. Starting from an initial 401 million triples, we inferred about 158 million knowledge statements that allow for a myriad of prospective queries, potentially leading to new hypotheses about for instance gene products, processes, interactions or diseases.
  456. Mironov, V., Antezana San Roman, E. Z., Egaña, M., Blondé, W., De Baets, B., Kuiper, M., & Stevens, R. (2011). Flexibility and utility of the Cell Cycle Ontology. APPLIED ONTOLOGY, 6(3), 247–261.
    The Cell Cycle Ontology (CCO) has the aim to provide a 'one stop shop' for scientists interested in the biology of the cell cycle that would like to ask questions from a molecular and/or systems perspective: what are the genes, proteins, and so on involved in the regulation of cell division? How do they interact to produce the effects observed in the regulation of the cell cycle? To answer these questions, the CCO must integrate a large amount of knowledge from diverse sources; the irregularity and incompleteness of this information suggests an ontology can act as the means of this integration. The volatility and continued expansion of biological knowledge means the content and modelling of the CCO will have to be frequently changed and updated. The CCO is generated from the input data automatically once every two months. This makes it easy to change the representation to enable certain queries; incorporate new knowledge; and consistently apply design patterns across the CCO. The automatic process also allows the CCO to be delivered in a variety of representations that suit the needs of various CCO customers and the abilities of existing toolsets. In this paper we present the CCO and its characteristics of utility and flexibility, that, from our perspective, make it a beautiful ontology.
  457. Coyne, R. S., Hannick, L., Shanmugam, D., Hostetler, J. B., Brami, D., Joardar, V. S., Johnson, J., et al. (2011). Comparative genomics of the pathogenic ciliate Ichthyophthirius multifiliis, its free-living relatives and a host species provide insights into adoption of a parasitic lifestyle and prospects for disease control. GENOME BIOLOGY, 12(10).
    BACKGROUND: Ichthyophthirius multifiliis, commonly known as Ich, is a highly pathogenic ciliate responsible for 'white spot', a disease causing significant economic losses to the global aquaculture industry. Options for disease control are extremely limited, and Ich's obligate parasitic lifestyle makes experimental studies challenging. Unlike most well-studied protozoan parasites, Ich belongs to a phylum composed primarily of free-living members. Indeed, it is closely related to the model organism Tetrahymena thermophila. Genomic studies represent a promising strategy to reduce the impact of this disease and to understand the evolutionary transition to parasitism. RESULTS: We report the sequencing, assembly and annotation of the Ich macronuclear genome. Compared with its free-living relative T. thermophila, the Ich genome is reduced approximately two-fold in length and gene density and three-fold in gene content. We analyzed in detail several gene classes with diverse functions in behavior, cellular function and host immunogenicity, including protein kinases, membrane transporters, proteases, surface antigens and cytoskeletal components and regulators. We also mapped by orthology Ich's metabolic pathways in comparison with other ciliates and a potential host organism, the zebrafish Danio rerio. CONCLUSIONS: Knowledge of the complete protein-coding and metabolic potential of Ich opens avenues for rational testing of therapeutic drugs that target functions essential to this parasite but not to its fish hosts. Also, a catalog of surface protein-encoding genes will facilitate development of more effective vaccines. The potential to use T. thermophila as a surrogate model offers promise toward controlling 'white spot' disease and understanding the adaptation to a parasitic lifestyle.
  458. Janssen, J., Schockaert, S., Vermeir, D., & De Cock, M. (2011). Aggregated fuzzy answer set programming. ANNALS OF MATHEMATICS AND ARTIFICIAL INTELLIGENCE, 63(2), 103–147.
    Fuzzy Answer Set programming (FASP) is an extension of answer set programming (ASP), based on fuzzy logic. It allows to encode continuous optimization problems in the same concise manner as ASP allows to model combinatorial problems. As a result of its inherent continuity, rules in FASP may be satisfied or violated to certain degrees. Rather than insisting that all rules are fully satisfied, we may only require that they are satisfied partially, to the best extent possible. However, most approaches that feature partial rule satisfaction limit themselves to attaching predefined weights to rules, which is not sufficiently flexible for most real-life applications. In this paper, we develop an alternative, based on aggregator functions that specify which (combination of) rules are most important to satisfy. We extend upon previous work by allowing aggregator expressions to define partially ordered preferences, and by the use of a fixpoint semantics.
  459. Yao, Yao, Baele, G., & Van de Peer, Y. (2011). A bio-inspired agent-based system for controlling robot behaviour. 2011 IEEE symposium on intelligent agent (IA). Presented at the 2011 IEEE Symposium on Intelligent Agent (IA), New York, NY, USA: IEEE.
    In this paper, we present an agent-based system to control a single robot’s behaviour. We present an artificial genome structure, based on gene regulatory networks, in which several regions can be distinguished such as promoter regions, indicator genes, transcription factor binding sites, regulatory genes and expressed genes. We use agent-based modeling (ABM) to simulate a bio-inspired system based on the artificial genome, with the ultimate goal of providing phenotypic information for a simulated robot. We show that the presence of a feedback loop in the agent based system, along with the corresponding agent replacements, is essential to allow the robot to perform its tasks.