The plant transcriptome—from integrating observations to models

B. Usadel,A. Fernie

Published 2013 in Frontiers in Plant Science

ABSTRACT

Transcriptomes as assessed by either microarrays or next-generation sequencing have produced a hitherto unprecedented data flood regarding transcript identity and levels in plant systems. Microarray data has been extensively used over the last 15 years or so and evaluation of the data thus produced has progressed well beyond early statistically quality evaluation and descriptive lists to a mature science whereby gene networks and cascades have been able to provide mechanistic insight. The development of sensitive quantitative PCR for lowly expressed genes such as transcription factors has additionally allowed another layer of complexity to be accessed and the modeling of transcription factor expression with that of target genes has met considerable success. Yet more recently, data emanating from RNAseq studies have greatly improved the coverage of transcript profiling. That said, this technology further compounded transcriptome analysis by making it possible to identify differentially spliced transcripts etc. In this research topic we would like to provide an “on the fly” portrait of the use of either microarray or RNAseq based datasets in contemporary Plant Systems Biology. Given the relative simplicity of doing so, much information has been gleaned from microarray datasets by assuming guilt-by-association. The success of this approach is summarized by articles of Provart (2012) and Tohge and Fernie (2012), as are recent studies that go beyond transcription and link in physiological and metabolic aspects. As in the legal process from which the approach lifts its name it is important to note that suspects obtained this way require “fair trial” since assuming “guilt” is fraught with dangers as summarized in Usadel et al. (2009a). Thus, Tohge and Fernie extend the use of the co-expression approach for the annotation of assumed gene function and discuss bringing in further experimental “evidence” as provided by metabolomics, proteomics, or physiological measurements (Tohge et al., 2005; De Boldt et al., 2012). They then delve further into the subject by explaining how to make a more solid case by linking gene functions across multiple species (Mutwil et al., 2011; Obayashi et al., 2011). The review by Provart (2012) also reviews novel aspects of visualized correlations, however, pays more attention to marrying these data with subcellular localization and tissue/organ specific networks such as those defined by SeedNet (Kohl et al., 2011) and the overlay of such networks with those derived from protein-protein interaction studies (Geisler-Lee et al., 2007). Junker et al. (2012a) follow a similar direction extending on ideas put forward in their recent Trends in Biotechnology review (Junker et al., 2012b) here focusing their attention on visual analysis of the transcriptome. They provide an overview of plant transcriptomics repositories and detail how these can serve as useful resources for visualization programs such as HIVE as well as detailing how the color-coded output from such programs can be integrated with known biological networks using analysis of floral homeotic gene expression patterns and seed expression profiles as exemplary case studies. They further discuss information visualization standards as suggested by Card et al. (1999) and the eFP browser (Winter et al., 2007). Friedel et al. (2012) and Grene et al. (2012) follow a similar approach whereby they re-analyse data using both visualization and network techniques both interested in abiotic conditions. Whereas Friedel uses network approaches and functional categories to investigate stress responses, Grene focuses on winter hardening in spruce. Interestingly Grene et al. (2012) is able to show a reprogramming of the cell wall and nucleotide sugar metabolism using MapMan (Usadel et al., 2009b) and GO ontologies. However, when it comes to data analysis of whole genome expression datasets, particularly those obtained from complex temporally and/or spatially resolved experiments visualization helps in finding “the meaning within the noise.” Thus, currently the researcher typically zooms in on a particular subset of the data which excites their biological curiosity, often obtaining such data from public repositories such as genevestigator (https://www.genevestigator.com/gv). But much information and potentially knowledge is untapped by adopting this approach. This leaves one wondering if aided by modern biostatistics and bioinformatics one shouldn't be able to do better. To improve this situation Klie et al. (2012) present a computational solution wherein recent extension of the principal component analysis variants STATIS and dual-STATIS (Lavit et al., 1994; Abdi et al., 2012) is applied to study the time resolved response of Arabidopsis thaliana to perturbations in the prevailing light and/or temperature conditions. This proof-of-concept study illustrates that these tools can clearly aid in dataset-wide analyses and furthermore that they can specify the extent to which either the transcript levels or alternatively the experimental treatments reflect these perturbations thus providing biological insight across the entire datasets obtained. As is evident from the multitude of manuscripts dealing with microarray data, there is still much to be learned from these data sets. However, time moves on and whilst it seems difficult to teach old dogs new omics tricks, RNAseq is slowly becoming more and more popular. Already machine learning techniques are trickling in to help separating noise from the data. Thus, Thieme et al. (2012) try to find the proverbial needle in the haystack by identifying Argonaute sorting signals for miRNAs. Whilst mutual information didn't indicate any other than the 5′ position to dictate which of the 10 Argonaute proteins is processing which miRNA, Thieme solve the problem of having only four possible 5′ bases for 10 different proteins, by showing that other positions likely play a role as well. Such analyses are assuming, however, that one actually knows which transcripts to deal with. But one of the perceived beauties of RNAseq is that one could learn about the transcriptome on the fly whilst analysing the data by assembling the reads into transcripts. This seems, however, an ambituous goal and thus in their article Schliesky et al. (2012) address the question RNAseq assembly—are we there yet? They review plant applications of 454/Roche and Illumina sequencing which have in combination, to date, already been used to assess the transcriptome of over 50 plant species. Although they argue these approaches have been useful in downstream applications such as proteomics (Lopez-Casado et al., 2012) and the same can be argued for their recent use to augment recent genome sequencing efforts (Tomato Genome Consortium, 2012), assemblies may well not accurately reflect the actual plant transcriptomes, especially if not checked well. In order to ameliorate challenges for the transciptome assembly problem they provide a list of quality control parameters and the necessary scripts to produce them most likely providing an invaluable resource for this burgeoning area of transcriptomes and bringing the old idea of genomeless genomics (Rudd, 2005) within the reach of even the smallest labs. Rose et al. (2012) then round up the uses of RNAseq by providing both insights into how RNAseq has already benefited the plant communityand detailed examples where genomeless genomics was used. Extending beyond this, they show that RNAseq is also valuable in finding small non-coding RNA highlighting the manner demonstrated in the Thieme et al. (2012) article. In addition they demonstrate how important RNAseq can be for bulk segregant analysis and thus the identification of causal mutations. Alongside these illustrations they additionally provide the wet bench biologist with comprehensive workflows on how the RNA should be processed for these varied applications. Finally, in his article Kliebenstein (2012), tries to answer the other burning questions of RNA-seq—How deep does deep-sequencing need to go to capture the majority of network or genomic information present in a variety of transciptomics experiments? To address this question he applied Shannon entropy analysis to existing Arabidopsis transcriptomics data namely a co-expression network, an expression QTL analysis and a temporal analysis of the circadian clock. Intriguingly, he came to the conclusion that at least 80% of the information present in a transcriptomic study is likely obtainable by measuring only the top 10% of the transcripts within a sample. This, rather surprising, finding has important consequences for experimental design particularly with concern to the scale and affordability of large-scale studies.

PUBLICATION RECORD

CITATION MAP

EXTRACTION MAP

CLAIMS

  • No claims are published for this paper.

CONCEPTS

  • No concepts are published for this paper.

REFERENCES

Showing 1-27 of 27 references · Page 1 of 1