CummeRbund: Visualization and Exploration of Cuﬄinks
4 Reading cuﬀdiﬀ output
cummeRbund was designed to process the multi-ﬁle output format for a ’cuﬀdiﬀ’ diﬀerential expression analysis. In this type of analysis, a user will use a reference .gtf ﬁle (either known annotation or a .gtf ﬁle created from a cuﬄinks assembly or merge of assemblies) and quantitate the expression values and diﬀerential regulation of the annotation(s) in the .gtf ﬁle across two or more SAM/BAM ﬁles. By design, cuﬀdiﬀ produces a number of output ﬁles that contain test results for changes in expression at the level of transcripts, primary transcripts, and genes. It also tracks changes in the relative abundance of transcripts sharing a common transcription start site, and in the relative abundances of the primary transcripts of each gene. Tracking the former allows one to see changes in splicing, and the latter lets one see changes in relative promoter use within a gene. Note:Early versions of Cuﬀdiﬀ required that transcripts in the input GTF be annotated with certain attributes in order to look for changes in primary transcript expression, splicing, coding output, and promoter use. This is no longer the case with >=v1.1.1 of cummeRbund, however we still recommend the use of both the following attributes in your GTF ﬁle to enable all downstream features of cummeRbund.
These attributes are:
- tss id: The ID of this transcript’s inferred start site. Determines which primary transcript this processed
transcript is believed to come from. Cuﬀcompare appends this attribute to every transcript reported in the
- p id The ID of the coding sequence this transcript contains. This attribute is attached by Cuﬀcompare to
the .combined.gtf records only when it is run with a reference annotation that include CDS records. Further,
diﬀerential CDS analysis is only performed when all isoforms of a gene have p id attributes, because neither
Cuﬄinks nor Cuﬀcompare attempt to assign an open reading frame to transcripts. cuﬀdiﬀ calculates the FPKM of each transcript, primary transcript, and gene in each sample. Primary transcript and gene FPKMs are computed by summing the FPKMs of transcripts in each primary transcript group or gene group. The results are output in FPKM tracking ﬁles, the structure of which can be found in the cuﬄinks manual.
There are four FPKM tracking ﬁles:
- isoforms.fpkm tracking Transcript FPKMs
- genes.fpkm tracking Gene FPKMs. Tracks the summed FPKM of transcripts sharing each gene id
- cds.fpkm tracking Coding sequence FPKMs. Tracks the summed FPKM of transcripts sharing each p id,
independent of tss id
- tss groups.fpkm tracking Primary transcript FPKMs. Tracks the summed FPKM of transcripts sharing each tss id
cuﬀdiﬀ also performs diﬀerential expression tests between supplied conditions. This tab delimited ﬁle lists the results of diﬀerential expression testing between samples for spliced transcripts, primary transcripts, genes, and coding sequences. For detailed ﬁle structure see cuﬄinks manual.
Four .diﬀ ﬁles are created:
- isoform exp.diﬀ Transcript diﬀerential FPKM.
- gene exp.diﬀ Gene diﬀerential FPKM. Tests diﬀerence sin the summed FPKM of transcripts sharing each gene id
- tss group exp.diﬀ Primary transcript diﬀerential FPKM. Tests diﬀerences in the summed FPKM of transcripts sharing each tss id
- cds exp.diﬀ Coding sequence diﬀerential FPKM. Tests diﬀerences in the summed FPKM of transcripts sharing each p id independent of tss id
In addition, cuﬀdiﬀ also performs diﬀerential splicing, CDS usage, and promoter usage tests for each gene across conditions:
- splicing.diﬀ Diﬀerential splicing tests.
- CDS.diﬀ Diﬀerential coding output.
- promoters.diﬀ Diﬀerential promoter use.
All of these output ﬁles are related to each other through their various tracking ids, but parsing through individual ﬁles to query for important result information requires both a good deal of patience and a strong grasp of commandline text manipulation. Enter cummeRbund, an R solution to aggregate, organize, and help visualize this multilayered dataset. One of the principle beneﬁts of using cummeRbund is that data are stored in a SQLite database. This allows for out of-memory analysis of data, quick retrieval, and only a one-time cost to setup the tables. By default, cummeRbund assumes that all output ﬁles from cuﬀdiﬀ are in the current working directory. To read these ﬁles, populate the ’cuﬀData.db’ database backend, and return the CuﬀSet pointer object, you can do the following.
library(cummeRbund) cuff<-readCufflinks() cuff
disp<-dispersionPlot(genes(cuff)) disp genes.scv<-fpkmSCVPlot(genes(cuff)) isoforms.scv<-fpkmSCVPlot(isoforms(cuff)) #To assess the distributions of FPKM scores across samples, you can use the csDensity plot (Figure 1). dens<-csDensity(genes(cuff)) dens densRep<-csDensity(genes(cuff),replicates=T) densRep #Boxplots can be visualized using the csBoxplot method (Figure 2). b<-csBoxplot(genes(cuff)) b brep<-csBoxplot(genes(cuff),replicates=T) brep #A matrix of pairwise scatterplots can be drawn using the csScatterMatrix() method. s<-csScatterMatrix(genes(cuff)) s<-csScatter(genes(cuff),"hESC","Fibroblasts",smooth=T) s #树状图 dend<-csDendro(genes(cuff)) dend.rep<-csDendro(genes(cuff),replicates=T) #MvsA plots can be useful to determine any systematic bias that may be present between #conditions. The CuﬀData method MAplot() can be used to examine these intensity vs fold-#change plots. You must specify the sample names to use for the pairwise comparison with x #and y: m<-MAplot(genes(cuff),"hESC","Fibroblasts") m mCount<-MAplot(genes(cuff),"hESC","Fibroblasts",useCount=T) mCount #Volcano plots are also available for the CuﬀData objects. v<-csVolcanoMatrix(genes(cuff)) v #For individual pairwise comparisons, you must specify the comparisons by sample name. v<-csVolcano(genes(cuff),"hESC","Fibroblasts")
6 Accessing Data
Cuﬀdiﬀ run information
Run-level information such as run parameters, and sample information can be accessed from a CuﬀSet object by using the runInfo and replicates methods:
runInfo(cuff) replicates(cuff) #Features/Annotation #Feature-level information can be accessed directly from a CuﬀData object using the fpkm, #repFpkm, count, diﬀData, gene.features<-annotation(genes(cuff)) head(gene.features) gene.fpkm<-fpkm(genes(cuff)) head(gene.fpkm) gene.repFpkm<-repFpkm(genes(cuff)) head(gene.repFpkm) gene.counts<-count(genes(cuff)) head(gene.counts) isoform.fpkm<-fpkm(isoforms(cuff)) head(isoform.fpkm) gene.diff<-diffData(genes(cuff)) head(gene.diff) #Condition and feature names #Vectors of sample names and feature names are available by using the samples and featureNames #methods: sample.names<-samples(genes(cuff)) head(sample.names) gene.featurenames<-featureNames(genes(cuff)) head(gene.featurenames) #Convenience functions #To facilitate Bioconductor-like operations, an ’FPKM-matrix’ can be returned easily using the #fpkmMatrix method: gene.matrix<-fpkmMatrix(genes(cuff)) head(gene.matrix) gene.rep.matrix<-repFpkmMatrix(genes(cuff)) head(gene.rep.matrix) gene.count.matrix<-countMatrix(genes(cuff)) head(gene.count.matrix)
7 Creating Gene Sets
Gene Sets (stored in a CuﬀGeneSet object) can be created using the getGenes method on a CuﬀSet object. You must ﬁrst create a vector of ’gene id’ or ’gene short name’ values to identify the genes you wish to select:
data(sampleData) myGeneIds<-sampleIDs myGeneIds myGenes<-getGenes(cuff,myGeneIds) myGenes #The same fpkm, repFpkm, count, annotation, diﬀData, samples, and featureNames methods are #available for instances of the CuﬀGeneSet class, but additional accessor methods are available #for the promoters, relCDS, and splicing slot data as well. #FPKM values for genes in gene set head(fpkm(myGenes)) head(fpkm(isoforms(myGenes))) head(repFpkm(TSS(myGenes))) #As of cummeRbund v2.0 CuﬀGeneSet classes can be created from any type of identiﬁer (’gene #id’,’isoform id’,’TSS group id’,or ’CDS id’). When you pass a list of identiﬁers that are not #gene id to getGenes(), the function attempts to lookup the parent gene id for each feature and #returns all relevant information for the given genes and all of their subfeatures (not just the #sub-features passed to getGenes()). If you are interested in just retrieving information for a #given set of features, please use the new getFeatures() method described later. myGeneset.pluri<-getGenes(cuff,myGeneIds,sampleIdList=c("hESC","iPS")) myGeneset.pluri
7.1 Geneset level plots
There are several plotting functions available for gene-set-level visualization:
The csHeatmap() function is a plotting wrapper that takes as input either a CuﬀGeneSet or a CuﬀFeatureSet object (essentially a collection of genes and/or features) and produces a heatmap of FPKM expression values. The ’cluster’ argument can be used to re-order either ’row’, ’column’, or ’both’ dimensions of this matrix. By default, the Jensen-Shannon distance is used as the clustering metric, however, any function that produces a dist object can be passed to the ’cluster’ argument as well.
h<-csHeatmap(myGenes,cluster= ’ both ’ ) h h.rep<-csHeatmap(myGenes,cluster= ’ both ’ ,replicates=T) h.rep b<-expressionBarplot(myGenes) b #The csScatter() method can be used to produce scatter plot comparisons between any two #conditions. s<-csScatter(myGenes,"Fibroblasts","hESC",smooth=T) s #The volcano plot is a useful visualization to compare fold change between any two conditions #and signiﬁcance (-log P-values). v<-csVolcano(myGenes,"Fibroblasts","hESC") v #Similar plots can be made for all sub-level features of a CuﬀGeneSet class by specifying which #slot you would like to plot (eg. isoforms(myGenes),TSS(myGenes),CDS(myGenes)). ih<-csHeatmap(isoforms(myGenes),cluster= ’ both ’ ,labRow=F) ih th<-csHeatmap(TSS(myGenes),cluster= ’ both ’ ,labRow=F) th #Dendrograms can provide insight into the relationships between conditions for various genesets #(e.g. signiﬁcant genes used to draw relationships between conditions). As of v1.1.3 the method #csDendro() can be used to plot a dendrogram based on Jensen-Shannon distances between #conditions for a given CuﬀFeatureSet or CuﬀGeneSet. den<-csDendro(myGenes)
8 Individual Genes
An individual CuﬀGene object can be created by using the getGene() function for a given ’gene id’ or ’gene short name’. As of cummeRbund ≥ v2.0 you can also use isoform id, tss group id, or cds id values to retrieve the corresponding parent gene object.
myGeneId<-"PINK1" myGene<-getGene(cuff,myGeneId) myGene head(fpkm(myGene)) head(fpkm(isoforms(myGene)))
8.1 Gene-level plots
gl<-expressionPlot(myGene) gl gl.rep<-expressionPlot(myGene,replicates=TRUE) gl.rep gl.iso.rep<-expressionPlot(isoforms(myGene),replicates=T) gl.iso.rep gl.cds.rep<-expressionPlot(CDS(myGene),replicates=T) gl.cds.rep gb<-expressionBarplot(myGene) gb gb.rep<-expressionBarplot(myGene,replicates=T) gb.rep igb<-expressionBarplot(isoforms(myGene),replicates=T) igb gp<-csPie(myGene,level="isoforms") gp
8.1.1 Gene Feature plots
If you included both the genome build and gtfFile in your call to readCuﬄinks() then you will be able to access some of the transcript-structure level features that are now being integrated into cummeRbund. For now, these features are extended only to the single gene, CuﬀGene objects.
head(features(myGene)) #The Gviz package can be used to display features in a ’track’-like format. In particular, the #GeneRegionTrack class creates a mechanism by which we can start to visualize transcript-level #structures in their genomic context. cummeRbund implements the makeGeneRegionTrack() method to #quickly create a GeneRegionTrack from the gene features. genetrack<-makeGeneRegionTrack(myGene) plotTracks(genetrack)
9 Data Exploration
The cummeRbund package is more than just a visualization tool as well. We are working to implement several diﬀerent means of data exploration from gene and condition clustering, ﬁnding features with similar expression proﬁles, as well as incorporating Gene Ontology analysis.
9.1 Overview of signiﬁcant features
The sigMatrix() function can provide you with a “quick–and–dirty” view of the number of signiﬁcant features of a particular type, and at a given alpha (0.05 by default). For example:
mySigMat<-sigMatrix(cuff,level= ’ genes ’ ,alpha=0.05)
9.2 Creating gene sets from signiﬁcantly regulated genes
One of the primary roles of a diﬀerential expression analysis is to conduct signiﬁcance tests on each feature (genes,isoforms, TSS, and CDS) for appropriate pairwise comparisons of conditions. The results of these tests (after multiple testing correction of course) can be used to determine which genes are diﬀerentially regulated. cummeRbund makes accessing the results of these signiﬁcance tests simple via getSig(). This function takes a CuﬀSet object and will scan at various feature levels (’genes’ by default) to produce a vector of feature IDs. By default getSig() outputs a vector of tracking IDs corresponding to all genes that reject the null hypothesis in any condition tested. The default feature type can be changed by adjusting the ’level’ argument to getSig(). In addition, a alpha value can be provided on which to ﬁlter the resulting list (the default is 0.05 to match the default of cuﬀdiﬀ).
mySigGeneIds<-getSig(cuff,alpha=0.05,level= ’ genes ’ ) head(mySigGeneIds) length(mySigGeneIds) #By default getSig() outputs a vector of tracking IDs corresponding to all genes that reject the #null hypothesis in any condition tested. The default feature type can be changed by adjusting #the ’level’ argument to getSig(). In addition, a alpha value can be provided on which to ﬁlter #the resulting list (the default is 0.05 to match the default of cuﬀdiﬀ). Signiﬁcance results #for speciﬁc pairwise comparisons can be retrieved as well by specifying the two conditions as #’x’ and ’y’. In this case, p-values are adjusted to reduce the impact of multiple-testing #correction when only one set of tests is being conducted. hESC_vs_iPS.sigIsoformIds<-getSig(cuff,x= ’ hESC ’ ,y= ’ iPS ’ ,alpha=0.05,level= ’ isoforms ’ ) head(hESC_vs_iPS.sigIsoformIds) length(hESC_vs_iPS.sigIsoformIds) #The values returned for each level of this list can be used as an argument to getGenes, to #create a CuﬀGeneSet object of signiﬁcantly regulated genes (or features). mySigGenes<-getGenes(cuff,mySigGeneIds) mySigGenes #Alternatively, you can use the getSigTable() method to return a full test-table of ’signiﬁcant #features’ x ’pairwise tests’ for all comparisons. Only features in which the null hypothesis #can be rejected in at least one test are reported. mySigTable<-getSigTable(cuff,alpha=0.01,level= ’ genes ’ ) head(mySigTable,20)
9.3 Exploring the relationships between conditions
9.3.1 Distance matrix
Similarities between conditions and/or replicates can provide useful insight into the relationship between various groupings of conditions and can aid in identifying outlier replicates that do not behave as expected. cummeRbund provides the csDistHeat() method to visualize the pairwise similarities between conditions:
9.3.2 Dimensionality reduction
Dimensionality reduction is an informative approach for clustering and exploring the relationships between conditions. It can be useful for feature selection as well as identifying the sources of variability within your data. To this end, we have applied two diﬀerent dimensionality reduction strategies in cummeRbund: principal component analysis (PCA) and multi-dimensional scaling (MDS). We provide the two wrapper methods, PCAplot and MDSplot
genes.MDS<-MDSplot(genes(cuff)) genes.PCA.rep<-PCAplot(genes(cuff),"PC1","PC2",replicates=T) genes.MDS.rep<-MDSplot(genes(cuff),replicates=T) genes.PCA<-PCAplot(genes(cuff),"PC1","PC2")
K-means clustering is a useful tool that can be helpful in identifying clusters of genes with similar expression proﬁles. In fact, these proﬁles are learned from the data during the clustering. csCluster() uses the pam() method from the clustering package to perform the partitioning around medoids. In this case however, the distance metric used by default is the Jensen-Shannon distance instead of the default Euclidean distance. Prior to performing this particular partitioning, the user must choose the number of clusters (K) into which the expression proﬁles should be divided.
ic<-csCluster(myGenes,k=4) head(ic$cluster) icp<-csClusterPlot(ic) icp
In some cases, a researcher may be interested in identifying features that are ’condition-speciﬁc’. Or, more likely, producing an ordered list of genes based on their speciﬁcity for a given condition. We deﬁne a speciﬁcity score (S) as the following:
9.6 Finding similar genes
Another common question in large-scale gene expression analyses is ’How can I ﬁnd genes with similar expression proﬁles to gene x?’. We have implemented a method, ﬁndSimilar to allow you to identify a ﬁxed number of the most similar genes to a given gene of interest. For example, if you wanted to ﬁnd the 20 genes most similar to ”PINK1”, you could do the following:
By default, ﬁndSimilar will return a CuﬀGeneSet of similar genes matching your criteria. Recently a few additional features have been added as well to enhance this type of exploration:
- If ’returnGeneSet’ is set to FALSE, then ﬁndSimilar returns a data.frame of distance-ranked similar genes with distances. This is useful if you would like to see a rank-ordered list of similar genes.
- The ’distThresh’ argument allows you to pass a value [between 0-1] to be used as a distance threshold instead of an arbitrary ’n’ number of genes. setting distThresh=1.0 will return all genes ranked by their distance to your gene of interest.
You are also able to provide your own expression proﬁle in lieu of a ’gene id’. The vector provided must match the order and length of samples().
myProfile<-c(500,0,400) mySimilar2<-findSimilar(cuff,myProfile,n=10) mySimilar2.expression<-expressionPlot(mySimilar2,logMode=T,showErrorbars=F) ﬁndSimilar() also uses the Jensen-Shannon distance between the probability distributions of each gene across conditions to determine the similarity. We have found this to be a more robust way to determine distance between genes using the high dynamic range of FPKM data. Future versions may allow for other dissimilarity measures to be used instead.
- In appropriate plots, using the argument replicates=T will allow you to visualize replicate-level FPKM values either in lieu of or in addition to condition-level FPKMs.
- As of v1.1.3 we attempt to provide new visual cues in most plots that will indicate the quantiﬁcation status for a particular feature in each given condition. We have enabled this feature by default for most plots to suggest a measure of reliability for each feature in a particular condition. In most cases, this feature can be disabled by setting ’showStatus=FALSE’.
- CummeRbund will now work with the hidden ’–no-diﬀ’ argument for cuﬀdiﬀ. This will quantify features against .bam ﬁles but not do diﬀerential testing. This is useful when you want to aggregate very large numbers of conditions, and cannot aﬀord the time or space for the diﬀerential test results. (Not recommended unless you have a SPECIFIC need for this).
- All plotting functions return ggplot objects and the resulting objects can be manipulated/faceted/altered using standard ggplot2 methods.
- There are occasional DB connectivity issues that arise. Not entirely sure why yet. If necessary, just readCufflinks again and this should solve connectivity issues with a new RSQLite connection object. If connectivity continues to be a problem, try cuff<-readCufflinks(rebuild=T)
- I am still working on fully documenting each of the methods. There are a good number of arguments that exist, but might be hard to ﬁnd without looking at the reference manual.