|
Exercise - R
Download the file R_Data.zip from http://microarrays.btk.fi and extract it into a new folder called ‘Illumina’ on the desktop. Start a new R session by double clicking on the R icon. Change the working directory to the newly created Illumina folder.
The following code loads the beadarray package, reads the targets file and imports the data into R:
> library(beadarray) # loads the package
> BStargets <- readBeadSummaryTargets(“BStargets.txt”) # reads the targets file into the object BStargets
> BSData <- readBeadSummaryData(BStargets) # reads the .csv files specified in the targets file
Now we have a look at the BSData object:
BSData is an object of type ExpressionSetIlluminawhich is an extension of the ExpressionSet class developed by the Biocore team used as a container for high-throughput assays. The data from the the raw data file has been written to the assayData slot of the object, whereas the phenoData slot contains information from sample sheet and the QC slot contains the quality control information. For consistency with the definition of other ExpressionSet objects, we now refer to the expression values as the exprs matrix which can be accessed using exprs and subset in the usual manner. The BeadStDevmatrix can be accessed using se.exprs. The rows of exprs are named according to the row names of the original raw_data file.
> BSData # calls the function show(BSData)
Explore the object by using the functions exprs, se.exprs, pData, NoBeads and str. Don’t forget to use subscripts!
Boxplots of expression values are useful for quality control. The code below draws boxplots of the log2 intensities and of the average number of beads of each array in the experiment.
> par(mfrow = c(1, 2))
> boxplot(log2(exprs(BSData)[1:1000, ]), las = 2)
> boxplot(NoBeads(BSData)[1:1000, ], las = 2)
In the expression boxplots we notice that there are differences in expression level across a chip and between chips. Therefore we might want to normalise the arrays in the experiment comparable. We also see the 3rd array has significantly different intensity. The sample on this array is replicated three times on the first chip, so comparing the MA and XY plots for the replicates of this sample can be informative.
Particular genes of interest may be highlighted on the MA and XY plots by using the genesToLabel argument which should match up with the row names in BSData. The labelCol argument can be used to specify a colour for each gene. For simplicity sake we simply highlight the first ten genes in the expression matrix, a possible application might be to highlight control genes on the plot or particular genes of interest.
> g = rownames(exprs(BSData))[1:10]
> g
[1] "GI_10047089-S" "GI_10047091-S" "GI_10047093-S" "GI_10047099-S"
[5] "GI_10047103-S" "GI_10047105-S" "GI_10047121-S" "GI_10047123-S"
[9] "GI_10047133-A" "GI_10047133-I"
> cols = rainbow(start = 0, end = 5/6, n = 10)
> plotMAXY(exprs(BSData)[1:1000, ], arrays = 1:3,
+ genesToLabel = g, labelCols = cols)
From the obtained plot one can clearly see that the third array differs significantly from the other replicates and requires normalisation.
For evaluation purposes I’d like you to use a function which is plotting MA and XY plots of replicates in a data set. It is possible to query for a certain character string specifying
the replicates such as “I” or “P” in the current data set. The code to plot e.g. all possible MA and scatter plots for the “I” samples is shown below.
> source("replicate.plot.R")
> source("replicatePlot.R")
> replicate.plot(exprs(BSData), find.pattern="I",
+ plot.type="scatter")
> replicate.plot(exprs(BSData), find.pattern="I",
+ plot.type="MA")
To produce a plot of the same samples as with the plotMAXY function use the following code:
> replicate.plot(exprs(BSData), samples=1:3,
+ plot.type="scatter")
> replicate.plot(exprs(BSData), samples=1:3, plot.type="MA")
It is possible to use the normalisation methods available in the affy package such as quantile, qspline or others. As quantile normalisation has shown to perform best for Illumina data this will be the method of choice in this course. However, from the biologists’ point of view, it is recommendable to “compare” different normalisation methods and choose one which gives the most meaningful results throughout the analysis.
As assayDataElementReplace returns a matrix and doesn’t transform the output back to a data.frame, we need to perform this transformation by ourselves. Additionally, the column and row names need to be restored as they are lost during normalisation. This is probably a bug and will be corrected in future beadarray versions.
> BSData.quantile <- assayDataElementReplace(BSData, "exprs",
+ normalize.quantiles(as.matrix(exprs(BSData))))
> BSData.quantile@assayData$exprs <- as.data.frame(BSData.quantile@assayData$exprs)
> names(BSData.quantile@assayData$exprs) <- colnames(exprs(BSData))
> row.names(BSData.quantile@assayData$exprs) <- row.names(exprs(BSData))
Please repeat boxplots, MA plots and scatter plots using the normalised data and note the changes which occurred after normalisation. You may use functions of your choice to produce the plots.
Another way of testing the efficacy of the normalisation method is to perform hierarchical clustering. Although this is not exactly a pre-processing method, it may be helpful to decide whether or not the normalisation was appropriate.
> d = dist(t(exprs(BSData)))
> plclust(hclust(d), labels = rownames(pData(BSData)))
> d.quantile = dist(t(exprs(BSData.quantile)))
> plclust(hclust(d.quantile),
+ labels = rownames(pData(BSData.quantile)))
Tip: To open a new graphics window (grapics device in R terminology) you may use the functions x11 or windows, either with default values or setting arguments such as width and height. windows works under MS Windows, only, while x11 works under Unix, MS Windows as well as MacOS X (needs X11 running). In R for Mac one may also call quartz. Please refer to the respective help files for more information.
To visualise the behaviour of replicate samples as sample groups PCA or Principal Components Analysis may be used. An appropriate plot is a 3D-scatter plot as provided by the packages made4, ade4 and scatterplot3d.
> library(made4)
> pca.2 <- dudi.pca(log2(exprs(BSData.quantile)))
> x <- c("IH", "IC", "IH", "MC", "MD", "MT", "IC", "IH", "IC",
+ "P", "P", "Norm", "MC", "MD", "MT", "P", "Norm", "P")
> pData(BSData.quantile)$Group <- x
> do3d(pca.1$co, angle=272,
+ classvec=pData(BSData.quantile)$Group)
> legend("bottomright", levels(pData(BSData.quantile)$Group),
+ col=getcol(nc=c(1:7), palette="colours1"), pch=16)
Try different angles by adjusting the angle parameter. You will notice that the positions of the spots in relation to each other differ considerably. Find the plot which shows the behaviour of the samples best in your opinion.
At the moment, beadarray doesn’t offer methods for detecting differential expression. In the meantime, users are able to use the lmFit and eBayes functions from limma on the matrix exprs(BSdata.quantile) with a log2 transformation applied.
The following code shows how to set up a design matrix for the example experiment combining the I, MC, MD, MT, P and Normal samples together. We then define contrasts comparing the I samples to the P samples and I to Normal and perform an empirical bayes shrinkage. In this particular experiment, the I and P samples are completely different so we would expect to see plenty of differentially expressed genes.
> design <- model.matrix(~ -1 + factor(c(1, 1, 1, 2, 3, 4, 1,
+ 1, 1, 5, 5, 6, 2, 3, 4, 5, 6, 5)))
> colnames(design) = c("I", "MC", "MD", "MT", "P", "Norm")
> fit <- lmFit(log2(exprs(BSData.quantile)), design)
> cont.matrix <- makeContrasts(IvsP=I - P, IvsNorm=I – Norm,
+ PvsNorm=P - Norm, levels=design)
> fit2 <- contrasts.fit(fit, cont.matrix)
> ebFit <- eBayes(fit2)
> results <- topTable(ebFit, number=40, sort.by="B",
+ resort.by="M")
At this point the primary analsis, in this form e.g. provided by the Micorarray Data Analysis Team at Turku Centre for Biotechnology, is done. The only part missing is of course annotation information. Scope and kind of these information are subjective, however, the
usual data provided are gene names or symbols, gene description and several gene identifiers in addition to the platform specific IDs. These information are included with the microarray chip package when purchased and are available in form of a text file which can be joined with the results table.
To offer something somewhat more interesting in this course we are going to use the biomaRt package to obtain annotation data. The advantage of this procedure is that the functions included with biomaRt are able to perform “online” database queries and fetch most recent information.
> library(biomaRt)
> ensembl <- useMart("ensembl",
+ dataset="hsapiens_gene_ensembl")
> illuids <- results$ID
> BM <- getBM(attributes=c(“illumina_v1", "entrezgene", "go",
+ "go_description"), filters=“illumina_v1", values=illuids,
+ mart=ensembl)
> BM[1:20, ]
illumina_v1 entrezgene go go_description
1 GI_13027795-S 4320 GO:0004249 stromelysin 3 activity
2 GI_13027795-S 4320 GO:0005509 calcium ion binding
3 GI_13027795-S 4320 GO:0005578 extracellular matrix (sensu Metazoa)
4 GI_13027795-S 4320 GO:0006508 proteolysis
5 GI_13027795-S 4320 GO:0007275 development
6 GI_13027795-S 4320 GO:0008270 zinc ion binding
7 GI_13027795-S 4320 GO:0009653 morphogenesis
8 GI_13027795-S 4320 GO:0030574 collagen catabolism
9 GI_14456711-S 3039 GO:0005344 oxygen transporter activity
10 GI_14456711-S 3039 GO:0005506 iron ion binding
11 GI_14456711-S 3039 GO:0005515 protein binding
12 GI_14456711-S 3039 GO:0005833 hemoglobin complex
13 GI_14456711-S 3039 GO:0006810 transport
14 GI_14456711-S 3039 GO:0015671 oxygen transport
15 GI_14456711-S 3039 GO:0019825 oxygen binding
16 GI_14456711-S 3039 GO:0020037 heme binding
17 GI_14456711-S 3039 GO:0046872 metal ion binding
18 GI_14456711-S 3040 GO:0005344 oxygen transporter activity
19 GI_14456711-S 3040 GO:0005506 iron ion binding
20 GI_14456711-S 3040 GO:0005515 protein binding
Result: Contact Jitendra Narayan
|
|