inst/scripts/crlmmIllumina.Rnw
852cc5cf
 %\VignetteIndexEntry{crlmm Vignette - Illumina 370k chip}
 %\VignetteKeywords{genotype, crlmm, Illumina}
 %\VignettePackage{crlmm}
 
 \documentclass[12pt]{article}
 
 \usepackage{amsmath,pstricks}
 \usepackage[authoryear,round]{natbib}
 \usepackage{hyperref}
 \usepackage{Sweave}
 
 \textwidth=6.2in
 \textheight=8.5in
 %\parskip=.3cm
 \oddsidemargin=.1in
 \evensidemargin=.1in
 \headheight=-.3in
 
 \newcommand{\scscst}{\scriptscriptstyle}
 \newcommand{\scst}{\scriptstyle}
 
 
 \newcommand{\Rfunction}[1]{{\texttt{#1}}}
 \newcommand{\Robject}[1]{{\texttt{#1}}}
 \newcommand{\Rpackage}[1]{{\textit{#1}}}
 \newcommand{\Rmethod}[1]{{\texttt{#1}}}
 \newcommand{\Rfunarg}[1]{{\texttt{#1}}}
 \newcommand{\Rclass}[1]{{\textit{#1}}}
 
 \textwidth=6.2in
 
3be95c2b
 \bibliographystyle{plainnat}
 
852cc5cf
 \begin{document}
 %\setkeys{Gin}{width=0.55\textwidth}
 
 \title{Using \Rpackage{crlmm} to genotype data from Illumina's Infinium BeadChips}
 \author{Matt Ritchie}
 \maketitle
 
 \section{Getting started}
 
3be95c2b
 In this user guide we read in and genotype data from 40 HapMap samples
852cc5cf
 which have been analyzed using Illumina's 370k Duo BeadChips.
3be95c2b
 This data is available in the \Rpackage{hapmap370k} package.
 Additional chip-specific model parameters and basic SNP annotation
a11efc9c
 information used by CRLMM is stored in the \Rpackage{human370v1cCrlmm} package.
82b78441
 The required packages can be installed in the usual way using the \Rfunction{biocLite} function.
852cc5cf
 
d3668036
 <<echo=TRUE, results=hide, eval=FALSE>>=
a11efc9c
 source("http://www.bioconductor.org/biocLite.R")
82b78441
 biocLite(c("crlmm", "hapmap370k", "human370v1cCrlmm"))
3be95c2b
 @
852cc5cf
 
3be95c2b
 \section{Reading in data}
 The function \Rfunction{readIdatFiles} extracts the Red and Green intensities
852cc5cf
 from the binary {\tt idat} files output by Illumina's scanning device.
 The file {\tt samples370k.csv} contains information about each sample.
 
d3668036
 <<echo=FALSE, results=hide, eval=TRUE>>=
 options(width=70)
3be95c2b
 @
852cc5cf
 
a11efc9c
 <<read, results=hide, eval=TRUE>>=
852cc5cf
 library(Biobase)
 library(crlmm)
 library(hapmap370k)
 
 data.dir = system.file("idatFiles", package="hapmap370k")
 
 # Read in sample annotation info
 samples = read.csv(file.path(data.dir, "samples370k.csv"), as.is=TRUE)
 samples[1:5,]
3be95c2b
 @
852cc5cf
 
 <<read2, results=hide, cache=TRUE>>=
 # Read in .idats using sampleSheet information
3be95c2b
 RG = readIdatFiles(samples, path=data.dir,
d3668036
 arrayInfoColNames=list(barcode=NULL,position="SentrixPosition"),saveDate=TRUE)
852cc5cf
 @
 
3be95c2b
 Reading in this data takes approximately 100 seconds and peak memory usage
450a6d8a
 was 0.8 GB of RAM on our linux system.
82b78441
 If memory is limiting, load the \Rpackage{ff} package and run the same command.
 When this package is available, the objects are stored using disk rather then RAM.
3be95c2b
 The \Robject{RG} object is an \Rclass{NChannelSet} which stores the
 Red and Green intensities, the number of beads and standard errors for
 each bead-type.
d3a7556e
 The scanning date of each array is stored in \Robject{protocolData}.
852cc5cf
 
 <<explore>>=
 class(RG)
 dim(RG)
 slotNames(RG)
 channelNames(RG)
 exprs(channel(RG, "R"))[1:5,1:5]
 exprs(channel(RG, "G"))[1:5,1:5]
 pd = pData(RG)
 pd[1:5,]
 
4b098e36
 scandatetime = strptime(protocolData(RG)[["ScanDate"]], "%m/%d/%Y %H:%M:%S %p")
852cc5cf
 datescanned = substr(scandatetime, 1, 10)
 scanbatch =  factor(datescanned)
 levels(scanbatch) = 1:16
 scanbatch = as.numeric(scanbatch)
 @
 
3be95c2b
 If GenCall output is available instead of idat files, the function \Rfunction{readGenCallOutput} can be
 used to read in the data.
d3668036
 This function assumes the GenCall output is formatted to have samples listed one below the other,
3be95c2b
 and that the columns 'X Raw' and 'Y Raw' are available in the file.
d3668036
 The resulting \Robject{NChannelSet} from this function can be used as input to \Rfunction{crlmmIllumina} via the \Robject{XY} argument (instead of the usual \Rfunction{RG} argument used when the data has been read in from idat files).
 
 Plots of the summarised data can be easily generated to check for arrays with poor signal.
852cc5cf
 
 <<boxplots, fig=TRUE, width=8, height=8>>=
 par(mfrow=c(2,1), mai=c(0.4,0.4,0.4,0.1), oma=c(1,1,0,0))
3be95c2b
 boxplot(log2(exprs(channel(RG, "R"))), xlab="Array", ylab="", names=1:40,
d3668036
 main="Red channel",outline=FALSE,las=2)
3be95c2b
 boxplot(log2(exprs(channel(RG, "G"))), xlab="Array", ylab="", names=1:40,
d3668036
 main="Green channel",outline=FALSE,las=2)
852cc5cf
 mtext(expression(log[2](intensity)), side=2, outer=TRUE)
 mtext("Array", side=1, outer=TRUE)
 @
 
 \section{Genotyping}
 
a11efc9c
 Next we use the function \Rfunction{crlmmIllumina} which performs preprocessing followed by genotyping using the CRLMM algorithm.
852cc5cf
 
 <<genotype, results=hide, cache=TRUE>>=
d3668036
 crlmmResult = crlmmIllumina(RG=RG, cdfName="human370v1c", returnParams=TRUE)
3be95c2b
 @
852cc5cf
 
d3668036
 This analysis took 3 minutes to complete and peak memory usage was 1.9 GB on our system.
3be95c2b
 The output stored in \Robject{crlmmResult} is a \Rclass{SnpSet} object.
852cc5cf
 <<explore2>>=
d3668036
 class(crlmmResult)
 dim(crlmmResult)
 slotNames(crlmmResult)
 calls(crlmmResult)[1:10, 1:5]
3be95c2b
 @
852cc5cf
 
3be95c2b
 Plotting the {\it SNR} reveals no obvious batch effects in this data set (different symbols are used for
852cc5cf
 arrays scanned on different days).
 
 <<snr,  fig=TRUE, width=8, height=6>>=
3be95c2b
 plot(crlmmResult[["SNR"]], pch=scanbatch, xlab="Array", ylab="SNR",
      main="Signal-to-noise ratio per array",las=2)
852cc5cf
 @
 
3be95c2b
 An all-in-one function named \Rfunction{crlmmIlluminaV2} that combines
 reading of idat files with genotyping is also available.
d3668036
 
 <<readandgenotypeinone, results=hide, cache=TRUE>>=
3be95c2b
 crlmmResult2 <- crlmmIlluminaV2(samples, path=data.dir,
 				arrayInfoColNames=list(barcode=NULL,position="SentrixPosition"),
 				saveDate=TRUE, cdfName="human370v1c", returnParams=TRUE)
 @
d3668036
 
852cc5cf
 \section{System information}
 
3be95c2b
 This analysis was carried out on a linux machine with 32GB of RAM
852cc5cf
 using the following packages:
 
 <<session>>=
 sessionInfo()
 @
 
 \end{document}