%\VignetteIndexEntry{crlmm Vignette - Illumina 370k chip}
%\VignetteKeywords{genotype, crlmm, Illumina}








\title{Using \Rpackage{crlmm} to genotype data from Illumina's Infinium BeadChips}
\author{Matt Ritchie}

\section{Getting started}

In this user guide we read in and genotype data from 40 HapMap samples 
which have been analyzed using Illumina's 370k Duo BeadChips.
This data is available in the \Rpackage{hapmap370k} package.  
Additional chip-specific model parameters and basic SNP annotation 
information used by CRLMM is stored in the \Rpackage{human370v1cCrlmm} package.
These packages can be installed in the usual way using the \Rfunction{biocLite} function.

<<<echo=TRUE, results=hide, eval=FALSE>>=
biocLite(c("hapmap370k", "human370v1cCrlmm"))

\section{Reading in data} 
The function \Rfunction{readIdatFiles} extracts the Red and Green intensities 
from the binary {\tt idat} files output by Illumina's scanning device.
The file {\tt samples370k.csv} contains information about each sample.

<<<echo=FALSE, results=hide, eval=TRUE>>=

<<read, results=hide, eval=TRUE>>=

data.dir = system.file("idatFiles", package="hapmap370k")

# Read in sample annotation info
samples = read.csv(file.path(data.dir, "samples370k.csv"), as.is=TRUE)

<<read2, results=hide, cache=TRUE>>=
# Read in .idats using sampleSheet information
RG = readIdatFiles(samples, path=data.dir, arrayInfoColNames=list(barcode=NULL, position="SentrixPosition"), saveDate=TRUE)

Reading in this data takes approximately 90 seconds and peak memory usage 
was 1.2 GB of RAM on our linux system.
The \Robject{RG} object is an \Rclass{NChannelSet} which stores the 
Red and Green intensities, the number of beads and standard errors for 
each bead-type.  
The scanning date of each array is stored in \Robject{protocolData}.

exprs(channel(RG, "R"))[1:5,1:5]
exprs(channel(RG, "G"))[1:5,1:5]
pd = pData(RG)

scandatetime = strptime(protocolData(RG)[["ScanDate"]], "%m/%d/%Y %H:%M:%S %p")
datescanned = substr(scandatetime, 1, 10)
scanbatch =  factor(datescanned)
levels(scanbatch) = 1:16
scanbatch = as.numeric(scanbatch)

Plots of the summarised data can be easily generated to check for arrays 
with poor signal.

<<boxplots, fig=TRUE, width=8, height=8>>=
par(mfrow=c(2,1), mai=c(0.4,0.4,0.4,0.1), oma=c(1,1,0,0))
boxplot(log2(exprs(channel(RG, "R"))), xlab="Array", ylab="", names=1:40, main="Red channel", outline=FALSE, las=2)
boxplot(log2(exprs(channel(RG, "G"))), xlab="Array", ylab="", names=1:40, main="Green channel", outline=FALSE, las=2)
mtext(expression(log[2](intensity)), side=2, outer=TRUE)
mtext("Array", side=1, outer=TRUE)


Next we use the function \Rfunction{crlmmIllumina} which performs preprocessing followed by genotyping using the CRLMM algorithm.

<<genotype, results=hide, cache=TRUE>>=
crlmmResult = crlmmIllumina(RG=RG, cdfName="human370v1c", sns=pData(RG)$ID, returnParams=TRUE)

This analysis took 470 seconds to complete and peak memory usage was 3.3 GB on our system.
The output stored in \Robject{crlmmResult} is a \Rclass{SnpSet} object.                                                                                                                                                         
  calls(crlmmResult)[1:10, 1:5]

Plotting the {\it SNR} reveals no obvious batch effects in this data set (different symbols are used for 
arrays scanned on different days).

<<snr,  fig=TRUE, width=8, height=6>>=
plot(crlmmResult[["SNR"]], pch=scanbatch, xlab="Array", ylab="SNR", main="Signal-to-noise ratio per array", las=2)

\section{System information}

This analysis was carried out on a linux machine with 32GB of RAM 
using the following packages: