%\VignetteIndexEntry{From Genotypes to Association} %\VignetteKeywords{genotype, crlmm, SNP 5, SNP 6} %\VignettePackage{crlmm} \documentclass[12pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \textwidth=6.2in \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{crlmm to downstream data analysis} \author{VJ Carey, B Carvalho} \date{March, 2012} \maketitle \section{Running CRLMM on a nontrivial set of CEL files} To use the \Rmethod{crlmm} algorithm, the user must load the \Rpackage{crlmm} package, as described below: <<loadPkg>>= library(crlmm) @ We work with the 90 CEU samples hybridized to Affy 6.0 chips. When CEL files are available, they must be identified and passed to \Rmethod{crlmm}, as shown below. In this example, we assume that the results are stored in a variable called \Robject{crlmmResult}. <<lkd, eval=FALSE>>= celFiles <- list.celfiles() crlmmResult <- crlmm(celFiles) @ Alternatively, the data aforementioned are available through the \Rpackage{hapmapsnp6} package (required minimum version $1.3.6$) and can be loaded by using: <<loadFromPkg>>= suppressPackageStartupMessages(library(hapmapsnp6)) data(crlmmResult) @ This is currently a \Rclass{SnpSet} object. <<lkj21>>= class(crlmmResult) @ %% In order to reduce the memory requirements for this task, we will use %% only results for chromosome 20. %% %% <<getSubset>>= %% @ \section{Adding information to a \Rclass{SnpSet}} We will use the \Rpackage{GGdata} package to obtain extra information on the samples. This will be later used when building an \Rclass{eSet} extension to store the genotyping results. <<getpd>>= suppressPackageStartupMessages(library(GGdata)) hmceuB36 <- getSS('GGdata', as.character(1:22)) pd <- phenoData(hmceuB36) ggn <- sampleNames(pd) preSN <- sampleNames(crlmmResult) simpSN <- gsub("_.*", "", preSN) if (!all.equal(simpSN, ggn)) stop("align GGdata phenoData with crlmmResult read") @ The additional information obtained from \Rpackage{GGdata} can be easily combined to what is already available on \Robject{crlmmResult}. <<docl>>= sampleNames(crlmmResult) <- simpSN phenoData(crlmmResult) <- combine(pd, phenoData(crlmmResult)) dim(calls(crlmmResult)) dim(confs(crlmmResult, FALSE)) calls(crlmmResult)[1:10, 1:2] confs(crlmmResult, FALSE)[1:10, 1:2] @ \section{Coercing to SnpMatrix as a prelude to a GWAS} From this point on, we will use only the genotype calls. Therefore, to reduce memory requirements, we will recode the \Rpackage{crlmm} genotype calls, so the \Rpackage{snpStats} package can be used, and delete the remaining \Rmethod{crlmm} results. <<clean>>= theCalls <- t(calls(crlmmResult))-1L rm(crlmmResult) @ <<morecleaning, echo=FALSE>>= gc() @ SNP's for which all the samples have the same genotype are not informative for association studies. Therefore, we remove such SNP's prior to fitting the models. <<rmNonInformative>>= gtypeCounts <- rbind(AA=colSums(theCalls == 0L), AB=colSums(theCalls == 1L), BB=colSums(theCalls == 2L)) gtypeCounts[, 1:5] toRemove <- which(colSums(gtypeCounts == 0) == 2L) gtypeCounts[, toRemove[1:4]] theCalls <- theCalls[, -toRemove] @ The \Rpackage{snpStats} provides tools to simplify the analysis of GWAS. The snippet below shows how to load the package and convert the genotype calls to a format that \Rpackage{snpStats} is able to handle. <<lksnm>>= suppressPackageStartupMessages(library(snpStats)) crlmmSM <- new("SnpMatrix", theCalls) crlmmSM @ \section{Conducting a GWAS} We want to find SNP for which genotype is predictive of expression of CPNE1. We will use expression data available from GGdata, using a naive analysis. <<doa>>= suppressPackageStartupMessages(library(illuminaHumanv1.db)) rmm <- revmap(illuminaHumanv1SYMBOL) mypr <- get("CPNE1", rmm) ex <- as.numeric(exprs(hmceuB36)[mypr[1],]) subjdata <- pData(hmceuB36) subjdata[["ex"]] <- ex head(subjdata) @ With the expression data now available in \Robject{subjdata}, we can use the tools from \Rpackage{SnpMatrix} to fit models that will be used to evaluate the association between the genotypes of each available SNP and the expression levels of CPNE1. <<model>>= gwas <- snp.rhs.tests(ex~male, data=subjdata, snp.data=crlmmSM, family="gaussian") ok <- which(p.value(gwas) < 1e-10) gwas[ok,] @ <<dopl,fig=TRUE>>= snp <- names(gwas[ok,])[1] gtypes <- theCalls[,snp]+1L boxplot(ex~gtypes, xlab=paste("Genotype Call for", snp), ylab="CPNE1 Expression", xaxt="n", range=0) points(ex~jitter(gtypes), col=gtypes, pch=19) axis(1, at=1:3, labels=c("AA", "AB", "BB")) @ \section{Session Info} This vignette was created using the following packages: <<lksess>>= sessionInfo() @ \end{document}