vignettes/gmapR.Rnw
25c2e281
 
b22faf0d
 % \VignetteIndexEntry{gmapR}
 % \VignetteKeywords{gmap,gsnap}
 % \VignettePackage{gmapR}
 \documentclass[10pt]{article}
 
 \usepackage{times}
 \usepackage{hyperref}
 \usepackage{Sweave}
 
 \textwidth=6.5in
 \textheight=8.5in
 % \parskip=.3cm
 \oddsidemargin=-.1in
 \evensidemargin=-.1in
 \headheight=-.3in
 
 \newcommand{\Rfunction}[1]{{\texttt{#1}}}
 \newcommand{\Robject}[1]{{\texttt{#1}}}
 \newcommand{\Rpackage}[1]{{\textit{#1}}}
 \newcommand{\Rmethod}[1]{{\texttt{#1}}}
 \newcommand{\Rfunarg}[1]{{\texttt{#1}}}
 \newcommand{\Rclass}[1]{{\textit{#1}}}
 \newcommand{\Rcode}[1]{{\texttt{#1}}}
 
 \newcommand{\software}[1]{\textsf{#1}}
 \newcommand{\R}{\software{R}}
 
 \title{gmapR: Use the GMAP Suite of Tools in R}
 \author{Michael Lawrence, Cory Barr}
 \date{\today}
 
 \begin{document}
 
 \maketitle
 \tableofcontents
 \newpage
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Introduction}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
 The \Rpackage{gmapR} packages provides users with a way to access
 \software{GSNAP}, \software{bam\_tally}, and other utilities from the
 \software{GMAP} suite of tools within an R session. In this vignette,
 we briefly look at the \software{GMAP} suite of tools available
 through the \Rpackage{gmapR} package and work through an example.
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{What is \software{GMAP}, \software{GSNAP}, and \software{bam\_tally}?}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
 The \software{GMAP} suite offers useful tools for the following:
 
 \begin{itemize}
 
   \item Genomic mapping: Given a cDNA, find where it best aligns to
     an entire genome
 
   \item Genomic alignment: Given a cDNA and a genomic segment,
     provide a nucleotide-level correspondence for the exons of the
     cDNA to the genomic segment
 
   \item Summarization via coverage plus reference and allele
     nucleotide polymorphism counts for an aligned set of sequencing
     reads over a given genomic location
 
 \end{itemize}
 
 \software{GMAP} (Genomic Mapping and Alignment Program) is
 particularly suited to relatively long mRNA and EST sequences such as
 those that are obtained from Roche 454 or Pacific Biosciences
 sequencing technologies. (At present, only \software{GSNAP} is
 available through the \Rpackage{gmapR}. \software{GMAP} integration is
 scheduled for the near future.)
 
 \software{GSNAP} (Genomic Short-read Nucleotide Alignment Program)
 also provides users with genomic mapping and alignment capabilities,
 but is optimized to handle issues that arise when dealing with the
 alignment of short reads generated from sequencing technologies such
 as those from Illumina/Solexa or ABI/SOLiD. \software{GSNAP} offers
 the following functionality, as mentioned in
 \href{http://bioinformatics.oxfordjournals.org/content/26/7/873.full}{Fast
   and SNP-tolerant detection of complex variants and splicing in short
   reads} by Thomas D. Wu and Serban Nacu:
 
 \begin{itemize}
   
   \item fast detection of complex variants and splicing in short
     reads, based on a successively constrained search process of
     merging and filtering position lists from a genomic index
 
   \item alignment of both single- and paired-end reads as short as 14
     nt and of arbitrarily long length
     
   \item detection of short- and long-distance splicing, including
     interchromosomal splicing, in individual reads, using
     probabilistic models or a database of known splice sites
     
   \item SNP-tolerant alignment to a reference space of all possible
     combinations of major and minor alleles
 
   \item alignment of reads from bisulfite-treated DNA for the study of
     methylation state
     
 \end{itemize}
     
 \software{bam\_tally} provides users with the coverage as well as
 counts of both reference alleles, alternative alleles, and indels over
 a genomic region.
 
 For more detailed information on the \software{GMAP} suite of tools
 including a detailed explication of algorithmic specifics, see
 \url{http://research-pub.gene.com/gmap/}.
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Create a \Rclass{GmapGenome} Object}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 To align reads with \software{GMAP} or \software{GSNAP}, or to use
 \software{bam\_tally}, you will need to either obtain or create a
 \Rclass{GmapGenome} object. \Rclass{GmapGenome} objects can be created
 from FASTA files or \Rclass{BSgenome} objects, as the following
 example demonstrates:
 
 <<create_GmapGenome, eval=FALSE>>=
 library(gmapR)
 
 if (!suppressWarnings(require(BSgenome.Dmelanogaster.UCSC.dm3))) {
   source("http://bioconductor.org/biocLite.R")
   biocLite("BSgenome.Dmelanogaster.UCSC.dm3")
   library(BSgenome.Dmelanogaster.UCSC.dm3)
 }
 
6747ae05
 gmapGenomePath <- file.path(getwd(), "flyGenome")
b22faf0d
 gmapGenomeDirectory <- GmapGenomeDirectory(gmapGenomePath, create = TRUE)
 ##> gmapGenomeDirectory
 ##GmapGenomeDirectory object
 ##path: /reshpcfs/home/coryba/projects/gmapR2/testGenome 
 
 gmapGenome <- GmapGenome(genome=Dmelanogaster,
afabc0fa
                          directory=gmapGenomeDirectory,
                          name="dm3",
                          create=TRUE,
a500e01a
                          k = 12L)
b22faf0d
 ##> gmapGenome
 ##GmapGenome object
 ##genome: dm3 
 ##directory: /reshpcfs/home/coryba/projects/gmapR2/testGenome 
 @
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Aligning with \software{GSNAP}}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
 The \software{GSNAP} algorithm incorporates biological knowledge to
6747ae05
 provide accurate alignments, particularly for RNA-seq data. In this
 section, we will align reads from an RNA-seq experiment provided in a
 fastq file to a selected region of the human genome. In this example,
 we will align to the region of the human genome containing TP53 plus
 an additional one megabase on each side of this gene.
 
 First we need to obtain the desired region of interest from the genome:
 
 <<get_TP53_coordinates>>=
 library("org.Hs.eg.db")
 library("TxDb.Hsapiens.UCSC.hg19.knownGene")
 eg <- org.Hs.eg.db::org.Hs.egSYMBOL2EG[["TP53"]]
 txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
8b7fd7ae
 tx <- transcripts(txdb, filter = list(gene_id = eg))
6747ae05
 roi <- range(tx) + 1e6
afabc0fa
 strand(roi) <- "*"
6747ae05
 @ 
b22faf0d
 
6747ae05
 Next we get the genetic sequence and use it to create a GmapGenome
e8a3d1d3
 object. (Please note that the TP53\_demo GmapGenome object is used by
6747ae05
 many examples and tests in the gmapR and VariationTools packages. If
 the object has been created before, its subsequent creation will be
 instantaneous.)
 
 <<get_TP53_seq>>=
 library("BSgenome.Hsapiens.UCSC.hg19")
2b429ef0
 library("gmapR")
6747ae05
 p53Seq <- getSeq(BSgenome.Hsapiens.UCSC.hg19::Hsapiens, roi,
                  as.character = FALSE) 
 names(p53Seq) <- "TP53"
fab13f05
 gmapGenome <- GmapGenome(genome = p53Seq, 
                          name = paste0("TP53_demo_", 
                            packageVersion("TxDb.Hsapiens.UCSC.hg19.knownGene")), 
                          create = TRUE, 
afabc0fa
                          k = 12L)
fab13f05
 @
afabc0fa
 
 We add the known transcripts (splice sites) to the genome index:
 <<set_TP53_splicesites>>=
 exons <- gmapR:::subsetRegion(exonsBy(txdb), roi, "TP53")
 spliceSites(gmapGenome, "knownGene") <- exons
b22faf0d
 @ 
 
6747ae05
 The data package \software{LungCancerLines} contains fastqs of reads
 obtained from sequencing H1993 and H2073 cell lines. We will use these
 fastqs to demonstrate \software{GSNAP} alignment with \software{gmapR}.
b22faf0d
 
6747ae05
 <<get_lung_cancer_fastqs>>=
 library("LungCancerLines")
 fastqs <- LungCancerFastqFiles()
 @ 
b22faf0d
 
6747ae05
 \software{GSNAP} is highly configurable. Users create
 \Rclass{GsnapParam} objects to specify desired \software{GSNAP}
 behavior. 
 
 <<create_gsnapParam>>=
b22faf0d
 ##specify how GSNAP should behave using a GsnapParam object
afabc0fa
 gsnapParam <- GsnapParam(genome=gmapGenome,
                          unique_only=FALSE,
                          suboptimal_levels=2L, 
                          npaths=10L,
                          novelsplicing=TRUE,
                          splicing="knownGene",
                          indel_penalty=1L,
                          distant_splice_penalty=1L,
0379507d
                          clip_overlap=TRUE)
6747ae05
 @ 
b22faf0d
 
6747ae05
 Now we are ready to align.
 
 <<align_with_gsnap, eval=FALSE>>=
 gsnapOutput <- gsnap(input_a=fastqs["H1993.first"],
                      input_b=fastqs["H1993.last"],
b22faf0d
                      params=gsnapParam)
afabc0fa
 
b22faf0d
 ##gsnapOutput
 ##An object of class "GsnapOutput"
 ##Slot "path":
 ##[1] "/local/Rtmporwsvr"
 ##
 ##Slot "param":
 ##A GsnapParams object
 ##genome: dm3 (/reshpcfs/home/coryba/projects/gmapR2/testGenome)
 ##part: NULL
 ##batch: 2
 ##max_mismatches: NULL
 ##suboptimal_levels: 0
 ##snps: NULL
 ##mode: standard
 ##nthreads: 1
 ##novelsplicing: FALSE
 ##splicing: NULL
 ##npaths: 10
 ##quiet_if_excessive: FALSE
 ##nofails: FALSE
 ##split_output: TRUE
 ##extra: list()
 ##
 ##Slot "version":
 ## [1] NA NA NA NA NA NA NA NA NA NA NA NA
 ##
 @ 
 
 The \Rcode{gsnapOutput} object will be of the class
 \Rclass{GsnapOutput}. It will provide access to the BAM files of
 alignments that \software{GSNAP} outputs along with other utilities.
 
 <<get_gsnap_bam_files, eval=FALSE>>=
 ##> dir(path(gsnapOutput), full.names=TRUE, pattern="\\.bam$")
 ##[1] "/local/Rtmporwsvr/file1cbc73503e9.1.nomapping.bam"        
 ##[2] "/local/Rtmporwsvr/file1cbc73503e9.1.unpaired_mult.bam"    
 ##[3] "/local/Rtmporwsvr/file1cbc73503e9.1.unpaired_transloc.bam"
 ##[4] "/local/Rtmporwsvr/file1cbc73503e9.1.unpaired_uniq.bam"    
 @ 
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Using \Rmethod{bam\_tally}}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
 Running the \Rmethod{bam\_tally} method will return a \Rclass{GRanges}
6747ae05
 of information per nucleotide. Below is an example demonstrating how
 to find variants in the TP53 gene of the human genome. See the
 documentation for the \Rmethod{bam\_tally} method for more details.
 
 \software{gmapR} provides access to a demo genome for examples. This
 genome encompasses the TP53 gene along with a 1-megabase flanking
 region on each side.
 <<TP53Genome_accessor>>=
 genome <- TP53Genome()
 @ 
 
 The \software{LungCancerLines} R package contains a BAM file of reads
 aligned to the TP53 region of the human genome. We'll use this file to
e8a3d1d3
 demonstrate the use of \software{bam\_tally} through the
6747ae05
 \software{gmapR} package. The resulting data structure will contain
 the needed information such as number of alternative alleles, quality
59c36844
 scores, and read position for each allele.
b22faf0d
 
4d5cc806
 The call to \Rfunction{bam\_tally} returns an opaque pointer to a
 C-level data structure. We anticipate having multiple means of
 summarizing these data into R data structures. For now, there is one:
05eefeec
 \Rfunction{variantSummary}, which returns a \Rcode{VRanges} object
4d5cc806
 describing putative genetic variants in the sample.
b22faf0d
 <<run_bamtally, eval=FALSE>>=
6747ae05
 bam_file <- system.file("extdata/H1993.analyzed.bam", 
                         package="LungCancerLines", mustWork=TRUE)
b22faf0d
 breaks <- c(0L, 15L, 60L, 75L)
 bqual <- 56L
 mapq <- 13L
4d5cc806
 param <- BamTallyParam(genome,
25c2e281
                        minimum_mapq = mapq,
b22faf0d
                        concordant_only = FALSE, unique_only = FALSE,
                        primary_only = FALSE,
                        min_depth = 0L, variant_strand = 1L,
                        ignore_query_Ns = TRUE,
8418591e
                        indels = FALSE, include_soft_clips = 1L, xs=TRUE)
b22faf0d
 tallies <-bam_tally(bam_file,
                     param)
4d5cc806
 variantSummary(tallies, high_base_quality = 23L)
b22faf0d
 @ 
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Creating a \Rclass{GmapGenome} Package}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
 After creating a \Rclass{GmapGenome} object, you might want to
 distribute it for collaboration or version it for reproducible
 research. The function \Rcode{makeGmapGenomePackage} allows you to do
 this. Continuing on with the D. melanogaster example from above, here
 is how to archive the D. melanogaster \Rclass{GmapGenome} object in a
 \Rclass{GmapGenome} package:
 
 <<create_GmapGenomePackge, eval=FALSE>>=
 makeGmapGenomePackage(gmapGenome=gmapGenome,
                       version="1.0.0",
                       maintainer="<your.name@somewhere.com>",
                       author="Your Name",
                       destDir="myDestDir",
                       license="Artistic-2.0",
                       pkgName="GmapGenome.Dmelanogaster.UCSC.dm3")
753b566b
 @
b22faf0d
 
 After creating the package, you can run \Rcode{R CMD INSTALL
   myDestDir} to install it, or run \Rcode{R CMD build myDestDir} to
 create a distribution of the package.
 
0a4b811e
 Many users with be working with the human genome. Many of the examples
 used in \software{gmapR} make use of a particular build of the human
 genome. As such, creating a \Rclass{GmapGenome} of hg19 is
 recommended. Here is one way to create it, using a \Rclass{BSgenome}
 object:
 
 <<create_hg19_GmapGenome, eval=FALSE>>=
 if (!suppressWarnings(require(BSgenome.Hsapiens.UCSC.hg19))) {
   source("http://bioconductor.org/biocLite.R")
   biocLite("BSgenome.Hsapiens.UCSC.hg19")
   library(BSgenome.Hsapiens.UCSC.hg19)
 }
 gmapGenome <- GmapGenome(genome=Hsapiens,
                          directory = "Hsapiens",
                          name = "hg19",
                          create = TRUE)
 destDir <- "HsapiensGmapGenome"
 pkgName <- "GmapGenome.Hsapiens.UCSC.hg19"
 makeGmapGenomePackage(gmapGenome=gmapGenome,
                       version="1.0.0",
                       maintainer="<your.name@somewhere.com>",
                       author="Your Name",
                       destDir=destDir,
                       license="Artistic-2.0",
                       pkgName="GmapGenome.Hsapiens.UCSC.hg19")
 @ 
 
 After running the above code, you should be able to run \Rcode{R CMD
   INSTALL GmapGenome.Hsapiens.UCSC.hg19} in the appropriate directory
 from the command line, submit GmapGenome.Hsapiens.UCSC.hg19 to a
 repository, etc. After installation,
 \Rcode{library("GmapGenome.Hsapiens.UCSC.hg19")} will load a
 \Rclass{GmapGenome} object that has the same name as the package.
 
 
 
6747ae05
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %\section{Aligning with SNP Tolerance}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0a4b811e
 
6747ae05
 %Both \software{GSNAP} and \software{GMAP} have the ability to perform
 %alignment without penalizing against a set of known SNPs. In this
 %section, we work through an example of aligning with this SNP
 %tolerance. By using SNP-tolerance, we'll perform alignments without
 %penalizing reads that match SNPs contained in the provided SNPs. In
 %this case, we'll use dbSNP to provide the SNPs.
 %
 %This example will use hg19. If you have not installed
 %\Rcode{GmapGenome.Hsapiens.UCSC.hg19}, please refer to the section
 %\ref{create_hg19_GmapGenome}.
 %
 %First we load the Hsapiens \Rclass{GmapGenome} object:
 %
 %<<<load_GmapGenomeHsapiens>>=
 %library("GmapGenome.Hsapiens.UCSC.hg19")
 %@ 
0a4b811e
 
b22faf0d
 <<SessionInfo>>=
 sessionInfo()
 @ 
 \end{document}