Browse code

moved vignette to Rmd/html using latest BiocStyle style

Laurent authored on 22/08/2017 14:33:49
Showing 3 changed files

... ...
@@ -19,7 +19,7 @@ License: Artistic-2.0
19 19
 LazyLoad: yes
20 20
 Depends: Rcpp (>= 0.10.1), methods, utils
21 21
 Imports: Biobase, BiocGenerics (>= 0.13.6), ProtGenerics
22
-Suggests: msdata (>= 0.15.1), RUnit, mzID, BiocStyle, knitr, XML
22
+Suggests: msdata (>= 0.15.1), RUnit, mzID, BiocStyle (>= 2.5.19), knitr, XML
23 23
 VignetteBuilder: knitr
24 24
 LinkingTo: Rcpp, zlibbioc
25 25
 RcppModules: Ramp, Pwiz, Ident
26 26
new file mode 100644
... ...
@@ -0,0 +1,203 @@
1
+---
2
+title: "A parser for raw and identification mass-spectrometry data"
3
+author: 
4
+- name: Bernd Fischer
5
+- name: Steffen Neumann
6
+- name: Laurent Gatto
7
+- name: Qiang Kou
8
+package: mzR
9
+output:
10
+  BiocStyle::html_document:
11
+    toc_float: true
12
+bibliography: mzR.bib 	
13
+vignette: >
14
+  %\VignetteEngine{knitr::rmarkdown}
15
+  %\VignetteIndexEntry{Accessin raw mass spectrometry and identification data}
16
+  %\VignetteKeywords{mzXML, mzData, netCDF, mzML, mzIdentML, mass spectrometry, proteomics, metabolomics}
17
+  %\VignetteEncoding{UTF-8}
18
+  %\VignettePackage{mzR}
19
+---
20
+
21
+# Introduction
22
+
23
+The `r Biocpkg("mzR")` package aims at providing a common, low-level
24
+interface to several mass spectrometry data formats, namely `mzData`
25
+[@Orchard2007], `mzXML` [@Pedrioli2004], `mzML` [@Martens2010] for raw
26
+data, and `mzIdentML` [@Jones2012], somewhat similar to the
27
+Bioconductor package affyio for affymetrix raw data. No processing is
28
+done in `r Biocpkg("mzR")`, which is left to packages such as `r
29
+Biocpkg("xcms")` [@Smith:2006, Tautenhahn:2008] or 
30
+`r Biocpkg("MSnbase")` [@Gatto:2012]. These packages also provide more
31
+convenient, high-level interfaces to raw and identification. data
32
+
33
+Most importantly, access to the data should be fast and memory
34
+efficient. This is made possible by allowing on-disk random file
35
+access, i.e. retrieving specific data of interest without having to
36
+sequentially browser the full content nor loading the entire data into
37
+memory.
38
+
39
+The actual work of reading and parsing the data files is handled by
40
+the included C/C++ libraries or *backends*. The `mzRramp` RAMP parser,
41
+written at the Institute for Systems Biology (ISB) is a fast and
42
+lightweight parser in pure C. Later, it gained support for the
43
+`mzData` format. The C++ reference implementation for the `mzML` is
44
+the proteowizard library [@Kessner08] (pwiz in short), which in turn
45
+makes use of the boost C++ (<http://www.boost.org/>) library. RAMP is
46
+able to access `mzML` files by calling pwiz methods. More recently,
47
+the proteowizard (http://proteowizard.sourceforge.net/)
48
+[@Chambers2012] has been fully integrated using the `mzRpwiz` backend
49
+for raw data, and is not the default option. The `mzRnetCDF` backend
50
+provides support to `CDF`-based formats. Finally, the `mzRident`
51
+backend is available to access identification data (`mzIdentML`)
52
+through pwiz.
53
+
54
+The `r Biocpkg("mzR")` package is in essence a collection of wrappers
55
+to the C++ code, and benefits from the C++ interface provided through
56
+the Rcpp package [@Rcpp11].
57
+
58
+
59
+# Mass spectrometry raw data
60
+
61
+All the mass spectrometry file formats are organized similarly, where
62
+a set of metadata nodes about the run is followed by a list of spectra
63
+with the actual masses and intensities. In addition, each of these
64
+spectra has its own set of metadata, such as the retention time and
65
+acquisition parameters.
66
+
67
+## Spectral data access
68
+
69
+Access to the spectral data is done via the `peaks` function. The
70
+return value is a list of two-column mass-to-charge and intensity
71
+matrices or a single matrix if one spectrum is queried.
72
+
73
+## Chromatogram access
74
+
75
+Access to the chromatogram(s) is done using the `chromatogram` (or
76
+`chromatograms`) function, that return one (or a list of)
77
+data.frames. See `?chromatogram` for details. This functionality is
78
+only available with the `pwiz` backend.
79
+
80
+## Identification result access
81
+
82
+The main access to identification result is done via `psms`, `score`
83
+and `modifications`.  `psms` and `score` will return the detailed
84
+information on each psm and scores.  `modifications` will return the
85
+details on each modification found in peptide.
86
+
87
+## Metadata access
88
+
89
+**Run metadata** is available via several functions such as
90
+`instrumentInfo()` or `runInfo()`. The individual fields can be
91
+accessed via e.g. `detector()` etc.
92
+
93
+**Spectrum metadata** is available via `header()`, which will return a
94
+list (for single scans) or a dataframe with information such as the
95
+`basePeakMZ`, `peaksCount`, ... or, for higher-order MS the `msLevel`
96
+and precursor information.
97
+
98
+**Identification metadata**is available via `mzidInfo()`, which will
99
+return a list with information such as the `software`,
100
+`ModificationSearched`, `enzymes`, `SpectraSource` and other
101
+information for this identification result.
102
+
103
+The availability of this metadata can not always be guaranteed, and
104
+depends on the MS software which converted the data.
105
+
106
+# Example
107
+
108
+## `mzXML`/`mzML`/`mzData` files
109
+
110
+A short example sequence to read data from a mass spectrometer. 
111
+First open the file.
112
+
113
+```{r openraw}
114
+library(mzR)
115
+library(msdata)
116
+
117
+mzxml <- system.file("threonine/threonine_i2_e35_pH_tree.mzXML", 
118
+                     package = "msdata")
119
+aa <- openMSfile(mzxml) 
120
+``` 
121
+
122
+We can obtain different kind of header information.
123
+
124
+```{r get header information}
125
+runInfo(aa)
126
+instrumentInfo(aa)
127
+header(aa,1)
128
+``` 
129
+
130
+Read a single spectrum from the file.
131
+
132
+```{r plotspectrum}
133
+pl <- peaks(aa,10)
134
+peaksCount(aa,10)
135
+head(pl)
136
+plot(pl[,1], pl[,2], type="h", lwd=1)
137
+``` 
138
+
139
+One should always close the file when not needed any more. This will
140
+release the memory of cached content.
141
+
142
+```{r close the file}
143
+close(aa)
144
+``` 
145
+
146
+## `mzIdentML` files
147
+
148
+You can use `openIDfile` to read a `mzIdentML` file (version 1.1),
149
+which use the pwiz backend.
150
+
151
+```{r openid}
152
+library(mzR)
153
+library(msdata)
154
+
155
+file <- system.file("mzid", "Tandem.mzid.gz", package="msdata")
156
+x <- openIDfile(file)
157
+``` 
158
+
159
+`mzidInfo` function will return general information about this
160
+identification result.
161
+
162
+```{r metadata}
163
+mzidInfo(x)
164
+``` 
165
+
166
+`psms` will return the detailed information on each
167
+peptide-spectrum-match, include `spectrumID`, `chargeState`,
168
+`sequence`. `modNum` and others.
169
+
170
+```{r psms0}
171
+p <- psms(x)
172
+colnames(p)
173
+```
174
+
175
+The modifications information can be accessed using `modifications`,
176
+which will return the `spectrumID`, `sequence`, `name`, `mass` and
177
+`location`.
178
+
179
+```{r psms1}
180
+m <- modifications(x)
181
+head(m)
182
+```
183
+
184
+Since different software will use different scoring function, we
185
+provide a `score` to extract the scores for each psm. It will return a
186
+data.frame with different columns depending on software generating
187
+this file.
188
+
189
+```{r psms2}
190
+scr <- score(x)
191
+colnames(scr)
192
+```
193
+
194
+# Future plans
195
+
196
+Other file formats provided by HUPO, such as `mzQuantML` for
197
+quantitative data [@Walzer:2013] are also possible in the future.
198
+
199
+# Session information {#sec:sessionInfo} 
200
+
201
+```{r label=sessioninfo, results='asis', echo=FALSE}
202
+toLatex(sessionInfo())
203
+``` 
0 204
deleted file mode 100644
... ...
@@ -1,228 +0,0 @@
1
-%\VignetteEngine{knitr::knitr}
2
-%\VignetteIndexEntry{Accessin raw mass spectrometry and identification data}
3
-%\VignetteKeywords{mzXML, mzData, netCDF, mzML, mzIdentML, mass spectrometry, proteomics, metabolomics}
4
-%\VignettePackage{mzR}
5
-
6
-\documentclass{article}
7
-
8
-<<style, eval=TRUE, echo=FALSE, results="asis">>=
9
-BiocStyle::latex()
10
-@
11
-
12
-\title{A parser for raw and identification mass-spectrometry data}
13
-
14
-\author{Bernd Fischer\footnote{\email{bernd.fischer@embl.de}} \\
15
-  Steffen Neumann\footnote{\email{sneumann@ipb-halle.de}} \\
16
-  Laurent Gatto\footnote{\email{lg390@cam.ac.uk}}\\
17
-  Qiang Kou\footnote{\email{qkou@umail.iu.edu}}}
18
-
19
-
20
-\begin{document}
21
-
22
-
23
-\maketitle
24
-
25
-\tableofcontents
26
-
27
-\section{Introduction}
28
-
29
-The \Biocpkg{mzR} package aims at providing a common interface to
30
-several mass spectrometry data formats, namely \texttt{mzData}
31
-\cite{Orchard2007}, \texttt{mzXML} \cite{Pedrioli2004}, \texttt{mzML}
32
-\cite{Martens2010} for raw data, and \texttt{mzIdentML}
33
-\cite{Jones2012}, somewhat similar to the Bioconductor package affyio
34
-for affymetrix raw data. No processing is done in \Biocpkg{mzR}, which
35
-is left to packages such as \Biocpkg{XCMS} \cite{Smith:2006,
36
-  Tautenhahn:2008} or \Biocpkg{MSnbase} \cite{Gatto:2012}.
37
-
38
-\bigskip
39
-
40
-Most importantly, access to the data should be fast and memory
41
-efficient. This is made possible by allowing on-disk random file
42
-access, i.e. retrieving specific data of interest without having to
43
-sequentially browser the full content nor loading the entire data into
44
-memory.
45
-
46
-The actual work of reading and parsing the data files is handled by
47
-the included C/C++ libraries or ``backends''. The \texttt{mzRramp}
48
-RAMP parser, written at the Institute for Systems Biology (ISB) is a
49
-fast and lightweight parser in pure C. Later, it gained support for
50
-the \texttt{mzData} format. The C++ reference implementation for the
51
-\texttt{mzML} is the proteowizard library \cite{Kessner08} (pwiz in
52
-short), which in turn makes use of the boost C++
53
-(\url{http://www.boost.org/}) library.  RAMP is able to access
54
-\texttt{mzML} files by calling pwiz methods. More recently, the
55
-proteowizard\footnote{\url{http://proteowizard.sourceforge.net/}}
56
-\cite{Chambers2012} has been fully integrated using the
57
-\texttt{mzRpwiz} backend for raw data. The \texttt{mzRnetCDF} backend
58
-provides support to \texttt{CDF}-based formats. Finally, the
59
-\texttt{mzRident} backend is available to access identification data
60
-(\texttt{mzIdentML}) through pwiz.
61
-
62
-\warning{It is anticipated to switch to the \texttt{mzRpwiz} backend
63
-  in Bioconductor 3.1. We advise users and developers to test it and
64
-  report any issues on the github issue tracker
65
-  \url{https://github.com/sneumann/mzR/issues}.}
66
-
67
-The \Biocpkg{mzR} package is in essence a collection of wrappers to
68
-the C++ code, and benefits from the C++ interface provided through the
69
-Rcpp package \cite{Rcpp11}.
70
-
71
-
72
-\section{Mass spectrometry raw data}
73
-
74
-All the mass spectrometry file formats are organized similarly, where
75
-a set of metadata nodes about the run is followed by a list of spectra
76
-with the actual masses and intensities. In addition, each of these
77
-spectra has its own set of metadata, such as the retention time and
78
-acquisition parameters.
79
-
80
-\subsection{Spectral data access}
81
-
82
-Access to the spectral data is done via the \Rfunction{peaks}
83
-function. The return value is a list of two-column mass-to-charge and
84
-intensity matrices or a single matrix if one spectrum is queried.
85
-
86
-\subsection{Chromatogram access}
87
-
88
-Access to the chromatogram(s) is done using the
89
-\Rfunction{chromatogram} (or \Rfunction{chromatograms}) function, that
90
-return one (or a list of) data.frames. See \texttt{?chromatogram} for
91
-details. This functionality is only available with the \texttt{pwiz}
92
-backend.
93
-
94
-\subsection{Identification result access}
95
-
96
-The main access to identification result is done via \Rfunction{psms},
97
-\Rfunction{score} and \Rfunction{modifications}.  \Rfunction{psms} and
98
-\Rfunction{score} will return the detailed information on each psm and
99
-scores.  \Rfunction{modifications} will return the details on each
100
-modification found in peptide.
101
-
102
-\subsection{Metadata access}
103
-
104
-\paragraph{Run metadata} is available via several functions such as
105
-\Rfunction{instrumentInfo()} or \Rfunction{runInfo()}. The individual
106
-fields can be accessed via e.g. \Rfunction{detector()} etc.
107
-
108
-\paragraph{Spectrum metadata} is available via \Rfunction{header()},
109
-which will return a list (for single scans) or a dataframe with
110
-information such as the \Rfunction{basePeakMZ},
111
-\Rfunction{peaksCount}, \ldots or, for higher-order MS the
112
-\Rfunction{msLevel} and precursor information.
113
-
114
-\paragraph{Identification metadata} is available via
115
-\Rfunction{mzidInfo()}, which will return a list with information such
116
-as the \Rfunction{software}, \Rfunction{ModificationSearched},
117
-\Rfunction{enzymes}, \Rfunction{SpectraSource} and other information
118
-for this identification result.
119
-
120
-\bigskip
121
-
122
-The availability of this metadata can not always be guaranteed, and
123
-depends on the MS software which converted the data.
124
-
125
-\section{Example}
126
-
127
-\subsection{\texttt{mzXML}/\texttt{mzML}/\texttt{mzData} files}
128
-
129
-A short example sequence to read data from a mass spectrometer. 
130
-First open the file.
131
-
132
-<<openraw>>=
133
-library(mzR)
134
-library(msdata)
135
-
136
-mzxml <- system.file("threonine/threonine_i2_e35_pH_tree.mzXML", 
137
-                     package = "msdata")
138
-aa <- openMSfile(mzxml) ## ramp, default backend
139
-@ 
140
-
141
-We can obtain different kind of header information.
142
-<<get header information>>=
143
-runInfo(aa)
144
-instrumentInfo(aa)
145
-header(aa,1)
146
-@ 
147
-
148
-Read a single spectrum from the file.
149
-<<plotspectrum>>=
150
-pl <- peaks(aa,10)
151
-peaksCount(aa,10)
152
-head(pl)
153
-plot(pl[,1], pl[,2], type="h", lwd=1)
154
-@ 
155
-
156
-One should always close the file when not needed any more if you are using RAMP backend.
157
-This will release the memory of cached content.
158
-
159
-<<close the file>>=
160
-close(aa)
161
-@ 
162
-
163
-\subsection{\texttt{mzIdentML} files}
164
-
165
-You can use \Rfunction{openIDfile} to read a \texttt{mzIdentML} file
166
-(version 1.1), which use the pwiz backend.
167
-
168
-<<openid>>=
169
-library(mzR)
170
-library(msdata)
171
-
172
-file <- system.file("mzid", "Tandem.mzid.gz", package="msdata")
173
-x <- openIDfile(file)
174
-@ 
175
-
176
-\Rfunction{mzidInfo} function will return general information about
177
-this identification result.
178
-
179
-<<metadata>>=
180
-mzidInfo(x)
181
-@ 
182
-
183
-\Rfunction{psms} will return the detailed information on each
184
-peptide-spectrum-match, include \texttt{spectrumID},
185
-\texttt{chargeState}, \texttt{sequence}. \texttt{modNum} and others.
186
-
187
-<<psms0>>=
188
-p <- psms(x)
189
-colnames(p)
190
-@
191
-
192
-The modifications information can be accessed using
193
-\Rfunction{modifications}, which will return the \texttt{spectrumID},
194
-\texttt{sequence}, \texttt{name}, \texttt{mass} and \texttt{location}.
195
-
196
-<<psms1>>=
197
-m <- modifications(x)
198
-head(m)
199
-@
200
-
201
-Since different software will use different scoring function, we
202
-provide a \texttt{score} to extract the scores for each psm. It will
203
-return a data.frame with different columns depending on software
204
-generating this file.
205
-
206
-<<psms2>>=
207
-scr <- score(x)
208
-colnames(scr)
209
-@
210
-
211
-\section{Future plans}
212
-
213
-Other file formats provided by HUPO, such as \texttt{mzQuantML} for
214
-quantitative data \cite{Walzer:2013} are also possible in the future.
215
-
216
-\section{Session information}\label{sec:sessionInfo} 
217
-
218
-<<label=sessioninfo, results='asis', echo=FALSE>>=
219
-toLatex(sessionInfo())
220
-@ 
221
-
222
-
223
-\bibliography{mzR}
224
-
225
-\end{document}
226
-
227
-
228
-