... | ... |
@@ -19,7 +19,7 @@ License: Artistic-2.0 |
19 | 19 |
LazyLoad: yes |
20 | 20 |
Depends: Rcpp (>= 0.10.1), methods, utils |
21 | 21 |
Imports: Biobase, BiocGenerics (>= 0.13.6), ProtGenerics |
22 |
-Suggests: msdata (>= 0.15.1), RUnit, mzID, BiocStyle, knitr, XML |
|
22 |
+Suggests: msdata (>= 0.15.1), RUnit, mzID, BiocStyle (>= 2.5.19), knitr, XML |
|
23 | 23 |
VignetteBuilder: knitr |
24 | 24 |
LinkingTo: Rcpp, zlibbioc |
25 | 25 |
RcppModules: Ramp, Pwiz, Ident |
26 | 26 |
new file mode 100644 |
... | ... |
@@ -0,0 +1,203 @@ |
1 |
+--- |
|
2 |
+title: "A parser for raw and identification mass-spectrometry data" |
|
3 |
+author: |
|
4 |
+- name: Bernd Fischer |
|
5 |
+- name: Steffen Neumann |
|
6 |
+- name: Laurent Gatto |
|
7 |
+- name: Qiang Kou |
|
8 |
+package: mzR |
|
9 |
+output: |
|
10 |
+ BiocStyle::html_document: |
|
11 |
+ toc_float: true |
|
12 |
+bibliography: mzR.bib |
|
13 |
+vignette: > |
|
14 |
+ %\VignetteEngine{knitr::rmarkdown} |
|
15 |
+ %\VignetteIndexEntry{Accessin raw mass spectrometry and identification data} |
|
16 |
+ %\VignetteKeywords{mzXML, mzData, netCDF, mzML, mzIdentML, mass spectrometry, proteomics, metabolomics} |
|
17 |
+ %\VignetteEncoding{UTF-8} |
|
18 |
+ %\VignettePackage{mzR} |
|
19 |
+--- |
|
20 |
+ |
|
21 |
+# Introduction |
|
22 |
+ |
|
23 |
+The `r Biocpkg("mzR")` package aims at providing a common, low-level |
|
24 |
+interface to several mass spectrometry data formats, namely `mzData` |
|
25 |
+[@Orchard2007], `mzXML` [@Pedrioli2004], `mzML` [@Martens2010] for raw |
|
26 |
+data, and `mzIdentML` [@Jones2012], somewhat similar to the |
|
27 |
+Bioconductor package affyio for affymetrix raw data. No processing is |
|
28 |
+done in `r Biocpkg("mzR")`, which is left to packages such as `r |
|
29 |
+Biocpkg("xcms")` [@Smith:2006, Tautenhahn:2008] or |
|
30 |
+`r Biocpkg("MSnbase")` [@Gatto:2012]. These packages also provide more |
|
31 |
+convenient, high-level interfaces to raw and identification. data |
|
32 |
+ |
|
33 |
+Most importantly, access to the data should be fast and memory |
|
34 |
+efficient. This is made possible by allowing on-disk random file |
|
35 |
+access, i.e. retrieving specific data of interest without having to |
|
36 |
+sequentially browser the full content nor loading the entire data into |
|
37 |
+memory. |
|
38 |
+ |
|
39 |
+The actual work of reading and parsing the data files is handled by |
|
40 |
+the included C/C++ libraries or *backends*. The `mzRramp` RAMP parser, |
|
41 |
+written at the Institute for Systems Biology (ISB) is a fast and |
|
42 |
+lightweight parser in pure C. Later, it gained support for the |
|
43 |
+`mzData` format. The C++ reference implementation for the `mzML` is |
|
44 |
+the proteowizard library [@Kessner08] (pwiz in short), which in turn |
|
45 |
+makes use of the boost C++ (<http://www.boost.org/>) library. RAMP is |
|
46 |
+able to access `mzML` files by calling pwiz methods. More recently, |
|
47 |
+the proteowizard (http://proteowizard.sourceforge.net/) |
|
48 |
+[@Chambers2012] has been fully integrated using the `mzRpwiz` backend |
|
49 |
+for raw data, and is not the default option. The `mzRnetCDF` backend |
|
50 |
+provides support to `CDF`-based formats. Finally, the `mzRident` |
|
51 |
+backend is available to access identification data (`mzIdentML`) |
|
52 |
+through pwiz. |
|
53 |
+ |
|
54 |
+The `r Biocpkg("mzR")` package is in essence a collection of wrappers |
|
55 |
+to the C++ code, and benefits from the C++ interface provided through |
|
56 |
+the Rcpp package [@Rcpp11]. |
|
57 |
+ |
|
58 |
+ |
|
59 |
+# Mass spectrometry raw data |
|
60 |
+ |
|
61 |
+All the mass spectrometry file formats are organized similarly, where |
|
62 |
+a set of metadata nodes about the run is followed by a list of spectra |
|
63 |
+with the actual masses and intensities. In addition, each of these |
|
64 |
+spectra has its own set of metadata, such as the retention time and |
|
65 |
+acquisition parameters. |
|
66 |
+ |
|
67 |
+## Spectral data access |
|
68 |
+ |
|
69 |
+Access to the spectral data is done via the `peaks` function. The |
|
70 |
+return value is a list of two-column mass-to-charge and intensity |
|
71 |
+matrices or a single matrix if one spectrum is queried. |
|
72 |
+ |
|
73 |
+## Chromatogram access |
|
74 |
+ |
|
75 |
+Access to the chromatogram(s) is done using the `chromatogram` (or |
|
76 |
+`chromatograms`) function, that return one (or a list of) |
|
77 |
+data.frames. See `?chromatogram` for details. This functionality is |
|
78 |
+only available with the `pwiz` backend. |
|
79 |
+ |
|
80 |
+## Identification result access |
|
81 |
+ |
|
82 |
+The main access to identification result is done via `psms`, `score` |
|
83 |
+and `modifications`. `psms` and `score` will return the detailed |
|
84 |
+information on each psm and scores. `modifications` will return the |
|
85 |
+details on each modification found in peptide. |
|
86 |
+ |
|
87 |
+## Metadata access |
|
88 |
+ |
|
89 |
+**Run metadata** is available via several functions such as |
|
90 |
+`instrumentInfo()` or `runInfo()`. The individual fields can be |
|
91 |
+accessed via e.g. `detector()` etc. |
|
92 |
+ |
|
93 |
+**Spectrum metadata** is available via `header()`, which will return a |
|
94 |
+list (for single scans) or a dataframe with information such as the |
|
95 |
+`basePeakMZ`, `peaksCount`, ... or, for higher-order MS the `msLevel` |
|
96 |
+and precursor information. |
|
97 |
+ |
|
98 |
+**Identification metadata**is available via `mzidInfo()`, which will |
|
99 |
+return a list with information such as the `software`, |
|
100 |
+`ModificationSearched`, `enzymes`, `SpectraSource` and other |
|
101 |
+information for this identification result. |
|
102 |
+ |
|
103 |
+The availability of this metadata can not always be guaranteed, and |
|
104 |
+depends on the MS software which converted the data. |
|
105 |
+ |
|
106 |
+# Example |
|
107 |
+ |
|
108 |
+## `mzXML`/`mzML`/`mzData` files |
|
109 |
+ |
|
110 |
+A short example sequence to read data from a mass spectrometer. |
|
111 |
+First open the file. |
|
112 |
+ |
|
113 |
+```{r openraw} |
|
114 |
+library(mzR) |
|
115 |
+library(msdata) |
|
116 |
+ |
|
117 |
+mzxml <- system.file("threonine/threonine_i2_e35_pH_tree.mzXML", |
|
118 |
+ package = "msdata") |
|
119 |
+aa <- openMSfile(mzxml) |
|
120 |
+``` |
|
121 |
+ |
|
122 |
+We can obtain different kind of header information. |
|
123 |
+ |
|
124 |
+```{r get header information} |
|
125 |
+runInfo(aa) |
|
126 |
+instrumentInfo(aa) |
|
127 |
+header(aa,1) |
|
128 |
+``` |
|
129 |
+ |
|
130 |
+Read a single spectrum from the file. |
|
131 |
+ |
|
132 |
+```{r plotspectrum} |
|
133 |
+pl <- peaks(aa,10) |
|
134 |
+peaksCount(aa,10) |
|
135 |
+head(pl) |
|
136 |
+plot(pl[,1], pl[,2], type="h", lwd=1) |
|
137 |
+``` |
|
138 |
+ |
|
139 |
+One should always close the file when not needed any more. This will |
|
140 |
+release the memory of cached content. |
|
141 |
+ |
|
142 |
+```{r close the file} |
|
143 |
+close(aa) |
|
144 |
+``` |
|
145 |
+ |
|
146 |
+## `mzIdentML` files |
|
147 |
+ |
|
148 |
+You can use `openIDfile` to read a `mzIdentML` file (version 1.1), |
|
149 |
+which use the pwiz backend. |
|
150 |
+ |
|
151 |
+```{r openid} |
|
152 |
+library(mzR) |
|
153 |
+library(msdata) |
|
154 |
+ |
|
155 |
+file <- system.file("mzid", "Tandem.mzid.gz", package="msdata") |
|
156 |
+x <- openIDfile(file) |
|
157 |
+``` |
|
158 |
+ |
|
159 |
+`mzidInfo` function will return general information about this |
|
160 |
+identification result. |
|
161 |
+ |
|
162 |
+```{r metadata} |
|
163 |
+mzidInfo(x) |
|
164 |
+``` |
|
165 |
+ |
|
166 |
+`psms` will return the detailed information on each |
|
167 |
+peptide-spectrum-match, include `spectrumID`, `chargeState`, |
|
168 |
+`sequence`. `modNum` and others. |
|
169 |
+ |
|
170 |
+```{r psms0} |
|
171 |
+p <- psms(x) |
|
172 |
+colnames(p) |
|
173 |
+``` |
|
174 |
+ |
|
175 |
+The modifications information can be accessed using `modifications`, |
|
176 |
+which will return the `spectrumID`, `sequence`, `name`, `mass` and |
|
177 |
+`location`. |
|
178 |
+ |
|
179 |
+```{r psms1} |
|
180 |
+m <- modifications(x) |
|
181 |
+head(m) |
|
182 |
+``` |
|
183 |
+ |
|
184 |
+Since different software will use different scoring function, we |
|
185 |
+provide a `score` to extract the scores for each psm. It will return a |
|
186 |
+data.frame with different columns depending on software generating |
|
187 |
+this file. |
|
188 |
+ |
|
189 |
+```{r psms2} |
|
190 |
+scr <- score(x) |
|
191 |
+colnames(scr) |
|
192 |
+``` |
|
193 |
+ |
|
194 |
+# Future plans |
|
195 |
+ |
|
196 |
+Other file formats provided by HUPO, such as `mzQuantML` for |
|
197 |
+quantitative data [@Walzer:2013] are also possible in the future. |
|
198 |
+ |
|
199 |
+# Session information {#sec:sessionInfo} |
|
200 |
+ |
|
201 |
+```{r label=sessioninfo, results='asis', echo=FALSE} |
|
202 |
+toLatex(sessionInfo()) |
|
203 |
+``` |
0 | 204 |
deleted file mode 100644 |
... | ... |
@@ -1,228 +0,0 @@ |
1 |
-%\VignetteEngine{knitr::knitr} |
|
2 |
-%\VignetteIndexEntry{Accessin raw mass spectrometry and identification data} |
|
3 |
-%\VignetteKeywords{mzXML, mzData, netCDF, mzML, mzIdentML, mass spectrometry, proteomics, metabolomics} |
|
4 |
-%\VignettePackage{mzR} |
|
5 |
- |
|
6 |
-\documentclass{article} |
|
7 |
- |
|
8 |
-<<style, eval=TRUE, echo=FALSE, results="asis">>= |
|
9 |
-BiocStyle::latex() |
|
10 |
-@ |
|
11 |
- |
|
12 |
-\title{A parser for raw and identification mass-spectrometry data} |
|
13 |
- |
|
14 |
-\author{Bernd Fischer\footnote{\email{bernd.fischer@embl.de}} \\ |
|
15 |
- Steffen Neumann\footnote{\email{sneumann@ipb-halle.de}} \\ |
|
16 |
- Laurent Gatto\footnote{\email{lg390@cam.ac.uk}}\\ |
|
17 |
- Qiang Kou\footnote{\email{qkou@umail.iu.edu}}} |
|
18 |
- |
|
19 |
- |
|
20 |
-\begin{document} |
|
21 |
- |
|
22 |
- |
|
23 |
-\maketitle |
|
24 |
- |
|
25 |
-\tableofcontents |
|
26 |
- |
|
27 |
-\section{Introduction} |
|
28 |
- |
|
29 |
-The \Biocpkg{mzR} package aims at providing a common interface to |
|
30 |
-several mass spectrometry data formats, namely \texttt{mzData} |
|
31 |
-\cite{Orchard2007}, \texttt{mzXML} \cite{Pedrioli2004}, \texttt{mzML} |
|
32 |
-\cite{Martens2010} for raw data, and \texttt{mzIdentML} |
|
33 |
-\cite{Jones2012}, somewhat similar to the Bioconductor package affyio |
|
34 |
-for affymetrix raw data. No processing is done in \Biocpkg{mzR}, which |
|
35 |
-is left to packages such as \Biocpkg{XCMS} \cite{Smith:2006, |
|
36 |
- Tautenhahn:2008} or \Biocpkg{MSnbase} \cite{Gatto:2012}. |
|
37 |
- |
|
38 |
-\bigskip |
|
39 |
- |
|
40 |
-Most importantly, access to the data should be fast and memory |
|
41 |
-efficient. This is made possible by allowing on-disk random file |
|
42 |
-access, i.e. retrieving specific data of interest without having to |
|
43 |
-sequentially browser the full content nor loading the entire data into |
|
44 |
-memory. |
|
45 |
- |
|
46 |
-The actual work of reading and parsing the data files is handled by |
|
47 |
-the included C/C++ libraries or ``backends''. The \texttt{mzRramp} |
|
48 |
-RAMP parser, written at the Institute for Systems Biology (ISB) is a |
|
49 |
-fast and lightweight parser in pure C. Later, it gained support for |
|
50 |
-the \texttt{mzData} format. The C++ reference implementation for the |
|
51 |
-\texttt{mzML} is the proteowizard library \cite{Kessner08} (pwiz in |
|
52 |
-short), which in turn makes use of the boost C++ |
|
53 |
-(\url{http://www.boost.org/}) library. RAMP is able to access |
|
54 |
-\texttt{mzML} files by calling pwiz methods. More recently, the |
|
55 |
-proteowizard\footnote{\url{http://proteowizard.sourceforge.net/}} |
|
56 |
-\cite{Chambers2012} has been fully integrated using the |
|
57 |
-\texttt{mzRpwiz} backend for raw data. The \texttt{mzRnetCDF} backend |
|
58 |
-provides support to \texttt{CDF}-based formats. Finally, the |
|
59 |
-\texttt{mzRident} backend is available to access identification data |
|
60 |
-(\texttt{mzIdentML}) through pwiz. |
|
61 |
- |
|
62 |
-\warning{It is anticipated to switch to the \texttt{mzRpwiz} backend |
|
63 |
- in Bioconductor 3.1. We advise users and developers to test it and |
|
64 |
- report any issues on the github issue tracker |
|
65 |
- \url{https://github.com/sneumann/mzR/issues}.} |
|
66 |
- |
|
67 |
-The \Biocpkg{mzR} package is in essence a collection of wrappers to |
|
68 |
-the C++ code, and benefits from the C++ interface provided through the |
|
69 |
-Rcpp package \cite{Rcpp11}. |
|
70 |
- |
|
71 |
- |
|
72 |
-\section{Mass spectrometry raw data} |
|
73 |
- |
|
74 |
-All the mass spectrometry file formats are organized similarly, where |
|
75 |
-a set of metadata nodes about the run is followed by a list of spectra |
|
76 |
-with the actual masses and intensities. In addition, each of these |
|
77 |
-spectra has its own set of metadata, such as the retention time and |
|
78 |
-acquisition parameters. |
|
79 |
- |
|
80 |
-\subsection{Spectral data access} |
|
81 |
- |
|
82 |
-Access to the spectral data is done via the \Rfunction{peaks} |
|
83 |
-function. The return value is a list of two-column mass-to-charge and |
|
84 |
-intensity matrices or a single matrix if one spectrum is queried. |
|
85 |
- |
|
86 |
-\subsection{Chromatogram access} |
|
87 |
- |
|
88 |
-Access to the chromatogram(s) is done using the |
|
89 |
-\Rfunction{chromatogram} (or \Rfunction{chromatograms}) function, that |
|
90 |
-return one (or a list of) data.frames. See \texttt{?chromatogram} for |
|
91 |
-details. This functionality is only available with the \texttt{pwiz} |
|
92 |
-backend. |
|
93 |
- |
|
94 |
-\subsection{Identification result access} |
|
95 |
- |
|
96 |
-The main access to identification result is done via \Rfunction{psms}, |
|
97 |
-\Rfunction{score} and \Rfunction{modifications}. \Rfunction{psms} and |
|
98 |
-\Rfunction{score} will return the detailed information on each psm and |
|
99 |
-scores. \Rfunction{modifications} will return the details on each |
|
100 |
-modification found in peptide. |
|
101 |
- |
|
102 |
-\subsection{Metadata access} |
|
103 |
- |
|
104 |
-\paragraph{Run metadata} is available via several functions such as |
|
105 |
-\Rfunction{instrumentInfo()} or \Rfunction{runInfo()}. The individual |
|
106 |
-fields can be accessed via e.g. \Rfunction{detector()} etc. |
|
107 |
- |
|
108 |
-\paragraph{Spectrum metadata} is available via \Rfunction{header()}, |
|
109 |
-which will return a list (for single scans) or a dataframe with |
|
110 |
-information such as the \Rfunction{basePeakMZ}, |
|
111 |
-\Rfunction{peaksCount}, \ldots or, for higher-order MS the |
|
112 |
-\Rfunction{msLevel} and precursor information. |
|
113 |
- |
|
114 |
-\paragraph{Identification metadata} is available via |
|
115 |
-\Rfunction{mzidInfo()}, which will return a list with information such |
|
116 |
-as the \Rfunction{software}, \Rfunction{ModificationSearched}, |
|
117 |
-\Rfunction{enzymes}, \Rfunction{SpectraSource} and other information |
|
118 |
-for this identification result. |
|
119 |
- |
|
120 |
-\bigskip |
|
121 |
- |
|
122 |
-The availability of this metadata can not always be guaranteed, and |
|
123 |
-depends on the MS software which converted the data. |
|
124 |
- |
|
125 |
-\section{Example} |
|
126 |
- |
|
127 |
-\subsection{\texttt{mzXML}/\texttt{mzML}/\texttt{mzData} files} |
|
128 |
- |
|
129 |
-A short example sequence to read data from a mass spectrometer. |
|
130 |
-First open the file. |
|
131 |
- |
|
132 |
-<<openraw>>= |
|
133 |
-library(mzR) |
|
134 |
-library(msdata) |
|
135 |
- |
|
136 |
-mzxml <- system.file("threonine/threonine_i2_e35_pH_tree.mzXML", |
|
137 |
- package = "msdata") |
|
138 |
-aa <- openMSfile(mzxml) ## ramp, default backend |
|
139 |
-@ |
|
140 |
- |
|
141 |
-We can obtain different kind of header information. |
|
142 |
-<<get header information>>= |
|
143 |
-runInfo(aa) |
|
144 |
-instrumentInfo(aa) |
|
145 |
-header(aa,1) |
|
146 |
-@ |
|
147 |
- |
|
148 |
-Read a single spectrum from the file. |
|
149 |
-<<plotspectrum>>= |
|
150 |
-pl <- peaks(aa,10) |
|
151 |
-peaksCount(aa,10) |
|
152 |
-head(pl) |
|
153 |
-plot(pl[,1], pl[,2], type="h", lwd=1) |
|
154 |
-@ |
|
155 |
- |
|
156 |
-One should always close the file when not needed any more if you are using RAMP backend. |
|
157 |
-This will release the memory of cached content. |
|
158 |
- |
|
159 |
-<<close the file>>= |
|
160 |
-close(aa) |
|
161 |
-@ |
|
162 |
- |
|
163 |
-\subsection{\texttt{mzIdentML} files} |
|
164 |
- |
|
165 |
-You can use \Rfunction{openIDfile} to read a \texttt{mzIdentML} file |
|
166 |
-(version 1.1), which use the pwiz backend. |
|
167 |
- |
|
168 |
-<<openid>>= |
|
169 |
-library(mzR) |
|
170 |
-library(msdata) |
|
171 |
- |
|
172 |
-file <- system.file("mzid", "Tandem.mzid.gz", package="msdata") |
|
173 |
-x <- openIDfile(file) |
|
174 |
-@ |
|
175 |
- |
|
176 |
-\Rfunction{mzidInfo} function will return general information about |
|
177 |
-this identification result. |
|
178 |
- |
|
179 |
-<<metadata>>= |
|
180 |
-mzidInfo(x) |
|
181 |
-@ |
|
182 |
- |
|
183 |
-\Rfunction{psms} will return the detailed information on each |
|
184 |
-peptide-spectrum-match, include \texttt{spectrumID}, |
|
185 |
-\texttt{chargeState}, \texttt{sequence}. \texttt{modNum} and others. |
|
186 |
- |
|
187 |
-<<psms0>>= |
|
188 |
-p <- psms(x) |
|
189 |
-colnames(p) |
|
190 |
-@ |
|
191 |
- |
|
192 |
-The modifications information can be accessed using |
|
193 |
-\Rfunction{modifications}, which will return the \texttt{spectrumID}, |
|
194 |
-\texttt{sequence}, \texttt{name}, \texttt{mass} and \texttt{location}. |
|
195 |
- |
|
196 |
-<<psms1>>= |
|
197 |
-m <- modifications(x) |
|
198 |
-head(m) |
|
199 |
-@ |
|
200 |
- |
|
201 |
-Since different software will use different scoring function, we |
|
202 |
-provide a \texttt{score} to extract the scores for each psm. It will |
|
203 |
-return a data.frame with different columns depending on software |
|
204 |
-generating this file. |
|
205 |
- |
|
206 |
-<<psms2>>= |
|
207 |
-scr <- score(x) |
|
208 |
-colnames(scr) |
|
209 |
-@ |
|
210 |
- |
|
211 |
-\section{Future plans} |
|
212 |
- |
|
213 |
-Other file formats provided by HUPO, such as \texttt{mzQuantML} for |
|
214 |
-quantitative data \cite{Walzer:2013} are also possible in the future. |
|
215 |
- |
|
216 |
-\section{Session information}\label{sec:sessionInfo} |
|
217 |
- |
|
218 |
-<<label=sessioninfo, results='asis', echo=FALSE>>= |
|
219 |
-toLatex(sessionInfo()) |
|
220 |
-@ |
|
221 |
- |
|
222 |
- |
|
223 |
-\bibliography{mzR} |
|
224 |
- |
|
225 |
-\end{document} |
|
226 |
- |
|
227 |
- |
|
228 |
- |