...
|
...
|
@@ -1,498 +1,498 @@
|
1
|
|
-%\VignetteIndexEntry{MSnID Package for Handling MS/MS Identifications}
|
2
|
|
-%\VignetteDepends{BiocStyle, msmsTests, ggplot2}
|
3
|
|
-%\VignetteKeywords{Documentation}
|
4
|
|
-%\VignettePackage{MSnID}
|
5
|
|
-\documentclass[11pt]{article}
|
6
|
|
-
|
7
|
|
-
|
8
|
|
-<<style, eval=TRUE, echo=FALSE, results=tex>>=
|
9
|
|
-BiocStyle::latex(use.unsrturl=FALSE)
|
10
|
|
-@
|
11
|
|
-
|
12
|
|
-
|
13
|
|
-
|
14
|
|
-\usepackage[authoryear,round]{natbib}
|
15
|
|
-
|
16
|
|
-
|
17
|
|
-\title{\Rpackage{MSnID} Package for Handling MS/MS Identifications}
|
18
|
|
-\author{Vladislav A. Petyuk}
|
19
|
|
-
|
20
|
|
-\begin{document}
|
21
|
|
-\SweaveOpts{concordance=TRUE, eval=TRUE, prefix.string=graphics}
|
22
|
|
-
|
23
|
|
-\maketitle
|
24
|
|
-\tableofcontents
|
25
|
|
-
|
26
|
|
-
|
27
|
|
-
|
28
|
|
-\section{Introduction}
|
29
|
|
-MS/MS identification is a process with some uncertainty. Some peptide or
|
30
|
|
-protein to spectrum matches are true and some are not. There are ways to
|
31
|
|
-score how well the peptide/protein fragmentation pattern observed in
|
32
|
|
-MS/MS spectrum matches the theoretical amino acid sequence. Other ways to
|
33
|
|
-assess confidence of identification are:
|
34
|
|
-\begin{enumerate}
|
35
|
|
- \item the difference in theoretical and experimental masses
|
36
|
|
- \item frequency of observation (true identifications tend to be more
|
37
|
|
- consistent)
|
38
|
|
- \item peptide sequence properties (only in the case of technique involving
|
39
|
|
- protein digestion) such as missed cleavages or presence of cleavages not
|
40
|
|
- typical for a given protease and chemical reagent.
|
41
|
|
-\end{enumerate}
|
42
|
|
-A typical and currently most reliable way to quantify uncertainty in the list
|
43
|
|
-of identify spectra, peptides or proteins relies on so-called decoy database.
|
44
|
|
-For bottom-up (i.e. involving protein digestion) approaches a common way to
|
45
|
|
-construct a decoy database is simple inversion of protein amino-acid sequences.
|
46
|
|
-The other approach commonly used in top-down (that is intact protein)
|
47
|
|
-approaches is based on shuffling of amino acids within protein sequence.
|
48
|
|
-Typically normal and decoy sequences are concatenated into one FASTA file.
|
49
|
|
-Some software tools (e.g. MS-GF+) perform the task of constructing and
|
50
|
|
-appending the decoy sequence internally on the fly. If the spectrum matches
|
51
|
|
-to normal protein sequence it can be true or false match. Matches to decoy
|
52
|
|
-part of the database are false only (excluding the palindromes). Therefore
|
53
|
|
-the false discovery rate (FDR) of identifications can be estimated as ratio
|
54
|
|
-of hits to decoy over normal parts of the protein sequence database.
|
55
|
|
-\\
|
56
|
|
-There are multiple levels of identification that FDR can be estimated for.
|
57
|
|
-First, is at the level of peptide/protein-to-spectrum matches. Second is at
|
58
|
|
-the level of unique peptide sequences. Note, true peptides tend to be
|
59
|
|
-identified by more then one spectrum. False peptide tend to be sporadic.
|
60
|
|
-Therefore, after collapsing the redundant peptide identifications from
|
61
|
|
-multiple spectra to the level of unique peptide sequence, the FDR typically
|
62
|
|
-increases. The extend of FDR increase depends on the type and complexity of
|
63
|
|
-the sample. The same trend is true for estimating the identification FDR at
|
64
|
|
-the protein level. True proteins tend to be identified with multiple peptides,
|
65
|
|
-while false protein identifications are commonly covered only by one peptide.
|
66
|
|
-Therefore FDR estimate tend to be even higher for protein level compare to
|
67
|
|
-peptide level.
|
68
|
|
-\\
|
69
|
|
-The estimation of the FDR is also affected by the number of LC-MS (runs)
|
70
|
|
-datasets in the experiment. Again, true identifications tend to be more
|
71
|
|
-consistent from run to run, while false are sporadic. After collapsing the
|
72
|
|
-redundancy across the runs, the number of true identification reduces much
|
73
|
|
-stronger compare to false identifications. Therefore, the peptide and
|
74
|
|
-protein FDR estimates need to be re-evaluated.
|
75
|
|
-\\
|
76
|
|
-The main objective of the MSnID package is to provide convenience tools for
|
77
|
|
-handling tasks on estimation of FDR, defining and optimizing the filtering
|
78
|
|
-criteria and ensuring confidence in MS/MS identification data. The user can
|
79
|
|
-specify the criteria for filtering the data (e.g. goodness or p-value of
|
80
|
|
-matching of experimental and theoretical fragmentation mass spectrum,
|
81
|
|
-deviation of theoretical from experimentally measured mass, presence of
|
82
|
|
-missed cleavages in the peptide sequence, etc), evaluate the performance of
|
83
|
|
-the filter judging by FDRs at spectrum, peptide and protein levels, and
|
84
|
|
-finally optimize the filter to achieve the maximum number of identifications
|
85
|
|
-while not exceeding maximally allowed FDR upper threshold.
|
86
|
|
-
|
87
|
|
-
|
88
|
|
-\section{Starting the project}
|
89
|
|
-First, the \Robject{MSnID} object has to be initialized. The main argument is
|
90
|
|
-path to the working directory. This directory will be used for storing cached
|
91
|
|
-analysis results. Caching/memoisation mechanism is based on \CRANpkg{R.cache}.
|
92
|
|
-<<>>=
|
93
|
|
-library("MSnID")
|
94
|
|
-msnid <- MSnID(".")
|
95
|
|
-@
|
96
|
|
-
|
97
|
|
-\section{Reading MS/MS data}
|
98
|
|
-One way to read in peptide/protein to MS/MS spectrum matching (PSM) results as a
|
99
|
|
-table from a text file and assing the \Rclass{data.frame} object.
|
100
|
|
-<<>>=
|
101
|
|
-PSMresults <- read.delim(system.file("extdata", "human_brain.txt",
|
102
|
|
- package="MSnID"),
|
103
|
|
- stringsAsFactors=FALSE)
|
104
|
|
-psms(msnid) <- PSMresults
|
105
|
|
-show(msnid)
|
106
|
|
-@
|
107
|
|
-Alternative and currently the preferred way to read MS/MS results is by
|
108
|
|
-parsing mzIdentML files (*.mzid or *.mzid.gz extensions). The \Rcode{read\_mzIDs}
|
109
|
|
-function leverages \Biocpkg{mzID} package facilities.
|
110
|
|
-<<>>=
|
111
|
|
-mzids <- system.file("extdata", "c_elegans.mzid.gz", package="MSnID")
|
112
|
|
-msnid <- read_mzIDs(msnid, mzids)
|
113
|
|
-show(msnid)
|
114
|
|
-@
|
115
|
|
-Internally PSMs stored as \CRANpkg{data.table} object.
|
116
|
|
-\\
|
117
|
|
-The example file \texttt{"c\_elegans.mzid.gz"} is based on MS-GF+ search engine.
|
118
|
|
-The \Rcode{read\_mzIDs} function reads results of any MS/MS search engine as
|
119
|
|
-long as it compliant with mzIdentML standard. In general case, use aforementioned
|
120
|
|
-\Rcode{psms<-} function.
|
121
|
|
-
|
122
|
|
-
|
123
|
|
-\section{Updating columns}
|
124
|
|
-Note, to take a full advantage of the \Biocpkg{MSnID}, the
|
125
|
|
-the following columns have to be present. Checking of columns happens
|
126
|
|
-internally.
|
127
|
|
-<<echo=FALSE>>=
|
128
|
|
-sort(MSnID:::.mustBeColumns)
|
129
|
|
-@
|
130
|
|
-Check what are the current column names in the MS/MS search results table.
|
131
|
|
-<<>>=
|
132
|
|
-names(msnid)
|
133
|
|
-@
|
134
|
|
-
|
135
|
|
-
|
136
|
|
-\section{Basic info on the \Robject{MSnID} object instance}
|
137
|
|
-Printing the \Robject{MSnID} object returns some basic information such as
|
138
|
|
-\begin{itemize}
|
139
|
|
- \item Working directory.
|
140
|
|
- \item Number of spectrum files used to generate data.
|
141
|
|
- \item Number of peptide-to-spectrum matches and corresponding FDR.
|
142
|
|
- \item Number of unique peptide sequences and corresponding FDR.
|
143
|
|
- \item Number of unique proteins or amino acid sequence accessions and
|
144
|
|
- corresponding FDR.
|
145
|
|
-\end{itemize}
|
146
|
|
-False discovery rate or FDR is defined here as a ratio of hits to decoy
|
147
|
|
-accessions to the non-decoy (normal) accessions. In terms of forward and revese
|
148
|
|
-protein sequences that would mean ratio of \#reverse/\#forward.
|
149
|
|
-While computing FDRs of PSMs and unique peptide sequences is trivial,
|
150
|
|
-definition of protein (accession) FDR is a subject for discussion in the field
|
151
|
|
-of proteomics. Here, protein (accession) FDR is computed the same way as in
|
152
|
|
-IDPicker software \cite{Zhang2007} and simply constitutes a ratio of
|
153
|
|
-unique accessions from decoy component to non-decoy component of the
|
154
|
|
-sequence database.
|
155
|
|
-<<>>=
|
156
|
|
-show(msnid)
|
157
|
|
-@
|
158
|
|
-
|
159
|
|
-
|
160
|
|
-
|
161
|
|
-\section{Analysis of peptide sequences}
|
162
|
|
-A particular properties of peptide sequences we are interested in are
|
163
|
|
-\begin{enumerate}
|
164
|
|
- \item irregular cleavages at the termini of the peptides and
|
165
|
|
- \item missing cleavage site within the peptide sequences.
|
166
|
|
-\end{enumerate}
|
167
|
|
-The default regular expressions of valid and missed cleavage
|
168
|
|
-patterns correspond to trypsin.
|
169
|
|
-Counting the number of irregular cleavage termimi (0,1 or 2) in peptides
|
170
|
|
-sequence creates a new column \texttt{numIrregCleavages}.
|
171
|
|
-<<>>=
|
172
|
|
-msnid <- assess_termini(msnid, validCleavagePattern="[KR]\\.[^P]")
|
173
|
|
-@
|
174
|
|
-Counting the number of missed cleavages in peptides sequence creates a new
|
175
|
|
-column \texttt{numMissCleavages}.
|
176
|
|
-<<>>=
|
177
|
|
-msnid <- assess_missed_cleavages(msnid, missedCleavagePattern="[KR](?=[^P$])")
|
178
|
|
-@
|
179
|
|
-Now the object has two more columns, \texttt{numIrregCleavages} and
|
180
|
|
-\texttt{numMissCleavages}, evidently corresponding to the number
|
181
|
|
-of termini with irregular cleavages and number of missed cleavages
|
182
|
|
-within the peptide sequence.
|
183
|
|
-<<label=missedCleavages, fig=TRUE, include=FALSE, width=9>>=
|
184
|
|
-pepCleav <- unique(psms(msnid)[,c("numMissCleavages", "isDecoy", "peptide")])
|
185
|
|
-pepCleav <- as.data.frame(table(pepCleav[,c("numMissCleavages", "isDecoy")]))
|
186
|
|
-library("ggplot2")
|
187
|
|
-ggplot(pepCleav, aes(x=numMissCleavages, y=Freq, fill=isDecoy)) +
|
188
|
|
- geom_bar(stat='identity', position='dodge') +
|
189
|
|
- ggtitle("Number of Missed Cleavages")
|
190
|
|
-@
|
191
|
|
-\begin{center}
|
192
|
|
-\includegraphics[width=0.8\textwidth]{graphics-missedCleavages}
|
193
|
|
-\end{center}
|
194
|
|
-\pagebreak[0]
|
195
|
|
-Peptide sequences, as any other column, can by accessed by
|
196
|
|
-directly using \Rcode{\$} operator. For example:
|
197
|
|
-Counting number of cysteins per peptide sequence
|
198
|
|
-<<>>==
|
199
|
|
-msnid$numCys <- sapply(lapply(strsplit(msnid$peptide,''),'==','C'),sum)
|
200
|
|
-@
|
201
|
|
-Calculating peptide lengths. Note, -4 decrements the AA count by two
|
202
|
|
-the flanking AAs and the two dots separating them from the
|
203
|
|
-actual peptide sequence.
|
204
|
|
-<<label=lengths, fig=TRUE, include=FALSE, width=9>>=
|
205
|
|
-msnid$PepLength <- nchar(msnid$peptide) - 4
|
206
|
|
-pepLen <- unique(psms(msnid)[,c("PepLength", "isDecoy", "peptide")])
|
207
|
|
-ggplot(pepLen, aes(x=PepLength, fill=isDecoy)) +
|
208
|
|
- geom_histogram(position='dodge', binwidth=3) +
|
209
|
|
- ggtitle("Distribution on of Peptide Lengths")
|
210
|
|
-@
|
211
|
|
-\begin{center}
|
212
|
|
-\includegraphics[width=0.8\textwidth]{graphics-lengths}
|
213
|
|
-\end{center}
|
214
|
|
-\pagebreak[0]
|
215
|
|
-
|
216
|
|
-
|
217
|
|
-
|
218
|
|
-\section{Trimming the data}
|
219
|
|
-The main way for trimming or filtering the data is
|
220
|
|
-\Rfunction{apply\_filter} function. The second argument can be either
|
221
|
|
-1) a string representing expression that will be evaluated in the context of
|
222
|
|
-data.frame containing MS/MS results or 2) \Rclass{MSnFilter} class object
|
223
|
|
-(explained below). Note, the reduction in FDR. Assuming that the sample
|
224
|
|
-has been digested with trypsin, the true identifications tend to be
|
225
|
|
-fully tryptic and contain fewer missed cleavages.
|
226
|
|
-Original FDRs.
|
227
|
|
-<<>>=
|
228
|
|
-show(msnid)
|
229
|
|
-@
|
230
|
|
-Leaving only fully tryptic peptides.
|
231
|
|
-<<>>=
|
232
|
|
-msnid <- apply_filter(msnid, "numIrregCleavages == 0")
|
233
|
|
-show(msnid)
|
234
|
|
-@
|
235
|
|
-Filtering out peptides with more then 2 missed cleavages.
|
236
|
|
-<<>>=
|
237
|
|
-msnid <- apply_filter(msnid, "numMissCleavages <= 2")
|
238
|
|
-show(msnid)
|
239
|
|
-@
|
240
|
|
-
|
241
|
|
-
|
242
|
|
-
|
243
|
|
-\section{Parent ion mass measurement error}
|
244
|
|
-Assuming both \Rcode{calculatedMassToCharge} and
|
245
|
|
-\Rcode{experimentalMassToCharge} are present in \Rcode{names(msnid)},
|
246
|
|
-one can access parent ion mass measurement in points per million (ppm) units.
|
247
|
|
-<<label=ppmOriginal, fig=TRUE, include=FALSE, width=9>>=
|
248
|
|
-ppm <- mass_measurement_error(msnid)
|
249
|
|
-ggplot(as.data.frame(ppm), aes(x=ppm)) +
|
250
|
|
- geom_histogram(binwidth=100)
|
251
|
|
-@
|
252
|
|
-\begin{center}
|
253
|
|
-\includegraphics[width=0.8\textwidth]{graphics-ppmOriginal}
|
254
|
|
-\end{center}
|
255
|
|
-\pagebreak[0]
|
256
|
|
-Note, although the MS/MS search was done with $\pm$ 20ppm parent ion mass
|
257
|
|
-tolerance, error stretch over 1000 in ppm units. The reason is that the
|
258
|
|
-settings of the MS/MS search engine MS-GF+ (used for the analysis of this
|
259
|
|
-LC-MS dataset) fairly assumed that the instrument could have picked
|
260
|
|
-non-monoisotopic peaks of parent ion for fragmentation
|
261
|
|
-and thus considered peptides that were off by $\pm$ 1 Da
|
262
|
|
-(\textsuperscript{13}C-\textsuperscript{12}C to be exact). Similar settings
|
263
|
|
-can be found in other search engines (e.g X!Tandem).
|
264
|
|
-<<label=deltaMass, fig=TRUE, include=FALSE, width=9>>=
|
265
|
|
-dM <- with(psms(msnid),
|
266
|
|
- (experimentalMassToCharge-calculatedMassToCharge)*chargeState)
|
267
|
|
-x <- data.frame(dM, isDecoy=msnid$isDecoy)
|
268
|
|
-ggplot(x, aes(x=dM, fill=isDecoy)) +
|
269
|
|
- geom_histogram(position='stack', binwidth=0.1)
|
270
|
|
-@
|
271
|
|
-\begin{center}
|
272
|
|
-\includegraphics[width=0.8\textwidth]{graphics-deltaMass}
|
273
|
|
-\end{center}
|
274
|
|
-\pagebreak[0]
|
275
|
|
-Ideally, to avoid this problem, the MS/MS datasets have to be either aquired
|
276
|
|
-in MIPS (monoisotopic ion precurson selection) mode or preprocessed with
|
277
|
|
-DeconMSn \cite{Mayampurath2008} tools that identifies the monoisotipic peaks
|
278
|
|
-post-experimentally. The \Biocpkg{MSnID} package provide a simple
|
279
|
|
-\Rcode{correct\_peak\_selection} function that simply adds or subtracts
|
280
|
|
-the difference between \textsuperscript{13}C and \textsuperscript{12}C
|
281
|
|
-to make the error less then 1 Dalton.
|
282
|
|
-<<label=ppmCorrectedMass, fig=TRUE, include=FALSE, width=9>>=
|
283
|
|
-msnid.fixed <- correct_peak_selection(msnid)
|
284
|
|
-ppm <- mass_measurement_error(msnid.fixed)
|
285
|
|
-ggplot(as.data.frame(ppm), aes(x=ppm)) +
|
286
|
|
- geom_histogram(binwidth=0.25)
|
287
|
|
-@
|
288
|
|
-\begin{center}
|
289
|
|
-\includegraphics[width=0.8\textwidth]{graphics-ppmCorrectedMass}
|
290
|
|
-\end{center}
|
291
|
|
-\pagebreak[0]
|
292
|
|
-Alternatively, one can simply apply a filter to remove any matches that
|
293
|
|
-do not fit the $\pm$ 20 ppm tolerance.
|
294
|
|
-<<label=ppmFiltered20, fig=TRUE, include=FALSE, width=9>>=
|
295
|
|
-msnid.chopped <- apply_filter(msnid, "abs(mass_measurement_error(msnid)) < 20")
|
296
|
|
-ppm <- mass_measurement_error(msnid.chopped)
|
297
|
|
-ggplot(as.data.frame(ppm), aes(x=ppm)) +
|
298
|
|
- geom_histogram(binwidth=0.25)
|
299
|
|
-@
|
300
|
|
-\begin{center}
|
301
|
|
-\includegraphics[width=0.8\textwidth]{graphics-ppmFiltered20}
|
302
|
|
-\end{center}
|
303
|
|
-\pagebreak[0]
|
304
|
|
-
|
305
|
|
-For further processing we'll consider the \Rcode{msnid.chopped}
|
306
|
|
-data that ignores matches with 1 Da errors. Note, if the center of
|
307
|
|
-the histogram is significantly shifted from zero,
|
308
|
|
-\Rcode{experimentalMassToCharge} can be post-experimentally recalibrated.
|
309
|
|
-This MS/MS data was preprocessed with
|
310
|
|
-DtaRefinery tool \cite{Petyuk2010} that post-experimentally eliminates
|
311
|
|
-any systematic mass measurement error. At this point, the \Rcode{recalibrate}
|
312
|
|
-function implements the most simplistic algorithm avalable in the
|
313
|
|
-DtaRefinery tool.
|
314
|
|
-<<label=ppmRecalibrated, fig=TRUE, include=FALSE, width=9>>=
|
315
|
|
-msnid <- recalibrate(msnid.chopped)
|
316
|
|
-ppm <- mass_measurement_error(msnid)
|
317
|
|
-ggplot(as.data.frame(ppm), aes(x=ppm)) +
|
318
|
|
- geom_histogram(binwidth=0.25)
|
319
|
|
-@
|
320
|
|
-\begin{center}
|
321
|
|
-\includegraphics[width=0.8\textwidth]{graphics-ppmRecalibrated}
|
322
|
|
-\end{center}
|
323
|
|
-\pagebreak[0]
|
324
|
|
-
|
325
|
|
-
|
326
|
|
-
|
327
|
|
-\section{\Robject{MSnIDFilter} object for filtering MS/MS identifications}
|
328
|
|
-The criteria that will be used for filtering the MS/MS data has to be present
|
329
|
|
-in the \Robject{MSnID} object. We will use -log10 transformed MS-GF+
|
330
|
|
-Spectrum E-value, reflecting the goodness of match experimental and
|
331
|
|
-theoretical fragmentation patterns as one the filtering criteria.
|
332
|
|
-Let's store it under the "msmsScore" name. The score density distribution
|
333
|
|
-shows that it is a good discriminant between non-decoy (red)
|
334
|
|
-and decoy hits (green).
|
335
|
|
-\\
|
336
|
|
-For alternative MS/MS search engines refer to the engine-specific manual for
|
337
|
|
-the names of parameters reflecting the quality of MS/MS spectra matching.
|
338
|
|
-Examples of such parameters are \Rcode{E-Value} for X!Tandem
|
339
|
|
-and \Rcode{XCorr} and \Rcode{$\Delta$Cn2} for SEQUEST.
|
340
|
|
-<<label=msmsScoreDistribution, fig=TRUE, include=FALSE, width=9>>=
|
341
|
|
-msnid$msmsScore <- -log10(msnid$`MS-GF:SpecEValue`)
|
342
|
|
-params <- psms(msnid)[,c("msmsScore","isDecoy")]
|
343
|
|
-ggplot(params) +
|
344
|
|
- geom_density(aes(x = msmsScore, color = isDecoy, ..count..))
|
345
|
|
-@
|
346
|
|
-\begin{center}
|
347
|
|
-\includegraphics[width=0.8\textwidth]{graphics-msmsScoreDistribution}
|
348
|
|
-\end{center}
|
349
|
|
-\pagebreak[0]
|
350
|
|
-As a second criterion we will be using the absolute mass measurement
|
351
|
|
-error (in ppm units) of the parent ion. The mass measurement errors tend to
|
352
|
|
-be small for non-decoy (enriched with real identificaiton) hits (red line) and
|
353
|
|
-is effectively uniformly distributed for decoy hits.
|
354
|
|
-<<label=absPpmDistribution, fig=TRUE, include=FALSE, width=9>>=
|
355
|
|
-msnid$absParentMassErrorPPM <- abs(mass_measurement_error(msnid))
|
356
|
|
-params <- psms(msnid)[,c("absParentMassErrorPPM","isDecoy")]
|
357
|
|
-ggplot(params) +
|
358
|
|
- geom_density(aes(x = absParentMassErrorPPM, color = isDecoy, ..count..))
|
359
|
|
-@
|
360
|
|
-\begin{center}
|
361
|
|
-\includegraphics[width=0.8\textwidth]{graphics-absPpmDistribution}
|
362
|
|
-\end{center}
|
363
|
|
-\pagebreak[0]
|
364
|
|
-MS/MS fiters are handled by a special \Rclass{MSnIDFilter} class objects.
|
365
|
|
-Individual filtering criteria can be set by name
|
366
|
|
-(that is present in \Rcode{names(msnid)}), comparison operator (>, <, = , ...)
|
367
|
|
-defining if we should retain hits with higher or lower given the threshold and
|
368
|
|
-finally the threshold value itself.
|
369
|
|
-<<>>=
|
370
|
|
-filtObj <- MSnIDFilter(msnid)
|
371
|
|
-filtObj$absParentMassErrorPPM <- list(comparison="<", threshold=10.0)
|
372
|
|
-filtObj$msmsScore <- list(comparison=">", threshold=10.0)
|
373
|
|
-show(filtObj)
|
374
|
|
-@
|
375
|
|
-Let's evaluate the performace of the filter at three different levels of
|
376
|
|
-confidence assessment.
|
377
|
|
-<<>>=
|
378
|
|
-evaluate_filter(msnid, filtObj, level="PSM")
|
379
|
|
-evaluate_filter(msnid, filtObj, level="peptide")
|
380
|
|
-evaluate_filter(msnid, filtObj, level="accession")
|
381
|
|
-@
|
382
|
|
-
|
383
|
|
-
|
384
|
|
-\section{Optimizing the MS/MS filter to achieve the maximum number of
|
385
|
|
-identifications within a given FDR upper limit threshold}
|
386
|
|
-The threshold values in the example above are not necessarily optimal and set
|
387
|
|
-just be in the range of probable values. Filters can be optimized to ensure
|
388
|
|
-maximum number of identifications (peptide-to-spectrum matches,
|
389
|
|
-unique peptide sequences or proteins) within a given FDR upper limit.
|
390
|
|
-\\
|
391
|
|
-First, the filter can be optimized simply by stepping through
|
392
|
|
-individual parameters and their combinations. The idea has been described in
|
393
|
|
-\cite{Piehowski2013a}. The resulting \Robject{MSnIDFilter} object can be
|
394
|
|
-used for final data filtering or can be used as a good starting parameters for
|
395
|
|
-follow-up refining optimizations with more advanced algorithms.
|
396
|
|
-<<>>=
|
397
|
|
-filtObj.grid <- optimize_filter(filtObj, msnid, fdr.max=0.01,
|
398
|
|
- method="Grid", level="peptide",
|
399
|
|
- n.iter=500)
|
400
|
|
-show(filtObj.grid)
|
401
|
|
-@
|
402
|
|
-%# (absParentMassErrorPPM < 2) & (msmsScore > 7.8)
|
403
|
|
-
|
404
|
|
-The resulting \Rcode{filtObj.grid} can be further fine tuned with such
|
405
|
|
-optimization routines as simulated annealing or Nelder-Mead optimization.
|
406
|
|
-<<>>=
|
407
|
|
-filtObj.nm <- optimize_filter(filtObj.grid, msnid, fdr.max=0.01,
|
408
|
|
- method="Nelder-Mead", level="peptide",
|
409
|
|
- n.iter=500)
|
410
|
|
-show(filtObj.nm)
|
411
|
|
-@
|
412
|
|
-%# (absParentMassErrorPPM < 3) & (msmsScore > 7.8)
|
413
|
|
-
|
414
|
|
-Let's compare the original (good guess) and optimized fileters. Obviously the
|
415
|
|
-latter yields much more peptide identifications, while not exceeding
|
416
|
|
-the maximally allowed FDR threshold of 1%.
|
417
|
|
-<<>>=
|
418
|
|
-evaluate_filter(msnid, filtObj, level="peptide")
|
419
|
|
-evaluate_filter(msnid, filtObj.nm, level="peptide")
|
420
|
|
-@
|
421
|
|
-Finally we'll apply the optimized filter to proceed with further
|
422
|
|
-steps in the analysis pipeline.
|
423
|
|
-<<>>=
|
424
|
|
-msnid <- apply_filter(msnid, filtObj.nm)
|
425
|
|
-show(msnid)
|
426
|
|
-@
|
427
|
|
-Identifications that matched decoy and contaminant protein sequences can be
|
428
|
|
-removed by providing filters in the forms of text strings that will be
|
429
|
|
-evaluated in the context of PSM table.
|
430
|
|
-<<>>=
|
431
|
|
-msnid <- apply_filter(msnid, "isDecoy == FALSE")
|
432
|
|
-show(msnid)
|
433
|
|
-msnid <- apply_filter(msnid, "!grepl('Contaminant',accession)")
|
434
|
|
-show(msnid)
|
435
|
|
-@
|
436
|
|
-
|
437
|
|
-
|
438
|
|
-
|
439
|
|
-\section{Data output and interface with other Bioconductor packages}
|
440
|
|
-One can extract the entire PSMs tables as
|
441
|
|
-\Rcode{data.frame} or \Rcode{data.table}
|
442
|
|
-<<>>=
|
443
|
|
-psm.df <- psms(msnid)
|
444
|
|
-psm.dt <- as(msnid, "data.table")
|
445
|
|
-@
|
446
|
|
-If only interested in the non-redundant list of confidently identified
|
447
|
|
-peptides or proteins
|
448
|
|
-<<>>=
|
449
|
|
-peps <- peptides(msnid)
|
450
|
|
-head(peps)
|
451
|
|
-prots <- accessions(msnid)
|
452
|
|
-head(prots)
|
453
|
|
-prots <- proteins(msnid) # may be more intuitive then accessions
|
454
|
|
-head(prots)
|
455
|
|
-@
|
456
|
|
-The \Biocpkg{MSnID} package is aimed at providing convenience functionality
|
457
|
|
-to handle MS/MS identifications. Quantification \textit{per se} is outside of
|
458
|
|
-the scope of the package. The only type of quantitation that can be seamlessly
|
459
|
|
-tied with MS/MS identification analysis is so-called
|
460
|
|
-\emph{spectral counting} approach. In such an approach a peptide abundance is
|
461
|
|
-considered to be directly proportional to the number of matched MS/MS spectra.
|
462
|
|
-In its turn protein abunance is proportional to the sum of the number of
|
463
|
|
-spectra of the matching peptides. The \Rclass{MSnID} object can be converted
|
464
|
|
-to an \Rclass{MSnSet} object defined in \Biocpkg{MSnbase} that extends generic
|
465
|
|
-Bioconductor \Rclass{eSet} class to quantitative proteomics data.
|
466
|
|
-The spectral count data can be analyzed with \Biocpkg{msmsEDA},
|
467
|
|
-\Biocpkg{msmsTests} or \Biocpkg{DESeq} packages.
|
468
|
|
-<<label=convertingToMSnSet>>=
|
469
|
|
-msnset <- as(msnid, "MSnSet")
|
470
|
|
-library("MSnbase")
|
471
|
|
-head(fData(msnset))
|
472
|
|
-head(exprs(msnset))
|
473
|
|
-@
|
474
|
|
-Note, the convertion from \Robject{MSnID} to \Robject{MSnSet} uses peptides
|
475
|
|
-as features. The number of redundant peptide observations represent so-called
|
476
|
|
-spectral count that can be used for rough quantitative analysis. Summing of
|
477
|
|
-all of the peptide counts to a proteins level can be done with
|
478
|
|
-\Rcode{combineFeatures} function from \Biocpkg{MSnbase} package.
|
479
|
|
-<<>>=
|
480
|
|
-msnset <- combineFeatures(msnset,
|
481
|
|
- fData(msnset)$accession,
|
482
|
|
- redundancy.handler="unique",
|
483
|
|
- fun="sum",
|
484
|
|
- cv=FALSE)
|
485
|
|
-head(fData(msnset))
|
486
|
|
-head(exprs(msnset))
|
487
|
|
-@
|
488
|
|
-
|
489
|
|
-
|
490
|
|
-% clean-up
|
491
|
|
-<<eval=TRUE, echo=FALSE, results=hide>>=
|
492
|
|
-unlink(".Rcache", recursive=TRUE)
|
493
|
|
-@
|
494
|
|
-
|
495
|
|
-
|
496
|
|
-\bibliographystyle{plainnat}
|
497
|
|
-\bibliography{msnid}
|
498
|
|
-\end{document}
|
|
1
|
+%\VignetteIndexEntry{MSnID Package for Handling MS/MS Identifications}
|
|
2
|
+%\VignetteDepends{BiocStyle, msmsTests, ggplot2}
|
|
3
|
+%\VignetteKeywords{Documentation}
|
|
4
|
+%\VignettePackage{MSnID}
|
|
5
|
+\documentclass[11pt]{article}
|
|
6
|
+
|
|
7
|
+
|
|
8
|
+<<style, eval=TRUE, echo=FALSE, results=tex>>=
|
|
9
|
+BiocStyle::latex(use.unsrturl=FALSE)
|
|
10
|
+@
|
|
11
|
+
|
|
12
|
+
|
|
13
|
+
|
|
14
|
+\usepackage[authoryear,round]{natbib}
|
|
15
|
+
|
|
16
|
+
|
|
17
|
+\title{\Rpackage{MSnID} Package for Handling MS/MS Identifications}
|
|
18
|
+\author{Vladislav A. Petyuk}
|
|
19
|
+
|
|
20
|
+\begin{document}
|
|
21
|
+\SweaveOpts{concordance=TRUE, eval=TRUE, prefix.string=graphics}
|
|
22
|
+
|
|
23
|
+\maketitle
|
|
24
|
+\tableofcontents
|
|
25
|
+
|
|
26
|
+
|
|
27
|
+
|
|
28
|
+\section{Introduction}
|
|
29
|
+MS/MS identification is a process with some uncertainty. Some peptide or
|
|
30
|
+protein to spectrum matches are true and some are not. There are ways to
|
|
31
|
+score how well the peptide/protein fragmentation pattern observed in
|
|
32
|
+MS/MS spectrum matches the theoretical amino acid sequence. Other ways to
|
|
33
|
+assess confidence of identification are:
|
|
34
|
+\begin{enumerate}
|
|
35
|
+ \item the difference in theoretical and experimental masses
|
|
36
|
+ \item frequency of observation (true identifications tend to be more
|
|
37
|
+ consistent)
|
|
38
|
+ \item peptide sequence properties (only in the case of technique involving
|
|
39
|
+ protein digestion) such as missed cleavages or presence of cleavages not
|
|
40
|
+ typical for a given protease and chemical reagent.
|
|
41
|
+\end{enumerate}
|
|
42
|
+A typical and currently most reliable way to quantify uncertainty in the list
|
|
43
|
+of identify spectra, peptides or proteins relies on so-called decoy database.
|
|
44
|
+For bottom-up (i.e. involving protein digestion) approaches a common way to
|
|
45
|
+construct a decoy database is simple inversion of protein amino-acid sequences.
|
|
46
|
+The other approach commonly used in top-down (that is intact protein)
|
|
47
|
+approaches is based on shuffling of amino acids within protein sequence.
|
|
48
|
+Typically normal and decoy sequences are concatenated into one FASTA file.
|
|
49
|
+Some software tools (e.g. MS-GF+) perform the task of constructing and
|
|
50
|
+appending the decoy sequence internally on the fly. If the spectrum matches
|
|
51
|
+to normal protein sequence it can be true or false match. Matches to decoy
|
|
52
|
+part of the database are false only (excluding the palindromes). Therefore
|
|
53
|
+the false discovery rate (FDR) of identifications can be estimated as ratio
|
|
54
|
+of hits to decoy over normal parts of the protein sequence database.
|
|
55
|
+\\
|
|
56
|
+There are multiple levels of identification that FDR can be estimated for.
|
|
57
|
+First, is at the level of peptide/protein-to-spectrum matches. Second is at
|
|
58
|
+the level of unique peptide sequences. Note, true peptides tend to be
|
|
59
|
+identified by more then one spectrum. False peptide tend to be sporadic.
|
|
60
|
+Therefore, after collapsing the redundant peptide identifications from
|
|
61
|
+multiple spectra to the level of unique peptide sequence, the FDR typically
|
|
62
|
+increases. The extend of FDR increase depends on the type and complexity of
|
|
63
|
+the sample. The same trend is true for estimating the identification FDR at
|
|
64
|
+the protein level. True proteins tend to be identified with multiple peptides,
|
|
65
|
+while false protein identifications are commonly covered only by one peptide.
|
|
66
|
+Therefore FDR estimate tend to be even higher for protein level compare to
|
|
67
|
+peptide level.
|
|
68
|
+\\
|
|
69
|
+The estimation of the FDR is also affected by the number of LC-MS (runs)
|
|
70
|
+datasets in the experiment. Again, true identifications tend to be more
|
|
71
|
+consistent from run to run, while false are sporadic. After collapsing the
|
|
72
|
+redundancy across the runs, the number of true identification reduces much
|
|
73
|
+stronger compare to false identifications. Therefore, the peptide and
|
|
74
|
+protein FDR estimates need to be re-evaluated.
|
|
75
|
+\\
|
|
76
|
+The main objective of the MSnID package is to provide convenience tools for
|
|
77
|
+handling tasks on estimation of FDR, defining and optimizing the filtering
|
|
78
|
+criteria and ensuring confidence in MS/MS identification data. The user can
|
|
79
|
+specify the criteria for filtering the data (e.g. goodness or p-value of
|
|
80
|
+matching of experimental and theoretical fragmentation mass spectrum,
|
|
81
|
+deviation of theoretical from experimentally measured mass, presence of
|
|
82
|
+missed cleavages in the peptide sequence, etc), evaluate the performance of
|
|
83
|
+the filter judging by FDRs at spectrum, peptide and protein levels, and
|
|
84
|
+finally optimize the filter to achieve the maximum number of identifications
|
|
85
|
+while not exceeding maximally allowed FDR upper threshold.
|
|
86
|
+
|
|
87
|
+
|
|
88
|
+\section{Starting the project}
|
|
89
|
+First, the \Robject{MSnID} object has to be initialized. The main argument is
|
|
90
|
+path to the working directory. This directory will be used for storing cached
|
|
91
|
+analysis results. Caching/memoisation mechanism is based on \CRANpkg{R.cache}.
|
|
92
|
+<<>>=
|
|
93
|
+library("MSnID")
|
|
94
|
+msnid <- MSnID(".")
|
|
95
|
+@
|
|
96
|
+
|
|
97
|
+\section{Reading MS/MS data}
|
|
98
|
+One way to read in peptide/protein to MS/MS spectrum matching (PSM) results as a
|
|
99
|
+table from a text file and assing the \Rclass{data.frame} object.
|
|
100
|
+<<>>=
|
|
101
|
+PSMresults <- read.delim(system.file("extdata", "human_brain.txt",
|
|
102
|
+ package="MSnID"),
|
|
103
|
+ stringsAsFactors=FALSE)
|
|
104
|
+psms(msnid) <- PSMresults
|
|
105
|
+show(msnid)
|
|
106
|
+@
|
|
107
|
+Alternative and currently the preferred way to read MS/MS results is by
|
|
108
|
+parsing mzIdentML files (*.mzid or *.mzid.gz extensions). The \Rcode{read\_mzIDs}
|
|
109
|
+function leverages \Biocpkg{mzID} package facilities.
|
|
110
|
+<<>>=
|
|
111
|
+mzids <- system.file("extdata", "c_elegans.mzid.gz", package="MSnID")
|
|
112
|
+msnid <- read_mzIDs(msnid, mzids)
|
|
113
|
+show(msnid)
|
|
114
|
+@
|
|
115
|
+Internally PSMs stored as \CRANpkg{data.table} object.
|
|
116
|
+\\
|
|
117
|
+The example file \texttt{"c\_elegans.mzid.gz"} is based on MS-GF+ search engine.
|
|
118
|
+The \Rcode{read\_mzIDs} function reads results of any MS/MS search engine as
|
|
119
|
+long as it compliant with mzIdentML standard. In general case, use aforementioned
|
|
120
|
+\Rcode{psms<-} function.
|
|
121
|
+
|
|
122
|
+
|
|
123
|
+\section{Updating columns}
|
|
124
|
+Note, to take a full advantage of the \Biocpkg{MSnID}, the
|
|
125
|
+the following columns have to be present. Checking of columns happens
|
|
126
|
+internally.
|
|
127
|
+<<echo=FALSE>>=
|
|
128
|
+sort(MSnID:::.mustBeColumns)
|
|
129
|
+@
|
|
130
|
+Check what are the current column names in the MS/MS search results table.
|
|
131
|
+<<>>=
|
|
132
|
+names(msnid)
|
|
133
|
+@
|
|
134
|
+
|
|
135
|
+
|
|
136
|
+\section{Basic info on the \Robject{MSnID} object instance}
|
|
137
|
+Printing the \Robject{MSnID} object returns some basic information such as
|
|
138
|
+\begin{itemize}
|
|
139
|
+ \item Working directory.
|
|
140
|
+ \item Number of spectrum files used to generate data.
|
|
141
|
+ \item Number of peptide-to-spectrum matches and corresponding FDR.
|
|
142
|
+ \item Number of unique peptide sequences and corresponding FDR.
|
|
143
|
+ \item Number of unique proteins or amino acid sequence accessions and
|
|
144
|
+ corresponding FDR.
|
|
145
|
+\end{itemize}
|
|
146
|
+False discovery rate or FDR is defined here as a ratio of hits to decoy
|
|
147
|
+accessions to the non-decoy (normal) accessions. In terms of forward and revese
|
|
148
|
+protein sequences that would mean ratio of \#reverse/\#forward.
|
|
149
|
+While computing FDRs of PSMs and unique peptide sequences is trivial,
|
|
150
|
+definition of protein (accession) FDR is a subject for discussion in the field
|
|
151
|
+of proteomics. Here, protein (accession) FDR is computed the same way as in
|
|
152
|
+IDPicker software \cite{Zhang2007} and simply constitutes a ratio of
|
|
153
|
+unique accessions from decoy component to non-decoy component of the
|
|
154
|
+sequence database.
|
|
155
|
+<<>>=
|
|
156
|
+show(msnid)
|
|
157
|
+@
|
|
158
|
+
|
|
159
|
+
|
|
160
|
+
|
|
161
|
+\section{Analysis of peptide sequences}
|
|
162
|
+A particular properties of peptide sequences we are interested in are
|
|
163
|
+\begin{enumerate}
|
|
164
|
+ \item irregular cleavages at the termini of the peptides and
|
|
165
|
+ \item missing cleavage site within the peptide sequences.
|
|
166
|
+\end{enumerate}
|
|
167
|
+The default regular expressions of valid and missed cleavage
|
|
168
|
+patterns correspond to trypsin.
|
|
169
|
+Counting the number of irregular cleavage termimi (0,1 or 2) in peptides
|
|
170
|
+sequence creates a new column \texttt{numIrregCleavages}.
|
|
171
|
+<<>>=
|
|
172
|
+msnid <- assess_termini(msnid, validCleavagePattern="[KR]\\.[^P]")
|
|
173
|
+@
|
|
174
|
+Counting the number of missed cleavages in peptides sequence creates a new
|
|
175
|
+column \texttt{numMissCleavages}.
|
|
176
|
+<<>>=
|
|
177
|
+msnid <- assess_missed_cleavages(msnid, missedCleavagePattern="[KR](?=[^P$])")
|
|
178
|
+@
|
|
179
|
+Now the object has two more columns, \texttt{numIrregCleavages} and
|
|
180
|
+\texttt{numMissCleavages}, evidently corresponding to the number
|
|
181
|
+of termini with irregular cleavages and number of missed cleavages
|
|
182
|
+within the peptide sequence.
|
|
183
|
+<<label=missedCleavages, fig=TRUE, include=FALSE, width=9>>=
|
|
184
|
+pepCleav <- unique(psms(msnid)[,c("numMissCleavages", "isDecoy", "peptide")])
|
|
185
|
+pepCleav <- as.data.frame(table(pepCleav[,c("numMissCleavages", "isDecoy")]))
|
|
186
|
+library("ggplot2")
|
|
187
|
+ggplot(pepCleav, aes(x=numMissCleavages, y=Freq, fill=isDecoy)) +
|
|
188
|
+ geom_bar(stat='identity', position='dodge') +
|
|
189
|
+ ggtitle("Number of Missed Cleavages")
|
|
190
|
+@
|
|
191
|
+\begin{center}
|
|
192
|
+\includegraphics[width=0.8\textwidth]{graphics-missedCleavages}
|
|
193
|
+\end{center}
|
|
194
|
+\pagebreak[0]
|
|
195
|
+Peptide sequences, as any other column, can by accessed by
|
|
196
|
+directly using \Rcode{\$} operator. For example:
|
|
197
|
+Counting number of cysteins per peptide sequence
|
|
198
|
+<<>>==
|
|
199
|
+msnid$numCys <- sapply(lapply(strsplit(msnid$peptide,''),'==','C'),sum)
|
|
200
|
+@
|
|
201
|
+Calculating peptide lengths. Note, -4 decrements the AA count by two
|
|
202
|
+the flanking AAs and the two dots separating them from the
|
|
203
|
+actual peptide sequence.
|
|
204
|
+<<label=lengths, fig=TRUE, include=FALSE, width=9>>=
|
|
205
|
+msnid$PepLength <- nchar(msnid$peptide) - 4
|
|
206
|
+pepLen <- unique(psms(msnid)[,c("PepLength", "isDecoy", "peptide")])
|
|
207
|
+ggplot(pepLen, aes(x=PepLength, fill=isDecoy)) +
|
|
208
|
+ geom_histogram(position='dodge', binwidth=3) +
|
|
209
|
+ ggtitle("Distribution on of Peptide Lengths")
|
|
210
|
+@
|
|
211
|
+\begin{center}
|
|
212
|
+\includegraphics[width=0.8\textwidth]{graphics-lengths}
|
|
213
|
+\end{center}
|
|
214
|
+\pagebreak[0]
|
|
215
|
+
|
|
216
|
+
|
|
217
|
+
|
|
218
|
+\section{Trimming the data}
|
|
219
|
+The main way for trimming or filtering the data is
|
|
220
|
+\Rfunction{apply\_filter} function. The second argument can be either
|
|
221
|
+1) a string representing expression that will be evaluated in the context of
|
|
222
|
+data.frame containing MS/MS results or 2) \Rclass{MSnFilter} class object
|
|
223
|
+(explained below). Note, the reduction in FDR. Assuming that the sample
|
|
224
|
+has been digested with trypsin, the true identifications tend to be
|
|
225
|
+fully tryptic and contain fewer missed cleavages.
|
|
226
|
+Original FDRs.
|
|
227
|
+<<>>=
|
|
228
|
+show(msnid)
|
|
229
|
+@
|
|
230
|
+Leaving only fully tryptic peptides.
|
|
231
|
+<<>>=
|
|
232
|
+msnid <- apply_filter(msnid, "numIrregCleavages == 0")
|
|
233
|
+show(msnid)
|
|
234
|
+@
|
|
235
|
+Filtering out peptides with more then 2 missed cleavages.
|
|
236
|
+<<>>=
|
|
237
|
+msnid <- apply_filter(msnid, "numMissCleavages <= 2")
|
|
238
|
+show(msnid)
|
|
239
|
+@
|
|
240
|
+
|
|
241
|
+
|
|
242
|
+
|
|
243
|
+\section{Parent ion mass measurement error}
|
|
244
|
+Assuming both \Rcode{calculatedMassToCharge} and
|
|
245
|
+\Rcode{experimentalMassToCharge} are present in \Rcode{names(msnid)},
|
|
246
|
+one can access parent ion mass measurement in points per million (ppm) units.
|
|
247
|
+<<label=ppmOriginal, fig=TRUE, include=FALSE, width=9>>=
|
|
248
|
+ppm <- mass_measurement_error(msnid)
|
|
249
|
+ggplot(as.data.frame(ppm), aes(x=ppm)) +
|
|
250
|
+ geom_histogram(binwidth=100)
|
|
251
|
+@
|
|
252
|
+\begin{center}
|
|
253
|
+\includegraphics[width=0.8\textwidth]{graphics-ppmOriginal}
|
|
254
|
+\end{center}
|
|
255
|
+\pagebreak[0]
|
|
256
|
+Note, although the MS/MS search was done with $\pm$ 20ppm parent ion mass
|
|
257
|
+tolerance, error stretch over 1000 in ppm units. The reason is that the
|
|
258
|
+settings of the MS/MS search engine MS-GF+ (used for the analysis of this
|
|
259
|
+LC-MS dataset) fairly assumed that the instrument could have picked
|
|
260
|
+non-monoisotopic peaks of parent ion for fragmentation
|
|
261
|
+and thus considered peptides that were off by $\pm$ 1 Da
|
|
262
|
+(\textsuperscript{13}C-\textsuperscript{12}C to be exact). Similar settings
|
|
263
|
+can be found in other search engines (e.g X!Tandem).
|
|
264
|
+<<label=deltaMass, fig=TRUE, include=FALSE, width=9>>=
|
|
265
|
+dM <- with(psms(msnid),
|
|
266
|
+ (experimentalMassToCharge-calculatedMassToCharge)*chargeState)
|
|
267
|
+x <- data.frame(dM, isDecoy=msnid$isDecoy)
|
|
268
|
+ggplot(x, aes(x=dM, fill=isDecoy)) +
|
|
269
|
+ geom_histogram(position='stack', binwidth=0.1)
|
|
270
|
+@
|
|
271
|
+\begin{center}
|
|
272
|
+\includegraphics[width=0.8\textwidth]{graphics-deltaMass}
|
|
273
|
+\end{center}
|
|
274
|
+\pagebreak[0]
|
|
275
|
+Ideally, to avoid this problem, the MS/MS datasets have to be either aquired
|
|
276
|
+in MIPS (monoisotopic ion precurson selection) mode or preprocessed with
|
|
277
|
+DeconMSn \cite{Mayampurath2008} tools that identifies the monoisotipic peaks
|
|
278
|
+post-experimentally. The \Biocpkg{MSnID} package provide a simple
|
|
279
|
+\Rcode{correct\_peak\_selection} function that simply adds or subtracts
|
|
280
|
+the difference between \textsuperscript{13}C and \textsuperscript{12}C
|
|
281
|
+to make the error less then 1 Dalton.
|
|
282
|
+<<label=ppmCorrectedMass, fig=TRUE, include=FALSE, width=9>>=
|
|
283
|
+msnid.fixed <- correct_peak_selection(msnid)
|
|
284
|
+ppm <- mass_measurement_error(msnid.fixed)
|
|
285
|
+ggplot(as.data.frame(ppm), aes(x=ppm)) +
|
|
286
|
+ geom_histogram(binwidth=0.25)
|
|
287
|
+@
|
|
288
|
+\begin{center}
|
|
289
|
+\includegraphics[width=0.8\textwidth]{graphics-ppmCorrectedMass}
|
|
290
|
+\end{center}
|
|
291
|
+\pagebreak[0]
|
|
292
|
+Alternatively, one can simply apply a filter to remove any matches that
|
|
293
|
+do not fit the $\pm$ 20 ppm tolerance.
|
|
294
|
+<<label=ppmFiltered20, fig=TRUE, include=FALSE, width=9>>=
|
|
295
|
+msnid.chopped <- apply_filter(msnid, "abs(mass_measurement_error(msnid)) < 20")
|
|
296
|
+ppm <- mass_measurement_error(msnid.chopped)
|
|
297
|
+ggplot(as.data.frame(ppm), aes(x=ppm)) +
|
|
298
|
+ geom_histogram(binwidth=0.25)
|
|
299
|
+@
|
|
300
|
+\begin{center}
|
|
301
|
+\includegraphics[width=0.8\textwidth]{graphics-ppmFiltered20}
|
|
302
|
+\end{center}
|
|
303
|
+\pagebreak[0]
|
|
304
|
+
|
|
305
|
+For further processing we'll consider the \Rcode{msnid.chopped}
|
|
306
|
+data that ignores matches with 1 Da errors. Note, if the center of
|
|
307
|
+the histogram is significantly shifted from zero,
|
|
308
|
+\Rcode{experimentalMassToCharge} can be post-experimentally recalibrated.
|
|
309
|
+This MS/MS data was preprocessed with
|
|
310
|
+DtaRefinery tool \cite{Petyuk2010} that post-experimentally eliminates
|
|
311
|
+any systematic mass measurement error. At this point, the \Rcode{recalibrate}
|
|
312
|
+function implements the most simplistic algorithm avalable in the
|
|
313
|
+DtaRefinery tool.
|
|
314
|
+<<label=ppmRecalibrated, fig=TRUE, include=FALSE, width=9>>=
|
|
315
|
+msnid <- recalibrate(msnid.chopped)
|
|
316
|
+ppm <- mass_measurement_error(msnid)
|
|
317
|
+ggplot(as.data.frame(ppm), aes(x=ppm)) +
|
|
318
|
+ geom_histogram(binwidth=0.25)
|
|
319
|
+@
|
|
320
|
+\begin{center}
|
|
321
|
+\includegraphics[width=0.8\textwidth]{graphics-ppmRecalibrated}
|
|
322
|
+\end{center}
|
|
323
|
+\pagebreak[0]
|
|
324
|
+
|
|
325
|
+
|
|
326
|
+
|
|
327
|
+\section{\Robject{MSnIDFilter} object for filtering MS/MS identifications}
|
|
328
|
+The criteria that will be used for filtering the MS/MS data has to be present
|
|
329
|
+in the \Robject{MSnID} object. We will use -log10 transformed MS-GF+
|
|
330
|
+Spectrum E-value, reflecting the goodness of match experimental and
|
|
331
|
+theoretical fragmentation patterns as one the filtering criteria.
|
|
332
|
+Let's store it under the "msmsScore" name. The score density distribution
|
|
333
|
+shows that it is a good discriminant between non-decoy (red)
|
|
334
|
+and decoy hits (green).
|
|
335
|
+\\
|
|
336
|
+For alternative MS/MS search engines refer to the engine-specific manual for
|
|
337
|
+the names of parameters reflecting the quality of MS/MS spectra matching.
|
|
338
|
+Examples of such parameters are \Rcode{E-Value} for X!Tandem
|
|
339
|
+and \Rcode{XCorr} and \Rcode{$\Delta$Cn2} for SEQUEST.
|
|
340
|
+<<label=msmsScoreDistribution, fig=TRUE, include=FALSE, width=9>>=
|
|
341
|
+msnid$msmsScore <- -log10(msnid$`MS-GF:SpecEValue`)
|
|
342
|
+params <- psms(msnid)[,c("msmsScore","isDecoy")]
|
|
343
|
+ggplot(params) +
|
|
344
|
+ geom_density(aes(x = msmsScore, color = isDecoy, ..count..))
|
|
345
|
+@
|
|
346
|
+\begin{center}
|
|
347
|
+\includegraphics[width=0.8\textwidth]{graphics-msmsScoreDistribution}
|
|
348
|
+\end{center}
|
|
349
|
+\pagebreak[0]
|
|
350
|
+As a second criterion we will be using the absolute mass measurement
|
|
351
|
+error (in ppm units) of the parent ion. The mass measurement errors tend to
|
|
352
|
+be small for non-decoy (enriched with real identificaiton) hits (red line) and
|
|
353
|
+is effectively uniformly distributed for decoy hits.
|
|
354
|
+<<label=absPpmDistribution, fig=TRUE, include=FALSE, width=9>>=
|
|
355
|
+msnid$absParentMassErrorPPM <- abs(mass_measurement_error(msnid))
|
|
356
|
+params <- psms(msnid)[,c("absParentMassErrorPPM","isDecoy")]
|
|
357
|
+ggplot(params) +
|
|
358
|
+ geom_density(aes(x = absParentMassErrorPPM, color = isDecoy, ..count..))
|
|
359
|
+@
|
|
360
|
+\begin{center}
|
|
361
|
+\includegraphics[width=0.8\textwidth]{graphics-absPpmDistribution}
|
|
362
|
+\end{center}
|
|
363
|
+\pagebreak[0]
|
|
364
|
+MS/MS fiters are handled by a special \Rclass{MSnIDFilter} class objects.
|
|
365
|
+Individual filtering criteria can be set by name
|
|
366
|
+(that is present in \Rcode{names(msnid)}), comparison operator (>, <, = , ...)
|
|
367
|
+defining if we should retain hits with higher or lower given the threshold and
|
|
368
|
+finally the threshold value itself.
|
|
369
|
+<<>>=
|
|
370
|
+filtObj <- MSnIDFilter(msnid)
|
|
371
|
+filtObj$absParentMassErrorPPM <- list(comparison="<", threshold=10.0)
|
|
372
|
+filtObj$msmsScore <- list(comparison=">", threshold=10.0)
|
|
373
|
+show(filtObj)
|
|
374
|
+@
|
|
375
|
+Let's evaluate the performace of the filter at three different levels of
|
|
376
|
+confidence assessment.
|
|
377
|
+<<>>=
|
|
378
|
+evaluate_filter(msnid, filtObj, level="PSM")
|
|
379
|
+evaluate_filter(msnid, filtObj, level="peptide")
|
|
380
|
+evaluate_filter(msnid, filtObj, level="accession")
|
|
381
|
+@
|
|
382
|
+
|
|
383
|
+
|
|
384
|
+\section{Optimizing the MS/MS filter to achieve the maximum number of
|
|
385
|
+identifications within a given FDR upper limit threshold}
|
|
386
|
+The threshold values in the example above are not necessarily optimal and set
|
|
387
|
+just be in the range of probable values. Filters can be optimized to ensure
|
|
388
|
+maximum number of identifications (peptide-to-spectrum matches,
|
|
389
|
+unique peptide sequences or proteins) within a given FDR upper limit.
|
|
390
|
+\\
|
|
391
|
+First, the filter can be optimized simply by stepping through
|
|
392
|
+individual parameters and their combinations. The idea has been described in
|
|
393
|
+\cite{Piehowski2013a}. The resulting \Robject{MSnIDFilter} object can be
|
|
394
|
+used for final data filtering or can be used as a good starting parameters for
|
|
395
|
+follow-up refining optimizations with more advanced algorithms.
|
|
396
|
+<<>>=
|
|
397
|
+filtObj.grid <- optimize_filter(filtObj, msnid, fdr.max=0.01,
|
|
398
|
+ method="Grid", level="peptide",
|
|
399
|
+ n.iter=500)
|
|
400
|
+show(filtObj.grid)
|
|
401
|
+@
|
|
402
|
+%# (absParentMassErrorPPM < 2) & (msmsScore > 7.8)
|
|
403
|
+
|
|
404
|
+The resulting \Rcode{filtObj.grid} can be further fine tuned with such
|
|
405
|
+optimization routines as simulated annealing or Nelder-Mead optimization.
|
|
406
|
+<<>>=
|
|
407
|
+filtObj.nm <- optimize_filter(filtObj.grid, msnid, fdr.max=0.01,
|
|
408
|
+ method="Nelder-Mead", level="peptide",
|
|
409
|
+ n.iter=500)
|
|
410
|
+show(filtObj.nm)
|
|
411
|
+@
|
|
412
|
+%# (absParentMassErrorPPM < 3) & (msmsScore > 7.8)
|
|
413
|
+
|
|
414
|
+Let's compare the original (good guess) and optimized fileters. Obviously the
|
|
415
|
+latter yields much more peptide identifications, while not exceeding
|
|
416
|
+the maximally allowed FDR threshold of 1%.
|
|
417
|
+<<>>=
|
|
418
|
+evaluate_filter(msnid, filtObj, level="peptide")
|
|
419
|
+evaluate_filter(msnid, filtObj.nm, level="peptide")
|
|
420
|
+@
|
|
421
|
+Finally we'll apply the optimized filter to proceed with further
|
|
422
|
+steps in the analysis pipeline.
|
|
423
|
+<<>>=
|
|
424
|
+msnid <- apply_filter(msnid, filtObj.nm)
|
|
425
|
+show(msnid)
|
|
426
|
+@
|
|
427
|
+Identifications that matched decoy and contaminant protein sequences can be
|
|
428
|
+removed by providing filters in the forms of text strings that will be
|
|
429
|
+evaluated in the context of PSM table.
|
|
430
|
+<<>>=
|
|
431
|
+msnid <- apply_filter(msnid, "isDecoy == FALSE")
|
|
432
|
+show(msnid)
|
|
433
|
+msnid <- apply_filter(msnid, "!grepl('Contaminant',accession)")
|
|
434
|
+show(msnid)
|
|
435
|
+@
|
|
436
|
+
|
|
437
|
+
|
|
438
|
+
|
|
439
|
+\section{Data output and interface with other Bioconductor packages}
|
|
440
|
+One can extract the entire PSMs tables as
|
|
441
|
+\Rcode{data.frame} or \Rcode{data.table}
|
|
442
|
+<<>>=
|
|
443
|
+psm.df <- psms(msnid)
|
|
444
|
+psm.dt <- as(msnid, "data.table")
|
|
445
|
+@
|
|
446
|
+If only interested in the non-redundant list of confidently identified
|
|
447
|
+peptides or proteins
|
|
448
|
+<<>>=
|
|
449
|
+peps <- peptides(msnid)
|
|
450
|
+head(peps)
|
|
451
|
+prots <- accessions(msnid)
|
|
452
|
+head(prots)
|
|
453
|
+prots <- proteins(msnid) # may be more intuitive then accessions
|
|
454
|
+head(prots)
|
|
455
|
+@
|
|
456
|
+The \Biocpkg{MSnID} package is aimed at providing convenience functionality
|
|
457
|
+to handle MS/MS identifications. Quantification \textit{per se} is outside of
|
|
458
|
+the scope of the package. The only type of quantitation that can be seamlessly
|
|
459
|
+tied with MS/MS identification analysis is so-called
|
|
460
|
+\emph{spectral counting} approach. In such an approach a peptide abundance is
|
|
461
|
+considered to be directly proportional to the number of matched MS/MS spectra.
|
|
462
|
+In its turn protein abunance is proportional to the sum of the number of
|
|
463
|
+spectra of the matching peptides. The \Rclass{MSnID} object can be converted
|
|
464
|
+to an \Rclass{MSnSet} object defined in \Biocpkg{MSnbase} that extends generic
|
|
465
|
+Bioconductor \Rclass{eSet} class to quantitative proteomics data.
|
|
466
|
+The spectral count data can be analyzed with \Biocpkg{msmsEDA},
|
|
467
|
+\Biocpkg{msmsTests} or \Biocpkg{DESeq} packages.
|
|
468
|
+<<label=convertingToMSnSet>>=
|
|
469
|
+msnset <- as(msnid, "MSnSet")
|
|
470
|
+library("MSnbase")
|
|
471
|
+head(fData(msnset))
|
|
472
|
+head(exprs(msnset))
|
|
473
|
+@
|
|
474
|
+Note, the convertion from \Robject{MSnID} to \Robject{MSnSet} uses peptides
|
|
475
|
+as features. The number of redundant peptide observations represent so-called
|
|
476
|
+spectral count that can be used for rough quantitative analysis. Summing of
|
|
477
|
+all of the peptide counts to a proteins level can be done with
|
|
478
|
+\Rcode{combineFeatures} function from \Biocpkg{MSnbase} package.
|
|
479
|
+<<>>=
|
|
480
|
+msnset <- combineFeatures(msnset,
|
|
481
|
+ fData(msnset)$accession,
|
|
482
|
+ redundancy.handler="unique",
|
|
483
|
+ fun="sum",
|
|
484
|
+ cv=FALSE)
|
|
485
|
+head(fData(msnset))
|
|
486
|
+head(exprs(msnset))
|
|
487
|
+@
|
|
488
|
+
|
|
489
|
+
|
|
490
|
+% clean-up
|
|
491
|
+<<eval=TRUE, echo=FALSE, results=hide>>=
|
|
492
|
+unlink(".Rcache", recursive=TRUE)
|
|
493
|
+@
|
|
494
|
+
|
|
495
|
+
|
|
496
|
+\bibliographystyle{plainnat}
|
|
497
|
+\bibliography{msnid}
|
|
498
|
+\end{document}
|