Browse code

fix eol

Vlad Petyuk authored on 22/07/2020 00:11:15
Showing 1 changed files

... ...
@@ -1,498 +1,498 @@
1
-%\VignetteIndexEntry{MSnID Package for Handling MS/MS Identifications}
2
-%\VignetteDepends{BiocStyle, msmsTests, ggplot2}
3
-%\VignetteKeywords{Documentation}
4
-%\VignettePackage{MSnID}
5
-\documentclass[11pt]{article}
6
-
7
-
8
-<<style, eval=TRUE, echo=FALSE, results=tex>>=
9
-BiocStyle::latex(use.unsrturl=FALSE)
10
-@
11
-
12
-
13
-
14
-\usepackage[authoryear,round]{natbib}
15
-
16
-
17
-\title{\Rpackage{MSnID} Package for Handling MS/MS Identifications}
18
-\author{Vladislav A. Petyuk}
19
-
20
-\begin{document}
21
-\SweaveOpts{concordance=TRUE, eval=TRUE, prefix.string=graphics}
22
-
23
-\maketitle
24
-\tableofcontents
25
-
26
-
27
-
28
-\section{Introduction}
29
-MS/MS identification is a process with some uncertainty. Some peptide or
30
-protein to spectrum matches are true and some are not. There are ways to
31
-score how well the peptide/protein fragmentation pattern observed in
32
-MS/MS spectrum matches the theoretical amino acid sequence. Other ways to
33
-assess confidence of identification are:
34
-\begin{enumerate}
35
-    \item the difference in theoretical and experimental masses
36
-    \item frequency of observation (true identifications tend to be more
37
-    consistent)
38
-    \item peptide sequence properties (only in the case of technique involving
39
-    protein digestion) such as missed cleavages or presence of cleavages not
40
-    typical for a given protease and chemical reagent.
41
-\end{enumerate}
42
-A typical and currently most reliable way to quantify uncertainty in the list
43
-of identify spectra, peptides or proteins relies on so-called decoy database.
44
-For bottom-up (i.e. involving protein digestion) approaches a common way to
45
-construct a decoy database is simple inversion of protein amino-acid sequences.
46
-The other approach commonly used in top-down (that is intact protein)
47
-approaches is based on shuffling of amino acids within protein sequence.
48
-Typically normal and decoy sequences are concatenated into one FASTA file.
49
-Some software tools (e.g. MS-GF+) perform the task of constructing and
50
-appending the decoy sequence internally on the fly. If the spectrum matches
51
-to normal protein sequence it can be true or false match. Matches to decoy
52
-part of the database are false only (excluding the palindromes). Therefore
53
-the false discovery rate (FDR) of identifications can be estimated as ratio
54
-of hits to decoy over normal parts of the protein sequence database.
55
-\\
56
-There are multiple levels of identification that FDR can be estimated for.
57
-First, is at the level of peptide/protein-to-spectrum matches. Second is at
58
-the level of unique peptide sequences. Note, true peptides tend to be
59
-identified by more then one spectrum. False peptide tend to be sporadic.
60
-Therefore, after collapsing the redundant peptide identifications from
61
-multiple spectra to the level of unique peptide sequence, the FDR typically
62
-increases. The extend of FDR increase depends on the type and complexity of
63
-the sample. The same trend is true for estimating the identification FDR at
64
-the protein level. True proteins tend to be identified with multiple peptides,
65
-while false protein identifications are commonly covered only by one peptide.
66
-Therefore FDR estimate tend to be even higher for protein level compare to
67
-peptide level.
68
-\\
69
-The estimation of the FDR is also affected by the number of LC-MS (runs)
70
-datasets in the experiment. Again, true identifications tend to be more
71
-consistent from run to run, while false are sporadic. After collapsing the
72
-redundancy across the runs, the number of true identification reduces much
73
-stronger compare to false identifications. Therefore, the peptide and
74
-protein FDR estimates need to be re-evaluated.
75
-\\
76
-The main objective of the MSnID package is to provide convenience tools for
77
-handling tasks on estimation of FDR, defining and optimizing the filtering
78
-criteria and ensuring confidence in MS/MS identification data. The user can
79
-specify the criteria for filtering the data (e.g. goodness or p-value of
80
-matching of experimental and theoretical fragmentation mass spectrum,
81
-deviation of theoretical from experimentally measured mass, presence of
82
-missed cleavages in the peptide sequence, etc), evaluate the performance of
83
-the filter judging by FDRs at spectrum, peptide and protein levels, and
84
-finally optimize the filter to achieve the maximum number of identifications
85
-while not exceeding maximally allowed FDR upper threshold.
86
-
87
-
88
-\section{Starting the project}
89
-First, the \Robject{MSnID} object has to be initialized. The main argument is
90
-path to the working directory. This directory will be used for storing cached
91
-analysis results. Caching/memoisation mechanism is based on \CRANpkg{R.cache}.
92
-<<>>=
93
-library("MSnID")
94
-msnid <- MSnID(".")
95
-@
96
-
97
-\section{Reading MS/MS data}
98
-One way to read in peptide/protein to MS/MS spectrum matching (PSM) results as a
99
-table from a text file and assing the \Rclass{data.frame} object.
100
-<<>>=
101
-PSMresults <- read.delim(system.file("extdata", "human_brain.txt",
102
-                                            package="MSnID"),
103
-                                stringsAsFactors=FALSE)
104
-psms(msnid) <- PSMresults
105
-show(msnid)
106
-@
107
-Alternative and currently the preferred way to read MS/MS results is by
108
-parsing mzIdentML files (*.mzid or *.mzid.gz extensions). The \Rcode{read\_mzIDs}
109
-function leverages \Biocpkg{mzID} package facilities.
110
-<<>>=
111
-mzids <- system.file("extdata", "c_elegans.mzid.gz", package="MSnID")
112
-msnid <- read_mzIDs(msnid, mzids)
113
-show(msnid)
114
-@
115
-Internally PSMs stored as \CRANpkg{data.table} object.
116
-\\
117
-The example file \texttt{"c\_elegans.mzid.gz"} is based on MS-GF+ search engine.
118
-The \Rcode{read\_mzIDs} function reads results of any MS/MS search engine as
119
-long as it compliant with mzIdentML standard. In general case, use aforementioned
120
-\Rcode{psms<-} function.
121
-
122
-
123
-\section{Updating columns}
124
-Note, to take a full advantage of the \Biocpkg{MSnID}, the
125
-the following columns have to be present. Checking of columns happens
126
-internally.
127
-<<echo=FALSE>>=
128
-sort(MSnID:::.mustBeColumns)
129
-@
130
-Check what are the current column names in the MS/MS search results table.
131
-<<>>=
132
-names(msnid)
133
-@
134
-
135
-
136
-\section{Basic info on the \Robject{MSnID} object instance}
137
-Printing the \Robject{MSnID} object returns some basic information such as
138
-\begin{itemize}
139
-    \item Working directory.
140
-    \item Number of spectrum files used to generate data.
141
-    \item Number of peptide-to-spectrum matches and corresponding FDR.
142
-    \item Number of unique peptide sequences and corresponding FDR.
143
-    \item Number of unique proteins or amino acid sequence accessions and
144
-            corresponding FDR.
145
-\end{itemize}
146
-False discovery rate or FDR is defined here as a ratio of hits to decoy
147
-accessions to the non-decoy (normal) accessions. In terms of forward and revese
148
-protein sequences that would mean ratio of \#reverse/\#forward.
149
-While computing FDRs of PSMs and unique peptide sequences is trivial,
150
-definition of protein (accession) FDR is a subject for discussion in the field
151
-of proteomics. Here, protein (accession) FDR is computed the same way as in
152
-IDPicker software \cite{Zhang2007} and simply constitutes a ratio of
153
-unique accessions from decoy component to non-decoy component of the
154
-sequence database.
155
-<<>>=
156
-show(msnid)
157
-@
158
-
159
-
160
-
161
-\section{Analysis of peptide sequences}
162
-A particular properties of peptide sequences we are interested in are
163
-\begin{enumerate}
164
-    \item irregular cleavages at the termini of the peptides and
165
-    \item missing cleavage site within the peptide sequences.
166
-\end{enumerate}
167
-The default regular expressions of valid and missed cleavage
168
-patterns correspond to trypsin.
169
-Counting the number of irregular cleavage termimi (0,1 or 2) in peptides
170
-sequence creates a new column \texttt{numIrregCleavages}.
171
-<<>>=
172
-msnid <- assess_termini(msnid, validCleavagePattern="[KR]\\.[^P]")
173
-@
174
-Counting the number of missed cleavages in peptides sequence creates a new
175
-column \texttt{numMissCleavages}.
176
-<<>>=
177
-msnid <- assess_missed_cleavages(msnid, missedCleavagePattern="[KR](?=[^P$])")
178
-@
179
-Now the object has two more columns, \texttt{numIrregCleavages} and
180
-\texttt{numMissCleavages}, evidently corresponding to the number
181
-of termini with irregular cleavages and number of missed cleavages
182
-within the peptide sequence.
183
-<<label=missedCleavages, fig=TRUE, include=FALSE, width=9>>=
184
-pepCleav <- unique(psms(msnid)[,c("numMissCleavages", "isDecoy", "peptide")])
185
-pepCleav <- as.data.frame(table(pepCleav[,c("numMissCleavages", "isDecoy")]))
186
-library("ggplot2")
187
-ggplot(pepCleav, aes(x=numMissCleavages, y=Freq, fill=isDecoy)) +
188
-    geom_bar(stat='identity', position='dodge') +
189
-    ggtitle("Number of Missed Cleavages")
190
-@
191
-\begin{center}
192
-\includegraphics[width=0.8\textwidth]{graphics-missedCleavages}
193
-\end{center}
194
-\pagebreak[0]
195
-Peptide sequences, as any other column, can by accessed by
196
-directly using \Rcode{\$} operator. For example:
197
-Counting number of cysteins per peptide sequence
198
-<<>>==
199
-msnid$numCys <- sapply(lapply(strsplit(msnid$peptide,''),'==','C'),sum)
200
-@
201
-Calculating peptide lengths. Note, -4 decrements the AA count by two
202
-the flanking AAs and the two dots separating them from the
203
-actual peptide sequence.
204
-<<label=lengths, fig=TRUE, include=FALSE, width=9>>=
205
-msnid$PepLength <- nchar(msnid$peptide) - 4
206
-pepLen <- unique(psms(msnid)[,c("PepLength", "isDecoy", "peptide")])
207
-ggplot(pepLen, aes(x=PepLength, fill=isDecoy)) +
208
-    geom_histogram(position='dodge', binwidth=3) +
209
-    ggtitle("Distribution on of Peptide Lengths")
210
-@
211
-\begin{center}
212
-\includegraphics[width=0.8\textwidth]{graphics-lengths}
213
-\end{center}
214
-\pagebreak[0]
215
-
216
-
217
-
218
-\section{Trimming the data}
219
-The main way for trimming or filtering the data is
220
-\Rfunction{apply\_filter} function. The second argument can be either
221
-1) a string representing expression that will be evaluated in the context of
222
-data.frame containing MS/MS results or 2) \Rclass{MSnFilter} class object
223
-(explained below). Note, the reduction in FDR. Assuming that the sample
224
-has been digested with trypsin, the true identifications tend to be
225
-fully tryptic and contain fewer missed cleavages.
226
-Original FDRs.
227
-<<>>=
228
-show(msnid)
229
-@
230
-Leaving only fully tryptic peptides.
231
-<<>>=
232
-msnid <- apply_filter(msnid, "numIrregCleavages == 0")
233
-show(msnid)
234
-@
235
-Filtering out peptides with more then 2 missed cleavages.
236
-<<>>=
237
-msnid <- apply_filter(msnid, "numMissCleavages <= 2")
238
-show(msnid)
239
-@
240
-
241
-
242
-
243
-\section{Parent ion mass measurement error}
244
-Assuming both \Rcode{calculatedMassToCharge} and
245
-\Rcode{experimentalMassToCharge} are present in \Rcode{names(msnid)},
246
-one can access parent ion mass measurement in points per million (ppm) units.
247
-<<label=ppmOriginal, fig=TRUE, include=FALSE, width=9>>=
248
-ppm <- mass_measurement_error(msnid)
249
-ggplot(as.data.frame(ppm), aes(x=ppm)) +
250
-    geom_histogram(binwidth=100)
251
-@
252
-\begin{center}
253
-\includegraphics[width=0.8\textwidth]{graphics-ppmOriginal}
254
-\end{center}
255
-\pagebreak[0]
256
-Note, although the MS/MS search was done with $\pm$ 20ppm parent ion mass
257
-tolerance, error stretch over 1000 in ppm units. The reason is that the
258
-settings of the MS/MS search engine MS-GF+ (used for the analysis of this
259
-LC-MS dataset) fairly assumed that the instrument could have picked
260
-non-monoisotopic peaks of parent ion for fragmentation
261
-and thus considered peptides that were off by $\pm$ 1 Da
262
-(\textsuperscript{13}C-\textsuperscript{12}C to be exact). Similar settings
263
-can be found in other search engines (e.g X!Tandem).
264
-<<label=deltaMass, fig=TRUE, include=FALSE, width=9>>=
265
-dM <- with(psms(msnid),
266
-    (experimentalMassToCharge-calculatedMassToCharge)*chargeState)
267
-x <- data.frame(dM, isDecoy=msnid$isDecoy)
268
-ggplot(x, aes(x=dM, fill=isDecoy)) +
269
-    geom_histogram(position='stack', binwidth=0.1)
270
-@
271
-\begin{center}
272
-\includegraphics[width=0.8\textwidth]{graphics-deltaMass}
273
-\end{center}
274
-\pagebreak[0]
275
-Ideally, to avoid this problem, the MS/MS datasets have to be either aquired
276
-in MIPS (monoisotopic ion precurson selection) mode or preprocessed with
277
-DeconMSn \cite{Mayampurath2008} tools that identifies the monoisotipic peaks
278
-post-experimentally. The \Biocpkg{MSnID} package provide a simple
279
-\Rcode{correct\_peak\_selection} function that simply adds or subtracts
280
-the difference between \textsuperscript{13}C and \textsuperscript{12}C
281
-to make the error less then 1 Dalton.
282
-<<label=ppmCorrectedMass, fig=TRUE, include=FALSE, width=9>>=
283
-msnid.fixed <- correct_peak_selection(msnid)
284
-ppm <- mass_measurement_error(msnid.fixed)
285
-ggplot(as.data.frame(ppm), aes(x=ppm)) +
286
-    geom_histogram(binwidth=0.25)
287
-@
288
-\begin{center}
289
-\includegraphics[width=0.8\textwidth]{graphics-ppmCorrectedMass}
290
-\end{center}
291
-\pagebreak[0]
292
-Alternatively, one can simply apply a filter to remove any matches that
293
-do not fit the $\pm$ 20 ppm tolerance.
294
-<<label=ppmFiltered20, fig=TRUE, include=FALSE, width=9>>=
295
-msnid.chopped <- apply_filter(msnid, "abs(mass_measurement_error(msnid)) < 20")
296
-ppm <- mass_measurement_error(msnid.chopped)
297
-ggplot(as.data.frame(ppm), aes(x=ppm)) +
298
-    geom_histogram(binwidth=0.25)
299
-@
300
-\begin{center}
301
-\includegraphics[width=0.8\textwidth]{graphics-ppmFiltered20}
302
-\end{center}
303
-\pagebreak[0]
304
-
305
-For further processing we'll consider the \Rcode{msnid.chopped}
306
-data that ignores matches with 1 Da errors. Note, if the center of
307
-the histogram is significantly shifted from zero,
308
-\Rcode{experimentalMassToCharge} can be post-experimentally recalibrated.
309
-This MS/MS data was preprocessed with
310
-DtaRefinery tool \cite{Petyuk2010} that post-experimentally eliminates
311
-any systematic mass measurement error. At this point, the \Rcode{recalibrate}
312
-function implements the most simplistic algorithm avalable in the
313
-DtaRefinery tool.
314
-<<label=ppmRecalibrated, fig=TRUE, include=FALSE, width=9>>=
315
-msnid <- recalibrate(msnid.chopped)
316
-ppm <- mass_measurement_error(msnid)
317
-ggplot(as.data.frame(ppm), aes(x=ppm)) +
318
-    geom_histogram(binwidth=0.25)
319
-@
320
-\begin{center}
321
-\includegraphics[width=0.8\textwidth]{graphics-ppmRecalibrated}
322
-\end{center}
323
-\pagebreak[0]
324
-
325
-
326
-
327
-\section{\Robject{MSnIDFilter} object for filtering MS/MS identifications}
328
-The criteria that will be used for filtering the MS/MS data has to be present
329
-in the \Robject{MSnID} object. We will use -log10 transformed MS-GF+
330
-Spectrum E-value, reflecting the goodness of match experimental and
331
-theoretical fragmentation patterns as one the filtering criteria.
332
-Let's store it under the "msmsScore" name. The score density distribution
333
-shows that it is a good discriminant between non-decoy (red)
334
-and decoy hits (green).
335
-\\
336
-For alternative MS/MS search engines refer to the engine-specific manual for
337
-the names of parameters reflecting the quality of MS/MS spectra matching.
338
-Examples of such parameters are \Rcode{E-Value} for X!Tandem
339
-and \Rcode{XCorr} and \Rcode{$\Delta$Cn2} for SEQUEST.
340
-<<label=msmsScoreDistribution, fig=TRUE, include=FALSE, width=9>>=
341
-msnid$msmsScore <- -log10(msnid$`MS-GF:SpecEValue`)
342
-params <- psms(msnid)[,c("msmsScore","isDecoy")]
343
-ggplot(params) +
344
-    geom_density(aes(x = msmsScore, color = isDecoy, ..count..))
345
-@
346
-\begin{center}
347
-\includegraphics[width=0.8\textwidth]{graphics-msmsScoreDistribution}
348
-\end{center}
349
-\pagebreak[0]
350
-As a second criterion we will be using the absolute mass measurement
351
-error (in ppm units) of the parent ion. The mass measurement errors tend to
352
-be small for non-decoy (enriched with real identificaiton) hits (red line) and
353
-is effectively uniformly distributed for decoy hits.
354
-<<label=absPpmDistribution, fig=TRUE, include=FALSE, width=9>>=
355
-msnid$absParentMassErrorPPM <- abs(mass_measurement_error(msnid))
356
-params <- psms(msnid)[,c("absParentMassErrorPPM","isDecoy")]
357
-ggplot(params) +
358
-    geom_density(aes(x = absParentMassErrorPPM, color = isDecoy, ..count..))
359
-@
360
-\begin{center}
361
-\includegraphics[width=0.8\textwidth]{graphics-absPpmDistribution}
362
-\end{center}
363
-\pagebreak[0]
364
-MS/MS fiters are handled by a special \Rclass{MSnIDFilter} class objects.
365
-Individual filtering criteria can be set by name
366
-(that is present in \Rcode{names(msnid)}), comparison operator (>, <, = , ...)
367
-defining if we should retain hits with higher or lower given the threshold and
368
-finally the threshold value itself.
369
-<<>>=
370
-filtObj <- MSnIDFilter(msnid)
371
-filtObj$absParentMassErrorPPM <- list(comparison="<", threshold=10.0)
372
-filtObj$msmsScore <- list(comparison=">", threshold=10.0)
373
-show(filtObj)
374
-@
375
-Let's evaluate the performace of the filter at three different levels of
376
-confidence assessment.
377
-<<>>=
378
-evaluate_filter(msnid, filtObj, level="PSM")
379
-evaluate_filter(msnid, filtObj, level="peptide")
380
-evaluate_filter(msnid, filtObj, level="accession")
381
-@
382
-
383
-
384
-\section{Optimizing the MS/MS filter to achieve the maximum number of
385
-identifications within a given FDR upper limit threshold}
386
-The threshold values in the example above are not necessarily optimal and set
387
-just be in the range of probable values. Filters can be optimized to ensure
388
-maximum number of identifications (peptide-to-spectrum matches,
389
-unique peptide sequences or proteins) within a given FDR upper limit.
390
-\\
391
-First, the filter can be optimized simply by stepping through
392
-individual parameters and their combinations. The idea has been described in
393
-\cite{Piehowski2013a}. The resulting \Robject{MSnIDFilter} object can be
394
-used for final data filtering or can be used as a good starting parameters for
395
-follow-up refining optimizations with more advanced algorithms.
396
-<<>>=
397
-filtObj.grid <- optimize_filter(filtObj, msnid, fdr.max=0.01,
398
-                                method="Grid", level="peptide",
399
-                                n.iter=500)
400
-show(filtObj.grid)
401
-@
402
-%# (absParentMassErrorPPM < 2) & (msmsScore > 7.8)
403
-
404
-The resulting \Rcode{filtObj.grid} can be further fine tuned with such
405
-optimization routines as simulated annealing or Nelder-Mead optimization.
406
-<<>>=
407
-filtObj.nm <- optimize_filter(filtObj.grid, msnid, fdr.max=0.01,
408
-                                method="Nelder-Mead", level="peptide",
409
-                                n.iter=500)
410
-show(filtObj.nm)
411
-@
412
-%# (absParentMassErrorPPM < 3) & (msmsScore > 7.8)
413
-
414
-Let's compare the original (good guess) and optimized fileters. Obviously the
415
-latter yields much more peptide identifications, while not exceeding
416
-the maximally allowed FDR threshold of 1%.
417
-<<>>=
418
-evaluate_filter(msnid, filtObj, level="peptide")
419
-evaluate_filter(msnid, filtObj.nm, level="peptide")
420
-@
421
-Finally we'll apply the optimized filter to proceed with further
422
-steps in the analysis pipeline.
423
-<<>>=
424
-msnid <- apply_filter(msnid, filtObj.nm)
425
-show(msnid)
426
-@
427
-Identifications that matched decoy and contaminant protein sequences can be
428
-removed by providing filters in the forms of text strings that will be
429
-evaluated in the context of PSM table.
430
-<<>>=
431
-msnid <- apply_filter(msnid, "isDecoy == FALSE")
432
-show(msnid)
433
-msnid <- apply_filter(msnid, "!grepl('Contaminant',accession)")
434
-show(msnid)
435
-@
436
-
437
-
438
-
439
-\section{Data output and interface with other Bioconductor packages}
440
-One can extract the entire PSMs tables as
441
-\Rcode{data.frame} or \Rcode{data.table}
442
-<<>>=
443
-psm.df <- psms(msnid)
444
-psm.dt <- as(msnid, "data.table")
445
-@
446
-If only interested in the non-redundant list of confidently identified
447
-peptides or proteins
448
-<<>>=
449
-peps <- peptides(msnid)
450
-head(peps)
451
-prots <- accessions(msnid)
452
-head(prots)
453
-prots <- proteins(msnid) # may be more intuitive then accessions
454
-head(prots)
455
-@
456
-The \Biocpkg{MSnID} package is aimed at providing convenience functionality
457
-to handle MS/MS identifications. Quantification \textit{per se} is outside of
458
-the scope of the package. The only type of quantitation that can be seamlessly
459
-tied with MS/MS identification analysis is so-called
460
-\emph{spectral counting} approach. In such an approach a peptide abundance is
461
-considered to be directly proportional to the number of matched MS/MS spectra.
462
-In its turn protein abunance is proportional to the sum of the number of
463
-spectra of the matching peptides. The \Rclass{MSnID} object can be converted
464
-to an \Rclass{MSnSet} object defined in \Biocpkg{MSnbase} that extends generic
465
-Bioconductor \Rclass{eSet} class to quantitative proteomics data.
466
-The spectral count data can be analyzed with \Biocpkg{msmsEDA},
467
-\Biocpkg{msmsTests} or \Biocpkg{DESeq} packages.
468
-<<label=convertingToMSnSet>>=
469
-msnset <- as(msnid, "MSnSet")
470
-library("MSnbase")
471
-head(fData(msnset))
472
-head(exprs(msnset))
473
-@
474
-Note, the convertion from \Robject{MSnID} to \Robject{MSnSet} uses peptides
475
-as features. The number of redundant peptide observations represent so-called
476
-spectral count that can be used for rough quantitative analysis. Summing of
477
-all of the peptide counts to a proteins level can be done with
478
-\Rcode{combineFeatures} function from \Biocpkg{MSnbase} package.
479
-<<>>=
480
-msnset <- combineFeatures(msnset,
481
-                            fData(msnset)$accession,
482
-                            redundancy.handler="unique",
483
-                            fun="sum",
484
-                            cv=FALSE)
485
-head(fData(msnset))
486
-head(exprs(msnset))
487
-@
488
-
489
-
490
-% clean-up
491
-<<eval=TRUE, echo=FALSE, results=hide>>=
492
-unlink(".Rcache", recursive=TRUE)
493
-@
494
-
495
-
496
-\bibliographystyle{plainnat}
497
-\bibliography{msnid}
498
-\end{document}
1
+%\VignetteIndexEntry{MSnID Package for Handling MS/MS Identifications}
2
+%\VignetteDepends{BiocStyle, msmsTests, ggplot2}
3
+%\VignetteKeywords{Documentation}
4
+%\VignettePackage{MSnID}
5
+\documentclass[11pt]{article}
6
+
7
+
8
+<<style, eval=TRUE, echo=FALSE, results=tex>>=
9
+BiocStyle::latex(use.unsrturl=FALSE)
10
+@
11
+
12
+
13
+
14
+\usepackage[authoryear,round]{natbib}
15
+
16
+
17
+\title{\Rpackage{MSnID} Package for Handling MS/MS Identifications}
18
+\author{Vladislav A. Petyuk}
19
+
20
+\begin{document}
21
+\SweaveOpts{concordance=TRUE, eval=TRUE, prefix.string=graphics}
22
+
23
+\maketitle
24
+\tableofcontents
25
+
26
+
27
+
28
+\section{Introduction}
29
+MS/MS identification is a process with some uncertainty. Some peptide or
30
+protein to spectrum matches are true and some are not. There are ways to
31
+score how well the peptide/protein fragmentation pattern observed in
32
+MS/MS spectrum matches the theoretical amino acid sequence. Other ways to
33
+assess confidence of identification are:
34
+\begin{enumerate}
35
+    \item the difference in theoretical and experimental masses
36
+    \item frequency of observation (true identifications tend to be more
37
+    consistent)
38
+    \item peptide sequence properties (only in the case of technique involving
39
+    protein digestion) such as missed cleavages or presence of cleavages not
40
+    typical for a given protease and chemical reagent.
41
+\end{enumerate}
42
+A typical and currently most reliable way to quantify uncertainty in the list
43
+of identify spectra, peptides or proteins relies on so-called decoy database.
44
+For bottom-up (i.e. involving protein digestion) approaches a common way to
45
+construct a decoy database is simple inversion of protein amino-acid sequences.
46
+The other approach commonly used in top-down (that is intact protein)
47
+approaches is based on shuffling of amino acids within protein sequence.
48
+Typically normal and decoy sequences are concatenated into one FASTA file.
49
+Some software tools (e.g. MS-GF+) perform the task of constructing and
50
+appending the decoy sequence internally on the fly. If the spectrum matches
51
+to normal protein sequence it can be true or false match. Matches to decoy
52
+part of the database are false only (excluding the palindromes). Therefore
53
+the false discovery rate (FDR) of identifications can be estimated as ratio
54
+of hits to decoy over normal parts of the protein sequence database.
55
+\\
56
+There are multiple levels of identification that FDR can be estimated for.
57
+First, is at the level of peptide/protein-to-spectrum matches. Second is at
58
+the level of unique peptide sequences. Note, true peptides tend to be
59
+identified by more then one spectrum. False peptide tend to be sporadic.
60
+Therefore, after collapsing the redundant peptide identifications from
61
+multiple spectra to the level of unique peptide sequence, the FDR typically
62
+increases. The extend of FDR increase depends on the type and complexity of
63
+the sample. The same trend is true for estimating the identification FDR at
64
+the protein level. True proteins tend to be identified with multiple peptides,
65
+while false protein identifications are commonly covered only by one peptide.
66
+Therefore FDR estimate tend to be even higher for protein level compare to
67
+peptide level.
68
+\\
69
+The estimation of the FDR is also affected by the number of LC-MS (runs)
70
+datasets in the experiment. Again, true identifications tend to be more
71
+consistent from run to run, while false are sporadic. After collapsing the
72
+redundancy across the runs, the number of true identification reduces much
73
+stronger compare to false identifications. Therefore, the peptide and
74
+protein FDR estimates need to be re-evaluated.
75
+\\
76
+The main objective of the MSnID package is to provide convenience tools for
77
+handling tasks on estimation of FDR, defining and optimizing the filtering
78
+criteria and ensuring confidence in MS/MS identification data. The user can
79
+specify the criteria for filtering the data (e.g. goodness or p-value of
80
+matching of experimental and theoretical fragmentation mass spectrum,
81
+deviation of theoretical from experimentally measured mass, presence of
82
+missed cleavages in the peptide sequence, etc), evaluate the performance of
83
+the filter judging by FDRs at spectrum, peptide and protein levels, and
84
+finally optimize the filter to achieve the maximum number of identifications
85
+while not exceeding maximally allowed FDR upper threshold.
86
+
87
+
88
+\section{Starting the project}
89
+First, the \Robject{MSnID} object has to be initialized. The main argument is
90
+path to the working directory. This directory will be used for storing cached
91
+analysis results. Caching/memoisation mechanism is based on \CRANpkg{R.cache}.
92
+<<>>=
93
+library("MSnID")
94
+msnid <- MSnID(".")
95
+@
96
+
97
+\section{Reading MS/MS data}
98
+One way to read in peptide/protein to MS/MS spectrum matching (PSM) results as a
99
+table from a text file and assing the \Rclass{data.frame} object.
100
+<<>>=
101
+PSMresults <- read.delim(system.file("extdata", "human_brain.txt",
102
+                                            package="MSnID"),
103
+                                stringsAsFactors=FALSE)
104
+psms(msnid) <- PSMresults
105
+show(msnid)
106
+@
107
+Alternative and currently the preferred way to read MS/MS results is by
108
+parsing mzIdentML files (*.mzid or *.mzid.gz extensions). The \Rcode{read\_mzIDs}
109
+function leverages \Biocpkg{mzID} package facilities.
110
+<<>>=
111
+mzids <- system.file("extdata", "c_elegans.mzid.gz", package="MSnID")
112
+msnid <- read_mzIDs(msnid, mzids)
113
+show(msnid)
114
+@
115
+Internally PSMs stored as \CRANpkg{data.table} object.
116
+\\
117
+The example file \texttt{"c\_elegans.mzid.gz"} is based on MS-GF+ search engine.
118
+The \Rcode{read\_mzIDs} function reads results of any MS/MS search engine as
119
+long as it compliant with mzIdentML standard. In general case, use aforementioned
120
+\Rcode{psms<-} function.
121
+
122
+
123
+\section{Updating columns}
124
+Note, to take a full advantage of the \Biocpkg{MSnID}, the
125
+the following columns have to be present. Checking of columns happens
126
+internally.
127
+<<echo=FALSE>>=
128
+sort(MSnID:::.mustBeColumns)
129
+@
130
+Check what are the current column names in the MS/MS search results table.
131
+<<>>=
132
+names(msnid)
133
+@
134
+
135
+
136
+\section{Basic info on the \Robject{MSnID} object instance}
137
+Printing the \Robject{MSnID} object returns some basic information such as
138
+\begin{itemize}
139
+    \item Working directory.
140
+    \item Number of spectrum files used to generate data.
141
+    \item Number of peptide-to-spectrum matches and corresponding FDR.
142
+    \item Number of unique peptide sequences and corresponding FDR.
143
+    \item Number of unique proteins or amino acid sequence accessions and
144
+            corresponding FDR.
145
+\end{itemize}
146
+False discovery rate or FDR is defined here as a ratio of hits to decoy
147
+accessions to the non-decoy (normal) accessions. In terms of forward and revese
148
+protein sequences that would mean ratio of \#reverse/\#forward.
149
+While computing FDRs of PSMs and unique peptide sequences is trivial,
150
+definition of protein (accession) FDR is a subject for discussion in the field
151
+of proteomics. Here, protein (accession) FDR is computed the same way as in
152
+IDPicker software \cite{Zhang2007} and simply constitutes a ratio of
153
+unique accessions from decoy component to non-decoy component of the
154
+sequence database.
155
+<<>>=
156
+show(msnid)
157
+@
158
+
159
+
160
+
161
+\section{Analysis of peptide sequences}
162
+A particular properties of peptide sequences we are interested in are
163
+\begin{enumerate}
164
+    \item irregular cleavages at the termini of the peptides and
165
+    \item missing cleavage site within the peptide sequences.
166
+\end{enumerate}
167
+The default regular expressions of valid and missed cleavage
168
+patterns correspond to trypsin.
169
+Counting the number of irregular cleavage termimi (0,1 or 2) in peptides
170
+sequence creates a new column \texttt{numIrregCleavages}.
171
+<<>>=
172
+msnid <- assess_termini(msnid, validCleavagePattern="[KR]\\.[^P]")
173
+@
174
+Counting the number of missed cleavages in peptides sequence creates a new
175
+column \texttt{numMissCleavages}.
176
+<<>>=
177
+msnid <- assess_missed_cleavages(msnid, missedCleavagePattern="[KR](?=[^P$])")
178
+@
179
+Now the object has two more columns, \texttt{numIrregCleavages} and
180
+\texttt{numMissCleavages}, evidently corresponding to the number
181
+of termini with irregular cleavages and number of missed cleavages
182
+within the peptide sequence.
183
+<<label=missedCleavages, fig=TRUE, include=FALSE, width=9>>=
184
+pepCleav <- unique(psms(msnid)[,c("numMissCleavages", "isDecoy", "peptide")])
185
+pepCleav <- as.data.frame(table(pepCleav[,c("numMissCleavages", "isDecoy")]))
186
+library("ggplot2")
187
+ggplot(pepCleav, aes(x=numMissCleavages, y=Freq, fill=isDecoy)) +
188
+    geom_bar(stat='identity', position='dodge') +
189
+    ggtitle("Number of Missed Cleavages")
190
+@
191
+\begin{center}
192
+\includegraphics[width=0.8\textwidth]{graphics-missedCleavages}
193
+\end{center}
194
+\pagebreak[0]
195
+Peptide sequences, as any other column, can by accessed by
196
+directly using \Rcode{\$} operator. For example:
197
+Counting number of cysteins per peptide sequence
198
+<<>>==
199
+msnid$numCys <- sapply(lapply(strsplit(msnid$peptide,''),'==','C'),sum)
200
+@
201
+Calculating peptide lengths. Note, -4 decrements the AA count by two
202
+the flanking AAs and the two dots separating them from the
203
+actual peptide sequence.
204
+<<label=lengths, fig=TRUE, include=FALSE, width=9>>=
205
+msnid$PepLength <- nchar(msnid$peptide) - 4
206
+pepLen <- unique(psms(msnid)[,c("PepLength", "isDecoy", "peptide")])
207
+ggplot(pepLen, aes(x=PepLength, fill=isDecoy)) +
208
+    geom_histogram(position='dodge', binwidth=3) +
209
+    ggtitle("Distribution on of Peptide Lengths")
210
+@
211
+\begin{center}
212
+\includegraphics[width=0.8\textwidth]{graphics-lengths}
213
+\end{center}
214
+\pagebreak[0]
215
+
216
+
217
+
218
+\section{Trimming the data}
219
+The main way for trimming or filtering the data is
220
+\Rfunction{apply\_filter} function. The second argument can be either
221
+1) a string representing expression that will be evaluated in the context of
222
+data.frame containing MS/MS results or 2) \Rclass{MSnFilter} class object
223
+(explained below). Note, the reduction in FDR. Assuming that the sample
224
+has been digested with trypsin, the true identifications tend to be
225
+fully tryptic and contain fewer missed cleavages.
226
+Original FDRs.
227
+<<>>=
228
+show(msnid)
229
+@
230
+Leaving only fully tryptic peptides.
231
+<<>>=
232
+msnid <- apply_filter(msnid, "numIrregCleavages == 0")
233
+show(msnid)
234
+@
235
+Filtering out peptides with more then 2 missed cleavages.
236
+<<>>=
237
+msnid <- apply_filter(msnid, "numMissCleavages <= 2")
238
+show(msnid)
239
+@
240
+
241
+
242
+
243
+\section{Parent ion mass measurement error}
244
+Assuming both \Rcode{calculatedMassToCharge} and
245
+\Rcode{experimentalMassToCharge} are present in \Rcode{names(msnid)},
246
+one can access parent ion mass measurement in points per million (ppm) units.
247
+<<label=ppmOriginal, fig=TRUE, include=FALSE, width=9>>=
248
+ppm <- mass_measurement_error(msnid)
249
+ggplot(as.data.frame(ppm), aes(x=ppm)) +
250
+    geom_histogram(binwidth=100)
251
+@
252
+\begin{center}
253
+\includegraphics[width=0.8\textwidth]{graphics-ppmOriginal}
254
+\end{center}
255
+\pagebreak[0]
256
+Note, although the MS/MS search was done with $\pm$ 20ppm parent ion mass
257
+tolerance, error stretch over 1000 in ppm units. The reason is that the
258
+settings of the MS/MS search engine MS-GF+ (used for the analysis of this
259
+LC-MS dataset) fairly assumed that the instrument could have picked
260
+non-monoisotopic peaks of parent ion for fragmentation
261
+and thus considered peptides that were off by $\pm$ 1 Da
262
+(\textsuperscript{13}C-\textsuperscript{12}C to be exact). Similar settings
263
+can be found in other search engines (e.g X!Tandem).
264
+<<label=deltaMass, fig=TRUE, include=FALSE, width=9>>=
265
+dM <- with(psms(msnid),
266
+    (experimentalMassToCharge-calculatedMassToCharge)*chargeState)
267
+x <- data.frame(dM, isDecoy=msnid$isDecoy)
268
+ggplot(x, aes(x=dM, fill=isDecoy)) +
269
+    geom_histogram(position='stack', binwidth=0.1)
270
+@
271
+\begin{center}
272
+\includegraphics[width=0.8\textwidth]{graphics-deltaMass}
273
+\end{center}
274
+\pagebreak[0]
275
+Ideally, to avoid this problem, the MS/MS datasets have to be either aquired
276
+in MIPS (monoisotopic ion precurson selection) mode or preprocessed with
277
+DeconMSn \cite{Mayampurath2008} tools that identifies the monoisotipic peaks
278
+post-experimentally. The \Biocpkg{MSnID} package provide a simple
279
+\Rcode{correct\_peak\_selection} function that simply adds or subtracts
280
+the difference between \textsuperscript{13}C and \textsuperscript{12}C
281
+to make the error less then 1 Dalton.
282
+<<label=ppmCorrectedMass, fig=TRUE, include=FALSE, width=9>>=
283
+msnid.fixed <- correct_peak_selection(msnid)
284
+ppm <- mass_measurement_error(msnid.fixed)
285
+ggplot(as.data.frame(ppm), aes(x=ppm)) +
286
+    geom_histogram(binwidth=0.25)
287
+@
288
+\begin{center}
289
+\includegraphics[width=0.8\textwidth]{graphics-ppmCorrectedMass}
290
+\end{center}
291
+\pagebreak[0]
292
+Alternatively, one can simply apply a filter to remove any matches that
293
+do not fit the $\pm$ 20 ppm tolerance.
294
+<<label=ppmFiltered20, fig=TRUE, include=FALSE, width=9>>=
295
+msnid.chopped <- apply_filter(msnid, "abs(mass_measurement_error(msnid)) < 20")
296
+ppm <- mass_measurement_error(msnid.chopped)
297
+ggplot(as.data.frame(ppm), aes(x=ppm)) +
298
+    geom_histogram(binwidth=0.25)
299
+@
300
+\begin{center}
301
+\includegraphics[width=0.8\textwidth]{graphics-ppmFiltered20}
302
+\end{center}
303
+\pagebreak[0]
304
+
305
+For further processing we'll consider the \Rcode{msnid.chopped}
306
+data that ignores matches with 1 Da errors. Note, if the center of
307
+the histogram is significantly shifted from zero,
308
+\Rcode{experimentalMassToCharge} can be post-experimentally recalibrated.
309
+This MS/MS data was preprocessed with
310
+DtaRefinery tool \cite{Petyuk2010} that post-experimentally eliminates
311
+any systematic mass measurement error. At this point, the \Rcode{recalibrate}
312
+function implements the most simplistic algorithm avalable in the
313
+DtaRefinery tool.
314
+<<label=ppmRecalibrated, fig=TRUE, include=FALSE, width=9>>=
315
+msnid <- recalibrate(msnid.chopped)
316
+ppm <- mass_measurement_error(msnid)
317
+ggplot(as.data.frame(ppm), aes(x=ppm)) +
318
+    geom_histogram(binwidth=0.25)
319
+@
320
+\begin{center}
321
+\includegraphics[width=0.8\textwidth]{graphics-ppmRecalibrated}
322
+\end{center}
323
+\pagebreak[0]
324
+
325
+
326
+
327
+\section{\Robject{MSnIDFilter} object for filtering MS/MS identifications}
328
+The criteria that will be used for filtering the MS/MS data has to be present
329
+in the \Robject{MSnID} object. We will use -log10 transformed MS-GF+
330
+Spectrum E-value, reflecting the goodness of match experimental and
331
+theoretical fragmentation patterns as one the filtering criteria.
332
+Let's store it under the "msmsScore" name. The score density distribution
333
+shows that it is a good discriminant between non-decoy (red)
334
+and decoy hits (green).
335
+\\
336
+For alternative MS/MS search engines refer to the engine-specific manual for
337
+the names of parameters reflecting the quality of MS/MS spectra matching.
338
+Examples of such parameters are \Rcode{E-Value} for X!Tandem
339
+and \Rcode{XCorr} and \Rcode{$\Delta$Cn2} for SEQUEST.
340
+<<label=msmsScoreDistribution, fig=TRUE, include=FALSE, width=9>>=
341
+msnid$msmsScore <- -log10(msnid$`MS-GF:SpecEValue`)
342
+params <- psms(msnid)[,c("msmsScore","isDecoy")]
343
+ggplot(params) +
344
+    geom_density(aes(x = msmsScore, color = isDecoy, ..count..))
345
+@
346
+\begin{center}
347
+\includegraphics[width=0.8\textwidth]{graphics-msmsScoreDistribution}
348
+\end{center}
349
+\pagebreak[0]
350
+As a second criterion we will be using the absolute mass measurement
351
+error (in ppm units) of the parent ion. The mass measurement errors tend to
352
+be small for non-decoy (enriched with real identificaiton) hits (red line) and
353
+is effectively uniformly distributed for decoy hits.
354
+<<label=absPpmDistribution, fig=TRUE, include=FALSE, width=9>>=
355
+msnid$absParentMassErrorPPM <- abs(mass_measurement_error(msnid))
356
+params <- psms(msnid)[,c("absParentMassErrorPPM","isDecoy")]
357
+ggplot(params) +
358
+    geom_density(aes(x = absParentMassErrorPPM, color = isDecoy, ..count..))
359
+@
360
+\begin{center}
361
+\includegraphics[width=0.8\textwidth]{graphics-absPpmDistribution}
362
+\end{center}
363
+\pagebreak[0]
364
+MS/MS fiters are handled by a special \Rclass{MSnIDFilter} class objects.
365
+Individual filtering criteria can be set by name
366
+(that is present in \Rcode{names(msnid)}), comparison operator (>, <, = , ...)
367
+defining if we should retain hits with higher or lower given the threshold and
368
+finally the threshold value itself.
369
+<<>>=
370
+filtObj <- MSnIDFilter(msnid)
371
+filtObj$absParentMassErrorPPM <- list(comparison="<", threshold=10.0)
372
+filtObj$msmsScore <- list(comparison=">", threshold=10.0)
373
+show(filtObj)
374
+@
375
+Let's evaluate the performace of the filter at three different levels of
376
+confidence assessment.
377
+<<>>=
378
+evaluate_filter(msnid, filtObj, level="PSM")
379
+evaluate_filter(msnid, filtObj, level="peptide")
380
+evaluate_filter(msnid, filtObj, level="accession")
381
+@
382
+
383
+
384
+\section{Optimizing the MS/MS filter to achieve the maximum number of
385
+identifications within a given FDR upper limit threshold}
386
+The threshold values in the example above are not necessarily optimal and set
387
+just be in the range of probable values. Filters can be optimized to ensure
388
+maximum number of identifications (peptide-to-spectrum matches,
389
+unique peptide sequences or proteins) within a given FDR upper limit.
390
+\\
391
+First, the filter can be optimized simply by stepping through
392
+individual parameters and their combinations. The idea has been described in
393
+\cite{Piehowski2013a}. The resulting \Robject{MSnIDFilter} object can be
394
+used for final data filtering or can be used as a good starting parameters for
395
+follow-up refining optimizations with more advanced algorithms.
396
+<<>>=
397
+filtObj.grid <- optimize_filter(filtObj, msnid, fdr.max=0.01,
398
+                                method="Grid", level="peptide",
399
+                                n.iter=500)
400
+show(filtObj.grid)
401
+@
402
+%# (absParentMassErrorPPM < 2) & (msmsScore > 7.8)
403
+
404
+The resulting \Rcode{filtObj.grid} can be further fine tuned with such
405
+optimization routines as simulated annealing or Nelder-Mead optimization.
406
+<<>>=
407
+filtObj.nm <- optimize_filter(filtObj.grid, msnid, fdr.max=0.01,
408
+                                method="Nelder-Mead", level="peptide",
409
+                                n.iter=500)
410
+show(filtObj.nm)
411
+@
412
+%# (absParentMassErrorPPM < 3) & (msmsScore > 7.8)
413
+
414
+Let's compare the original (good guess) and optimized fileters. Obviously the
415
+latter yields much more peptide identifications, while not exceeding
416
+the maximally allowed FDR threshold of 1%.
417
+<<>>=
418
+evaluate_filter(msnid, filtObj, level="peptide")
419
+evaluate_filter(msnid, filtObj.nm, level="peptide")
420
+@
421
+Finally we'll apply the optimized filter to proceed with further
422
+steps in the analysis pipeline.
423
+<<>>=
424
+msnid <- apply_filter(msnid, filtObj.nm)
425
+show(msnid)
426
+@
427
+Identifications that matched decoy and contaminant protein sequences can be
428
+removed by providing filters in the forms of text strings that will be
429
+evaluated in the context of PSM table.
430
+<<>>=
431
+msnid <- apply_filter(msnid, "isDecoy == FALSE")
432
+show(msnid)
433
+msnid <- apply_filter(msnid, "!grepl('Contaminant',accession)")
434
+show(msnid)
435
+@
436
+
437
+
438
+
439
+\section{Data output and interface with other Bioconductor packages}
440
+One can extract the entire PSMs tables as
441
+\Rcode{data.frame} or \Rcode{data.table}
442
+<<>>=
443
+psm.df <- psms(msnid)
444
+psm.dt <- as(msnid, "data.table")
445
+@
446
+If only interested in the non-redundant list of confidently identified
447
+peptides or proteins
448
+<<>>=
449
+peps <- peptides(msnid)
450
+head(peps)
451
+prots <- accessions(msnid)
452
+head(prots)
453
+prots <- proteins(msnid) # may be more intuitive then accessions
454
+head(prots)
455
+@
456
+The \Biocpkg{MSnID} package is aimed at providing convenience functionality
457
+to handle MS/MS identifications. Quantification \textit{per se} is outside of
458
+the scope of the package. The only type of quantitation that can be seamlessly
459
+tied with MS/MS identification analysis is so-called
460
+\emph{spectral counting} approach. In such an approach a peptide abundance is
461
+considered to be directly proportional to the number of matched MS/MS spectra.
462
+In its turn protein abunance is proportional to the sum of the number of
463
+spectra of the matching peptides. The \Rclass{MSnID} object can be converted
464
+to an \Rclass{MSnSet} object defined in \Biocpkg{MSnbase} that extends generic
465
+Bioconductor \Rclass{eSet} class to quantitative proteomics data.
466
+The spectral count data can be analyzed with \Biocpkg{msmsEDA},
467
+\Biocpkg{msmsTests} or \Biocpkg{DESeq} packages.
468
+<<label=convertingToMSnSet>>=
469
+msnset <- as(msnid, "MSnSet")
470
+library("MSnbase")
471
+head(fData(msnset))
472
+head(exprs(msnset))
473
+@
474
+Note, the convertion from \Robject{MSnID} to \Robject{MSnSet} uses peptides
475
+as features. The number of redundant peptide observations represent so-called
476
+spectral count that can be used for rough quantitative analysis. Summing of
477
+all of the peptide counts to a proteins level can be done with
478
+\Rcode{combineFeatures} function from \Biocpkg{MSnbase} package.
479
+<<>>=
480
+msnset <- combineFeatures(msnset,
481
+                            fData(msnset)$accession,
482
+                            redundancy.handler="unique",
483
+                            fun="sum",
484
+                            cv=FALSE)
485
+head(fData(msnset))
486
+head(exprs(msnset))
487
+@
488
+
489
+
490
+% clean-up
491
+<<eval=TRUE, echo=FALSE, results=hide>>=
492
+unlink(".Rcache", recursive=TRUE)
493
+@
494
+
495
+
496
+\bibliographystyle{plainnat}
497
+\bibliography{msnid}
498
+\end{document}