Browse code

added cisbp 2014 reference

paul-shannon authored on 09/04/2018 05:47:55
Showing2 changed files

... ...
@@ -1,8 +1,8 @@
1 1
 Package: MotifDb
2 2
 Type: Package
3 3
 Title: An Annotated Collection of Protein-DNA Binding Sequence Motifs
4
-Version: 1.21.2
5
-Date: 2018-01-31
4
+Version: 1.21.3
5
+Date: 2018-04-08
6 6
 Author: Paul Shannon, Matt Richards
7 7
 Maintainer: Paul Shannon <pshannon@systemsbiology.org>
8 8
 Depends: R (>= 2.15.0), methods, BiocGenerics, S4Vectors, IRanges, Biostrings
... ...
@@ -2,7 +2,7 @@
2 2
 %% %\VignetteIndexEntry{MotifDb Overview}
3 3
 %% %\VignettePackage{MotifDb}
4 4
 \usepackage[noae]{Sweave}
5
-\usepackage[left=0.5in,top=0.5in,right=0.5in,bottom=0.75in,nohead,nofoot]{geometry} 
5
+\usepackage[left=0.5in,top=0.5in,right=0.5in,bottom=0.75in,nohead,nofoot]{geometry}
6 6
 \usepackage{hyperref}
7 7
 \usepackage[noae]{Sweave}
8 8
 \usepackage{color}
... ...
@@ -30,16 +30,16 @@
30 30
 
31 31
 \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
32 32
 \fvset{listparameters={\setlength{\topsep}{6pt}}}
33
-% These determine the rules used to place floating objects like figures 
33
+% These determine the rules used to place floating objects like figures
34 34
 % They are only guides, but read the manual to see the effect of each.
35 35
 \renewcommand{\topfraction}{.99}
36 36
 \renewcommand{\bottomfraction}{.99}
37 37
 \renewcommand{\textfraction}{0.0}
38 38
 
39
-\title{MotifDb} 
39
+\title{MotifDb}
40 40
 \author{Paul Shannon}
41 41
 
42
-\begin{document} 
42
+\begin{document}
43 43
 
44 44
 \maketitle
45 45
 \begin{abstract}
... ...
@@ -52,7 +52,7 @@ described as \emph{motifs} represented numerically as frequency matrices, and vi
52 52
 in current research, there has been until now no single, annotated, comprehensive collection of publicly available motifs.
53 53
 The current package provides such a collection, offering more than two thousand annotated matrices from multiple organisms, within the
54 54
 context of the Bioconductor project.  The matrices can be filtered and selected on the basis of their metadata, used with other
55
-Bioconductor packages (MotIV for motif comparison, seqLogo for visualization) or easily exported for use with 
55
+Bioconductor packages (MotIV for motif comparison, seqLogo for visualization) or easily exported for use with
56 56
 standard software and websites such as those provided by the MEME Suite\footnote{http://meme.sdsc.edu/meme/doc/meme.html}.
57 57
 \end{abstract}
58 58
 
... ...
@@ -66,13 +66,13 @@ The first step is to load the necessary packages:
66 66
 library (MotifDb)
67 67
 library (MotIV)
68 68
 library (seqLogo)
69
-@ 
69
+@
70 70
 
71 71
 <<hiddenCode results=hide, echo=FALSE>>=
72 72
 MotIV.toTable = function (match) {
73 73
   if (length (match@bestMatch) == 0)
74 74
     return (NA)
75
-  
75
+
76 76
   alignments = match@bestMatch[[1]]@aligns
77 77
 
78 78
   df = data.frame (stringsAsFactors=FALSE)
... ...
@@ -83,58 +83,58 @@ MotIV.toTable = function (match) {
83 83
     sequence = x@sequence
84 84
     match = x@match
85 85
     strand = x@strand
86
-    df = rbind (df, data.frame (name=name, eVal=eVal, sequence=sequence, 
86
+    df = rbind (df, data.frame (name=name, eVal=eVal, sequence=sequence,
87 87
                                 match=match, strand=strand, stringsAsFactors=FALSE))
88 88
     } # for alignment
89 89
   return (df)
90
-  } # MotIV.toTable 
91
-@ 
90
+  } # MotIV.toTable
91
+@
92 92
 
93 93
 %% MotifDb provides two kinds of loosely linked data:  position frequency matrices, and metadata about each matrix.  The matrix
94 94
 %% names, and the rownames of the metadata table, are identical, so it is easy to map back and forth between
95 95
 %% the two.  Some measure of convenience is gained by extracting these two kinds of data into separate variables,
96 96
 %% as we shall see.  The cost in extra memory should not significant.
97
-%% 
97
+%%
98 98
 %% <<all.matrices>>=
99 99
 %% matrices.all = as.list (MotifDb)
100 100
 %% metadata <- values (MotifDb)
101
-%% @ 
101
+%% @
102 102
 There are  more than two thousand  matrices, from five sources:
103 103
 <<sources>>=
104 104
 length (MotifDb)
105 105
 sort (table (values (MotifDb)$dataSource), decreasing=TRUE)
106
-@ 
106
+@
107 107
 And 22 organisms (though the majority of the matrices come from just four):
108 108
 <<organisms>>=
109 109
 sort (table (values (MotifDb)$organism), decreasing=TRUE)
110
-@ 
110
+@
111 111
 
112 112
 With these categories of metadata
113 113
 <<metadata>>=
114 114
 colnames (values (MotifDb))
115
-@ 
115
+@
116 116
 \section{Selection}
117 117
 
118 118
 There are three ways to extract subsets of interest from the MotifDb collection.  All three operate upon the MotifDb metadata,
119
-matching values in one or more of those fifteen attributes (listed just above), and returning the subset of MotifDb  which 
119
+matching values in one or more of those fifteen attributes (listed just above), and returning the subset of MotifDb  which
120 120
 meet the specified criteria.  The three techniques:  \emph{query}, \emph{subset} and \emph{grep}
121 121
 
122 122
 \subsection{query}
123
-This is the simplest technique to use, and will suffice in many circumstances.  For example, if you want 
123
+This is the simplest technique to use, and will suffice in many circumstances.  For example, if you want
124 124
 all of the human matrices:
125 125
 <<queryHuman>>=
126 126
 query (MotifDb, 'hsapiens')
127
-@ 
127
+@
128 128
 If you want all matrices associated with \textbf{\emph{Sox}} transcription factors, regardless of dataSource or organism:
129 129
 <<querySox>>=
130 130
 query (MotifDb, 'sox')
131
-@ 
131
+@
132 132
 For all yeast transcription factors with a homeo domain
133 133
 <<queryYeastHomeo>>=
134 134
 query (query (MotifDb, 'cerevisiae'), 'homeo')
135
-@ 
135
+@
136 136
 The last example may inspire more confidence in the precision of the result than is justified, and for a couple
137
-of reasons.  First, the assignment of  protein binding domains to specific categories is, as of 2012, an ad hoc 
137
+of reasons.  First, the assignment of  protein binding domains to specific categories is, as of 2012, an ad hoc
138 138
 and incomplete process.  Second, the query commands matches the supplied character string to \emph{all} metadata
139 139
 columns.  In this case, 'homeo' appears both in the \emph{bindingDomain} column and the \emph{tfFamily} column,
140 140
 and the above \emph{query} will return matches from both.
... ...
@@ -144,7 +144,7 @@ illustrate:
144 144
 <<homeoVariety>>=
145 145
 unique (grep ('homeo', values(MotifDb)$bindingDomain, ignore.case=T, v=T))
146 146
 unique (grep ('homeo', values(MotifDb)$tfFamily, ignore.case=T, v=T))
147
-@ 
147
+@
148 148
 \subsection{grep}
149 149
 This selection method (and the next, \emph{subset}) require that you address metadata columns explicitly.  This is a little more
150 150
 work, but the requisite direct engagement with the metadata is worthwhile.  Repeating the 'query' examples from above,
... ...
@@ -157,7 +157,7 @@ homeo.indices.domain = grepl ('homeo', values (MotifDb)$bindingDomain, ignore.ca
157 157
 homeo.indices.family = grepl ('homeo', values (MotifDb)$tfFamily, ignore.case=TRUE)
158 158
 yeast.homeo.indices = yeast.indices & (homeo.indices.domain | homeo.indices.family)
159 159
 yeast.homeoDb = MotifDb [yeast.homeo.indices]
160
-@ 
160
+@
161 161
 
162 162
 An alternate and somewhat more compact approach:
163 163
 <<withHomeo>>=
... ...
@@ -166,7 +166,7 @@ yeast.homeo.indices <- with(values(MotifDb),
166 166
     (grepl('homeo', bindingDomain, ignore.case=TRUE) |
167 167
      grepl('homeo', tfFamily, ignore.case=TRUE)))
168 168
 
169
-@ 
169
+@
170 170
 \subsection{subset}
171 171
 MotifDb::subset emulates the R base data.frame \emph{subset} command, which is not unlike an SQL select function.
172 172
 Unfortunately -- and just like the R base subset function -- this MotifDb method cannot be used reliably  within a script:
... ...
@@ -176,19 +176,19 @@ reproduce the \emph{query} and \emph{grep} selections shown above.
176 176
 <<subsetHuman>>=
177 177
 if (interactive ())
178 178
   subset (MotifDb, organism=='Hsapiens')
179
-@ 
179
+@
180 180
 One can easily find all the 'sox' genes with the subset command, avoiding possible upper/lower case conflicts by passing
181 181
 the metadata's geneSymbol column through the function 'tolower':
182 182
 <<subsetSox>>=
183 183
 if (interactive ())
184 184
   subset (MotifDb, tolower (geneSymbol) == 'sox4')
185
-@ 
185
+@
186 186
 Similarly, subset has limited application for a permissive 'homeo' search.
187 187
 But for the retrieval by explicitly specified search terms, subset works very well:
188 188
 <<subsetYeastHomeo>>=
189 189
 if (interactive ())
190 190
   subset (MotifDb, organism=='Scerevisiae' & bindingDomain=='Homeo')
191
-@ 
191
+@
192 192
 
193 193
 \subsection{The Egr1 Case Study}
194 194
 
... ...
@@ -197,27 +197,27 @@ associated with the well-known and highly conserved zinc-finger transcription fa
197 197
 There are two of these in MotifDb, both from mouse, and each from a different data source.
198 198
 
199 199
 <<findEgr1>>=
200
-  # subset is convenient: 
200
+  # subset is convenient:
201 201
 if (interactive ())
202 202
   as.list (subset (MotifDb, tolower (geneSymbol) == 'egr1'))
203 203
   # grep returns indices which allow for more flexibility
204
-indices = grep ('egr1', values (MotifDb)$geneSymbol, ignore.case=TRUE)  
204
+indices = grep ('egr1', values (MotifDb)$geneSymbol, ignore.case=TRUE)
205 205
 length (indices)
206
-@ 
207
-There are a variety of ways to examine and extract data from this object, a MotifList of length 2.  
206
+@
207
+There are a variety of ways to examine and extract data from this object, a MotifList of length 2.
208 208
 <<MotifDbViews>>=
209 209
 MotifDb [indices]
210
-@ 
210
+@
211 211
 
212 212
 Now view the matrices as a named list:
213 213
 <<as.list>>=
214 214
 as.list (MotifDb [indices])
215
-@ 
215
+@
216 216
 and finally, the metadata associated with these two matrices, transposed, for easy reading and comparison:
217 217
 <<as.metadata>>=
218 218
 noquote (t (as.data.frame (values (MotifDb [indices]))))
219
-@ 
220
- 
219
+@
220
+
221 221
 We used the \emph{grep} function above to find rows in the metadata table whose \emph{geneSymbol} column includes the string 'Egr1'.
222 222
 If you wish to identify matrices (and/or their attendant metadata) based upon a richer combination of criteria, for instance:
223 223
 
... ...
@@ -233,47 +233,47 @@ the grep solution, while serviceable, becomes a little awkward:
233 233
 geneSymbol.rows = grep ('Egr1', values (MotifDb)$geneSymbol, ignore.case=TRUE)
234 234
 organism.rows = grep ('Mmusculus', values (MotifDb)$organism, ignore.case=TRUE)
235 235
 source.rows = grep ('JASPAR', values (MotifDb)$dataSource, ignore.case=TRUE)
236
-egr1.mouse.jaspar.rows = intersect (geneSymbol.rows, 
236
+egr1.mouse.jaspar.rows = intersect (geneSymbol.rows,
237 237
                            intersect (organism.rows, source.rows))
238 238
 print (egr1.mouse.jaspar.rows)
239 239
 egr1.motif <- MotifDb [egr1.mouse.jaspar.rows]
240
-@ 
240
+@
241 241
 <<MotifDbViews>>=
242
-@ 
242
+@
243 243
 
244
-Far more concise, and fully reliable as an interactive command (though \emph{not} if used in a 
244
+Far more concise, and fully reliable as an interactive command (though \emph{not} if used in a
245 245
 script\footnote{See the help page of the base R command subset for detail), is the \emph{subset} command}):
246 246
 <<subsetSearchForEgr1>>=
247 247
 if (interactive ()) {
248
-  egr1.motif <- subset (MotifDb, organism=='Mmusculus' & 
249
-                        dataSource=='JASPAR_CORE' & 
248
+  egr1.motif <- subset (MotifDb, organism=='Mmusculus' &
249
+                        dataSource=='JASPAR_CORE' &
250 250
                         geneSymbol=='Egr1')
251 251
   }
252
-@ 
252
+@
253 253
 Whichever method you use, this next chunk of code displays the matrix, and then the metadata for mouse JASPAR Egr1, the latter
254 254
 textually-transformed for easy reading within the size constraints of this page.
255 255
 <<examine-egr1>>=
256 256
 egr1.motif
257 257
 as.list (egr1.motif)
258 258
 noquote (t (as.data.frame (values (egr1.motif))))
259
-@ 
259
+@
260 260
 
261 261
 
262 262
 Next we use the bioconductor \emph{seqLogo} package to display this motif.
263 263
 
264 264
 <<egr1, fig=TRUE, include=FALSE>>=
265 265
 seqLogo (as.list (egr1.motif)[[1]])
266
-@ 
266
+@
267 267
 
268 268
 \begin{figure}[htpb!]
269 269
   \centering
270 270
   \includegraphics[width=0.3\textwidth]{MotifDb-egr1}
271 271
   \caption{Mmusculus-JASPAR\_CORE-Egr1-MA0162.1}
272 272
 \end{figure}
273
-  
273
+
274 274
 \section{Motif Matching}
275 275
 We will look for the ten position frequency matrices which are the best match to JASPAR's mouse EGR1, using
276
-the MotIV package.  We actually request the top eleven hits from the entire MotifDb, since the first hit 
276
+the MotIV package.  We actually request the top eleven hits from the entire MotifDb, since the first hit
277 277
 should be the target matrix itself, since that is of necessity found in the full MotifDb.
278 278
 
279 279
 <<motifmatch>>=
... ...
@@ -281,10 +281,10 @@ egr1.hits <- motifMatch (as.list (egr1.motif) [1], as.list (MotifDb), top=11)
281 281
 # 'MotIV.toTable' -- defined above (and hidden) -- will become part of MotIV in the upcoming release
282 282
 tbl.hits <- MotIV.toTable (egr1.hits)
283 283
 print (tbl.hits)
284
-@ 
284
+@
285 285
 
286
-The \emph{sequence} column in this table is the \emph{consensus sequence} -- with heterogeneity left out -- for the 
287
-matrix it describes.   
286
+The \emph{sequence} column in this table is the \emph{consensus sequence} -- with heterogeneity left out -- for the
287
+matrix it describes.
288 288
 
289 289
 \vspace{10 mm}
290 290
 
... ...
@@ -299,10 +299,10 @@ all three (geneSymbol differences aside) describe the same protein:
299 299
 <<three.mice.metadata>>=
300 300
 if (interactive ())
301 301
   noquote (t (as.data.frame (subset (values (MotifDb), geneId=='13653'))))
302
-@ 
302
+@
303 303
 Zinc finger protein domains are classified into many \emph{fold groups}; their respective cognate DNA sequence may classify similarly.
304
-That two fly matrices significantly match three reports of the mouse Egr1 motif suggests impressive conservation of this 
305
-binding pattern, or convergent evolution.  
304
+That two fly matrices significantly match three reports of the mouse Egr1 motif suggests impressive conservation of this
305
+binding pattern, or convergent evolution.
306 306
 
307 307
 Let us look at the metadata for the first fly match, whose geneId is \textbf{FBgn0003499}:
308 308
 <<fly.Sr.metadata>>=
... ...
@@ -397,19 +397,19 @@ matrix.output.file = tempfile ()   # substitute your preferred filename here
397 397
 meme.text = export (MotifDb, matrix.output.file, 'meme')
398 398
 
399 399
 metadata.output.file = tempfile () # substitute your preferred filename here
400
-write.table (as.data.frame (values (MotifDb)), file=metadata.output.file, sep='\t', 
400
+write.table (as.data.frame (values (MotifDb)), file=metadata.output.file, sep='\t',
401 401
              row.names=TRUE, col.names=TRUE, quote=FALSE)
402
-@ 
402
+@
403 403
 
404 404
 \section{Future Work}
405 405
 This first version of MotifDb collects into one R package all of the best-known public domain protein-DNA binding matrices, with
406 406
 as much metadata as could be gleaned from the five providers.  However, not all of these matrices are equally supported by data
407 407
 and by no means are all are accompanied by complete metadata.
408 408
 
409
-With the passage of time our knowledge of protein-DNA binding sequence motifs will improve.  They will be derived from more 
410
-binding events, with more precision and specificity, and accompanied by more (and better understood) contextual detail.  Cooperative binding, 
409
+With the passage of time our knowledge of protein-DNA binding sequence motifs will improve.  They will be derived from more
410
+binding events, with more precision and specificity, and accompanied by more (and better understood) contextual detail.  Cooperative binding,
411 411
 mentioned only in a few times in the current (July 2012) version of this package, will be well-represented.  Metadata will improve.
412
-Better assignment of binding domains to consensus categories will be especially useful when it is available.  Three-dimensional 
412
+Better assignment of binding domains to consensus categories will be especially useful when it is available.  Three-dimensional
413 413
 models of specific proteins binding to specific DNA may someday become commonplace.
414 414
 
415 415
 
... ...
@@ -427,6 +427,9 @@ models of specific proteins binding to specific DNA may someday become commonpla
427 427
 
428 428
 \item Zhu LJ, et al. 2011. FlyFactorSurvey: a database of Drosophila transcription factor binding specificities determined using the bacterial one-hybrid system. Nucleic Acids Res. 2011 Jan;39(Database issue):D111-7. Epub 2010 Nov 19.
429 429
 
430
+\item Weirauch MT, Yang A, Albu M, Cote AG, Montenegro-Montero A, Drewe P, Najafabadi HS, Lambert SA, Mann I, Cook K, Zheng H, Goity A, van Bakel H, Lozano JC, Galli M, Lewsey MG, Huang E, Mukherjee T, Chen X, Reece-Hoyes JS, Govindarajan S, Shaulsky G, Walhout AJ, Bouget FY, Ratsch G, Larrondo LF, Ecker JR, Hughes TR.
431
+Determination and inference of eukaryotic transcription factor sequence specificity.
432
+Cell. 2014 Sep 11;158(6):1431-43. doi:10.1016/j.cell.2014.08.009. PMID: 25215497
430 433
 
431 434
 \end{itemize}
432 435