Browse code

new knitr vignette

paul-shannon authored on 14/04/2020 16:54:50
Showing8 changed files

1 1
deleted file mode 100644
2 2
Binary files a/vignettes/MotifDb-logo1.pdf and /dev/null differ
3 3
deleted file mode 100644
4 4
Binary files a/vignettes/MotifDb-logo2.pdf and /dev/null differ
5 5
deleted file mode 100644
6 6
Binary files a/vignettes/MotifDb-logo3.pdf and /dev/null differ
7 7
deleted file mode 100644
8 8
Binary files a/vignettes/MotifDb-logo4.pdf and /dev/null differ
9 9
deleted file mode 100644
10 10
Binary files a/vignettes/MotifDb-logo5.pdf and /dev/null differ
11 11
deleted file mode 100644
12 12
Binary files a/vignettes/MotifDb-logo6.pdf and /dev/null differ
... ...
@@ -14,18 +14,21 @@ vignette: >
14 14
 Many kinds of biological activity are regulated by the binding of proteins to their cognate
15 15
 substrates.  Of particular interest is the sequence-specific binding of transcription factors to
16 16
 DNA, often in regulatory regions just upstream of the transcription start site of a gene.  These
17
-binding events play a pivotal role in regulating gene expression.  Sequence specificity among
18
-closely related binding sites is nearly always incomplete: some variety in the DNA sequence is
19
-routinely observed.  For this reason, these inexact binding sequence patterns are commonly described
20
-as **motifs**, represented numerically as frequency matrices, and visualized as sequence logos.
21
-
22
-Despite their importance in current research, there has been until now no single, annotated,
23
-comprehensive collection of publicly available motifs. The current package attempts to provide such
24
-a collection, offering more than ten thousand annotated matrices from multiple organisms, within the
25
-context of the Bioconductor project.  The matrices can be filtered and selected on the basis of
26
-their metadata, used with other Bioconductor packages (for instance, seqLogo can be used for for
27
-visualization) or easily exported for use with standard software and websites such as those provided
28
-by the [MEME Suite](http://meme.sdsc.edu/meme/doc/meme.html).
17
+binding events play a pivotal role in regulating gene expression.
18
+
19
+Sequence specificity among closely related binding sites is nearly always incomplete: some variety
20
+in the DNA sequence is routinely observed.  For this reason, these inexact binding sequence patterns
21
+are commonly described as **motifs**, represented numerically as frequency matrices, and visualized
22
+as sequence logos.
23
+
24
+Despite their importance in current research, there has been until now, to the best of our
25
+knowledge, no single, annotated, comprehensive collection of publicly available motifs. The current
26
+package attempts to provide such a collection, offering more than ten thousand annotated matrices
27
+from multiple organisms, within the context of the Bioconductor project.  The matrices can be
28
+filtered and selected on the basis of their metadata, used with other Bioconductor packages (for
29
+instance, seqLogo can be used for for visualization) or easily exported for use with standard
30
+software and websites such as those provided by the [MEME
31
+Suite](http://meme.sdsc.edu/meme/doc/meme.html).
29 32
 
30 33
 Transcription factor binding sites (TFBS) can only be imperfectly predicted from sequence matching
31 34
 of motif to DNA sequence.  When using MotifDb, please keep in mind that actual and functional TF
32 35
deleted file mode 100644
... ...
@@ -1,473 +0,0 @@
1
-\documentclass{article}
2
-%% %\VignetteIndexEntry{MotifDb Overview}
3
-%% %\VignettePackage{MotifDb}
4
-\usepackage[noae]{Sweave}
5
-\usepackage[left=0.5in,top=0.5in,right=0.5in,bottom=0.75in,nohead,nofoot]{geometry}
6
-\usepackage{hyperref}
7
-\usepackage[noae]{Sweave}
8
-\usepackage{color}
9
-\usepackage{graphicx}
10
-\usepackage{caption}
11
-\usepackage{subcaption}
12
-
13
-\definecolor{Blue}{rgb}{0,0,0.5}
14
-\definecolor{Green}{rgb}{0,0.5,0}
15
-
16
-\RecustomVerbatimEnvironment{Sinput}{Verbatim}{%
17
-  xleftmargin=1em,%
18
-  fontsize=\small,%
19
-  fontshape=sl,%
20
-  formatcom=\color{Blue}%
21
-  }
22
-\RecustomVerbatimEnvironment{Soutput}{Verbatim}{%
23
-  xleftmargin=0em,%
24
-  fontsize=\scriptsize,%
25
-  formatcom=\color{Blue}%
26
-  }
27
-\RecustomVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em}
28
-
29
-
30
-
31
-\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
32
-\fvset{listparameters={\setlength{\topsep}{6pt}}}
33
-% These determine the rules used to place floating objects like figures
34
-% They are only guides, but read the manual to see the effect of each.
35
-\renewcommand{\topfraction}{.99}
36
-\renewcommand{\bottomfraction}{.99}
37
-\renewcommand{\textfraction}{0.0}
38
-
39
-\title{MotifDb}
40
-\author{Paul Shannon}
41
-
42
-\begin{document}
43
-
44
-\maketitle
45
-\begin{abstract}
46
-Many kinds of biological activity are regulated by the binding of proteins to their cognate
47
-substrates.  Of particular interest is the sequence-specific binding of transcription factors to DNA, often in
48
-regulatory regions just upstream of the transcription start site of a gene.  These binding events play a pivotal
49
-role in regulating gene expression.  Sequence specificity among closely related binding sites is nearly always incomplete: some variety
50
-in the DNA sequence is routinely observed.  For this reason, these inexact binding sequence patterns are commonly
51
-described as \emph{motifs} represented numerically as frequency matrices, and visualized as sequence logos.  Despite their importance
52
-in current research, there has been until now no single, annotated, comprehensive collection of publicly available motifs.
53
-The current package provides such a collection, offering more than two thousand annotated matrices from multiple organisms, within the
54
-context of the Bioconductor project.  The matrices can be filtered and selected on the basis of their metadata, used with other
55
-Bioconductor packages (MotIV for motif comparison, seqLogo for visualization) or easily exported for use with
56
-standard software and websites such as those provided by the MEME Suite\footnote{http://meme.sdsc.edu/meme/doc/meme.html}.
57
-\end{abstract}
58
-
59
-\tableofcontents
60
-
61
-\section{Introduction and Basic Operations}
62
-
63
-The first step is to load the necessary packages:
64
-
65
-<<libraries>>=
66
-library (MotifDb)
67
-library (MotIV)
68
-library (seqLogo)
69
-@
70
-
71
-<<hiddenCode results=hide, echo=FALSE>>=
72
-MotIV.toTable = function (match) {
73
-  if (length (match@bestMatch) == 0)
74
-    return (NA)
75
-
76
-  alignments = match@bestMatch[[1]]@aligns
77
-
78
-  df = data.frame (stringsAsFactors=FALSE)
79
-  for (alignment in alignments) {
80
-    x = alignment
81
-    name = x@TF@name
82
-    eVal = x@evalue
83
-    sequence = x@sequence
84
-    match = x@match
85
-    strand = x@strand
86
-    df = rbind (df, data.frame (name=name, eVal=eVal, sequence=sequence,
87
-                                match=match, strand=strand, stringsAsFactors=FALSE))
88
-    } # for alignment
89
-  return (df)
90
-  } # MotIV.toTable
91
-@
92
-
93
-%% MotifDb provides two kinds of loosely linked data:  position frequency matrices, and metadata about each matrix.  The matrix
94
-%% names, and the rownames of the metadata table, are identical, so it is easy to map back and forth between
95
-%% the two.  Some measure of convenience is gained by extracting these two kinds of data into separate variables,
96
-%% as we shall see.  The cost in extra memory should not significant.
97
-%%
98
-%% <<all.matrices>>=
99
-%% matrices.all = as.list (MotifDb)
100
-%% metadata <- values (MotifDb)
101
-%% @
102
-There are  more than two thousand  matrices, from five sources:
103
-<<sources>>=
104
-length (MotifDb)
105
-sort (table (values (MotifDb)$dataSource), decreasing=TRUE)
106
-@
107
-And 22 organisms (though the majority of the matrices come from just four):
108
-<<organisms>>=
109
-sort (table (values (MotifDb)$organism), decreasing=TRUE)
110
-@
111
-
112
-With these categories of metadata
113
-<<metadata>>=
114
-colnames (values (MotifDb))
115
-@
116
-\section{Selection}
117
-
118
-There are three ways to extract subsets of interest from the MotifDb collection.  All three operate upon the MotifDb metadata,
119
-matching values in one or more of those fifteen attributes (listed just above), and returning the subset of MotifDb  which
120
-meet the specified criteria.  The three techniques:  \emph{query}, \emph{subset} and \emph{grep}
121
-
122
-\subsection{query}
123
-This is the simplest technique to use, and will suffice in many circumstances.  For example, if you want
124
-all of the human matrices:
125
-<<queryHuman>>=
126
-query (MotifDb, 'hsapiens')
127
-@
128
-If you want all matrices associated with \textbf{\emph{Sox}} transcription factors, regardless of dataSource or organism:
129
-<<querySox>>=
130
-query (MotifDb, 'sox')
131
-@
132
-For all yeast transcription factors with a homeo domain
133
-<<queryYeastHomeo>>=
134
-query (query (MotifDb, 'cerevisiae'), 'homeo')
135
-@
136
-The last example may inspire more confidence in the precision of the result than is justified, and for a couple
137
-of reasons.  First, the assignment of  protein binding domains to specific categories is, as of 2012, an ad hoc
138
-and incomplete process.  Second, the query commands matches the supplied character string to \emph{all} metadata
139
-columns.  In this case, 'homeo' appears both in the \emph{bindingDomain} column and the \emph{tfFamily} column,
140
-and the above \emph{query} will return matches from both.
141
-Searching and filtering should always be accompanined by close scrutiny of the data, such as these commands
142
-illustrate:
143
-
144
-<<homeoVariety>>=
145
-unique (grep ('homeo', values(MotifDb)$bindingDomain, ignore.case=T, v=T))
146
-unique (grep ('homeo', values(MotifDb)$tfFamily, ignore.case=T, v=T))
147
-@
148
-\subsection{grep}
149
-This selection method (and the next, \emph{subset}) require that you address metadata columns explicitly.  This is a little more
150
-work, but the requisite direct engagement with the metadata is worthwhile.  Repeating the 'query' examples from above,
151
-you can see how more knowedge of MotifDb metadata is required.
152
-<<grepHuman>>=
153
-mdb.human <- MotifDb [grep ('Hsapiens', values (MotifDb)$organism)]
154
-mdb.sox <- MotifDb [grep ('sox', values (MotifDb)$geneSymbol, ignore.case=TRUE)]
155
-yeast.indices = grepl ('scere', values (MotifDb)$organism, ignore.case=TRUE)
156
-homeo.indices.domain = grepl ('homeo', values (MotifDb)$bindingDomain, ignore.case=TRUE)
157
-homeo.indices.family = grepl ('homeo', values (MotifDb)$tfFamily, ignore.case=TRUE)
158
-yeast.homeo.indices = yeast.indices & (homeo.indices.domain | homeo.indices.family)
159
-yeast.homeoDb = MotifDb [yeast.homeo.indices]
160
-@
161
-
162
-An alternate and somewhat more compact approach:
163
-<<withHomeo>>=
164
-yeast.homeo.indices <- with(values(MotifDb),
165
-  grepl('scere', organism, ignore.case=TRUE) &
166
-    (grepl('homeo', bindingDomain, ignore.case=TRUE) |
167
-     grepl('homeo', tfFamily, ignore.case=TRUE)))
168
-
169
-@
170
-\subsection{subset}
171
-MotifDb::subset emulates the R base data.frame \emph{subset} command, which is not unlike an SQL select function.
172
-Unfortunately -- and just like the R base subset function -- this MotifDb method cannot be used reliably  within a script:
173
-\emph{It is only reliable when called interactively.}  Here, with mixed success (as you will see) , we use MotifDb::subset to
174
-reproduce the \emph{query} and \emph{grep} selections shown above.
175
-
176
-<<subsetHuman>>=
177
-if (interactive ())
178
-  subset (MotifDb, organism=='Hsapiens')
179
-@
180
-One can easily find all the 'sox' genes with the subset command, avoiding possible upper/lower case conflicts by passing
181
-the metadata's geneSymbol column through the function 'tolower':
182
-<<subsetSox>>=
183
-if (interactive ())
184
-  subset (MotifDb, tolower (geneSymbol) == 'sox4')
185
-@
186
-Similarly, subset has limited application for a permissive 'homeo' search.
187
-But for the retrieval by explicitly specified search terms, subset works very well:
188
-<<subsetYeastHomeo>>=
189
-if (interactive ())
190
-  subset (MotifDb, organism=='Scerevisiae' & bindingDomain=='Homeo')
191
-@
192
-
193
-\subsection{The Egr1 Case Study}
194
-
195
-We now do a simple geneSymbol search, followed by an examination of the sub-MotifDb the search returns.  We are looking for all matrices
196
-associated with the well-known and highly conserved zinc-finger transcription factor, Egr1.
197
-There are two of these in MotifDb, both from mouse, and each from a different data source.
198
-
199
-<<findEgr1>>=
200
-  # subset is convenient:
201
-if (interactive ())
202
-  as.list (subset (MotifDb, tolower (geneSymbol) == 'egr1'))
203
-  # grep returns indices which allow for more flexibility
204
-indices = grep ('egr1', values (MotifDb)$geneSymbol, ignore.case=TRUE)
205
-length (indices)
206
-@
207
-There are a variety of ways to examine and extract data from this object, a MotifList of length 2.
208
-<<MotifDbViews>>=
209
-MotifDb [indices]
210
-@
211
-
212
-Now view the matrices as a named list:
213
-<<as.list>>=
214
-as.list (MotifDb [indices])
215
-@
216
-and finally, the metadata associated with these two matrices, transposed, for easy reading and comparison:
217
-<<as.metadata>>=
218
-noquote (t (as.data.frame (values (MotifDb [indices]))))
219
-@
220
-
221
-We used the \emph{grep} function above to find rows in the metadata table whose \emph{geneSymbol} column includes the string 'Egr1'.
222
-If you wish to identify matrices (and/or their attendant metadata) based upon a richer combination of criteria, for instance:
223
-
224
-\begin{enumerate}
225
-  \item organism  (\emph{Mmusculus})
226
-  \item gene symbol  (\emph{Egr1})
227
-  \item data source  (\emph{JASPAR\_CORE})
228
-\end{enumerate}
229
-
230
-the grep solution, while serviceable, becomes a little awkward:
231
-<<egr1-multi-grep>>=
232
-
233
-geneSymbol.rows = grep ('Egr1', values (MotifDb)$geneSymbol, ignore.case=TRUE)
234
-organism.rows = grep ('Mmusculus', values (MotifDb)$organism, ignore.case=TRUE)
235
-source.rows = grep ('JASPAR', values (MotifDb)$dataSource, ignore.case=TRUE)
236
-egr1.mouse.jaspar.rows = intersect (geneSymbol.rows,
237
-                           intersect (organism.rows, source.rows))
238
-print (egr1.mouse.jaspar.rows)
239
-egr1.motif <- MotifDb [egr1.mouse.jaspar.rows]
240
-@
241
-<<MotifDbViews>>=
242
-@
243
-
244
-Far more concise, and fully reliable as an interactive command (though \emph{not} if used in a
245
-script\footnote{See the help page of the base R command subset for detail), is the \emph{subset} command}):
246
-<<subsetSearchForEgr1>>=
247
-if (interactive ()) {
248
-  egr1.motif <- subset (MotifDb, organism=='Mmusculus' &
249
-                        dataSource=='JASPAR_CORE' &
250
-                        geneSymbol=='Egr1')
251
-  }
252
-@
253
-Whichever method you use, this next chunk of code displays the matrix, and then the metadata for mouse JASPAR Egr1, the latter
254
-textually-transformed for easy reading within the size constraints of this page.
255
-<<examine-egr1>>=
256
-egr1.motif
257
-as.list (egr1.motif)
258
-noquote (t (as.data.frame (values (egr1.motif))))
259
-@
260
-
261
-
262
-Next we use the bioconductor \emph{seqLogo} package to display this motif.
263
-
264
-<<egr1, fig=TRUE, include=FALSE>>=
265
-seqLogo (as.list (egr1.motif)[[1]])
266
-@
267
-
268
-\begin{figure}[htpb!]
269
-  \centering
270
-  \includegraphics[width=0.3\textwidth]{MotifDb-egr1}
271
-  \caption{Mmusculus-JASPAR\_CORE-Egr1-MA0162.1}
272
-\end{figure}
273
-
274
-\section{Motif Matching}
275
-Note: this section is disabled for now (2 april 2020) due to the
276
-breakage and perhaps end-of-life status of the Bioconductor package MotIV.
277
-
278
-An alternative Bioconductor package for motif matching is motifmatchr,
279
-which uses the C++ moods library.  Unfortunately, the inclusion of
280
-motifmatchr and supporting packages needed to render it interoperable
281
-with MotifDb requires the further inclusion of very many dependencies.
282
-
283
-For that reason, we present only non-executable code here:
284
-
285
-<<motifmatchr, eval=FALSE>>=
286
-
287
- gr.region <- GRanges(seqnames="chr1", IRanges(start=47229520, end=47229560))
288
- motifs <- query(MotifDb, c("jaspar2018", "ZNF263"))
289
-
290
- motifs.pfmatrix <- lapply(motifs,
291
-                           function(motif) convert_motifs(motif, "TFBSTools-PFMatrix"))
292
-
293
- motifs.pfmList <- do.call(PFMatrixList, motifs.pfmatrix)
294
- gr.list <- motifmatchr::matchMotifs(motifs.pfmList, regions, genome=genomeName,
295
-                                     out="positions", p.cutoff=1e-5)
296
- gr <- unlist(gr.list)
297
- motif.names <- names(gr)
298
- names(gr) <- NULL
299
- tbl <- as.data.frame(gr)
300
- tbl$motif_id <- motif.names
301
- colnames(tbl)[1] <- "chrom"
302
- tbl$chrom <- as.character(tbl$chrom)
303
- colnames(tbl)[grep("score", colnames(tbl))] <- "mood.score"
304
- new.order <- order(tbl$start, decreasing=FALSE)
305
- tbl <- tbl[new.order,]
306
-
307
-@
308
-
309
-%% We will look for the ten position frequency matrices which are the best match to JASPAR's mouse EGR1, using
310
-%% the MotIV package.  We actually request the top eleven hits from the entire MotifDb, since the first hit
311
-%% should be the target matrix itself, since that is of necessity found in the full MotifDb.
312
-
313
-%% <<motifmatch>>=
314
-%% egr1.hits <- motifMatch (as.list (egr1.motif) [1], as.list (MotifDb), top=11)
315
-%% # 'MotIV.toTable' -- defined above (and hidden) -- will become part of MotIV in the upcoming release
316
-%% tbl.hits <- MotIV.toTable (egr1.hits)
317
-%% print (tbl.hits)
318
-%% @
319
-%%
320
-%% The \emph{sequence} column in this table is the \emph{consensus sequence} -- with heterogeneity left out -- for the
321
-%% matrix it describes.
322
-%%
323
-%% \vspace{10 mm}
324
-%%
325
-%% \textbf{\emph{Puzzling: the strand of the match reported above is opposite of what I expected, and opposite of what seqLogo displays.
326
-%%   This is a question for the MotIV developers.}}
327
-%%
328
-%% \vspace{10 mm}
329
-%%
330
-%% The six logos appear below, beginning with the logo of the query matrix, \emph{Mmusculus-JASPAR\_CORE-Egr1-MA0162.1}, including
331
-%% two other mouse matrices, and two zinc-finger fly matrices.  Examining the three mouse matrices and their metadata reveals that
332
-%% all three (geneSymbol differences aside) describe the same protein:
333
-%% <<three.mice.metadata>>=
334
-%% if (interactive ())
335
-%%   noquote (t (as.data.frame (subset (values (MotifDb), geneId=='13653'))))
336
-%% @
337
-%% Zinc finger protein domains are classified into many \emph{fold groups}; their respective cognate DNA sequence may classify similarly.
338
-%% That two fly matrices significantly match three reports of the mouse Egr1 motif suggests impressive conservation of this
339
-%% binding pattern, or convergent evolution.
340
-%%
341
-%% Let us look at the metadata for the first fly match, whose geneId is \textbf{FBgn0003499}:
342
-%% <<fly.Sr.metadata>>=
343
-%% noquote (t (as.data.frame (values (MotifDb)[grep ('FBgn0003499', values (MotifDb)$geneId),])))
344
-%% @ Note that the SANGER motif, based on 18 sequences, had a high fidelity match to mouse Egr1 (see above, 10e-12), but
345
-%% that the SOLEXA motif, based upon 2316 sequences, did not (in work not shown, it appears 22nd in the an expanded
346
-%% motifMatch hit list, with a eval of 10e-5).  It is possible that the SOLEXA motif is more accurate, and that a close
347
-%% examination of this case, including sequence logos, position frequency matrices, and the search parameters of
348
-%% motifMatch, will be instructive.  Repeating the search with \emph{tomtom} might also be illuminating -- either as
349
-%% confirmation of MotIV and the default parameterization we used, or as a correction to it.  Here we see the facilities
350
-%% for exploratory data analysis MotifDb provides, and the opportunities for data analysis which result.
351
-%%
352
-%%
353
-%% <<logo1, fig=TRUE, include=FALSE, echo=FALSE>>=
354
-%%   seqLogo (MotifDb [[tbl.hits$name[1]]])
355
-%% @
356
-%%
357
-%% <<logo2, fig=TRUE, include=FALSE, echo=FALSE>>=
358
-%%   seqLogo (MotifDb [[tbl.hits$name[2]]])
359
-%% @
360
-%%
361
-%% <<logo3, fig=TRUE, include=FALSE, echo=FALSE>>=
362
-%%   seqLogo (MotifDb [[tbl.hits$name[3]]])
363
-%% @
364
-%%
365
-%% <<logo4, fig=TRUE, include=FALSE, echo=FALSE>>=
366
-%%   seqLogo (MotifDb [[tbl.hits$name[4]]])
367
-%% @
368
-%%
369
-%% <<logo5, fig=TRUE, include=FALSE, echo=FALSE>>=
370
-%%   seqLogo (MotifDb [[tbl.hits$name[5]]])
371
-%% @
372
-%%
373
-%% <<logo6, fig=TRUE, include=FALSE, echo=FALSE>>=
374
-%%   seqLogo (MotifDb [[tbl.hits$name[6]]])
375
-%% @
376
-%%
377
-%% \begin{figure}[htpb!]
378
-%%   \centering
379
-%%   \begin{subfigure}[b]{0.38\textwidth}
380
-%%     \includegraphics[width=\textwidth]{MotifDb-logo1}
381
-%%     \caption{Mmusculus-JASPAR\_CORE-Egr1-MA0162.1}
382
-%%     \label{fig:Egr1-MA0162.1}
383
-%%     \end{subfigure}%
384
-%%   \begin{subfigure}[b]{0.38\textwidth}
385
-%%     \includegraphics[width=\textwidth]{MotifDb-logo2}
386
-%%     \caption{Dme-FFS-sr\_SANGER\_5\_FBgn0003499\\(abbreviated)}
387
-%%     \label{fig:Egr1-logo2}
388
-%%     \end{subfigure}%
389
-%% \end{figure}
390
-%%
391
-%% \begin{figure}[htpb!]
392
-%%   \centering
393
-%%   \begin{subfigure}[b]{0.38\textwidth}
394
-%%     \includegraphics[width=\textwidth]{MotifDb-logo3}
395
-%%     \caption{Mmusculus-UniPROBE-Zif268.UP00400}
396
-%%     \label{fig:Egr1-logo3}
397
-%%     \end{subfigure}%
398
-%%   \begin{subfigure}[b]{0.38\textwidth}
399
-%%     \includegraphics[width=\textwidth]{MotifDb-logo4}
400
-%%     \caption{Dme-FFS-klu\_SANGER\_10\_FBgn0013469}
401
-%%     \label{fig:Egr1-logo4}
402
-%%     \end{subfigure}%
403
-%% \end{figure}
404
-%%
405
-%%
406
-%% \begin{figure}[htpb!]
407
-%%   \centering
408
-%%   \begin{subfigure}[b]{0.38\textwidth}
409
-%%     \includegraphics[width=\textwidth]{MotifDb-logo5}
410
-%%     \caption{Mmusculus-UniPROBE-Egr1.UP00007}
411
-%%     \label{fig:Egr1-logo5}
412
-%%     \end{subfigure}%
413
-%%   \begin{subfigure}[b]{0.38\textwidth}
414
-%%     \centering
415
-%%     \includegraphics[width=\textwidth]{MotifDb-logo6}
416
-%%     \caption{Dme-FFS-klu\_SOLEXA\_5\_FBgn0013469}
417
-%%     \label{fig:Egr1-logo6}
418
-%%     \end{subfigure}%
419
-%% \end{figure}
420
-%%
421
-%% \newpage
422
-
423
-\section{Exporting to the MEME Suite}
424
-Some users of this package may wish to export the data -- both matrices and metadata -- so that they may be used in
425
-other programs.  The MEME suite, among others, is broadly useful, continuously improved and well-regarded throughout the
426
-bioinformatics community.  The code below exports all of the MotifDb matrices as a text file in the MEME format, and all
427
-of the metadata as a tab-delimited text file.
428
-
429
-<<export>>=
430
-matrix.output.file = tempfile ()   # substitute your preferred filename here
431
-meme.text = export (MotifDb, matrix.output.file, 'meme')
432
-
433
-metadata.output.file = tempfile () # substitute your preferred filename here
434
-write.table (as.data.frame (values (MotifDb)), file=metadata.output.file, sep='\t',
435
-             row.names=TRUE, col.names=TRUE, quote=FALSE)
436
-@
437
-
438
-\section{Future Work}
439
-This first version of MotifDb collects into one R package all of the best-known public domain protein-DNA binding matrices, with
440
-as much metadata as could be gleaned from the five providers.  However, not all of these matrices are equally supported by data
441
-and by no means are all are accompanied by complete metadata.
442
-
443
-With the passage of time our knowledge of protein-DNA binding sequence motifs will improve.  They will be derived from more
444
-binding events, with more precision and specificity, and accompanied by more (and better understood) contextual detail.  Cooperative binding,
445
-mentioned only in a few times in the current (July 2012) version of this package, will be well-represented.  Metadata will improve.
446
-Better assignment of binding domains to consensus categories will be especially useful when it is available.  Three-dimensional
447
-models of specific proteins binding to specific DNA may someday become commonplace.
448
-
449
-
450
-\section{References}
451
-
452
-\begin{itemize}
453
-
454
-\item Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, Yusuf D, Lenhard B, Wasserman WW, Sandelin A. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 2010 Jan;38(Database issue):D105-10. Epub 2009 Nov 11.
455
-
456
-\item Robasky K, Bulyk ML. UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Res. 2011 Jan;39(Database issue):D124-8. Epub 2010 Oct 30.
457
-
458
-\item Spivak AT, Stormo GD. ScerTF: a comprehensive database of benchmarked position weight matrices for Saccharomyces species. Nucleic Acids Res. 2012 Jan;40(Database issue):D162-8. Epub 2011 Dec 2.
459
-
460
-\item Xie Z, Hu S, Blackshaw S, Zhu H, Qian J. hPDI: a database of experimental human protein-DNA interactions. Bioinformatics. 2010 Jan 15;26(2):287-9. Epub 2009 Nov 9.
461
-
462
-\item Zhu LJ, et al. 2011. FlyFactorSurvey: a database of Drosophila transcription factor binding specificities determined using the bacterial one-hybrid system. Nucleic Acids Res. 2011 Jan;39(Database issue):D111-7. Epub 2010 Nov 19.
463
-
464
-\item Weirauch MT, Yang A, Albu M, Cote AG, Montenegro-Montero A, Drewe P, Najafabadi HS, Lambert SA, Mann I, Cook K, Zheng H, Goity A, van Bakel H, Lozano JC, Galli M, Lewsey MG, Huang E, Mukherjee T, Chen X, Reece-Hoyes JS, Govindarajan S, Shaulsky G, Walhout AJ, Bouget FY, Ratsch G, Larrondo LF, Ecker JR, Hughes TR.
465
-Determination and inference of eukaryotic transcription factor sequence specificity.
466
-Cell. 2014 Sep 11;158(6):1431-43. doi:10.1016/j.cell.2014.08.009. PMID: 25215497
467
-
468
-\end{itemize}
469
-
470
-
471
-\end{document}
472
-
473
-