\documentclass{article}
%% %\VignetteIndexEntry{MotifDb Overview}
%% %\VignettePackage{MotifDb}
\usepackage[noae]{Sweave}
\usepackage[left=0.5in,top=0.5in,right=0.5in,bottom=0.75in,nohead,nofoot]{geometry} 
\usepackage{hyperref}
\usepackage[noae]{Sweave}
\usepackage{color}
\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}

\definecolor{Blue}{rgb}{0,0,0.5}
\definecolor{Green}{rgb}{0,0.5,0}

\RecustomVerbatimEnvironment{Sinput}{Verbatim}{%
  xleftmargin=1em,%
  fontsize=\small,%
  fontshape=sl,%
  formatcom=\color{Blue}%
  }
\RecustomVerbatimEnvironment{Soutput}{Verbatim}{%
  xleftmargin=0em,%
  fontsize=\scriptsize,%
  formatcom=\color{Blue}%
  }
\RecustomVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em}



\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
\fvset{listparameters={\setlength{\topsep}{6pt}}}
% These determine the rules used to place floating objects like figures 
% They are only guides, but read the manual to see the effect of each.
\renewcommand{\topfraction}{.99}
\renewcommand{\bottomfraction}{.99}
\renewcommand{\textfraction}{0.0}

\title{MotifDb} 
\author{Paul Shannon}

\begin{document} 

\maketitle
\begin{abstract}
Many kinds of biological activity are regulated by the binding of proteins to their cognate
substrates.  Of particular interest is the sequence-specific binding of transcription factors to DNA, often in
regulatory regions just upstream of the transcription start site of a gene.  These binding events play a pivotal
role in regulating gene expression.  Sequence specificity among closely related binding sites is nearly always incomplete: some variety
in the DNA sequence is routinely observed.  For this reason, these inexact binding sequence patterns are commonly
described as \emph{motifs} represented numerically as frequency matrices, and visualized as sequence logos.  Despite their importance
in current research, there has been until now no single, annotated, comprehensive collection of publicly available motifs.
The current package provides such a collection, offering more than two thousand annotated matrices from multiple organisms, within the
context of the Bioconductor project.  The matrices can be filtered and selected on the basis of their metadata, used with other
Bioconductor packages (MotIV for motif comparison, seqLogo for visualization) or easily exported for use with 
standard software and websites such as those provided by the MEME Suite\footnote{http://meme.sdsc.edu/meme/doc/meme.html}.
\end{abstract}

\tableofcontents

\section{Introduction and Basic Operations}

The first step is to load the necessary packages:

\begin{Schunk}
\begin{Sinput}
> library (MotifDb)
> library (MotIV)
> library (seqLogo)
\end{Sinput}
\end{Schunk}


%% MotifDb provides two kinds of loosely linked data:  position frequency matrices, and metadata about each matrix.  The matrix
%% names, and the rownames of the metadata table, are identical, so it is easy to map back and forth between
%% the two.  Some measure of convenience is gained by extracting these two kinds of data into separate variables,
%% as we shall see.  The cost in extra memory should not significant.
%% 
%% <<all.matrices>>=
%% matrices.all = as.list (MotifDb)
%% metadata <- values (MotifDb)
%% @ 
There are  more than two thousand  matrices, from five sources:
\begin{Schunk}
\begin{Sinput}
> length (MotifDb)
\end{Sinput}
\begin{Soutput}
[1] 2086
\end{Soutput}
\begin{Sinput}
> sort (table (values (MotifDb)$dataSource), decreasing=TRUE)
\end{Sinput}
\begin{Soutput}
FlyFactorSurvey     JASPAR_CORE            hPDI        UniPROBE          ScerTF 
            614             459             437             380             196 
\end{Soutput}
\end{Schunk}
And 22 organisms (though the majority of the matrices come from just four):
\begin{Schunk}
\begin{Sinput}
> sort (table (values (MotifDb)$organism), decreasing=TRUE)
\end{Sinput}
\begin{Soutput}
Dmelanogaster      Hsapiens   Scerevisiae     Mmusculus   Rnorvegicus 
          739           505           464           329             8 
     Celegans         Zmays     Athaliana    Vertebrata        Amajus 
            7             6             5             4             3 
     Psativum        Gallus   Pfalciparum       Cparvum      Hroretzi 
            3             2             2             1             1 
     Hvulgare   Nsylvestris    Ocuniculus      Phybrida       Rrattus 
            1             1             1             1             1 
    Taestivam       Xlaevis 
            1             1 
\end{Soutput}
\end{Schunk}

With these categories of metadata
\begin{Schunk}
\begin{Sinput}
> colnames (values (MotifDb))
\end{Sinput}
\begin{Soutput}
 [1] "providerName"    "providerId"      "dataSource"      "geneSymbol"     
 [5] "geneId"          "geneIdType"      "proteinId"       "proteinIdType"  
 [9] "organism"        "sequenceCount"   "bindingSequence" "bindingDomain"  
[13] "tfFamily"        "experimentType"  "pubmedID"       
\end{Soutput}
\end{Schunk}
\section{Selection}

There are three ways to extract subsets of interest from the MotifDb collection.  All three operate upon the MotifDb metadata,
matching values in one or more of those fifteen attributes (listed just above), and returning the subset of MotifDb  which 
meet the specified criteria.  The three techniques:  \emph{query}, \emph{subset} and \emph{grep}

\subsection{query}
This is the simplest technique to use, and will suffice in many circumstances.  For example, if you want 
all of the human matrices:
\begin{Schunk}
\begin{Sinput}
> query (MotifDb, 'hsapiens')
\end{Sinput}
\begin{Soutput}
MotifDb object of length 505
| Created from downloaded public sources: 2012-Jul6
| 505 position frequency matrices from 3 sources:
|        JASPAR_CORE:   66
|           UniPROBE:    2
|               hPDI:  437
| 1 organism/s
|           Hsapiens:  505
Hsapiens-hPDI-ABCF2 
Hsapiens-hPDI-ACF 
Hsapiens-hPDI-ACO1 
Hsapiens-hPDI-ADARB1 
Hsapiens-hPDI-AFF4 
...
Hsapiens-JASPAR_CORE-SP1-MA0079.2 
Hsapiens-JASPAR_CORE-ESR2-MA0258.1 
Hsapiens-JASPAR_CORE-HIF1A::ARNT-MA0259.1 
Hsapiens-UniPROBE-Sox4.UP00401 
Hsapiens-UniPROBE-Oct_1.UP00399 
\end{Soutput}
\end{Schunk}
If you want all matrices associated with \textbf{\emph{Sox}} transcription factors, regardless of dataSource or organism:
\begin{Schunk}
\begin{Sinput}
> query (MotifDb, 'sox')
\end{Sinput}
\begin{Soutput}
MotifDb object of length 24
| Created from downloaded public sources: 2012-Jul6
| 24 position frequency matrices from 4 sources:
|    FlyFactorSurvey:    2
|        JASPAR_CORE:    5
|           UniPROBE:   15
|               hPDI:    2
| 3 organism/s
|          Mmusculus:   18
|           Hsapiens:    4
|      Dmelanogaster:    2
Dmelanogaster-FlyFactorSurvey-Sox14_SANGER_10_FBgn0005612 
Dmelanogaster-FlyFactorSurvey-Sox15_SANGER_5_FBgn0005613 
Hsapiens-hPDI-SOX13 
Hsapiens-hPDI-SOX14 
Hsapiens-JASPAR_CORE-SOX9-MA0077.1 
...
Mmusculus-UniPROBE-Sox30.UP00023 
Mmusculus-UniPROBE-Sox4.UP00062 
Mmusculus-UniPROBE-Sox5.UP00091 
Mmusculus-UniPROBE-Sox7.UP00034 
Mmusculus-UniPROBE-Sox8.UP00051 
\end{Soutput}
\end{Schunk}
For all yeast transcription factors with a homeo domain
\begin{Schunk}
\begin{Sinput}
> query (query (MotifDb, 'cerevisiae'), 'homeo')
\end{Sinput}
\begin{Soutput}
MotifDb object of length 14
| Created from downloaded public sources: 2012-Jul6
| 14 position frequency matrices from 2 sources:
|        JASPAR_CORE:   10
|           UniPROBE:    4
| 1 organism/s
|        Scerevisiae:   14
Scerevisiae-UniPROBE-Cup9.UP00308 
Scerevisiae-UniPROBE-Matalpha2.UP00307 
Scerevisiae-UniPROBE-Pho2.UP00268 
Scerevisiae-UniPROBE-Yox1.UP00274 
Scerevisiae-JASPAR_CORE-CUP9-MA0288.1 
...
Scerevisiae-JASPAR_CORE-STE12-MA0393.1 
Scerevisiae-JASPAR_CORE-TEC1-MA0406.1 
Scerevisiae-JASPAR_CORE-TOS8-MA0408.1 
Scerevisiae-JASPAR_CORE-YHP1-MA0426.1 
Scerevisiae-JASPAR_CORE-YOX1-MA0433.1 
\end{Soutput}
\end{Schunk}
The last example may inspire more confidence in the precision of the result than is justified, and for a couple
of reasons.  First, the assignment of  protein binding domains to specific categories is, as of 2012, an ad hoc 
and incomplete process.  Second, the query commands matches the supplied character string to \emph{all} metadata
columns.  In this case, 'homeo' appears both in the \emph{bindingDomain} column and the \emph{tfFamily} column,
and the above \emph{query} will return matches from both.
Searching and filtering should always be accompanined by close scrutiny of the data, such as these commands
illustrate:

\begin{Schunk}
\begin{Sinput}
> unique (grep ('homeo', values(MotifDb)$bindingDomain, ignore.case=T, v=T))
\end{Sinput}
\begin{Soutput}
 [1] "Homeobox"                          "Hox9_act;Homeobox"                
 [3] "LIM;Homeobox"                      "PAX;Homeobox"                     
 [5] "OAR;Homeobox"                      "Pou;Homeobox"                     
 [7] "Distant similarity to homeodomain" "Homeo"                            
 [9] "Homeo, PAX"                        "Homeo, POU"                       
\end{Soutput}
\begin{Sinput}
> unique (grep ('homeo', values(MotifDb)$tfFamily, ignore.case=T, v=T))
\end{Sinput}
\begin{Soutput}
[1] "Homeo"                                
[2] "Homeo::Nuclear Factor I-CCAAT-binding"
\end{Soutput}
\end{Schunk}
\subsection{grep}
This selection method (and the next, \emph{subset}) require that you address metadata columns explicitly.  This is a little more
work, but the requisite direct engagement with the metadata is worthwhile.  Repeating the 'query' examples from above,
you can see how more knowedge of MotifDb metadata is required.
\begin{Schunk}
\begin{Sinput}
> mdb.human <- MotifDb [grep ('Hsapiens', values (MotifDb)$organism)]
> mdb.sox <- MotifDb [grep ('sox', values (MotifDb)$geneSymbol, ignore.case=TRUE)]
> yeast.indices = grepl ('scere', values (MotifDb)$organism, ignore.case=TRUE)
> homeo.indices.domain = grepl ('homeo', values (MotifDb)$bindingDomain, ignore.case=TRUE)
> homeo.indices.family = grepl ('homeo', values (MotifDb)$tfFamily, ignore.case=TRUE)
> yeast.homeo.indices = yeast.indices & (homeo.indices.domain | homeo.indices.family)
> yeast.homeoDb = MotifDb [yeast.homeo.indices]
\end{Sinput}
\end{Schunk}

An alternate and somewhat more compact approach:
\begin{Schunk}
\begin{Sinput}
> yeast.homeo.indices <- with(values(MotifDb),
+   grepl('scere', organism, ignore.case=TRUE) &
+     (grepl('homeo', bindingDomain, ignore.case=TRUE) |
+      grepl('homeo', tfFamily, ignore.case=TRUE)))
> 
\end{Sinput}
\end{Schunk}
\subsection{subset}
MotifDb::subset emulates the R base data.frame \emph{subset} command, which is not unlike an SQL select function.
Unfortunately -- and just like the R base subset function -- this MotifDb method cannot be used reliably  within a script:
\emph{It is only reliable when called interactively.}  Here, with mixed success (as you will see) , we use MotifDb::subset to
reproduce the \emph{query} and \emph{grep} selections shown above.

\begin{Schunk}
\begin{Sinput}
> if (interactive ())
+   subset (MotifDb, organism=='Hsapiens')
\end{Sinput}
\end{Schunk}
One can easily find all the 'sox' genes with the subset command, avoiding possible upper/lower case conflicts by passing
the metadata's geneSymbol column through the function 'tolower':
\begin{Schunk}
\begin{Sinput}
> if (interactive ())
+   subset (MotifDb, tolower (geneSymbol) == 'sox4')
\end{Sinput}
\end{Schunk}
Similarly, subset has limited application for a permissive 'homeo' search.
But for the retrieval by explicitly specified search terms, subset works very well:
\begin{Schunk}
\begin{Sinput}
> if (interactive ())
+   subset (MotifDb, organism=='Scerevisiae' & bindingDomain=='Homeo')
\end{Sinput}
\end{Schunk}

\subsection{The Egr1 Case Study}

We now do a simple geneSymbol search, followed by an examination of the sub-MotifDb the search returns.  We are looking for all matrices
associated with the well-known and highly conserved zinc-finger transcription factor, Egr1.
There are two of these in MotifDb, both from mouse, and each from a different data source.

\begin{Schunk}
\begin{Sinput}
>   # subset is convenient: 
> if (interactive ())
+   as.list (subset (MotifDb, tolower (geneSymbol) == 'egr1'))
>   # grep returns indices which allow for more flexibility
> indices = grep ('egr1', values (MotifDb)$geneSymbol, ignore.case=TRUE)  
> length (indices)
\end{Sinput}
\begin{Soutput}
[1] 2
\end{Soutput}
\end{Schunk}
There are a variety of ways to examine and extract data from this object, a MotifList of length 2.  
\begin{Schunk}
\begin{Sinput}
> MotifDb [indices]
\end{Sinput}
\begin{Soutput}
MotifDb object of length 2
| Created from downloaded public sources: 2012-Jul6
| 2 position frequency matrices from 2 sources:
|        JASPAR_CORE:    1
|           UniPROBE:    1
| 1 organism/s
|          Mmusculus:    2
Mmusculus-JASPAR_CORE-Egr1-MA0162.1 
Mmusculus-UniPROBE-Egr1.UP00007 
\end{Soutput}
\end{Schunk}

Now view the matrices as a named list:
\begin{Schunk}
\begin{Sinput}
> as.list (MotifDb [indices])
\end{Sinput}
\begin{Soutput}
$`Mmusculus-JASPAR_CORE-Egr1-MA0162.1`
           1          2         3 4   5   6          7 8         9 10
A 0.20000000 0.13333333 0.0000000 0 0.0 0.2 0.06666667 0 0.1333333  0
C 0.26666667 0.06666667 0.8666667 0 0.0 0.0 0.00000000 0 0.6666667  0
G 0.06666667 0.80000000 0.0000000 1 0.2 0.8 0.93333333 1 0.0000000  1
T 0.46666667 0.00000000 0.1333333 0 0.8 0.0 0.00000000 0 0.2000000  0
          11
A 0.06666667
C 0.00000000
G 0.46666667
T 0.46666667

$`Mmusculus-UniPROBE-Egr1.UP00007`
          1          2          3          4           5           6
A 0.2115466 0.14198757 0.03260499 0.11512588 0.003516173 0.004715059
C 0.2827083 0.72243721 0.87717185 0.07060553 0.990021152 0.982482238
G 0.2034722 0.05485440 0.01243161 0.78128969 0.002264928 0.009896878
T 0.3022730 0.08072082 0.07779155 0.03297890 0.004197748 0.002905824
            7           8           9         10         11         12
A 0.001626612 0.262351637 0.005889514 0.02289301 0.02303758 0.56763334
C 0.975937323 0.731731673 0.985755764 0.09046006 0.85994854 0.05739392
G 0.001661635 0.002729558 0.002081402 0.64932246 0.03791264 0.16679165
T 0.020774430 0.003187133 0.006273319 0.23732447 0.07910124 0.20818108
         13        14
A 0.1765973 0.1830489
C 0.3312648 0.1837744
G 0.1253083 0.2267928
T 0.3668295 0.4063840
\end{Soutput}
\end{Schunk}
and finally, the metadata associated with these two matrices, transposed, for easy reading and comparison:
\begin{Schunk}
\begin{Sinput}
> noquote (t (as.data.frame (values (MotifDb [indices]))))
\end{Sinput}
\begin{Soutput}
                Mmusculus-JASPAR_CORE-Egr1-MA0162.1
providerName    Egr1                               
providerId      MA0162.1                           
dataSource      JASPAR_CORE                        
geneSymbol      Egr1                               
geneId          13653                              
geneIdType      ENTREZ                             
proteinId       P08046                             
proteinIdType   UNIPROT                            
organism        Mmusculus                          
sequenceCount   15                                 
bindingSequence <NA>                               
bindingDomain   Zinc-coordinating                  
tfFamily        BetaBetaAlpha-zinc finger          
experimentType  bacterial 1-hybrid                 
pubmedID        16041365                           
                Mmusculus-UniPROBE-Egr1.UP00007
providerName    SCI09/Egr1_pwm_primary.txt     
providerId      UP00007                        
dataSource      UniPROBE                       
geneSymbol      Egr1                           
geneId          13653                          
geneIdType      ENTREZ                         
proteinId       P08046                         
proteinIdType   UNIPROT                        
organism        Mmusculus                      
sequenceCount   <NA>                           
bindingSequence <NA>                           
bindingDomain   ZnF_C2H2                       
tfFamily        <NA>                           
experimentType  protein binding microarray     
pubmedID        19443739                       
\end{Soutput}
\end{Schunk}
 
We used the \emph{grep} function above to find rows in the metadata table whose \emph{geneSymbol} column includes the string 'Egr1'.
If you wish to identify matrices (and/or their attendant metadata) based upon a richer combination of criteria, for instance:

\begin{enumerate}
  \item organism  (\emph{Mmusculus})
  \item gene symbol  (\emph{Egr1})
  \item data source  (\emph{JASPAR\_CORE})
\end{enumerate}

the grep solution, while serviceable, becomes a little awkward:
\begin{Schunk}
\begin{Sinput}
> geneSymbol.rows = grep ('Egr1', values (MotifDb)$geneSymbol, ignore.case=TRUE)
> organism.rows = grep ('Mmusculus', values (MotifDb)$organism, ignore.case=TRUE)
> source.rows = grep ('JASPAR', values (MotifDb)$dataSource, ignore.case=TRUE)
> egr1.mouse.jaspar.rows = intersect (geneSymbol.rows, 
+                            intersect (organism.rows, source.rows))
> print (egr1.mouse.jaspar.rows)
\end{Sinput}
\begin{Soutput}
[1] 1402
\end{Soutput}
\begin{Sinput}
> egr1.motif <- MotifDb [egr1.mouse.jaspar.rows]
\end{Sinput}
\end{Schunk}

Far more concise, and fully reliable as an interactive command (though \emph{not} if used in a 
script\footnote{See the help page of the base R command subset for detail), is the \emph{subset} command}):
\begin{Schunk}
\begin{Sinput}
> if (interactive ()) {
+   egr1.motif <- subset (MotifDb, organism=='Mmusculus' & 
+                         dataSource=='JASPAR_CORE' & 
+                         geneSymbol=='Egr1')
+   }
\end{Sinput}
\end{Schunk}
Whichever method you use, this next chunk of code displays the matrix, and then the metadata for mouse JASPAR Egr1, the latter
textually-transformed for easy reading within the size constraints of this page.
\begin{Schunk}
\begin{Sinput}
> egr1.motif
\end{Sinput}
\begin{Soutput}
MotifDb object of length 1
| Created from downloaded public sources: 2012-Jul6
| 1 position frequency matrices from 1 source:
|        JASPAR_CORE:    1
| 1 organism/s
|          Mmusculus:    1
Mmusculus-JASPAR_CORE-Egr1-MA0162.1 
\end{Soutput}
\begin{Sinput}
> as.list (egr1.motif)
\end{Sinput}
\begin{Soutput}
$`Mmusculus-JASPAR_CORE-Egr1-MA0162.1`
           1          2         3 4   5   6          7 8         9 10
A 0.20000000 0.13333333 0.0000000 0 0.0 0.2 0.06666667 0 0.1333333  0
C 0.26666667 0.06666667 0.8666667 0 0.0 0.0 0.00000000 0 0.6666667  0
G 0.06666667 0.80000000 0.0000000 1 0.2 0.8 0.93333333 1 0.0000000  1
T 0.46666667 0.00000000 0.1333333 0 0.8 0.0 0.00000000 0 0.2000000  0
          11
A 0.06666667
C 0.00000000
G 0.46666667
T 0.46666667
\end{Soutput}
\begin{Sinput}
> noquote (t (as.data.frame (values (egr1.motif))))
\end{Sinput}
\begin{Soutput}
                Mmusculus-JASPAR_CORE-Egr1-MA0162.1
providerName    Egr1                               
providerId      MA0162.1                           
dataSource      JASPAR_CORE                        
geneSymbol      Egr1                               
geneId          13653                              
geneIdType      ENTREZ                             
proteinId       P08046                             
proteinIdType   UNIPROT                            
organism        Mmusculus                          
sequenceCount   15                                 
bindingSequence <NA>                               
bindingDomain   Zinc-coordinating                  
tfFamily        BetaBetaAlpha-zinc finger          
experimentType  bacterial 1-hybrid                 
pubmedID        16041365                           
\end{Soutput}
\end{Schunk}


Next we use the bioconductor \emph{seqLogo} package to display this motif.

\begin{Schunk}
\begin{Sinput}
> seqLogo (as.list (egr1.motif)[[1]])
\end{Sinput}
\end{Schunk}

\begin{figure}[htpb!]
  \centering
  \includegraphics[width=0.3\textwidth]{MotifDb-egr1}
  \caption{Mmusculus-JASPAR\_CORE-Egr1-MA0162.1}
\end{figure}
  
\section{Motif Matching}
We will look for the ten position frequency matrices which are the best match to JASPAR's mouse EGR1, using
the MotIV package.  We actually request the top eleven hits from the entire MotifDb, since the first hit 
should be the target matrix itself, since that is of necessity found in the full MotifDb.

\begin{Schunk}
\begin{Sinput}
> egr1.hits <- motifMatch (as.list (egr1.motif) [1], as.list (MotifDb), top=11)
\end{Sinput}
\begin{Soutput}
	Ungapped Alignment
	Scores read
	Database read
	Motif matches : 11
\end{Soutput}
\begin{Sinput}
> # 'MotIV.toTable' -- defined above (and hidden) -- will become part of MotIV in the upcoming release
> tbl.hits <- MotIV.toTable (egr1.hits)
> print (tbl.hits)
\end{Sinput}
\begin{Soutput}
                                                               name
1                               Mmusculus-JASPAR_CORE-Egr1-MA0162.1
2             Dmelanogaster-FlyFactorSurvey-sr_SANGER_5_FBgn0003499
3                                 Mmusculus-UniPROBE-Zif268.UP00400
4           Dmelanogaster-FlyFactorSurvey-klu_SANGER_10_FBgn0013469
5                                   Mmusculus-UniPROBE-Egr1.UP00007
6            Dmelanogaster-FlyFactorSurvey-klu_SOLEXA_5_FBgn0013469
7            Dmelanogaster-FlyFactorSurvey-Sp1_SANGER_5_FBgn0020378
8        Dmelanogaster-FlyFactorSurvey-Bteb2_SANGER_2.5_FBgn0025679
9  Dmelanogaster-FlyFactorSurvey-CG3065_F1.3_SANGER_2.5_FBgn0034946
10                                 Scerevisiae-ScerTF-SUT1-harbison
11         Dmelanogaster-FlyFactorSurvey-Sp1_SOLEXA_2.5_FBgn0020378
           eVal                sequence                   match strand
1  1.110223e-16             NGCGTGGGCGK             NGCGTGGGCGK      +
2  2.638001e-12          -NGCGTGGGCGK--          NKRNGKGGGCGKNN      +
3  1.550782e-11 ------NGCGTGGGCGK------ NNNNNNTGCGTGGGCGNNNNNTN      -
4  2.341696e-10             NGCGTGGGCGK             TGYGTGGGTGK      +
5  8.184576e-10          --NGCGTGGGCGK-          NNNGCGGGGGCGGN      -
6  1.812032e-08         NGCGTGGGCGK----         NGYGTGGGTGKNNNN      +
7  1.998650e-08          NGCGTGGGCGK---          ---GKGGGCGKRNN      +
8  1.632763e-07             NGCGTGGGCGK             ---GGGGGCGT      +
9  2.892135e-07          NGCGTGGGCGK---          ---GKGGGCGTGGC      +
10 6.714194e-07             NGCGTGGGCGK             -GCSGSGNNSG      +
11 1.033784e-06         NGCGTGGGCGK----         -NNGKGGGCGKRNCN      +
\end{Soutput}
\end{Schunk}

The \emph{sequence} column in this table is the \emph{consensus sequence} -- with heterogeneity left out -- for the 
matrix it describes.   

\vspace{10 mm}

\textbf{\emph{Puzzling: the strand of the match reported above is opposite of what I expected, and opposite of what seqLogo displays.
  This is a question for the MotIV developers.}}

\vspace{10 mm}

The six logos appear below, beginning with the logo of the query matrix, \emph{Mmusculus-JASPAR\_CORE-Egr1-MA0162.1}, including
two other mouse matrices, and two zinc-finger fly matrices.  Examining the three mouse matrices and their metadata reveals that
all three (geneSymbol differences aside) describe the same protein:
\begin{Schunk}
\begin{Sinput}
> if (interactive ())
+   noquote (t (as.data.frame (subset (values (MotifDb), geneId=='13653'))))
\end{Sinput}
\end{Schunk}
Zinc finger protein domains are classified into many \emph{fold groups}; their respective cognate DNA sequence may classify similarly.
That two fly matrices significantly match three reports of the mouse Egr1 motif suggests impressive conservation of this 
binding pattern, or convergent evolution.  

Let us look at the metadata for the first fly match, whose geneId is \textbf{FBgn0003499}:
\begin{Schunk}
\begin{Sinput}
> noquote (t (as.data.frame (values (MotifDb)[grep ('FBgn0003499', values (MotifDb)$geneId),])))
\end{Sinput}
\begin{Soutput}
                Dmelanogaster-FlyFactorSurvey-sr_SANGER_5_FBgn0003499
providerName    sr_SANGER_5_FBgn0003499                              
providerId      FBgn0003499                                          
dataSource      FlyFactorSurvey                                      
geneSymbol      Sr                                                   
geneId          FBgn0003499                                          
geneIdType      FLYBASE                                              
proteinId       E1JIP0                                               
proteinIdType   UNIPROT                                              
organism        Dmelanogaster                                        
sequenceCount     18                                                 
bindingSequence <NA>                                                 
bindingDomain   zf-C2H2                                              
tfFamily        <NA>                                                 
experimentType  bacterial 1-hybrid, SANGER sequencing                
pubmedID        <NA>                                                 
                Dmelanogaster-FlyFactorSurvey-sr_SOLEXA_5_FBgn0003499
providerName    sr_SOLEXA_5_FBgn0003499                              
providerId      FBgn0003499                                          
dataSource      FlyFactorSurvey                                      
geneSymbol      Sr                                                   
geneId          FBgn0003499                                          
geneIdType      FLYBASE                                              
proteinId       E1JIP0                                               
proteinIdType   UNIPROT                                              
organism        Dmelanogaster                                        
sequenceCount   2316                                                 
bindingSequence <NA>                                                 
bindingDomain   zf-C2H2                                              
tfFamily        <NA>                                                 
experimentType  bacterial 1-hybrid, SOLEXA sequencing                
pubmedID        <NA>                                                 
\end{Soutput}
\end{Schunk}
that the SOLEXA motif, based upon 2316 sequences, did not (in work not shown, it appears 22nd in the an expanded
motifMatch hit list, with a eval of 10e-5).  It is possible that the SOLEXA motif is more accurate, and that a close
examination of this case, including sequence logos, position frequency matrices, and the search parameters of
motifMatch, will be instructive.  Repeating the search with \emph{tomtom} might also be illuminating -- either as
confirmation of MotIV and the default parameterization we used, or as a correction to it.  Here we see the facilities
for exploratory data analysis MotifDb provides, and the opportunities for data analysis which result.








\begin{figure}[htpb!]
  \centering
  \begin{subfigure}[b]{0.38\textwidth}
    \includegraphics[width=\textwidth]{MotifDb-logo1}
    \caption{Mmusculus-JASPAR\_CORE-Egr1-MA0162.1}
    \label{fig:Egr1-MA0162.1}
    \end{subfigure}%
  \begin{subfigure}[b]{0.38\textwidth}
    \includegraphics[width=\textwidth]{MotifDb-logo2}
    \caption{Dme-FFS-sr\_SANGER\_5\_FBgn0003499\\(abbreviated)}
    \label{fig:Egr1-logo2}
    \end{subfigure}%
\end{figure}

\begin{figure}[htpb!]
  \centering
  \begin{subfigure}[b]{0.38\textwidth}
    \includegraphics[width=\textwidth]{MotifDb-logo3}
    \caption{Mmusculus-UniPROBE-Zif268.UP00400}
    \label{fig:Egr1-logo3}
    \end{subfigure}%
  \begin{subfigure}[b]{0.38\textwidth}
    \includegraphics[width=\textwidth]{MotifDb-logo4}
    \caption{Dme-FFS-klu\_SANGER\_10\_FBgn0013469}
    \label{fig:Egr1-logo4}
    \end{subfigure}%
\end{figure}


\begin{figure}[htpb!]
  \centering
  \begin{subfigure}[b]{0.38\textwidth}
    \includegraphics[width=\textwidth]{MotifDb-logo5}
    \caption{Mmusculus-UniPROBE-Egr1.UP00007}
    \label{fig:Egr1-logo5}
    \end{subfigure}%
  \begin{subfigure}[b]{0.38\textwidth}
    \centering
    \includegraphics[width=\textwidth]{MotifDb-logo6}
    \caption{Dme-FFS-klu\_SOLEXA\_5\_FBgn0013469}
    \label{fig:Egr1-logo6}
    \end{subfigure}%
\end{figure}

\newpage

\section{Exporting to the MEME Suite}
Some users of this package may wish to export the data -- both matrices and metadata -- so that they may be used in
other programs.  The MEME suite, among others, is broadly useful, continuously improved and well-regarded throughout the
bioinformatics community.  The code below exports all of the MotifDb matrices as a text file in the MEME format, and all
of the metadata as a tab-delimited text file.

\begin{Schunk}
\begin{Sinput}
> matrix.output.file = tempfile ()   # substitute your preferred filename here
> meme.text = export (MotifDb, matrix.output.file, 'meme')
> metadata.output.file = tempfile () # substitute your preferred filename here
> write.table (as.data.frame (values (MotifDb)), file=metadata.output.file, sep='\t', 
+              row.names=TRUE, col.names=TRUE, quote=FALSE)
\end{Sinput}
\end{Schunk}

\section{Future Work}
This first version of MotifDb collects into one R package all of the best-known public domain protein-DNA binding matrices, with
as much metadata as could be gleaned from the five providers.  However, not all of these matrices are equally supported by data
and by no means are all are accompanied by complete metadata.

With the passage of time our knowledge of protein-DNA binding sequence motifs will improve.  They will be derived from more 
binding events, with more precision and specificity, and accompanied by more (and better understood) contextual detail.  Cooperative binding, 
mentioned only in a few times in the current (July 2012) version of this package, will be well-represented.  Metadata will improve.
Better assignment of binding domains to consensus categories will be especially useful when it is available.  Three-dimensional 
models of specific proteins binding to specific DNA may someday become commonplace.


\section{References}

\begin{itemize}

\item Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, Yusuf D, Lenhard B, Wasserman WW, Sandelin A. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 2010 Jan;38(Database issue):D105-10. Epub 2009 Nov 11.

\item Robasky K, Bulyk ML. UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Res. 2011 Jan;39(Database issue):D124-8. Epub 2010 Oct 30.

\item Spivak AT, Stormo GD. ScerTF: a comprehensive database of benchmarked position weight matrices for Saccharomyces species. Nucleic Acids Res. 2012 Jan;40(Database issue):D162-8. Epub 2011 Dec 2.

\item Xie Z, Hu S, Blackshaw S, Zhu H, Qian J. hPDI: a database of experimental human protein-DNA interactions. Bioinformatics. 2010 Jan 15;26(2):287-9. Epub 2009 Nov 9.

\item Zhu LJ, et al. 2011. FlyFactorSurvey: a database of Drosophila transcription factor binding specificities determined using the bacterial one-hybrid system. Nucleic Acids Res. 2011 Jan;39(Database issue):D111-7. Epub 2010 Nov 19.


\end{itemize}


\end{document}