Browse code

first version

git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/MotifDb@107918 bc3139a8-67e5-0310-9ffc-ced21a209358

p.shannon authored on 29/08/2015 00:56:57
Showing 1 changed files

1 1
new file mode 100644
... ...
@@ -0,0 +1,123 @@
1
+Importing Frequency Matrices and Metadata into MotifDb
2
+
3
+Paul Shannon
4
+pshannon@fredhutch.org
5
+28 August 2015
6
+
7
+MotifDb collects currently nine sources of protein-binding DNA sequences, with metadata, into
8
+one common format.   The heterogeneity of these sources is large, and sometimes daunting, but
9
+by transforming them into a standard form (a list of position weightm matrices and a standard
10
+fifteen-column metadata DataFrame) they may be easily used together.
11
+
12
+In the inst/scripts/import directories you will one subdirectory for each data source.
13
+Within each subdirectory you will find a data-source-specific "import.R" which knows
14
+the details of the raw data each source provides, and which transforms the raw data into 
15
+two data structures:  matrices, and tbl.md
16
+
17
+The best way to learn the ropes is to run, study, and thoroughly understand 
18
+
19
+   inst/scripts/import/demo/import.R
20
+
21
+In the demo directory you will also find the raw files, each describing four binding site motifs.
22
+This particular data is taken from the much larger extracted from the much larger
23
+jasper core data set.  It is presented here only as an example.
24
+
25
+The first of four matrices:
26
+
27
+    >MA0004.1 Arnt
28
+    4	19	0	0	0	0
29
+    16	0	20	0	0	0
30
+    0	1	0	20	0	20
31
+    0	0	0	0	20	0
32
+
33
+It's metadata:
34
+
35
+id	jaspar.class	ma.name	unknown	gene.symbol	uniprot	ncbi.tax.code	class	comment	family	medline	tax_group	type	pazar_tf_id	description	tfbs_shape_id	consensus	jaspar	mcs	transfac	end_relative_to_tss	start_relative_to_tss	included_models	source	centrality_logp	tfe_id	symbol	alias
36
+9232	CORE	MA0004	1	Arnt	P53762	10090	Zipper-Type	-	Helix-Loop-Helix	7592839	vertebrates	SELEX	TF0000003	aryl hydrocarbon receptor nuclear translocator	11	NA	NA	NA	NA	NA	NA	NA	NA	NA	580	ARNT	HIF-1beta,bHLHe2
37
+
38
+
39
+Once standardized:
40
+
41
+    $`Mmusculus-jaspar2014-Arnt-MA0004`
42
+        1    2 3 4 5 6
43
+    A 0.2 0.95 0 0 0 0
44
+    C 0.8 0.00 1 0 0 0
45
+    G 0.0 0.05 0 1 0 1
46
+    T 0.0 0.00 0 0 1 0
47
+
48
+Transpose the corresponding row of the metadata for easy reading:
49
+
50
+    t(tbl.md[1,])
51
+                    Mmusculus-jaspar2014-Arnt-MA0004
52
+    providerName    "MA0004.1 Arnt"                 
53
+    providerId      "MA0004.1 Arnt"                 
54
+    dataSource      "jaspar2014"                    
55
+    geneSymbol      "Arnt"                          
56
+    geneId          NA                              
57
+    geneIdType      NA                              
58
+    proteinId       "P53762"                        
59
+    proteinIdType   "UNIPROT"                       
60
+    organism        "Mmusculus"                     
61
+    sequenceCount   "20"                            
62
+    bindingSequence NA                              
63
+    bindingDomain   NA                              
64
+    tfFamily        "Helix-Loop-Helix"              
65
+    experimentType  "SELEX"                         
66
+    pubmedID        "24194598"                      
67
+
68
+
69
+Each import.R file has a "run" function which requires one argument: the path to the parent data directory 
70
+below which is found the raw data, one subdirectory per source.
71
+
72
+Since inst/scripts/import/demo/ seeks to be self-contained, and has its own small raw data files, you 
73
+can run it like this:
74
+
75
+cd inst/scripts/import/demo
76
+source("import.R")
77
+run("..")   # points to the immediage parent
78
+
79
+The result of "run" is to create tbl.md and the matrices list, then save them both together
80
+in a single serialized RData file, e.g., "demo.RData").
81
+
82
+If you wanted to include these 4 matrices into the next build of MotifDb, our convention is
83
+to simply copy demo.RData into MotifDb/inst/extdata.  When a user loads MotifDb, every
84
+serialized object in extdata is read and concatenated into one large tbl.md, and one large
85
+list of matrices.  (You will never want to leave demo.RData in extdata/ -- the matrices
86
+are duplicates of those found in jaspar2014.)
87
+
88
+In the parent directory of all the data-source-specific import directories (demo, uniprobe, ScerTF, ...)
89
+is a siple master script, importAll.R.  You will typically invoke this from the command line:
90
+
91
+   R -f importAll.R
92
+
93
+as watch as it runs all of the nested import.R scripts.
94
+
95
+Important point:  importAll.R must be told where to find all of the raw data.  And there is a LOT
96
+of it.  I keep a copy on my laptop during development, but it is only a duplicate of the permanent
97
+raw data repository at the Hutch:
98
+
99
+  rhino:/fh/fast/morgan_m/BioC/MotifDb-raw-data/
100
+
101
+Thus the standard way to assemble all the data for a MotifDb package build is
102
+
103
+ 1) login to one of the rhinos
104
+ 2) checkout the latest MotifDb
105
+ 3) cd MotifDb/inst/scripts/import
106
+ 4) make sure that the "dataDir" variable is assigned
107
+     "/fh/fast/morgan_m/BioC/MotifDb-raw-data/"
108
+ 5) R -f importAll.R
109
+
110
+When the script is complete:
111
+
112
+  6) cp */*Rdata ../../extedata/
113
+  7) build MotifDb
114
+  8) run the unit tests
115
+
116
+
117
+
118
+
119
+
120
+
121
+
122
+
123
+