git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/MotifDb@107918 bc3139a8-67e5-0310-9ffc-ced21a209358
1 | 1 |
new file mode 100644 |
... | ... |
@@ -0,0 +1,123 @@ |
1 |
+Importing Frequency Matrices and Metadata into MotifDb |
|
2 |
+ |
|
3 |
+Paul Shannon |
|
4 |
+pshannon@fredhutch.org |
|
5 |
+28 August 2015 |
|
6 |
+ |
|
7 |
+MotifDb collects currently nine sources of protein-binding DNA sequences, with metadata, into |
|
8 |
+one common format. The heterogeneity of these sources is large, and sometimes daunting, but |
|
9 |
+by transforming them into a standard form (a list of position weightm matrices and a standard |
|
10 |
+fifteen-column metadata DataFrame) they may be easily used together. |
|
11 |
+ |
|
12 |
+In the inst/scripts/import directories you will one subdirectory for each data source. |
|
13 |
+Within each subdirectory you will find a data-source-specific "import.R" which knows |
|
14 |
+the details of the raw data each source provides, and which transforms the raw data into |
|
15 |
+two data structures: matrices, and tbl.md |
|
16 |
+ |
|
17 |
+The best way to learn the ropes is to run, study, and thoroughly understand |
|
18 |
+ |
|
19 |
+ inst/scripts/import/demo/import.R |
|
20 |
+ |
|
21 |
+In the demo directory you will also find the raw files, each describing four binding site motifs. |
|
22 |
+This particular data is taken from the much larger extracted from the much larger |
|
23 |
+jasper core data set. It is presented here only as an example. |
|
24 |
+ |
|
25 |
+The first of four matrices: |
|
26 |
+ |
|
27 |
+ >MA0004.1 Arnt |
|
28 |
+ 4 19 0 0 0 0 |
|
29 |
+ 16 0 20 0 0 0 |
|
30 |
+ 0 1 0 20 0 20 |
|
31 |
+ 0 0 0 0 20 0 |
|
32 |
+ |
|
33 |
+It's metadata: |
|
34 |
+ |
|
35 |
+id jaspar.class ma.name unknown gene.symbol uniprot ncbi.tax.code class comment family medline tax_group type pazar_tf_id description tfbs_shape_id consensus jaspar mcs transfac end_relative_to_tss start_relative_to_tss included_models source centrality_logp tfe_id symbol alias |
|
36 |
+9232 CORE MA0004 1 Arnt P53762 10090 Zipper-Type - Helix-Loop-Helix 7592839 vertebrates SELEX TF0000003 aryl hydrocarbon receptor nuclear translocator 11 NA NA NA NA NA NA NA NA NA 580 ARNT HIF-1beta,bHLHe2 |
|
37 |
+ |
|
38 |
+ |
|
39 |
+Once standardized: |
|
40 |
+ |
|
41 |
+ $`Mmusculus-jaspar2014-Arnt-MA0004` |
|
42 |
+ 1 2 3 4 5 6 |
|
43 |
+ A 0.2 0.95 0 0 0 0 |
|
44 |
+ C 0.8 0.00 1 0 0 0 |
|
45 |
+ G 0.0 0.05 0 1 0 1 |
|
46 |
+ T 0.0 0.00 0 0 1 0 |
|
47 |
+ |
|
48 |
+Transpose the corresponding row of the metadata for easy reading: |
|
49 |
+ |
|
50 |
+ t(tbl.md[1,]) |
|
51 |
+ Mmusculus-jaspar2014-Arnt-MA0004 |
|
52 |
+ providerName "MA0004.1 Arnt" |
|
53 |
+ providerId "MA0004.1 Arnt" |
|
54 |
+ dataSource "jaspar2014" |
|
55 |
+ geneSymbol "Arnt" |
|
56 |
+ geneId NA |
|
57 |
+ geneIdType NA |
|
58 |
+ proteinId "P53762" |
|
59 |
+ proteinIdType "UNIPROT" |
|
60 |
+ organism "Mmusculus" |
|
61 |
+ sequenceCount "20" |
|
62 |
+ bindingSequence NA |
|
63 |
+ bindingDomain NA |
|
64 |
+ tfFamily "Helix-Loop-Helix" |
|
65 |
+ experimentType "SELEX" |
|
66 |
+ pubmedID "24194598" |
|
67 |
+ |
|
68 |
+ |
|
69 |
+Each import.R file has a "run" function which requires one argument: the path to the parent data directory |
|
70 |
+below which is found the raw data, one subdirectory per source. |
|
71 |
+ |
|
72 |
+Since inst/scripts/import/demo/ seeks to be self-contained, and has its own small raw data files, you |
|
73 |
+can run it like this: |
|
74 |
+ |
|
75 |
+cd inst/scripts/import/demo |
|
76 |
+source("import.R") |
|
77 |
+run("..") # points to the immediage parent |
|
78 |
+ |
|
79 |
+The result of "run" is to create tbl.md and the matrices list, then save them both together |
|
80 |
+in a single serialized RData file, e.g., "demo.RData"). |
|
81 |
+ |
|
82 |
+If you wanted to include these 4 matrices into the next build of MotifDb, our convention is |
|
83 |
+to simply copy demo.RData into MotifDb/inst/extdata. When a user loads MotifDb, every |
|
84 |
+serialized object in extdata is read and concatenated into one large tbl.md, and one large |
|
85 |
+list of matrices. (You will never want to leave demo.RData in extdata/ -- the matrices |
|
86 |
+are duplicates of those found in jaspar2014.) |
|
87 |
+ |
|
88 |
+In the parent directory of all the data-source-specific import directories (demo, uniprobe, ScerTF, ...) |
|
89 |
+is a siple master script, importAll.R. You will typically invoke this from the command line: |
|
90 |
+ |
|
91 |
+ R -f importAll.R |
|
92 |
+ |
|
93 |
+as watch as it runs all of the nested import.R scripts. |
|
94 |
+ |
|
95 |
+Important point: importAll.R must be told where to find all of the raw data. And there is a LOT |
|
96 |
+of it. I keep a copy on my laptop during development, but it is only a duplicate of the permanent |
|
97 |
+raw data repository at the Hutch: |
|
98 |
+ |
|
99 |
+ rhino:/fh/fast/morgan_m/BioC/MotifDb-raw-data/ |
|
100 |
+ |
|
101 |
+Thus the standard way to assemble all the data for a MotifDb package build is |
|
102 |
+ |
|
103 |
+ 1) login to one of the rhinos |
|
104 |
+ 2) checkout the latest MotifDb |
|
105 |
+ 3) cd MotifDb/inst/scripts/import |
|
106 |
+ 4) make sure that the "dataDir" variable is assigned |
|
107 |
+ "/fh/fast/morgan_m/BioC/MotifDb-raw-data/" |
|
108 |
+ 5) R -f importAll.R |
|
109 |
+ |
|
110 |
+When the script is complete: |
|
111 |
+ |
|
112 |
+ 6) cp */*Rdata ../../extedata/ |
|
113 |
+ 7) build MotifDb |
|
114 |
+ 8) run the unit tests |
|
115 |
+ |
|
116 |
+ |
|
117 |
+ |
|
118 |
+ |
|
119 |
+ |
|
120 |
+ |
|
121 |
+ |
|
122 |
+ |
|
123 |
+ |