Browse code

change filter doc

Simone authored on 26/04/2019 09:07:25
Showing4 changed files

... ...
@@ -1,2 +1,4 @@
1
+^Meta$
2
+^doc$
1 3
 ^.*\.Rproj$
2 4
 ^\.Rproj\.user$
... ...
@@ -1,3 +1,5 @@
1
+Meta
2
+doc
1 3
 .DS_Store
2 4
 **/.DS_Store
3 5
 .Rhistory
... ...
@@ -6,4 +8,4 @@
6 8
 .Rapp.history
7 9
 .RData
8 10
 .*.zip
9
-*.Rproj
10 11
\ No newline at end of file
12
+*.Rproj
... ...
@@ -35,7 +35,8 @@
35 35
 #'
36 36
 #' @details
37 37
 #' This function works only with datatset or GRangesList all whose samples or 
38
-#' Granges have the same region coordinates (chr, ranges, strand)
38
+#' Granges have the same region coordinates (chr, ranges, strand) ordered in 
39
+#' the same way for each sample 
39 40
 #' 
40 41
 #' In case of GRangesList data input, the function searches for metadata
41 42
 #' into metadata() function associated to GRangesList.
... ...
@@ -52,10 +53,14 @@
52 53
 #' filter_and_extract(test_path, region_attributes = c("pvalue", "peak"))
53 54
 #' 
54 55
 #' ## This statement imports a GMQL dataset as GRangesList and filters it 
55
-#' ## including at output only "pvalue" and "peak" region attributes
56
+#' ## including at output only "pvalue" and "peak" region attributes, the sort
57
+#' ## function makes sure that the region coordinates (chr, ranges, strand) 
58
+#' ## of all samples are ordered correctly
59
+#' 
56 60
 #' 
57 61
 #' grl = import_gmql(test_path, TRUE)
58
-#' filter_and_extract(grl, region_attributes = c("pvalue", "peak"))
62
+#' sorted_grl = sort(grl)
63
+#' filter_and_extract(sorted_grl, region_attributes = c("pvalue", "peak"))
59 64
 #'
60 65
 #'
61 66
 #' @export
62 67
deleted file mode 100644
... ...
@@ -1,303 +0,0 @@
1
-title: "RGMQL: GenoMetric Query Language for R/Bioconductor"
2
-author: 
3
-- "Simone Pallotta" 
4
-- "Marco Masseroli"
5
-date: "2017-11-14"
6
-bibliography: bibliography.bib
7
-output: BiocStyle::pdf_document
8
-vignette: >
9
-    %\VignetteIndexEntry{Vignette Title}
10
-    %\VignetteEngine{knitr::rmarkdown}
11
-    %\VignetteEncoding{UTF-8}
12
-link-citations: true
13
-
14
-# Introduction
15
-
16
-Recent years have seen a tremendous increase in the volume of data generated 
17
-in the life sciences, especially propelled by the rapid progress of 
18
-Next Generation Sequencing (NGS) technologies. 
19
-This high-throughput technologies can produce billions of short DNA or RNA 
20
-fragments in excess of a few terabytes of data in a single run.
21
-Next-generation sequencing refers to the deep, in-parallel DNA sequencing 
22
-technologies providing massively parallel analysis and extremely 
23
-high-throughput from multiple samples at much reduced cost. 
24
-Improvement of sequencing technologies and data processing pipelines 
25
-is rapidly providing sequencing data, with associated high-level features, 
26
-of many individual genomes in multiple biological and clinical conditions. 
27
-To make effective use of the produced data, the design of big data algorithms 
28
-and their efficient implementation on modern high performance 
29
-computing infrastructures, such as clouds, CPU clusters 
30
-and network infrastructures, is required in order to achieve scalability 
31
-and performance. 
32
-For this purpose the GenoMetric Query Language (GMQL) has been proposed 
33
-as high-level, declarative language to process, query, 
34
-and compare multiple and heterogeneous genomic datasets for biomedical 
35
-knowledge discovery [@Bioinformatics2015]
36
-
37
-## Purpose
38
-
39
-A very important emerging problem is to make sense of the enormous amount and 
40
-variety of NGS data becoming available, i.e. to discover how different genomic 
41
-regions and their products interact and cooperate with each other. 
42
-To this aim, the integration of several heterogeneous DNA feature data 
43
-is required.
44
-Such big genomic feature data are collected within numerous and 
45
-heterogeneous files, usually distributed within different repositories, 
46
-lacking an attribute-based organization and a systematic description 
47
-of their metadata. 
48
-These heterogeneous data can contain the hidden answer to very important 
49
-biomedical questions.
50
-To inveil them, standard tools already available for knowledge extraction 
51
-are too specialized or present powerful features, but have a rough interface 
52
-not well-suited for scientists/biologists.
53
-GMQL addresses these aspects using cloud-based technologies 
54
-(including Apache Hadoop, mapReduce, and Spark), and focusing on genomic data 
55
-operations written as simple queries with implicit iterations over thousands 
56
-of heterogeneous samples, computed efficiently [@IEEE7484654].
57
-This RGMQL package makes easy to take advantage of GMQL functionalities also 
58
-to scientists and biologists with limited knowledge of query and 
59
-programming languages, but used to the R/Bioconductor environment. 
60
-This package is built over a GMQL scalable data management engine 
61
-written in Scala programming language, released as Scala API [@githubrepo] 
62
-providing a set of functions to combine, manipulate, compare, and extract 
63
-genomic data from different datasources both from local and remote datasets.
64
-These functions allow performing complex GMQL processing and queries without 
65
-knowledge of GMQL syntax, but leveraging on R idiomatic paradigm and logic.
66
-
67
-
68
-# Genomic Data Model
69
-
70
-The Genomic Data Model (GDM) is based on the notions of datasets 
71
-and samples[@modeling2016] 
72
-Datasets are collections of samples, and each sample consists of two parts, 
73
-the region data, which describe portions of the genome, and the metadata, 
74
-which describe sample general properties and how observations are collected.
75
-In contrast to other data models, it clearly divides, and comprehensively 
76
-manages, observations about genomic regions and metadata.
77
-GDM provides a flat attribute based organization, just requiring that 
78
-each dataset is associated with a given data schema, which specifies 
79
-the attributes and their type of region data.
80
-The first attributes of such schema are fixed (chr, start, end, strand); 
81
-they represent the genomic region identifying coordinates.
82
-In addition, metadata have free attribute-value pair format.
83
-
84
-## Genomic Region 
85
-
86
-Genomic region data describe a broad variety of biomolecular aspects and are 
87
-very valuable for biomolecular investigation.
88
-A genomic region is a portion of a genome, qualified by a quadruple of values 
89
-called region coordinates:
90
-$$< chr, left, right, strand >$$
91
-Regions can have an arbitrary number of associated values, according to 
92
-the processing of DNA, RNA or epigenomic sequencing reads that determined 
93
-the region.
94
-
95
-## Metadata
96
-
97
-Metadata describe the biological and clinical properties associated with 
98
-each sample.
99
-They are usually collected in a broad variety of data structures and formats 
100
-that constitute barriers to their use and comparison GDM models metadata 
101
-simply as arbitrary semi-structured attribute-value pairs, 
102
-where attributes may have multiple values.
103
-
104
-## Genomic Sample
105
-
106
-Formally, a sample s is a collection of genomic regions modeled as 
107
-the following triple: $$< id, {< r_i,v_i >}, {m_j} >$$ where:
108
-
109
-* id is the sample identifier
110
-* Each region is a pair of coordinates $r_i$ and values $v_i$
111
-* Metadata $m_j$ are attribute-value pairs $< a_j,v_j >$
112
-
113
-Note that the sample id attribute provides a many-to-many connection between 
114
-regions and metadata of a sample.
115
-Through the use of a data type system to express region data, and of arbitrary 
116
-attribute-value pairs for metadata, GDM provides interoperability across 
117
-datasets in multiple formats produced by different experimental techniques.
118
-
119
-## Dataset
120
-
121
-A dataset is a collection of samples uniquely identified, with the same region 
122
-schema and with each sample consisting of two parts:
123
-
124
-* region data: describing characteristics and location of genomic portions
125
-* metadata: expressing general properties of the sample
126
-
127
-Each dataset is typically produced within the same project by using the same 
128
-or equivalent technology and tools, but with different experimental 
129
-conditions, described by metadata.
130
-
131
-Datasets contain large number of information describing regions of a genome, 
132
-with data encoded in human readable format using plain text files.
133
-
134
-GMQL datasets are materialized in a standard layout composed of three 
135
-types of files:
136
-
137
-1. genomic region tab-delimited text files with extension .gdm, or .gtf 
138
-if in standard GTF format
139
-2. metadata attribute-value tab-delimited text files with the same fullname 
140
-(name and extension) of the correspondent region file and extension .meta
141
-3. schema XML file containing region attribute names and types
142
-
143
-All these files reside in unique folder called files.
144
-
145
-<!-- ![GMQL dataset folder](dataset_gmql.png) -->
146
-
147
-In RGMQL package dataset files are considered read-only.
148
-Once read, genomic information is represented in abstract structure inside 
149
-the package, mapped to a R GRanges data structure at occurency.
150
-
151
-
152
-# GenoMetric Query Language
153
-
154
-The GenoMetric Query Language name stems from the language ability to deal 
155
-with genomic distances, which are measured as number of nucleotide bases 
156
-between genomic regions (aligned to the same reference genome) and computed 
157
-using arithmetic operations between region coordinates.
158
-GMQL is a high-level, declarative language that allows expressing queries 
159
-easily over genomic regions and their metadata, in a way similar to what can 
160
-be done with the Structured Query Language (SQL) over a relational database.
161
-GMQL approach exhibits two main differences with respect to other tools 
162
-based on Hadoop, mapReduce framework, and Spark engine technologies 
163
-to address similar biomedical problems:\newline
164
-
165
-* GMQL:
166
-
167
-    1. reads from processed datasets
168
-    2. supports metadata management
169
-    
170
-* Others:
171
-
172
-    1. read generally from raw or alligned data from NGS machines
173
-    2. provide no support for metadata management
174
-
175
-GMQL is the appropriate tool for querying numerous processed genomic datasets 
176
-and very many samples that are becoming available.
177
-Note however that GMQL performs worse than some other available systems on a 
178
-small number of small-scale datasets, but these other systems are not 
179
-cloud-based; hence, they are not adequate for efficient big data processing 
180
-and, in some cases, they are inherently limited in their 
181
-data management capacity, as they only work as RAM memory resident processes.
182
-
183
-## Query structure
184
-
185
-A GMQL operation is expressed as a sequence of GMQL operations with the 
186
-following structure:
187
-$$< variable > = operator(< parameters >) < variable >;$$
188
-where each $< variable >$ stands for a GDM dataset
189
-
190
-This RGMQL package brings GMQL functionalities into R environemnt, 
191
-allowing users to build directly a GMQL query without knowing the GMQL syntax.
192
-In RGMQL every GMQL operations is translated into a R function 
193
-and expressed as:
194
-$$ variable = operator(variable, parameters)$$
195
-
196
-It is very similar to the GMQL syntax for operation expression although 
197
-expressed with the R idiomatic paradigm and logic, with parameters totaly 
198
-builded using R native data structures such as lists, matrices, 
199
-vectors or R logic conditions.
200
-
201
-
202
-# Processing Environments
203
-
204
-In this section, we show how GMQL processing is built in R, which operations 
205
-are available in RGMQL, and the difference beetween local 
206
-and remote dataset processing.
207
-
208
-## Local Processing
209
-
210
-RGMQL local processing consumes computational power directly from local 
211
-CPUs/system while managing datasets (both GMQL or generic text plain datasets).
212
-
213
-### Initialization
214
-
215
-Load and attach the GMQL package in a R session using library function:
216
-
217
-```r
218
-library('RGMQL')
219
-```
220
-Before starting using any GMQL operation we need to initialise the GMQL 
221
-context with the following code:
222
-
223
-```r
224
-init_gmql()
225
-```
226
-The function *init_gmql()* initializes the context of scalable data management 
227
-engine laid upon Spark and Hadoop.
228
-Details on this and all other functions are provided in the R documentation 
229
-for this package (e.g., help(RGMQL)).
230
-
231
-### Read Dataset
232
-
233
-After initialization we need to read datasets.
234
-We already defined above the formal definition of dataset and the power of 
235
-GMQL to deal with data in a variety of standard tab-delimited text formats.
236
-In the following, we show how to get data from different sources.\newline
237
-We distinguish two different cases:
238
-
239
-1. Local dataset:\newline
240
-A local dataset is a folder with sample files (region files and correspondent 
241
-metadata files) on the user computer.
242
-As data are already in the user computer, we simply execute:
243
-
244
-
245
-```r
246
-gmql_dataset_path <- system.file("example", "EXON", package = "RGMQL")
247
-data_out = read_dataset(gmql_dataset_path)
248
-```
249
-In this case we are reading a dataset named EXON specified by path.
250
-It doens't matter what kind of format the data are, *read_dataset()* read many 
251
-standard tab-delimited text formats without specified any paramter at input.
252
-
253
-2. GRangesList:\newline
254
-For better integration in the R environment and with other packages, 
255
-we provide a *read()* function to read directly from R memory/environment 
256
-using GRangesList as input.
257
-
258
-
259
-
260
-
261
-
262
-
263
-
264
-
265
-
266
-
267
-
268
-
269
-
270
-
271
-
272
-
273
-
274
-
275
-
276
-
277
-
278
-
279
-
280
-
281
-
282
-
283
-
284
-
285
-
286
-
287
-
288
-
289
-
290
-
291
-
292
-
293
-
294
-
295
-
296
-
297
-
298
-
299
-
300
-
301
-