... | ... |
@@ -35,7 +35,8 @@ |
35 | 35 |
#' |
36 | 36 |
#' @details |
37 | 37 |
#' This function works only with datatset or GRangesList all whose samples or |
38 |
-#' Granges have the same region coordinates (chr, ranges, strand) |
|
38 |
+#' Granges have the same region coordinates (chr, ranges, strand) ordered in |
|
39 |
+#' the same way for each sample |
|
39 | 40 |
#' |
40 | 41 |
#' In case of GRangesList data input, the function searches for metadata |
41 | 42 |
#' into metadata() function associated to GRangesList. |
... | ... |
@@ -52,10 +53,14 @@ |
52 | 53 |
#' filter_and_extract(test_path, region_attributes = c("pvalue", "peak")) |
53 | 54 |
#' |
54 | 55 |
#' ## This statement imports a GMQL dataset as GRangesList and filters it |
55 |
-#' ## including at output only "pvalue" and "peak" region attributes |
|
56 |
+#' ## including at output only "pvalue" and "peak" region attributes, the sort |
|
57 |
+#' ## function makes sure that the region coordinates (chr, ranges, strand) |
|
58 |
+#' ## of all samples are ordered correctly |
|
59 |
+#' |
|
56 | 60 |
#' |
57 | 61 |
#' grl = import_gmql(test_path, TRUE) |
58 |
-#' filter_and_extract(grl, region_attributes = c("pvalue", "peak")) |
|
62 |
+#' sorted_grl = sort(grl) |
|
63 |
+#' filter_and_extract(sorted_grl, region_attributes = c("pvalue", "peak")) |
|
59 | 64 |
#' |
60 | 65 |
#' |
61 | 66 |
#' @export |
62 | 67 |
deleted file mode 100644 |
... | ... |
@@ -1,303 +0,0 @@ |
1 |
-title: "RGMQL: GenoMetric Query Language for R/Bioconductor" |
|
2 |
-author: |
|
3 |
-- "Simone Pallotta" |
|
4 |
-- "Marco Masseroli" |
|
5 |
-date: "2017-11-14" |
|
6 |
-bibliography: bibliography.bib |
|
7 |
-output: BiocStyle::pdf_document |
|
8 |
-vignette: > |
|
9 |
- %\VignetteIndexEntry{Vignette Title} |
|
10 |
- %\VignetteEngine{knitr::rmarkdown} |
|
11 |
- %\VignetteEncoding{UTF-8} |
|
12 |
-link-citations: true |
|
13 |
- |
|
14 |
-# Introduction |
|
15 |
- |
|
16 |
-Recent years have seen a tremendous increase in the volume of data generated |
|
17 |
-in the life sciences, especially propelled by the rapid progress of |
|
18 |
-Next Generation Sequencing (NGS) technologies. |
|
19 |
-This high-throughput technologies can produce billions of short DNA or RNA |
|
20 |
-fragments in excess of a few terabytes of data in a single run. |
|
21 |
-Next-generation sequencing refers to the deep, in-parallel DNA sequencing |
|
22 |
-technologies providing massively parallel analysis and extremely |
|
23 |
-high-throughput from multiple samples at much reduced cost. |
|
24 |
-Improvement of sequencing technologies and data processing pipelines |
|
25 |
-is rapidly providing sequencing data, with associated high-level features, |
|
26 |
-of many individual genomes in multiple biological and clinical conditions. |
|
27 |
-To make effective use of the produced data, the design of big data algorithms |
|
28 |
-and their efficient implementation on modern high performance |
|
29 |
-computing infrastructures, such as clouds, CPU clusters |
|
30 |
-and network infrastructures, is required in order to achieve scalability |
|
31 |
-and performance. |
|
32 |
-For this purpose the GenoMetric Query Language (GMQL) has been proposed |
|
33 |
-as high-level, declarative language to process, query, |
|
34 |
-and compare multiple and heterogeneous genomic datasets for biomedical |
|
35 |
-knowledge discovery [@Bioinformatics2015] |
|
36 |
- |
|
37 |
-## Purpose |
|
38 |
- |
|
39 |
-A very important emerging problem is to make sense of the enormous amount and |
|
40 |
-variety of NGS data becoming available, i.e. to discover how different genomic |
|
41 |
-regions and their products interact and cooperate with each other. |
|
42 |
-To this aim, the integration of several heterogeneous DNA feature data |
|
43 |
-is required. |
|
44 |
-Such big genomic feature data are collected within numerous and |
|
45 |
-heterogeneous files, usually distributed within different repositories, |
|
46 |
-lacking an attribute-based organization and a systematic description |
|
47 |
-of their metadata. |
|
48 |
-These heterogeneous data can contain the hidden answer to very important |
|
49 |
-biomedical questions. |
|
50 |
-To inveil them, standard tools already available for knowledge extraction |
|
51 |
-are too specialized or present powerful features, but have a rough interface |
|
52 |
-not well-suited for scientists/biologists. |
|
53 |
-GMQL addresses these aspects using cloud-based technologies |
|
54 |
-(including Apache Hadoop, mapReduce, and Spark), and focusing on genomic data |
|
55 |
-operations written as simple queries with implicit iterations over thousands |
|
56 |
-of heterogeneous samples, computed efficiently [@IEEE7484654]. |
|
57 |
-This RGMQL package makes easy to take advantage of GMQL functionalities also |
|
58 |
-to scientists and biologists with limited knowledge of query and |
|
59 |
-programming languages, but used to the R/Bioconductor environment. |
|
60 |
-This package is built over a GMQL scalable data management engine |
|
61 |
-written in Scala programming language, released as Scala API [@githubrepo] |
|
62 |
-providing a set of functions to combine, manipulate, compare, and extract |
|
63 |
-genomic data from different datasources both from local and remote datasets. |
|
64 |
-These functions allow performing complex GMQL processing and queries without |
|
65 |
-knowledge of GMQL syntax, but leveraging on R idiomatic paradigm and logic. |
|
66 |
- |
|
67 |
- |
|
68 |
-# Genomic Data Model |
|
69 |
- |
|
70 |
-The Genomic Data Model (GDM) is based on the notions of datasets |
|
71 |
-and samples[@modeling2016] |
|
72 |
-Datasets are collections of samples, and each sample consists of two parts, |
|
73 |
-the region data, which describe portions of the genome, and the metadata, |
|
74 |
-which describe sample general properties and how observations are collected. |
|
75 |
-In contrast to other data models, it clearly divides, and comprehensively |
|
76 |
-manages, observations about genomic regions and metadata. |
|
77 |
-GDM provides a flat attribute based organization, just requiring that |
|
78 |
-each dataset is associated with a given data schema, which specifies |
|
79 |
-the attributes and their type of region data. |
|
80 |
-The first attributes of such schema are fixed (chr, start, end, strand); |
|
81 |
-they represent the genomic region identifying coordinates. |
|
82 |
-In addition, metadata have free attribute-value pair format. |
|
83 |
- |
|
84 |
-## Genomic Region |
|
85 |
- |
|
86 |
-Genomic region data describe a broad variety of biomolecular aspects and are |
|
87 |
-very valuable for biomolecular investigation. |
|
88 |
-A genomic region is a portion of a genome, qualified by a quadruple of values |
|
89 |
-called region coordinates: |
|
90 |
-$$< chr, left, right, strand >$$ |
|
91 |
-Regions can have an arbitrary number of associated values, according to |
|
92 |
-the processing of DNA, RNA or epigenomic sequencing reads that determined |
|
93 |
-the region. |
|
94 |
- |
|
95 |
-## Metadata |
|
96 |
- |
|
97 |
-Metadata describe the biological and clinical properties associated with |
|
98 |
-each sample. |
|
99 |
-They are usually collected in a broad variety of data structures and formats |
|
100 |
-that constitute barriers to their use and comparison GDM models metadata |
|
101 |
-simply as arbitrary semi-structured attribute-value pairs, |
|
102 |
-where attributes may have multiple values. |
|
103 |
- |
|
104 |
-## Genomic Sample |
|
105 |
- |
|
106 |
-Formally, a sample s is a collection of genomic regions modeled as |
|
107 |
-the following triple: $$< id, {< r_i,v_i >}, {m_j} >$$ where: |
|
108 |
- |
|
109 |
-* id is the sample identifier |
|
110 |
-* Each region is a pair of coordinates $r_i$ and values $v_i$ |
|
111 |
-* Metadata $m_j$ are attribute-value pairs $< a_j,v_j >$ |
|
112 |
- |
|
113 |
-Note that the sample id attribute provides a many-to-many connection between |
|
114 |
-regions and metadata of a sample. |
|
115 |
-Through the use of a data type system to express region data, and of arbitrary |
|
116 |
-attribute-value pairs for metadata, GDM provides interoperability across |
|
117 |
-datasets in multiple formats produced by different experimental techniques. |
|
118 |
- |
|
119 |
-## Dataset |
|
120 |
- |
|
121 |
-A dataset is a collection of samples uniquely identified, with the same region |
|
122 |
-schema and with each sample consisting of two parts: |
|
123 |
- |
|
124 |
-* region data: describing characteristics and location of genomic portions |
|
125 |
-* metadata: expressing general properties of the sample |
|
126 |
- |
|
127 |
-Each dataset is typically produced within the same project by using the same |
|
128 |
-or equivalent technology and tools, but with different experimental |
|
129 |
-conditions, described by metadata. |
|
130 |
- |
|
131 |
-Datasets contain large number of information describing regions of a genome, |
|
132 |
-with data encoded in human readable format using plain text files. |
|
133 |
- |
|
134 |
-GMQL datasets are materialized in a standard layout composed of three |
|
135 |
-types of files: |
|
136 |
- |
|
137 |
-1. genomic region tab-delimited text files with extension .gdm, or .gtf |
|
138 |
-if in standard GTF format |
|
139 |
-2. metadata attribute-value tab-delimited text files with the same fullname |
|
140 |
-(name and extension) of the correspondent region file and extension .meta |
|
141 |
-3. schema XML file containing region attribute names and types |
|
142 |
- |
|
143 |
-All these files reside in unique folder called files. |
|
144 |
- |
|
145 |
-<!--  --> |
|
146 |
- |
|
147 |
-In RGMQL package dataset files are considered read-only. |
|
148 |
-Once read, genomic information is represented in abstract structure inside |
|
149 |
-the package, mapped to a R GRanges data structure at occurency. |
|
150 |
- |
|
151 |
- |
|
152 |
-# GenoMetric Query Language |
|
153 |
- |
|
154 |
-The GenoMetric Query Language name stems from the language ability to deal |
|
155 |
-with genomic distances, which are measured as number of nucleotide bases |
|
156 |
-between genomic regions (aligned to the same reference genome) and computed |
|
157 |
-using arithmetic operations between region coordinates. |
|
158 |
-GMQL is a high-level, declarative language that allows expressing queries |
|
159 |
-easily over genomic regions and their metadata, in a way similar to what can |
|
160 |
-be done with the Structured Query Language (SQL) over a relational database. |
|
161 |
-GMQL approach exhibits two main differences with respect to other tools |
|
162 |
-based on Hadoop, mapReduce framework, and Spark engine technologies |
|
163 |
-to address similar biomedical problems:\newline |
|
164 |
- |
|
165 |
-* GMQL: |
|
166 |
- |
|
167 |
- 1. reads from processed datasets |
|
168 |
- 2. supports metadata management |
|
169 |
- |
|
170 |
-* Others: |
|
171 |
- |
|
172 |
- 1. read generally from raw or alligned data from NGS machines |
|
173 |
- 2. provide no support for metadata management |
|
174 |
- |
|
175 |
-GMQL is the appropriate tool for querying numerous processed genomic datasets |
|
176 |
-and very many samples that are becoming available. |
|
177 |
-Note however that GMQL performs worse than some other available systems on a |
|
178 |
-small number of small-scale datasets, but these other systems are not |
|
179 |
-cloud-based; hence, they are not adequate for efficient big data processing |
|
180 |
-and, in some cases, they are inherently limited in their |
|
181 |
-data management capacity, as they only work as RAM memory resident processes. |
|
182 |
- |
|
183 |
-## Query structure |
|
184 |
- |
|
185 |
-A GMQL operation is expressed as a sequence of GMQL operations with the |
|
186 |
-following structure: |
|
187 |
-$$< variable > = operator(< parameters >) < variable >;$$ |
|
188 |
-where each $< variable >$ stands for a GDM dataset |
|
189 |
- |
|
190 |
-This RGMQL package brings GMQL functionalities into R environemnt, |
|
191 |
-allowing users to build directly a GMQL query without knowing the GMQL syntax. |
|
192 |
-In RGMQL every GMQL operations is translated into a R function |
|
193 |
-and expressed as: |
|
194 |
-$$ variable = operator(variable, parameters)$$ |
|
195 |
- |
|
196 |
-It is very similar to the GMQL syntax for operation expression although |
|
197 |
-expressed with the R idiomatic paradigm and logic, with parameters totaly |
|
198 |
-builded using R native data structures such as lists, matrices, |
|
199 |
-vectors or R logic conditions. |
|
200 |
- |
|
201 |
- |
|
202 |
-# Processing Environments |
|
203 |
- |
|
204 |
-In this section, we show how GMQL processing is built in R, which operations |
|
205 |
-are available in RGMQL, and the difference beetween local |
|
206 |
-and remote dataset processing. |
|
207 |
- |
|
208 |
-## Local Processing |
|
209 |
- |
|
210 |
-RGMQL local processing consumes computational power directly from local |
|
211 |
-CPUs/system while managing datasets (both GMQL or generic text plain datasets). |
|
212 |
- |
|
213 |
-### Initialization |
|
214 |
- |
|
215 |
-Load and attach the GMQL package in a R session using library function: |
|
216 |
- |
|
217 |
-```r |
|
218 |
-library('RGMQL') |
|
219 |
-``` |
|
220 |
-Before starting using any GMQL operation we need to initialise the GMQL |
|
221 |
-context with the following code: |
|
222 |
- |
|
223 |
-```r |
|
224 |
-init_gmql() |
|
225 |
-``` |
|
226 |
-The function *init_gmql()* initializes the context of scalable data management |
|
227 |
-engine laid upon Spark and Hadoop. |
|
228 |
-Details on this and all other functions are provided in the R documentation |
|
229 |
-for this package (e.g., help(RGMQL)). |
|
230 |
- |
|
231 |
-### Read Dataset |
|
232 |
- |
|
233 |
-After initialization we need to read datasets. |
|
234 |
-We already defined above the formal definition of dataset and the power of |
|
235 |
-GMQL to deal with data in a variety of standard tab-delimited text formats. |
|
236 |
-In the following, we show how to get data from different sources.\newline |
|
237 |
-We distinguish two different cases: |
|
238 |
- |
|
239 |
-1. Local dataset:\newline |
|
240 |
-A local dataset is a folder with sample files (region files and correspondent |
|
241 |
-metadata files) on the user computer. |
|
242 |
-As data are already in the user computer, we simply execute: |
|
243 |
- |
|
244 |
- |
|
245 |
-```r |
|
246 |
-gmql_dataset_path <- system.file("example", "EXON", package = "RGMQL") |
|
247 |
-data_out = read_dataset(gmql_dataset_path) |
|
248 |
-``` |
|
249 |
-In this case we are reading a dataset named EXON specified by path. |
|
250 |
-It doens't matter what kind of format the data are, *read_dataset()* read many |
|
251 |
-standard tab-delimited text formats without specified any paramter at input. |
|
252 |
- |
|
253 |
-2. GRangesList:\newline |
|
254 |
-For better integration in the R environment and with other packages, |
|
255 |
-we provide a *read()* function to read directly from R memory/environment |
|
256 |
-using GRangesList as input. |
|
257 |
- |
|
258 |
- |
|
259 |
- |
|
260 |
- |
|
261 |
- |
|
262 |
- |
|
263 |
- |
|
264 |
- |
|
265 |
- |
|
266 |
- |
|
267 |
- |
|
268 |
- |
|
269 |
- |
|
270 |
- |
|
271 |
- |
|
272 |
- |
|
273 |
- |
|
274 |
- |
|
275 |
- |
|
276 |
- |
|
277 |
- |
|
278 |
- |
|
279 |
- |
|
280 |
- |
|
281 |
- |
|
282 |
- |
|
283 |
- |
|
284 |
- |
|
285 |
- |
|
286 |
- |
|
287 |
- |
|
288 |
- |
|
289 |
- |
|
290 |
- |
|
291 |
- |
|
292 |
- |
|
293 |
- |
|
294 |
- |
|
295 |
- |
|
296 |
- |
|
297 |
- |
|
298 |
- |
|
299 |
- |
|
300 |
- |
|
301 |
- |