... | ... |
@@ -1,48 +1,24 @@ |
1 | 1 |
Package: RGMQL |
2 | 2 |
Type: Package |
3 | 3 |
Title: GenoMetric Query Language for R/Bioconductor |
4 |
-Version: 0.99.38 |
|
4 |
+Version: 0.99.40 |
|
5 | 5 |
Author: Simone Pallotta, Marco Masseroli |
6 | 6 |
Maintainer: Simone Pallotta <simonepallotta@hotmail.com> |
7 | 7 |
Description: This package brings the GenoMetric Query Language (GMQL) |
8 | 8 |
functionalities into the R environment. GMQL is a high-level, declarative |
9 |
- language to query and compare multiple and heterogeneous genomic datasets |
|
10 |
- for biomedical knowledge discovery. It allows expressing easily queries and |
|
11 |
- processing over genomic regions and their metadata, to extract genomic regions |
|
12 |
- of interest and compute their properties. GMQL adopts algorithms designed |
|
13 |
- for big data and their efficient implementation using cloud-computing |
|
14 |
- technologies, including Apache Hadoop framework and Spark engine; |
|
15 |
- these make GMQL able to run on modern high performance computing |
|
16 |
- infrastructures, CPU clusters and network infrastructures, in order to achieve |
|
17 |
- scalability and performance on big data. This RGMQL package is built over |
|
18 |
- a scalable data management engine written in Scala programming language, |
|
19 |
- released as Scala API; it provides a set of functions to create, |
|
20 |
- manipulate and extract genomic data from different data sources both |
|
21 |
- from local and remote datasets. These RGMQL functions allow performing |
|
22 |
- complex queries and processing without knowing the GMQL syntax, |
|
23 |
- but leveraging on R idiomatic paradigm and logic. RGMQL provides two different |
|
24 |
- approaches in writing GMQL queries and processing scripts: a) REST calls b) |
|
25 |
- standard R APIs The REST approach let users to log into a remote infrastructure |
|
26 |
- where a GMQL system is installed, and manage remote big genomic datasets hosted |
|
27 |
- in cluster-based repository. User can download an entire remote dataset into |
|
28 |
- local folder, upload local datasets into the remote repository or compiling |
|
29 |
- and running a textual query or processing script just invoking the right RGMQL |
|
30 |
- functions. Multiple REST invocations can be invoked and run concurrently on |
|
31 |
- remote infrastructure allowing user to monitor the progress status of every |
|
32 |
- call. Many other REST functionalities are available in order to allow a complete |
|
33 |
- interaction with remote infrastructure. The R APIs approach lets user work with |
|
34 |
- local or remote datasets using batch-like style where single invocations must |
|
35 |
- be invoked sequentially; with this approach all GMQL queries and processing |
|
36 |
- can be written as a sequence of RGMQL functions. Unlike other similar packages, |
|
37 |
- every RGMQL function simply builds a query, with no intermediate result shown |
|
38 |
- (except for a few functions that execute queries and for some utility functions |
|
39 |
- for interoperability with other packages) The RGMQL package also provides a rich |
|
40 |
- set of ancillary classes that allow sophisticated input/output management and |
|
41 |
- sorting, such as ASC, DESC, BAG, MIN, MAX, SUM, AVG, MEDIAN, STD, Q1, Q2, Q3, |
|
42 |
- and several others; these classes are used only to build predicates and complex |
|
43 |
- conditions taken as input by RGMQL functions; Note that many RGMQL functions are |
|
44 |
- not directly executed in R environment, but are deferred until real execution is |
|
45 |
- issued. |
|
9 |
+ language to manage heterogeneous genomic datasets for biomedical purposes, |
|
10 |
+ using simple queries to process genomic regions and their metadata and properties. |
|
11 |
+ GMQL adopts algorithms efficiently designed for big data using cloud-computing |
|
12 |
+ technologies (like Apache Hadoop and Spark) allowing GMQL to run on modern |
|
13 |
+ infrastructures, in order to achieve scalability and high performance. |
|
14 |
+ It allows to to create, manipulate and extract genomic data from different |
|
15 |
+ data sources both locally and remotely. Our RGMQL functions allow complex |
|
16 |
+ queries and processing leveraging on the R idiomatic paradigm. |
|
17 |
+ The RGMQL package also provides a rich set of ancillary classes that allow |
|
18 |
+ sophisticated input/output management and sorting, such as: |
|
19 |
+ ASC, DESC, BAG, MIN, MAX, SUM, AVG, MEDIAN, STD, Q1, Q2, Q3 (and many others). |
|
20 |
+ Note that many RGMQL functions are not directly executed in R environment, |
|
21 |
+ but are deferred until real execution is issued. |
|
46 | 22 |
License: Artistic-2.0 |
47 | 23 |
URL: http://www.bioinformatics.deib.polimi.it/genomic_computing/GMQL/ |
48 | 24 |
Encoding: UTF-8 |
... | ... |
@@ -64,7 +40,7 @@ Imports: |
64 | 40 |
glue, |
65 | 41 |
BiocGenerics |
66 | 42 |
Depends: |
67 |
- R(<= 3.4.2), RGMQLlib |
|
43 |
+ R(>= 3.4.2), RGMQLlib |
|
68 | 44 |
VignetteBuilder: knitr |
69 | 45 |
Suggests: |
70 | 46 |
BiocStyle, |
... | ... |
@@ -40,7 +40,7 @@ import_gmql <- function(dataset_path, is_gtf) |
40 | 40 |
{ |
41 | 41 |
datasetName <- sub("/*[/]$","",datasetName) |
42 | 42 |
if(basename(datasetName) !="files") |
43 |
- datasetName <- paste0(datasetName,"/files") |
|
43 |
+ datasetName <- file.path(datasetName,"files") |
|
44 | 44 |
|
45 | 45 |
if(!dir.exists(datasetName)) |
46 | 46 |
stop("Directory does not exists") |
... | ... |
@@ -78,8 +78,8 @@ import_gmql <- function(dataset_path, is_gtf) |
78 | 78 |
{ |
79 | 79 |
datasetName <- sub("/*[/]$","",datasetName) |
80 | 80 |
if(basename(datasetName) !="files") |
81 |
- datasetName <- paste0(datasetName,"/files") |
|
82 |
- |
|
81 |
+ datasetName <- file.path(datasetName,"files") |
|
82 |
+ |
|
83 | 83 |
if(!dir.exists(datasetName)) |
84 | 84 |
stop("Directory does not exists") |
85 | 85 |
|
... | ... |
@@ -94,7 +94,7 @@ export_gmql <- function(samples, dir_out, is_gtf) |
94 | 94 |
if(!dir.exists(dir_out)) |
95 | 95 |
dir.create(dir_out) |
96 | 96 |
|
97 |
- files_sub_dir <- paste0(dir_out,"/files") |
|
97 |
+ files_sub_dir <- file.path(dir_out,"files") |
|
98 | 98 |
dir.create(files_sub_dir) |
99 | 99 |
cnt = .counter() |
100 | 100 |
#col_names <- .get_schema_names(samples) |
... | ... |
@@ -102,7 +102,7 @@ export_gmql <- function(samples, dir_out, is_gtf) |
102 | 102 |
{ |
103 | 103 |
#write region |
104 | 104 |
lapply(samples,function(x,dir){ |
105 |
- sample_name = paste0(dir,"/S_",cnt(),".gtf") |
|
105 |
+ sample_name <- file.path(dir,paste0("S_",cnt(),".gtf")) |
|
106 | 106 |
g <- rtracklayer::export(x,sample_name,format = "gtf") |
107 | 107 |
},files_sub_dir) |
108 | 108 |
cnt = .counter(0) |
... | ... |
@@ -110,7 +110,7 @@ export_gmql <- function(samples, dir_out, is_gtf) |
110 | 110 |
|
111 | 111 |
#write metadata |
112 | 112 |
lapply(meta,function(x,dir){ |
113 |
- sample_name = paste0(dir,"/S_",cnt(),".gtf") |
|
113 |
+ sample_name <- file.path(dir,paste0("S_",cnt(),".gtf")) |
|
114 | 114 |
.write_metadata(x,sample_name) |
115 | 115 |
},files_sub_dir) |
116 | 116 |
} |
... | ... |
@@ -118,7 +118,7 @@ export_gmql <- function(samples, dir_out, is_gtf) |
118 | 118 |
{ |
119 | 119 |
#write region |
120 | 120 |
lapply(samples,function(x,dir){ |
121 |
- sample_name = paste0(dir,"/S_",cnt(),".gdm") |
|
121 |
+ sample_name <- file.path(dir,paste0("S_",cnt(),".gdm")) |
|
122 | 122 |
region_frame <- data.frame(x) |
123 | 123 |
write.table(region_frame,sample_name,col.names = FALSE, |
124 | 124 |
row.names = FALSE, sep = '\t',quote = FALSE) |
... | ... |
@@ -129,7 +129,7 @@ export_gmql <- function(samples, dir_out, is_gtf) |
129 | 129 |
|
130 | 130 |
#write metadata |
131 | 131 |
lapply(meta,function(x,dir){ |
132 |
- sample_name = paste0(dir,"/S_",cnt(),".gdm") |
|
132 |
+ sample_name <- file.path(dir,paste0("S_",cnt(),".gdm")) |
|
133 | 133 |
.write_metadata(x,sample_name) |
134 | 134 |
},files_sub_dir) |
135 | 135 |
} |
... | ... |
@@ -177,7 +177,7 @@ export_gmql <- function(samples, dir_out, is_gtf) |
177 | 177 |
node_list <- c(fixed_element, columns) |
178 | 178 |
} |
179 | 179 |
|
180 |
- schema <- paste0(directory,"/granges.schema") |
|
180 |
+ schema <- file.path(directory,"granges.schema") |
|
181 | 181 |
root <- xml2::xml_new_root("gmqlSchemaCollection") |
182 | 182 |
xml2::xml_attr(root,"name") <- "DatasetName_SCHEMAS" |
183 | 183 |
xml2::xml_attr(root,"xmlns") <- "http://genomic.elet.polimi.it/entities" |
... | ... |
@@ -76,8 +76,8 @@ filter_and_extract <- function(data, metadata = NULL, |
76 | 76 |
{ |
77 | 77 |
datasetName <- sub("/*[/]$","",datasetName) |
78 | 78 |
if(basename(datasetName) !="files") |
79 |
- datasetName <- paste0(datasetName,"/files") |
|
80 |
- |
|
79 |
+ datasetName <- file.path(datasetName,"files") |
|
80 |
+ |
|
81 | 81 |
if(!dir.exists(datasetName)) |
82 | 82 |
stop("Directory does not exists") |
83 | 83 |
|
... | ... |
@@ -165,7 +165,7 @@ gmql_cover <- function(input_data, min_acc, max_acc, groupBy,aggregates,flag) |
165 | 165 |
} |
166 | 166 |
else |
167 | 167 |
metadata_matrix <- .jnull("java/lang/String") |
168 |
- |
|
168 |
+ |
|
169 | 169 |
WrappeR <- J("it/polimi/genomics/r/Wrapper") |
170 | 170 |
response <- switch(flag, |
171 | 171 |
"COVER" = WrappeR$cover(min_acc, max_acc, join_matrix, |
... | ... |
@@ -137,7 +137,7 @@ gmql_materialize <- function(input_data, dir_out, name) |
137 | 137 |
if(!remote_proc) |
138 | 138 |
{ |
139 | 139 |
dir_out <- sub("/*[/]$","",dir_out) |
140 |
- res_dir_out <- paste0(dir_out,"/",name) |
|
140 |
+ res_dir_out <- file.path(dir_out,name) |
|
141 | 141 |
if(!dir.exists(res_dir_out)) |
142 | 142 |
dir.create(res_dir_out) |
143 | 143 |
} |
... | ... |
@@ -81,7 +81,7 @@ read_GMQL <- function(dataset, parser = "CustomParser", is_local = TRUE, |
81 | 81 |
|
82 | 82 |
dataset <- sub("/*[/]$","",dataset) |
83 | 83 |
if(basename(dataset) !="files") |
84 |
- dataset <- paste0(dataset,"/files") |
|
84 |
+ dataset <- file.path(dataset,"files") |
|
85 | 85 |
|
86 | 86 |
schema_XML <- list.files(dataset, pattern = "*.schema$", |
87 | 87 |
full.names = TRUE) |
... | ... |
@@ -791,7 +791,7 @@ upload_dataset <- function(url, datasetName, folderPath, schemaName = NULL, |
791 | 791 |
isGMQL = TRUE) |
792 | 792 |
{ |
793 | 793 |
if(isGMQL) |
794 |
- folderPath <- paste0(folderPath,"/files") |
|
794 |
+ folderPath <- file.path(folderPath,"files") |
|
795 | 795 |
|
796 | 796 |
files <- list.files(folderPath,full.names = TRUE) |
797 | 797 |
if(!length(files)) |
... | ... |
@@ -961,8 +961,8 @@ download_dataset <- function(url, datasetName, path = getwd()) |
961 | 961 |
print(content) |
962 | 962 |
else |
963 | 963 |
{ |
964 |
- zip_path <- paste0(path,"/",datasetName,".zip") |
|
965 |
- dir_out <- paste0(path,"/") |
|
964 |
+ zip_path <- file.path(path,paste0(datasetName,".zip")) |
|
965 |
+ dir_out <- file.path(path,"") |
|
966 | 966 |
writeBin(content, zip_path) |
967 | 967 |
unzip(zip_path,exdir = dir_out) |
968 | 968 |
print("Download Complete") |
... | ... |
@@ -63,9 +63,10 @@ This package is built over a GMQL scalable data management engine |
63 | 63 |
written in Scala programming language, released as Scala API [@githubrepo] |
64 | 64 |
providing a set of functions to combine, manipulate, compare, and extract |
65 | 65 |
genomic data from different data sources both from local and remote datasets. |
66 |
-These functions, built extending functionalities available in the R/Bioconductor |
|
67 |
-framework, allow performing complex GMQL processing and queries without |
|
68 |
-knowledge of GMQL syntax, but leveraging on R idiomatic paradigm and logic. |
|
66 |
+These functions, built extending functionalities available in the |
|
67 |
+R/Bioconductor framework, allow performing complex GMQL processing and queries |
|
68 |
+without knowledge of GMQL syntax, but leveraging on R idiomatic paradigm |
|
69 |
+and logic. |
|
69 | 70 |
|
70 | 71 |
|
71 | 72 |
# Genomic Data Model |
... | ... |
@@ -140,7 +141,8 @@ types of files: |
140 | 141 |
1. genomic region tab-delimited text files with extension .gdm, or .gtf |
141 | 142 |
if in standard GTF format |
142 | 143 |
2. metadata attribute-value tab-delimited text files with the same fullname |
143 |
-(name and extension) of the correspondent genomic region file and extension .meta |
|
144 |
+(name and extension) of the correspondent genomic region file and extension |
|
145 |
+.meta |
|
144 | 146 |
3. schema XML file containing region attribute names and types |
145 | 147 |
|
146 | 148 |
All these files reside in a unique folder called $files$. |
... | ... |
@@ -257,17 +259,17 @@ additional input parameter. |
257 | 259 |
|
258 | 260 |
2. GRangesList:\newline |
259 | 261 |
For better integration in the R environment and with other R packages, |
260 |
-we provide the *read_GRangesList()* function to read directly from R memory/environment |
|
262 |
+we provide the *read_GRangesList()* function to read directly from R memory |
|
261 | 263 |
using GRangesList as input. |
262 | 264 |
|
263 | 265 |
```{r, read GRangesList} |
264 | 266 |
library("GenomicRanges") |
265 | 267 |
gr1 <- GRanges(seqnames = "chr2", |
266 |
- ranges = IRanges(103, 106), strand = "+", score = 5L, GC = 0.45) |
|
268 |
+ ranges = IRanges(103, 106), strand = "+", score = 5L, GC = 0.45) |
|
267 | 269 |
|
268 | 270 |
gr2 <- GRanges(seqnames = c("chr1", "chr1"), |
269 |
- ranges = IRanges(c(107, 113), width = 3), strand = c("+", "-"), |
|
270 |
- score = 3:4, GC = c(0.3, 0.5)) |
|
271 |
+ ranges = IRanges(c(107, 113), width = 3), strand = c("+", "-"), |
|
272 |
+ score = 3:4, GC = c(0.3, 0.5)) |
|
271 | 273 |
|
272 | 274 |
grl <- GRangesList("txA" = gr1, "txB" = gr2) |
273 | 275 |
data_out <- read_GRangesList(grl) |
... | ... |
@@ -372,10 +374,10 @@ each sample. |
372 | 374 |
NOTE: GRangesList are contained in the R environment and are not saved on disk. |
373 | 375 |
|
374 | 376 |
With the *rows* parameter it is possible to specify how many rows, for each |
375 |
-sample inside the input dataset, are extracted; by default, the *rows* parameter |
|
376 |
-value is $0$, that means all rows are extracted. |
|
377 |
-Note that, since we are working with big data, to extract all rows could be very |
|
378 |
-time and space consuming. |
|
377 |
+sample inside the input dataset, are extracted; by default, the *rows* |
|
378 |
+parameter value is $0$, that means all rows are extracted. |
|
379 |
+Note that, since we are working with big data, to extract all rows could be |
|
380 |
+very time and space consuming. |
|
379 | 381 |
|
380 | 382 |
## Remote Processing |
381 | 383 |
|
... | ... |
@@ -463,9 +465,10 @@ logout_gmql(test_url) |
463 | 465 |
### Batch Execution |
464 | 466 |
|
465 | 467 |
This execution type is similar to local processing (syntax, functions, and |
466 |
-so on ...) except that materialized data are stored only on the remote repository, |
|
467 |
-from where they can be downloaded locally and imported in GRangesList |
|
468 |
-using the functions in this RGMQL package [(see Import/Export)](#utilities). |
|
468 |
+so on ...) except that materialized data are stored only on the remote |
|
469 |
+repository, from where they can be downloaded locally and imported in |
|
470 |
+GRangesList using the functions in this RGMQL package |
|
471 |
+[(see Import/Export)](#utilities). |
|
469 | 472 |
|
470 | 473 |
Before starting with an example, note that we have to log into remote |
471 | 474 |
infrastructure with login function: |
... | ... |
@@ -533,7 +536,8 @@ The processing flavour can be switched using the function: |
533 | 536 |
```{r, switch mode} |
534 | 537 |
remote_processing(TRUE) |
535 | 538 |
``` |
536 |
-An user can switch processing mode until the first *collect()* has been performed. |
|
539 |
+An user can switch processing mode until the first *collect()* has been |
|
540 |
+performed. |
|
537 | 541 |
|
538 | 542 |
This kind of processing comes from the fact that the *read_GMQL()* function can |
539 | 543 |
accept either a local dataset or a remote repository dataset, |
... | ... |
@@ -592,23 +596,25 @@ collect(exon_res) |
592 | 596 |
execute() |
593 | 597 |
``` |
594 | 598 |
|
595 |
-As we can see, the two *read_GMQL()* functions above read from different sources: |
|
596 |
-*mut_ds* from local dataset, *HG19_bed_ann* from remote repository. |
|
599 |
+As we can see, the two *read_GMQL()* functions above read from different |
|
600 |
+sources: *mut_ds* from local dataset, *HG19_bed_ann* from remote repository. |
|
597 | 601 |
|
598 | 602 |
If we set remote processing to false (*remote_processing(FALSE)*), |
599 | 603 |
the execution is performed locally downloading all needed datasets from remote |
600 |
-repositories, otherwise all local datasets are automatically uploaded to the remote |
|
601 |
-GMQL repository associated with the remote system where the processing is performed. |
|
604 |
+repositories, otherwise all local datasets are automatically uploaded to the |
|
605 |
+remote GMQL repository associated with the remote system where the processing |
|
606 |
+is performed. |
|
602 | 607 |
|
603 |
-NOTE: The public datasets cannot be downloaded from a remote GMQL repository by design. |
|
608 |
+NOTE: The public datasets cannot be downloaded from a remote GMQL repository |
|
609 |
+by design. |
|
604 | 610 |
|
605 | 611 |
# Utilities |
606 | 612 |
|
607 |
-The RGMQL package contains functions that allow the user to interface with other |
|
608 |
-packages available in R/Bioconductor repository, e.g., GenomicRanges, and TFARM. |
|
609 |
-These functions return GRangesList or GRanges with metadata associated, |
|
610 |
-if present, as data structure suitable to further processing in other R/Bioconductor |
|
611 |
-packages. |
|
613 |
+The RGMQL package contains functions that allow the user to interface with |
|
614 |
+other packages available in R/Bioconductor repository, e.g., GenomicRanges, |
|
615 |
+and TFARM. These functions return GRangesList or GRanges with metadata |
|
616 |
+associated, if present, as data structure suitable to further processing |
|
617 |
+in other R/Bioconductor packages. |
|
612 | 618 |
|
613 | 619 |
## Import/Export |
614 | 620 |
|
... | ... |
@@ -655,9 +661,9 @@ matrix |
655 | 661 |
|
656 | 662 |
``` |
657 | 663 |
*filter_and_extract()* filters the samples in the input dataset based on their |
658 |
-specified *metadata*, and then extracts as metadata columns of GRanges the vector |
|
659 |
-of region attributes you specify to retrieve from each filtered sample from the |
|
660 |
-input dataset. |
|
664 |
+specified *metadata*, and then extracts as metadata columns of GRanges the |
|
665 |
+vector of region attributes you specify to retrieve from each filtered sample |
|
666 |
+from the input dataset. |
|
661 | 667 |
If the $metadata$ argument is NULL, all samples are taken. |
662 | 668 |
The number of obtained columns is equal to the number of samples left after |
663 | 669 |
filtering, multiplied by the number of specified region attributes. |