Browse code

minor fix

Simone authored on 05/01/2018 08:31:03
Showing 9 changed files

... ...
@@ -1,48 +1,24 @@
1 1
 Package: RGMQL
2 2
 Type: Package
3 3
 Title: GenoMetric Query Language for R/Bioconductor
4
-Version: 0.99.38
4
+Version: 0.99.40
5 5
 Author: Simone Pallotta, Marco Masseroli
6 6
 Maintainer: Simone Pallotta <simonepallotta@hotmail.com>
7 7
 Description: This package brings the GenoMetric Query Language (GMQL)
8 8
     functionalities into the R environment. GMQL is a high-level, declarative
9
-    language to query and compare multiple and heterogeneous genomic datasets
10
-    for biomedical knowledge discovery. It allows expressing easily queries and
11
-    processing over genomic regions and their metadata, to extract genomic regions 
12
-    of interest and compute their properties. GMQL adopts algorithms designed 
13
-    for big data and their efficient implementation using cloud-computing 
14
-    technologies, including Apache Hadoop framework and Spark engine; 
15
-    these make GMQL able to run on modern high performance computing
16
-    infrastructures, CPU clusters and network infrastructures, in order to achieve
17
-    scalability and performance on big data. This RGMQL package is built over 
18
-    a scalable data management engine written in Scala programming language, 
19
-    released as Scala API; it provides a set of functions to create, 
20
-    manipulate and extract genomic data from different data sources both 
21
-    from local and remote datasets. These RGMQL functions allow performing 
22
-    complex queries and processing without knowing the GMQL syntax,
23
-    but leveraging on R idiomatic paradigm and logic. RGMQL provides two different
24
-    approaches in writing GMQL queries and processing scripts: a) REST calls b)
25
-    standard R APIs The REST approach let users to log into a remote infrastructure
26
-    where a GMQL system is installed, and manage remote big genomic datasets hosted
27
-    in cluster-based repository. User can download an entire remote dataset into
28
-    local folder, upload local datasets into the remote repository or compiling
29
-    and running a textual query or processing script just invoking the right RGMQL
30
-    functions. Multiple REST invocations can be invoked and run concurrently on
31
-    remote infrastructure allowing user to monitor the progress status of every
32
-    call. Many other REST functionalities are available in order to allow a complete
33
-    interaction with remote infrastructure. The R APIs approach lets user work with
34
-    local or remote datasets using batch-like style where single invocations must
35
-    be invoked sequentially; with this approach all GMQL queries and processing
36
-    can be written as a sequence of RGMQL functions. Unlike other similar packages,
37
-    every RGMQL function simply builds a query, with no intermediate result shown
38
-    (except for a few functions that execute queries and for some utility functions
39
-    for interoperability with other packages) The RGMQL package also provides a rich
40
-    set of ancillary classes that allow sophisticated input/output management and
41
-    sorting, such as ASC, DESC, BAG, MIN, MAX, SUM, AVG, MEDIAN, STD, Q1, Q2, Q3,
42
-    and several others; these classes are used only to build predicates and complex
43
-    conditions taken as input by RGMQL functions; Note that many RGMQL functions are
44
-    not directly executed in R environment, but are deferred until real execution is
45
-    issued.
9
+    language to manage heterogeneous genomic datasets for biomedical purposes, 
10
+	using simple queries to process genomic regions and their metadata and properties.
11
+	GMQL adopts algorithms efficiently designed for big data using cloud-computing 
12
+    technologies (like Apache Hadoop and Spark) allowing GMQL to run on modern
13
+	infrastructures, in order to achieve scalability and high performance.
14
+	It allows to to create, manipulate and extract genomic data from different 
15
+	data sources both locally and remotely. Our RGMQL functions allow complex 
16
+	queries and processing leveraging on the R idiomatic paradigm. 
17
+	The RGMQL package also provides a rich set of ancillary classes that allow
18
+	sophisticated input/output management and sorting, such as:
19
+	ASC, DESC, BAG, MIN, MAX, SUM, AVG, MEDIAN, STD, Q1, Q2, Q3 (and many others).
20
+	Note that many RGMQL functions are not directly executed in R environment, 
21
+	but are deferred until real execution is issued.
46 22
 License: Artistic-2.0
47 23
 URL: http://www.bioinformatics.deib.polimi.it/genomic_computing/GMQL/
48 24
 Encoding: UTF-8
... ...
@@ -64,7 +40,7 @@ Imports:
64 40
     glue,
65 41
     BiocGenerics
66 42
 Depends:
67
-    R(<= 3.4.2), RGMQLlib
43
+    R(>= 3.4.2), RGMQLlib
68 44
 VignetteBuilder: knitr
69 45
 Suggests: 
70 46
     BiocStyle,
... ...
@@ -40,7 +40,7 @@ import_gmql <- function(dataset_path, is_gtf)
40 40
 {
41 41
     datasetName <- sub("/*[/]$","",datasetName)
42 42
     if(basename(datasetName) !="files")
43
-        datasetName <- paste0(datasetName,"/files")
43
+        datasetName <- file.path(datasetName,"files")
44 44
     
45 45
     if(!dir.exists(datasetName))
46 46
         stop("Directory does not exists")
... ...
@@ -78,8 +78,8 @@ import_gmql <- function(dataset_path, is_gtf)
78 78
 {
79 79
     datasetName <- sub("/*[/]$","",datasetName)
80 80
     if(basename(datasetName) !="files")
81
-        datasetName <- paste0(datasetName,"/files")
82
-
81
+        datasetName <- file.path(datasetName,"files")
82
+    
83 83
     if(!dir.exists(datasetName))
84 84
         stop("Directory does not exists")
85 85
 
... ...
@@ -94,7 +94,7 @@ export_gmql <- function(samples, dir_out, is_gtf)
94 94
     if(!dir.exists(dir_out))
95 95
         dir.create(dir_out)
96 96
     
97
-    files_sub_dir <- paste0(dir_out,"/files")
97
+    files_sub_dir <- file.path(dir_out,"files")
98 98
     dir.create(files_sub_dir)
99 99
     cnt = .counter()
100 100
     #col_names <- .get_schema_names(samples)
... ...
@@ -102,7 +102,7 @@ export_gmql <- function(samples, dir_out, is_gtf)
102 102
     {
103 103
         #write region
104 104
         lapply(samples,function(x,dir){
105
-            sample_name = paste0(dir,"/S_",cnt(),".gtf")
105
+            sample_name <- file.path(dir,paste0("S_",cnt(),".gtf"))
106 106
             g <- rtracklayer::export(x,sample_name,format = "gtf")
107 107
         },files_sub_dir)
108 108
         cnt = .counter(0)
... ...
@@ -110,7 +110,7 @@ export_gmql <- function(samples, dir_out, is_gtf)
110 110
 
111 111
         #write metadata
112 112
         lapply(meta,function(x,dir){
113
-            sample_name = paste0(dir,"/S_",cnt(),".gtf")
113
+            sample_name <- file.path(dir,paste0("S_",cnt(),".gtf"))
114 114
             .write_metadata(x,sample_name)
115 115
         },files_sub_dir)
116 116
     }
... ...
@@ -118,7 +118,7 @@ export_gmql <- function(samples, dir_out, is_gtf)
118 118
     {
119 119
         #write region
120 120
         lapply(samples,function(x,dir){
121
-            sample_name = paste0(dir,"/S_",cnt(),".gdm")
121
+            sample_name <- file.path(dir,paste0("S_",cnt(),".gdm"))
122 122
             region_frame <- data.frame(x)
123 123
             write.table(region_frame,sample_name,col.names = FALSE,
124 124
                             row.names = FALSE, sep = '\t',quote = FALSE)
... ...
@@ -129,7 +129,7 @@ export_gmql <- function(samples, dir_out, is_gtf)
129 129
 
130 130
         #write metadata
131 131
         lapply(meta,function(x,dir){
132
-            sample_name = paste0(dir,"/S_",cnt(),".gdm")
132
+            sample_name <- file.path(dir,paste0("S_",cnt(),".gdm"))
133 133
             .write_metadata(x,sample_name)
134 134
         },files_sub_dir)
135 135
     }
... ...
@@ -177,7 +177,7 @@ export_gmql <- function(samples, dir_out, is_gtf)
177 177
         node_list <- c(fixed_element, columns)
178 178
     }
179 179
 
180
-    schema <- paste0(directory,"/granges.schema")
180
+    schema <- file.path(directory,"granges.schema")
181 181
     root <- xml2::xml_new_root("gmqlSchemaCollection")
182 182
     xml2::xml_attr(root,"name") <- "DatasetName_SCHEMAS"
183 183
     xml2::xml_attr(root,"xmlns") <- "http://genomic.elet.polimi.it/entities"
... ...
@@ -76,8 +76,8 @@ filter_and_extract <- function(data, metadata = NULL,
76 76
 {
77 77
     datasetName <- sub("/*[/]$","",datasetName)
78 78
     if(basename(datasetName) !="files")
79
-        datasetName <- paste0(datasetName,"/files")
80
-    
79
+        datasetName <- file.path(datasetName,"files")
80
+
81 81
     if(!dir.exists(datasetName))
82 82
         stop("Directory does not exists")
83 83
     
... ...
@@ -165,7 +165,7 @@ gmql_cover <- function(input_data, min_acc, max_acc, groupBy,aggregates,flag)
165 165
     }
166 166
     else
167 167
         metadata_matrix <- .jnull("java/lang/String")
168
-
168
+    
169 169
     WrappeR <- J("it/polimi/genomics/r/Wrapper")
170 170
     response <- switch(flag,
171 171
         "COVER" = WrappeR$cover(min_acc, max_acc, join_matrix,
... ...
@@ -137,7 +137,7 @@ gmql_materialize <- function(input_data, dir_out, name)
137 137
     if(!remote_proc)
138 138
     {
139 139
         dir_out <- sub("/*[/]$","",dir_out)
140
-        res_dir_out <- paste0(dir_out,"/",name)
140
+        res_dir_out <- file.path(dir_out,name)
141 141
         if(!dir.exists(res_dir_out))
142 142
             dir.create(res_dir_out)
143 143
     }
... ...
@@ -81,7 +81,7 @@ read_GMQL <- function(dataset, parser = "CustomParser", is_local = TRUE,
81 81
         
82 82
         dataset <- sub("/*[/]$","",dataset)
83 83
         if(basename(dataset) !="files")
84
-            dataset <- paste0(dataset,"/files")
84
+            dataset <- file.path(dataset,"files")
85 85
         
86 86
         schema_XML <- list.files(dataset, pattern = "*.schema$",
87 87
                                     full.names = TRUE)
... ...
@@ -791,7 +791,7 @@ upload_dataset <- function(url, datasetName, folderPath, schemaName = NULL,
791 791
                                     isGMQL = TRUE)
792 792
 {
793 793
     if(isGMQL)
794
-        folderPath <- paste0(folderPath,"/files")
794
+        folderPath <- file.path(folderPath,"files")
795 795
     
796 796
     files <- list.files(folderPath,full.names = TRUE)
797 797
     if(!length(files))
... ...
@@ -961,8 +961,8 @@ download_dataset <- function(url, datasetName, path = getwd())
961 961
         print(content)
962 962
     else
963 963
     {
964
-        zip_path <- paste0(path,"/",datasetName,".zip")
965
-        dir_out <- paste0(path,"/")
964
+        zip_path <- file.path(path,paste0(datasetName,".zip"))
965
+        dir_out <- file.path(path,"")
966 966
         writeBin(content, zip_path)
967 967
         unzip(zip_path,exdir = dir_out)
968 968
         print("Download Complete")
... ...
@@ -63,9 +63,10 @@ This package is built over a GMQL scalable data management engine
63 63
 written in Scala programming language, released as Scala API [@githubrepo] 
64 64
 providing a set of functions to combine, manipulate, compare, and extract 
65 65
 genomic data from different data sources both from local and remote datasets.
66
-These functions, built extending functionalities available in the R/Bioconductor 
67
-framework, allow performing complex GMQL processing and queries without 
68
-knowledge of GMQL syntax, but leveraging on R idiomatic paradigm and logic.
66
+These functions, built extending functionalities available in the 
67
+R/Bioconductor framework, allow performing complex GMQL processing and queries 
68
+without knowledge of GMQL syntax, but leveraging on R idiomatic paradigm 
69
+and logic.
69 70
 
70 71
 
71 72
 # Genomic Data Model
... ...
@@ -140,7 +141,8 @@ types of files:
140 141
 1. genomic region tab-delimited text files with extension .gdm, or .gtf 
141 142
 if in standard GTF format
142 143
 2. metadata attribute-value tab-delimited text files with the same fullname 
143
-(name and extension) of the correspondent genomic region file and extension .meta
144
+(name and extension) of the correspondent genomic region file and extension 
145
+.meta
144 146
 3. schema XML file containing region attribute names and types
145 147
 
146 148
 All these files reside in a unique folder called $files$.
... ...
@@ -257,17 +259,17 @@ additional input parameter.
257 259
 
258 260
 2. GRangesList:\newline
259 261
 For better integration in the R environment and with other R packages, 
260
-we provide the *read_GRangesList()* function to read directly from R memory/environment
262
+we provide the *read_GRangesList()* function to read directly from R memory
261 263
 using GRangesList as input.
262 264
 
263 265
 ```{r, read GRangesList}
264 266
 library("GenomicRanges")
265 267
 gr1 <- GRanges(seqnames = "chr2",
266
-	ranges = IRanges(103, 106), strand = "+", score = 5L, GC = 0.45)
268
+    ranges = IRanges(103, 106), strand = "+", score = 5L, GC = 0.45)
267 269
 
268 270
 gr2 <- GRanges(seqnames = c("chr1", "chr1"),
269
-	ranges = IRanges(c(107, 113), width = 3), strand = c("+", "-"),
270
-	score = 3:4, GC = c(0.3, 0.5))
271
+    ranges = IRanges(c(107, 113), width = 3), strand = c("+", "-"),
272
+    score = 3:4, GC = c(0.3, 0.5))
271 273
 
272 274
 grl <- GRangesList("txA" = gr1, "txB" = gr2)
273 275
 data_out <- read_GRangesList(grl)
... ...
@@ -372,10 +374,10 @@ each sample.
372 374
 NOTE: GRangesList are contained in the R environment and are not saved on disk.
373 375
 
374 376
 With the *rows* parameter it is possible to specify how many rows, for each 
375
-sample inside the input dataset, are extracted; by default, the *rows* parameter 
376
-value is $0$, that means all rows are extracted.
377
-Note that, since we are working with big data, to extract all rows could be very 
378
-time and space consuming.
377
+sample inside the input dataset, are extracted; by default, the *rows* 
378
+parameter value is $0$, that means all rows are extracted.
379
+Note that, since we are working with big data, to extract all rows could be 
380
+very time and space consuming.
379 381
 
380 382
 ## Remote Processing
381 383
 
... ...
@@ -463,9 +465,10 @@ logout_gmql(test_url)
463 465
 ### Batch Execution
464 466
 
465 467
 This execution type is similar to local processing (syntax, functions, and 
466
-so on ...) except that materialized data are stored only on the remote repository, 
467
-from where they can be downloaded locally and imported in GRangesList 
468
-using the functions in this RGMQL package [(see Import/Export)](#utilities).
468
+so on ...) except that materialized data are stored only on the remote 
469
+repository, from where they can be downloaded locally and imported in 
470
+GRangesList using the functions in this RGMQL package 
471
+[(see Import/Export)](#utilities).
469 472
 
470 473
 Before starting with an example, note that we have to log into remote 
471 474
 infrastructure with login function:
... ...
@@ -533,7 +536,8 @@ The processing flavour can be switched using the function:
533 536
 ```{r, switch mode}
534 537
 remote_processing(TRUE)
535 538
 ```
536
-An user can switch processing mode until the first *collect()* has been performed.
539
+An user can switch processing mode until the first *collect()* has been 
540
+performed.
537 541
 
538 542
 This kind of processing comes from the fact that the *read_GMQL()* function can 
539 543
 accept either a local dataset or a remote repository dataset, 
... ...
@@ -592,23 +596,25 @@ collect(exon_res)
592 596
 execute()
593 597
 ```
594 598
 
595
-As we can see, the two *read_GMQL()* functions above read from different sources: 
596
-*mut_ds* from local dataset, *HG19_bed_ann* from remote repository.
599
+As we can see, the two *read_GMQL()* functions above read from different 
600
+sources: *mut_ds* from local dataset, *HG19_bed_ann* from remote repository.
597 601
 
598 602
 If we set remote processing to false (*remote_processing(FALSE)*), 
599 603
 the execution is performed locally downloading all needed datasets from remote 
600
-repositories, otherwise all local datasets are automatically uploaded to the remote 
601
-GMQL repository associated with the remote system where the processing is performed.
604
+repositories, otherwise all local datasets are automatically uploaded to the 
605
+remote GMQL repository associated with the remote system where the processing 
606
+is performed.
602 607
 
603
-NOTE: The public datasets cannot be downloaded from a remote GMQL repository by design.
608
+NOTE: The public datasets cannot be downloaded from a remote GMQL repository 
609
+by design.
604 610
 
605 611
 # Utilities
606 612
 
607
-The RGMQL package contains functions that allow the user to interface with other 
608
-packages available in R/Bioconductor repository, e.g., GenomicRanges, and TFARM.
609
-These functions return GRangesList or GRanges with metadata associated, 
610
-if present, as data structure suitable to further processing in other R/Bioconductor 
611
-packages.
613
+The RGMQL package contains functions that allow the user to interface with 
614
+other packages available in R/Bioconductor repository, e.g., GenomicRanges, 
615
+and TFARM. These functions return GRangesList or GRanges with metadata 
616
+associated, if present, as data structure suitable to further processing 
617
+in other R/Bioconductor packages.
612 618
 
613 619
 ## Import/Export
614 620
 
... ...
@@ -655,9 +661,9 @@ matrix
655 661
 
656 662
 ```
657 663
 *filter_and_extract()* filters the samples in the input dataset based on their 
658
-specified *metadata*, and then extracts as metadata columns of GRanges the vector 
659
-of region attributes you specify to retrieve from each filtered sample from the 
660
-input dataset.
664
+specified *metadata*, and then extracts as metadata columns of GRanges the 
665
+vector of region attributes you specify to retrieve from each filtered sample 
666
+from the input dataset.
661 667
 If the $metadata$ argument is NULL, all samples are taken.
662 668
 The number of obtained columns is equal to the number of samples left after 
663 669
 filtering, multiplied by the number of specified region attributes.