Name Mode Size
.github 040000
R 040000
inst 040000
man 040000
tests 040000
vignettes 040000
.BBSoptions 100644 0 kb
.Rbuildignore 100644 0 kb
.gitignore 100644 0 kb
DESCRIPTION 100644 1 kb
NAMESPACE 100644 0 kb
NEWS 100644 0 kb 100644 8 kb
<!-- badges: start --> [![Lifecycle: experimental](]( <!-- [![BioC status](]( [![BioC dev status](]( --> [![R build status](]( [![Codecov test coverage](]( <!-- [![Support site activity, last 6 months: tagged questions/avg. answers per question/avg. comments per question/accepted answers, or 0 if no tagged posts.](]( --> [![GitHub issues](]( <!-- badges: end --> # Rcollectl: simple interfaces to collectl output Profiling R computations is supported by Rprof and profvis. These measure memory consumption and function execution counts and timings. In workflow design we are also interested in CPU load, disk I/O, and network traffic. There is no standard portable approach to measuring these. In this package, we focus on linux systems measurement with the [collectl]( suite of tools. ## Notes from the AnVIL workspace description Our objective in this workspace is demonstration of instrumentation of bioinformatic tasks. By "instrumentation" we mean the definition and use of tools that measure computational resource consumption: memory, disk, CPU, network traffic. (On AnVIL, we do not know how to identify the network device and data on network traffic are not obtained as of 1/1/2021.) This Dashboard Description discusses details of an example using [salmon]( to carry out quantification of an RNA-seq experiment. ## Basic resources ### Workspace backbone This workspace was initialized using AnVILPublish. The runtime/cloud environment used as of 2020-01-04 is dockerhub vjcitn/instr:0.0.3 ### Raw data The FASTQ files underlying the "airway" RNA-seq workflow were collected using commands like ``` gsutil -u landmarkanvil2 ls gs://sra-pub-run-3/SRR1039512/SRR1039512.1 gsutil -u landmarkanvil2 cp gs://sra-pub-run-4/SRR1039513/SRR1039513.1 . gsutil -u landmarkanvil2 cp gs://sra-pub-run-4/SRR1039516/SRR1039516.1 . gsutil -u landmarkanvil2 cp gs://sra-pub-run-2/SRR1039517/SRR1039517.1 . gsutil -u landmarkanvil2 cp gs://sra-pub-run-2/SRR1039520/SRR1039520.1 . gsutil -u landmarkanvil2 cp gs://sra-pub-run-3/SRR1039521/SRR1039521.1 . ``` A public bucket with the extracted fastq was then produced via ``` fastq-dump --split-3 *512.1 & fastq-dump --split-3 *513.1 & fastq-dump --split-3 *516.1 & fastq-dump --split-3 *517.1 & fastq-dump --split-3 *520.1 & fastq-dump --split-3 *521.1 & gzip *fastq gsutil ls gs://bioc-airway-fastq gsutil cp SRR1039508.1_2.fastq.gz gs://bioc-airway-fastq/ ``` ## Quantification process ### Software and indexing resources Our aim is to analyze the resource consumption in processing this sort of data with R using Rcollectl. We'll describe how to run salmon to generate quantifications,and we will attempt to give estimates for processing various volumes of data by various approaches. We installed snakemake with pip3 and salmon via ``` wget ``` We obtained GENCODE transcript sequences: ``` wget ``` And built the index `gencode.v36_salmon_1.4.0` via ``` salmon index --gencode -t gencode.v36.transcripts.fa.gz -i gencode.v36_salmon_1.4.0 ``` ### Snakemake for workflow definition For a single sample (paired-end) we have a snakemake file: ``` rstudio@saturn-1f2f18e5-4182-40c0-8449-1f301b5c3b03:~$ cat snakemake_one DATASETS = ["SRR1039508.1"] SALMON = "/home/rstudio/bin/salmon" rule all: input: expand("quants/{dataset}/quant.sf", dataset=DATASETS) rule salmon_quant: input: r1 = "fastq/{sample}_1.fastq.gz", r2 = "fastq/{sample}_2.fastq.gz", index = "/home/rstudio/gencode.v36_salmon_1.4.0" output: "quants/{sample}/quant.sf" params: dir = "quants/{sample}" shell: "{SALMON} quant -i {input.index} -l A -p 6 --validateMappings \ --gcBias --numGibbsSamples 20 -o {params.dir} \ -1 {input.r1} -2 {input.r2}" ``` When used with ``` snakemake -j1 --snakefile snakemake_one ``` it takes 6 minutes to produce a folder `quants` with content ``` └── SRR1039508.1 ├── aux_info │ ├── ambig_info.tsv │ ├── bootstrap │ │ ├── bootstraps.gz │ │ └── names.tsv.gz │ ├── expected_bias.gz │ ├── exp_gc.gz │ ├── fld.gz │ ├── meta_info.json │ ├── observed_bias_3p.gz │ ├── observed_bias.gz │ └── obs_gc.gz ├── cmd_info.json ├── lib_format_counts.json ├── libParams │ └── flenDist.txt ├── logs │ └── salmon_quant.log └── quant.sf ``` ## Instrumentation of this process We use the Rcollectl package (which must be installed from to monitor resource consumption while the snakemake process is running. ### Single sample 6 threads Here is a display of the resource consumption for the single (paired-end) sample: ![single sample usage profile]( That run uses a setting of `-p 6` for `salmon quant`, which allows the algorithm to use 6 threads. ### Single sample 12 threads We used a 16 core machine, and raised the value of `-p` to 12 to obtain the following profile: ![single sample usage, 12 threads]( ### Eight samples 12 threads ![eight samples 12 threads]( It takes about 40 minutes to do the eight samples, using about 80% of the available CPUs and about 16 GB RAM overall. ### Four samples 3 threads 4 processes (-j4 for snakemake) Finally, we obtain some data to compare thread-based parallelism to process parallelism. We'll reduce the number of threads per sample, but run 4 samples at once via snakemake -j argument. ![four samples 3 threads -j4]( It seems the throughput is a bit better here, but we should check the comparability of results of the different approaches before declaring this. ## Conclusions Rcollectl can be used to obtain useful profiling information on AnVIL interactive work. We saw that the salmon quantification could be tuned to consume 80-90% of available CPUs. The RAM requirements and disk consumption are measurable. It does not seem possible to decouple the NCPU from the RAM elements in a fine-grained way in AnVIL interactive cloud environments. For 16 cores we could choose between 14.4 and 64 GB RAM. 64 GB seems too much for the task we are doing. Additional work of interest: snakemake is used only because a working snakefile was published. We would like to be able to conduct these exercises in a more Bioconductor-native way. Specifically, we should have an object representing the read data in a self-describing way, and use BiocParallel to manage the execution approach. Furthermore, as we define the self-describing representation of the read data in Bioconductor, we should consider how it maps to the AnVIL/Terra-native structures related to the workspace data tables, and DRS.