title: "README"
author: "Pourya Naderi, Alan Teo, Ilya Sytchev, and Winston Hide"
date: "`r format(Sys.Date(), '%m/%d/%Y')`"
bibliography: vignettes/references.bib  
    variant: gfm

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

<!-- badges: start -->
<!-- badges: end -->

## Introduction

PanomiR is a package for pathway and microRNA Analysis of gene expression data.
This document provides details about how to install and utilize various 
functionality in PanomiR.

For questions, comments, and other queries, contact pouryany@gmail.com

## Installation

PanomiR can be accessed via Bioconductor. To install, start R
and run the following code.

```{r installation_bioc, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))


You can also install the latest development version of PanomiR using GitHub.

```{r installation_git, eval = FALSE}

## Overview

PanomiR is a pipeline to prioritize disease-associated miRNAs based on activity 
of disease-associated pathways. The input datasets for PanomiR are (a) a gene 
expression disease dataset along with covariates, (b) a background collection
of pathways/genesets, and (c) a collection of miRNAs containing gene targets.

The general workflow of PanomiR is (a) generation of pathway summary statistics 
from gene expression data, (b) detection of differentially activated pathways,
(c) finding coherent groups, or clusters, of differentially activated pathways,
and (d) detecting miRNAs targeting each group of pathways. 

Individual steps of the workflow can be used in isolation to carry out different
analyses. The following sections outline each step and material needed to
execute PanomiR. 

## 1. Pathway summarization

PanomiR can generate pathway activity profiles given a gene expression dataset
and a list of pathways.Pathway summaries are numbers that represent the overall
activity of genes that 
belong to each pathway. These numbers are calculated based on a methodology
previously described in part in  [@altschuler2013pathprinting;
Briefly, genes in each sample are ranked by their expression values and then
pathway summaries are calculated as the average rank-squared of genes within a 
pathway. The summaries are then center and scaled (zNormalized) across samples.

The default list of background pathways in PanomiR is formatted into a table
(`data("path_gene_table")`). The table is based on canonical pathways collection
of Molecular Signatures Database (MSigDB) V6.2 and it contains annotated 
pathways from a variety of sources  [@liberzon2011molecular].

** Users interested in using other pathway/geneset backgrounds, such as newer 
versions of MSigDB or KEGG, should refer to the [appendix](#geneset) of
this manual.

This section uses a reduced example dataset from The Cancer Genome Atlas (TCGA)
Liver Hepatocellular Carcinoma (LIHC) dataset to generate
Pathway summary statistics [@ally2017comprehensive]. **Note:** Make sure that you
select gene representation type that matches the rownames of your expression 
data. The type can be modified using the `id` argument in the function below.
The default value for this argument is `ENSEMBL`. 

```{r load_package}

# Pathway reference from the PanomiR package
# Generating pathway summary statistics 

summaries <- pathwaySummary(miniTestsPanomiR$mini_LIHC_Exp,
                            path_gene_table, method = "x2",
                            zNormalize = TRUE, id = "ENSEMBL")


## 2. Differential Pathway activation

Once you generate the pathway activity profiles, as discussed in the last
section, there are several analysis that you can perform. We have bundled some 
of the most important ones into standalone functions. Here, we describe
differential pathway activation profiling, which is examining differences in
pathway activity profiles in user-determined conditions.

At this stage you need to provide a pathway-gene association table, an
expression dataset, and a covariates table. You need to specity what covariates
you would like to contrast. You also need to provide a contrast, as formatted in
limma. If the contrast is not provided, the function assumes the first two 
levels of the provided contrast covariate. **Note:** make sure the contrast 
covariate is formatted as factor.

```{r differential}

output0 <- differentialPathwayAnalysis(
                        geneCounts = miniTestsPanomiR$mini_LIHC_Exp,
                        pathways =  path_gene_table,
                        covariates = miniTestsPanomiR$mini_LIHC_Cov,
                        condition = 'shortLetterCode')

de.paths <- output0$DEP


## 3. Finding clusters of pathways

PanomiR provides a function to find groups coordinated differentially 
activated pathways based on a pathway co-expression network (PCxN) previously
described in [@pita2018pathway]. Briefly, PCxN
is a network where nodes are pathways and links are co-expression between the
nodes. It is formatted into a table were rows represent edges. The edges of PCxN
are marked by two numbers, 1- a correlation co-efficient and 2- a significance
adjusted p-value. Cut-offs for both of these numbers can be manually set using 
PanomiR functions. See function manuals for more info. 

PCxN and its associated genesets are already released and can be accessed
through following Bioconductor packages:
[pcxn](http://bioconductor.org/packages/release/bioc/html/pcxn.html) and

Here we have provided a small version of PCxN for tutorial purposes. A more 
recent version of PCxN based on MSigDB V6.2 is available through the 
data repository accompanying PanomiR manuscript, which can be found

```{r pcxn}

# using an updated version of pcxn 

pathwayClustsLIHC <- mappingPathwaysClusters(
                            pcxn = miniTestsPanomiR$miniPCXN, 
                            dePathways = de.paths[1:300,],
                            topPathways = 200,
                            plot = FALSE,
                            subplot = FALSE,
                            clusteringFunction = "cluster_louvain",
                            correlationCutOff = 0.1)



## 4. Prioritizing miRNAs per cluster of pathways.

PanomiR identifies miRNAs that target clusters of pathways, as defined in the 
last section. In order to this, you would need a reference table of
miRNA-Pathway association score (enrichment). We recommend using a customized
miRNA-Pathway association table, tailored to your experimental data.
This section provides an overview of prioritization process. Readers interested
in knowing more about the technical details of PanomiR are refered to
accompaniying publication (Work under preparation).

### Enrichment reference
Here, we provide a preprocessed small example table of miRNA-pathway enrichment
in `miniTestsPanomiR$miniEnrich` object. This table contains enrichment analysis
results using Fisher's Exact Test between MSigDB pathways and TargetScan miRNA
targets. The individual components are  accessible via `data(msigdb_c2)` and
`data(targetScan_03)` [@agarwal2015predicting; @liberzon2011molecular]. This
example table is contains only a full subset of the full pairwise enrichment. 
You can refer to [section 5](#geneset) of this manual on how to create full 
tables and how to customize them to your specific gene expression data.

### Generating targeting scores
PanomiR generates a score for individual miRNAs targeting a group of pathways.
These scores are generated based on the reference enrichment table.
We are interested in knowing to what extent each miRNA targets pathway clusters
identified in the last step (see previous section). 
PanomiR constructs a null distribution of this targeting score for each miRNA.
The significance of observed scores from a given group of pathways (clusters
in this case) is contrasted against the null distribution to generate a
targeting p-value. These p-values are used to rank miRNAs per cluster.

### Sampling parameter
The above described process requires repeated sampling to empirically obtain the
null distribution. The argument `sampRate` denotes the number of repeats in the
process. Note that in the example below, we use a sampling rate of 50, the
recommended rate is between 500-1000. Also, we set the saveSampling argument to
FALSE. This argument, if set TRUE, ensures that the null distribution is obtain
only once. This argument should be set to TRUE if you wish to save your sampling
and check for different outputs from the clustering algorithms or pathway

```{r miRNA}

output2 <- prioritizeMicroRNA(enriches0 = miniTestsPanomiR$miniEnrich,
                    pathClust = miniTestsPanomiR$miniPathClusts$Clustering,
                    topClust = 1,
                    sampRate = 50, 
                    method = c("aggInv"),
                    outDir = "Output/",
                    dataDir = "outData/",
                    saveSampling = FALSE,
                    runJackKnife = FALSE,
                    numCores = 1,
                    prefix = "outmiR",
                    saveCSV = FALSE)



## 5. miRNA-Pathway enrichment tables 

PanomiR best performs on tissue/experiment-customized datasets. In order to do
this, you need to create a customized enrichment table. You can simply do so by
using the pathway and miRNA list that we have provided as a part of the package.
simply, plug in the name of the genes present (expressed) in your experiment in
the following code

```{r customized_mir, eval=FALSE}

# using an updated version of pcxn 

customeTableEnrich <- miRNAPathwayEnrichment(mirSets = targetScan_03,
                                              pathwaySets = msigdb_c2,
                                              geneSelection = yourGenes,
                                              mirSelection = yourMicroRNAs,
                                              fromID = "ENSEMBL",
                                              toID = "ENTREZID",
                                              minPathSize = 9,
                                              numCores = 1,
                                              outDir = ".",
                                              saveOutName = NULL)


In the above section, the field ```fromID``` denotes the gene representation 
format of your input list. Here is a quick example that runs fast. Note that 
the `miRNAPathwayEnrichment()` function creates a detailed report with
parameters that are used internally. To get a smaller table that is suitable 
for publication purposes, use `reportEnrichment()` function.
```{r customized_mir2}

# using an updated version of pcxn 

tempEnrich <-miRNAPathwayEnrichment(targetScan_03[1:30],msigdb_c2[1:30])


## 6. Customized genesets and recommendations {#geneset}

PanomiR can integrate genesets and pathways from external sources including
those annotated in MSigDB. In order to do so, you need to provide a 
`GeneSetCollection` object as defined in the `GSEABase` package. 

The example below illustrates how to use external sources to create your 
own customized pathway-gene association table. This customized can then
replaced `path_gene_table` input in functions described in sections 1,2, and 5
of this manual.

```{r customized_gsc}

newPathGeneTable <-tableFromGSC(gscExample)


The the pathway correlation network in section 3 is build upon an MSigDB V6.2, 
canonical pathways (cp) collection dataset that includes KEGG Pathways.
KEGG prohibits distribution of its pathways by third parties. However, use can
access desired versions of MSigDB in gmt format via
[this link](https://www.gsea-msigdb.org/gsea/downloads_archive.jsp)

The library `msigdb` provides an programmatic interface to download different
geneset collections. Including how to add KEGG pathways or download mouse 
genesets. Use the this [MSigDB tutorial](https://bioconductor.org/packages/release/data/experiment/vignettes/msigdb/inst/doc/msigdb.html)
to create your desired gene sets.

You can also use the following code chunk to create pathway-gene association
tables from gmt files.

```{r customized_gsc2, eval=FALSE}


yourGeneSetCollection <- getGmt("YOUR GMT FILE")
newPathGeneTable      <- tableFromGSC(yourGeneSetCollection)


## Session info
```{r sessionInfo}


## References