Bioconductor Code: YAPSA

Name	Mode	Size
R	040000
README_files	040000
data	040000
inst	040000
man	040000
tests	040000
vignettes	040000
.Rbuildignore	100644	0 kb
.gitignore	100644	0 kb
DESCRIPTION	100644	2 kb
NAMESPACE	100644	4 kb
NEWS	100755	3 kb
README.md	100755	15 kb

README.md

# YAPSA Daniel Huebschmann and Lea Jopp-Saile 20/05/2019 # Citation The package is currently being submitted to [Bioconductor](https://www.bioconductor.org/). Please use it once it is accepted there and a suitable citation is provided. # General design ## Introduction The concept of mutational signatures was introduced in a series of papers by Ludmil Alexandrov, Serena Nik-Zainal, Michael Stratton and others (for precise citations please refer to the vignette of the package). The concept was initialy only valid for SNV signatures and in 2018 extedended to allow also the analysis of INDEL mutational sigantures. The general approach is as follows: 1. *The SNVs are categorized by their nucleotide exchange. In total there are `4 x 3 = 12` different nucleotide exchanges, but if summing over reverse complements only `12 / 2 = 6` different categories are left. For every SNV detected, the motif context around the position of the SNV is extracted. This may be a trinucleotide context if taking one base upstream and one base downstream of the position of the SNV, but larger motifs may be taken as well (e.g. pentamers). Taking into account the motif context increases combinatorial complexity: in the case of the trinucleotide context, there are `4 x 6 x 4 = 96` different variant categories. These categories are called **features** in the following text. The number of features will be called `n`. *The INDELs are categorized bt their insertion and deletion and the immediate sequence context. In total there are 83 differnet categories, features. In total there are 47 delition freatures and 36 inserion freatures. For each INDEL the sequence context is extracted and the number of motive repetition determined. In the case of microhomology delition only a partial repetition of the sequence context is considert. 2. A cohort consists of different samples with the number of samples denoted by `m`. For each sample we can count the occurences of each feature, yielding an `n`-dimensional vector (`n` being the number of features) per sample. For a cohort, we thus get an `n x m` -dimensional matrix, called the **mutational catalogue** `V`. It can be understood as a summary indicating which sample has how many variants of which category, but omitting the information of the genomic coordinates of the variants. 3. The mutational catalogue `V` is quite big and still carries a lot of complexity. For many analyses a reduction of complexity is desirable. One way to achieve such a complexity reduction is a matrix decomposition: we would like to find two smaller matrices `W` and `H` which if multiplied would span a high fraction of the complexity of the big matrix `V` (the mutational catalogue). Remember that `V` is an `n x m` -dimensional matrix, `n` being the number of features and `m` being the number of samples. `W` in this setting is an `n x l` -dimensional matrix and `H` is an `l x m` -dimensional matrix. The columns of `W` are called the **mutational signatures** and the columns of `H` are called **exposures**. `l` denotes the number of mutational signatures. Hence the signatures are `n`-dimensional vectors (with `n` being the number of features), while the exposures are `l`-dimensional vectors (`l` being the number of signatures). Note that as we are dealing with count data, we would like to have only positive entries in `W` and `H`. A mathematical method which is able to do such a decomposition is the **NMF** (**nonnegative matrix factorization**). It basically solves the problem as illustrated in the following figure: ![NMF, Image taken from <https://en.wikipedia.org/wiki/Non-negative_matrix_factorization>](https://upload.wikimedia.org/wikipedia/commons/f/f9/NMF.png) Of course the whole concept only leads to a reduction in complexity if `l < n`, i.e. if the number of signatures is smaller than the number of features, as indicated in the above figure. Note that the NMF itself solves the above problem for a given number of signatures `l`. Addinional criteria exist to evaluate the true number of signatures. ## The YAPSA package In a context where mutational signatures `W` are already known (because they were decribed and published or they are available in a database as under <http://cancer.sanger.ac.uk/cosmic/signatures>), we might want to just find the exposures `H` for these known signatures in the mutational catalogue `V` of a given cohort. The **YAPSA**-package (Yet Another Package for Signature Analysis) presented here provides the function `LCD` (**l**inear **c**ombination **d**ecomposition) to perform this task. The advantage of this method is that there are **no constraints on the cohort size**, so `LCD` can be run for as little as one sample and thus be used e.g. for signature analysis in personalized oncology. In contrast to NMF, `LCD` is very fast and requires very little computational resources. The YAPSA package provides additional functions for signature analysis, e.g. for stratifying the mutational catalogue to determine signature exposures in different strata, part of which is discussed in the vignette of the package. # Install As long as `YAPSA` is not yet accepted on [Bioconductor](https://www.bioconductor.org/) it may be downloaded and installed from [github](https://github.com/): ```r library(devtools) install_github("huebschm/YAPSA") ``` Of course, `devtools` has to be installed: ```r install.packages("devtools") ``` `YAPSA` already has preparations to use the newest versions of the pacakges `circlize` and `ComplexHeatmap` by Zuguang Gu. These are currently not in the release branch of Bioconductor. If you want your system to be ready for the next coming update of `YAPSA` you may already now install the newest versions of these packages from [github](https://github.com/) as well: ```r install_github("jokergoo/circlize") install_github("jokergoo/ComplexHeatmap") ``` If you ran into dependency conflicts before, try rerunning `install_github("huebschm/YAPSA")` now. # Usage ## Example: a cohort of B-cell lymphomas ```r library(YAPSA) library(knitr) opts_chunk$set(echo=TRUE) opts_chunk$set(fig.show='asis') ``` ### Loading example data Load data in a vcf-like format: ```r data("lymphoma_Nature2013_raw") ``` Adapt the data structure: ```r names(lymphoma_Nature2013_raw_df) <- c("PID","TYPE","CHROM","START", "STOP","REF","ALT","FLAG") lymphoma_Nature2013_df <- subset(lymphoma_Nature2013_raw_df,TYPE=="subs", select=c(CHROM,START,REF,ALT,PID)) names(lymphoma_Nature2013_df)[2] <- "POS" head(lymphoma_Nature2013_df) ``` ``` ## CHROM POS REF ALT PID ## 1 1 183502381 G A 07-35482 ## 2 18 60985506 T A 07-35482 ## 3 18 60985748 G T 07-35482 ## 4 18 60985799 T C 07-35482 ## 5 2 242077457 A G 07-35482 ## 6 6 13470412 C T 07-35482 ``` Note that there are 48 different samples: ```r unique(lymphoma_Nature2013_df$PID) ``` ``` ## [1] 07-35482 1060 1061 1065 ## [5] 1093 1096 1102 4101316 ## [9] 4105105 4108101 4112512 4116738 ## [13] 4119027 4121361 4125240 4133511 ## [17] 4135350 4142267 4158726 4159170 ## [21] 4163639 4175837 4177856 4182393 ## [25] 4189200 4189998 4190495 4193278 ## [29] 4194218 4194891 515 DLBCL-PatientA ## [33] DLBCL-PatientB DLBCL-PatientC DLBCL-PatientD DLBCL-PatientE ## [37] DLBCL-PatientF DLBCL-PatientG DLBCL-PatientH DLBCL-PatientI ## [41] DLBCL-PatientJ DLBCL-PatientK DLBCL-PatientL DLBCL-PatientM ## [45] EB2 FL009 FL-PatientA G1 ## 48 Levels: 07-35482 1060 1061 1065 1093 1096 1102 4101316 ... G1 ``` For convenience later on, we annotate subgroup information to every variant (indirectly through the sample it occurs in). For reasons of simplicity, we also restrict the analysis to the Whole Genome Sequencing (WGS) datasets: ```r lymphoma_Nature2013_df$SUBGROUP <- "unknown" DLBCL_ind <- grep("^DLBCL.*",lymphoma_Nature2013_df$PID) lymphoma_Nature2013_df$SUBGROUP[DLBCL_ind] <- "DLBCL_other" MMML_ind <- grep("^41[0-9]+$",lymphoma_Nature2013_df$PID) lymphoma_Nature2013_df <- lymphoma_Nature2013_df[MMML_ind,] data(lymphoma_PID) for(my_PID in rownames(lymphoma_PID_df)) { PID_ind <- which(as.character(lymphoma_Nature2013_df$PID)==my_PID) lymphoma_Nature2013_df$SUBGROUP[PID_ind] <- lymphoma_PID_df$subgroup[which(rownames(lymphoma_PID_df)==my_PID)] } lymphoma_Nature2013_df$SUBGROUP <- factor(lymphoma_Nature2013_df$SUBGROUP) unique(lymphoma_Nature2013_df$SUBGROUP) ``` ``` ## [1] WGS_D WGS_F WGS_B WGS_I ## Levels: WGS_B WGS_D WGS_F WGS_I ``` As stated [above](#LCD), one of the functions in the YAPSA package (`LCD`) is designed to do mutational signatures analysis with known signatures. There are (at least) two possible sources for signature data: i) the ones published initially by Alexandrov, and ii) an updated and curated current set of mutational signatures is maintained by Ludmil Alexandrov at <http://cancer.sanger.ac.uk/cosmic/signatures>. ```r data(sigs) current_sig_df <- AlexCosmicValid_sig_df current_sigInd_df <- AlexCosmicValid_sigInd_df ``` Now we can start using main functions of the YAPSA package: `LCD` and `LCD_complex_cutoff`. ### Building a mutational catalogue Prepare. ```r library(BSgenome.Hsapiens.UCSC.hg19) lymphoma_Nature2013_df <- translate_to_hg19(lymphoma_Nature2013_df,"CHROM") lymphoma_Nature2013_df$change <- attribute_nucleotide_exchanges(lymphoma_Nature2013_df) lymphoma_Nature2013_df <- lymphoma_Nature2013_df[order(lymphoma_Nature2013_df$PID, lymphoma_Nature2013_df$CHROM, lymphoma_Nature2013_df$POS),] ``` This section uses functions which are to a large extent wrappers for functions in the package `SomaticSignatures` by Julian Gehring. ```r word_length <- 3 lymphomaNature2013_mutCat_list <- create_mutation_catalogue_from_df( lymphoma_Nature2013_df, this_seqnames.field = "CHROM", this_start.field = "POS", this_end.field = "POS", this_PID.field = "PID", this_subgroup.field = "SUBGROUP", this_refGenome = BSgenome.Hsapiens.UCSC.hg19, this_wordLength = word_length) ``` The function `create_mutation_catalogue_from_df` returns a list object with several entries. We will use the one called `matrix`. ```r lymphomaNature2013_mutCat_df <- as.data.frame( lymphomaNature2013_mutCat_list$matrix) ``` ### LCD analysis with signature-specific cutoffs When using `LCD_complex_cutoff`, we have to supply a vector of cutoffs with as many entries as there are signatures. It may make sense to provide different cutoffs for different signatures. ```r my_cutoff <- 0.06 general_cutoff_vector <- rep(my_cutoff,dim(current_sig_df)[2]) specific_cutoff_vector <- general_cutoff_vector specific_cutoff_vector[c(1,5)] <- 0 specific_cutoff_vector ``` ``` ## [1] 0.00 0.06 0.06 0.06 0.00 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 ## [15] 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 ## [29] 0.06 0.06 ``` In this example, the cutoff for signatures AC1 and AC5 is thus set to 0, whereas the cutoffs for all other signatures remains at 0.06. Running the function `LCD_complex_cutoff`: ```r CosmicValid_cutoffSpec_LCDlist <- LCD_complex_cutoff( in_mutation_catalogue_df = lymphomaNature2013_mutCat_df, in_signatures_df = current_sig_df, in_cutoff_vector = specific_cutoff_vector, in_sig_ind_df = current_sigInd_df) ``` Some adaptation (extracting and reformatting the information which sample belongs to which subgroup): ```r COSMIC_subgroups_df <- make_subgroups_df(CosmicValid_cutoffSpec_LCDlist$exposures, lymphoma_Nature2013_df) ``` Plotting absolute exposures for visualization: ```r exposures_barplot( in_exposures_df = CosmicValid_cutoffSpec_LCDlist$exposures, in_signatures_ind_df = CosmicValid_cutoffSpec_LCDlist$out_sig_ind_df, in_subgroups_df = COSMIC_subgroups_df) ``` ![Absoute exposures of the COSMIC signatures in the lymphoma mutational catalogues, cutoff of 6% for the LCD (Linear Combination Decomposition).](README_files/figure-html/exposures_barplot_abs_cutoffSpec-1.png) And relative exposures: ```r exposures_barplot( in_exposures_df = CosmicValid_cutoffSpec_LCDlist$norm_exposures, in_signatures_ind_df = CosmicValid_cutoffSpec_LCDlist$out_sig_ind_df, in_subgroups_df = COSMIC_subgroups_df) ``` ![Relative exposures of the COSMIC signatures in the lymphoma mutational catalogues, cutoff of 6% for the LCD (Linear Combination Decomposition).](README_files/figure-html/exposures_barplot_rel_cutoffSpec-1.png) Note that the signatures extracted with the signature-specific cutoffs are the same in the example displayed here. Depending on the analyzed cohort and the choice of cutoffs, the extracted signatures may vary considerably. ## Cluster samples based on their signature exposures To identify groups of samples which were exposed to similar mutational processes, the exposure vectors of the samples can be compared. The YAPSA package provides a custom function for this task: `complex_heatmap_exposures`, which uses the package *[ComplexHeatmap](http://bioconductor.org/packages/ComplexHeatmap)* by Zuguang Gu. It produces output as follows: ```r complex_heatmap_exposures(CosmicValid_cutoffSpec_LCDlist$norm_exposures, COSMIC_subgroups_df, CosmicValid_cutoffSpec_LCDlist$out_sig_ind_df, in_data_type="norm exposures", in_subgroup_colour_column="col", in_method="manhattan", in_subgroup_column="subgroup") ``` ![Clustering of Samples and Signatures based on the relative exposures of the COSMIC signatures in the lymphoma mutational catalogues.](README_files/figure-html/apply_heatmap_exposures-1.png) If you are interested only in the clustering and not in the heatmap information, you could also use `hclust_exposures`: ```r hclust_list <- hclust_exposures(CosmicValid_cutoffSpec_LCDlist$norm_exposures, COSMIC_subgroups_df, in_method="manhattan", in_subgroup_column="subgroup") ``` The dendrogram produced by either the function `complex_heatmap_exposures` or the function `hclust_exposures` can be cut to yield signature exposure specific subgroups of the PIDs. ```r cluster_vector <- cutree(hclust_list$hclust,k=4) COSMIC_subgroups_df$cluster <- cluster_vector subgroup_colour_vector <- rainbow(length(unique(COSMIC_subgroups_df$cluster))) COSMIC_subgroups_df$cluster_col <- subgroup_colour_vector[factor(COSMIC_subgroups_df$cluster)] complex_heatmap_exposures(CosmicValid_cutoffSpec_LCDlist$norm_exposures, COSMIC_subgroups_df, CosmicValid_cutoffSpec_LCDlist$out_sig_ind_df, in_data_type="norm exposures", in_subgroup_colour_column="cluster_col", in_method="manhattan", in_subgroup_column="cluster") ``` ![PIDs labelled by the clusters extracted from the signature analysis.](README_files/figure-html/cluster_PIDs_sig_info-1.png) ## Performing a stratification based on mutation density This type of analysis is performed using the function `run_SMC` where SMC stands for __stratification of the mutational catalogue__. For details on this function please consult the vignette.