vignettes/OncoSimulR.Rmd
 6610ac26  --- title: "OncoSimulR: forward genetic simulation in asexual populations with arbitrary epistatic interactions and a focus on modeling tumor progression." author: " Ramon Diaz-Uriarte\\ Dept. Biochemistry, Universidad Autónoma de Madrid, Instituto de Investigaciones Biomédicas 'Alberto Sols' (UAM-CSIC), Madrid, Spain.\\ , "  bbd76086  date: "r paste0(Sys.Date(),'. OncoSimulR version ', packageVersion('OncoSimulR'), suppressWarnings(ifelse(length(try(system('git rev-parse --short HEAD', ignore.stderr = TRUE, intern = TRUE))), paste0('. Revision: ', system('git rev-parse --short HEAD', intern = TRUE)), '')))" header-includes: - \input{preamble.tex}  6610ac26  output:  08de3ba7  bookdown::html_document2: css: custom4.css  6610ac26  toc: yes toc_float: true  34c1b54f  fig_retina: null  6610ac26  classoption: a4paper geometry: margin=3cm fontsize: 12pt bibliography: OncoSimulR.bib  bbd76086  biblio-style: "apalike"  6610ac26  link-citations: true vignette: > %\VignetteIndexEntry{OncoSimulR: forward genetic simulation in asexual populations with arbitrary epistatic interactions and a focus on modeling tumor progression.} %\VignetteEngine{knitr::rmarkdown} %\VignettePackage{OncoSimulR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteKeywords{OncoSimulR simulation cancer oncogenetic trees} %\VignetteDepends{OncoSimulR} ---  08de3ba7   6610ac26   08de3ba7   6610ac26  {r setup, include=FALSE}  08de3ba7  ## use collapse for bookdown, to collapse all the source and output  6610ac26  ## blocks from one code chunk into a single block knitr::opts_chunk$set(echo = TRUE, collapse = TRUE) options(width = 70) require(BiocStyle)  bbd76086  require(pander)  6610ac26    bbd76086  \clearpage  6610ac26   bbd76086  # Introduction {#introdd}  6610ac26   74449e3b  OncoSimulR is an individual- or clone-based forward-time genetic simulator for biallelic markers (wildtype vs. mutated) in asexually reproducing populations without spatial structure (perfect mixing). Its design emphasizes flexible specification of fitness and mutator effects.  fc38a875   f9f9ab96  OncoSimulR was originally developed to simulate tumor progression with emphasis on allowing users to set restrictions in the accumulation of mutations as specified, for example, by Oncogenetic Trees [OT: @Desper1999JCB; @Szabo2008] or Conjunctive Bayesian Networks [CBN: @Beerenwinkel2007; @Gerstung2009; @Gerstung2011], with the  fc38a875  possibility of adding passenger mutations to the simulations and allowing for several types of sampling. Since then, OncoSimulR has been vastly extended to allow you to specify other types of restrictions in the accumulation of genes, such as the XOR models of @Korsunsky2014 or the "semimonotone" model of @Farahani2013. Moreover, different fitness effects related to the order in which mutations appear can also be incorporated, involving arbitrary numbers of genes. This is *very* different from "restrictions in the order of accumulation of mutations". With order effects, described in a recent cancer paper by Ortmann and collaborators [@Ortmann2015], the effect of having both mutations "A" and "B" differs depending on whether "A" appeared before or after "B" (the actual case involves genes JAK2 and TET2). More generally, OncoSimulR now also allows you to specify arbitrary epistatic interactions between arbitrary collections of genes and to model, for example, synthetic mortality or synthetic viability (again, involving an arbitrary number of genes, some of which might also depend on other genes, or show order effects with other genes). Moreover, it is possible to specify the above interactions in terms of modules, not genes. This idea is discussed in, for example, @Raphael2014a and @Gerstung2011: the restrictions encoded in, say, CBNs or OT can be considered to apply not to genes, but to modules, where each module is a set of genes (and the intersection between modules is the empty set) that performs a specific biological function. Modules, then, play the role of a "union operation" over the set of genes in a module. In addition, arbitrary numbers of genes without interactions (and with fitness effects coming from any distribution you might want) are also possible.  f9f9ab96  Mutator/antimutator genes, genes that alter the mutation rate of other genes [@gerrish_complete_2007; @tomlinson_mutation_1996], can also be simulated with OncoSimulR and specified with most of the mechanisms above (you can have, for instance, interactions between mutator genes). And, regardless of the presence or not of other mutator/antimutator genes, different genes can have different mutation rates.  fc38a875   f9f9ab96  Simulations can be stopped as a function of total population size, number of mutated driver genes, or number of time periods. Simulations can also be stopped with a stochastic detection mechanism where the probability of  fc38a875  detecting a tumor increases with total population size. Simulations return  f9f9ab96  the number of cells of every genotype/clone at each of the sampling periods and we can take samples from the former with single-cell or whole- tumor resolution, adding noise if we want. If we ask for them, simulations also store and return the genealogical relationships of all clones generated during the simulation.  fc38a875  The models so far implemented are all continuous time models, which are simulated using the BNB algorithm of @Mather2012. The core of the code is implemented in C++, providing for fast execution. To help with simulation studies, code to simulate random graphs of the kind often seen in CBNs, OTs, etc, is also available. Finally, OncoSimulR also allows for the generation of random fitness landscapes and the representation of fitness landscapes  f9f9ab96  and provides statistics of evolutionary predictability.  fc38a875   6610ac26   a9ecbe56   6610ac26  ## Key features of OncoSimulR {#key} As mentioned above, OncoSimulR is now a very general package for forward genetic simulation, with applicability well beyond tumor progression. This  f9f9ab96  is a summary of some of its key features:  6610ac26   fc38a875  * You can specify arbitrary interactions between genes, with arbitrary fitness effects, with explicit support for:  6610ac26  - Restrictions in the accumulations of mutations, as specified by Oncogenetic Trees (OTs), Conjunctive Bayesian Networks (CBNs), semimonotone progression networks, and XOR relationships.  f9f9ab96  - Epistatic interactions including, but not limited to, synthetic  6610ac26  viability and synthetic lethality. - Order effects. * You can add passenger mutations. * You can add mutator/antimutator effects. * Fitness and mutation rates can be gene-specific.  f9f9ab96  * You can add arbitrary numbers of non-interacting  6610ac26  genes with arbitrary fitness effects.  fc38a875  * you can allow for deviations from the OT, CBN, semimonotone, and  6610ac26  XOR models, specifying a penalty for such deviations (the$s_h$parameter). * You can conduct multiple simulations, and sample from them with different temporal schemes and using both whole tumor or single cell sampling.  fc38a875  * You can stop the simulations using a flexible combination of  f9f9ab96  conditions: final time, number of drivers, population size, fixation of certain genotypes, and a stochastic stopping mechanism that depends on population size.  6610ac26  * Right now, three different models are available, two that lead to exponential growth, one of them loosely based on @Bozic2010, and another that leads to logistic-like growth, based on @McFarland2013.  f9f9ab96  * You can use large numbers of genes (e.g., see an example of  6610ac26  50000 in section \@ref(mcf50070) ).  fc38a875  * Simulations are generally very fast: I use C++ to implement the BNB algorithm (see sections \@ref(bnbmutation) and \@ref(bnbdensdep) for more detailed comments on the usage of this algorithm).  6610ac26  * You can obtain the true sequence of events and the phylogenetic  fc38a875  relationships between clones (see section \@ref(meaningclone) for the details of what we mean by "clone").  6610ac26  * You can generate random fitness landscapes (under the House of Cards, Rough Mount Fuji, or additive models, or combinations of the former) and use those landscapes as input to the simulation functions. * You can plot fitness landscapes.  f9f9ab96  * You can obtain statistics of evolutionary predictability  fc38a875  from the simulations.  6610ac26   bbd76086   fc38a875  The table below, modified from the table at the  f9f9ab96  [Genetics Simulation Resources (GSR) page](https://popmodels.cancercontrol.cancer.gov/gsr/packages/oncosimulr/#detailed),  fc38a875  provides a summary of the key features of OncoSimulR. (An explanation of the meaning of terms specific to the GSR table is available from https://popmodels.cancercontrol.cancer.gov/gsr/search/ or from the [Genetics Simulation Resources table itself](https://popmodels.cancercontrol.cancer.gov/gsr/packages/oncosimulr/#detailed), by moving the mouse over each term).  bbd76086  \clearpage  fc38a875   f9f9ab96  |Attribute Category | Attribute | |-------------------------------|-----------------------------------------------| |**Target** | | | Type of Simulated Data | Haploid DNA Sequence| | Variations | Biallelic Marker, Genotype or Sequencing Error| |**Simulation Method** | Forward-time| | Type of Dynamical Model | Continuous time| | Entities Tracked | Clones (see \@ref(trackindivs))|  fc38a875  |**Input** | Program specific (R data frames and matrices specifying genotypes' fitness, gene effects, and starting genotype) | |**Output**|| | Data Type| Genotype or Sequence, Individual Relationship (complete parent-child relationships between clones), Demographic (populations sizes of all clones at sampling times), Diversity Measures (LOD, POM, diversity of genotypes), Fitness| | Sample Type| Random or Independent, Longitudinal, Other (proportional to population size)| |**Evolutionary Features** || | Mating Scheme| Asexual Reproduction |  f9f9ab96  | Demographic ||  fc38a875  | Population Size Changes| Exponential (two models), Logistic (McFarland et al., 2013)| | Fitness Components|| | Birth Rate| Individually Determined from Genotype (models "Exp" and "McFL")| | Death Rate| Individually Determined from Genotype (model "Bozic"), Influenced by Environment ---population size (model "McFL")| | Natural Selection|| | Determinant| Single and Multi-locus, Fitness of Offspring, Environmental Factors (population size)| | Models| Directional Selection, Multi-locus models, Epistasis, Random Fitness Effects| | Mutation Models| Two-allele Mutation Model (wildtype, mutant), without back mutation| | Events Allowed| Varying Genetic Features: change of individual mutation rates (mutator/antimutator genes)| | Spatial Structure| No Spatial Structure (perfectly mixed and no migration)|  bbd76086  Table:(\#tab:osrfeatures) Key features of OncoSimulR. Modified from  fc38a875  the original table from https://popmodels.cancercontrol.cancer.gov/gsr/packages/oncosimulr/#detailed . Further details about the original motivation for wanting to simulate data this way in the context of tumor progression can be found in @Diaz-Uriarte2015, where additional comments about model parameters and caveats are discussed.  f9f9ab96  Are there similar programs? The Java program by @Reiter2013a, TTP, offers somewhat similar functionality to the previous version of OncoSimulR, but it is restricted to at most four drivers (whereas v.1 of OncoSimulR allowed for up to 64), you cannot use arbitrary CBNs or OTs (or XORs or semimonotone graphs) to specify restrictions, there is no allowance for passengers, and a single type of model (a discrete time Galton-Watson process) is implemented. The current functionality of OncoSimulR goes well beyond the the previous version (and, thus, also the TPT of @Reiter2013a). We now allow you to specify all types of fitness effects in other general forward genetic simulators such as FFPopSim [@Zanini2012], and some that, to our knowledge (e.g., order effects) are not available from any genetics simulator. In addition, the "Lego system" to flexibly combine different fitness specifications is also unique; by "Lego system" I mean that we can combine different pieces and blocks, similarly to what we do with Lego bricks. (I find this an intuitive and very graphical analogy, which I have copied from @Hothorn_2006 and @Hothorn_2008). In a nutshell, salient features of OncoSimulR compared to other simulators are the unparalleled flexibility to specify fitness and mutator effects, with modules and order effects as particularly unique, and the options for sampling and stopping the simulations, particularly convenient in cancer evolution models. Also unique in this type of software is the addition of functions for simulating fitness landscapes and assessing evolutionary predictability.  fc38a875  ## What kinds of questions is OncoSimulR suited for? {#generalwhatfor} OncoSimulR can be used to address questions that span from the effect of mutator genes in cancer to the interplay between fitness landscapes and mutation rates. The main types of questions that OncoSimulR can help address involve combinations of: * Simulating asexual evolution (the oncoSimul* functions) where: - Fitness is: - A function of specific epistatic effects between genes - A function of order effects - A function of epistatic effects specified using DAGs/posets where these DAGs/posets: - Are user-specified - Generated randomly (simOGraph)  f9f9ab96  - Any mapping between genotypes and fitness where this mapping is:  fc38a875  - User-specified - Generated randomly from families of random fitness landscapes (rfitness) - Mutation rates can: - Vary between genes - Be affected by other genes * Examining times to evolutionarily or biomedically relevant events (fixation of genotypes, reaching a minimal size, acquiring a minimal number of driver genes, etc ---specified with the stopping conditions to the oncoSimul* functions). * Using different sampling schemes (samplePop) that are related to: - Assessing genotypes from single-cell vs. whole tumor (or whole population) with the typeSample argument - Genotyping error (propError argument) - Timing of samples (timeSample argument) - ... and assessing the consequences of those on the observed genotypes and their diversity (sampledGenotypes) and any other inferences that depend on the observational process.  f9f9ab96  - (OncoSimulR returns the abundances of all genotypes at each of the sampling points, so you are not restricted by what the samplePop function provides.)  fc38a875   6610ac26   f9f9ab96  * Tracking the genealogical relationships of clones  fc38a875  (plotClonePhylog) and assessing evolutionary predictability (LOD, POM).  6610ac26   fc38a875  Some specific questions that you can address with the help of OncoSimulR are discussed in section \@ref(whatfor).  6610ac26   fc38a875  -----  6610ac26   fc38a875  A quick overview of the main functions and their relationships is shown in  f9f9ab96  Figure \@ref(fig:frelats), where we use italics for the type/class of R object and courier font for the name of the functions.  fc38a875   6610ac26   fc38a875  {r frelats, eval=TRUE,echo=FALSE, fig.cap="Relationships between the main functions in OncoSimulR."} knitr::include_graphics("relfunct.png")   f9f9ab96  \clearpage  fc38a875  ## Examples of questions that can be addressed with OncoSimulR {#whatfor} Most of the examples in the rest of this vignette, starting with those in \@ref(quickexample), focus on the mechanics. Here, we will illustrate some  f9f9ab96  problems in cancer genomics and evolutionary genetics where OncoSimulR could be of help. This section does not try to provide an answer to any of these questions (those would be full papers by themselves). Instead, this section simply tries to illustrate some kinds of questions where you can use OncoSimulR; of course, the possible uses of OncoSimulR are only limited by your ingenuity. Here, I will only use short snippets of working code as we are limited by time of execution; for real work you would want to use many more scenarios and many more simulations, you would use appropriate statistical methods to compare the output of runs, etc, etc, etc.  fc38a875  {r firstload} ## Load the package library(OncoSimulR)  ### Recovering restrictions in the order of accumulation of mutations {#ex-order}  f9f9ab96  This is a question that was addressed, for instance, in  fc38a875  @Diaz-Uriarte2015: do methods that try to infer restrictions in the  f9f9ab96  order of accumulation of mutations [e.g., @Szabo2008; @Gerstung2009; @ramazzotti_capri_2015] work well under different evolutionary  fc38a875  models and with different sampling schemes? (This issue is also touched upon in section \@ref(sample-1)). A possible way to examine that question would involve: - generating random DAGs that encode restrictions; - simulating cancer evolution using those DAGs; - sampling the data and adding different levels of noise to the sampled data; - running the inferential method; - comparing the inferred DAG with the original, true, one. {r, echo=FALSE} set.seed(2) RNGkind("L'Ecuyer-CMRG")  {r ex-dag-inf}  f9f9ab96  ## For reproducibility set.seed(2) RNGkind("L'Ecuyer-CMRG") ## Simulate a DAG  fc38a875  g1 <- simOGraph(4, out = "rT")  f9f9ab96  ## Simulate 10 evolutionary trajectories  fc38a875  s1 <- oncoSimulPop(10, allFitnessEffects(g1, drvNames = 1:4), mc.cores = 2, ## adapt to your hardware  bbd76086  seed = NULL) ## for reproducibility of vignette  fc38a875   f9f9ab96  ## Sample those data uniformly, and add noise  fc38a875  d1 <- samplePop(s1, timeSample = "unif", propError = 0.1)  f9f9ab96  ## You would now run the appropriate inferential method and ## compare observed and true. For example  bbd76086   f9f9ab96  ## require(Oncotree) ## fit1 <- oncotree.fit(d1)  bbd76086   f9f9ab96  ## Now, you'd compare fitted and original. This is well beyond ## the scope of this document (and OncoSimulR itself).  fc38a875    f9f9ab96  {r hidden-rng-exochs, echo = FALSE} set.seed(NULL)  fc38a875    f9f9ab96   74449e3b  ### Sign epistasis and probability of crossing fitness valleys {#ex-ochs}  fc38a875   f9f9ab96  This question, and the question in the next section (\@ref(ex-predict)), encompass a wide range of issues that have been addressed in evolutionary genetics studies and which include from detailed analysis of simple models with a few uphill paths and valleys as in @Weissman2009 or @Ochs2015, to questions that refer to larger, more complex fitness landscapes as in @szendro_predictability_2013 or @franke_evolutionary_2011 (see below).  e719a81d  Using as an example @Ochs2015 (we will see this example again in section \@ref(ochsdesai), where we cover different ways of specifying fitness), we could specify the fitness landscape and run simulations until fixation (with argument fixation to oncoSimulPop ---see more details in section \@ref(fixation) and \@ref(fixationG), again with this example). We would then examine the proportion of genotypes fixed under different scenarios. And we can extend this example by adding mutator genes:  fc38a875  {r hiddenochs, echo=FALSE} set.seed(2) RNGkind("L'Ecuyer-CMRG")  {r exochs}  f9f9ab96  ## For reproducibility set.seed(2) RNGkind("L'Ecuyer-CMRG")  fc38a875  ## Specify fitness effects.  bbd76086  ## Numeric values arbitrary, but set the intermediate genotype en ## route to ui as mildly deleterious so there is a valley.  fc38a875  ## As in Ochs and Desai, the ui and uv genotypes ## can never appear. u <- 0.2; i <- -0.02; vi <- 0.6; ui <- uv <- -Inf od <- allFitnessEffects( epistasis = c("u" = u, "u:i" = ui, "u:v" = uv, "i" = i, "v:-i" = -Inf, "v:i" = vi))  bbd76086  ## For the sake of extending this example, also turn i into a ## mutator gene  fc38a875  odm <- allMutatorEffects(noIntGenes = c("i" = 50)) ## How do mutation and fitness look like for each genotype? evalAllGenotypesFitAndMut(od, odm, addwt = TRUE)  Ochs and Desai explicitly say "Each simulated population was evolved until either the uphill genotype or valley-crossing genotype fixed." So we will use fixation. {r exochsb} ## Set a small initSize, as o.w. unlikely to pass the valley initS <- 10 ## The number of replicates is tiny, 10, for the sake of speed ## of creation of the vignette od_sim <- oncoSimulPop(10, od, muEF = odm, fixation = c("u", "i, v"), initSize = initS, model = "McFL",  f9f9ab96  mu = 1e-4, detectionDrivers = NA, finalTime = NA,  fc38a875  detectionSize = NA, detectionProb = NA, onlyCancer = TRUE, mc.cores = 2, ## adapt to your hardware seed = NULL) ## for reproducibility ## What is the frequency of each final genotype? sampledGenotypes(samplePop(od_sim))   f9f9ab96  {r hidden-rng-exochs33, echo = FALSE}  fc38a875  set.seed(NULL)   74449e3b  ### Predictability of evolution in complex fitness landscapes {#ex-predict}  fc38a875   74449e3b  Focusing now on predictability in more general fitness landscapes, we would run simulations under random fitness landscapes with varied ruggedness, and would then examine the evolutionary predictability of the trajectories with measures such as "Lines of Descent" and "Path of the Maximum" [@szendro_predictability_2013] and the  f9f9ab96  diversity of the sampled genotypes under different sampling regimes (see details in section \@ref(evolpredszend)).  fc38a875  {r hiddenrng0szen, echo=FALSE} set.seed(7) RNGkind("L'Ecuyer-CMRG")  {r exszendro}  f9f9ab96  ## For reproducibility set.seed(7) RNGkind("L'Ecuyer-CMRG")  fc38a875  ## Repeat the following loop for different combinations of whatever ## interests you, such as number of genes, or distribution of the ## c and sd (which affect how rugged the landscape is), or  f9f9ab96  ## reference genotype, or evolutionary model, or stopping criterion, ## or sampling procedure, or ...  fc38a875   f9f9ab96  ## Generate a random fitness landscape, from the Rough Mount ## Fuji model, with g genes, and c ("slope" constant) and ## reference chosen randomly (reference is random by default and ## thus not specified below). Require a minimal number of ## accessible genotypes  bbd76086   fc38a875  g <- 6 c <- runif(1, 1/5, 5) rl <- rfitness(g, c = c, min_accessible_genotypes = g)  bbd76086  ## Plot it if you want; commented here as it takes long for a ## vignette  fc38a875  ## plot(rl) ## Obtain landscape measures from Magellan. Export to Magellan to_Magellan(rl, file = "rl1.txt")  bbd76086   f9f9ab96  ## (Getting the statistics from Magellan requires either ## calling the web app (http://wwwabi.snv.jussieu.fr/public/Magellan/) ## or asking Magellan's authors for the software. This is of course ## beyond the scope of the example and the package.)  fc38a875   f9f9ab96  ## Simulate evolution in that landscape many times (here just 10)  fc38a875  simulrl <- oncoSimulPop(10, allFitnessEffects(genotFitness = rl), keepPhylog = TRUE, keepEvery = 1, initSize = 4000, seed = NULL, ## for reproducibility mc.cores = 2) ## adapt to your hardware  f9f9ab96  ## Obtain measures of evolutionary predictability  fc38a875  diversityLOD(LOD(simulrl)) diversityPOM(POM(simulrl)) sampledGenotypes(samplePop(simulrl, typeSample = "whole"))  {r hidden-rng-exszend, echo = FALSE} set.seed(NULL)   74449e3b   f9f9ab96  ### Mutator and antimutator genes {#exmutantimut}  74449e3b  The effects of mutator and antimutator genes have been examined both  f9f9ab96  in cancer genetics [@nowak_evolutionary_2006;  74449e3b  @tomlinson_mutation_1996] and in evolutionary genetics [@gerrish_complete_2007], and are related to wider issues such as Muller's ratchet and the evolution of sex. There are, thus, a large range of questions related to mutator and antimutator genes. One question addressed in @tomlinson_mutation_1996 concerns under what circumstances mutator genes are likely to play a role in cancer progression. For instance, @tomlinson_mutation_1996 find that an increased mutation rate is more likely to matter if the number of required mutations in driver genes needed to reach cancer is large and if the mutator effect is large.  f9f9ab96  We might want to ask, then, how long it takes before to reach cancer under different scenarios. Time to reach cancer is stored in the component FinalTime of the output. We would specify different numbers and effects of mutator genes (argument muEF). We would also change the criteria for  74449e3b  reaching cancer and in our case we can easily do that by specifying  f9f9ab96  different numbers in detectionDrivers. Of course, we would also want to examine the effects of varying numbers of mutators, drivers, and possibly fitness consequences of mutators. Below we assume mutators are neutral and we assume there are no additional genes with deleterious mutations, but this need not be so, of course [see also @tomlinson_mutation_1996; @gerrish_complete_2007; @McFarland2014].  74449e3b  Let us run an example. For the sake of simplicity, we assume no epistatic interactions. {r ex-tomlin1} sd <- 0.1 ## fitness effect of drivers sm <- 0 ## fitness effect of mutator nd <- 20 ## number of drivers nm <- 5 ## number of mutators mut <- 10 ## mutator effect fitnessGenesVector <- c(rep(sd, nd), rep(sm, nm)) names(fitnessGenesVector) <- 1:(nd + nm) mutatorGenesVector <- rep(mut, nm) names(mutatorGenesVector) <- (nd + 1):(nd + nm) ft <- allFitnessEffects(noIntGenes = fitnessGenesVector, drvNames = 1:nd) mt <- allMutatorEffects(noIntGenes = mutatorGenesVector)  Now, simulate using the fitness and mutator specification. We fix the number of drivers to cancer, and we stop when those numbers of drivers are reached. Since we only care about the time it takes to reach cancer, not the actual trajectories, we set keepEvery = NA: {r hiddentom, echo=FALSE} set.seed(2) RNGkind("L'Ecuyer-CMRG")  {r ex-tomlin2}  f9f9ab96  ## For reproducibility set.seed(2) RNGkind("L'Ecuyer-CMRG")  74449e3b  ddr <- 4 st <- oncoSimulPop(4, ft, muEF = mt, detectionDrivers = ddr, finalTime = NA, detectionSize = NA, detectionProb = NA, onlyCancer = TRUE, keepEvery = NA, mc.cores = 2, ## adapt to your hardware seed = NULL) ## for reproducibility ## How long did it take to reach cancer? unlist(lapply(st, function(x) x$FinalTime))  {r hidden-rng-tom, echo = FALSE} set.seed(NULL)  (Incidentally, notice that it is easy to get OncoSimulR to throw an  f9f9ab96  exception if you accidentally specify a huge mutation rate when all mutator genes are mutated: see section \@ref(tomlinexcept).)  74449e3b  ### Epistatic interactions between drivers and passengers in cancer and the consequences of order effects {#exbauer} #### Epistatic interactions between drivers and passengers  fc38a875  @Bauer2014 have examined the effects of epistatic relationships between drivers and passengers in cancer initiation. We could use their model as a starting point, and examine how likely cancer is to develop under different variations of their model and different evolutionary scenarios (e.g., initial sample size, mutation rates, evolutionary model, etc). There are several ways to specify their model, as we discuss in section \@ref(bauer). We will use one based on DAGs here: {r exusagebau} K <- 4 sp <- 1e-5 sdp <- 0.015 sdplus <- 0.05 sdminus <- 0.1 cnt <- (1 + sdplus)/(1 + sdminus) prod_cnt <- cnt - 1 bauer <- data.frame(parent = c("Root", rep("D", K)), child = c("D", paste0("s", 1:K)), s = c(prod_cnt, rep(sdp, K)), sh = c(0, rep(sp, K)), typeDep = "MN") fbauer <- allFitnessEffects(bauer) (b1 <- evalAllGenotypes(fbauer, order = FALSE, addwt = TRUE)) ## How does the fitness landscape look like? plot(b1, use_ggrepel = TRUE) ## avoid overlapping labels   f9f9ab96  Now run simulations and examine how frequently the runs end up with population sizes larger than a pre-specified threshold; for instance, below we look at increasing population size 4x in the default maximum number of 2281 time periods (for real, you would of course increase the number of total populations, the range of initial population sizes, model, mutation rate, required population size or number of drivers, etc):  fc38a875  {r hiddenbau, echo=FALSE} set.seed(2) RNGkind("L'Ecuyer-CMRG")  {r exusagebau2}  f9f9ab96  ## For reproducibility set.seed(2) RNGkind("L'Ecuyer-CMRG")  fc38a875  totalpops <- 5  f9f9ab96  initSize <- 100  fc38a875  sb1 <- oncoSimulPop(totalpops, fbauer, model = "Exp",  f9f9ab96  initSize = initSize,  fc38a875  onlyCancer = FALSE, mc.cores = 2, ## adapt to your hardware seed = NULL) ## for reproducibility  f9f9ab96  ## What proportion of the simulations reach 4x initSize? sum(summary(sb1)[, "TotalPopSize"] > (4 * initSize))/totalpops  fc38a875   {r hidden-rng-exbau, echo = FALSE} set.seed(NULL)   74449e3b   f9f9ab96  Alternatively, to examine how long it takes to reach cancer for a pre-specified size, you could look at the value of FinalTime as we did above (section \@ref(exmutantimut)) after running simulations with onlyCancer = TRUE and detectionSize set to some reasonable value: {r hiddenbau22, echo=FALSE} set.seed(2) RNGkind("L'Ecuyer-CMRG")  {r hhhhbbbb22} totalpops <- 5 initSize <- 100 sb2 <- oncoSimulPop(totalpops, fbauer, model = "Exp", initSize = initSize, onlyCancer = TRUE, detectionSize = 10 * initSize, mc.cores = 2, ## adapt to your hardware seed = NULL) ## for reproducibility ## How long did it take to reach cancer? unlist(lapply(sb2, function(x) x$FinalTime))  {r hidden-rng-exbau22, echo = FALSE} set.seed(NULL)   74449e3b  #### Consequences of order effects for cancer initiation {#exorder1intro}  fc38a875  Instead of focusing on different models for epistatic interactions, you might want to examine the consequences of order effects [@Ortmann2015]. You would proceed as above, but using models that differ by, say, the presence or absence of order effects. Details on their specification are provided in section \@ref(oe). Here is one particular model (you would, of course, want to compare this to models without order effects or with other magnitudes and types of order effects):  6610ac26   fc38a875  {r oex1intro} ## Order effects involving three genes.  f9f9ab96  ## Genotype "D, M" has different fitness effects  fc38a875  ## depending on whether M or D mutated first.  f9f9ab96  ## Ditto for genotype "F, D, M". ## Meaning of specification: X > Y means ## that X is mutated before Y.  fc38a875  o3 <- allFitnessEffects(orderEffects = c( "F > D > M" = -0.3, "D > F > M" = 0.4, "D > M > F" = 0.2, "D > M" = 0.1, "M > D" = 0.5)) ## With the above specification, let's double check ## the fitness of the possible genotypes (oeag <- evalAllGenotypes(o3, addwt = TRUE, order = TRUE))  Now, run simulations and examine how frequently the runs do not end up in extinction. As above, for real, you would of course increase the number of total populations, the range of initial population sizes, mutation rate, etc: {r hiddoef, echo=FALSE} set.seed(2) RNGkind("L'Ecuyer-CMRG")  {r exusageoe2}  f9f9ab96  ## For reproducibility set.seed(2) RNGkind("L'Ecuyer-CMRG")  fc38a875  totalpops <- 5 soe1 <- oncoSimulPop(totalpops, o3, model = "Exp", initSize = 500, onlyCancer = FALSE, mc.cores = 2, ## adapt to your hardware seed = NULL) ## for reproducibility ## What proportion of the simulations do not end up extinct? sum(summary(soe1)[, "TotalPopSize"] > 0)/totalpops  {r hidden-rng-exoef, echo = FALSE} set.seed(NULL)   f9f9ab96  As we just said, alternatively, to examine how long it takes to reach cancer you could run simulations with onlyCancer = TRUE and look at the value of FinalTime as we did above (section \@ref(exmutantimut)).  fc38a875  ## Trade-offs and what is OncoSimulR not well suited for {#whatnotfor} OncoSimulR is designed for complex fitness specifications and selection scenarios and uses forward-time simulations; the types of questions where OncoSimulR can be of help are discussed in sections \@ref(generalwhatfor) and \@ref(whatfor) and running time and space consumption of OncoSimulR are addressed in section \@ref(timings). You should be aware that **coalescent simulations**, sometimes also called backward-time simulations, are much more efficient for simulating neutral data as well as some special selection  f9f9ab96  scenarios [@Yuan2012; @Carvajal-Rodriguez2010; @Hoban2011]. In addition, since OncoSimulR allows you to specify fitness with arbitrary epistatic and order effects, as well as mutator effects, you need to learn the syntax of how to specify those effects and you might be paying a performance penalty if your scenario does not require this complexity. For instance, in the model of @Beerenwinkel2007b, the fitness of a genotype depends only on the total number of drivers mutated, but not on which drivers are mutated (and, thus, not on the epistatic interactions nor the order of accumulation of the drivers). This means that the syntax for specifying that model could probably be a lot simpler (e.g., specify$s$per driver). But it also means that code written for just that case could probably run much faster. First, because fitness evaluation is easier. Second, and possibly much more important, because what we need to keep track of leads to much simpler and economic structures: we do not need to keep track of clones (where two cells are regarded as different clones if they differ anywhere in their genotype), but only of clone types or clone classes as defined by the number of mutated drivers, and keeping track of clones can be expensive ---see sections \@ref(timings) and \@ref(trackindivs).  fc38a875  So for those cases where you do not need the full flexibility of OncoSimulR, special purpose software might be easier to use and faster to run. Of course, for some types of problems this special purpose software might not be available, though.  f9f9ab96  ## Steps for using OncoSimulR {#steps}  6610ac26  Using this package will often involve the following steps:  f9f9ab96  1. Specify fitness effects: sections \@ref(specfit) and \@ref(litex).  6610ac26   f9f9ab96  2. Simulate cancer progression: section \@ref(simul). You can simulate for a single individual or subject or for a set of subjects. You will need to:  6610ac26  - Decide on a model. This basically amounts to choosing a model with  fc38a875  exponential growth ("Exp" or "Bozic") or a model with carrying capacity ("McFL"). If exponential growth, you can choose whether the the effects of mutations operate on the death rate ("Bozic") or the birth rate ("Exp")[^1].  6610ac26  - Specify other parameters of the simulation. In particular, decide when to stop the simulation, mutation rates, etc. Of course, at least for initial playing around, you can use the defaults.  fc38a875  3. Sample from the simulated data and do something with those simulated data (e.g., fit an OT model to them, examine diversity or time until cancer, etc). Most of what you do with the data, however, is outside the scope of this package and this vignette.  6610ac26   fc38a875  [^1]:It is of course possible to do this with the carrying capacity (or gompertz-like) models, but there probably is little reason to do it. @McFarland2013 discuss this has little effect on their results, for example. In addition, decreasing the death rate will more easily lead to numerical problems as shown in section \@ref(ex-0-death).  6610ac26   fc38a875  Before anything else, let us load the package in case it was not yet loaded. We also explicitly load r Biocpkg("graph") and r CRANpkg("igraph") for the vignette to work (you do not need that for your usual interactive work). And I set the default color for vertices in igraph.  6610ac26  {r, results="hide",message=FALSE, echo=TRUE, include=TRUE} library(OncoSimulR) library(graph) library(igraph) igraph_options(vertex.color = "SkyBlue2")  {r, echo=FALSE, results='hide'} options(width = 68)  To be explicit, what version are we running? {r} packageVersion("OncoSimulR")   fc38a875  ## Two quick examples of fitness specifications {#quickexample}  6610ac26   fc38a875  Following \@ref(steps) we will run two very minimal examples. First a model with a few genes and **epistasis**:  6610ac26  {r, fig.width=6.5, fig.height=10} ## 1. Fitness effects: here we specify an ## epistatic model with modules. sa <- 0.1 sb <- -0.2 sab <- 0.25 sac <- -0.1 sbc <- 0.25 sv2 <- allFitnessEffects(epistasis = c("-A : B" = sb, "A : -B" = sa, "A : C" = sac, "A:B" = sab, "-A:B:C" = sbc), geneToModule = c( "A" = "a1, a2", "B" = "b", "C" = "c"), drvNames = c("a1", "a2", "b", "c")) evalAllGenotypes(sv2, addwt = TRUE) ## 2. Simulate the data. Here we use the "McFL" model and set  f9f9ab96  ## explicitly parameters for mutation rate, initial size, size  6610ac26  ## of the population that will end the simulations, etc RNGkind("Mersenne-Twister") set.seed(983) ep1 <- oncoSimulIndiv(sv2, model = "McFL", mu = 5e-6,  fc38a875  sampleEvery = 0.025,  6610ac26  keepEvery = 0.5, initSize = 2000, finalTime = 3000, onlyCancer = FALSE)   f9f9ab96  {r iep1x1,fig.width=6.5, fig.height=4.5, fig.cap="Plot of drivers of an epistasis simulation."}  6610ac26  ## 3. We will not analyze those data any further. We will only plot ## them. For the sake of a small plot, we thin the data. plot(ep1, show = "drivers", xlim = c(0, 1500), thinData = TRUE, thinData.keep = 0.5)   34c1b54f   6610ac26  As a second example, we will use a model where we specify  fc38a875  **restrictions in the order of accumulation of mutations using a DAG** with the  6610ac26  pancreatic cancer poset in @Gerstung2011 (see more details in section \@ref(pancreas)):  f9f9ab96  {r fepancr1, fig.width=5}  6610ac26  ## 1. Fitness effects: pancr <- allFitnessEffects( data.frame(parent = c("Root", rep("KRAS", 4), "SMAD4", "CDNK2A", "TP53", "TP53", "MLL3"), child = c("KRAS","SMAD4", "CDNK2A", "TP53", "MLL3", rep("PXDN", 3), rep("TGFBR2", 2)), s = 0.1, sh = -0.9, typeDep = "MN"), drvNames = c("KRAS", "SMAD4", "CDNK2A", "TP53", "MLL3", "TGFBR2", "PXDN"))  f9f9ab96    6610ac26   f9f9ab96  {r figfpancr1, fig.width=5, fig.cap="Plot of DAG corresponding to fitnessEffects object."}  6610ac26  ## Plot the DAG of the fitnessEffects object plot(pancr)   f9f9ab96   34c1b54f  {r}  fc38a875  ## 2. Simulate from it. We change several possible options.  6610ac26  set.seed(4) ## Fix the seed, so we can repeat it ep2 <- oncoSimulIndiv(pancr, model = "McFL", mu = 1e-6, sampleEvery = 0.02, keepEvery = 1, initSize = 1000, finalTime = 10000, onlyCancer = FALSE)   fc38a875  {r iep2x2, fig.width=6.5, fig.height=5, fig.cap= "Plot of genotypes of a simulation from a DAG."}  6610ac26  ## 3. What genotypes and drivers we get? And play with limits  34c1b54f  ## to show only parts of the data. We also aggressively thin ## the data.  6610ac26  par(cex = 0.7) plot(ep2, show = "genotypes", xlim = c(1000, 8000), ylim = c(0, 2400),  34c1b54f  thinData = TRUE, thinData.keep = 0.03)  6610ac26    fc38a875  The rest of this vignette explores all of those functions and arguments in much more detail. ## Citing OncoSimulR and other documentation {#citing} In R, you can do {r} citation("OncoSimulR")   b137a4a3  which will tell you how to cite the package. Please, do cite the Bionformatics paper if you use the package in publications.  fc38a875   b137a4a3  This is the URL for the Bioinformatics paper: [https://doi.org/10.1093/bioinformatics/btx077](https://doi.org/10.1093/bioinformatics/btx077) (there is also an early preprint at [bioRxiv](http://biorxiv.org/content/early/2016/08/14/069500), but it should now point to the Bioinformatics paper).  fc38a875  ### HTML and PDF versions of the vignette {#pdfvignette} A PDF version of this vignette is available from . And an HTML version from . These  f9f9ab96  files should correspond to the most recent, GitHub version, of the package  fc38a875  (i.e., they might include changes not yet available from the BioConductor package).  a9ecbe56  ## Testing, code coverage, and other examples {#codecover} OncoSimulR includes more than 2000 tests that are run at every check cycle. These tests provide a code coverage of more than 90% including both the C++ and R code. Another set of over 500 long-running (several hours) tests can be run on demand (see directory '/tests/manual'). In addition to serving as test cases, some of that code also provides further examples of usage.  fc38a875  ## Versions {#versions} In this vignette and the documentation I often refer to version 1 (v.1) and version 2 of OncoSimulR. Version 1 is the version available up to, and including, BioConductor v. 3.1. Version 2 of OncoSimulR is available starting from BioConductor 3.2 (and, of course, available too from development versions of BioC). So, if you are using the current stable or development version of BioConductor, or you grab the sources from GitHub () you are using what we call version 2. Please note that **the functionality of version 1 will soon be removed.**  bbd76086  \clearpage  6610ac26   fc38a875  # Running time and space consumption of OncoSimulR {#timings}  6610ac26   74449e3b  Time to complete the simulations and size of returned objects (space  fc38a875  consumption) depend on several, interacting factors. The usual rule of "experiment before launching a large number of simulations" applies, but here we will walk through several cases to get a feeling for the major factors that affect speed and size. Many of the comments on this section need to use ideas discussed in other  f9f9ab96  places of this document; if you read this section first, you might  fc38a875  want to come back after reading the relevant parts.  6610ac26   fc38a875  Speed will depend on:  6610ac26  * Your hardware, of course.  f9f9ab96  * The evolutionary model.  6610ac26  * The granularity of how often you keep data (keepEvery argument). Note that the default, which is to keep as often as you  f9f9ab96  sample (so that we preserve all history) can lead to slow execution  6610ac26  times. * The mutation rate, because higher mutation rates lead to more  fc38a875  clones, and more clones means we need to iterate over, well, more clones,  6610ac26  and keep larger data structures.  fc38a875  * The fitness specification: more complex fitness specifications tend to be slightly slower but specially different fitness specifications can have radically different effects on the evolutionary trajectories, accessibility of fast growing genotypes and, generally, the evolutionary dynamics. * The stopping conditions (detectionProb, detectionDrivers, detectionSize arguments) and whether or not simulations are run until cancer is reached (onlyCancer argument).  f9f9ab96  * Most of the above factors can interact in complex ways.  6610ac26   fc38a875  Size of returned objects will depend on:  6610ac26   fc38a875  * Any factor that affects the number of clones tracked/returned, in particular: initial sizes and stopping conditions, mutation rate, and how often you keep data (the keepEvery argument can make a huge difference here). * Whether or not you keep the complete genealogy of clones (this affects slightly the size of returned object, not speed).  6610ac26   fc38a875  In the sections that follow, we go over several cases to understand some of the main settings that affect running time (or execution time) and space consumption (the size of returned objects). It should be understood, however, that many of the examples shown below do not represent typical use cases of OncoSimulR and are used only to identify what and how affects running time and space consumption. As we will see in most examples in this vignette, typical use cases of OncoSimulR involve hundreds to thousands of genes on population sizes up to$10^5$to$10^7$.  6610ac26   fc38a875  Note that most of the code in this section is not executed during the building of the vignette to keep vignette build time reasonable and prevent using huge amounts of RAM. All of the code, ready to be sourced and run, is available from the 'inst/miscell' directory (and the summary output from some of the benchmarks is available from the 'miscell-files/vignette_bench_Rout' directory of the main OncoSimul repository at https://github.com/rdiaz02/OncoSimul).  bbd76086  {r colnames_benchmarks, echo = FALSE, eval = TRUE} data(benchmark_1) data(benchmark_1_0.05) data(benchmark_2) data(benchmark_3) colnames(benchmark_1)[ match(c( "time_per_simul", "size_mb_per_simul", "NumClones.Median", "NumIter.Median", "FinalTime.Median", "TotalPopSize.Median", "TotalPopSize.Mean", "TotalPopSize.Max.", "keepEvery", "Attempts.Median", "Attempts.Mean", "Attempts.Max.", "PDBaseline", "n2", "onlyCancer"), colnames(benchmark_1) )] <- c("Elapsed Time, average per simulation (s)", "Object Size, average per simulation (MB)", "Number of Clones, median", "Number of Iterations, median", "Final Time, median", "Total Population Size, median", "Total Population Size, mean", "Total Population Size, max.", "keepEvery", "Attempts until Cancer, median", "Attempts until Cancer, mean", "Attempts until Cancer, max.", "PDBaseline", "n2", "onlyCancer" ) colnames(benchmark_1_0.05)[ match(c("time_per_simul", "size_mb_per_simul", "NumClones.Median", "NumIter.Median", "FinalTime.Median", "TotalPopSize.Median", "TotalPopSize.Mean", "TotalPopSize.Max.", "keepEvery", "PDBaseline", "n2", "onlyCancer", "Attempts.Median"), colnames(benchmark_1_0.05))] <- c("Elapsed Time, average per simulation (s)", "Object Size, average per simulation (MB)", "Number of Clones, median", "Number of Iterations, median", "Final Time, median", "Total Population Size, median", "Total Population Size, mean", "Total Population Size, max.", "keepEvery", "PDBaseline", "n2", "onlyCancer", "Attempts until Cancer, median" ) colnames(benchmark_2)[match(c("Model", "fitness", "time_per_simul", "size_mb_per_simul", "NumClones.Median", "NumIter.Median", "FinalTime.Median", "TotalPopSize.Median", "TotalPopSize.Mean", "TotalPopSize.Max."), colnames(benchmark_2))] <- c("Model", "Fitness", "Elapsed Time, average per simulation (s)", "Object Size, average per simulation (MB)", "Number of Clones, median", "Number of Iterations, median", "Final Time, median", "Total Population Size, median", "Total Population Size, mean", "Total Population Size, max." ) colnames(benchmark_3)[match(c("Model", "fitness", "time_per_simul", "size_mb_per_simul", "NumClones.Median", "NumIter.Median", "FinalTime.Median", "TotalPopSize.Median", "TotalPopSize.Mean", "TotalPopSize.Max."), colnames(benchmark_3))] <- c("Model", "Fitness", "Elapsed Time, average per simulation (s)", "Object Size, average per simulation (MB)", "Number of Clones, median", "Number of Iterations, median", "Final Time, median", "Total Population Size, median", "Total Population Size, mean", "Total Population Size, max." )   fc38a875  ## Exp and McFL with "detectionProb" and pancreas example {#bench1} To get familiar with some of they factors that affect time and size, we will use the fitness specification from section \@ref(quickexample), with the detectionProb stopping mechanism (see \@ref(detectprob)). We will use the two main growth models (exponential and McFarland). Each model will be run with two settings of keepEvery. With keepEvery = 1 (runs exp1 and mc1), population samples are stored at time intervals of 1 (even if most of the clones in those samples later become extinct). With keepEvery = NA (runs exp2 and mc2) no intermediate population samples are stored, so clones that become extinct at any sampling period are pruned and only the existing clones at the end of the simulation are returned (see details in \@ref(prune)). Will run r unique(benchmark_1$Numindiv) simulations. The results I show are for a laptop with an 8-core Intel Xeon E3-1505M CPU, running Debian GNU/Linux (the results from these benchmarks are available as data(benchmark_1)). {r timing1, eval=FALSE} ## Specify fitness pancr <- allFitnessEffects( data.frame(parent = c("Root", rep("KRAS", 4), "SMAD4", "CDNK2A", "TP53", "TP53", "MLL3"), child = c("KRAS","SMAD4", "CDNK2A", "TP53", "MLL3", rep("PXDN", 3), rep("TGFBR2", 2)), s = 0.1, sh = -0.9, typeDep = "MN"), drvNames = c("KRAS", "SMAD4", "CDNK2A", "TP53", "MLL3", "TGFBR2", "PXDN")) Nindiv <- 100 ## Number of simulations run. ## Increase this number to decrease sampling variation ## keepEvery = 1 t_exp1 <- system.time( exp1 <- oncoSimulPop(Nindiv, pancr,  6610ac26  detectionProb = "default", detectionSize = NA, detectionDrivers = NA, finalTime = NA,  fc38a875  keepEvery = 1,  6610ac26  model = "Exp",  fc38a875  mc.cores = 1))["elapsed"]/Nindiv  6610ac26   fc38a875  t_mc1 <- system.time( mc1 <- oncoSimulPop(Nindiv, pancr,  6610ac26  detectionProb = "default", detectionSize = NA, detectionDrivers = NA, finalTime = NA,  fc38a875  keepEvery = 1,  6610ac26  model = "McFL",  fc38a875  mc.cores = 1))["elapsed"]/Nindiv  6610ac26   fc38a875  ## keepEvery = NA t_exp2 <- system.time( exp2 <- oncoSimulPop(Nindiv, pancr,  6610ac26  detectionProb = "default", detectionSize = NA, detectionDrivers = NA, finalTime = NA,  fc38a875  keepEvery = NA,  6610ac26  model = "Exp",  fc38a875  mc.cores = 1))["elapsed"]/Nindiv  6610ac26   fc38a875  t_mc2 <- system.time( mc2 <- oncoSimulPop(Nindiv, pancr,  6610ac26  detectionProb = "default", detectionSize = NA, detectionDrivers = NA, finalTime = NA,  fc38a875  keepEvery = NA,  6610ac26  model = "McFL",  fc38a875  mc.cores = 1))["elapsed"]/Nindiv  6610ac26    f9f9ab96  We can obtain times, sizes of objects, and summaries of numbers  fc38a875  of clones, iterations, and final times doing, for instance:  6610ac26   {r, eval=FALSE}  fc38a875  cat("\n\n\n t_exp1 = ", t_exp1, "\n") object.size(exp1)/(Nindiv * 1024^2) cat("\n\n") summary(unlist(lapply(exp1, "[[", "NumClones"))) summary(unlist(lapply(exp1, "[[", "NumIter"))) summary(unlist(lapply(exp1, "[[", "FinalTime"))) summary(unlist(lapply(exp1, "[[", "TotalPopSize")))  6610ac26   The above runs yield the following:  bbd76086  \blandscape Table: (\#tab:bench1) Benchmarks of Exp and McFL models using the default detectionProb with two settings of keepEvery.  fc38a875  {r bench1, eval=TRUE, echo = FALSE}  bbd76086   a9ecbe56  panderOptions('table.split.table', 99999999) panderOptions('table.split.cells', 900) ## For HTML ## panderOptions('table.split.cells', 8) ## For PDF  bbd76086  set.alignment('right')  f9f9ab96  panderOptions('round', 2) panderOptions('big.mark', ',') panderOptions('digits', 2)  bbd76086  pander(benchmark_1[1:4, c("Elapsed Time, average per simulation (s)", "Object Size, average per simulation (MB)", "Number of Clones, median", "Number of Iterations, median", "Final Time, median", "Total Population Size, median", "Total Population Size, max.", "keepEvery")],  f9f9ab96  justify = c('left', rep('right', 8)), ## o.w. hlines not right  bbd76086  ## caption = "\\label{tab:bench1}Benchmarks of Exp and McFL models using the default detectionProb with two settings of keepEvery." )  fc38a875    6610ac26   bbd76086  \elandscape \clearpage The above table shows that a naive comparison (looking simply at execution time) might conclude that the McFL model is much, much slower than the Exp model. But that is not the complete story: using the detectionProb stopping mechanism (see \@ref(detectprob)) will lead to stopping the simulations very quickly in the exponential model because as soon as a clone with fitness $>1$ appears it starts growing exponentially. In fact, we can see that the number of iterations and the final time are much smaller in the Exp than in the McFL model. We will elaborate on this point below (section \@ref(common1)), when we discuss the setting for checkSizePEvery (here left at its default value of 20): checking the exiting condition more often (smaller checkSizePEvery) would probably be justified here (notice also the very large final times) and would lead to a sharp decrease in number of iterations and, thus, running time.  fc38a875   f9f9ab96  This table also shows that the keepEvery = NA setting, which was in effect in simulations exp2 and mc2, can make a difference especially for the McFL models, as seen by the median number of clones and the size of the returned object. Models exp2 and mc2 do not store any intermediate population samples so clones that become extinct at any sampling period are pruned and only the existing clones at the end of the simulation are returned. In contrast, models exp1 and mc1 store population samples at time intervals of 1 (keepEvery = 1), even if many of those clones eventually become extinct. We will return to this issue below as execution time and object size depend strongly on the number of  fc38a875  clones tracked. We can run the exponential model again modifying the arguments of the detectionProb mechanism; in two of the models below (exp3 and exp4) no detection can take place unless populations are at least 100 times larger than the initial population size, and probability of detection is 0.1 with a population size 1,000 times larger than the initial one (PDBaseline = 5e4, n2 = 5e5). In the other two models (exp5 and exp6), no detection can take place unless populations are at least 1,000 times larger than the initial population size, and probability of detection is 0.1 with a population size 100,000 times larger than the initial one (PDBaseline =  f9f9ab96  5e5, n2 = 5e7)[^rva]. In runs exp3 and exp5 we set keepEvery = 1 and in  fc38a875  runs exp4 and exp6 we set keepEvery = NA.  f9f9ab96  [^rva]:Again, these are not necessarily reasonable or common settings. We are using them to understand what and how affects running time and space consumption.  fc38a875  {r timing2, eval = FALSE} t_exp3 <- system.time( exp3 <- oncoSimulPop(Nindiv, pancr, detectionProb = c(PDBaseline = 5e4, p2 = 0.1, n2 = 5e5, checkSizePEvery = 20), detectionSize = NA, detectionDrivers = NA, finalTime = NA, keepEvery = 1, model = "Exp", mc.cores = 1))["elapsed"]/Nindiv  6610ac26   fc38a875  t_exp4 <- system.time( exp4 <- oncoSimulPop(Nindiv, pancr, detectionProb = c(PDBaseline = 5e4, p2 = 0.1, n2 = 5e5, checkSizePEvery = 20), detectionSize = NA, detectionDrivers = NA, finalTime = NA, keepEvery = NA, model = "Exp", mc.cores = 1))["elapsed"]/Nindiv  6610ac26   fc38a875  t_exp5 <- system.time( exp5 <- oncoSimulPop(Nindiv, pancr, detectionProb = c(PDBaseline = 5e5, p2 = 0.1, n2 = 5e7), detectionSize = NA, detectionDrivers = NA, finalTime = NA, keepEvery = 1, model = "Exp", mc.cores = 1))["elapsed"]/Nindiv  6610ac26   fc38a875  t_exp6 <- system.time( exp6 <- oncoSimulPop(Nindiv, pancr, detectionProb = c(PDBaseline = 5e5, p2 = 0.1, n2 = 5e7), detectionSize = NA, detectionDrivers = NA, finalTime = NA, keepEvery = NA, model = "Exp", mc.cores = 1))["elapsed"]/Nindiv  6610ac26   fc38a875    6610ac26   bbd76086  \blandscape  6610ac26   bbd76086  Table: (\#tab:bench1b) Benchmarks of Exp models modifying the default detectionProb with two settings of keepEvery.  fc38a875  {r bench1b, eval=TRUE, echo = FALSE}  a9ecbe56  panderOptions('table.split.table', 99999999) panderOptions('table.split.cells', 900) ## For HTML ## panderOptions('table.split.cells', 8) ## For PDF  f9f9ab96   bbd76086  set.alignment('right') panderOptions('round', 2) panderOptions('big.mark', ',') panderOptions('digits', 2) pander(benchmark_1[5:8, c("Elapsed Time, average per simulation (s)", "Object Size, average per simulation (MB)", "Number of Clones, median", "Number of Iterations, median", "Final Time, median", "Total Population Size, median", "Total Population Size, max.", "keepEvery", "PDBaseline", "n2")],  f9f9ab96  justify = c('left', rep('right', 10)), ## o.w. hlines not right  bbd76086  ## round = c(rep(2, 3), rep(0, 7)), ## digits = c(rep(2, 3), rep(1, 7)), ## caption = "\\label{tab:bench1b}Benchmarks of Exp and McFL models modifying the default detectionProb with two settings of keepEvery." )  fc38a875    6610ac26   bbd76086  \elandscape \clearpage  6610ac26   fc38a875  As above, keepEvery = NA (in exp4 and exp6) leads to much smaller object sizes and slightly smaller numbers of clones and execution times. Changing the exiting conditions (by changing detectionProb arguments) leads to large increases in number of iterations (in this case by factors of about 15x to 25x) and a  f9f9ab96  corresponding increase in execution time as well as much larger population sizes (in some cases $>10^{10}$).  fc38a875  In some of the runs of exp5 and exp6 we get the (recoverable) exception message from the C++ code: Recoverable exception ti set to DBL_MIN. Rerunning, which is related to those simulations reaching total population sizes $>10^{10}$; we return to this below (section \@ref(popgtzx)). You might also wonder why total and median population sizes are so large in these two runs, given the exiting conditions. One of the reasons is that we are using the default checkSizePEvery = 20, so the interval between successive checks of the exiting condition is large; this is discussed at greater length in section \@ref(common1). All the runs above used the default value onlyCancer = TRUE. This means that simulations will be repeated until the exiting conditions are reached (see details in section \@ref(endsimul)) and, therefore, any simulation that ends up in extinction will be repeated. This setting can thus have a large effect on the exponential models, because when the initial population size is not very large and we start from the wildtype, it is  f9f9ab96  not uncommon for simulations to become extinct (when birth and death  fc38a875  rates are equal and the population size is small, it is easy to reach extinction before a mutation in a gene that increases fitness occurs). But  f9f9ab96  this is rarely the case in the McFarland model (unless we use really tiny  fc38a875  initial population sizes) because of the dependency of death rate on total population size (see section \@ref(mcfl)).  bbd76086  The number of attempts until cancer was reached in the above  f9f9ab96  models is shown in Table \@ref(tab:bench1c) (the values can be obtained from  bbd76086  any of the above runs doing, for instance, median(unlist(lapply(exp1, function(x) x$other$attemptsUsed))) ):  fc38a875   bbd76086  Table: (\#tab:bench1c) Number of attempts until cancer.  fc38a875  {r bench1c, eval=TRUE, echo = FALSE}  a9ecbe56  panderOptions('table.split.table', 99999999) panderOptions('table.split.cells', 900) ## For HTML ## panderOptions('table.split.cells', 12) ## For PDF  bbd76086  set.alignment('right') panderOptions('round', 2) panderOptions('big.mark', ',') panderOptions('digits', 2) pander(benchmark_1[1:8, c( "Attempts until Cancer, median", "Attempts until Cancer, mean", "Attempts until Cancer, max.", "PDBaseline", "n2")],  f9f9ab96  justify = c('left', rep('right', 5)), ## o.w. hlines not right  bbd76086  ## round = c(rep(2, 3), rep(0, 7)), ## digits = c(rep(2, 3), rep(1, 7)),  f9f9ab96  ## caption = "\\label{tab:bench1c}Number of attempts until cancer."  bbd76086  ) ## ## data(benchmark_1) ## knitr::kable(benchmark_1[1:8, c("Attempts.Median", ## "PDBaseline", "n2"), drop = FALSE], ## booktabs = TRUE, ## row.names = TRUE, ## col.names = c("Attempts until cancer", "PDBaseline", "n2"), ## caption = "Median number of attempts until cancer.", ## align = "r")  fc38a875    6610ac26   bbd76086  The McFL models finish in a single attempt. The exponential model simulations where we can exit with small population sizes (exp1, exp2) need many fewer attempts to reach cancer than those where large population sizes are required (exp3 to exp6). There is no relevant different  f9f9ab96  among those last four, which is what we would expect: a population that has  bbd76086  already reached a size of 50,000 cells from an initial population size of 500 is obviously a growing population where there is at least one mutant with positive fitness; thus, it unlikely to go extinct and therefore having to grow up to at least 500,000 will not significantly increase the risk of extinction.  6610ac26   fc38a875  We will now rerun all of the above models with argument onlyCancer =  bbd76086  FALSE. The results are shown in Table \@ref(tab:timing3) (note that the  f9f9ab96  differences between this table and Table \@ref(tab:bench1) for the McFL  bbd76086  models are due only to sampling variation). \bslandscape  fc38a875   bbd76086  Table: (\#tab:timing3) Benchmarks of models in Table \@ref(tab:bench1) and \@ref(tab:bench1b) when run with onlyCancer = FALSE  fc38a875  {r bench1d, eval=TRUE, echo = FALSE}  a9ecbe56  panderOptions('table.split.table', 99999999) panderOptions('table.split.cells', 900) ## For HTML ## panderOptions('table.split.cells', 8) ## For PDF panderOptions('table.split.cells', 15) ## does not fit otherwise  bbd76086  set.alignment('right') panderOptions('round', 3) pander(benchmark_1[9:16, c("Elapsed Time, average per simulation (s)", "Object Size, average per simulation (MB)", "Number of Clones, median", "Number of Iterations, median", "Final Time, median", "Total Population Size, median",  fc38a875  "Total Population Size, mean",  bbd76086  "Total Population Size, max.", "keepEvery", "PDBaseline", "n2")],  f9f9ab96  justify = c('left', rep('right', 11)), ## o.w. hlines not right  bbd76086  ## caption = "\\label{tab:timing3} Benchmarks of models in Table \\@ref(tab:bench1) and \\@ref(tab:bench1b) when run with onlyCancer = FALSE." )  fc38a875    6610ac26   bbd76086  \eslandscape \clearpage  6610ac26   fc38a875  Now most simulations under the exponential model end up in extinction, as seen by the median population size of 0 (but not all, as the mean and max. population size are clearly away from zero). Consequently, simulations under the exponential model are now faster (and the size of  74449e3b  the average returned object is smaller). Of course, whether one should run  fc38a875  simulations with onlyCancer = TRUE or onlyCancer = FALSE will depend on the question being asked (see, for example, section \@ref(exbauer) for a question where we will naturally want to use onlyCancer = FALSE).  bbd76086  To make it easier to compare results with those of the next section, Table \@ref(tab:allr1bck) shows all the runs so far. \bslandscape  f9f9ab96  Table: (\#tab:allr1bck) Benchmarks of all models in Tables \@ref(tab:bench1), \@ref(tab:bench1b), and \@ref(tab:timing3).  bbd76086  {r bench1dx0, eval=TRUE, echo = FALSE}  a9ecbe56  panderOptions('table.split.table', 99999999) ## panderOptions('table.split.cells', 900) ## For HTML panderOptions('table.split.cells', 19)  bbd76086  set.alignment('right') panderOptions('round', 3) pander(benchmark_1[ , c("Elapsed Time, average per simulation (s)", "Object Size, average per simulation (MB)", "Number of Clones, median", "Number of Iterations, median", "Final Time, median", "Total Population Size, median", "Total Population Size, mean", "Total Population Size, max.", "keepEvery", "PDBaseline", "n2", "onlyCancer")],  f9f9ab96  justify = c('left', rep('right', 12)), ## o.w. hlines not right ## caption = "\\label{tab:allr1bck}Benchmarks of all models in Tables \\@ref(tab:bench1), \\@ref(tab:bench1b), and \\@ref(tab:timing3)."  bbd76086  )  \eslandscape \clearpage  fc38a875  ### Changing fitness: $s=0.1$ and $s=0.05$ {#bench1xf}  f9f9ab96  In the above fitness specification the fitness effect of each gene (when its restrictions are satisfied) is $s = 0.1$ (see section \@ref(numfit) for details). Here we rerun all the above benchmarks using $s= 0.05$ (the results from these benchmarks are available as data(benchmark_1_0.05)) and results are shown below in Table \@ref(tab:timing3xf).  fc38a875   bbd76086  \bslandscape  fc38a875   bbd76086  Table: (\#tab:timing3xf) Benchmarks of all models in Table \@ref(tab:allr1bck) using $s=0.05$ (instead of $s=0.1$).  fc38a875  {r bench1dx, eval=TRUE, echo = FALSE}  bbd76086  ## data(benchmark_1_0.05) ## knitr::kable(benchmark_1_0.05[, c("time_per_simul", ## "size_mb_per_simul", "NumClones.Median", "NumIter.Median", ## "FinalTime.Median", "TotalPopSize.Median", "TotalPopSize.Mean", ## "TotalPopSize.Max.", ## "keepEvery", ## "PDBaseline", "n2", "onlyCancer")], ## booktabs = TRUE, ## col.names = c("Elapsed Time, average per simulation (s)", ## "Object Size, average per simulation (MB)", ## "Number of Clones, median", ## "Number of Iterations, median", ## "Final Time, median", ## "Total Population Size, median", ## "Total Population Size, mean", ## "Total Population Size, max.", ## "keepEvery", ## "PDBaseline", "n2", "onlyCancer" ## ), ## ## caption = "Benchmarks of models in Table \@ref(tab:bench1) and ## ## \@ref(tab:bench1b) when run with onlyCancer = FALSE", ## align = "c")  a9ecbe56  panderOptions('table.split.table', 99999999) ## panderOptions('table.split.cells', 900) ## For HTML panderOptions('table.split.cells', 19)  bbd76086  set.alignment('right') panderOptions('round', 3) pander(benchmark_1_0.05[ , c("Elapsed Time, average per simulation (s)", "Object Size, average per simulation (MB)", "Number of Clones, median", "Number of Iterations, median", "Final Time, median", "Total Population Size, median", "Total Population Size, mean", "Total Population Size, max.", "keepEvery", "PDBaseline", "n2", "onlyCancer")],  f9f9ab96  justify = c('left', rep('right', 12)), ## o.w. hlines not right  bbd76086  ## caption = "\\label{tab:timing3xf}Benchmarks of all models in Table \\@ref(tab:allr1bck) using $s=0.05$ (instead of $s=0.1$)." )  fc38a875    6610ac26   bbd76086  \eslandscape \clearpage  fc38a875  As expected, having a smaller $s$ leads to slower processes in most cases, since it takes longer to reach the exiting conditions sooner. Particularly noticeable are the runs for the McFL models (notice the increases in  bbd76086  population size and number of iterations ---see also below).  6610ac26   fc38a875  That is not the case, however, for exp5 and exp6 (and exp5_noc and exp6_noc). When running with $s=0.05$ the simulations exit at a later time (see column "Final Time") but they exit with smaller population sizes. Here we have an interaction between sampling frequency, speed of growth of the population, mutation events and number of clones. In populations that grow much faster mutation events will happen more often (which will trigger further iterations of the algorithm); in addition, more new clones will be created, even if they only exist for short times and become extinct by the following sampling period (so they are not reflected in the pops.by.time matrix). These differences are proportionally larger the larger the rate of growth of the population. Thus, they are larger between, say, the exp5 at $s=0.1$ and $s=0.05$ than between the exp4 at the two different $s$: the exp5 exit conditions can only be satisfied at much larger population sizes so at populations sizes when growth is much faster (recall we are dealing with exponential growth).  6610ac26   fc38a875  Recall also that with the default settings in detectionProb, we assess the exiting condition every 20 time periods (argument checkSizePEvery); this means that for fast growing populations, the increase in population size between successive checks of the exit conditions will be much larger (this phenomenon is also discussed in section \@ref(common1)).  6610ac26   08de3ba7   fc38a875  Thus, what is happening in the exp5 and exp6 with $s=0.1$ is that close to the time the exit conditions could be satisfied, they are growing very fast, accumulating mutants, and incurring in additional iterations. They exit sooner in terms of time periods, but they do much more work before arriving there.  6610ac26   bbd76086  The setting of checkSizePEvery is also having a huge effect on the McFL model simulations (the number of iterations is $>10^6$). Even more than in the previous section, checking the exiting condition more often (smaller checkSizePEvery) would probably be justified here (notice also the very large final times) and would lead to a sharp decrease in number of iterations and, thus, running time.  f9f9ab96  The moral here is that in complex simulations like this (and most simulations are complex), the effects of some parameters ($s$ in this case) might look counter-intuitive at first. Thus the need to "experiment before launching a large number of simulations".  6610ac26   fc38a875  ## Several "common use cases" runs {#benchusual} Let us now execute some simulations under more usual conditions. We will use seven different fitness specifications: the pancreas example, two random fitness landscapes, and four sets of independent genes (200 to 4000 genes) with fitness effects randomly drawn from exponential distributions: {r fitusualb, echo = TRUE, eval = FALSE} pancr <- allFitnessEffects( data.frame(parent = c("Root", rep("KRAS", 4), "SMAD4", "CDNK2A", "TP53", "TP53", "MLL3"), child = c("KRAS","SMAD4", "CDNK2A", "TP53", "MLL3", rep("PXDN", 3), rep("TGFBR2", 2)), s = 0.1, sh = -0.9, typeDep = "MN"), drvNames = c("KRAS", "SMAD4", "CDNK2A", "TP53", "MLL3", "TGFBR2", "PXDN"))  bbd76086  ## Random fitness landscape with 6 genes ## At least 50 accessible genotypes  fc38a875  rfl6 <- rfitness(6, min_accessible_genotypes = 50)  bbd76086  attributes(rfl6)$accessible_genotypes ## How many accessible  fc38a875  rf6 <- allFitnessEffects(genotFitness = rfl6) ## Random fitness landscape with 12 genes  bbd76086  ## At least 200 accessible genotypes  fc38a875  rfl12 <- rfitness(12, min_accessible_genotypes = 200)  bbd76086  attributes(rfl12)$accessible_genotypes ## How many accessible  fc38a875  rf12 <- allFitnessEffects(genotFitness = rfl12) ## Independent genes; positive fitness from exponential distribution  f9f9ab96  ## with mean around 0.1, and negative from exponential with mean ## around -0.02. Half of genes positive fitness effects, half ## negative.  bbd76086   f9f9ab96  ng <- 200 re_200 <- allFitnessEffects(noIntGenes = c(rexp(ng/2, 10), -rexp(ng/2, 50)))  fc38a875  ng <- 500  bbd76086  re_500 <- allFitnessEffects(noIntGenes = c(rexp(ng/2, 10), -rexp(ng/2, 50)))  fc38a875  ng <- 2000  bbd76086  re_2000 <- allFitnessEffects(noIntGenes = c(rexp(ng/2, 10), -rexp(ng/2, 50)))  fc38a875  ng <- 4000  bbd76086  re_4000 <- allFitnessEffects(noIntGenes = c(rexp(ng/2, 10), -rexp(ng/2, 50)))  fc38a875   ### Common use cases, set 1. {#common1} We will use the Exp and the McFL models, run with different parameters. The script is provided as 'benchmark_2.R', under '/inst/miscell', with output in the 'miscell-files/vignette_bench_Rout' directory of the main OncoSimul repository at https://github.com/rdiaz02/OncoSimul. The data are available as data(benchmark_2). For the Exp model the call will be {r exp-usual-r, eval = FALSE, echo = TRUE} oncoSimulPop(Nindiv, fitness, detectionProb = NA, detectionSize = 1e6, initSize = 500, detectionDrivers = NA, keepPhylog = TRUE, model = "Exp", errorHitWallTime = FALSE, errorHitMaxTries = FALSE, finalTime = 5000, onlyCancer = FALSE, mc.cores = 1, sampleEvery = 0.5, keepEvery = 1)  And for McFL: {r mc-usual-r, eval = FALSE, echo = TRUE} initSize <- 1000 oncoSimulPop(Nindiv, fitness, detectionProb = c( PDBaseline = 1.4 * initSize, n2 = 2 * initSize, p2 = 0.1, checkSizePEvery = 4), initSize = initSize, detectionSize = NA, detectionDrivers = NA, keepPhylog = TRUE, model = "McFL", errorHitWallTime = FALSE, errorHitMaxTries = FALSE, finalTime = 5000, max.wall.time = 10, onlyCancer = FALSE, mc.cores = 1, keepEvery = 1)   f9f9ab96  For the exponential model we will stop simulations when populations  bbd76086  have $>10^6$ cells (simulations start from 500 cells). For the McFarland  fc38a875  model we will use the detectionProb mechanism (see section \@ref(detectprob) for details); we could have used as stopping mechanism detectionSize = 2 * initSize (which would be basically equivalent to reaching cancer, as argued in [@McFarland2013]) but we want to provide  bbd76086  further examples under the detectionProb mechanism. We will start from 1000  fc38a875  cells, not 500 (starting from 1000 we almost always reach cancer in a  f9f9ab96  single run).  fc38a875  Why not use the detectionProb mechanism with the Exp models? Because it can be hard to intuitively understand what are reasonable settings for the parameters of the detectionProb mechanism when used in a population that is growing exponentially, especially if different genes have very  f9f9ab96  different effects on fitness. Moreover, we are using fitness specifications that are very different (compare the fitness landscape of six genes, the pancreas specification, and the fitness specification with 4000 genes with fitness effects drawn from an exponential distribution ---re_4000). In contrast, the detectionProb mechanism might be simpler to reason about in a population that is growing under a model of carrying capacity with possibly large periods of stasis. Let us emphasize that it is not that the detectionProb mechanism does not make sense with the Exp model; it is simply that the parameters might need finer adjustment for them to make sense, and in these benchmarks we are dealing with widely different fitness specifications.  fc38a875  Note also that we specify checkSizePEvery = 4 (instead of the default, which is 20). Why? Because the fitness specifications where fitness effects are drawn from exponential distributions (re_200 to re_4000 above) include many genes (well, up to 4000) some of them with possibly  f9f9ab96  very large effects. In these conditions, simulations can run very fast in  fc38a875  the sense of "units of time". If we check exiting conditions every 20  f9f9ab96  units the population could have increased its size several orders of magnitude in between checks (this is also discussed in sections \@ref(bench1xf) and \@ref(detectprob)). You can verify this by running the script with other settings for checkSizePEvery (and being aware that large settings might require you to wait for a long time). To ensure that populations have really grown, we have increased the setting of PDBaseline so that no simulation can be considered for stopping unless its size is 1.4 times larger than initSize.  fc38a875   f9f9ab96  In all cases we use keepEvery = 1 and keepPhylog = TRUE (so we store the population sizes of all clones every 1 time unit and we keep the complete genealogy of clones). Finally, we run all models with errorHitWallTime = FALSE and errorHitMaxTries = FALSE so that we can see results even if stopping conditions are not met.  fc38a875   bbd76086   fc38a875  The results of the benchmarks, using r unique(benchmark_2Numindiv) individual simulations, are shown in Table \@ref(tab:timingusual).  bbd76086  \blandscape  fc38a875   bbd76086  Table: (\#tab:timingusual) Benchmarks under some common use cases, set 1.  fc38a875  {r benchustable, eval=TRUE, echo = FALSE}  bbd76086  ## data(benchmark_2) ## knitr::kable(benchmark_2[, c("Model", "fitness", "time_per_simul", ## "size_mb_per_simul", "NumClones.Median", "NumIter.Median", ## "FinalTime.Median", "TotalPopSize.Median", "TotalPopSize.Mean", ## "TotalPopSize.Max.")], ## booktabs = TRUE, ## col.names = c("Model", ## "Fitness", ## "Elapsed Time, average per simulation (s)", ## "Object Size, average per simulation (MB)", ## "Number of Clones, median", ## "Number of Iterations, median", ## "Final Time, median", ## "Total Population Size, median", ## "Total Population Size, mean", ## "Total Population Size, max." ## ), ## align = "c")  a9ecbe56  panderOptions('table.split.table', 99999999) panderOptions('table.split.cells', 900) ## For HTML ## panderOptions('table.split.cells', 8) ## For PDF  bbd76086  ## set.alignment('right', row.names = 'center') panderOptions('table.alignment.default', 'right') panderOptions('round', 3) pander(benchmark_2[ , c( "Model", "Fitness", "Elapsed Time, average per simulation (s)", "Object Size, average per simulation (MB)", "Number of Clones, median", "Number of Iterations, median", "Final Time, median", "Total Population Size, median", "Total Population Size, mean", "Total Population Size, max.")], justify = c('left', 'left', rep('right', 8)), ## caption = "\\label{tab:timingusual}Benchmarks under some common use cases, set 1." )  fc38a875    bbd76086  \elandscape \clearpage  fc38a875  In most cases, simulations run reasonably fast (under 0.1 seconds per  74449e3b  individual simulation) and the returned objects are small. I will only  fc38a875  focus on a few cases. The McFL model with random fitness landscape rf12 and with pancr does not satisfy the conditions of detectionProb in most cases: its median final time is 5000, which was the maximum final time specified. This suggests that the fitness landscape is such that it is unlikely that we  f9f9ab96  will reach population sizes> 1400$(remember we the setting for  fc38a875  PDBaseline) before 5000 time units. There is nothing particular about using a fitness landscape of 12 genes and other runs in other 12-gene random fitness landscapes do not show this pattern. However, complex fitness landscapes might be such that genotypes of high fitness (those that allow reaching a large population size quickly) are not easily accessible[^access] so reaching them might take a long time. This does not affect the exponential model in the same way because, well, because there is exponential growth in that model: any genotype with fitness$>1$will grow exponentially (of course, at possibly very different rates). You might want to play with the script and modify the call to rfitness (using different values of reference and c, for instance) to have simpler paths to a maximum or modify the call to oncoSimulPop (with, say, finalTime to much larger values). Some of these issues are related to more general questions about fitness landscapes and accessibility (see section \@ref(ex-ochs) and references therein). [^access]:By easily accessible I mean that there are many, preferably short, paths of non-decreasing fitness from the wildtype to this genotype. See definitions and discussion in, e.g., @franke_evolutionary_2011. You could also set onlyCancer = TRUE. This might make sense if you are interested in only seeing simulations that "reach cancer" (where "reach cancer" means reaching a state you define as a function of population size or drivers). However, if you are exploring fitness landscapes, onlyCancer = TRUE might not always be reasonable as reaching a particular population size, for instance, might just not be possible under some fitness landscapes (this phenomenon is of course not restricted to random fitness landscapes ---see also section \@ref(largegenes005)). As we anticipated above, the detectionProb mechanism has to be used with care: some of the simulations run in very short "time units", such as those for the fitness specifications with 2000 and 4000 genes. Having used a checkSizePEvery = 20 probably would not have made sense. Finally, it is interesting that in the cases examined here, the two  f9f9ab96  slowest running simulations are from "Exp", with fitnesses re_2000 and re_4000 (and the third slowest is also Exp, under re_500). These are also the cases with the largest number of clones. Why? In the "Exp" model there is no competition, and fitness specifications re_2000 and re_4000 have genomes with many genes with positive fitness contributions. It is thus very easy to obtain, from the wildtype ancestor, a large number of clones all of which have birth rates$>1$and, thus, clones that are unlikely to become extinct.  fc38a875  ### Common use cases, set 2. {#common2} We will now rerun the simulations above changing the following:  f9f9ab96  - finalTime set to 25000. - onlyCancer set to TRUE.  fc38a875  - The "Exp" models will stop when population size$> 10^5. This is in script 'benchmark_3.R', under '/inst/miscell', with output in the 'miscell-files/vignette_bench_Rout' directory of the main OncoSimul repository at https://github.com/rdiaz02/OncoSimul. The data are available as data(benchmark_3).  bbd76086  \blandscape Table: (\#tab:timingusual2) Benchmarks under some common use cases, set 2.  fc38a875  {r benchustable2, eval=TRUE, echo = FALSE}  bbd76086  ## data(benchmark_3) ## knitr::kable(benchmark_3[, c("Model", "fitness", "time_per_simul", ## "size_mb_per_simul", "NumClones.Median", "NumIter.Median", ## "FinalTime.Median", "TotalPopSize.Median", "TotalPopSize.Mean", ## "TotalPopSize.Max.")], ## booktabs = TRUE, ## col.names = c("Model", ## "Fitness", "Elapsed Time, average per simulation (s)", ## "Object Size, average per simulation (MB)", ## "Number of Clones, median", ## "Number of Iterations, median", ## "Final Time, median", ## "Total Population Size, median", ## "Total Population Size, mean", ## "Total Population Size, max." ## ), ## align = "c")  a9ecbe56  panderOptions('table.split.table', 99999999) panderOptions('table.split.cells', 900) ## For HTML ## panderOptions('table.split.cells', 8) ## For PDF  bbd76086  panderOptions('round', 3) panderOptions('table.alignment.default', 'right') pander(benchmark_3[ , c( "Model", "Fitness", "Elapsed Time, average per simulation (s)", "Object Size, average per simulation (MB)", "Number of Clones, median", "Number of Iterations, median", "Final Time, median", "Total Population Size, median", "Total Population Size, mean", "Total Population Size, max.")], justify = c('left', 'left', rep('right', 8)), ## caption = "\\label{tab:timingusual2}Benchmarks under some common use cases, set 2." )   fc38a875   bbd76086  \elandscape \clearpage  fc38a875  Since we increased the maximum final time and forced runs to "reach cancer" the McFL run with the pancreas fitness specification takes a bit longer because it also has to do a larger number of iterations. Interestingly, notice that the median final time is close to 10000, so the runs in \@ref(common1) with maximum final time of 5000 would have had a hard time finishing with onlyCancer = TRUE.  f9f9ab96  Forcing simulations to "reach cancer" and just random differences between  fc38a875  the random fitness landscape also affect the McFL run under rf12: final time is below 5000 and the median number of iterations is about half of what was above. Finally, by stopping the Exp simulations at10^5$, simulations with re_2000 and re_4000 finish now in much shorter times (but they still take longer than their McFL counterparts) and the number of clones created is much smaller. ## Can we use a large number of genes? {#lnum} Yes. In fact, in OncoSimulR there is no pre-set limit on genome  f9f9ab96  size. However, large numbers of genes can lead to unacceptably large returned object sizes and/or running time. We discuss several examples next that illustrate some of the major issues to consider. Another example  fc38a875  with 50,000 genes is shown in section \@ref(mcf50070). We have seen in \@ref(bench1) and \@ref(common1) that for the Exp model, benchmark results using detectionProb require a lot of care and can be  f9f9ab96  misleading. Here, we will fix initial population sizes (to 500) and all final population sizes will be set to$\geq 10^6$. In addition, to avoid the confounding factor of the onlyCancer = TRUE argument, we will set it to FALSE, so we measure directly the time of individual runs.  fc38a875  ### Exponential model with 10,000 and 50,000 genes {#exp50000} #### Exponential, 10,000 genes, example 1 {#exp100001} We will start with 10000 genes and an exponential model, where we  f9f9ab96  stop when the population grows over$10^6$individuals:  fc38a875  {r exp10000, echo = TRUE, eval = FALSE} ng <- 10000  74449e3b  u <- allFitnessEffects(noIntGenes = c(rep(0.1, ng/2), rep(-0.1, ng/2)))  bbd76086  t_e_10000 <- system.time( e_10000 <- oncoSimulPop(5, u, model = "Exp", mu = 1e-7, detectionSize = 1e6, detectionDrivers = NA, detectionProb = NA, keepPhylog = TRUE, onlyCancer = FALSE, mutationPropGrowth = TRUE, mc.cores = 1))  fc38a875   {r exp10000-out, echo = TRUE, eval = FALSE} t_e_10000 ## user system elapsed ## 4.368 0.196 4.566 summary(e_10000)[, c(1:3, 8, 9)] ## NumClones TotalPopSize LargestClone FinalTime NumIter ## 1 5017 1180528 415116 143 7547 ## 2 3726 1052061 603612 131 5746 ## 3 4532 1100721 259510 132 6674 ## 4 4150 1283115 829728 99 6646 ## 5 4430 1139185 545958 146 6748 print(object.size(e_10000), units = "MB") ## 863.9 Mb  Each simulation takes about 1 second but note that the number of clones  74449e3b  for most simulations is already over 4000 and that the size of the returned  fc38a875  object is close to 1 GB (a more detailed explanation of where this 1 GB comes from is deferred until section \@ref(wheresizefrom)). #### Exponential, 10,000 genes, example 2 {#exp10000_2}  74449e3b  We can decrease the size of the returned object if we use the keepEvery =  fc38a875  NA argument (this setting was explained in detail in section \@ref(bench1)): {r exp10000b, eval = FALSE, echo = TRUE}  bbd76086  t_e_10000b <- system.time( e_10000b <- oncoSimulPop(5, u, model = "Exp", mu = 1e-7, detectionSize = 1e6, detectionDrivers = NA, detectionProb = NA, keepPhylog = TRUE, onlyCancer = FALSE, keepEvery = NA, mutationPropGrowth = TRUE, mc.cores = 1 ))  fc38a875   {r exp10000b-out, echo = TRUE, eval = FALSE} t_e_10000b ## user system elapsed ## 5.484 0.100 5.585 summary(e_10000b)[, c(1:3, 8, 9)] ## NumClones TotalPopSize LargestClone FinalTime NumIter ## 1 2465 1305094 727989 91 6447 ## 2 2362 1070225 400329 204 8345 ## 3 2530 1121164 436721 135 8697 ## 4 2593 1206293 664494 125 8149 ## 5 2655 1186994 327835 191 8572 print(object.size(e_10000b), units = "MB") ## 488.3 Mb  #### Exponential, 50,000 genes, example 1 {#exp500001} Let's use 50,000 genes. To keep object sizes reasonable we use keepEvery = NA. For now, we also set mutationPropGrowth = FALSE so that the mutation rate does not become really large in clones with many mutations but, of course, whether or not this is a reasonable decision depends on the problem; see also below. {r exp50000, echo = TRUE, eval = FALSE} ng <- 50000  74449e3b  u <- allFitnessEffects(noIntGenes = c(rep(0.1, ng/2), rep(-0.1, ng/2)))  bbd76086  t_e_50000 <- system.time( e_50000 <- oncoSimulPop(5, u, model = "Exp", mu = 1e-7, detectionSize = 1e6, detectionDrivers = NA, detectionProb = NA, keepPhylog = TRUE, onlyCancer = FALSE, keepEvery = NA, mutationPropGrowth = FALSE, mc.cores = 1 ))  fc38a875  t_e_50000 ## user system elapsed ## 44.192 1.684 45.891 summary(e_50000)[, c(1:3, 8, 9)] ## NumClones TotalPopSize LargestClone FinalTime NumIter ## 1 7367 1009949 335455 75.00 18214 ## 2 8123 1302324 488469 63.65 17379 ## 3 8408 1127261 270690 72.57 21144 ## 4 8274 1138513 318152 80.59 20994 ## 5 7520 1073131 690814 70.00 18569 print(object.size(e_50000), units = "MB") ## 7598.6 Mb   74449e3b  Of course, simulations now take longer and the size of the returned  f9f9ab96  object is over 7 GB (we are keeping more than 7,000 clones, even if when  fc38a875  we prune all those that went extinct).  f9f9ab96  #### Exponential, 50,000 genes, example 2 {#exp50000_2}  fc38a875  What if we had not pruned? {r exp50000np, echo = TRUE, eval = FALSE} ng <- 50000  74449e3b  u <- allFitnessEffects(noIntGenes = c(rep(0.1, ng/2), rep(-0.1, ng/2)))  bbd76086  t_e_50000np <- system.time( e_50000np <- oncoSimulPop(5, u, model = "Exp", mu = 1e-7, detectionSize = 1e6, detectionDrivers = NA, detectionProb = NA, keepPhylog = TRUE, onlyCancer = FALSE, keepEvery = 1, mutationPropGrowth = FALSE, mc.cores = 1 ))  fc38a875  t_e_50000np ## user system elapsed ## 42.316 2.764 45.079 summary(e_50000np)[, c(1:3, 8, 9)] ## NumClones TotalPopSize LargestClone FinalTime NumIter ## 1 13406 1027949 410074 71.97 19469 ## 2 12469 1071325 291852 66.00 17834 ## 3 11821 1089834 245720 90.00 16711 ## 4 14008 1165168 505607 77.61 19675 ## 5 14759 1074621 205954 87.68 20597 print(object.size(e_50000np), units = "MB") ## 12748.4 Mb  The main effect is not on execution time but on object size (it has  f9f9ab96  grown by 5 GB). We are tracking more than 10,000 clones.  fc38a875   f9f9ab96  #### Exponential, 50,000 genes, example 3 {#exp50000_3}  fc38a875  What about the mutationPropGrowth setting? We will rerun the example in \@ref(exp500001) leaving keepEvery = NA but with the default mutationPropGrowth: {r exp50000mpg, echo = TRUE, eval = FALSE} ng <- 50000  74449e3b  u <- allFitnessEffects(noIntGenes = c(rep(0.1, ng/2), rep(-0.1, ng/2)))  fc38a875   bbd76086  t_e_50000c <- system.time( e_50000c <- oncoSimulPop(5, u, model = "Exp", mu = 1e-7, detectionSize = 1e6, detectionDrivers = NA, detectionProb = NA, keepPhylog = TRUE, onlyCancer = FALSE, keepEvery = NA, mutationPropGrowth = TRUE, mc.cores = 1 ))  fc38a875  t_e_50000c ## user system elapsed ## 84.228 2.416 86.665 summary(e_50000c)[, c(1:3, 8, 9)] ## NumClones TotalPopSize LargestClone FinalTime NumIter ## 1 11178 1241970 344479 84.74 27137 ## 2 12820 1307086 203544 91.94 33448 ## 3 10592 1126091 161057 83.81 26064 ## 4 11883 1351114 148986 65.68 25396 ## 5 10518 1101392 253523 99.79 26082 print(object.size(e_50000c), units = "MB") ## 10904.9 Mb  As expected (because the mutation rate per unit time is increasing in the fastest growing clones), we have many more clones, larger objects, and longer times of execution here: we almost double the time and the size of the object increases by almost 3 GB. What about larger population sizes or larger mutation rates? The number of clones starts growing fast, which means much slower  74449e3b  execution times and much larger returned objects (see also the examples  fc38a875  below).  f9f9ab96  #### Interlude: where is that 1 GB coming from? {#wheresizefrom} In section \@ref(exp100001) we have seen an apparently innocuous simulation producing a returned object of almost 1 GB. Where is that coming from? It means that each simulation produced almost 200 MB of output. Let us look at one simulation in more detail: {r sizedetail, eval = FALSE, echo = TRUE} r1 <- oncoSimulIndiv(u, model = "Exp", mu = 1e-7, detectionSize = 1e6, detectionDrivers = NA, detectionProb = NA, keepPhylog = TRUE, onlyCancer = FALSE, mutationPropGrowth = TRUE ) summary(r1)[c(1, 8)] ## NumClones FinalTime ## 1 3887 345 print(object.size(r1), units = "MB") ## 160 Mb ## Size of the two largest objects inside: sizes <- lapply(r1, function(x) object.size(x)/(1024^2)) sort(unlist(sizes), decreasing = TRUE)[1:2] ## Genotypes pops.by.time ## 148.28 10.26 dim(r1$Genotypes) ## [1] 10000 3887  The above shows the reason: the Genotypes matrix is a 10,000 by 3,887 integer matrix (with a 0 and 1 indicating not-mutated/mutated for each gene in each genotype) and in R integers use 4 bytes each. The pops.by.time matrix is 346 by 3,888 (the 1 in $346 = 345 + 1$ comes from starting at 0 and going up to the final time, both included; the 1 in $3888 = 3887 + 1$ is from the column of time) double matrix and doubles use 8 bytes[^popsbytime]. [^popsbytime]:These matrices do not exist during most of the execution of the C++ code; they are generated right before returning from the C++ code.  fc38a875  ### McFarland model with 50,000 genes; the effect of keepEvery {#mc50000} We show an example of McFarland's model with 50,000 genes in section \@ref(mcf50070). We will show here a few more examples with those many genes but with a different fitness specification and changing several other settings. #### McFarland, 50,000 genes, example 1 {#mc50000ex1}  f9f9ab96  Let's start with mutationPropGrowth = FALSE and keepEvery = NA. Simulations end when population size $\geq 10^6$.  fc38a875  {r mc50000_1, echo = TRUE, eval = FALSE} ng <- 50000  74449e3b  u <- allFitnessEffects(noIntGenes = c(rep(0.1, ng/2), rep(-0.1, ng/2)))  fc38a875   bbd76086  t_mc_50000_nmpg <- system.time( mc_50000_nmpg <- oncoSimulPop(5, u, model = "McFL", mu = 1e-7, detectionSize = 1e6, detectionDrivers = NA, detectionProb = NA, keepPhylog = TRUE, onlyCancer = FALSE, keepEvery = NA, mutationPropGrowth = FALSE, mc.cores = 1 ))  fc38a875  t_mc_50000_nmpg ## user system elapsed ## 30.46 0.54 31.01 summary(mc_50000_nmpg)[, c(1:3, 8, 9)] ## NumClones TotalPopSize LargestClone FinalTime NumIter ## 1 1902 1002528 582752 284.2 31137 ## 2 2159 1002679 404858 274.8 36905 ## 3 2247 1002722 185678 334.5 42429 ## 4 2038 1009606 493574 218.4 32519 ## 5 2222 1004661 162628 291.0 38470 print(object.size(mc_50000_nmpg), units = "MB") ## 2057.6 Mb  We are already dealing with 2000 clones. #### McFarland, 50,000 genes, example 2 {#mc50000ex2} Setting keepEvery = 1 (i.e., keeping track of clones with an interval of 1): {r mc50000_kp, echo = TRUE, eval = FALSE}  bbd76086  t_mc_50000_nmpg_k <- system.time( mc_50000_nmpg_k <- oncoSimulPop(5, u, model = "McFL", mu = 1e-7, detectionSize = 1e6, detectionDrivers = NA, detectionProb = NA, keepPhylog = TRUE, onlyCancer = FALSE, keepEvery = 1, mutationPropGrowth = FALSE, mc.cores = 1 ))  fc38a875  t_mc_50000_nmpg_k ## user system elapsed ## 30.000 1.712 31.714 summary(mc_50000_nmpg_k)[, c(1:3, 8, 9)] ## NumClones TotalPopSize LargestClone FinalTime NumIter ## 1 8779 1000223 136453 306.7 38102 ## 2 7442 1006563 428150 345.3 35139 ## 3 8710 1003509 224543 252.3 35659 ## 4 8554 1002537 103889 273.7 36783 ## 5 8233 1003171 263005 301.8 35236 print(object.size(mc_50000_nmpg_k), units = "MB") ## 8101.4 Mb   f9f9ab96  Computing time increases slightly but the major effect is seen on the size of the returned object, that increases by a factor of about 4x, up to 8 GB, corresponding to the increase in about 4x in the number of clones being tracked (see details of where the size of this object comes from in  fc38a875  section \@ref(wheresizefrom)). #### McFarland, 50,000 genes, example 3 {#mc50000ex3} We will set keepEvery = NA again, but we will now increase  f9f9ab96  detection size by a factor of 3 (so we stop when total population size becomes $\geq 3 * 10^6$).  fc38a875  {r mc50000_popx, echo = TRUE, eval = FALSE} ng <- 50000  74449e3b  u <- allFitnessEffects(noIntGenes = c(rep(0.1, ng/2), rep(-0.1, ng/2)))  fc38a875   bbd76086  t_mc_50000_nmpg_3e6 <- system.time( mc_50000_nmpg_3e6 <- oncoSimulPop(5, u, model = "McFL", mu = 1e-7, detectionSize = 3e6, detectionDrivers = NA, detectionProb = NA, keepPhylog = TRUE, onlyCancer = FALSE, keepEvery = NA, mutationPropGrowth = FALSE, mc.cores = 1 ))  fc38a875  t_mc_50000_nmpg_3e6 ## user system elapsed ## 77.240 1.064 78.308 summary(mc_50000_nmpg_3e6)[, c(1:3, 8, 9)] ## NumClones TotalPopSize LargestClone FinalTime NumIter ## 1 5487 3019083 836793 304.5 65121 ## 2 4812 3011816 789146 286.3 53087 ## 3 4463 3016896 1970957 236.6 45918 ## 4 5045 3028142 956026 360.3 63464 ## 5 4791 3029720 916692 358.1 55012 print(object.size(mc_50000_nmpg_3e6), units = "MB") ## 4759.3 Mb  Compared with the first run (\@ref(mc50000ex1)) we have approximately doubled computing time, number of iterations, number of clones, and object size. #### McFarland, 50,000 genes, example 4 {#mc50000ex4} Let us use the same detectionSize = 1e6 as in the first example (\@ref(mc50000ex1)), but with 5x the mutation rate: {r mc50000_mux, echo = TRUE, eval = FALSE}  bbd76086  t_mc_50000_nmpg_5mu <- system.time( mc_50000_nmpg_5mu <- oncoSimulPop(5, u, model = "McFL", mu = 5e-7, detectionSize = 1e6, detectionDrivers = NA, detectionProb = NA, keepPhylog = TRUE, onlyCancer = FALSE, keepEvery = NA, mutationPropGrowth = FALSE, mc.cores = 1 ))  fc38a875  t_mc_50000_nmpg_5mu ## user system elapsed ## 167.332 1.796 169.167 summary(mc_50000_nmpg_5mu)[, c(1:3, 8, 9)] ## NumClones TotalPopSize LargestClone FinalTime NumIter ## 1 7963 1004415 408352 99.03 57548 ## 2 8905 1010751 120155 130.30 74738 ## 3 8194 1005465 274661 96.98 58546 ## 4 9053 1014049 119943 112.23 75379 ## 5 8982 1011817 95047 99.95 76757 print(object.size(mc_50000_nmpg_5mu), units = "MB") ## 8314.4 Mb   74449e3b  The number of clones we are tracking is about 4x the number of clones of the first example (\@ref(mc50000ex1)), and roughly similar to the number of clones of the second example (\@ref(mc50000ex2)), and size of the returned object is similar to that of the second example. But computing time has increased by a factor of about 5x and iterations have increased by a factor of about 2x. Iterations increase because mutation is more frequent; in addition, at each sampling period each iteration needs to do more work as it needs to loop over a larger number of clones and this larger number includes clones that are not shown here, because they are pruned (they are extinct by the time we exit the simulation ---again, pruning is discussed with further details in \@ref(prune)).  fc38a875  #### McFarland, 50,000 genes, example 5 {#mc50000ex5}  74449e3b  Now let's run the above example but with keepEvery = 1:  fc38a875  {r mcf5muk, echo = TRUE, eval = FALSE}  bbd76086  t_mc_50000_nmpg_5mu_k <- system.time( mc_50000_nmpg_5mu_k <- oncoSimulPop(5, u, model = "McFL", mu = 5e-7, detectionSize = 1e6, detectionDrivers = NA, detectionProb = NA, keepPhylog = TRUE, onlyCancer = FALSE, keepEvery = 1, mutationPropGrowth = FALSE, mc.cores = 1 ))  fc38a875  t_mc_50000_nmpg_5mu_k ## user system elapsed ## 174.404 5.068 179.481 summary(mc_50000_nmpg_5mu_k)[, c(1:3, 8, 9)] ## NumClones TotalPopSize LargestClone FinalTime NumIter ## 1 25294 1001597 102766 123.4 74524 ## 2 23766 1006679 223010 124.3 71808 ## 3 21755 1001379 203638 114.8 62609 ## 4 24889 1012103 161003 119.3 75031 ## 5 21844 1002927 255388 108.8 64556 print(object.size(mc_50000_nmpg_5mu_k), units = "MB") ## 22645.8 Mb  We have already seen these effects before in section \@ref(mc50000ex2): using keepEvery = 1 leads to a slight increase  f9f9ab96  in execution time. What is really affected is the size of the returned object which increases by a factor of about 3x (and is now over 20GB). That 3x corresponds, of course, to the increase in the number of clones being tracked (now over 20,000). This, by the way, also allows us to understand the comment above, where we said that in these two cases (where we have increased mutation rate) at each iteration we need to do more work as at every update of the population the algorithm needs to loop over a much larger number of clones (even if many of those are eventually pruned).  fc38a875  #### McFarland, 50,000 genes, example 6 {#mc50000ex6} Finally, we will run the example in section \@ref(mc50000ex1) with the default of mutationPropGrowth = TRUE: {r mc50000_2, echo = TRUE, eval = FALSE}  bbd76086  t_mc_50000 <- system.time( mc_50000 <- oncoSimulPop(5, u, model = "McFL", mu = 1e-7, detectionSize = 1e6, detectionDrivers = NA, detectionProb = NA, keepPhylog = TRUE, onlyCancer = FALSE, keepEvery = NA, mutationPropGrowth = TRUE, mc.cores = 1 ))  fc38a875  t_mc_50000 ## user system elapsed ## 303.352 2.808 306.223 summary(mc_50000)[, c(1:3, 8, 9)] ## NumClones TotalPopSize LargestClone FinalTime NumIter ## 1 13928 1010815 219814 210.9 91255 ## 2 12243 1003267 214189 178.1 67673 ## 3 13880 1014131 124354 161.4 88322 ## 4 14104 1012941 75521 205.7 98583 ## 5 12428 1005594 232603 167.4 70359 print(object.size(mc_50000), units = "MB") ## 12816.6 Mb  Note the huge increase in computing time (related of course to the huge  74449e3b  increase in number of iterations) and in the size of the returned object: we  fc38a875  have gone from having to track about 2000 clones to tracking over 12000 clones even when we prune all clones without descendants. ### Examples with $s = 0.05$ {#largegenes005} A script with the above runs but using $s=0.05$ instead of $s=0.1$ is available from the repository ('miscell-files/vignette_bench_Rout/large_num_genes_0.05.Rout'). I will single out a couple of cases here. First, we repeat the run shown in section \@ref(mc50000ex5): {r mcf5muk005, echo = TRUE, eval = FALSE}  bbd76086  t_mc_50000_nmpg_5mu_k <- system.time( mc_50000_nmpg_5mu_k <- oncoSimulPop(2, u, model = "McFL", mu = 5e-7, detectionSize = 1e6, detectionDrivers = NA, detectionProb = NA, keepPhylog = TRUE, onlyCancer = FALSE, keepEvery = 1, mutationPropGrowth = FALSE, mc.cores = 1 ))  fc38a875  t_mc_50000_nmpg_5mu_k ## user system elapsed ## 305.512 5.164 310.711 summary(mc_50000_nmpg_5mu_k)[, c(1:3, 8, 9)] ## NumClones TotalPopSize LargestClone FinalTime NumIter ## 1 61737 1003273 104460 295.8731 204214 ## 2 65072 1000540 133068 296.6243 210231 print(object.size(mc_50000_nmpg_5mu_k), units = "MB") ## 24663.6 Mb  Note we use only two replicates, since those two already lead to a  74449e3b  24 GB returned object as we are tracking more than 60,000 clones, more  fc38a875  than twice those with $s=0.1$. The reason for the difference in number of clones and iterations is of course the change from $s=0.1$ to $s=0.05$: under the McFarland model to reach population sizes of $10^6$ starting from an equilibrium population of 500 we need about 43 mutations (whereas only about 22 are needed if $s=0.1$[^mcnum]). [^mcnum]: Given the dependence of death rates on population size in  f9f9ab96  McFarland's model (section \@ref(mcfl)), if all mutations have the same fitness effects we can calculate the equilibrium population size (where birth and death rates are equal) for a given number of mutated genes as: $K * (e^{(1 + s)^p} - 1)$, where $K$ is the initial equilibrium size, $s$ the fitness effect of each mutation, and $p$ the number of mutated genes.  fc38a875  Next, let us rerun \@ref(mc50000ex1): {r mc50000_1_005, echo = TRUE, eval = FALSE}  bbd76086  t_mc_50000_nmpg <- system.time( mc_50000_nmpg <- oncoSimulPop(5, u, model = "McFL", mu = 1e-7, detectionSize = 1e6, detectionDrivers = NA, detectionProb = NA, keepPhylog = TRUE, onlyCancer = FALSE, keepEvery = NA, mutationPropGrowth = FALSE, mc.cores = 1 ))  fc38a875  t_mc_50000_nmpg ## user system elapsed ## 111.236 0.596 111.834 summary(mc_50000_nmpg)[, c(1:3, 8, 9)] ## NumClones TotalPopSize LargestClone FinalTime NumIter ## 1 2646 1000700 217188 734.475 108566 ## 2 2581 1001626 209873 806.500 107296 ## 3 2903 1001409 125148 841.700 120859 ## 4 2310 1000146 473948 906.300 91519 ## 5 2704 1001290 448409 838.800 103556 print(object.size(mc_50000_nmpg), units = "MB") ## 2638.3 Mb  Using $s=0.05$ leads to a large increase in final time and number of iterations. However, as we are using the keepEvery = NA setting,  74449e3b  the increase in number of clones tracked and in size of returned  fc38a875  object is relatively small. ### The different consequences of keepEvery = NA in the Exp and McFL models {#kpexpmc}  74449e3b  We have seen that keepEvery = NA often leads to much smaller returned  fc38a875  objects when using the McFarland model than when using the Exp model. Why? Because in the McFarland model there is strong competition and there can be complete clonal sweeps so that in extreme cases a single clone might be all that is left after some time. This is not the case in the exponential models.  f9f9ab96  Of course, the details depend on the difference in fitness effects between different genotypes (or clones). In particular, we have seen several examples where even with keepEvery=NA there are a lot of clones in the McFL models. In those examples many clones had identical fitness (the fitness effects of all genes with positive fitness was the same, and ditto for the genes with negative fitness effects), so no clone ends up displacing all the others.  fc38a875  ### Are we keeping the complete history (genealogy) of the clones? {#histlargegenes}  f9f9ab96  Yes we are if we run with keepPhylog = TRUE, regardless of the setting for keepEvery. As explained in section \@ref(trackindivs), OncoSimulR prunes clones that never had a population size larger than zero at any sampling period (so they are not reflected in the pops.by.time matrix in the output). And when we set keepEvery = NA we are telling OncoSimulR to discard all sampling periods except the very last one (i.e., the pops.by.time matrix contains only the clones with 1 or more cells at the end of the simulation).  fc38a875  keepPhylog operates differently: it records the exact time at which a clone appeared and the clone that gave rise to it. This information is kept regardless of whether or not those clones appear in the pops.by.time matrix. Keeping the complete genealogy might be of limited use if the  f9f9ab96  pops.by.time matrix only contains the very last period. However, you can use plotClonePhylog and ask to be shown only clones that exist in the very last period (while of course showing all of their ancestors, even if those are now extinct ---i.e., regardless of their abundance).  fc38a875  For instance, in run \@ref(exp500001) we could have looked at the  f9f9ab96  information stored about the genealogy of clones by doing (we look at the first "individual" of the simulation, of the five "individuals" we simulated):  fc38a875  {r filog_exp50000_1, echo = TRUE, eval = FALSE} head(e_50000[[1]]$other$PhylogDF) ## parent child time ## 1 3679 0.8402 ## 2 4754 1.1815 ## 3 20617 1.4543 ## 4 15482 2.3064 ## 5 4431 3.7130 ## 6 41915 4.0628 tail(e_50000[[1]]$other$PhylogDF) ## parent child time ## 20672 3679, 20282 3679, 20282, 22359 75.0 ## 20673 3679, 17922, 22346 3679, 17922, 22346, 35811 75.0 ## 20674 2142, 3679 2142, 3679, 25838 75.0 ## 20675 3679, 17922, 19561 3679, 17922, 19561, 43777 75.0 ## 20676 3679, 15928, 19190, 20282 3679, 15928, 19190, 20282, 49686 75.0 ## 20677 2142, 3679, 16275 2142, 3679, 16275, 24201 75.0  where each row corresponds to one event of appearance of a new clone, the column labeled "parent" are the mutated genes in the parent, and the column labeled "child" are the mutated genes in the child.  f9f9ab96  And we could plot the genealogical relationships of clones that have a population size of at least one in the last period (again, while of course showing all of their ancestors, even if those are now extinct ---i.e., regardless of their current numbers) doing:  fc38a875  {r noplotlconephylog, echo = TRUE, eval = FALSE} plotClonePhylog(e_50000[[1]]) ## plot not shown  What is the cost of keep the clone genealogies? In terms of time it is minor. In terms of space, and as shown in the example above, we can end up storing a data frame with tends of thousands of rows and three columns (two factors, one float). In the example above the size of that data frame is approximately 2 MB for a single simulation. This is much smaller than the pops.by.time or Genotypes matrices, but it can quickly build up if you routinely launch, say, 1000 simulations via oncoSimulPop. That is why the default is keepPhylog = FALSE as this information is not needed as often as that in the other two matrices (pops.by.time and  ab439943  Genotypes).  fc38a875  ## Population sizes $\geq 10^{10}$ {#popgtzx} We have already seen examples where population sizes reach $10^{8}$ to $10^{10}$, as in Tables \@ref(tab:bench1b), \@ref(tab:timing3), \@ref(tab:timing3xf). What about even larger population sizes? The C++ code will unconditionally alert if population sizes exceed $4*10^{15}$ as in those cases loosing precision (as we are using doubles) would be unavoidable, and we would also run into problems with the generation of binomial random variates (code that illustrates and discusses this problem is available in file "example-binom-problems.cpp", in directory "/inst/miscell"). However, well before we reach $4*10^{15}$ we loose precision from other sources. One of the most noticeable ones is that when we reach population sizes around $10^{11}$ the C++ code will often alert us by throwing exceptions with the message  f9f9ab96  Recoverable exception ti set to DBL_MIN. Rerunning. I throw this  fc38a875  exception because $t_i$, the random variable for time to next mutation, is less than DBL_MIN, the minimum representable floating-point number. This happens because, unless we use really tiny mutation rates, the time to a mutation starts getting closer to zero as population sizes grow very large. It might be possible to ameliorate these problems somewhat by using long doubles (instead of doubles) or special purpose libraries that provide more precision. However, this would make it harder to run the same code in different operating systems and would likely decrease execution speed on the rest of the common scenarios for which OncoSimulR has been designed. The following code shows some examples where we use population sizes of $10^{10}$ or larger. Since we do not want simulations in the exponential model to end because of extinction, I use a fitness specification where all genes have a positive fitness effect and we start all simulations from a large population (to make it unlikely that the population will become extinct before cells mutate and start increasing in numbers). We set the maximum running time to 10 minutes. We keep the genealogy of the clones and use keepEvery = 1. {r ex-large-pop-size, eval = FALSE, echo = TRUE} ng <- 50 u <- allFitnessEffects(noIntGenes = c(rep(0.1, ng)))  {r ex-large-mf, eval = FALSE, echo = TRUE}  bbd76086  t_mc_k_50_1e11 <- system.time( mc_k_50_1e11 <- oncoSimulPop(5, u, model = "McFL", mu = 1e-7, detectionSize = 1e11, initSize = 1e5, detectionDrivers = NA, detectionProb = NA, keepPhylog = TRUE, onlyCancer = FALSE, mutationPropGrowth = FALSE, keepEvery = 1, finalTime = 5000, mc.cores = 1, max.wall.time = 600 ))  fc38a875  ## Recoverable exception ti set to DBL_MIN. Rerunning. ## Recoverable exception ti set to DBL_MIN. Rerunning. t_mc_k_50_1e11 ## user system elapsed ## 613.612 0.040 613.664 summary(mc_k_50_1e11)[, c(1:3, 8, 9)] ## NumClones TotalPopSize LargestClone FinalTime NumIter ## 1 5491 100328847809 44397848771 1019.950 942764 ## 2 3194 100048090441 34834178374 789.675 888819 ## 3 5745 100054219162 24412502660 927.950 929231 ## 4 4017 101641197799 60932177160 750.725 480938 ## 5 5393 100168156804 41659212367 846.250 898245 ## print(object.size(mc_k_50_1e11), units = "MB") ## 177.8 Mb  We get to $10^{11}$. But notice the exception with the warning about $t_i$. Notice also that this takes a long time and we run a very large number of iterations (getting close to one million in some cases). Now the exponential model with detectionSize = 1e11: {r ex-large-exp, eval = FALSE, echo = TRUE}  bbd76086  t_exp_k_50_1e11 <- system.time( exp_k_50_1e11 <- oncoSimulPop(5, u, model = "Exp", mu = 1e-7, detectionSize = 1e11, initSize = 1e5, detectionDrivers = NA, detectionProb = NA, keepPhylog = TRUE, onlyCancer = FALSE, mutationPropGrowth = FALSE, keepEvery = 1, finalTime = 5000, mc.cores = 1, max.wall.time = 600, errorHitWallTime = FALSE, errorHitMaxTries = FALSE ))  fc38a875  ## Recoverable exception ti set to DBL_MIN. Rerunning. ## Hitted wall time. Exiting. ## Recoverable exception ti set to DBL_MIN. Rerunning. ## Recoverable exception ti set to DBL_MIN. Rerunning. ## Recoverable exception ti set to DBL_MIN. Rerunning. ## Hitted wall time. Exiting. ## Recoverable exception ti set to DBL_MIN. Rerunning. ## Recoverable exception ti set to DBL_MIN. Rerunning. ## Recoverable exception ti set to DBL_MIN. Rerunning. ## Recoverable exception ti set to DBL_MIN. Rerunning. ## Recoverable exception ti set to DBL_MIN. Rerunning. ## Hitted wall time. Exiting. ## Hitted wall time. Exiting. t_exp_k_50_1e11 ## user system elapsed ## 2959.068 0.128 2959.556 try(summary(exp_k_50_1e11)[, c(1:3, 8, 9)]) ## NumClones TotalPopSize LargestClone FinalTime NumIter ## 1 6078 65172752616 16529682757 235.7590 1883438 ## 2 5370 106476643712 24662446729 232.0000 2516675 ## 3 2711 21911284363 17945303353 224.8608 543698 ## 4 2838 13241462284 2944300245 216.8091 372298 ## 5 7289 76166784312 10941729810 240.0217 1999489 print(object.size(exp_k_50_1e11), units = "MB") ## 53.5 Mb  Note that we almost reached max.wall.time (600 * 5 = 3000). What if we  f9f9ab96  wanted to go up to $10^{12}$? We would not be able to do it in 10 minutes. We could set max.wall.time to a value larger than 600 to allow us to reach larger sizes but then we would be waiting for a possibly unacceptable time for simulations to finish. Moreover, this would eventually fail as simulations would keep hitting the $t_i$ exception without ever being able to complete. Finally, even if we were very patient, hitting that $t_i$ exception should make us worry about possible biases in the samples.  fc38a875  ## A summary of some determinants of running time and space consumption To summarize this section, we have seen:  f9f9ab96  - Both McFL and Exp can be run in short times over a range of sizes for the detectionProb and detectionSize mechanisms using a complex fitness specification with moderate numbers of genes. These are the typical or common use cases of OncoSimulR.  fc38a875  - The keepEvery argument can have a large effect on time in the McFL models and specially on object sizes. If only the end result of the simulation is to be used, you should set keepEvery = NA. - The distribution of fitness effects and the fitness landscape can have large effects on running times. Sometimes these are intuitive and simple to reason about, sometimes they are not as they interact with other factors (e.g., stopping mechanism, numbers of clones, etc). In general, there can be complex interactions between different settings, from mutation rate to fitness effects to initial size. As usual, test before launching a massive simulation. - Simulations start to slow down and lead to a very large object size when we keep track of around 6000 to 10000 clones. Anything that leads to these patterns will slow down the simulations.  f9f9ab96  - OncoSimulR needs to keep track of genotypes (or clones), not just numbers of drivers and passengers, because it allows you to use complex fitness and mutation specifications that depend on specific genotypes. The keepEvery = NA is an approach to store only the minimal information needed, but it is unavoidable that during the simulations we might be forced to deal with many thousands of different clones.  fc38a875   6610ac26   bbd76086  \clearpage  6610ac26  # Specifying fitness effects {#specfit}  fc38a875  OncoSimulR uses a standard continuous time model, where individual cells divide, die, and mutate with rates that can depend on genotype and population size; over time the abundance of the different genotypes changes by the action of selection (due to differences in  f9f9ab96  net growth rates among genotypes), drift, and mutation. As a  fc38a875  result of a mutation in a pre-existing clone new clones arise, and the birth rate of a newly arisen clone is determined at the time of its emergence as a function of its genotype. Simulations can use an use exponential growth model or a model with carrying capacity that follows @McFarland2013. For the exponential growth model, the death rate is fixed at one whereas in the model with carrying capacity death rate increases with population size. In both cases, therefore,  f9f9ab96  fitness differences among genotypes in a given population at a  fc38a875  given time are due to differences in the mapping between genotype and birth rate. There is second exponential model (called "Bozic") where birth rate is fixed at one, and genotype determines death rate instead of birth rate (see details in \@ref(numfit)). So when we discuss specifying fitness effects or the effects of genes on fitness, we are actually referring to specifying effects on birth (or death) rates, which then translate into differences in fitness (since the other rate, death or birth, is either fixed, as in the Exp and Bozic models, or depends on the population size). This is also shown in Table \@ref(tab:osrfeatures), in the rows for "Fitness components", under "Evolutionary Features".  6610ac26  ## Introduction to the specification of fitness effects {#introfit}  fc38a875   6610ac26  With OncoSimulR you can specify different types of effects on fitness:  f9f9ab96  * A special type of epistatic effect that is particularly amenable to be represented as a graph (a DAG). In this graph having, say, "B" be a child of "A" means that a mutation in B can only accumulate if a mutation in A is already present. This is what OT [@Desper1999JCB; @Szabo2008], CBN [@Beerenwinkel2007; @Gerstung2009; @Gerstung2011], progression networks [@Farahani2013], and other similar models [@Korsunsky2014] generally mean. Details are provided in section \@ref(posetslong). Note that this is not an order effect (discussed below): the fitness of a genotype from this DAGs is a function of whether or not the restrictions in the graph are  6610ac26  satisfied, not the historical sequence of how they were satisfied. * Effects where the order in which mutations are acquired matters, as  fc38a875  illustrated in section \@ref(oe). There is, in fact, empirical evidence of these effects [@Ortmann2015]. For instance, the fitness of genotype "A, B" would differ depending on whether A or B was acquired first (or, as in the actual example in [@Ortmann2015], the fitness of the mutant with JAK2 and TET2 mutated will depend on which of the genes was mutated first).  6610ac26  * General epistatic effects (e.g., section \@ref(epi)), including synthetic viability (e.g., section \@ref(sv)) and synthetic lethality/mortality (e.g., section \@ref(sl)). * Genes that have independent effects on fitness (section \@ref(noint)).  fc38a875  * Modules (see section \@ref(modules0)) allow you to specify any of the above effects (except those for genes without interactions, as it would not make sense there) in terms of modules (sets of genes), not individual genes. We will introduce them right after \@ref(posetslong), and we will continue using them thereafter.  6610ac26  A guiding design principle of OncoSimulR is to try to make the specification of those effects as simple as possible but also as flexible as possible. Thus, there are two main ways of specifying fitness effects: * Combining different types of effects in a single specification. For  fc38a875  instance, you can combine epistasis with order effects with no interaction genes with modules. What you would do here is specify the effects that different mutations (or their combinations) have on fitness (the fitness effects) and then have OncoSimulR take care of combining them as if each of these were lego pieces. We will refer to this as the **lego system of fitness effects**. (As explained above, I find this an intuitive and very graphical analogy, which I have copied from @Hothorn_2006 and @Hothorn_2008).  6610ac26  * Explicitly passing to OncoSimulR a mapping of genotypes to fitness. Here you specify the fitness of each genotype. We will refer to this as the **explicit mapping of genotypes to fitness**. Both approaches have advantages and disadvantages. Here I emphasize some relevant differences. * With the lego system you can specify huge genomes with an enormous variety of interactions, since the possible genotypes are not constructed in advance. You would not be able to do this with the explicit mapping of genotypes to fitness if you wanted to, say, construct that mapping for a modest genotype of 500 genes (you'd have  fc38a875  more genotypes than particles in the observable Universe).  6610ac26   f9f9ab96  * For many models/data you often intuitively start with the fitness of the genotypes, not the fitness consequences of the different mutations. In these cases, you'd need to do the math to specify the terms you want if you used the lego system so you'll probably use the specification with the direct mapping genotype $\rightarrow$ fitness.  6610ac26   f9f9ab96  * Likewise, sometimes you already have a moderate size genotype $\rightarrow$ fitness mapping and you certainly do not want to do the math by hand: here the lego system would be painful to use.  6610ac26  * But sometimes we do think in terms of "the effects on fitness of such and such mutations are" and that immediately calls for the lego system, where you focus on the effects, and let OncoSimulR take care of doing the math of combining. * If you want to use order effects, you must use the lego system (at least for now). * If you want to specify modules, you must use the lego system (the explicit mapping of genotypes is, by its very nature, ill-suited for this). * The lego system might help you see what your model really means: in  f9f9ab96  many cases, you can obtain fairly succinct specifications of complex  6610ac26  fitness models with just a few terms. Similarly, depending on what your emphasis is, you can often specify the same fitness landscape in several different ways. Regardless of the route, you need to get that information into OncoSimulR's functions. The main function we will use is  fc38a875  allFitnessEffects: this is the function in charge of reading  6610ac26  the fitness specifications. We also need to discuss how, what, and where  fc38a875  you have to pass to allFitnessEffects.  6610ac26  ### Explicit mapping of genotypes to fitness {#explicitmap} Conceptually, the simplest way to specify fitness is to specify the mapping of all genotypes to fitness explicitly. An example will make this clear. Let's suppose you have a simple two-gene scenario, so a total of four genotypes, and you have a data frame with genotypes and fitness, where genoytpes are specified as character vectors, with mutated genes separated by commas: {r} m4 <- data.frame(G = c("WT", "A", "B", "A, B"), F = c(1, 2, 3, 4))   fc38a875  Now, let's give that to the allFitnessEffects function:  6610ac26  {r} fem4 <- allFitnessEffects(genotFitness = m4)  (The message is just telling you what the program guessed you wanted.)  ab439943  That's it. You can try to plot that fitnessEffects object  08de3ba7  {r}  ab439943  try(plot(fem4))  6610ac26    ab439943  In this case, you probably want to plot the fitness landscape. {r, fig.width=6.5, fig.height = 6.5} plotFitnessLandscape(evalAllGenotypes(fem4))   6610ac26  You can also check what OncoSimulR thinks the fitnesses are, with the  fc38a875  evalAllGenotypes function that we will use repeatedly below  6610ac26  (of course, here we should see the same fitnesses we entered): {r} evalAllGenotypes(fem4, addwt = TRUE)  And you can plot the fitness landscape: {r} plotFitnessLandscape(evalAllGenotypes(fem4))  To specify the mapping you can also use a matrix (or data frame) with $g + 1$ columns; each of the first $g$ columns contains a 1 or a 0 indicating that the gene of that column is mutated or not. Column $g+ 1$ contains the fitness values. And you do not even need to specify all the  4ac900b7  genotypes: the missing genotypes are assigned a fitness 0 ---except for the WT genotype which, if missing, is assigned a fitness of 1:  6610ac26   08de3ba7  {r}  6610ac26  m6 <- cbind(c(1, 1), c(1, 0), c(2, 3)) fem6 <- allFitnessEffects(genotFitness = m6) evalAllGenotypes(fem6, addwt = TRUE)  ab439943  ## plot(fem6)  6610ac26   {r, fig.width=6.5, fig.height = 6.5} plotFitnessLandscape(evalAllGenotypes(fem6))  This way of giving a fitness specification to OncoSimulR might be ideal if you directly generate random mappings of genotypes to fitness (or random fitness landscapes), as we will do in section  ab439943  \@ref(gener-fit-land). Specially when the fitness landscape contains many non-viable genotypes (which are considered those with fitness ---birth rate--- $<1e-9$) this can result in considerable savings as we only need to check the fitness of the viable genotypes in a table (a C++ map). Note, however, that using the Bozic model with the fitness landscape specification is not tested. In addition, for speed, missing genotypes from the fitness landscape specification are taken to be non-viable genotypes (**beware!! this is a breaking change relative to versions < 2.9.1**)[^flfast]. [^flfast]: Note for curious readers: it used to be the case that we converted the table of fitness of genotypes to a fitness specification with all possible epistatic interactions; you can take a look at the test file test.genot_fitness_to_epistasis.R that uses the fem6 object. We no longer do that but instead pass directly the fitness landscape.  6610ac26   ab439943   6610ac26   ab439943   6610ac26   fc38a875   6610ac26   f9a38e24   6610ac26  ### How to specify fitness effects with the lego system {#howfit}  fc38a875  An alternative general approach followed in many genetic simulators is to specify how particular combinations of alleles modify the wildtype genotype or the genotype that contains the individual effects of the interacting genes (e.g., see equation 1 in the supplementary material for FFPopSim [@Zanini2012]). For example, if we specify that a mutation in "A" contributes 0.04, a mutation in "B" contributes 0.03, and the double mutation "A:B" contributes 0.1, that means that the fitness of the "A, B" genotype (the genotype with A and B mutated) is that of the wildtype (1, by default), plus (actually, times ---see section \@ref(numfit)--- but plus on the log scale) the effects of having A mutated, plus (times) the effects of having B mutated, plus (times) the effects of "A:B" both being mutated.  6610ac26  We will see below that with the "lego system" it is possible to do something very similar to the explicit mapping of section \@ref(explicitmap). But this will sometimes require a more cumbersome notation (and sometimes also will require your doing some math). We will see examples in sections \@ref(e2), \@ref(e3) and \@ref(theminus) or the example in \@ref(weis1b). But then, if we can be explicit about (at least some of) the mappings $genotype \rightarrow fitness$, how are these procedures different? When you use the "lego system" you can combine both a partial explicit mapping of genotypes to fitness with arbitrary fitness effects of other genes/modules. In other words, with the "lego system" OncoSimulR makes it simple to be explicit about the mapping of specific genotypes, while also using the "how this specific effects modifies previous effects" logic, leading to a flexible specification. This also means that in many cases the same fitness effects can be specified in several different ways. Most of the rest of this section is devoted to explaining how to combine those pieces. Before that, however, we need to discuss the fitness model we use.  4ac900b7   6610ac26  ## Numeric values of fitness effects {#numfit} We evaluate fitness using the usual [@Zanini2012; @Gillespie1993; @Beerenwinkel2007; @Datta2013] multiplicative model: fitness is $\prod (1 + s_i)$ where $s_i$ is the fitness effect of gene (or gene interaction) $i$. In all models except Bozic, this fitness refers to the growth rate (the death rate being fixed to 1[^2]). The original model of @McFarland2013 has a slightly different parameterization, but you can go easily from one to the other (see section \@ref(mcfl)). For the Bozic model [@Bozic2010], however, the birth rate is set to 1, and the death rate then becomes $\prod (1 - s_i)$. [^2]: You can change this if you really want to.  ab439943   6610ac26  ### McFarland parameterization {#mcfl} In the original model of @McFarland2013, the effects of drivers contribute to the numerator of the birth rate, and those of the (deleterious) passengers to the denominator as: $\frac{(1 +  c3eeacb4  s)^d}{(1 + s_p)^p}$, where $d$ and $p$ are, respectively, the total  6610ac26  number of drivers and passengers in a genotype, and here the fitness effects of all drivers is the same ($s$) and that of all passengers  c3eeacb4  the same too ($s_p$). Note that, as written above, and as explicitly said in @McFarland2013 (see p. 2911) and @McFarland2014-phd (see p. 9), "(...) $s_p$ is the fitness disadvantage conferred by a passenger". In other words, the larger the $s_p$ the more deleterious the passenger. This is obvious, but I make it explicit because in our parameterization a positive $s$ means fitness advantage, whereas fitness disadvantages are associated with negative $s$. Of course, if you rewrite the above expression as $\frac{(1 + s)^d}{(1 - s_p)^p}$ then we are back to the "positive means fitness advantage and negative means fitness disadvantage". As @McFarland2014-phd explains (see p.\ 9, bottom), we can rewrite the above expression so that there are no terms in the denominator. McFarland writes it as (I copy verbatim from the fourth and fifth lines from the bottom on his p.\ 9) $(1 + s_d)^{n_d} (1 - s_p^{'})^{n_p}$ where $s_p^{'} = s_p/(1 + s_p)$. However, if we want to express everything as products (no ratios) and use the "positive s means advantage and negative s means disadvantage" rule, we want to write the above expression as $(1 + s_d)^{n_d} (1 + s_{pp})^{n_p}$ where $s_{pp} = -s_p/(1 + s_p)$. And this is actually what we do in v.2. There is an example, for instance, in section \@ref(mcf5070) where you will see: {r mcflparam} sp <- 1e-3 spp <- -sp/(1 + sp)  so we are going from the "(...) $s_p$ is the fitness disadvantage conferred by a passenger" in @McFarland2013 (p. 2911) and @McFarland2014-phd (p. 9) to the expression where we have a product $\prod (1 + s_i)$, with the "positive s means advantage and negative s means disadvantage" rule. This reparameterization applies to v.2. In v.1 we use the same parameterization as in the original one in @McFarland2013, but with the "positive s means advantage and negative s means disadvantage" rule (so we are using expression $\frac{(1 + s)^d}{(1 - s_p)^p}$).  6610ac26   fc38a875  For death rate, we use the expression that @McFarland2013 (see their p. 2911) use "(...) for large cancers (grown to $10^6$ cells)": $D(N) = \log(1 + N/K)$ where $K$ is the initial equilibrium population size. As the authors explain, for large N/K the above expression "(...) recapitulates Gompertzian dynamics observed experimentally for large tumors". By default, OncoSimulR uses a value of $K=initSize/(e^{1} - 1)$ so that the starting population is at equilibrium.  6610ac26  ### No viability of clones and types of models {#noviab}  fc38a875  For all models where fitness affects directly the birth rate (all except Bozic), if you specify that some event (say, mutating gene A) has $s_A \le -1$, if that event happens then birth rate becomes zero. This is taken to indicate that the clone is not even viable and thus disappears immediately without any chance for mutation[^3].  6610ac26  [^3]:This is a shortcut that we take because we think that it is what you mean. Note, however, that technically a clone with birth rate of 0 might have a non-zero probability of mutating before becoming extinct because in the continuous time model we use mutation is not linked to reproduction. In the present code, we are not allowing for any mutation when birth rate is 0. There are other options, but none which I find really better. An alternative implementation makes a clone immediately extinct if and only if any of the $s_i = -\infty$. However, we still need to handle the case with $s_i < -1$ as a special case. We either make it identical to the case with any $s_i = -\infty$ or for any $s_i > -\infty$ we set $(1 + s_i) = \max(0, 1 + s_i)$ (i.e., if $s_i < -1$ then $(1 + s_i) = 0$), to avoid obtaining negative birth rates (that make no sense) and the problem of multiplying an even number of negative numbers. I think only the second would make sense as an alternative. Models based on Bozic, however, have a birth rate of 1 and mutations affect the death rate. In this case, a death rate larger than birth rate,  fc38a875  *per se*, does not signal immediate extinction and, moreover, even for death  6610ac26  rates that are a few times larger than birth rates, the clone could mutate before becoming extinct[^4]. [^4]:We said "a few times". For a clone of population size 1 ---which is the size at which all clones start from mutation---, if death rate is, say, 90 but birth rate is 1, the probability of mutating before becoming extinct is very, very close to zero for all reasonable values of mutation rate}. How do we signal immediate extinction or no viability in this case? You can set the value of $s = -\infty$. In general, if you want to identify some mutations or some combinations of mutations as leading to immediate extinction (i.e., no viability), of the affected clone, set it to $-\infty$ as this would work even if how birth rates of 0 are handled changes. Most examples below evaluate fitness by its effects on the birth rate. You can see one where we do it both ways in Section \@ref(fit-neg-pos). ## Genes without interactions {#noint}  fc38a875  This is a simple scenario. Each gene $i$ has a fitness effect $s_i$ if  6610ac26  mutated. The $s_i$ can come from any distribution you want. As an example let's use three genes. We know there are no order effects, but we will also see what happens if we examine genotypes as ordered. {r} ai1 <- evalAllGenotypes(allFitnessEffects( noIntGenes = c(0.05, -.2, .1)), order = FALSE)  We can easily verify the first results: {r} ai1  {r} all(ai1[, "Fitness"] == c( (1 + .05), (1 - .2), (1 + .1), (1 + .05) * (1 - .2), (1 + .05) * (1 + .1), (1 - .2) * (1 + .1), (1 + .05) * (1 - .2) * (1 + .1)))  And we can see that considering the order of mutations (see section \@ref(oe)) makes no difference: {r} (ai2 <- evalAllGenotypes(allFitnessEffects( noIntGenes = c(0.05, -.2, .1)), order = TRUE, addwt = TRUE))  (The meaning of the notation in the output table is as follows: "WT" denotes the wild-type, or non-mutated clone. The notation $x > y$ means that a mutation in "x" happened before a mutation in "y". A genotype $x > y\ \_\ z$ means that a mutation in "x" happened before a mutation in "y"; there is also a mutation in "z", but that is a gene for which order does not matter). And what if I want genes without interactions but I want modules (see section \@ref(modules0))? Go to section \@ref(mod-no-epi). ## Using DAGs: Restrictions in the order of mutations as extended posets {#posetslong} ### AND, OR, XOR relationships {#andorxor}  fc38a875  The literature on Oncogenetic trees, CBNs, etc, has used graphs as a way  6610ac26  of showing the restrictions in the order in which mutations can accumulate. The meaning of "convergent arrows" in these graphs, however, differs. In Figure 1 of @Korsunsky2014 we are shown a simple diagram that illustrates the three basic different meanings of convergent arrows using two parental nodes. We will illustrate it here with three. Suppose we focus on node "g" in the following figure (we will create it shortly) {r, fig.height=4} data(examplesFitnessEffects) plot(examplesFitnessEffects[["cbn1"]])  * In relationships of the type used in **Conjunctive Bayesian Networks (CBN)** [e.g., @Gerstung2009], we are modeling an **AND** relationship, also called **CMPN** by @Korsunsky2014 or **monotone** relationship by @Farahani2013. If the relationship in the graph is fully respected, then "g" will only appear if all of "c", "d", and "e" are already mutated. * **Semimonotone** relationships *sensu* @Farahani2013 or **DMPN** *sensu* @Korsunsky2014 are **OR** relationships: "g" will appear if one or more of "c", "d", or "e" are already mutated. * **XMPN** relationships [@Korsunsky2014] are **XOR** relationships: "g" will be present only if exactly one of "c", "d", or "e" is present.  fc38a875  Note that Oncogenetic trees [@Desper1999JCB; @Szabo2008] need not  6610ac26  deal with the above distinctions, since the DAGs are trees: no node has more than one incoming connection or more than one parent[^5]. [^5]: OTs and CBNs have some other technical differences about the underlying model they assume, such as the exponential waiting time in CBNs. We will not discuss them here. To have a flexible way of specifying all of these restrictions, we will want to be able to say what kind of dependency each child node has on its parents. ### Fitness effects {#fitnessposets} Those DAGs specify dependencies and, as explained in @Diaz-Uriarte2015, it is simple to map them to a simple evolutionary model: any set of mutations that does not conform to the restrictions encoded in the graph will have a fitness of 0. However, we might not want to require absolute compliance with the DAG. This means we might want to allow deviations from the DAG with a corresponding penalization that is, however, not identical to setting fitness to 0 [again, see @Diaz-Uriarte2015]. This we can do by being explicit about the fitness effects of the deviations from the restrictions encoded in the DAG. We will use below a column of s for the fitness effect when the restrictions are satisfied and a column of sh when they are not. (See also \@ref(numfit) for the details about the meaning of the fitness effects). That way of specifying fitness effects makes it also trivial to use the model in @Hjelm2006 where all mutations might be allowed to occur, but the presence of some mutations increases the probability of occurrence of other mutations. For example, the values of sh could be all small positive ones (or for mildly deleterious effects, small negative numbers), while the values of s are much larger positive numbers. ### Extended posets In version 1 of this package we used posets in the sense of @Beerenwinkel2007 and @Gerstung2009, as explained in section  fc38a875  \@ref(poset) and in the help for poset. Here, we continue using  6610ac26  two columns, that specify parents and children, but we add columns for the specific values of fitness effects (both s and sh ---i.e., fitness effects for what happens when restrictions are and are not satisfied) and for the type of dependency as explained in section \@ref(andorxor). We can now illustrate the specification of different fitness effects using DAGs. ### DAGs: A first conjunction (AND) example {#cbn1} {r} cs <- data.frame(parent = c(rep("Root", 4), "a", "b", "d", "e", "c"), child = c("a", "b", "d", "e", "c", "c", rep("g", 3)), s = 0.1, sh = -0.9, typeDep = "MN") cbn1 <- allFitnessEffects(cs)  (We skip one letter, just to show that names need not be consecutive or have any particular order.) We can get a graphical representation using the default "graphNEL" {r, fig.height=3} plot(cbn1)  or one using "igraph": {r, fig.height=5} plot(cbn1, "igraph")  Since we have a parent and children, the reingold.tilford layout is probably the best here, so you might want to use that: {r, fig.height=5} library(igraph) ## to make the reingold.tilford layout available plot(cbn1, "igraph", layout = layout.reingold.tilford)  And what is the fitness of all genotypes? {r} gfs <- evalAllGenotypes(cbn1, order = FALSE, addwt = TRUE) gfs[1:15, ]  You can verify that for each genotype, if a mutation is present without all of its dependencies present, you get a $(1 - 0.9)$ multiplier, and you get a $(1 + 0.1)$ multiplier for all the rest with its direct parents satisfied. For example, genotypes "a", or "b", or "d", or "e" have fitness $(1 + 0.1)$, genotype "a, b, c" has fitness $(1 + 0.1)^3$, but genotype "a, c" has fitness $(1 + 0.1) (1 - 0.9) = 0.11$. ### DAGs: A second conjunction example {#cbn2} Let's try a first attempt at a somewhat more complex example, where the fitness consequences of different genes differ. {r} c1 <- data.frame(parent = c(rep("Root", 4), "a", "b", "d", "e", "c"), child = c("a", "b", "d", "e", "c", "c", rep("g", 3)), s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, rep(0.2, 3)), sh = c(rep(0, 4), c(-.1, -.2), c(-.05, -.06, -.07)), typeDep = "MN") try(fc1 <- allFitnessEffects(c1))  If you try this, you'll get an error. There is an error because the "sh" varies within a child, and we do not allow that for a poset-type specification, as it is ambiguous. If you need arbitrary fitness values for arbitrary combinations of genotypes, you can specify them using epistatic effects as in section \@ref(epi) and order effects as in section \@ref(oe). Why do we need to specify as many "s" and "sh" as there are rows (or a single one, that gets expanded to those many) when the "s" and "sh" are properties of the child node, not of the edges? Because, for ease, we use a data.frame. We fix the error in our specification. Notice that the "sh" is not set to $-1$ in these examples. If you want strict compliance with the poset restrictions, you should set $sh = -1$ or, better yet, $sh = -\infty$ (see section \@ref(noviab)), but having an $sh > -1$ will lead to fitnesses that are $> 0$ and, thus, is a way of modeling small deviations from the poset [see discussion in @Diaz-Uriarte2015]. Note that for those nodes that depend only on "Root" the type of dependency is irrelevant. {r} c1 <- data.frame(parent = c(rep("Root", 4), "a", "b", "d", "e", "c"), child = c("a", "b", "d", "e", "c", "c", rep("g", 3)), s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, rep(0.2, 3)), sh = c(rep(0, 4), c(-.9, -.9), rep(-.95, 3)), typeDep = "MN") cbn2 <- allFitnessEffects(c1)  We could get graphical representations but the figures would be the same as in the example in section \@ref(cbn1), since the structure has not changed, only the numeric values. What is the fitness of all possible genotypes? Here, order of events *per se* does not matter, beyond that considered in the poset. In other words, the fitness of genotype "a, b, c" is the same no matter how we got to "a, b, c". What matters is whether or not the genes on which each of "a", "b", and "c" depend are present or not (I only show the first 10 genotypes) {r} gcbn2 <- evalAllGenotypes(cbn2, order = FALSE) gcbn2[1:10, ]  Of course, if we were to look at genotypes but taking into account order of occurrence of mutations, we would see no differences {r} gcbn2o <- evalAllGenotypes(cbn2, order = TRUE, max = 1956) gcbn2o[1:10, ]  (The $max = 1956$ is there so that we show all the genotypes, even if they are more than 256, the default.) You can check the output and verify things are as they should. For instance: {r} all.equal( gcbn2[c(1:21, 22, 28, 41, 44, 56, 63 ) , "Fitness"], c(1.01, 1.02, 0.1, 1.03, 1.04, 0.05, 1.01 * c(1.02, 0.1, 1.03, 1.04, 0.05), 1.02 * c(0.10, 1.03, 1.04, 0.05), 0.1 * c(1.03, 1.04, 0.05), 1.03 * c(1.04, 0.05), 1.04 * 0.05, 1.01 * 1.02 * 1.1, 1.01 * 0.1 * 0.05, 1.03 * 1.04 * 0.05, 1.01 * 1.02 * 1.1 * 0.05, 1.03 * 1.04 * 1.2 * 0.1, ## notice this 1.01 * 1.02 * 1.03 * 1.04 * 1.1 * 1.2 ))  A particular one that is important to understand is genotype with mutated genes "c, d, e, g": {r} gcbn2[56, ] all.equal(gcbn2[56, "Fitness"], 1.03 * 1.04 * 1.2 * 0.10)  where "g" is taken as if its dependencies are satisfied (as "c",  4ac900b7  "d", and "e" are present) even when the dependencies of "c" are not  6610ac26  satisfied (and that is why the term for "c" is 0.9). ### DAGs: A semimonotone or "OR" example {#mn1} We will reuse the above example, changing the type of relationship: {r} s1 <- data.frame(parent = c(rep("Root", 4), "a", "b", "d", "e", "c"), child = c("a", "b", "d", "e", "c", "c", rep("g", 3)), s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, rep(0.2, 3)), sh = c(rep(0, 4), c(-.9, -.9), rep(-.95, 3)), typeDep = "SM") smn1 <- allFitnessEffects(s1)  It looks like this (where edges are shown in blue to denote the semimonotone relationship): {r, fig.height=3} plot(smn1)  {r} gsmn1 <- evalAllGenotypes(smn1, order = FALSE)  Having just one parental dependency satisfied is now enough, in contrast to what happened before. For instance: {r} gcbn2[c(8, 12, 22), ] gsmn1[c(8, 12, 22), ] gcbn2[c(20:21, 28), ] gsmn1[c(20:21, 28), ]  ### An "XMPN" or "XOR" example {#xor1} Again, we reuse the example above, changing the type of relationship: {r} x1 <- data.frame(parent = c(rep("Root", 4), "a", "b", "d", "e", "c"), child = c("a", "b", "d", "e", "c", "c", rep("g", 3)), s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, rep(0.2, 3)), sh = c(rep(0, 4), c(-.9, -.9), rep(-.95, 3)), typeDep = "XMPN") xor1 <- allFitnessEffects(x1)  It looks like this (edges in red to denote the "XOR" relationship): {r, fig.height=3} plot(xor1)  {r} gxor1 <- evalAllGenotypes(xor1, order = FALSE)  Whenever "c" is present with both "a" and "b", the fitness component for "c" will be $(1 - 0.1)$. Similarly for "g" (if more than one of  4ac900b7  "d", "e", or "c" is present, it will show as $(1 - 0.05)$). For example:  6610ac26  {r} gxor1[c(22, 41), ] c(1.01 * 1.02 * 0.1, 1.03 * 1.04 * 0.05)  However, having just both "a" and "b" is identical to the case with CBN and the monotone relationship (see sections \@ref(cbn2) and \@ref(mn1)). If you want the joint presence of "a" and "b" to result in different fitness than the product of the individual terms, without considering the presence of "c", you can specify that using general epistatic effects (section \@ref(epi)). We also see a very different pattern compared to CBN (section \@ref(cbn2)) here: {r} gxor1[28, ] 1.01 * 1.1 * 1.2  as exactly one of the dependencies for both "c" and "g" are satisfied. But {r} gxor1[44, ] 1.01 * 1.02 * 0.1 * 1.2  is the result of a $0.1$ for "c" (and a $1.2$ for "g" that has exactly one of its dependencies satisfied). ### Posets: the three types of relationships {#p3} {r} p3 <- data.frame( parent = c(rep("Root", 4), "a", "b", "d", "e", "c", "f"), child = c("a", "b", "d", "e", "c", "c", "f", "f", "g", "g"), s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, 0.2, 0.2, 0.3, 0.3), sh = c(rep(0, 4), c(-.9, -.9), c(-.95, -.95), c(-.99, -.99)), typeDep = c(rep("--", 4), "XMPN", "XMPN", "MN", "MN", "SM", "SM")) fp3 <- allFitnessEffects(p3)  This is how it looks like: {r, fig.height=3} plot(fp3)  We can also use "igraph": {r, fig.height=6} plot(fp3, "igraph", layout.reingold.tilford)  {r} gfp3 <- evalAllGenotypes(fp3, order = FALSE)  Let's look at a few: {r} gfp3[c(9, 24, 29, 59, 60, 66, 119, 120, 126, 127), ] c(1.01 * 1.1, 1.03 * .05, 1.01 * 1.02 * 0.1, 0.1 * 0.05 * 1.3, 1.03 * 1.04 * 1.2, 1.01 * 1.02 * 0.1 * 0.05, 0.1 * 1.03 * 1.04 * 1.2 * 1.3, 1.01 * 1.02 * 0.1 * 1.03 * 1.04 * 1.2, 1.02 * 1.1 * 1.03 * 1.04 * 1.2 * 1.3, 1.01 * 1.02 * 1.03 * 1.04 * 0.1 * 1.2 * 1.3)  As before, looking at the order of mutations makes no difference (look at the test directory to see a test that verifies this assertion). ## Modules {#modules0}  f9f9ab96  As already mentioned, we can think of all the effects of fitness in terms  6610ac26  not of individual genes but, rather, modules. This idea is discussed in, for example, @Raphael2014a, @Gerstung2011: the restrictions encoded in, say, the DAGs can be considered to apply not to genes, but to modules, where each module is a set of genes (and the intersection between modules is the empty set). Modules, then, play the role of a "union operation" over sets of genes. Of course, if we can use modules for the restrictions in the DAGs we should also be able to use them for epistasis and order effects, as we will see later (e.g., \@ref(oemod)). ### What does a module provide {#module-what-for} Modules can provide very compact ways of specifying relationships when you want to, well, model the existence of modules. For simplicity suppose there is a module, "A", made of genes "a1" and "a2", and a module "B", made of a single gene "b1". Module "B" can mutate if module "A" is mutated, but mutating both "a1" and "a2" provides no additional fitness advantage compared to mutating only a single one of them. We can specify this as: {r} s <- 0.2 sboth <- (1/(1 + s)) - 1 m0 <- allFitnessEffects(data.frame( parent = c("Root", "Root", "a1", "a2"), child = c("a1", "a2", "b", "b"), s = s, sh = -1, typeDep = "OR"), epistasis = c("a1:a2" = sboth)) evalAllGenotypes(m0, order = FALSE, addwt = TRUE)  Note that we need to add an epistasis term, with value "sboth" to capture the idea of "mutating both "a1" and "a2" provides no additional fitness advantage compared to mutating only a single one of them"; see details in section \@ref(epi). Now, specify it using modules: {r} s <- 0.2 m1 <- allFitnessEffects(data.frame( parent = c("Root", "A"), child = c("A", "B"), s = s, sh = -1, typeDep = "OR"), geneToModule = c("Root" = "Root", "A" = "a1, a2", "B" = "b1")) evalAllGenotypes(m1, order = FALSE, addwt = TRUE)  This captures the ideas directly. The typing savings here are small, but they can be large with modules with many genes. ### Specifying modules {#modules} How do you specify modules? The general procedure is simple: you pass a vector that makes explicit the mapping from modules to sets of genes. We just saw an example. There are several additional examples such as \@ref(pm3), \@ref(oemod), \@ref(epimod). It is important to note that, once you specify modules, we expect all of the relationships (except those that involve the non interacting genes) to be specified as modules. Thus, all elements of the epistasis, posets (the DAGs) and order effects components should be specified in terms of modules. But you can, of course, specify a module as containing a single gene (and a single gene with the same name as the module). What about the "Root" node? If you use a "restriction table", that restriction table (that DAG) must have a node named "Root" and in the mapping of genes to module there **must** be a first entry that has a module and gene named "Root", as we saw above with geneToModule = c("Root" = "Root", .... We force you to do this to be explicit about the "Root" node. This is not needed (thought it does not hurt) with other fitness specifications. For instance, if we have a model with two modules, one of them with two genes (see details in section \@ref(mod-no-epi)) we do not need to pass a "Root" as in {r} fnme <- allFitnessEffects(epistasis = c("A" = 0.1, "B" = 0.2), geneToModule = c("A" = "a1, a2", "B" = "b1")) evalAllGenotypes(fnme, order = FALSE, addwt = TRUE)  but it is also OK to have a "Root" in the geneToModule: {r} fnme2 <- allFitnessEffects(epistasis = c("A" = 0.1, "B" = 0.2), geneToModule = c( "Root" = "Root", "A" = "a1, a2", "B" = "b1")) evalAllGenotypes(fnme, order = FALSE, addwt = TRUE)  ### Modules and posets again: the three types of relationships and modules {#pm3} We use the same specification of poset, but add modules. To keep it manageable, we only add a few genes for some modules, and have some modules with a single gene. Beware that the number of genotypes is starting to grow quite fast, though. We capitalize to differentiate modules (capital letters) from genes (lowercase with a number), but this is not needed. {r} p4 <- data.frame( parent = c(rep("Root", 4), "A", "B", "D", "E", "C", "F"), child = c("A", "B", "D", "E", "C", "C", "F", "F", "G", "G"), s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, 0.2, 0.2, 0.3, 0.3), sh = c(rep(0, 4), c(-.9, -.9), c(-.95, -.95), c(-.99, -.99)), typeDep = c(rep("--", 4), "XMPN", "XMPN", "MN", "MN", "SM", "SM")) fp4m <- allFitnessEffects( p4, geneToModule = c("Root" = "Root", "A" = "a1", "B" = "b1, b2", "C" = "c1", "D" = "d1, d2", "E" = "e1", "F" = "f1, f2", "G" = "g1"))  By default, plotting shows the modules: {r, fig.height=3} plot(fp4m)  but we can show the gene names instead of the module names: {r, fig.height=3} plot(fp4m, expandModules = TRUE)  or  34c1b54f  {r, fig.height=6}  6610ac26  plot(fp4m, "igraph", layout = layout.reingold.tilford, expandModules = TRUE)  We obtain the fitness of all genotypes in the usual way: {r} gfp4 <- evalAllGenotypes(fp4m, order = FALSE, max = 1024)  Let's look at a few of those: {r} gfp4[c(12, 20, 21, 40, 41, 46, 50, 55, 64, 92, 155, 157, 163, 372, 632, 828), ] c(1.01 * 1.02, 1.02, 1.02 * 1.1, 0.1 * 1.3, 1.03, 1.03 * 1.04, 1.04 * 0.05, 0.05 * 1.3, 1.01 * 1.02 * 0.1, 1.02 * 1.1, 0.1 * 0.05 * 1.3, 1.03 * 0.05, 1.03 * 0.05, 1.03 * 1.04 * 1.2, 1.03 * 1.04 * 1.2, 1.02 * 1.1 * 1.03 * 1.04 * 1.2 * 1.3)  ## Order effects {#oe}  bbd76086  As explained in the introduction (section \@ref(introdd)), by order effects we  6610ac26  mean a phenomenon such as the one shown empirically by @Ortmann2015: the fitness of a double mutant "A", "B" is different depending on whether "A" was acquired before "B" or "B" before "A". This, of course, can be generalized to more than two genes.  f9f9ab96  Note that order effects are different from the restrictions in the order of accumulation of mutations discussed in section \@ref(posetslong). With restrictions in the order of accumulation of mutations we might say that acquiring "B" depends or is facilitated by having "A" mutated (and, unless we allowed for multiple mutations, having "A" mutated means having "A" mutated before "B"). However, once you have the genotype "A, B", its fitness does not depend on the order in which "A" and "B" appeared.  6610ac26   fc38a875  ### Order effects: three-gene orders {#oeftres}  6610ac26  Consider this case, where three specific three-gene orders and two two-gene orders (one of them a subset of one of the three) lead to different fitness compared to the wild-type. We add also modules, to show its usage (but just limit ourselves to using one gene per module here).  f9f9ab96  Order effects are specified using a $x > y$, which means that that order  6610ac26  effect is satisfied when module $x$ is mutated before module $y$. {r} o3 <- allFitnessEffects(orderEffects = c( "F > D > M" = -0.3, "D > F > M" = 0.4, "D > M > F" = 0.2, "D > M" = 0.1, "M > D" = 0.5), geneToModule = c("M" = "m", "F" = "f", "D" = "d") ) (ag <- evalAllGenotypes(o3, addwt = TRUE, order = TRUE))  (The meaning of the notation in the output table is as follows: "WT" denotes the wild-type, or non-mutated clone. The notation $x > y$ means that a mutation in "x" happened before a mutation in "y". A genotype $x > y\ \_\ z$ means that a mutation in "x" happened before a mutation in "y"; there is also a mutation in "z", but that is a gene for which order does not matter). The values for the first nine genotypes come directly from the fitness specifications. The 10th genotype matches $D > F > M$ ($= (1 + 0.4)$) but also $D > M$ ($(1 + 0.1)$). The 11th matches $D > M > F$ and $D > M$. The 12th matches $F > D > M$ but also $D > M$. Etc. ### Order effects and modules with multiple genes {#oemod} Consider the following case: {r} ofe1 <- allFitnessEffects( orderEffects = c("F > D" = -0.3, "D > F" = 0.4), geneToModule = c("F" = "f1, f2", "D" = "d1, d2") ) ag <- evalAllGenotypes(ofe1, order = TRUE)  There are four genes, $d1, d2, f1, f2$, where each $d$ belongs to module $D$ and each $f$ belongs to module $F$. What to expect for cases such as $d1 > f1$ or $f1 > d1$ is clear, as shown in {r} ag[5:16,]  Likewise, cases such as $d1 > d2 > f1$ or $f2 > f1 > d1$ are clear, because in terms of modules they map to $D > F$ or $F > D$: the observed order of mutation $d1 > d2 > f1$ means that module $D$ was mutated first and module $F$ was mutated second. Similar for $d1 > f1 > f2$ or $f1 > d1 > d2$: those map to $D > F$ and $F > D$. We can see the fitness of those four case in: {r} ag[c(17, 39, 19, 29), ]  and they correspond to the values of those order effects, where $F > D = (1 - 0.3)$ and $D > F = (1 + 0.4)$: {r} ag[c(17, 39, 19, 29), "Fitness"] == c(1.4, 0.7, 1.4, 0.7)  What if we match several patterns? For example, $d1 > f1 > d2 > f2$ and $d1 > f1 > f2 > d2$? The first maps to $D > F > D > F$ and the second to $D > F > D$. But since we are concerned with which one happened first and which happened second we should expect those two to correspond to the same fitness, that of pattern $D > F$, as is the case: {r} ag[c(43, 44),] ag[c(43, 44), "Fitness"] == c(1.4, 1.4)  More generally, that applies to all the patterns that start with one of the "d" genes: {r} all(ag[41:52, "Fitness"] == 1.4)  Similar arguments apply to the opposite pattern, $F > D$, which apply to all the possible gene mutation orders that start with one of the "f" genes. For example: {r} all(ag[53:64, "Fitness"] == 0.7)  ### Order and modules with 325 genotypes We can of course have more than two genes per module. This just repeats the above, with five genes (there are 325 genotypes, and that is why we  fc38a875  pass the "max" argument to evalAllGenotypes, to allow for  6610ac26  more than the default 256). {r} ofe2 <- allFitnessEffects( orderEffects = c("F > D" = -0.3, "D > F" = 0.4), geneToModule = c("F" = "f1, f2, f3", "D" = "d1, d2") ) ag2 <- evalAllGenotypes(ofe2, max = 325, order = TRUE)  We can verify that any combination that starts with a "d" gene and then contains at least one "f" gene will have a fitness of $1+0.4$. And any combination that starts with an "f" gene and contains at least one "d" genes will have a fitness of $1 - 0.3$. All other genotypes have a fitness of 1: {r} all(ag2[grep("^d.*f.*", ag2[, 1]), "Fitness"] == 1.4) all(ag2[grep("^f.*d.*", ag2[, 1]), "Fitness"] == 0.7) oe <- c(grep("^f.*d.*", ag2[, 1]), grep("^d.*f.*", ag2[, 1])) all(ag2[-oe, "Fitness"] == 1)  ### Order effects and genes without interactions We will now look at both order effects and interactions. To make things more interesting, we name genes so that the ordered names do split nicely between those with and those without order effects (this, thus, also serves as a test of messy orders of names). {r} foi1 <- allFitnessEffects( orderEffects = c("D>B" = -0.2, "B > D" = 0.3), noIntGenes = c("A" = 0.05, "C" = -.2, "E" = .1))  You can get a verbose view of what the gene names and modules are (and their automatically created numeric codes) by: {r} foi1[c("geneModule", "long.geneNoInt")]  We can get the fitness of all genotypes (we set $max = 325$ because that is the number of possible genotypes): {r} agoi1 <- evalAllGenotypes(foi1, max = 325, order = TRUE) head(agoi1)  Now: {r} rn <- 1:nrow(agoi1) names(rn) <- agoi1[, 1] agoi1[rn[LETTERS[1:5]], "Fitness"] == c(1.05, 1, 0.8, 1, 1.1)  According to the fitness effects we have specified, we also know that any genotype with only two mutations, one of which is either "A", "C" "E" and the other is "B" or "D" will have the fitness corresponding to "A", "C" or "E", respectively: {r} agoi1[grep("^A > [BD]$", names(rn)), "Fitness"] == 1.05 agoi1[grep("^C > [BD]$", names(rn)), "Fitness"] == 0.8 agoi1[grep("^E > [BD]$", names(rn)), "Fitness"] == 1.1 agoi1[grep("^[BD] > A$", names(rn)), "Fitness"] == 1.05 agoi1[grep("^[BD] > C$", names(rn)), "Fitness"] == 0.8 agoi1[grep("^[BD] > E$", names(rn)), "Fitness"] == 1.1  We will not be playing many additional games with regular expressions, but let us check those that start with "D" and have all the other mutations, which occupy rows 230 to 253; fitness should be equal (within numerical error, because of floating point arithmetic) to the order effect of "D" before "B" times the other effects $(1 - 0.3) * 1.05 * 0.8 * 1.1 = 0.7392$ {r} all.equal(agoi1[230:253, "Fitness"] , rep((1 - 0.2) * 1.05 * 0.8 * 1.1, 24))  and that will also be the value of any genotype with the five mutations where "D" comes before "B" such as those in rows 260 to 265, 277, or 322 and 323, but it will be equal to $(1 + 0.3) * 1.05 * 0.8 * 1.1 = 1.2012$ in those where "B" comes before "D". Analogous arguments apply to four, three, and two mutation genotypes.  f9f9ab96  ## Epistasis {#epi}  6610ac26   f9f9ab96  ### Epistasis: two alternative specifications {#e2}  6610ac26   f9f9ab96  We want the following mapping of genotypes to fitness:  6610ac26   f9f9ab96  ------------------------ A B Fitness -- -- -------------- wt wt 1  6610ac26   f9f9ab96  wt M $1 + s_b$  6610ac26   f9f9ab96  M wt $1 + s_a$  6610ac26   f9f9ab96  M M $1 + s_{ab}$ -- -- --------------- Suppose that the actual numerical values are $s_a = 0.2, s_b = 0.3, s_{ab} = 0.7$.  6610ac26   f9f9ab96  We specify the above as follows: {r} sa <- 0.2 sb <- 0.3 sab <- 0.7  6610ac26   f9f9ab96  e2 <- allFitnessEffects(epistasis = c("A: -B" = sa, "-A:B" = sb, "A : B" = sab)) evalAllGenotypes(e2, order = FALSE, addwt = TRUE)  6610ac26   f9f9ab96    6610ac26   f9f9ab96  That uses the "-" specification, so we explicitly exclude some patterns: with "A:-B" we say "A when there is no B".  6610ac26   f9f9ab96  But we can also use a specification where we do not use the "-". That requires a different numerical value of the interaction, because now, as we are rewriting the interaction term as genotype "A is mutant, B is mutant" the double mutant will incorporate the effects of "A mutant",  4ac900b7  "B mutant" and "both A and B mutants". We can define a new $s_2$ that  f9f9ab96  satisfies $(1 + s_{ab}) = (1 + s_a) (1 + s_b) (1 + s_2)$ so $(1 + s_2) = (1 + s_{ab})/((1 + s_a) (1 + s_b))$ and therefore specify as:  6610ac26   f9f9ab96  {r} s2 <- ((1 + sab)/((1 + sa) * (1 + sb))) - 1  6610ac26   f9f9ab96  e3 <- allFitnessEffects(epistasis = c("A" = sa, "B" = sb, "A : B" = s2)) evalAllGenotypes(e3, order = FALSE, addwt = TRUE)  6610ac26    f9f9ab96  Note that this is the way you would specify effects with FFPopsim [@Zanini2012]. Whether this specification or the previous one with "-" is simpler will depend on the model. For synthetic mortality and viability, I think the one using "-" is simpler to map genotype tables to fitness effects. See also section \@ref(e3) and \@ref(theminus) and the example in section \@ref(weis1b).  6610ac26   f9f9ab96  Finally, note that we can also specify some of these effects by combining the graph and the epistasis, as shown in section \@ref(misra1a) or \@ref(weis1b).  6610ac26   f9f9ab96  ### Epistasis with three genes and two alternative specifications {#e3}  6610ac26   f9f9ab96  Suppose we have  6610ac26   f9f9ab96  ------------------------------------- A B C Fitness -- -- -- -------------------------- M wt wt $1 + s_a$ wt M wt $1 + s_b$  6610ac26   f9f9ab96  wt wt M $1 + s_c$  6610ac26   f9f9ab96  M M wt $1 + s_{ab}$  6610ac26   f9f9ab96  wt M M $1 + s_{bc}$  6610ac26   f9f9ab96  M wt M $(1 + s_a) (1 + s_c)$  6610ac26   f9f9ab96  M M M $1 + s_{abc}$ -- -- -- --------------------------  6610ac26   f9f9ab96  where missing rows have a fitness of 1 (they have been deleted for conciseness). Note that the mutant for exactly A and C has a fitness that is the product of the individual terms (so there is no epistasis in that case).  6610ac26  {r}  f9f9ab96  sa <- 0.1 sb <- 0.15 sc <- 0.2 sab <- 0.3 sbc <- -0.25 sabc <- 0.4  6610ac26   f9f9ab96  sac <- (1 + sa) * (1 + sc) - 1  6610ac26   f9f9ab96  E3A <- allFitnessEffects(epistasis = c("A:-B:-C" = sa, "-A:B:-C" = sb, "-A:-B:C" = sc, "A:B:-C" = sab, "-A:B:C" = sbc, "A:-B:C" = sac, "A : B : C" = sabc) )  6610ac26   f9f9ab96  evalAllGenotypes(E3A, order = FALSE, addwt = FALSE)  6610ac26   f9f9ab96    6610ac26   f9f9ab96  We needed to pass the $s_{ac}$ coefficient explicitly, even if it that term was just the product. We can try to avoid using the "-", however (but we will need to do other calculations). For simplicity, I use capital  4ac900b7  "S" in what follows where the letters differ from the previous  f9f9ab96  specification:  6610ac26   f9f9ab96  {r}  6610ac26   f9f9ab96  sa <- 0.1 sb <- 0.15 sc <- 0.2 sab <- 0.3 Sab <- ( (1 + sab)/((1 + sa) * (1 + sb))) - 1 Sbc <- ( (1 + sbc)/((1 + sb) * (1 + sc))) - 1 Sabc <- ( (1 + sabc)/ ( (1 + sa) * (1 + sb) * (1 + sc) * (1 + Sab) * (1 + Sbc) ) ) - 1  6610ac26   f9f9ab96  E3B <- allFitnessEffects(epistasis = c("A" = sa, "B" = sb, "C" = sc, "A:B" = Sab, "B:C" = Sbc, ## "A:C" = sac, ## not needed now "A : B : C" = Sabc) ) evalAllGenotypes(E3B, order = FALSE, addwt = FALSE)  6610ac26   f9f9ab96    6610ac26   f9f9ab96  The above two are, of course, identical:  6610ac26   f9f9ab96  {r} all(evalAllGenotypes(E3A, order = FALSE, addwt = FALSE) == evalAllGenotypes(E3B, order = FALSE, addwt = FALSE))   6610ac26   f9f9ab96  We avoid specifying the "A:C", as it just follows from the individual "A" and "C" terms, but given a specified genotype table, we need to do a little bit of addition and multiplication to get the coefficients.  6610ac26   f9f9ab96  ### Why can we specify some effects with a "-"? {#theminus}  6610ac26   f9f9ab96  Let's suppose we want to specify the synthetic viability example seen before:  6610ac26   f9f9ab96  ----------------- A B Fitness -- -- ---------- wt wt 1  6610ac26   f9f9ab96  wt M 0  6610ac26   f9f9ab96  M wt 0  6610ac26   f9f9ab96  M M (1 + s) -- -- --------  6610ac26   f9f9ab96  where "wt" denotes wild type and "M" denotes mutant. If you want to directly map the above table to the fitness table for the program, to specify the genotype "A is wt, B is a mutant" you can specify it as "-A,B", not just as "B". Why? Because just the presence of a "B" is also compatible with genotype "A is mutant and B is mutant". If you use "-" you are explicitly saying what should not be there so that "-A,B" is NOT compatible with "A, B". Otherwise, you need to carefully add coefficients. Depending on what you are trying to model, different specifications might be simpler. See the examples in section \@ref(e2) and \@ref(e3). You have both options.  6610ac26   f9f9ab96  ### Epistasis: modules {#epimod} There is nothing conceptually new, but we will show an example here:  6610ac26  {r}  f9f9ab96  sa <- 0.2 sb <- 0.3 sab <- 0.7  6610ac26   f9f9ab96  em <- allFitnessEffects(epistasis = c("A: -B" = sa, "-A:B" = sb, "A : B" = sab), geneToModule = c("A" = "a1, a2", "B" = "b1, b2")) evalAllGenotypes(em, order = FALSE, addwt = TRUE)  6610ac26    f9f9ab96  Of course, we can do the same thing without using the "-", as in section \@ref(e2):  6610ac26   f9f9ab96  {r} s2 <- ((1 + sab)/((1 + sa) * (1 + sb))) - 1  6610ac26   f9f9ab96  em2 <- allFitnessEffects(epistasis = c("A" = sa, "B" = sb, "A : B" = s2), geneToModule = c("A" = "a1, a2", "B" = "b1, b2") ) evalAllGenotypes(em2, order = FALSE, addwt = TRUE)  6610ac26   f9f9ab96    6610ac26   f9f9ab96  ## I do not want epistasis, but I want modules! {#mod-no-epi}  6610ac26   f9f9ab96  Sometimes you might want something like having several modules, say "A" and "B", each with a number of genes, but with "A" and "B" showing no interaction.  6610ac26   f9f9ab96  It is a terminological issue whether we should allow noIntGenes (no interaction genes), as explained in section \@ref(noint) to actually be modules. The reasoning for not allowing them is that the situation depicted above (several genes in module A, for example) actually is one of interaction: the members of "A" are combined using an "OR" operator (i.e., the fitness consequences of having one or more genes of A mutated are the same), not just simply multiplying their fitness; similarly for  4ac900b7  "B". This is why no interaction genes also mean no modules allowed.  6610ac26   f9f9ab96  So how do you get what you want in this case? Enter the names of the modules in the epistasis component but have no term for ":" (the colon). Let's see an example:  6610ac26  {r}  f9f9ab96  fnme <- allFitnessEffects(epistasis = c("A" = 0.1, "B" = 0.2), geneToModule = c("A" = "a1, a2", "B" = "b1, b2, b3")) evalAllGenotypes(fnme, order = FALSE, addwt = TRUE)  6610ac26    f9f9ab96  In previous versions these was possible using the longer, still accepted way of specifying a : with a value of 0, but this is no longer needed:  6610ac26  {r}  f9f9ab96  fnme <- allFitnessEffects(epistasis = c("A" = 0.1, "B" = 0.2, "A : B" = 0.0), geneToModule = c("A" = "a1, a2", "B" = "b1, b2, b3")) evalAllGenotypes(fnme, order = FALSE, addwt = TRUE)  6610ac26    f9f9ab96  This can, of course, be extended to more modules.  6610ac26   f9f9ab96  ## Synthetic viability {#sv}  6610ac26   f9f9ab96  Synthetic viability and synthetic lethality [e.g., @Ashworth2011; @Hartman2001] are just special cases of epistasis (section \@ref(epi)) but we deal with them here separately.  6610ac26   f9f9ab96  ### A simple synthetic viability example A simple and extreme example of synthetic viability is shown in the following table, where the joint mutant has fitness larger than the wild type, but each single mutant is lethal.  6610ac26   f9f9ab96  --------------- A B Fitness -- -- ---------- wt wt 1  6610ac26   f9f9ab96  wt M 0  6610ac26   f9f9ab96  M wt 0  6610ac26   f9f9ab96  M M (1 + s) -- -- --------  6610ac26   f9f9ab96  where "wt" denotes wild type and "M" denotes mutant.  6610ac26   f9f9ab96  We can specify this (setting $s = 0.2$) as (I play around with spaces, to show there is a certain flexibility with them): {r} s <- 0.2 sv <- allFitnessEffects(epistasis = c("-A : B" = -1, "A : -B" = -1, "A:B" = s))  6610ac26    f9f9ab96  Now, let's look at all the genotypes (we use "addwt" to also get the wt, which by decree has fitness of 1), and disregard order:  6610ac26   f9f9ab96  {r} (asv <- evalAllGenotypes(sv, order = FALSE, addwt = TRUE))   6610ac26   f9f9ab96  Asking the program to consider the order of mutations of course makes no difference:  6610ac26   f9f9ab96  {r} evalAllGenotypes(sv, order = TRUE, addwt = TRUE)   6610ac26   f9f9ab96  Another example of synthetic viability is shown in section \@ref(misra1b).  6610ac26   f9f9ab96  Of course, if multiple simultaneous mutations are not possible in the simulations, it is not possible to go from the wildtype to the double mutant in this model where the single mutants are not viable.  6610ac26   f9f9ab96  ### Synthetic viability, non-zero fitness, and modules  6610ac26   f9f9ab96  This is a slightly more elaborate case, where there is one module and the single mutants have different fitness between themselves, which is non-zero. Without the modules, this is the same as in @Misra2014, Figure 1b, which we go over in section \@ref(misra).  6610ac26   f9f9ab96  --------------------- A B Fitness -- -- ------------- wt wt 1  6610ac26   f9f9ab96  wt M $1 + s_b$  6610ac26   f9f9ab96  M wt $1 + s_a$  6610ac26   f9f9ab96  M M $1 + s_{ab}$ -- -- ----------------  6610ac26   f9f9ab96  where $s_a, s_b < 0$ but $s_{ab} > 0$.  6610ac26   f9f9ab96  {r} sa <- -0.1 sb <- -0.2 sab <- 0.25 sv2 <- allFitnessEffects(epistasis = c("-A : B" = sb, "A : -B" = sa, "A:B" = sab), geneToModule = c( "A" = "a1, a2", "B" = "b")) evalAllGenotypes(sv2, order = FALSE, addwt = TRUE)  6610ac26    f9f9ab96  And if we look at order, of course it makes no difference:  6610ac26  {r}  f9f9ab96  evalAllGenotypes(sv2, order = TRUE, addwt = TRUE)   6610ac26   f9f9ab96   6610ac26   f9f9ab96  ## Synthetic mortality or synthetic lethality {#sl} In contrast to section \@ref(sv), here the joint mutant has decreased viability: ------------------------- A B Fitness --- --- ------------- wt wt 1  6610ac26   f9f9ab96  wt M $1 + s_b$  6610ac26   f9f9ab96  M wt $1 + s_a$  6610ac26   f9f9ab96  M M $1 + s_{ab}$ -- -- ----------------  6610ac26   f9f9ab96  where $s_a, s_b > 0$ but $s_{ab} < 0$.  6610ac26   f9f9ab96  {r} sa <- 0.1 sb <- 0.2 sab <- -0.8 sm1 <- allFitnessEffects(epistasis = c("-A : B" = sb, "A : -B" = sa, "A:B" = sab)) evalAllGenotypes(sm1, order = FALSE, addwt = TRUE)  6610ac26   f9f9ab96    6610ac26   f9f9ab96  And if we look at order, of course it makes no difference:  6610ac26   f9f9ab96  {r} evalAllGenotypes(sm1, order = TRUE, addwt = TRUE)   6610ac26   f9f9ab96  ## Possible issues with Bozic model {#boznumissues}  6610ac26   f9f9ab96  ### Synthetic viability using Bozic model {#fit-neg-pos}  6610ac26   f9f9ab96  If we were to use the above specification with Bozic's models, we might not get what we think we should get:  6610ac26   f9f9ab96  {r} evalAllGenotypes(sv, order = FALSE, addwt = TRUE, model = "Bozic")   6610ac26   f9f9ab96  What gives here? The simulation code would alert you of this (see section  4ac900b7  \@ref(ex-0-death)) in this particular case because there are "-1",  f9f9ab96  which might indicate that this is not what you want. The problem is that you probably want the Death rate to be infinity (the birth rate was 0, so no clone viability, when we used birth rates ---section \@ref(noviab)).  6610ac26   f9f9ab96  Let us say so explicitly:  6610ac26  {r}  f9f9ab96  s <- 0.2 svB <- allFitnessEffects(epistasis = c("-A : B" = -Inf, "A : -B" = -Inf, "A:B" = s)) evalAllGenotypes(svB, order = FALSE, addwt = TRUE, model = "Bozic")   6610ac26   f9f9ab96  Likewise, values of $s$ larger than one have no effect beyond setting $s = 1$ (a single term of $(1 - 1)$ will drive the product to 0, and as we cannot allow negative death rates negative values are set to 0):  6610ac26  {r}  f9f9ab96  s <- 1 svB1 <- allFitnessEffects(epistasis = c("-A : B" = -Inf, "A : -B" = -Inf, "A:B" = s))  6610ac26   f9f9ab96  evalAllGenotypes(svB1, order = FALSE, addwt = TRUE, model = "Bozic")  6610ac26   f9f9ab96  s <- 3 svB3 <- allFitnessEffects(epistasis = c("-A : B" = -Inf, "A : -B" = -Inf, "A:B" = s))  6610ac26   f9f9ab96  evalAllGenotypes(svB3, order = FALSE, addwt = TRUE, model = "Bozic")  6610ac26   f9f9ab96    6610ac26   f9f9ab96  Of course, death rates of 0.0 are likely to lead to trouble down the road, when we actually conduct simulations (see section \@ref(ex-0-death)).  6610ac26   f9f9ab96  ### Numerical issues with death rates of 0 in Bozic model {#ex-0-death}  6610ac26   f9f9ab96  As we mentioned above (section \@ref(fit-neg-pos)) death rates of 0 can lead to trouble when using Bozic's model:  6610ac26  {r}  f9f9ab96  i1 <- allFitnessEffects(noIntGenes = c(1, 0.5)) evalAllGenotypes(i1, order = FALSE, addwt = TRUE, model = "Bozic") i1_b <- oncoSimulIndiv(i1, model = "Bozic")  6610ac26   f9f9ab96   Of course, there is no problem in using the above with other models:  6610ac26   f9f9ab96  {r} evalAllGenotypes(i1, order = FALSE, addwt = TRUE, model = "Exp") i1_e <- oncoSimulIndiv(i1, model = "Exp") summary(i1_e)  6610ac26   ## A longer example: Poset, epistasis, synthetic mortality and viability, order effects and genes without interactions, with some modules {#exlong} We will now put together a complex example. We will use the poset from section \@ref(pm3) but will also add: * Order effects that involve genes in the poset. In this case, if C happens before F, fitness decreases by $1 - 0.1$. If it happens the other way around, there is no effect on fitness beyond their individual contributions. * Order effects that involve two new modules, "H" and "I" (with genes "h1, h2" and "i1", respectively), so that if H happens before I fitness increases by $1 + 0.12$. * Synthetic mortality between modules "I" (already present in the epistatic interaction) and "J" (with genes "j1" and "j2"): the joint presence of these modules leads to cell death (fitness of 0). * Synthetic viability between modules "K" and "M" (with genes "k1", "k2" and "m1", respectively), so that their joint presence is viable but adds nothing to fitness (i.e., mutation of both has fitness $1$), whereas each single mutant has a fitness of $1 - 0.5$. * A set of 5 driver genes ($n1, \ldots, n5$) with fitness that comes from an exponential distribution with rate of 10. As we are specifying many different things, we will start by writing each set of effects separately: {r} p4 <- data.frame( parent = c(rep("Root", 4), "A", "B", "D", "E", "C", "F"), child = c("A", "B", "D", "E", "C", "C", "F", "F", "G", "G"), s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, 0.2, 0.2, 0.3, 0.3), sh = c(rep(0, 4), c(-.9, -.9), c(-.95, -.95), c(-.99, -.99)), typeDep = c(rep("--", 4), "XMPN", "XMPN", "MN", "MN", "SM", "SM")) oe <- c("C > F" = -0.1, "H > I" = 0.12) sm <- c("I:J" = -1) sv <- c("-K:M" = -.5, "K:-M" = -.5) epist <- c(sm, sv) modules <- c("Root" = "Root", "A" = "a1", "B" = "b1, b2", "C" = "c1", "D" = "d1, d2", "E" = "e1", "F" = "f1, f2", "G" = "g1", "H" = "h1, h2", "I" = "i1", "J" = "j1, j2", "K" = "k1, k2", "M" = "m1")  fc38a875  set.seed(1) ## for reproducibility  6610ac26  noint <- rexp(5, 10) names(noint) <- paste0("n", 1:5) fea <- allFitnessEffects(rT = p4, epistasis = epist, orderEffects = oe, noIntGenes = noint, geneToModule = modules)  How does it look?  34c1b54f  {r, fig.height=5.5}  6610ac26  plot(fea)  or  34c1b54f  {r, fig.height=5.5}  6610ac26  plot(fea, "igraph")  We can, if we want, expand the modules using a "graphNEL" graph  34c1b54f  {r, fig.height=5.5}  6610ac26  plot(fea, expandModules = TRUE)  or an "igraph" one  34c1b54f  {r, fig.height=6.5}  6610ac26  plot(fea, "igraph", expandModules = TRUE)  We will not evaluate the fitness of all genotypes, since the number of all ordered genotypes is $> 7*10^{22}$. We will look at some specific genotypes: {r} evalGenotype("k1 > i1 > h2", fea) ## 0.5 evalGenotype("k1 > h1 > i1", fea) ## 0.5 * 1.12 evalGenotype("k2 > m1 > h1 > i1", fea) ## 1.12 evalGenotype("k2 > m1 > h1 > i1 > c1 > n3 > f2", fea) ## 1.12 * 0.1 * (1 + noint[3]) * 0.05 * 0.9  Finally, let's generate some ordered genotypes randomly: {r} randomGenotype <- function(fe, ns = NULL) { gn <- setdiff(c(fe$geneModule$Gene, fe$long.geneNoInt$Gene), "Root") if(is.null(ns)) ns <- sample(length(gn), 1) return(paste(sample(gn, ns), collapse = " > ")) } set.seed(2) ## for reproducibility evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE) ## Genotype: k2 > i1 > c1 > n1 > m1 ## Individual s terms are : 0.0755182 -0.9 ## Fitness: 0.107552 evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE) ## Genotype: n2 > h1 > h2 ## Individual s terms are : 0.118164 ## Fitness: 1.11816 evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE) ## Genotype: d2 > k2 > c1 > f2 > n4 > m1 > n3 > f1 > b1 > g1 > n5 > h1 > j2 ## Individual s terms are : 0.0145707 0.0139795 0.0436069 0.02 0.1 0.03 -0.95 0.3 -0.1 ## Fitness: 0.0725829 evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE) ## Genotype: h2 > c1 > f1 > n2 > b2 > a1 > n1 > i1 ## Individual s terms are : 0.0755182 0.118164 0.01 0.02 -0.9 -0.95 -0.1 0.12 ## Fitness: 0.00624418 evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE) ## Genotype: h2 > j1 > m1 > d2 > i1 > b2 > k2 > d1 > b1 > n3 > n1 > g1 > h1 > c1 > k1 > e1 > a1 > f1 > n5 > f2 ## Individual s terms are : 0.0755182 0.0145707 0.0436069 0.01 0.02 -0.9 0.03 0.04 0.2 0.3 -1 -0.1 0.12 ## Fitness: 0 evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE) ## Genotype: n1 > m1 > n3 > i1 > j1 > n5 > k1 ## Individual s terms are : 0.0755182 0.0145707 0.0436069 -1 ## Fitness: 0 evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE) ## Genotype: d2 > n1 > g1 > f1 > f2 > c1 > b1 > d1 > k1 > a1 > b2 > i1 > n4 > h2 > n2 ## Individual s terms are : 0.0755182 0.118164 0.0139795 0.01 0.02 -0.9 0.03 -0.95 0.3 -0.5 ## Fitness: 0.00420528 evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE) ## Genotype: j1 > f1 > j2 > a1 > n4 > c1 > n3 > k1 > d1 > h1 ## Individual s terms are : 0.0145707 0.0139795 0.01 0.1 0.03 -0.95 -0.5 ## Fitness: 0.0294308 evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE) ## Genotype: n5 > f2 > f1 > h2 > n4 > c1 > n3 > b1 ## Individual s terms are : 0.0145707 0.0139795 0.0436069 0.02 0.1 -0.95 ## Fitness: 0.0602298 evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE) ## Genotype: h1 > d1 > f2 ## Individual s terms are : 0.03 -0.95 ## Fitness: 0.0515  ## Homozygosity, heterozygosity, oncogenes, tumor suppressors {#oncog} We are using what is conceptually a single linear chromosome. However, you can use it to model scenarios where the numbers of copies affected matter, by properly duplicating the genes. Suppose we have a tumor suppressor gene, G, with two copies, one from Mom and one from Dad. We can have a table like: ---------------------------- $O_M$ $O_D$ Fitness ----- ----- -------------- wt wt 1 wt M 1 M wt 1 M M $(1 + s)$ ---- ----- --------------- where $s > 0$, meaning that you need two hits, one in each copy, to trigger the clonal expansion. What about oncogenes? A simple model is that one single hit leads to clonal expansion and additional hits lead to no additional changes, as in this table for gene O, where again the M or D subscript denotes the copy from Mom or from Dad: ---------------------------- $O_M$ $O_D$ Fitness ----- ----- -------------- wt wt 1 wt M $(1 + s)$ M wt $(1 + s)$ M M $(1 + s)$ ---- ----- -------------- If you have multiple copies you can proceed similarly. As you can see, these are nothing but special cases of synthetic mortality (\@ref(sl)), synthetic viability (\@ref(sv)) and epistasis (\@ref(epi)). ## Gene-specific mutation rates {#per-gene-mutation} You can specify gene-specific mutation rates. Instead of passing a scalar value for mu, you pass a named vector. (This does not work with the old v. 1 format, though; yet another reason to stop using that format). This is a simple example (many more are available in the tests, see file ./tests/testthat/test.per-gene-mutation-rates.R). {r} muvar2 <- c("U" = 1e-6, "z" = 5e-5, "e" = 5e-4, "m" = 5e-3, "D" = 1e-4) ni1 <- rep(0, 5) names(ni1) <- names(muvar2) ## We use the same names, of course fe1 <- allFitnessEffects(noIntGenes = ni1) bb <- oncoSimulIndiv(fe1, mu = muvar2, onlyCancer = FALSE, initSize = 1e5, finalTime = 25, seed =NULL)  ## Mutator genes {#mutator} You can specify mutator/antimutator genes [e.g. @gerrish_complete_2007; @tomlinson_mutation_1996]. These are genes that, when mutated, lead to an increase/decrease in the mutation rate all over the genome (similar to what happens with, say, mutations in mismatch-repair genes or microsatellite instability in cancer).  f9f9ab96  The specification is very similar to that for fitness effects, except we do not (at least for now) allow the use of DAGs nor of order effects (we have seen no reference in the literature to  6610ac26  suggest any of these would be relevant). You can, however, specify  f9f9ab96  epistasis and use modules. Note that the mutator genes must be a subset of the genes in the fitness effects; if you want to have mutator genes that have no direct fitness effects, give them a fitness effect of 0.  6610ac26  This first is a very simple example with simple fitness effects and modules for mutators. We will specify the fitness and mutator effects and evaluate the fitness and mutator effects: {r} fe2 <- allFitnessEffects(noIntGenes = c(a1 = 0.1, a2 = 0.2, b1 = 0.01, b2 = 0.3, b3 = 0.2, c1 = 0.3, c2 = -0.2)) fm2 <- allMutatorEffects(epistasis = c("A" = 5, "B" = 10, "C" = 3), geneToModule = c("A" = "a1, a2", "B" = "b1, b2, b3", "C" = "c1, c2")) ## Show the fitness effect of a specific genotype evalGenotype("a1, c2", fe2, verbose = TRUE) ## Show the mutator effect of a specific genotype evalGenotypeMut("a1, c2", fm2, verbose = TRUE) ## Fitness and mutator of a specific genotype evalGenotypeFitAndMut("a1, c2", fe2, fm2, verbose = TRUE)  You can also use the evalAll functions. We do not show the output here to avoid cluttering the vignette: {r, eval=FALSE} ## Show only all the fitness effects evalAllGenotypes(fe2, order = FALSE) ## Show only all mutator effects evalAllGenotypesMut(fm2) ## Show all fitness and mutator evalAllGenotypesFitAndMut(fe2, fm2, order = FALSE)  Building upon the above, the next is an example where we have a bunch of no interaction genes that affect fitness, and a small set of genes that affect the mutation rate (but have no fitness effects). {r} set.seed(1) ## for reproducibility ## 17 genes, 7 with no direct fitness effects ni <- c(rep(0, 7), runif(10, min = -0.01, max = 0.1)) names(ni) <- c("a1", "a2", "b1", "b2", "b3", "c1", "c2", paste0("g", 1:10)) fe3 <- allFitnessEffects(noIntGenes = ni) fm3 <- allMutatorEffects(epistasis = c("A" = 5, "B" = 10, "C" = 3, "A:C" = 70), geneToModule = c("A" = "a1, a2", "B" = "b1, b2, b3", "C" = "c1, c2"))  Let us check what the effects are of a few genotypes: {r} ## These only affect mutation, not fitness evalGenotypeFitAndMut("a1, a2", fe3, fm3, verbose = TRUE) evalGenotypeFitAndMut("a1, b3", fe3, fm3, verbose = TRUE) ## These only affect fitness: the mutator multiplier is 1 evalGenotypeFitAndMut("g1", fe3, fm3, verbose = TRUE) evalGenotypeFitAndMut("g3, g9", fe3, fm3, verbose = TRUE) ## These affect both evalGenotypeFitAndMut("g3, g9, a2, b3", fe3, fm3, verbose = TRUE)  Finally, we will do a simulation with those data {r} set.seed(1) ## so that it is easy to reproduce mue1 <- oncoSimulIndiv(fe3, muEF = fm3, mu = 1e-6, initSize = 1e5, model = "McFL", detectionSize = 5e6, finalTime = 500, onlyCancer = FALSE)  {r, eval=FALSE} ## We do not show this in the vignette to avoid cluttering it ## with output mue1   f9f9ab96  Of course, it is up to you to keep things reasonable: mutator effects are multiplicative, so if you specify, say, 20 genes (without modules), or 20 modules, each with a mutator effect of 50, the overall mutation rate can be increased by a factor of $50^{20}$ and that is unlikely to be what you really want (see also section \@ref(tomlinexcept)).  6610ac26   f9f9ab96  You can play with the following case (an extension of the example above), where a clone with a mutator phenotype and some fitness enhancing mutations starts giving rise to many other clones, some with additional mutator effects, and thus leading to the number of clones blowing up (as some also accumulate additional fitness-enhancing mutations). Things start getting out of hand shortly after time 250. The code below takes a few minutes to run and is not executed here, but you can run it to get an idea of the increase in the number of clones and their relationships (the usage of plotClonePhylog is explained in section \@ref(phylog)).  6610ac26  {r, eval=FALSE} set.seed(1) ## for reproducibility ## 17 genes, 7 with no direct fitness effects ni <- c(rep(0, 7), runif(10, min = -0.01, max = 0.1)) names(ni) <- c("a1", "a2", "b1", "b2", "b3", "c1", "c2", paste0("g", 1:10)) ## Next is for nicer figure labeling.  f9f9ab96  ## Consider as drivers genes with s >0  6610ac26  gp <- which(ni > 0) fe3 <- allFitnessEffects(noIntGenes = ni, drvNames = names(ni)[gp]) set.seed(12) mue1 <- oncoSimulIndiv(fe3, muEF = fm3, mu = 1e-6, initSize = 1e5, model = "McFL", detectionSize = 5e6, finalTime = 270, keepPhylog = TRUE, onlyCancer = FALSE) mue1 ## If you decrease N even further it gets even more cluttered op <- par(ask = TRUE) plotClonePhylog(mue1, N = 10, timeEvents = TRUE) plot(mue1, plotDrivers = TRUE, addtot = TRUE, plotDiversity = TRUE)  f9f9ab96   6610ac26  ## The stacked plot is slow; be patient ## Most clones have tiny population sizes, and their lines ## are piled on top of each other plot(mue1, addtot = TRUE, plotDiversity = TRUE, type = "stacked") par(op)   bbd76086  \clearpage  6610ac26  # Plotting fitness landscapes {#plot-fit-land}  fc38a875  The evalAllGenotypes and related functions allow you to obtain  6610ac26  tables of the genotype to fitness mappings. It might be more convenient to actually plot that, allowing us to quickly identify local minima and maxima and get an idea of how the fitness landscape looks.  fc38a875  In plotFitnessLandscape I have blatantly and shamelessly copied most  6610ac26  of the looks of the plots of MAGELLAN [@brouillet_magellan:_2015] (see also ), a very nice web-based tool for fitness landscape plotting and analysis (MAGELLAN provides some other extra functionality and epistasis statistics not provided here).  f9f9ab96  As an example, let us show the example of Weissman et al. we saw in \@ref(weissmanex):  6610ac26  {r} d1 <- -0.05 ## single mutant fitness 0.95 d2 <- -0.08 ## double mutant fitness 0.92 d3 <- 0.2 ## triple mutant fitness 1.2 s2 <- ((1 + d2)/(1 + d1)^2) - 1 s3 <- ( (1 + d3)/((1 + d1)^3 * (1 + s2)^3) ) - 1 wb <- allFitnessEffects( epistasis = c( "A" = d1, "B" = d1, "C" = d1, "A:B" = s2, "A:C" = s2, "B:C" = s2, "A:B:C" = s3))   34c1b54f  {r, fig.width=6.5, fig.height=5}  6610ac26  plotFitnessLandscape(wb, use_ggrepel = TRUE)  We have set use_ggrepel = TRUE to avoid overlap of labels.  f9f9ab96  \clearpage  fc38a875  For some types of objects, directly invoking plot will give  f9f9ab96  you the fitness landscape plot:  6610ac26   34c1b54f  {r, fig.width=6.5, fig.height=5}  6610ac26  (ewb <- evalAllGenotypes(wb, order = FALSE)) plot(ewb, use_ggrepel = TRUE)   f9f9ab96  \clearpage This is example (section \@ref(pancreas)) will give a very busy plot:  6610ac26   34c1b54f  {r wasthis111, fig.width=9.5, fig.height=9.5} par(cex = 0.7)  6610ac26  pancr <- allFitnessEffects( data.frame(parent = c("Root", rep("KRAS", 4), "SMAD4", "CDNK2A", "TP53", "TP53", "MLL3"), child = c("KRAS","SMAD4", "CDNK2A", "TP53", "MLL3", rep("PXDN", 3), rep("TGFBR2", 2)), s = 0.1, sh = -0.9, typeDep = "MN")) plot(evalAllGenotypes(pancr, order = FALSE), use_ggrepel = TRUE)   bbd76086  \clearpage  6610ac26  # Specifying fitness effects: some examples from the literature {#litex} ## Bauer et al., 2014 {#bauer}  34c1b54f  In the model of Bauer and collaborators [@Bauer2014, pp. 54] we have "For cells without the primary driver mutation, each secondary driver mutation leads to a change in the cell's fitness by $s_P$. For cells with the primary driver mutation, the fitness advantage obtained with each secondary driver mutation is $s_{DP}$."  6610ac26   08de3ba7  The proliferation probability is given as:  6610ac26  * $\frac{1}{2}(1 + s_p)^k$ when there are $k$ secondary drivers mutated and no primary diver; * $\frac{1}{2}\frac{1+S_D^+}{1+S_D^-} (1 + S_{DP})^k$ when the primary driver is mutated; apoptosis is one minus the proliferation rate. ### Using a DAG We cannot find a simple mapping from their expressions to our fitness parameterization, but we can get fairly close by using a DAG; in this one,  08de3ba7  note the unusual feature of having one of the "s" terms (that for the  6610ac26  driver dependency on root) be negative. Using the parameters given in the legend of their Figure 3 for $s_p, S_D^+, S_D^-, S_{DP}$ and obtaining that negative value for the dependency of the driver on root we can do: {r} K <- 4 sp <- 1e-5 sdp <- 0.015 sdplus <- 0.05 sdminus <- 0.1 cnt <- (1 + sdplus)/(1 + sdminus) prod_cnt <- cnt - 1  fc38a875  bauer <- data.frame(parent = c("Root", rep("D", K)), child = c("D", paste0("s", 1:K)),  6610ac26  s = c(prod_cnt, rep(sdp, K)), sh = c(0, rep(sp, K)), typeDep = "MN") fbauer <- allFitnessEffects(bauer) (b1 <- evalAllGenotypes(fbauer, order = FALSE, addwt = TRUE))   fc38a875  (We use "D" for "driver" or "primary driver", as is it is called in the original paper, and "s" for secondary drivers, somewhat similar to passengers).  6610ac26  Note that what we specify as "typeDep" is irrelevant (MN, SMN, or XMPN make no difference). This is the DAG: {r, fig.height=3} plot(fbauer)   fc38a875  And if you compare the tabular output of evalAllGenotypes you can  6610ac26  see that the values of fitness reproduces the fitness landscape that they show in their Figure 1. We can also use our plot for fitness landscapes: {r, fig.width=6, fig.height=6} plot(b1, use_ggrepel = TRUE)   08de3ba7  ### Specifying fitness of genotypes directly  6610ac26  An alternative approach to specify the fitness, if the number of genotypes is reasonably small, is to directly evaluate fitness as given by their expressions. Then, use the genotFitness argument to  fc38a875  allFitnessEffects.  6610ac26   08de3ba7  We will create all possible genotypes; then we will write a function that gives the fitness of each genotype according to their expression; finally, we will call this function on the data frame of  fc38a875  genotypes, and pass this data frame to allFitnessEffects.  6610ac26  {r}  fc38a875  m1 <- expand.grid(D = c(1, 0), s1 = c(1, 0), s2 = c(1, 0),  6610ac26  s3 = c(1, 0), s4 = c(1, 0))  fc38a875  fitness_bauer <- function(D, s1, s2, s3, s4,  6610ac26  sp = 1e-5, sdp = 0.015, sdplus = 0.05, sdminus = 0.1) {  fc38a875  if(!D) {  6610ac26  b <- 0.5 * ( (1 + sp)^(sum(c(s1, s2, s3, s4)))) } else { b <- 0.5 * (((1 + sdplus)/(1 + sdminus) * (1 + sdp)^(sum(c(s1, s2, s3, s4))))) } fitness <- b - (1 - b) our_fitness <- 1 + fitness ## prevent negative fitness and ## make wt fitness = 1 return(our_fitness) } m1$Fitness <- apply(m1, 1, function(x) do.call(fitness_bauer, as.list(x))) bauer2 <- allFitnessEffects(genotFitness = m1)  Now, show the fitness of all genotypes: {r} evalAllGenotypes(bauer2, order = FALSE, addwt = TRUE)  Can we use modules in this example, if we use the "lego system"? Sure, as in any other case. ## Misra et al., 2014 {#misra} Figure 1 of @Misra2014 presents three scenarios which are different types of epistasis. ### Example 1.a {#misra1a} {r, echo=FALSE, fig.height=4, fig.width=4} df1 <- data.frame(x = c(1, 1.2, 1.4), f = c(1, 1.2, 1.2), names = c("wt", "A", "B")) plot(df1[, 2] ~ df1[, 1], axes = TRUE, xlab= "", ylab = "Fitness", xaxt = "n", yaxt = "n", ylim = c(1, 1.21)) segments(1, 1, 1.2, 1.2) segments(1, 1, 1.4, 1.2) text(1, 1, "wt", pos = 4) text(1.2, 1.2, "A", pos = 2) text(1.4, 1.2, "B", pos = 2) ## axis(1, tick = FALSE, labels = FALSE) ## axis(2, tick = FALSE, labels = FALSE)  In that figure it is evident that the fitness effect of "A" and "B" are the same. There are two different models depending on whether "AB" is just the product of both, or there is epistasis. In the first case probably the simplest is: {r} s <- 0.1 ## or whatever number m1a1 <- allFitnessEffects(data.frame(parent = c("Root", "Root"), child = c("A", "B"), s = s, sh = 0, typeDep = "MN")) evalAllGenotypes(m1a1, order = FALSE, addwt = TRUE)  If the double mutant shows epistasis, as we saw before (section \@ref(e2)) we have a range of options. For example: {r} s <- 0.1 sab <- 0.3 m1a2 <- allFitnessEffects(epistasis = c("A:-B" = s, "-A:B" = s, "A:B" = sab)) evalAllGenotypes(m1a2, order = FALSE, addwt = TRUE)  But we could also modify the graph dependency structure, and we have to change the value of the coefficient, since that is what multiplies each of  f9f9ab96  the terms for "A" and "B":$(1 + s_{AB}) = (1 + s)^2(1 + s_{AB3})$ 6610ac26  {r} sab3 <- ((1 + sab)/((1 + s)^2)) - 1 m1a3 <- allFitnessEffects(data.frame(parent = c("Root", "Root"), child = c("A", "B"), s = s, sh = 0, typeDep = "MN"), epistasis = c("A:B" = sab3)) evalAllGenotypes(m1a3, order = FALSE, addwt = TRUE)  And, obviously {r} all.equal(evalAllGenotypes(m1a2, order = FALSE, addwt = TRUE), evalAllGenotypes(m1a3, order = FALSE, addwt = TRUE))  ### Example 1.b {#misra1b} This is a specific case of synthetic viability (see also section \@ref(sv)): {r, echo=FALSE, fig.width=4, fig.height=4} df1 <- data.frame(x = c(1, 1.2, 1.2, 1.4), f = c(1, 0.4, 0.3, 1.3), names = c("wt", "A", "B", "AB")) plot(df1[, 2] ~ df1[, 1], axes = TRUE, xlab= "", ylab = "Fitness", xaxt = "n", yaxt = "n", ylim = c(0.29, 1.32)) segments(1, 1, 1.2, 0.4) segments(1, 1, 1.2, 0.3) segments(1.2, 0.4, 1.4, 1.3) segments(1.2, 0.3, 1.4, 1.3) text(x = df1[, 1], y = df1[, 2], labels = df1[, "names"], pos = c(4, 2, 2, 2)) ## text(1, 1, "wt", pos = 4) ## text(1.2, 1.2, "A", pos = 2) ## text(1.4, 1.2, "B", pos = 2)  Here,$S_A, S_B < 0$,$S_B < 0$,$S_{AB} > 0$and$(1 + S_{AB}) (1 + S_A) (1 + S_B) > 1$. As before, we can specify this in several different ways. The simplest is to specify all genotypes: {r} sa <- -0.6 sb <- -0.7 sab <- 0.3 m1b1 <- allFitnessEffects(epistasis = c("A:-B" = sa, "-A:B" = sb, "A:B" = sab)) evalAllGenotypes(m1b1, order = FALSE, addwt = TRUE)  We could also use a tree and modify the "sab" for the epistasis, as before (\@ref(misra1a)). ### Example 1.c {#misra1c} The final case, in figure 1.c of Misra et al., is just epistasis, where a mutation in one of the genes is deleterious (possibly only mildly), in the other is beneficial, and the double mutation has fitness larger than any of the other two. {r, echo=FALSE, fig.width=4, fig.height=4} df1 <- data.frame(x = c(1, 1.2, 1.2, 1.4), f = c(1, 1.2, 0.7, 1.5), names = c("wt", "A", "B", "AB")) plot(df1[, 2] ~ df1[, 1], axes = TRUE, xlab = "", ylab = "Fitness", xaxt = "n", yaxt = "n", ylim = c(0.69, 1.53)) segments(1, 1, 1.2, 1.2) segments(1, 1, 1.2, 0.7) segments(1.2, 1.2, 1.4, 1.5) segments(1.2, 0.7, 1.4, 1.5) text(x = df1[, 1], y = df1[, 2], labels = df1[, "names"], pos = c(3, 3, 3, 2)) ## text(1, 1, "wt", pos = 4) ## text(1.2, 1.2, "A", pos = 2) ## text(1.4, 1.2, "B", pos = 2)  Here we have that$s_A > 0$,$s_B < 0$,$(1 + s_{AB}) (1 + s_A) (1 + s_B) > (1 + s_{AB})$so$s_{AB} > \frac{-s_B}{1 + s_B}$As before, we can specify this in several different ways. The simplest is to specify all genotypes: {r} sa <- 0.2 sb <- -0.3 sab <- 0.5 m1c1 <- allFitnessEffects(epistasis = c("A:-B" = sa, "-A:B" = sb, "A:B" = sab)) evalAllGenotypes(m1c1, order = FALSE, addwt = TRUE)  We could also use a tree and modify the "sab" for the epistasis, as before (\@ref(misra1a)). ## Ochs and Desai, 2015 {#ochsdesai} In @Ochs2015 the authors present a model shown graphically as (the actual numerical values are arbitrarily set by me): {r, echo=FALSE, fig.width=4.5, fig.height=3.5} df1 <- data.frame(x = c(1, 2, 3, 4), f = c(1.1, 1, 0.95, 1.2), names = c("u", "wt", "i", "v")) plot(df1[, 2] ~ df1[, 1], axes = FALSE, xlab = "", ylab = "") par(las = 1) axis(2) axis(1, at = c(1, 2, 3, 4), labels = df1[, "names"], ylab = "") box() arrows(c(2, 2, 3), c(1, 1, 0.95), c(1, 3, 4), c(1.1, 0.95, 1.2)) ## text(1, 1, "wt", pos = 4) ## text(1.2, 1.2, "A", pos = 2) ## text(1.4, 1.2, "B", pos = 2)  In their model,$s_u > 0$,$s_v > s_u$,$s_i < 0$, we can only arrive at$v$from$i$, and the mutants "ui" and "uv" can never appear as their fitness is 0, or$-\infty$, so$s_{ui} = s_{uv} = -1$(or$-\infty$). We can specify this combining a graph and epistasis specifications: {r} su <- 0.1 si <- -0.05  fc38a875  fvi <- 1.2 ## the fitness of the vi mutant  6610ac26  sv <- (fvi/(1 + si)) - 1 sui <- suv <- -1 od <- allFitnessEffects( data.frame(parent = c("Root", "Root", "i"), child = c("u", "i", "v"), s = c(su, si, sv), sh = -1, typeDep = "MN"), epistasis = c( "u:i" = sui, "u:v" = suv))  A figure showing that model is {r, fig.width=3, fig.height=3} plot(od)  And the fitness of all genotype is {r} evalAllGenotypes(od, order = FALSE, addwt = TRUE)   c95df82d  We could alternatively have specified fitness either directly specifying the fitness of each genotype or specifying epistatic effects. Let us use the second approach: {r} u <- 0.2; i <- -0.02; vi <- 0.6; ui <- uv <- -Inf od2 <- allFitnessEffects( epistasis = c("u" = u, "u:i" = ui, "u:v" = uv, "i" = i, "v:-i" = -Inf, "v:i" = vi)) evalAllGenotypes(od, addwt = TRUE)  We will return to this model when we explain the usage of fixation  e719a81d  for stopping the simulations (see \@ref(fixation) and \@ref(fixationG)).  c95df82d   6610ac26  ## Weissman et al., 2009 {#weissmanex} In their figure 1a, @Weissman2009 present this model (actual numeric values are set arbitrarily)  f9f9ab96  ### Figure 1.a {#weiss1a}  6610ac26  {r, echo=FALSE, fig.width=4, fig.height=3} df1 <- data.frame(x = c(1, 2, 3), f = c(1, 0.95, 1.2), names = c("wt", "1", "2")) plot(df1[, 2] ~ df1[, 1], axes = FALSE, xlab = "", ylab = "") par(las = 1) axis(2) axis(1, at = c(1, 2, 3), labels = df1[, "names"], ylab = "") box() segments(c(1, 2), c(1, 0.95), c(2, 3), c(0.95, 1.2)) ## text(1, 1, "wt", pos = 4) ## text(1.2, 1.2, "A", pos = 2) ## text(1.4, 1.2, "B", pos = 2)  c95df82d   6610ac26    f9f9ab96  where the "1" and "2" in the figure refer to the total number of mutations in two different loci. This is, therefore, very similar to the example in section \@ref(misra1b). Here we have, in their notation,$\delta_1 < 0$, fitness of single "A" or single "B" =$1 + \delta_1$,$S_{AB} > 0$,$(1 + S_{AB})(1 + \delta_1)^2 > 1$.  6610ac26  ### Figure 1.b {#weis1b} In their figure 1b they show {r, echo=FALSE, fig.width=4, fig.height=3} df1 <- data.frame(x = c(1, 2, 3, 4), f = c(1, 0.95, 0.92, 1.2), names = c("wt", "1", "2", "3")) plot(df1[, 2] ~ df1[, 1], axes = FALSE, xlab = "", ylab = "") par(las = 1) axis(2) axis(1, at = c(1, 2, 3, 4), labels = df1[, "names"], ylab = "") box() segments(c(1, 2, 3), c(1, 0.95, 0.92), c(2, 3, 4), c(0.95, 0.92, 1.2)) ## text(1, 1, "wt", pos = 4) ## text(1.2, 1.2, "A", pos = 2) ## text(1.4, 1.2, "B", pos = 2)  Where, as before, 1, 2, 3, denote the total number of mutations over three different loci and$\delta_1 < 0$,$\delta_2 < 0$, fitness of single mutant is$(1 + \delta_1)$, of double mutant is$(1 + \delta_2)$so that$(1 + \delta_2) = (1 + \delta_1)^2 (1 + s_2)$and of triple mutant is$(1 + \delta_3)$, so that$(1 + \delta_3) = (1 + \delta_1)^3 (1 + s_2)^3 (1 + s_3)$. We can specify this combining a graph with epistasis: {r} d1 <- -0.05 ## single mutant fitness 0.95 d2 <- -0.08 ## double mutant fitness 0.92 d3 <- 0.2 ## triple mutant fitness 1.2 s2 <- ((1 + d2)/(1 + d1)^2) - 1 s3 <- ( (1 + d3)/((1 + d1)^3 * (1 + s2)^3) ) - 1 w <- allFitnessEffects( data.frame(parent = c("Root", "Root", "Root"), child = c("A", "B", "C"), s = d1, sh = -1, typeDep = "MN"), epistasis = c( "A:B" = s2, "A:C" = s2, "B:C" = s2, "A:B:C" = s3))  The model can be shown graphically as: {r, fig.width=4, fig.height=4} plot(w)  And fitness of all genotypes is: {r} evalAllGenotypes(w, order = FALSE, addwt = TRUE)  Alternatively, we can directly specify what each genotype adds to the fitness, given the included genotype. This is basically replacing the graph by giving each of "A", "B", and "C" directly: {r} wb <- allFitnessEffects( epistasis = c( "A" = d1, "B" = d1, "C" = d1, "A:B" = s2, "A:C" = s2, "B:C" = s2, "A:B:C" = s3)) evalAllGenotypes(wb, order = FALSE, addwt = TRUE)  The plot, of course, is not very revealing and we cannot show that there is a three-way interaction (only all three two-way interactions): {r, , fig.width=3, fig.height=3} plot(wb)  As we have seen several times already (sections \@ref(e2), \@ref(e3), \@ref(theminus)) we can also give the genotypes directly and, consequently, the fitness of each genotype (not the added contribution): {r} wc <- allFitnessEffects( epistasis = c( "A:-B:-C" = d1, "B:-C:-A" = d1, "C:-A:-B" = d1, "A:B:-C" = d2, "A:C:-B" = d2, "B:C:-A" = d2, "A:B:C" = d3)) evalAllGenotypes(wc, order = FALSE, addwt = TRUE)  ## Gerstung et al., 2011, pancreatic cancer poset {#pancreas} Similar to what we did in v.1 (see section \@ref(poset)) we can specify the pancreatic cancer poset in @Gerstung2011 (their figure 2B, left). We use directly the names of the genes, since that is immediately supported by the new version. {r, fig.width=4} pancr <- allFitnessEffects( data.frame(parent = c("Root", rep("KRAS", 4), "SMAD4", "CDNK2A", "TP53", "TP53", "MLL3"), child = c("KRAS","SMAD4", "CDNK2A", "TP53", "MLL3", rep("PXDN", 3), rep("TGFBR2", 2)), s = 0.1, sh = -0.9, typeDep = "MN")) plot(pancr)  Of course the "s" and "sh" are set arbitrarily here.  08de3ba7  ## Raphael and Vandin's 2014 modules {#raphael-ex}  6610ac26  In @Raphael2014a, the authors show several progression models in terms of modules. We can code the extended poset for the colorectal cancer model in their Figure 4.a is (s and sh are arbitrary): {r, fig.height = 4} rv1 <- allFitnessEffects(data.frame(parent = c("Root", "A", "KRAS"), child = c("A", "KRAS", "FBXW7"), s = 0.1, sh = -0.01, typeDep = "MN"), geneToModule = c("Root" = "Root", "A" = "EVC2, PIK3CA, TP53", "KRAS" = "KRAS", "FBXW7" = "FBXW7")) plot(rv1, expandModules = TRUE, autofit = TRUE)  We have used the (experimental) autofit option to fit the labels to the edges. Note how we can use the same name for genes and modules, but we need to specify all the modules. Their Figure 5b is {r, fig.height=6} rv2 <- allFitnessEffects( data.frame(parent = c("Root", "1", "2", "3", "4"), child = c("1", "2", "3", "4", "ELF3"), s = 0.1, sh = -0.01, typeDep = "MN"), geneToModule = c("Root" = "Root", "1" = "APC, FBXW7", "2" = "ATM, FAM123B, PIK3CA, TP53", "3" = "BRAF, KRAS, NRAS", "4" = "SMAD2, SMAD4, SOX9", "ELF3" = "ELF3")) plot(rv2, expandModules = TRUE, autofit = TRUE)   bbd76086  \clearpage  6610ac26  # Running and plotting the simulations: starting, ending, and examples {#simul} ## Starting and ending {#starting-ending} After you have decided the specifics of the fitness effects and the model, you need to decide: * Where will you start your simulation from. This involves deciding the initial population size (argument initSize) and, possibly, the genotype of the initial population; the later is covered in section \@ref(initmut). * When will you stop it: how long to run it, and whether or not to require simulations to reach cancer (under some definition of what it means to reach cancer). This is covered in \@ref(endsimul). ## Can I start the simulation from a specific mutant? {#initmut}  4ac900b7  You bet. In version 2 you can specify the genotype for the  fc38a875  initial mutant with the same flexibility as in evalGenotype. Here we show a couple of examples (we use the representation of the parent-child relationships ---discussed in section \@ref(phylog)--- of the clones so that you can see which clones appear, and from which, and check that we are not making mistakes).  6610ac26   4ac900b7   6610ac26   34c1b54f  {r prbau003, fig.height=6}  6610ac26  o3init <- allFitnessEffects(orderEffects = c( "M > D > F" = 0.99, "D > M > F" = 0.2, "D > M" = 0.1, "M > D" = 0.9), noIntGenes = c("u" = 0.01, "v" = 0.01, "w" = 0.001, "x" = 0.0001, "y" = -0.0001, "z" = -0.001), geneToModule = c("M" = "m", "F" = "f", "D" = "d") ) oneI <- oncoSimulIndiv(o3init, model = "McFL", mu = 5e-5, finalTime = 500, detectionDrivers = 3, onlyCancer = FALSE, initSize = 1000, keepPhylog = TRUE, initMutant = c("m > u > d") ) plotClonePhylog(oneI, N = 0)  34c1b54f    6610ac26   34c1b54f  {r prbau003bb, fig.height=6}  6610ac26  ## Note we also disable the stopping stochastically as a function of size ## to allow the population to grow large and generate may different ## clones.  34c1b54f  ospI <- oncoSimulPop(2,  6610ac26  o3init, model = "Exp", mu = 5e-5, finalTime = 500, detectionDrivers = 3, onlyCancer = TRUE, initSize = 10, keepPhylog = TRUE, initMutant = c("d > m > z"), mc.cores = 2 )  34c1b54f  op <- par(mar = rep(0, 4), mfrow = c(1, 2))  6610ac26  plotClonePhylog(ospI[[1]]) plotClonePhylog(ospI[[2]]) par(op)  34c1b54f  ossI <- oncoSimulSample(2,  6610ac26  o3init, model = "Exp", mu = 5e-5, finalTime = 500, detectionDrivers = 2, onlyCancer = TRUE, initSize = 10, initMutant = c("z > d"), ## check presence of initMutant: thresholdWhole = 1 ) ## No phylogeny is kept with oncoSimulSample, but look at the ## OcurringDrivers and the sample ossI$popSample ossI$popSummary[, "OccurringDrivers", drop = FALSE]  ## Ending the simulations {#endsimul} OncoSimulR provides very flexible ways to decide when to stop a simulation. Here we focus on a single simulation; see further options with multiple simulations in \@ref(sample).  f9f9ab96  ### Ending the simulations: conditions  6610ac26  * **onlyCancer = TRUE**. A simulation will be repeated until any one of the "reach cancer" conditions is met, if this happens before the simulation reaches finalTime[^6]. These conditions are: [^6]: Of course, the "reach cancer" idea and the onlyCancer argument are generic names; this could have been labeled "reach whatever interests me". + Total population size becomes larger than detectionSize. + The number of drivers in any one genotype or clone becomes equal to, or larger than, detectionDrivers; note that this allows you to stop the simulation as soon as a **specific genotype** is found, by using exactly and only the genes that make that genotype as the drivers.  f9f9ab96  + A gene or gene combination among those listed in fixation becomes fixed in the population (i.e., has a frequency is 1)  e719a81d  (see details in (\@ref(fixation) and \@ref(fixationG)).  6610ac26  + The tumor is detected according to a stochastic detection mechanism, where the probability of "detecting the tumor" increases with population size; this is explained below (\@ref(detectprob)) and is controlled by argument detectionProb.  f9f9ab96   6610ac26  As we exit as soon as any of the exiting conditions is reached,  c95df82d  if you only care about one condition, set the other to NA (see also section \@ref(anddrvprob)).  6610ac26  * **onlyCancer = FALSE**. A simulation will run only once, and will exit as soon as any of the above conditions are met or as soon as the total population size becomes zero or we reach finalTime.  c95df82d   6610ac26  As an example of onlyCancer = TRUE, focusing on the first two mechanisms, suppose you give detectionSize = 1e4 and detectionDrivers =3 (and you have detectionProb = NA). A simulation will exit as soon  f9f9ab96  as it reaches a total population size of$10^4$or any clone has four  6610ac26  drivers, whichever comes first (if any of these happen before finalTime). In the onlyCancer = TRUE case, what happens if we reach finalTime (or the population size becomes zero) before any of the  4ac900b7  "reach cancer" conditions have been fulfilled? The simulation will be  6610ac26  repeated again, within the following limits: * max.wall.time: the total wall time we allow an individual simulation to run; * max.num.tries: the maximum number of times we allow a simulation to be repeated to reach cancer; * max.wall.time.total and max.num.tries.total, similar to the above but over a set of simulations in function  fc38a875  oncoSimulSample.  6610ac26  Incidentally, we keep track of the number of attempts used (the component other$attemptsUsed$) before we reach cancer, so you can estimate (as from a negative binomial sampling) the probability of reaching your desired end point under different scenarios. The onlyCancer = FALSE case might be what you want to do when you examine general population genetics scenarios without focusing on possible sampling issues. To do this, set finalTime to the value you want and set onlyCancer = FALSE; in addition, set detectionProb to "NA" and detectionDrivers and detectionSize to "NA" or to huge numbers^[Setting detectionDrivers and detectionSize to "NA" is in fact equivalent to setting them to the largest possible numbers for these variables:$2^{32} -1$and$\infty$, respectively.]. In this scenario you simply collect the simulation output at the end of the run, regardless of what happened with the population (it became extinct, it did not reach a large size, it did not accumulate drivers, etc).  fc38a875  ### Stochastic detection mechanism: "detectionProb" {#detectprob}  6610ac26  This is the process that is controlled by the argument detectionProb. Here the probability of tumor detection increases with the total population size. This is biologically a reasonable assumption: the larger the tumor, the more likely it is it will be detected. At regularly spaced times during the simulation, we compute the probability of detection as a function of size and determine (by comparing against a random uniform number) if the simulation should finish. For simplicity, and to make sure the probability is bounded between 0 and 1, we use the function  e719a81d   6610ac26  P(N) = \begin{cases}  e719a81d  1 - e^{ -c ( (N - B)/B)} & \text{if } N > B \\ 0 & \text{if } N \leq B  6610ac26  \end{cases} \label{eq:2} where$P(N)$is the probability that a tumor with a population size$N$ e719a81d  will be detected, and$c$(argument$cPDetect$in the oncoSimul* functions) controls how fast$P(N)$increases with increasing population size relative to a baseline,$B$($PDBaseline$in the oncoSimul* functions); with$B$we both control the minimal population size at which this mechanism stats operating (because we will rarely want detection unless there is some meaningful increase of population size over initSize) and we model the increase in$P(N)$as a function of relative differences with respect to$B$. (Note that this is **a major change** in version 2.9.9. Before version 2.9.9, the expression used was$P(N) = 1 - e^{ -c ( N - B)}$, so we did not make the increase relative to$B$; of course, you can choose an appropriate$c$to make different models comparable, but the expression used before 2.9.9 made it much harder to compare simulations with very different initial population sizes, as baselines are often naturall a function of initial population sizes.)  6610ac26  The$P(N)$refers to the probability of detection at each one of the occasions when we assess the probability of exiting. When, or how often, do we do that? When we assess probability of exiting is controlled by checkSizePEvery, which will often be much larger than sampleEvery^[We assess probability of exiting at every sampling time, as given by sampleEvery, that is the smallest possible sampling time that is separated from the previous time of assessment by at least checkSizePEvery. In other words, the interval between successive assessments will be the smallest multiple integer of sampleEvery that is larger than checkSizePEvery. For example, suppose sampleEvery = 2 and checkSizePEvery = 3: we will assess exiting at times$4, 8, 12, 16, \ldots$. If sampleEvery = 3 and checkSizePEvery = 3: we will assess exiting at times$6, 12, 18, \ldots$.]. Biologically, a way to think of checkSizePEvery is "time between doctor appointments".  fc38a875  An important **warning**, though: for populations that are growing very, very fast or where some genes might have very large effects on fitness even a moderate checkSizePEvery of, say, 10, might be inappropriate, since populations could have increased by several orders of magnitude between successive checks. This issue is also discussed in section \@ref(bench1xf) and \@ref(benchusual).  e719a81d  Finally, you can specify$c$($cPDetect$) directly (you will need to set n2 and p2 to NA). However, it might be more intuitive to specify the pair n2, p2, such that$P(n2) = p2$(and from that pair we solve for the value of$cPDetect$).  6610ac26  You can get a feeling for the effects of these arguments by playing with the following code, that we do not execute here for the sake of speed. Here no mutation has any effect, but there is a non-zero probability of exiting as soon as the total population size becomes larger than the initial population size. So, eventually, all simulations will exit and, as we are using the McFarland model, population size will vary slightly around the initial population size.  34c1b54f  {r prbaux002, eval=FALSE}  6610ac26  gi2 <- rep(0, 5) names(gi2) <- letters[1:5] oi2 <- allFitnessEffects(noIntGenes = gi2) s5 <- oncoSimulPop(200, oi2, model = "McFL", initSize = 1000, detectionProb = c(p2 = 0.1, n2 = 2000, PDBaseline = 1000, checkSizePEvery = 2), detectionSize = NA, finalTime = NA, keepEvery = NA, detectionDrivers = NA) s5 hist(unlist(lapply(s5, function(x) x$FinalTime)))  As you decrease checkSizePEvery the distribution of "FinalTime" will resemble more and more an exponential distribution. In this vignette, there are some further examples of using this mechanism in \@ref(s-cbn1) and \@ref(mcf5070), with the default arguments.  fc38a875  #### Stochastic detection mechanism and minimum number of drivers {#anddrvprob}  c95df82d  We said above that we exit as soon as any of the conditions is reached (i.e., we use an OR operation over the exit  fc38a875  conditions). There is a special exception to this procedure: if you  c95df82d  set AND_DrvProbExit = TRUE, both the number of drivers and the detectionProb mechanism condition must fulfilled. This means that the detectionProb mechanism not assessed unless the detectionDrivers condition is. Using AND_DrvProbExit = TRUE allows to run simulations and ensure that all of the returned simulations will have at least some cells with the number of drivers as specified by detectionDrivers. Note, though, that this does not guarantee that when you sample the population, all those drivers will be detected (as this depends on the actual proportion of cells with the drivers and the settings of samplePop).  fc38a875   f9f9ab96  ### Fixation of genes/gene combinations {#fixation}  c95df82d  In some cases we might be interested in running simulations until a particular set of genes, or gene combinations, reaches fixation. This exit condition might be more relevant than some of the above in many non-cancer-related evolutionary genetics scenarios.  f9f9ab96  Simulations will stop as soon as any of the genes or gene combinations in the vector (or list) fixation reaches a frequency of 1. These gene combinations might have non-zero intersection (i.e., they might share genes), and those genes need not be drivers. If we want simulations to only stop when fixation of those genes/gene combinations is reached, we will set all other stopping conditions to NA. It is, of course, up to you to ensure that those stopping conditions are reasonable (that they can be reached) and to use, or not, finalTime; otherwise, simulations will eventually abort (e.g., when max.wall.time or max.num.tries are reached). Since we are asking for fixation, the Exp or Bozic models will often not be appropriate here; instead, models with competition such as McFL are more appropriate.  c95df82d   f9f9ab96  We return here to the example from section \@ref(ochsdesai).  c95df82d  {r} u <- 0.2; i <- -0.02; vi <- 0.6; ui <- uv <- -Inf od2 <- allFitnessEffects( epistasis = c("u" = u, "u:i" = ui, "u:v" = uv, "i" = i, "v:-i" = -Inf, "v:i" = vi))  Ochs and Desai explain that "Each simulated population was evolved until either the uphill genotype or valley-crossing genotype fixed." (see @Ochs2015, p.2, section "Simulations"). We will do the same here. We specify that we want to end the simulation when either the "u" or the "v, i" genotypes have reached fixation, by passing those genotype combinations as the fixation argument (in this example using fixation = c("u", "v") would have been equivalent, since the  f9f9ab96  "v" genotype by itself has fitness of 0).  c95df82d  We want to be explicit that fixation will be the one and only condition for ending the simulations, and thus we set arguments detectionDrivers, finalTime, detectionSize and detectionProb  f9f9ab96  explicitly to NA. (We set the number of repetitions only to 10 for  c95df82d  the sake of speed when creating the vignette). {r simul-ochs} initS <- 20  fc38a875  ## We use only a small number of repetitions for the sake ## of speed. od100 <- oncoSimulPop(10, od2,  c95df82d  fixation = c("u", "v, i"), model = "McFL", mu = 1e-4, detectionDrivers = NA, finalTime = NA, detectionSize = NA, detectionProb = NA, onlyCancer = TRUE, initSize = initS,  fc38a875  mc.cores = 2)  c95df82d    fc38a875   c95df82d  What is the frequency of each genotype among the simulations? (or, what is the frequency of fixation of each genotype?) {r ochs-freq-genots} sampledGenotypes(samplePop(od100))   fc38a875  Note the very large variability in reaching fixation  c95df82d  {r sum-simul-ochs}  fc38a875  head(summary(od100)[, c(1:3, 8:9)])  c95df82d    e719a81d  ### Fixation of genotypes {#fixationG} Section \@ref(fixation) deals with the fixation of gene/gene combinations. What if you want fixation on specific genotypes? To give an example, suppose we have a five loci genotype and suppose that you want to stop the simulations only if genotypes "A", "B, E", or "A, B, C, D, E" reach fixation. You do not want to stop it if, say, genotype "A, B, E" reaches fixation. To specify genotypes, you prepend the genotype  4ac900b7  combinations with a "_,", and that tells OncoSimulR that you want  e719a81d  fixation of **genotypes**, not just gene combinations. An example of the differences between the mechanisms can be seen from this code: {r fixationG1} ## Create a simple fitness landscape rl1 <- matrix(0, ncol = 6, nrow = 9) colnames(rl1) <- c(LETTERS[1:5], "Fitness") rl1[1, 6] <- 1 rl1[cbind((2:4), c(1:3))] <- 1 rl1[2, 6] <- 1.4 rl1[3, 6] <- 1.32 rl1[4, 6] <- 1.32 rl1[5, ] <- c(0, 1, 0, 0, 1, 1.5) rl1[6, ] <- c(0, 0, 1, 1, 0, 1.54) rl1[7, ] <- c(1, 0, 1, 1, 0, 1.65) rl1[8, ] <- c(1, 1, 1, 1, 0, 1.75) rl1[9, ] <- c(1, 1, 1, 1, 1, 1.85) class(rl1) <- c("matrix", "genotype_fitness_matrix") ## plot(rl1) ## to see the fitness landscape ## Gene combinations local_max_g <- c("A", "B, E", "A, B, C, D, E") ## Specify the genotypes local_max <- paste0("_,", local_max_g) fr1 <- allFitnessEffects(genotFitness = rl1, drvNames = LETTERS[1:5]) initS <- 2000 ######## Stop on gene combinations ##### r1 <- oncoSimulPop(10, fp = fr1, model = "McFL", initSize = initS, mu = 1e-4, detectionSize = NA, sampleEvery = .03, keepEvery = 1, finalTime = 50000, fixation = local_max_g, detectionDrivers = NA, detectionProb = NA, onlyCancer = TRUE, max.num.tries = 500, max.wall.time = 20, errorHitMaxTries = TRUE, keepPhylog = FALSE, mc.cores = 2) sp1 <- samplePop(r1, "last", "singleCell") sgsp1 <- sampledGenotypes(sp1) ## often you will stop on gene combinations that ## are not local maxima in the fitness landscape sgsp1 sgsp1$Genotype %in% local_max_g ####### Stop on genotypes #### r2 <- oncoSimulPop(10, fp = fr1, model = "McFL", initSize = initS, mu = 1e-4, detectionSize = NA, sampleEvery = .03, keepEvery = 1, finalTime = 50000, fixation = local_max, detectionDrivers = NA, detectionProb = NA, onlyCancer = TRUE, max.num.tries = 500, max.wall.time = 20, errorHitMaxTries = TRUE, keepPhylog = FALSE, mc.cores = 2) ## All final genotypes should be local maxima sp2 <- samplePop(r2, "last", "singleCell") sgsp2 <- sampledGenotypes(sp2) sgsp2$Genotype %in% local_max_g  ### Fixation: tolerance, number of periods, minimal size In particular if you specify stopping on genotypes, you might want to think about three additional parameters: fixation_tolerance, min_successive_fixation, and fixation_min_size. fixation_tolerance: fixation is considered to have happened if the genotype/gene combinations specified as genotypes/gene combinations for fixation have reached a frequency $> 1 - fixation\_tolerance$. (The default is 0, so we ask for genotypes/gene combinations with a frequency of 1, which might not be what you want with large mutation rates and complex fitness landscape with genotypes of similar fitness.) min_successive_fixation: during how many successive sampling periods the conditions of fixation need to be fulfilled before declaring fixation. These must be successive sampling periods without interruptions (i.e., a single period when the condition is not fulfilled will set the counter to 0). This can help to exclude short, transitional, local maxima that are quickly replaced by other genotypes. (The default is 50, but this  4ac900b7  is probably too small for "real life" usage).  e719a81d  fixation_min_size: you might only want to consider fixation to have happened if a minimal size has been reached (this can help weed out local maxima that have fitness that is barely above that of the wild-type genotype). (The default is 0). An example of using those options: {r fixationG2} ## Create a simple fitness landscape rl1 <- matrix(0, ncol = 6, nrow = 9) colnames(rl1) <- c(LETTERS[1:5], "Fitness") rl1[1, 6] <- 1 rl1[cbind((2:4), c(1:3))] <- 1 rl1[2, 6] <- 1.4 rl1[3, 6] <- 1.32 rl1[4, 6] <- 1.32 rl1[5, ] <- c(0, 1, 0, 0, 1, 1.5) rl1[6, ] <- c(0, 0, 1, 1, 0, 1.54) rl1[7, ] <- c(1, 0, 1, 1, 0, 1.65) rl1[8, ] <- c(1, 1, 1, 1, 0, 1.75) rl1[9, ] <- c(1, 1, 1, 1, 1, 1.85) class(rl1) <- c("matrix", "genotype_fitness_matrix") ## plot(rl1) ## to see the fitness landscape ## The local fitness maxima are ## c("A", "B, E", "A, B, C, D, E") fr1 <- allFitnessEffects(genotFitness = rl1, drvNames = LETTERS[1:5]) initS <- 2000 ## Stop on genotypes r3 <- oncoSimulPop(10, fp = fr1, model = "McFL", initSize = initS, mu = 1e-4, detectionSize = NA, sampleEvery = .03, keepEvery = 1, finalTime = 50000, fixation = c(paste0("_,", c("A", "B, E", "A, B, C, D, E")), fixation_tolerance = 0.1, min_successive_fixation = 200, fixation_min_size = 3000), detectionDrivers = NA, detectionProb = NA, onlyCancer = TRUE, max.num.tries = 500, max.wall.time = 20, errorHitMaxTries = TRUE, keepPhylog = FALSE, mc.cores = 2)  ### Mixing stopping on gene combinations and genotypes {#fixationmix} This would probably be awfully confusing and is not tested formally (though it should work). Let me know if you think this is an important feature. (Pull requests with tests welcome.)  fc38a875  ## Plotting genotype/driver abundance over time; plotting the simulated trajectories {#plotraj}  f9f9ab96  We have seen many of these plots already, starting with Figure \@ref(fig:iep1x1) and Figure \@ref(fig:iep2x2) and we will see many more below, in the examples, starting with section \@ref(bauer2) such as in figures \@ref(fig:baux1) and \@ref(fig:baux2). In a nutshell, what we are plotting is the information contained in the  fc38a875  pops.by.time matrix, the matrix that contains the abundances of all the clones (or genotypes) at each of the sampling periods. The functions that do the work are called plot and these are actually methods for objects of class "oncosimul" and "oncosimulpop". You can access the help by doing ?plot.oncosimul, for example. What entities are shown in the plot? You can show the trajectories of: - numbers of drivers (e.g., \@ref(fig:baux1)); - genotypes or clones (e.g., \@ref(fig:baux2)). (Of course, showing "drivers" requires that you have specified  f9f9ab96  certain genes as drivers.)  c95df82d   fc38a875  What types of plots are available?  6610ac26   fc38a875  - line plots; - stacked plots; - stream plots.  f9f9ab96  All those three are shown in both of Figure \@ref(fig:baux1) and Figure \@ref(fig:baux2).  fc38a875  If you run multiple simulations using oncoSimulPop you can plot the trajectories of all of the simulations. ## Several examples of simulations and plotting simulation trajectories {#severalexplot} ### Bauer's example again {#bauer2}  6610ac26  We will use the model of @Bauer2014 that we saw in section \@ref(bauer).  34c1b54f  {r prbaux001}  6610ac26  K <- 5 sd <- 0.1 sdp <- 0.15 sp <- 0.05 bauer <- data.frame(parent = c("Root", rep("p", K)), child = c("p", paste0("s", 1:K)), s = c(sd, rep(sdp, K)), sh = c(0, rep(sp, K)), typeDep = "MN") fbauer <- allFitnessEffects(bauer, drvNames = "p") set.seed(1) ## Use fairly large mutation rate b1 <- oncoSimulIndiv(fbauer, mu = 5e-5, initSize = 1000, finalTime = NA, onlyCancer = TRUE, detectionProb = "default")  We will now use a variety of plots  fc38a875  {r baux1,fig.width=6.5, fig.height=10, fig.cap="Three drivers' plots of a simulation of Bauer's model"}  6610ac26  par(mfrow = c(3, 1)) ## First, drivers plot(b1, type = "line", addtot = TRUE) plot(b1, type = "stacked") plot(b1, type = "stream")   fc38a875  {r baux2,fig.width=6.5, fig.height=10, fig.cap="Three genotypes' plots of a simulation of Bauer's model"}  6610ac26  par(mfrow = c(3, 1)) ## Next, genotypes plot(b1, show = "genotypes", type = "line") plot(b1, show = "genotypes", type = "stacked") plot(b1, show = "genotypes", type = "stream")  In this case, probably the stream plots are most helpful. Note, however, that (in contrast to some figures in the literature showing models of clonal expansion) the stream plot (or the stacked plot) does not try to explicitly show parent-descendant relationships, which would hardly be realistically possible in these plots (although the plots of phylogenies in section \@ref(phylog) could be of help).  fc38a875  ### McFarland model with 5000 passengers and 70 drivers {#mcf5070}  6610ac26  {r, fig.width=6} set.seed(678) nd <- 70 np <- 5000 s <- 0.1 sp <- 1e-3 spp <- -sp/(1 + sp) mcf1 <- allFitnessEffects(noIntGenes = c(rep(s, nd), rep(spp, np)), drvNames = seq.int(nd)) mcf1s <- oncoSimulIndiv(mcf1, model = "McFL", mu = 1e-7, detectionProb = "default",  c95df82d  detectionSize = NA, detectionDrivers = NA,  fc38a875  sampleEvery = 0.025,  6610ac26  keepEvery = 8, initSize = 2000, finalTime = 4000, onlyCancer = FALSE) summary(mcf1s)  {r mcf1sx1,fig.width=6.5, fig.height=10} par(mfrow = c(2, 1)) ## I use thinData to make figures smaller and faster plot(mcf1s, addtot = TRUE, lwdClone = 0.9, log = "", thinData = TRUE, thinData.keep = 0.5) plot(mcf1s, show = "drivers", type = "stacked", thinData = TRUE, thinData.keep = 0.3, legend.ncols = 2)  With the above output (where we see there are over 500 different genotypes) trying to represent the genotypes makes no sense.  fc38a875  ### McFarland model with 50,000 passengers and 70 drivers: clonal competition {#mcf50070}  6610ac26  The next is too slow (takes a couple of minutes in an i5 laptop) and too big to run in a vignette, because we keep track of over 4000 different clones (which leads to a result object of over 800 MB): {r, eval=FALSE} set.seed(123) nd <- 70 np <- 50000 s <- 0.1 sp <- 1e-4 ## as we have many more passengers spp <- -sp/(1 + sp) mcfL <- allFitnessEffects(noIntGenes = c(rep(s, nd), rep(spp, np)), drvNames = seq.int(nd)) mcfLs <- oncoSimulIndiv(mcfL, model = "McFL", mu = 1e-7, detectionSize = 1e8, detectionDrivers = 100, sampleEvery = 0.02, keepEvery = 2, initSize = 1000, finalTime = 2000, onlyCancer = FALSE)  But you can access the pre-stored results and plot them (beware: this object has been trimmed by removing empty passenger rows in the Genotype matrix) {r, mcflsx2,fig.width=6} data(mcfLs) plot(mcfLs, addtot = TRUE, lwdClone = 0.9, log = "",  34c1b54f  thinData = TRUE, thinData.keep = 0.3,  6610ac26  plotDiversity = TRUE)  The argument plotDiversity = TRUE asks to show a small plot on top with Shannon's diversity index. {r} summary(mcfLs) ## number of passengers per clone summary(colSums(mcfLs$Genotypes[-(1:70), ]))  Note that we see clonal competition between clones with the same number of drivers (and with different drivers, of course). We will return to this (section \@ref(clonalint)).  34c1b54f  A stacked plot might be better to show the extent of clonal competition (plotting takes some time ---a stream plot reveals similar patterns and is also slower than the line plot). I will aggressively thin the data for this plot so it is faster and smaller (but we miss some of the fine grain, of course):  6610ac26  {r mcflsx3} plot(mcfLs, type = "stacked", thinData = TRUE,  34c1b54f  thinData.keep = 0.2,  6610ac26  plotDiversity = TRUE, xlim = c(0, 1000))   f9f9ab96  ### Simulation with a conjunction example {#s-cbn1}  6610ac26  We will use several of the previous examples. Most of them are in file examplesFitnessEffects, where they are stored inside a list, with named components (names the same as in the examples above): {r} data(examplesFitnessEffects) names(examplesFitnessEffects)  We will simulate using the simple CBN-like restrictions of  34c1b54f  section \@ref(cbn1) with two different models.  6610ac26   34c1b54f  {r smmcfls}  6610ac26  data(examplesFitnessEffects) evalAllGenotypes(examplesFitnessEffects$cbn1, order = FALSE)[1:10, ] sm <- oncoSimulIndiv(examplesFitnessEffects$cbn1, model = "McFL", mu = 5e-7, detectionSize = 1e8, detectionDrivers = 2, detectionProb = "default", sampleEvery = 0.025, keepEvery = 5, initSize = 2000, onlyCancer = TRUE) summary(sm)   34c1b54f  {r smbosb1}  6610ac26  set.seed(1234) evalAllGenotypes(examplesFitnessEffects$cbn1, order = FALSE, model = "Bozic")[1:10, ] sb <- oncoSimulIndiv(examplesFitnessEffects$cbn1, model = "Bozic", mu = 5e-6, detectionProb = "default", detectionSize = 1e8, detectionDrivers = 4, sampleEvery = 2, initSize = 2000, onlyCancer = TRUE) summary(sb)  As usual, we will use several plots here. {r sbx1,fig.width=6.5, fig.height=3.3} ## Show drivers, line plot par(cex = 0.75, las = 1) plot(sb,show = "drivers", type = "line", addtot = TRUE, plotDiversity = TRUE)  {r sbx2,fig.width=6.5, fig.height=3.3} ## Drivers, stacked par(cex = 0.75, las = 1) plot(sb,show = "drivers", type = "stacked", plotDiversity = TRUE)  {r sbx3,fig.width=6.5, fig.height=3.3} ## Drivers, stream par(cex = 0.75, las = 1) plot(sb,show = "drivers", type = "stream", plotDiversity = TRUE)  {r sbx4,fig.width=6.5, fig.height=3.3} ## Genotypes, line plot par(cex = 0.75, las = 1) plot(sb,show = "genotypes", type = "line", plotDiversity = TRUE)  {r sbx5,fig.width=6.5, fig.height=3.3} ## Genotypes, stacked par(cex = 0.75, las = 1) plot(sb,show = "genotypes", type = "stacked", plotDiversity = TRUE)  {r sbx6,fig.width=6.5, fig.height=3.3} ## Genotypes, stream par(cex = 0.75, las = 1) plot(sb,show = "genotypes", type = "stream", plotDiversity = TRUE)  The above illustrates again that different types of plots can be useful to reveal different patterns in the data. For instance, here, because of the huge relative frequency of one of the clones/genotypes, the stacked and stream plots do not reveal the other clones/genotypes as we cannot use a log-transformed y-axis, even if there are other clones/genotypes present.  fc38a875  ### Simulation with order effects and McFL model {#clonalint}  6610ac26  (We use a somewhat large mutation rate than usual, so that the simulation runs quickly.) {r, fig.width=6} set.seed(4321) tmp <- oncoSimulIndiv(examplesFitnessEffects[["o3"]], model = "McFL", mu = 5e-5, detectionSize = 1e8, detectionDrivers = 3, sampleEvery = 0.025, max.num.tries = 10, keepEvery = 5, initSize = 2000, finalTime = 6000, onlyCancer = FALSE)  We show a stacked and a line plot of the drivers: {r tmpmx1,fig.width=6.5, fig.height=4.1} par(las = 1, cex = 0.85)  34c1b54f  plot(tmp, addtot = TRUE, log = "", plotDiversity = TRUE, thinData = TRUE, thinData.keep = 0.2)  6610ac26   {r tmpmx2,fig.width=6.5, fig.height=4.1} par(las = 1, cex = 0.85) plot(tmp, type = "stacked", plotDiversity = TRUE,  34c1b54f  ylim = c(0, 5500), legend.ncols = 4, thinData = TRUE, thinData.keep = 0.2)  6610ac26    f9f9ab96  In this example (and at least under Linux, with both GCC and clang ---random number streams in C++, and thus simulations, can differ between combinations of operating system and compiler), we can see that the mutants with three drivers do not get established when we  6610ac26  stop the simulation at time 6000. This is one case where the summary  f9f9ab96  statistics about number of drivers says little of value, as fitness is very different for genotypes with the same number of mutations, and does not increase in a simple way with drivers:  6610ac26  {r} evalAllGenotypes(examplesFitnessEffects[["o3"]], addwt = TRUE, order = TRUE)  A few figures could help:  34c1b54f  {r tmpmx3,fig.width=6.5, fig.height=5.5} plot(tmp, show = "genotypes", ylim = c(0, 5500), legend.ncols = 3, thinData = TRUE, thinData.keep = 0.5)  6610ac26   (When reading the figure legends, recall that genotype$x > y\ \_\ z$is one where a mutation in "x" happened before a mutation in "y", and there is also a mutation in "z" for which order does not matter. Here, there are no genes for which order does not matter and thus there is nothing after the "_"). In this case, the clones with three drivers end up displacing those with two by the time we stop; moreover, notice how those with one driver never really grow to a large population size, so we basically go from a population with clones with zero drivers to a population made of clones with two or three drivers:  fc38a875   6610ac26  {r} set.seed(15) tmp <- oncoSimulIndiv(examplesFitnessEffects[["o3"]], model = "McFL", mu = 5e-5, detectionSize = 1e8, detectionDrivers = 3,  fc38a875  sampleEvery = 0.025,  6610ac26  max.num.tries = 10, keepEvery = 5, initSize = 2000, finalTime = 20000, onlyCancer = FALSE, extraTime = 1500) tmp  use a drivers plot: {r tmpmdx5,fig.width=6.5, fig.height=4} par(las = 1, cex = 0.85)  34c1b54f  plot(tmp, addtot = TRUE, log = "", plotDiversity = TRUE, thinData = TRUE, thinData.keep = 0.5)  6610ac26   {r tmpmdx6,fig.width=6.5, fig.height=4} par(las = 1, cex = 0.85) plot(tmp, type = "stacked", plotDiversity = TRUE,  fc38a875  legend.ncols = 4, ylim = c(0, 5200), xlim = c(3400, 5000),  34c1b54f  thinData = TRUE, thinData.keep = 0.5)  6610ac26   Now show the genotypes explicitly: {r tmpmdx7,fig.width=6.5, fig.height=5.3}  fc38a875  ## Improve telling apart the most abundant  6610ac26  ## genotypes by sorting colors ## differently via breakSortColors ## Modify ncols of legend, so it is legible by not overlapping ## with plot par(las = 1, cex = 0.85) plot(tmp, show = "genotypes", breakSortColors = "distave", plotDiversity = TRUE, legend.ncols = 4,  fc38a875  ylim = c(0, 5300), xlim = c(3400, 5000),  34c1b54f  thinData = TRUE, thinData.keep = 0.5)  6610ac26   As before, the argument plotDiversity = TRUE asks to show a small plot on top with Shannon's diversity index. Here, as before, the quick clonal expansion of the clone with two drivers leads to a sudden drop in diversity (for a while, the population is made virtually of a single clone). Note, however, that compared to section \@ref(mcf50070), we are modeling here a scenario with very few genes, and correspondingly very few possible genotypes, and thus it is not strange that we observe very little diversity. (We have used extraTime to continue the simulation well past the point of detection, here specified as three drivers. Instead of specifying extraTime we can set the detectionDrivers value to a number larger than the number of existing possible drivers, and the simulation will run until finalTime if onlyCancer = FALSE.) ## Interactive graphics {#interactive} It is possible to create interactive stacked area and stream plots using the r Githubpkg("hrbrmstr/streamgraph") package, available from . However, that package is not available as a CRAN or BioConductor package, and thus we cannot depend on it for this vignette (or this package). You can, however, paste the code below and make it run locally.  fc38a875  Before calling the streamgraph function, though, we need to  6610ac26  convert the data from the original format in which it is stored into "long format". A simple convenience function is provided as  fc38a875  OncoSimulWide2Long in r Biocpkg("OncoSimulR").  6610ac26  As an example, we will use the data we generated above for section \@ref(bauer2). {r, eval=FALSE} ## Convert the data lb1 <- OncoSimulWide2Long(b1)  fc38a875  ## Install the streamgraph package from GitHub and load  6610ac26  library(devtools) devtools::install_github("hrbrmstr/streamgraph") library(streamgraph) ## Stream plot for Genotypes sg_legend(streamgraph(lb1, Genotype, Y, Time, scale = "continuous"), show=TRUE, label="Genotype: ") ## Staked area plot and we use the pipe streamgraph(lb1, Genotype, Y, Time, scale = "continuous", offset = "zero") %>% sg_legend(show=TRUE, label="Genotype: ")   4ac900b7   6610ac26   bbd76086  \clearpage  6610ac26  # Sampling multiple simulations {#sample} Often, you will want to simulate multiple runs of the same scenario, and then obtain the matrix of runs by mutations (a matrix of individuals/samples by genes or, equivalently, a vector of "genotypes"), and do something with them. OncoSimulR offers several ways of doing this.  fc38a875  The key function here is samplePop, either called explicitly after oncoSimulPop (or oncoSimulIndiv), or implicitly as part of a call to oncoSimulSample. With samplePop you can use **single cell** or **whole tumor** sampling (for details see the help of samplePop). Depending on how the simulations were conducted, you might also sample at different times, or as a function of population sizes. A major difference between procedures has to do with whether or not you want to keep the complete history of the simulations.  6610ac26   f9f9ab96  **You want to keep the complete history of population sizes of clones during the simulations**. You will  6610ac26  simulate using:  fc38a875  - oncoSimulIndiv repeatedly (maybe within mclapply, to parallelize the run). - oncoSimulPop. oncoSimulPop is basically a thin wrapper around oncoSimulIndiv that uses mclapply.  6610ac26   fc38a875  In both cases, you specify the conditions for ending the simulations (as explained in \@ref(endsimul)). Then, you use function samplePop to obtain the matrix of samples by mutations.  6610ac26   f9f9ab96  **You do not want to keep the complete history of population sizes of clones during the simulations**. You will simulate using:  6610ac26   fc38a875  - oncoSimulIndiv repeatedly, with argument keepEvery = NA.  6610ac26   fc38a875  - oncoSimulPop, with argument keepEvery = NA. In both cases you specify the conditions for ending the simulations (as explained in \@ref(endsimul)). Then, you use function samplePop.  6610ac26   fc38a875  - oncoSimulSample, specifying the conditions for ending the simulations (as explained in \@ref(endsimul)). In this case, you will not use samplePop, as that is implicitly called by  6610ac26  oncoSimulSample. The output is directly the matrix (and a little bit of summary from each run), and during the simulation it only stores one time point. Why the difference between the above cases? If you keep the complete  f9f9ab96  history of population sizes, you can take samples at any of the times between the beginning and the end of the simulations. If you do not keep the history, you can only sample at the time the simulation exited (see section \@ref(trackindivs)). Why would you want to use the second route? If we are only interested in the final matrix of individuals by mutations, keeping the complete history above is wasteful because we store fully all of the simulations (for example in the call to oncoSimulPop) and then sample (in the call to samplePop). Further criteria to use when choosing between sampling procedures is whether you need detectionSize and detectionDrivers do differ between simulations: if you use oncoSimulPop the arguments for detectionSize and detectionDrivers must be the same for all simulations but this is not the case for oncoSimulSample. See further comments in \@ref(diffsample). Finally, parallelized execution is available for oncoSimulPop but, by design, not for oncoSimulSample.  6610ac26   fc38a875  The following are a few examples. First we run oncoSimulPop to obtain 4 simulations and in the last line we sample from them:  6610ac26   fc38a875  {r pancrpopcreate}  6610ac26  pancrPop <- oncoSimulPop(4, pancr, detectionSize = 1e7, keepEvery = 10, mc.cores = 2) summary(pancrPop) samplePop(pancrPop)  Now a simple multiple call to oncoSimulIndiv wrapped inside  fc38a875  mclapply; this is basically the same we just did above. We set  6610ac26  the class of the object to allow direct usage of  fc38a875  samplePop. (Note: in Windows mc.cores > 1 is not  6610ac26  supported, so for the vignette to run in Windows, Linux, and Mac we  fc38a875  explicitly set it here in the call to mclapply. For regular  6610ac26  usage, you will not need to do this; just use whatever is appropriate for your operating system and number of cores. As well, we do not need any of  fc38a875  this with oncoSimulPop because the code inside oncoSimulPop already takes care of setting mc.cores  6610ac26  to 1 in Windows). {r} library(parallel) if(.Platform$OS.type == "windows") { mc.cores <- 1 } else { mc.cores <- 2 } p2 <- mclapply(1:4, function(x) oncoSimulIndiv(pancr, detectionSize = 1e7, keepEvery = 10), mc.cores = mc.cores) class(p2) <- "oncosimulpop" samplePop(p2)  Above, we have kept the complete history of the simulations as you can check by doing, for instance {r} tail(pancrPop[[1]]$pops.by.time)  If we were not interested in the complete history of simulations we could have done instead (note the argument keepEvery = NA) {r} pancrPopNH <- oncoSimulPop(4, pancr, detectionSize = 1e7, keepEvery = NA, mc.cores = 2) summary(pancrPopNH) samplePop(pancrPopNH)  which only keeps the very last sample: {r} pancrPopNH[[1]]$pops.by.time   fc38a875  Or we could have used oncoSimulSample:  6610ac26  {r} pancrSamp <- oncoSimulSample(4, pancr) pancrSamp$popSamp  Again, why the above differences? If we are only interested in the final matrix of populations by mutations, keeping the complete history the above is wasteful, because we store fully all of the  fc38a875  simulations (in the call to oncoSimulPop) and then sample (in the call to samplePop).  6610ac26   fc38a875  ## Whole-tumor and single-cell sampling, and do we always want to sample? {#alwayssamp}  6610ac26   fc38a875  samplePop is designed to emulate the process of obtaining a sample from a (set of) "patient(s)". But there is no need to sample. The history of the population, with a granularity that is controlled by argument keepEvery, is kept in the matrix pops.by.time which contains the number of cells of every clone at every sampling point (see further details in \@ref(trackindivs)). This is the information  f9f9ab96  used in the plots that show the trajectory of a simulation: the  fc38a875  plots that show the change in genotype or driver abundance over time (see section \@ref(plotraj) and examples mentioned there).  6610ac26   fc38a875  Regardless of whether and how you plot the information in pops.by.time, you can also sample one or multiple simulations using samplePop. In **whole-tumor** sampling the resolution is the whole tumor (or the whole population). Thus, a key argument is thresholdWhole, the threshold for detecting a mutation: a gene is considered mutated if it is altered in at least "thresholdWhole" proportion of the cells in that simulation (at a particular time point). This of course means that your "sampled genotype" might not correspond to any existing genotype because we are summing over  f9f9ab96  all cells in the population. For instance, suppose that at the time we take  fc38a875  the sample there are only two clones in the population, one clone with a frequency of 0.4 that has gene A mutated, and a second clone one with a frequency of 0.6 that has gene B mutated. If you set thresholdWhole to values$\leq 0.4$the sampled genotype will show both A and B mutated. **Single-cell** sampling is provided as an option in contrast to whole-tumor sampling. Here any sampled genotype will correspond to an existing genotype as you are sampling with single-cell resolution.  6610ac26   fc38a875  When samplePop is run on a set of simulated data of, say, 100 simulated trajectories (100 "subjects"), it will produce a matrix with 100 rows (100 "subjects"). But if it makes sense in the context of your problem (e.g., multiple samples per patient?) you can of course run samplePop repeatedly.  6610ac26  ## Differences between "samplePop" and "oncoSimulSample" {#diffsample}  fc38a875  samplePop provides two sampling times: "last" and  6610ac26  "uniform". It also allows you to sample at the first sample time(s) at which the population(s) reaches a given size, which can be either the same or different for each simulation (with argument popSizeSample). "last" means to sample each individual in the very last time period of the  fc38a875  simulation. "uniform" means sampling each individual at a time chosen  6610ac26  uniformly from all the times recorded in the simulation between the time when the first driver appeared and the final time period. "unif" means that it is almost sure that different individuals will be sampled at different times. "last" does not guarantee that different individuals will be sampled at the same time unit, only that all will be sampled in the last time unit of their simulation.  fc38a875  With oncoSimulSample we obtain samples that correspond to timeSample = "last" in samplePop by specifying a unique value for detectionSize and detectionDrivers. The data from each simulation will  6610ac26  correspond to the time point at which those are reached (analogous to timeSample = "last"). How about uniform sampling? We pass a  fc38a875  vector of detectionSize and detectionDrivers,  6610ac26  where each value of the vector comes from a uniform distribution. This is  fc38a875  not identical to the "uniform" sampling of oncoSimulSample,  6610ac26  as we are not sampling uniformly over all time periods, but are stopping at uniformly distributed values over the stopping conditions. Arguably,  fc38a875  however, the procedure in samplePop might be closer to what we  6610ac26  mean with "uniformly sampled over the course of the disease" if that course is measured in terms of drivers or size of tumor.  fc38a875  An advantage of oncoSimulSample is that we can specify  6610ac26  arbitrary sampling schemes, just by passing the appropriate vector  fc38a875  detectionSize and detectionDrivers. A disadvantage  6610ac26  is that if we change the stopping conditions we can not just resample the data, but we need to run it again.  fc38a875  There is no difference between oncoSimulSample and oncoSimulPop + samplePop in terms of the  6610ac26  typeSample argument (whole tumor or single cell). Finally, there are some additional differences between the two  fc38a875  functions. oncoSimulPop can run parallelized (it uses mclapply). This is not done with oncoSimulSample  6610ac26  because this function is designed for simulation experiments where you want to examine many different scenarios simultaneously. Thus, we provide additional stopping criteria (max.wall.time.total and max.num.tries.total) to determine whether to continue running the simulations, that bounds the total running time of all the simulations in  fc38a875  a call to oncoSimulSample. And, if you are running multiple  6610ac26  different scenarios, you might want to make multiple, separate,  fc38a875  independent calls (e.g., from different R processes) to oncoSimulSample, instead of relying in mclapply, since this is likely to lead to better usage of multiple cores/CPUs if you are examining a large number of different scenarios.  6610ac26   c95df82d   fc38a875   c95df82d   6610ac26   bbd76086  \clearpage  6610ac26   fc38a875  # Showing the genealogical relationships of clones {#phylog}  6610ac26   fc38a875  If you run simulations with keepPhylog = TRUE, the simulations keep track of when every clone is generated, and that will allow us to see the  f9f9ab96  parent-child relationships between clones. (This is disabled by default).  6610ac26   fc38a875   6610ac26  Let us re-run a previous example: {r} set.seed(15) tmp <- oncoSimulIndiv(examplesFitnessEffects[["o3"]], model = "McFL", mu = 5e-5, detectionSize = 1e8, detectionDrivers = 3,  fc38a875  sampleEvery = 0.025,  6610ac26  max.num.tries = 10, keepEvery = 5, initSize = 2000, finalTime = 20000, onlyCancer = FALSE, extraTime = 1500, keepPhylog = TRUE) tmp  We can plot the parent-child relationships^[There are several packages in R devoted to phylogenetic inference and related issues. For instance, r CRANpkg("ape"). I have not used that infrastructure because of our very specific needs and circumstances; for instance, internal nodes are observed, we can have networks instead of trees, and we have no uncertainty about when events occurred.] of every clone ever created (with fitness larger than 0 ---clones without viability are never shown): {r} plotClonePhylog(tmp, N = 0)  However, we often only want to show clones that exist (have number of cells$>0$) at a certain time (while of course showing all of their ancestors, even if those are now extinct ---i.e., regardless of their current numbers). {r} plotClonePhylog(tmp, N = 1)  If we set keepEvents = TRUE the arrows show how many times each clone appeared: (The next can take a while) {r pcpkeepx1} plotClonePhylog(tmp, N = 1, keepEvents = TRUE)  And we can show the plot so that the vertical axis is proportional to time (though you might see overlap of nodes if a child node appeared shortly after the parent): {r} plotClonePhylog(tmp, N = 1, timeEvents = TRUE)  We can obtain the adjacency matrix doing {r, fig.keep="none"} get.adjacency(plotClonePhylog(tmp, N = 1, returnGraph = TRUE))  We can see another example here: {r} set.seed(456) mcf1s <- oncoSimulIndiv(mcf1, model = "McFL", mu = 1e-7, detectionSize = 1e8, detectionDrivers = 100,  fc38a875  sampleEvery = 0.025,  6610ac26  keepEvery = 2, initSize = 2000, finalTime = 1000, onlyCancer = FALSE, keepPhylog = TRUE)  Showing only clones that exist at the end of the simulation (and all their parents): {r} plotClonePhylog(mcf1s, N = 1)  Notice that the labels here do not have a "_", since there were no order effects in fitness. However, the labels show the genes that are mutated, just as before. Similar, but with vertical axis proportional to time: {r} par(cex = 0.7) plotClonePhylog(mcf1s, N = 1, timeEvents = TRUE)  What about those that existed in the last 200 time units? {r} par(cex = 0.7) plotClonePhylog(mcf1s, N = 1, t = c(800, 1000))  And try now to show also when the clones appeared (we restrict the time to between 900 and 1000, to avoid too much clutter): {r} par(cex = 0.7) plotClonePhylog(mcf1s, N = 1, t = c(900, 1000), timeEvents = TRUE)  (By playing with t, it should be possible to obtain animations of the phylogeny. We will not pursue it here.) If the previous graph seems cluttered, we can represent it in a different way by calling r CRANpkg("igraph") directly after storing the graph and using the default layout: {r fig.keep="none"} g1 <- plotClonePhylog(mcf1s, N = 1, t = c(900, 1000), returnGraph = TRUE)  {r} plot(g1)  which might be easier to show complex relationships or identify central or key clones. It is of course quite possible that, especially if we consider few genes, the parent-child relationships will form a network, not a tree, as the same child node can have multiple parents. You can play with this example, modified from one we saw before (section \@ref(mn1)): {r, eval=FALSE} op <- par(ask = TRUE) while(TRUE) { tmp <- oncoSimulIndiv(smn1, model = "McFL", mu = 5e-5, finalTime = 500, detectionDrivers = 3, onlyCancer = FALSE, initSize = 1000, keepPhylog = TRUE) plotClonePhylog(tmp, N = 0) } par(op)  ## Parent-child relationships from multiple runs {#phylogmult}  fc38a875  If you use oncoSimulPop you can store and plot the  4ac900b7  "phylogenies" of the different runs:  6610ac26  {r} oi <- allFitnessEffects(orderEffects = c("F > D" = -0.3, "D > F" = 0.4), noIntGenes = rexp(5, 10), geneToModule = c("F" = "f1, f2, f3", "D" = "d1, d2") ) oiI1 <- oncoSimulIndiv(oi, model = "Exp") oiP1 <- oncoSimulPop(4, oi, keepEvery = 10, mc.cores = 2, keepPhylog = TRUE)  We will plot the first two: {r, fig.height=9} op <- par(mar = rep(0, 4), mfrow = c(2, 1)) plotClonePhylog(oiP1[[1]]) plotClonePhylog(oiP1[[2]]) par(op)   fc38a875  This is so far disabled in function oncoSimulSample, since  6610ac26  that function is optimized for other uses. This might change in the future.  bbd76086  \clearpage  6610ac26  # Generating random fitness landscapes {#gener-fit-land}  c95df82d  In most of the examples seen above, we have fully specified the fitness of the different genotypes (either by providing directly the  fc38a875  full mapping genotypes to fitness, or by providing that mapping by  c95df82d  specifying the effects of the different gene combinations). In some cases, however, we might want to specify a particular model that generates the fitness landscape, and then have fitnesses be random  fc38a875  variables obtained under this model. In other words, in this random fitness landscape the fitness of the genotypes is a random variable  c95df82d  generated under some specific model. Random fitness landscapes are used extensively, for instance, to understand the evolutionary  f9f9ab96  consequences of different types of epistatic interactions [e.g., @szendro_quantitative_2013; @franke_evolutionary_2011] and there are  c95df82d  especially developed tools for plotting and analyzing random fitness  f9f9ab96  landscapes [e.g., @brouillet_magellan:_2015].  c95df82d  With OncoSimulR it is possible to generate mappings of genotype to  fc38a875  fitness using the function rfitness that allows you to use from  c95df82d  a pure House of Cards model to a purely additive model. I have  fc38a875  followed @szendro_quantitative_2013 and @franke_evolutionary_2011 and  c95df82d  model fitness as  6610ac26  $$f_i = -c\ d(i, reference) + x_i \label{eq:1}$$ where$d(i, j)$is the Hamming distance between genotypes$i$and$j$(the number of positions that differ),$c$is the decrease in fitness of a genotype per each unit increase in Hamming distance from the reference genotype, and$x_i$is a random variable (in this case, a normal deviate of mean 0 and standard deviation$sd$). You can change the reference genotype to any of the genotypes: for the deterministic part, you make the fittest genotype be the one with all positions mutated by setting reference = "max", or use the wildtype by using a string of 0s, or randomly select a genotype as a  fc38a875  reference by using reference = "random" or reference = "random2". And by changing$c$and$sd$you can flexibly modify the relative weight of the purely House of Cards vs. additive component. The expression used above is also very similar to the one on @greene_changing_2014 if you use rfitness with the argument reference = "max".  6610ac26  What can you do with these genotype to fitness mappings? You could plot  fc38a875  them, you could use them as input for oncoSimulIndiv and related functions, or you could export them (to_Magellan) and plot them externally  6610ac26  (e.g., in MAGELLAN: , @brouillet_magellan:_2015). {r} ## A small example rfitness(3)  f9f9ab96  ## A 5-gene example, where the reference genotype is the  bbd76086  ## one with all positions mutated, similar to Greene and Crona, ## 2014. We will plot the landscape and use it for simulations ## We downplay the random component with a sd = 0.5  6610ac26  r1 <- rfitness(5, reference = rep(1, 5), sd = 0.6) plot(r1) oncoSimulIndiv(allFitnessEffects(genotFitness = r1))   bbd76086  \clearpage  c95df82d   f9f9ab96  # Measures of evolutionary predictability and genotype diversity {#evolpredszend}  fc38a875  Several measures of evolutionary predictability have been proposed in the literature (see, e.g., @szendro_predictability_2013 and references  f9f9ab96  therein). We provide two, Lines of Descent (LOD) and Path of the Maximum  fc38a875  (POM), following @szendro_predictability_2013; we also provide a simple measure of diversity of the actual genotypes sampled. In @szendro_predictability_2013 "(...) paths defined as the time ordered sets of genotypes that at some time contain the largest  f9f9ab96  subpopulation" are called "Path of the Maximum" (POM) (see their  ab439943  p. 572). In our case, POM are obtained by finding the clone with largest population size whenever we sample and, thus, the POMs will be affected by how often we sample (argument sampleEvery), since we  fc38a875  are running a continuous time process.  ab439943  @szendro_predictability_2013 also define Lines of Descent (LODs) which "(...) represent the lineages that arrive at the most populated genotype at the final time". In that same page (572) they provide the details on how the LODs are obtained. Starting with version 2.9.2 of OncoSimulR I only provide an implementation where a single LOD per simulation is returned, with the same meaning as in @szendro_predictability_2013.  fc38a875  To briefly show some output, we will use again the \@ref(pancreas)  ab439943  example.  fc38a875  {r lod_pom_ex}  bbd76086  pancr <- allFitnessEffects( data.frame(parent = c("Root", rep("KRAS", 4), "SMAD4", "CDNK2A", "TP53", "TP53", "MLL3"), child = c("KRAS","SMAD4", "CDNK2A", "TP53", "MLL3", rep("PXDN", 3), rep("TGFBR2", 2)), s = 0.05, sh = -0.3, typeDep = "MN"))  fc38a875   ab439943  pancr16 <- oncoSimulPop(16, pancr, model = "Exp",  fc38a875  mc.cores = 2) ## Look a the first POM str(POM(pancr16)[1:3]) LOD(pancr16)[1:2]  bbd76086  ## The diversity of LOD (lod_single) and POM might or might not ## be identical  fc38a875  diversityPOM(POM(pancr16)) diversityLOD(LOD(pancr16))  bbd76086  ## Show the genotypes and their diversity (which might, or might ## not, differ from the diversity of LOD and POM)  fc38a875  sampledGenotypes(samplePop(pancr16))   bbd76086  \clearpage  fc38a875   f9f9ab96  # Generating random DAGs for restrictions {#simo} You might want to randomly generate DAGs like those often found in the literature on Oncogenetic trees et al. Function simOGraph might help here. {r} ## No seed fixed, so reruns will give different DAGs. (a1 <- simOGraph(10)) library(graph) ## for simple plotting plot(as(a1, "graphNEL"))  Once you obtain the adjacency matrices, it is for now up to you to convert them into appropriate posets or fitnessEffects objects. Why this function? I searched for, and could not find any that did what I wanted, in particular bounding the number of parents, being able to specify the approximate depth^[Where depth is defined in the usual way to mean smallest number of nodes ---or edges--- to traverse to get from the bottom to the top of the DAG.] of the graph, and optionally being able to have DAGs where no node is connected to another both directly (an edge between the two) and indirectly (there is a path between the two through other nodes). So I wrote my own code. The code is fairly simple to understand (all in file generate-random-trees.R). I would not be surprised if this way of generating random graphs has been proposed and named before; please let me know, best if with a reference. Should we remove direct connections if there are indirect? Or,  f9a38e24  should we set removeDirectIndirect = TRUE? Setting removeDirectIndirect = TRUE is basically asking for the [transitive reduction](https://en.wikipedia.org/wiki/Transitive_reduction) of the generated DAG. Except for @Farahani2013 and @ramazzotti_capri_2015, none of the DAGs I've seen in the context of CBNs, Oncogenetic trees, etc, include both direct and indirect connections between nodes. If these exist, reasoning about the model can be harder. For example, with CBN (AND or CMPN or monotone relationships) adding a direct connection makes no difference iff we assume that the relationships encoded in the DAG are fully respected (e.g., all$s_h = -\infty$). But it can make a difference if we allow for deviations from the monotonicity, specially if we only check for the satisfaction of the presence of the immediate ancestors. And things get even trickier if we combine XOR with AND. Thus, I strongly suggest you leave the default removeDirectIndirect = TRUE. If you change it, you should double check that the fitnesses of the possible genotypes are what you expect. In fact, I would suggest that, to be sure you get what you think you should get, you convert the fitness from the DAG to a fitness table, and pass that to the simulations, and this requires using non-exposed user functions; to give you an idea, this could work (but you've been warned: this is dangerous!) {r simographindirect, eval=FALSE,echo=TRUE} g2 <- simOGraph(4, out = "rT", removeDirectIndirect = FALSE) fe_from_d <- allFitnessEffects(g2) fitness_d <- evalAllGenotypes(fe_from_d) fe_from_t <- allFitnessEffects(genotFitness = OncoSimulR:::allGenotypes_to_matrix(fitness_d)) ## Compare fitness_d (fitness_t <- evalAllGenotypes(fe_from_t)) identical(fitness_d, fitness_t) ## ... but to be safe use fe_from_t as the fitnessEffects object for simulations   f9f9ab96  \clearpage  fc38a875  # FAQ, odds and ends  f9f9ab96  ## What we mean by "clone"; and "I want clones disregarding passengers" {#meaningclone}  fc38a875  In this vignette we often use "clone" or "genotype" interchangeably. A clone denotes a set of cells that have identical genotypes. So if you are using a fitness specification with four genes (i.e., your genome has only four loci), there can be up to$16 = 2^4$different genotypes or clones. Any two entities that differ in the genotype are different clones. And this applies regardless of whether or not you declare that some genes (loci) are drivers or not. So if you have four genes, it does not matter whether only the first or all four are regarded as drivers; you will always have at most 16 different clones or 16 different genotypes. Of course you can arrive at the same clone/genotype by different routes. Just think about loci A and B in our four-loci genome, and how you can end up with a cell with both A and B mutated. Analogously, if you have 100 genes, 10 drivers and 90 passengers, you can have up to$2^{100}$different clones or genotypes. Sure, one cell might have driver A mutated and passenger B mutated, and another cell might have driver A mutated and passenger C mutated. So if you only look at drivers you might be tempted to say that they are "the same clone for all practical purposes"; but they really are not the same clone as they differ in their genotype and this makes a lot of difference computationally.  f9f9ab96  If you want summaries of simulations that collapse over some genes (say, some "passengers", the 90 passengers we just mentioned) look at the help for samplePop, argument geneNames. This would allow you, for instance, to look at the diversity of clones/genotypes, considering as identical those genotypes that only differ in genes you deem relevant; something similar to defining a "drivers' clone" as the set formed from the union of all sets of cells that have identical genotype with respect to only the drivers (so that in the example of "A, B" and "A, C" just mentioned both cells would be considered "the same clone" as they only differ with respect to  fc38a875  passengers). However, this "disregard some genes" only applies to  f9f9ab96  summaries of simulations once we are done simulating data. OncoSimulR will always track clones, as defined above, regardless of whether many of those clones have the same genotype if you were to only focus on driver genes; see also section \@ref(trackindivs).  fc38a875  Labeling something as a "driver", therefore, does not affect what we mean by clone. Yes, labeling something as a driver can affect when you stop simulations if you use detectionDrivers as a stopping mechanism (see section \@ref(endsimul)). But, again, this has nothing to do with the definition of "clone". If this is all obvious to you, ignore it. I am adding it here because I've seen strange misunderstandings that eventually could be traced to the apparently multiple meanings of clone. (And to make the story complete, @Mather2012 use the expression "class" ---e.g., Algorithm 4 in the paper, Algorithm 5 in the supplementary material). ## Does OncoSimulR keep track of individuals or of clones? And how can it keep track of such large populations? {#trackindivs}  74449e3b  OncoSimulR keeps track of clones, where a clone is a set of cells that are genetically identical (note that this means completely identical over the whole set of genes/markers you are using; see section \@ref(meaningclone)). We do not need to keep track of individual cells because, for all purposes, and since we do not consider spatial structure, two or more cells that are genetically identical are interchangeable. This means, for instance, that the computational cost of keeping a population of a single clone with 1 individual or with$10^9$individuals is exactly the same: we just keep track of the genotype and the number of cells. (Sure, it is much more likely we will see a mutation soon in a clone with$10^9$cells than in a clone with 1, but that is a different issue.) Of course, the entities that die, reproduce, and mutate are individual cells. This is of course dealt with by tracking clones (as is clearly shown by Algorithms 4 and 5 in @Mather2012). Tracking individuals, as individuals, would provide no advantage, but would increase the computational burden by many orders of magnitude.  fc38a875   4ac900b7   fc38a875  ### sampleEvery, keepPhylog, and pruning {#prune} At each sampling time (where sampleEvery determines the time units between sampling times) the abundance of all the clones with number of cells$>0$is recorded. This is the structure that at the end of the run is converted into the pops.by.time matrix.  f9f9ab96  Now, some clones might arise from mutation between successive population samples but these clones might be extinct by the time we take a population sample. These clones do not appear in the pops.by.time matrix because, as we just said, they have 0 cells at the time of sampling. Of course, some of these clones might appear again later and reach a size larger than 0 at some posterior sampling time; it is at this time when this/these clone(s) will appear in the pops.by.time matrix. This pruning of clones with 0 cells can allow considerable savings in computing time (OncoSimulR needs to track the genotype of clones, their population sizes, their birth, death, and mutation rates, their next mutation time and the last time they were updated and thus it is important that we only loop over structures with information that is really needed). However, we still need to track clones as clones, not simply as classes such as "number of mutated genes". Therefore, very large genomes can represent a problem if they lead to the creation and tracking of many different clones (even if they have the same number of mutated genes), as we have seen, for instance, in section \@ref(lnum). In this case, programs that only keep track of numbers of mutated genes or of drivers, not individual clones, can of course achieve better speed.  fc38a875  What about the genealogy? If you ask OncoSimulR to keep track of the  f9f9ab96  complete parent-child relationships (keepPhylog = TRUE), you might see in the genealogy clones that are not present in pops.by.time if these are clones that never had a population size larger than 0 at any sampling time. To give an example, suppose that we will take population samples at times 0, 1, and 2. Clone A, with a population size larger than 0 at time 1, gives rise at time 1.5 to clone B; clone B then gives rise to clone C at time 1.8. Finally, suppose that at time 2 only clone C is alive. In other words, when we carry out the update of the population with Algorithm 5 from @Mather2012, clones A and B have size 0. Now, at time 1 clones B and C did not yet exist, and clone B is never alive at times 1 or 2. Thus, clone B is not present in pops.by.time. But we cannot remove clone B from our genealogy if we want to reflect the complete genealogy of C. Thus, pops.by.time will show only clones A and C (not B) but the complete genealogy will show clones A, B, C (and will show that B appeared from A at time 1.5 and C appeared from B at time 1.8). Since function plotClonePhylog offers a lot of flexibility with respect to what clones to show depending on their population sizes at different times, you can prevent being shown B, but its existence is there should you need it (see also \@ref(histlargegenes)).  fc38a875  ## Dealing with errors in "oncoSimulPop" {#errorosp} When running OncoSimulR under Windows mclapply does not use multiple cores, and errors from oncoSimulPop are reported directly. For example: {r} ## This code will only be evaluated under Windows if(.Platform$OS.type == "windows") try(pancrError <- oncoSimulPop(10, pancr, initSize = 1e-5, detectionSize = 1e7, keepEvery = 10, mc.cores = 2))  Under POSIX operating systems (e.g., GNU/Linux or Mac OSX) oncoSimulPop can ran parallelized by calling mclapply. Now, suppose you did something like {r} ## Do not run under Windows if(.Platform$OS.type != "windows") pancrError <- oncoSimulPop(10, pancr, initSize = 1e-5, detectionSize = 1e7, keepEvery = 10, mc.cores = 2)  The warning you are seeing tells you there was an error in the functions called by mclapply. If you check the help for mclpapply you'll see that it returns a try-error object, so we can inspect it. For instance, we could do: {r, eval=FALSE} pancrError[[1]]  But the output of this call might be easier to read: {r, eval=FALSE} pancrError[[1]][1]  And from here you could see the error that was returned by oncoSimulIndiv: initSize < 1 (which is indeed true: we pass initSize = 1e-5).  f9f9ab96  ## Whole tumor sampling, genotypes, and allele counts: what gives? And what about order? {#wtsampl}  fc38a875  You are obtaining genotypes, regardless of order. When we use "whole tumor sampling", it is the frequency of the mutations in each gene that counts, not the order. So, for instance, "c, d" and "c, d" both contribute to the counts of "c" and "d". Similarly, when we use single cell sampling, we obtain a genotype defined in terms of mutations, but there might be multiple orders that give this genotype. For example,$d > c$and$c > d$both give you a genotype with "c" and "d" mutated, and thus in the output you can have two columns with both genes mutated. ## Doesn't the BNB algorithm require small mutation rates for it to be advantageous? {#bnbmutation} As discussed in the original paper by @Mather2012 (see also their supplementary material), the BNB algorithm can achieve considerable speed advantages relative to other algorithms especially when mutation events are rare relative to birth and death events; the larger the mutation rate, the smaller the gains compared to other algorithms. As mentioned in their supplementary material (see p.5) "Note that the 'cost' of each step in BNB is somewhat higher than in  f9f9ab96  SSA [SSA is the original Gillespie's Stochastic Simulation Algorithm]  fc38a875  since it requires generation of several random numbers as compared to only two uniform random numbers for SSA. However this cost increase is small compared with significant benefits of jumping over birth and death reactions for the case of rare mutations." Since the earliest versions, OncoSimulR has provided information to assess these issues. The output of function oncoSimulIndiv includes a list called *"other"* that itself includes two lists named *"minDMratio"* and *"minBMratio"*, the smallest ratio, over all simulations, of death rate to mutation rate or birth rate to mutation rate, respectively. As explained above, the BNB algorithm thrives when those are large. Note, though, we say "it thrives": these ratios being large is not required for the BNB algorithm to be an exact simulation algorithm; these ratios being large make BNB comparatively much faster than other algorithms. ## Can we use the BNB algorithm with state-dependent birth or death rates? {#bnbdensdep} As discussed in the original paper by @Mather2012 (see sections 2.6 and 3.2 of the paper and section E of the supplementary material), the BNB algorithm can be used as an approximate stochastic simulation algorithm "(...) with non-constant birth, death, and mutation rates by evolving the system with a BNB step restricted to a short duration t." (p. 9 in supplementary material). The justification is that "(...) the propensities for reactions can be considered approximately constant during some short interval." (p. 1234). This is the reason why, when we use McFarland's model, we set a very short sampleEvery. In addition, the output of the simulation functions contains the simple summary statistic errorMF that can be used to assess the quality of the approximation[^errorMF]. [^errorMF]: Death rates are affected by density dependence and, thus, it is on the death rates where the approximation that they are constant over a short interval plays a role. Thus, we examine how large the difference between successive death rates is. More precisely, let$A$and$C$denote two successive sampling periods, with$D_A = log(1 + N_A/K)$and$D_C= log(1 + N_C/K)$their death rates. errorMF_size stores the largest$abs(D_C - D_A)$between any two sampling periods ever seen during a simulation. errorMF stores the largest$abs(D_C - D_A)/D_A$. Additionally, a simple procedure to use is to run the simulations with different values of sampleEvery, say the default value of 0.025 and values that are 10, 20, and 50 times larger or smaller, and assess their effects on the output of the simulations and the errorMF statistic itself. You can check that using a sampleEvery much smaller than 0.025 rarely makes any difference in errorMF or in the simulation output (though it increases computing time significantly). And, just for the fun of it, you can also check that using huge values for sampleEvery can lead to trouble and will be manifested too in the simulation output with large and unreasonable jumps in total population sizes and sudden extinctions. Note that, as the authors point out, approximations are common with stochastic simulation algorithms when there is density dependence, but the advantage of the BNB algorithm compared to, say, most tau-leap methods is that clones of different population sizes are treated uniformly. @Mather2012 further present results from simulations comparing the BNB algorithm with the original direct SSA method and the tau-leaps (see their Fig. 5), which shows that the approximation is very accurate as soon as the interval between samples becomes reasonably short.  f9f9ab96  ## Sometimes I get exceptions when running with mutator genes {#tomlinexcept}  fc38a875  Yes, sure, the following will cause an exception; this is similar to  f9f9ab96  the example used in \@ref(exmutantimut) but there is one crucial  fc38a875  difference: {r ex-tomlin1exc} sd <- 0.1 ## fitness effect of drivers sm <- 0 ## fitness effect of mutator nd <- 20 ## number of drivers nm <- 5 ## number of mutators mut <- 50 ## mutator effect THIS IS THE DIFFERENCE fitnessGenesVector <- c(rep(sd, nd), rep(sm, nm)) names(fitnessGenesVector) <- 1:(nd + nm) mutatorGenesVector <- rep(mut, nm) names(mutatorGenesVector) <- (nd + 1):(nd + nm) ft <- allFitnessEffects(noIntGenes = fitnessGenesVector, drvNames = 1:nd) mt <- allMutatorEffects(noIntGenes = mutatorGenesVector)  Now, simulate using the fitness and mutator specification. We fix the number of drivers to cancer, and we stop when those numbers of drivers are reached. Since we only care about the time it takes to reach cancer, not the actual trajectories, we set keepEvery = NA: {r ex-tomlinexc2} ddr <- 4 set.seed(2) RNGkind("L'Ecuyer-CMRG") st <- oncoSimulPop(4, ft, muEF = mt, detectionDrivers = ddr, finalTime = NA, detectionSize = NA, detectionProb = NA, onlyCancer = TRUE, keepEvery = NA, mc.cores = 2, ## adapt to your hardware seed = NULL) ## for reproducibility set.seed(NULL) ## return things to their "usual state"  What happened? That you are using five mutator genes, each with an effect of multiplying by 50 the mutation rate. So the genotype with all those five genes mutated will have an increased mutation rate of$50^5 = 312500000$. If you set the mutation rate to the default of$1e-6$you have a mutation rate of 312 which makes no sense (and leads to all sorts of numerical issues down the road and an early warning). Oh, but you want to accumulate mutator effects and have some, or the early ones, have a large effects and the rest progressively smaller effects? You can do that using epistatic effects for mutator effects. ## What are good values of sampleEvery? {#whatgoodsampleevery} First, we need to differentiate between the McFarland and the exponential models. If you use the McFarland model, you should read section \@ref(bnbdensdep) but, briefly, the small default is probably a good choice. With the exponential model, however, simulations can often be much faster if sampleEvery is large. How large? As large as you can make it. sampleEvery should not be larger than your desired keepEvery, where keepEvery determines the resolution or granularity of your samples (i.e., how often you take a snapshot of the population). If you only care about the final state, then set keepEvery = NA. The other factors that affects choosing a reasonable sampleEvery are mutation rate and population size. If population growth is very fast or mutation rate very large, you need to sample frequently to avoid the "Recoverable exception ti set to DBL_MIN. Rerunning." issue (see discussion in section \@ref(popgtzx)). ## What can you do with the simulations? This is up to you. In section \@ref(sample-1) we show an example where we infer an Oncogenetic tree from simulated data and in section \@ref(whatfor) we go over a varied set of scientific questions where OncoSimulR could help.  bbd76086  \clearpage  c95df82d   6610ac26  # Using v.1 posets and simulations {#v1} It is strongly recommended that you use the new (v.2) procedures for specifying fitness effects. However, the former v.1 procedures are still available, with only very minor changes to function calls. What follows below is the former vignette. You might want to use v.1 because for certain models (e.g., small number of genes, with restrictions as specified by a simple poset) simulations might be faster with v.1 (fitness evaluation is much simpler ---we are working on further improving speed). ## Specifying restrictions: posets {#poset} How to specify the restrictions is shown in the help for  fc38a875  poset. It is often useful, to make sure you did not make any  6610ac26  mistakes, to plot the poset. This is from the examples (we use an "L" after a number so that the numbers are integers, not doubles; we could alternatively have modified storage.mode). {r, fig.height=3} ## Node 2 and 3 depend on 1, and 4 depends on no one p1 <- cbind(c(1L, 1L, 0L), c(2L, 3L, 4L)) plotPoset(p1, addroot = TRUE)  {r, fig.height=3}  bbd76086  ## A simple way to create a poset where no gene (in a set of 15) ## depends on any other.  6610ac26  p4 <- cbind(0L, 15L) plotPoset(p4, addroot = TRUE)  Specifying posets is actually straightforward. For instance, we can specify the pancreatic cancer poset in @Gerstung2011 (their figure 2B, left). We specify the poset using numbers, but for nicer plotting we will use names (KRAS is 1, SMAD4 is 2,  fc38a875  etc). This example is also in the help for poset:  6610ac26  {r, fig.height=3} pancreaticCancerPoset <- cbind(c(1, 1, 1, 1, 2, 3, 4, 4, 5), c(2, 3, 4, 5, 6, 6, 6, 7, 7)) storage.mode(pancreaticCancerPoset) <- "integer" plotPoset(pancreaticCancerPoset, names = c("KRAS", "SMAD4", "CDNK2A", "TP53", "MLL3","PXDN", "TGFBR2"))  ## Simulating cancer progression {#simul1} We can simulate the progression in a single subject. Using an example very similar to the one in the help: {r, echo=FALSE,results='hide',error=FALSE} options(width=60)  {r} ## use poset p1101 data(examplePosets) p1101 <- examplePosets[["p1101"]] ## Bozic Model b1 <- oncoSimulIndiv(p1101, keepEvery = 15) summary(b1)  The first thing we do is make it simpler (for future examples) to use a set of restrictions. In this case, those encoded in poset p1101. Then, we run the simulations and look at a simple summary and a plot. If you want to plot the trajectories, it is better to keep more frequent samples, so you can see when clones appear: {r pb2bothx1,fig.height=5.5, fig.width=5.5} b2 <- oncoSimulIndiv(p1101, keepEvery = 1) summary(b2) plot(b2)  As we have seen before, the stacked plot here is less useful and that is why I do not evaluate that code for this vignette. {r pbssttx1,eval=FALSE} plot(b2, type = "stacked")  The following is an example where we do not care about passengers, but we want to use a different graph, and we want a few more drivers before considering cancer has been reached. And we allow it to run for longer.  fc38a875  Note that in the McFL model detectionSize really plays no  6610ac26  role. Note also how we pass the poset: it is the same as before, but now we directly access the poset in the list of posets. {r, echo=FALSE,eval=TRUE}  fc38a875  set.seed(1) ## for reproducibility. Once I saw it not reach cancer.  6610ac26   {r} m2 <- oncoSimulIndiv(examplePosets[["p1101"]], model = "McFL", numPassengers = 0, detectionDrivers = 8, mu = 5e-7, initSize = 4000, sampleEvery = 0.025, finalTime = 25000, keepEvery = 5, detectionSize = 1e6)  (Very rarely the above run will fail to reach cancer. If that happens, execute it again.) As usual, we will plot using both a line and a stacked plot: {r m2x1,fig.width=6.5, fig.height=10} par(mfrow = c(2, 1)) plot(m2, addtot = TRUE, log = "", thinData = TRUE, thinData.keep = 0.5) plot(m2, type = "stacked", thinData = TRUE, thinData.keep = 0.5)  The default is to simulate progression until a simulation reaches cancer (i.e., only simulations that satisfy the detectionDrivers or the  fc38a875  detectionSize will be returned). If you use the McFL model with large  6610ac26  enough initSize this will often be the case but not if you use very small initSize. Likewise, most of the Bozic runs do not reach cancer. Lets try a few: {r} b3 <- oncoSimulIndiv(p1101, onlyCancer = FALSE) summary(b3) b4 <- oncoSimulIndiv(p1101, onlyCancer = FALSE) summary(b4)  Plot those runs: {r b3b4x1ch1, fig.width=8, fig.height=4} par(mfrow = c(1, 2)) par(cex = 0.8) ## smaller font plot(b3) plot(b4)  ### Simulating progression in several subjects To simulate the progression in a bunch of subjects (we will use only four, so as not to fill the vignette with plots) we can do, with the same settings as above: {r ch2} p1 <- oncoSimulPop(4, p1101, mc.cores = 2) par(mfrow = c(2, 2)) plot(p1, ask = FALSE)  We can also use stream and stacked plots, though they might not be as useful in this case. For the sake of keeping the vignette small, these are commented out. {r p1multx1,eval=FALSE} par(mfrow = c(2, 2)) plot(p1, type = "stream", ask = FALSE)  {r p1multstx1,eval=FALSE} par(mfrow = c(2, 2)) plot(p1, type = "stacked", ask = FALSE)  ## Sampling from a set of simulated subjects {#sample-1} You will often want to do something with the simulated data. For instance, sample the simulated data. Here we will obtain the trajectories for 100 subjects in a scenario without passengers. Then we will sample with the default options and store that as a vector of genotypes (or a matrix of subjects by genes): {r} m1 <- oncoSimulPop(100, examplePosets[["p1101"]], numPassengers = 0, mc.cores = 2)   fc38a875  The function samplePop samples that object, and also gives you  6610ac26  some information about the output: {r} genotypes <- samplePop(m1)  What can you do with it? That is up to you. As an example, let us try to  fc38a875  infer an Oncogenetic tree (and plot it) using the r CRANpkg("Oncotree")  6610ac26  package @Oncotree after getting a quick look at the marginal frequencies of events: {r fxot1,fig.width=4, fig.height=4} colSums(genotypes)/nrow(genotypes) require(Oncotree) ot1 <- oncotree.fit(genotypes) plot(ot1)  Your run will likely differ from mine, but with the defaults (detection size of$10^8\$) it is likely that events down the tree will never appear. You can set detectionSize = 1e9 and you will see that events down the tree are now found in the cross-sectional sample. Alternatively, you can use single cell sampling and that, sometimes, recovers one or a couple more events. {r fxot2,fig.width=4, fig.height=4} genotypesSC <- samplePop(m1, typeSample = "single") colSums(genotypesSC)/nrow(genotypesSC) ot2 <- oncotree.fit(genotypesSC) plot(ot2)  You can of course rename the columns of the output matrix to something else if you want so the names of the nodes will reflect those potentially more meaningful names.  f9f9ab96  \clearpage  6610ac26  # Session info and packages used This is the information about the version of R and packages used: {r} sessionInfo()  # Funding Supported by BFU2015-67302-R (MINECO/FEDER, EU). {r, echo=FALSE, eval=TRUE} ## reinitialize the seed set.seed(NULL)  # References  fc38a875   4ac900b7