---
title: "OncoSimulR: forward genetic simulation in asexual populations with arbitrary epistatic interactions and a focus on modeling tumor progression."
author: "

         Ramon Diaz-Uriarte\\

         Dept. Biochemistry, Universidad Autónoma de Madrid, Instituto de
         Investigaciones Biomédicas 'Alberto Sols' (UAM-CSIC), Madrid,
         Spain.\\ 
		 
		 <rdiaz02@gmail.com>, <http://ligarto.org/rdiaz>
		 "
date: "`r paste0(Sys.Date(),'. OncoSimulR version ', packageVersion('OncoSimulR'), suppressWarnings(ifelse(length(try(system('git rev-parse --short HEAD', ignore.stderr = TRUE, intern = TRUE))), paste0('. Revision: ', system('git rev-parse --short HEAD', intern = TRUE)), '')))`"
header-includes:
    - \input{preamble.tex}
output: 
  bookdown::html_document2:
    css: custom4.css
    toc: yes
    toc_float: true
    fig_retina: null
classoption: a4paper
geometry: margin=3cm
fontsize: 12pt
bibliography: OncoSimulR.bib
biblio-style: "apalike"
link-citations: true
vignette: >
  %\VignetteIndexEntry{OncoSimulR: forward genetic simulation in asexual populations with arbitrary epistatic interactions and a focus on modeling tumor progression.}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignettePackage{OncoSimulR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteKeywords{OncoSimulR simulation cancer oncogenetic trees}
  %\VignetteDepends{OncoSimulR}
---

<!-- html_book seems to ignore the toc_float -->

  <!-- bookdown::gitbook: -->
  <!--   self_contained: false -->
  <!--   split_by: rmd -->
  <!--   css: custom4.css -->
  <!--   github-repo: "rdiaz02/OncoSimul" -->
  <!--   config: -->
  <!--     toc: -->
  <!--       collapse: subsection -->
  <!--     download: null -->
  <!--     sharing: null -->


<!-- A great one? Well... Slow, and no subsubsection and strange scrolling-->
<!-- render("OncoSimulR.Rmd", bookdown::gitbook(self_contained = FALSE, split_by = "none")) -->
<!-- Slow rendering, and the clicking on TOC to get expanded entries does -->
<!-- not work. -->
<!-- And it creates a libs dir that takes 5 MB. -->
<!-- What I like is the possibility to hide the TOC, change rendering -->
<!-- (color, font size and type) and download.  -->

<!-- This did not work well either.  -->
  <!-- bookdown::gitbook: -->
  <!--   toc-depth: 4  -->
  <!--   self_contained: false -->
  <!--   split_by: none -->
  <!--   css: custom4.css -->
  <!--   github-repo: "rdiaz02/OncoSimul" -->
  <!--   config: -->
  <!--     toc: -->
  <!--       collapse: section -->
  <!--       scroll_highlight: true -->
  <!--       after: null -->
  <!--       before: null -->
  <!--       toc-depth: 4		 -->
  <!--     toolbar: -->
  <!--       position: fixed -->
  <!--     edit: -->
  <!--       link: null -->
  <!--       text: null -->
  <!--     download: null -->
  <!--     search: no -->
  <!--     fontsettings: -->
  <!--       theme: white -->
  <!--       family: sans -->
  <!--       size: 2 -->
  <!--     sharing: null -->

<!-- download, only on my web site -->
<!-- EPUB and MOBI and PDF -->
<!-- url: "https\://" -->

<!-- output:  -->
<!--   bookdown::html_document2:  -->
<!--     toc: yes -->
<!--     toc_float: true -->
<!--     css: custom4.css -->


<!-- ##    <rdiaz02@gmail.com>, <http://ligarto.org/rdiaz> -->
<!-- ## date: "`r Sys.Date()`" -->

<!-- ## Bioc html_document no refs -->
<!-- ## Bioc html_document2 too wide margins and really ugly -->
<!-- ## Simplest is to use bookdown and add BioC CSS -->
<!-- ## Rmdformats is really neat, but no support for \@ref -->
<!-- ## PDF as: 

<!-- render("OncoSimulR.Rmd", output_format = bookdown::pdf_document2(toc = TRUE, toc_depth = 4, keep_tex = TRUE)  -->

<!-- -->


<!-- After the fix in BiocStyle 2.3.24 can use this too, which gives the  -->
<!-- BiocStyle output, but it is not the same as the one for Rnw and -->
<!-- breaks refs, etc. I do not really like it-->
<!-- ## PDF as: render("OncoSimulR.Rmd", output_format = BiocStyle::pdf_document2(toc = TRUE, toc_depth = 4, keep_tex = TRUE)) -->





<!-- Fomr https://github.com/rstudio/bookdown/issues/153 -->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
  TeX: { equationNumbers: { autoNumber: "AMS" } }
});
</script>


```{r setup, include=FALSE}
## use collapse for bookdown, to collapse all the source and output
## blocks from one code chunk into a single block
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE)
options(width = 70)
require(BiocStyle)
require(pander)
```

<!-- bookdown::pdf_document2 seems to produce the tex even if failures -->


<!-- This did not work -->
<!--     bookdown::pdf_document2: -->
<!--       template: template2.tex	 -->
<!--       keep_tex: true -->
<!--       toc: true -->

<!-- I convert Rfunction to italics on typewriter -->
<!-- sed -i 's/\\Rfunction{\([^}]\+\)}/*`\1`*/g' p22.md -->


\clearpage

# Introduction {#introdd}
 
OncoSimulR is an individual- or clone-based forward-time genetic
simulator for biallelic markers (wildtype vs. mutated) in asexually
reproducing populations without spatial structure (perfect
mixing). Its design emphasizes flexible specification of fitness and
mutator effects.
 
 
 
OncoSimulR was originally developed to simulate tumor progression with
emphasis on allowing users to set restrictions in the accumulation of
mutations as specified, for example, by Oncogenetic Trees
[OT: @Desper1999JCB; @Szabo2008] or Conjunctive Bayesian Networks
[CBN: @Beerenwinkel2007; @Gerstung2009; @Gerstung2011], with the
possibility of adding passenger mutations to the simulations and allowing
for several types of sampling.


Since then, OncoSimulR has been vastly extended to allow you to specify
other types of restrictions in the accumulation of genes, such as the XOR
models of @Korsunsky2014 or the "semimonotone" model of
@Farahani2013. Moreover, different fitness effects related to the order in
which mutations appear can also be incorporated, involving arbitrary numbers
of genes. This is *very* different from "restrictions in the order of
accumulation of mutations". With order effects, described in a recent cancer
paper by Ortmann and collaborators [@Ortmann2015], the effect of having both
mutations "A" and "B" differs depending on whether "A" appeared before or
after "B" (the actual case involves genes JAK2 and TET2).

More generally, OncoSimulR now also allows you to specify arbitrary
epistatic interactions between arbitrary collections of genes and to model,
for example, synthetic mortality or synthetic viability (again, involving an
arbitrary number of genes, some of which might also depend on other genes,
or show order effects with other genes). Moreover, it is possible to specify
the above interactions in terms of modules, not genes. This idea is
discussed in, for example, @Raphael2014a and @Gerstung2011: the restrictions
encoded in, say, CBNs or OT can be considered to apply not to genes, but to
modules, where each module is a set of genes (and the intersection between
modules is the empty set) that performs a specific biological
function. Modules, then, play the role of a "union operation" over the set
of genes in a module. In addition, arbitrary numbers of genes without
interactions (and with fitness effects coming from any distribution you
might want) are also possible.

Mutator/antimutator genes, genes that alter the mutation rate of
other genes [@gerrish_complete_2007; @tomlinson_mutation_1996], can
also be simulated with OncoSimulR and specified with most of the
mechanisms above (you can have, for instance, interactions between
mutator genes). And, regardless of the presence or not of other
mutator/antimutator genes, different genes can have different
mutation rates.


Simulations can be stopped as a function of total population size, number
of mutated driver genes, or number of time periods. Simulations can also
be stopped with a stochastic detection mechanism where the probability of
detecting a tumor increases with total population size. Simulations return
the number of cells of every genotype/clone at each of the sampling
periods and we can take samples from the former with single-cell or whole-
tumor resolution, adding noise if we want. If we ask for them, simulations
also store and return the genealogical relationships of all clones
generated during the simulation.


The models so far implemented are all continuous time models, which are
simulated using the BNB algorithm of @Mather2012. The core of the code is
implemented in C++, providing for fast execution.  To help with simulation
studies, code to simulate random graphs of the kind often seen in CBNs, OTs,
etc, is also available. Finally, OncoSimulR also allows for the generation
of random fitness landscapes and the representation of fitness landscapes
and provides statistics of evolutionary predictability.






## Key features of OncoSimulR {#key}

As mentioned above, OncoSimulR is now a very general package for forward
genetic simulation, with applicability well beyond tumor progression. This
is a summary of some of its key features:

<!-- FIXME: add the tables of the poster -->


 
* You can specify arbitrary interactions between genes, with
  arbitrary fitness effects, with explicit support for:
    - Restrictions in the accumulations of mutations, as specified by
      Oncogenetic Trees (OTs), Conjunctive Bayesian Networks (CBNs),
      semimonotone progression networks, and XOR relationships.
		  
    - Epistatic interactions including, but not limited to, synthetic
       viability and synthetic lethality.
    - Order effects.
	
* You can add passenger mutations.
* You can add mutator/antimutator effects.
* Fitness and mutation rates can be gene-specific.
* You can add arbitrary numbers of non-interacting
      genes with arbitrary fitness effects.
  
* you can allow for deviations from the OT, CBN, semimonotone, and
      XOR models, specifying a penalty for such deviations (the $s_h$
      parameter).
      
* You can conduct multiple simulations, and sample from them with
      different temporal schemes and using both whole tumor or single cell
      sampling. 
	  
* You can stop the simulations using a flexible combination of
      conditions: final time, number of drivers, population
      size, fixation of certain genotypes, and a stochastic
      stopping mechanism that depends on population size.
  
* Right now, three different models are available, two that lead
      to exponential growth, one of them loosely based on @Bozic2010, and
      another that leads to logistic-like growth, based on @McFarland2013.
      
<!-- * Code in C++ is available (though not yet callable from R) for -->
<!--       using several other models, including the one from @Beerenwinkel2007b. -->
      
* You can use large numbers of genes (e.g., see an example of
      50000 in section \@ref(mcf50070) ).
      
* Simulations are generally very fast: I use C++ to implement the BNB
      algorithm (see sections \@ref(bnbmutation) and \@ref(bnbdensdep) for
      more detailed comments on the usage of this algorithm).

* You can obtain the true sequence of events and the phylogenetic
      relationships between clones (see section \@ref(meaningclone)
      for the details of what we mean by "clone").
    
* You can generate random fitness landscapes (under the House of Cards,
      Rough Mount Fuji, or additive models, or combinations of the former
      and under the NK model) and use those landscapes as input to the
      simulation functions.
      
* You can plot fitness landscapes.
      
* You can obtain statistics of evolutionary predictability
  from the simulations.
  


The table below, modified from the table at the
[Genetics Simulation Resources (GSR) page](https://popmodels.cancercontrol.cancer.gov/gsr/packages/oncosimulr/#detailed), 
provides a summary of the key features of OncoSimulR. (An
explanation of the meaning of terms specific to the GSR table is
available from
https://popmodels.cancercontrol.cancer.gov/gsr/search/ or from the
[Genetics Simulation Resources table itself](https://popmodels.cancercontrol.cancer.gov/gsr/packages/oncosimulr/#detailed),
by moving the mouse over each term).

\clearpage

|Attribute Category                  | Attribute                                     |
|-------------------------------|-----------------------------------------------|
|**Target**                           |  |
|&nbsp; Type of Simulated Data        |           Haploid DNA Sequence|
|&nbsp; Variations                    |           Biallelic Marker, Genotype or Sequencing Error|
|**Simulation Method**                |           Forward-time|
|&nbsp; Type of Dynamical Model       | Continuous time|
|&nbsp; Entities Tracked              | Clones (see \@ref(trackindivs))|
|**Input** | Program specific (R data frames and matrices specifying genotypes' fitness, gene effects, and starting genotype) |
|**Output**||
|&nbsp; Data Type| Genotype or Sequence, Individual Relationship (complete parent-child relationships between clones), Demographic (populations sizes of all clones at sampling times), Diversity Measures (LOD, POM, diversity of genotypes), Fitness|
|&nbsp; Sample Type|	Random or Independent, Longitudinal, Other (proportional to population size)|
|**Evolutionary Features**	||
|&nbsp; Mating Scheme| Asexual Reproduction |
|&nbsp; Demographic                   ||	
|&nbsp; &nbsp; Population Size Changes|	Exponential (two models), Logistic (McFarland et al., 2013)|
|&nbsp; Fitness Components||	
|&nbsp; &nbsp; Birth Rate|	Individually Determined from Genotype (models "Exp" and "McFL")|
|&nbsp; &nbsp; Death Rate|	Individually Determined from Genotype (model "Bozic"), Influenced by Environment ---population size (model "McFL")|
|&nbsp;Natural Selection||	
|&nbsp; &nbsp; Determinant|	Single and Multi-locus, Fitness of Offspring,  Environmental Factors (population size)|
|&nbsp; &nbsp; Models|	Directional Selection, Multi-locus models, Epistasis, Random Fitness Effects|
|&nbsp; Mutation Models|	Two-allele Mutation Model (wildtype, mutant), without back mutation|
|&nbsp; Events Allowed|	Varying Genetic Features: change of individual mutation rates (mutator/antimutator genes)|
|&nbsp; Spatial Structure| No Spatial Structure (perfectly mixed and no migration)|
Table:(\#tab:osrfeatures) Key features of OncoSimulR. Modified from
the original table from
https://popmodels.cancercontrol.cancer.gov/gsr/packages/oncosimulr/#detailed
.


<!-- Why not "Carrying cappacity" instead of logistic? Both are very -->
<!-- similar, but the GSR page says, for carrying capacity "This includes models with age or stage-specific carrying capacities" -->


Further details about the original motivation for wanting to
simulate data this way in the context of tumor progression can be
found in @Diaz-Uriarte2015, where additional comments about model
parameters and caveats are discussed.


Are there similar programs? The Java program by @Reiter2013a, TTP, offers
somewhat similar functionality to the previous version of OncoSimulR, but
it is restricted to at most four drivers (whereas v.1 of OncoSimulR
allowed for up to 64), you cannot use arbitrary CBNs or OTs (or XORs or
semimonotone graphs) to specify restrictions, there is no allowance for
passengers, and a single type of model (a discrete time Galton-Watson
process) is implemented. The current functionality of OncoSimulR goes well
beyond the the previous version (and, thus, also the TPT of
@Reiter2013a). We now allow you to specify all types of fitness effects
in other general forward genetic simulators such as FFPopSim
[@Zanini2012], and some that, to our knowledge (e.g., order effects) are
not available from any genetics simulator. In addition, the "Lego system"
to flexibly combine different fitness specifications is also unique; by
"Lego system" I mean that we can combine different pieces and blocks,
similarly to what we do with Lego bricks. (I find this an intuitive and
very graphical analogy, which I have copied from @Hothorn_2006 and
@Hothorn_2008). In a nutshell, salient features of OncoSimulR compared to
other simulators are the unparalleled flexibility to specify fitness and
mutator effects, with modules and order effects as particularly unique,
and the options for sampling and stopping the simulations, particularly
convenient in cancer evolution models. Also unique in this type of
software is the addition of functions for simulating fitness landscapes
and assessing evolutionary predictability.







## What kinds of questions is OncoSimulR suited for? {#generalwhatfor}

OncoSimulR can be used to address questions that span from the
effect of mutator genes in cancer to the interplay between fitness
landscapes and mutation rates. The main types of questions that
OncoSimulR can help address involve combinations of:


* Simulating asexual evolution (the `oncoSimul*` functions) where:
     - Fitness is:
        - A function of specific epistatic effects between genes
        - A function of order effects
        - A function of epistatic effects specified using
          DAGs/posets where these DAGs/posets:
		     - Are user-specified
	         - Generated randomly (`simOGraph`)
        - Any mapping between genotypes and fitness where this mapping is:
            - User-specified
            - Generated randomly from families of random fitness landscapes (`rfitness`)     
     - Mutation rates can:
          - Vary between genes
          - Be affected by other genes
		 
* Examining times to evolutionarily or biomedically relevant events
  (fixation of genotypes, reaching a minimal size, acquiring a
  minimal number of driver genes, etc ---specified with the stopping
  conditions to the `oncoSimul*` functions).

* Using different sampling schemes (`samplePop`) that are related
  to:
    - Assessing genotypes from single-cell vs. whole tumor (or whole
      population) with the  `typeSample` argument
    - Genotyping error (`propError` argument)
    - Timing of samples (`timeSample` argument)
    - ... and assessing the consequences of those on the observed
      genotypes and their diversity (`sampledGenotypes`) and any other
      inferences that depend on the observational process.
	- (OncoSimulR returns the abundances of all genotypes at each of
	  the sampling points, so you are not restricted by what the
	  `samplePop` function provides.)
  

* Tracking the genealogical relationships of clones
  (`plotClonePhylog`) and assessing evolutionary predictability
  (`LOD`, `POM`).



Some specific questions that you can address with the help of
OncoSimulR are discussed in section \@ref(whatfor).

-----

A quick overview of the main functions and their relationships is shown in
Figure \@ref(fig:frelats), where we use italics for the type/class of R
object and courier font for the name of the functions.
<!-- Note: figure1.png, and how to create it, explained in miscell-files -->
<!-- in the repo -->
<!-- ![Relationship between the main functions in OncoSimulR.](figure1.png) -->



```{r frelats, eval=TRUE,echo=FALSE, fig.cap="Relationships between the main functions in OncoSimulR."}
knitr::include_graphics("relfunct.png")
```

\clearpage

## Examples of questions that can be addressed with OncoSimulR {#whatfor}

Most of the examples in the rest of this vignette, starting with those in
\@ref(quickexample), focus on the mechanics. Here, we will illustrate some
problems in cancer genomics and evolutionary genetics where OncoSimulR
could be of help. This section does not try to provide an answer to any of
these questions (those would be full papers by themselves). Instead, this
section simply tries to illustrate some kinds of questions where you can
use OncoSimulR; of course, the possible uses of OncoSimulR are only
limited by your ingenuity. Here, I will only use short snippets of working
code as we are limited by time of execution; for real work you would want
to use many more scenarios and many more simulations, you would use
appropriate statistical methods to compare the output of runs, etc, etc,
etc.


```{r firstload}
## Load the package
library(OncoSimulR) 
```

### Recovering restrictions in the order of accumulation of mutations {#ex-order}

This is a question that was addressed, for instance, in
@Diaz-Uriarte2015: do methods that try to infer restrictions in the
order of accumulation of mutations [e.g., @Szabo2008; @Gerstung2009;
@ramazzotti_capri_2015] work well under different evolutionary
models and with different sampling schemes? (This issue is also
touched upon in section \@ref(sample-1)).

A possible way to examine that question would involve:

 - generating random DAGs that encode restrictions;
 - simulating cancer evolution using those DAGs;
 - sampling the data and adding different levels of noise to the sampled data;
 - running the inferential method;
 - comparing the inferred DAG with the original, true, one.


```{r, echo=FALSE}
set.seed(2)
RNGkind("L'Ecuyer-CMRG")
```

```{r ex-dag-inf}
## For reproducibility
set.seed(2)
RNGkind("L'Ecuyer-CMRG")

## Simulate a DAG
g1 <- simOGraph(4, out = "rT")

## Simulate 10 evolutionary trajectories
s1 <- oncoSimulPop(10, allFitnessEffects(g1, drvNames = 1:4),
                   mc.cores = 2, ## adapt to your hardware
                   seed = NULL) ## for reproducibility of vignette

## Sample those data uniformly, and add noise
d1 <- samplePop(s1, timeSample = "unif", propError = 0.1)

## You would now run the appropriate inferential method and
## compare observed and true. For example

## require(Oncotree)
## fit1 <- oncotree.fit(d1)

## Now, you'd compare fitted and original. This is well beyond 
## the scope of this document (and OncoSimulR itself).

```

```{r hidden-rng-exochs, echo = FALSE}
set.seed(NULL)
```



### Sign epistasis and probability of crossing fitness valleys {#ex-ochs}

This question, and the question in the next section (\@ref(ex-predict)),
encompass a wide range of issues that have been addressed in evolutionary
genetics studies and which include from detailed analysis of simple models
with a few uphill paths and valleys as in @Weissman2009 or @Ochs2015, to
questions that refer to larger, more complex fitness landscapes as in
@szendro_predictability_2013 or @franke_evolutionary_2011 (see below).


Using as an example @Ochs2015 (we will see this example again in section
\@ref(ochsdesai), where we cover different ways of specifying fitness), we
could specify the fitness landscape and run simulations until fixation
(with argument `fixation` to `oncoSimulPop` ---see more details in section
\@ref(fixation) and \@ref(fixationG), again with this example). We would
then examine the proportion of genotypes fixed under different
scenarios. And we can extend this example by adding mutator genes:


```{r hiddenochs, echo=FALSE}
set.seed(2)
RNGkind("L'Ecuyer-CMRG")
```

```{r exochs}
## For reproducibility
set.seed(2)
RNGkind("L'Ecuyer-CMRG")

## Specify fitness effects. 

## Numeric values arbitrary, but set the intermediate genotype en
## route to ui as mildly deleterious so there is a valley.

## As in Ochs and Desai, the ui and uv genotypes
## can never appear. 

u <- 0.2; i <- -0.02; vi <- 0.6; ui <- uv <- -Inf

od <- allFitnessEffects(
    epistasis = c("u" = u,  "u:i" = ui,
                  "u:v" = uv, "i" = i,
                  "v:-i" = -Inf, "v:i" = vi))

## For the sake of extending this example, also turn i into a
## mutator gene

odm <- allMutatorEffects(noIntGenes = c("i" = 50))

## How do mutation and fitness look like for each genotype?
evalAllGenotypesFitAndMut(od, odm, addwt = TRUE)
```

Ochs and Desai explicitly say "Each simulated population was evolved
until either the uphill genotype or valley-crossing genotype fixed."
So we will use `fixation`.


```{r exochsb}
## Set a small initSize, as o.w. unlikely to pass the valley
initS <- 10
## The number of replicates is tiny, 10, for the sake of speed
## of creation of the vignette
od_sim <- oncoSimulPop(10, od, muEF = odm,
                       fixation = c("u", "i, v"), initSize = initS,
                       model = "McFL",
                       mu = 1e-4, detectionDrivers = NA, 
					   finalTime = NA,
                       detectionSize = NA, detectionProb = NA,
                       onlyCancer = TRUE, 
					   mc.cores = 2, ## adapt to your hardware
                       seed = NULL) ## for reproducibility
## What is the frequency of each final genotype?
sampledGenotypes(samplePop(od_sim))
```

```{r hidden-rng-exochs33, echo = FALSE}
set.seed(NULL)
```

### Predictability of evolution in complex fitness landscapes {#ex-predict}

Focusing now on predictability in more general fitness landscapes,
we would run simulations under random fitness landscapes with varied
ruggedness, and would then examine the evolutionary predictability
of the trajectories with measures such as "Lines of Descent" and
"Path of the Maximum" [@szendro_predictability_2013] and the
diversity of the sampled genotypes under different sampling regimes (see
details in section \@ref(evolpredszend)).



```{r hiddenrng0szen, echo=FALSE}
set.seed(7)
RNGkind("L'Ecuyer-CMRG")
```

```{r exszendro}
## For reproducibility
set.seed(7)
RNGkind("L'Ecuyer-CMRG")


## Repeat the following loop for different combinations of whatever
## interests you, such as number of genes, or distribution of the
## c and sd (which affect how rugged the landscape is), or 
## reference genotype, or evolutionary model, or stopping criterion, 
## or sampling procedure, or ...

##  Generate a random fitness landscape, from the Rough Mount
##  Fuji model, with g genes, and c ("slope" constant) and
##  reference chosen randomly (reference is random by default and 
##  thus not specified below). Require a minimal number of  
##  accessible genotypes

g <- 6
c <- runif(1, 1/5, 5)
rl <- rfitness(g, c = c, min_accessible_genotypes = g)

## Plot it if you want; commented here as it takes long for a
## vignette

## plot(rl)

## Obtain landscape measures from MAGELLAN. Export to MAGELLAN and
## call your own copy of MAGELLAN's binary
to_Magellan(rl, file = "rl1.txt")

## or use the binary copy provided with OncoSimulR
## see also below.
Magellan_stats(rl)

## Simulate evolution in that landscape many times (here just 10)
simulrl <- oncoSimulPop(10, allFitnessEffects(genotFitness = rl),
                        keepPhylog = TRUE, keepEvery = 1,
                        initSize = 4000,
                        seed = NULL, ## for reproducibility
                        mc.cores = 2) ## adapt to your hardware

## Obtain measures of evolutionary predictability
diversityLOD(LOD(simulrl))
diversityPOM(POM(simulrl))
sampledGenotypes(samplePop(simulrl, typeSample = "whole"))
```

```{r hidden-rng-exszend, echo = FALSE}
set.seed(NULL)
```



<!-- ## (Getting the statistics from MAGELLAN requires either  -->
<!-- ## calling the web app (http://wwwabi.snv.jussieu.fr/public/MAGELLAN/)  -->
<!-- ## or asking MAGELLAN's authors for the software. This is of course -->
<!-- ## beyond the scope of the example and the package.) -->


### Mutator and antimutator genes {#exmutantimut}

The effects of mutator and antimutator genes have been examined both
in cancer genetics [@nowak_evolutionary_2006;
@tomlinson_mutation_1996] and in evolutionary genetics
[@gerrish_complete_2007], and are related to wider issues such as
Muller's ratchet and the evolution of sex. There are, thus, a large
range of questions related to mutator and antimutator genes.


One question addressed in @tomlinson_mutation_1996 concerns under what
circumstances mutator genes are likely to play a role in cancer
progression. For instance, @tomlinson_mutation_1996 find that an increased
mutation rate is more likely to matter if the number of required mutations
in driver genes needed to reach cancer is large and if the mutator effect is
large.


We might want to ask, then, how long it takes before to reach cancer under
different scenarios. Time to reach cancer is stored in the component
`FinalTime` of the output. We would specify different numbers and effects
of mutator genes (argument `muEF`). We would also change the criteria for
reaching cancer and in our case we can easily do that by specifying
different numbers in `detectionDrivers`. Of course, we would also want to
examine the effects of varying numbers of mutators, drivers, and possibly
fitness consequences of mutators. Below we assume mutators are neutral and
we assume there are no additional genes with deleterious mutations, but
this need not be so, of course [see also
@tomlinson_mutation_1996; @gerrish_complete_2007; @McFarland2014].


Let us run an example. For the sake of simplicity, we assume no
epistatic interactions.

```{r ex-tomlin1}
sd <- 0.1 ## fitness effect of drivers
sm <- 0 ## fitness effect of mutator
nd <- 20 ## number of drivers
nm <- 5  ## number of mutators
mut <- 10 ## mutator effect

fitnessGenesVector <- c(rep(sd, nd), rep(sm, nm))
names(fitnessGenesVector) <- 1:(nd + nm)
mutatorGenesVector <- rep(mut, nm)
names(mutatorGenesVector) <- (nd + 1):(nd + nm)

ft <- allFitnessEffects(noIntGenes = fitnessGenesVector,
                        drvNames = 1:nd)
mt <- allMutatorEffects(noIntGenes = mutatorGenesVector)

```


Now, simulate using the fitness and mutator specification. We fix
the number of drivers to cancer, and we stop when those numbers of
drivers are reached. Since we only care about the time it takes to
reach cancer, not the actual trajectories, we set `keepEvery = NA`:

```{r hiddentom, echo=FALSE}
set.seed(2)
RNGkind("L'Ecuyer-CMRG")
```

```{r ex-tomlin2}
## For reproducibility
set.seed(2)
RNGkind("L'Ecuyer-CMRG")


ddr <- 4
st <- oncoSimulPop(4, ft, muEF = mt,
                   detectionDrivers = ddr,
                   finalTime = NA,
                   detectionSize = NA,
                   detectionProb = NA,
                   onlyCancer = TRUE,
                   keepEvery = NA, 
                   mc.cores = 2, ## adapt to your hardware
                   seed = NULL) ## for reproducibility

## How long did it take to reach cancer?
unlist(lapply(st, function(x) x$FinalTime))

```
```{r hidden-rng-tom, echo = FALSE}
set.seed(NULL)
```



(Incidentally, notice that it is easy to get OncoSimulR to throw an
exception if you accidentally specify a huge mutation rate when all
mutator genes are mutated: see section \@ref(tomlinexcept).)


<!-- More complex example where we look at accumulation of deleterious -->

<!-- ```{r r ex-tomlin3} -->
<!-- set.seed(2) ## for reproducibility -->
<!-- RNGkind("L'Ecuyer-CMRG") -->

<!-- sd <- 0.1 ## fitness effect of drivers -->
<!-- sp <- -0.01 ## fitness effect of mildly deleterious passengers -->
<!-- sm <- 0 ## fitness effect of mutator -->
<!-- nd <- 20 ## number of drivers -->
<!-- nm <- 5  ## number of mutators -->
<!-- np <- 50 -->
<!-- mut <- 5 ## mutator effect -->

<!-- fitnessGenesVector <- c(rep(sd, nd), rep(sm, nm), rep(sp, np)) -->
<!-- names(fitnessGenesVector) <- 1:(nd + nm + np) -->
<!-- mutatorGenesVector <- rep(mut, nm) -->
<!-- names(mutatorGenesVector) <- (nd + np + 1):(nd + np + nm) -->

<!-- ft <- allFitnessEffects(noIntGenes = fitnessGenesVector, -->
<!--                         drvNames = 1:nd) -->
<!-- mt <- allMutatorEffects(noIntGenes = mutatorGenesVector) -->

<!-- ddr <- 4 -->
<!-- st <- oncoSimulPop(4, ft, muEF = mt, -->
<!--                    detectionDrivers = ddr, -->
<!--                    finalTime = NA, -->
<!--                    detectionSize = NA, -->
<!--                    detectionProb = NA, -->
<!--                    onlyCancer = FALSE, -->
<!--                    keepEvery = NA,  -->
<!--                    mc.cores = 2, ## adapt to your hardware -->
<!--                    seed = NULL) ## for reproducibility -->
<!-- colSums(samplePop(st, timeSample = "last", typeSample = "single"), na.rm = TRUE) -->
<!-- ``` -->





### Epistatic interactions between drivers and passengers in cancer and the consequences of order effects {#exbauer}

#### Epistatic interactions between drivers and passengers

@Bauer2014 have examined the effects of epistatic relationships
between drivers and passengers in cancer initiation. We could use
their model as a starting point, and examine how likely cancer is to
develop under different variations of their model and different
evolutionary scenarios (e.g., initial sample size, mutation rates,
evolutionary model, etc).


There are several ways to specify their model, as we discuss in
section \@ref(bauer). We will use one based on DAGs here:


```{r exusagebau}
K <- 4
sp <- 1e-5
sdp <- 0.015
sdplus <- 0.05
sdminus <- 0.1

cnt <- (1 + sdplus)/(1 + sdminus)
prod_cnt <- cnt - 1
bauer <- data.frame(parent = c("Root", rep("D", K)),
                    child = c("D", paste0("s", 1:K)),
                    s = c(prod_cnt, rep(sdp, K)),
                    sh = c(0, rep(sp, K)),
                    typeDep = "MN")
fbauer <- allFitnessEffects(bauer)
(b1 <- evalAllGenotypes(fbauer, order = FALSE, addwt = TRUE))

## How does the fitness landscape look like?
plot(b1, use_ggrepel = TRUE) ## avoid overlapping labels
```

Now run simulations and examine how frequently the runs end up with
population sizes larger than a pre-specified threshold; for
instance, below we look at increasing population size 4x in the
default maximum number of 2281 time periods (for real, you would of
course increase the number of total populations, the range of
initial population sizes, model, mutation rate, required population
size or number of drivers, etc):


```{r hiddenbau, echo=FALSE}
set.seed(2)
RNGkind("L'Ecuyer-CMRG")
```

```{r exusagebau2}
## For reproducibility
set.seed(2)
RNGkind("L'Ecuyer-CMRG")

totalpops <- 5
initSize <- 100
sb1 <- oncoSimulPop(totalpops, fbauer, model = "Exp",
                    initSize = initSize,
                    onlyCancer = FALSE, 
					mc.cores = 2, ## adapt to your hardware
                    seed = NULL) ## for reproducibility
					
## What proportion of the simulations reach 4x initSize?
sum(summary(sb1)[, "TotalPopSize"] > (4 * initSize))/totalpops

```
```{r hidden-rng-exbau, echo = FALSE}
set.seed(NULL)
```


Alternatively, to examine how long it takes to reach cancer for a
pre-specified size, you could look at the value of `FinalTime` as we
did above (section \@ref(exmutantimut)) after running simulations
with `onlyCancer = TRUE` and `detectionSize` set to some reasonable value:


```{r hiddenbau22, echo=FALSE}
set.seed(2)
RNGkind("L'Ecuyer-CMRG")
```

```{r hhhhbbbb22}

totalpops <- 5
initSize <- 100
sb2 <- oncoSimulPop(totalpops, fbauer, model = "Exp",
                    initSize = initSize,
                    onlyCancer = TRUE,
					detectionSize = 10 * initSize,
					mc.cores = 2, ## adapt to your hardware
                    seed = NULL) ## for reproducibility
				
## How long did it take to reach cancer?
unlist(lapply(sb2, function(x) x$FinalTime))
```

```{r hidden-rng-exbau22, echo = FALSE}
set.seed(NULL)
```

#### Consequences of order effects for cancer initiation {#exorder1intro}

Instead of focusing on different models for epistatic interactions,
you might want to examine the consequences of order effects
[@Ortmann2015]. You would proceed as above, but using models that
differ by, say, the presence or absence of order effects. Details on
their specification are provided in section \@ref(oe). Here is one
particular model (you would, of course, want to compare this to
models without order effects or with other magnitudes and types of
order effects):


```{r oex1intro}
## Order effects involving three genes.

## Genotype "D, M" has different fitness effects
## depending on whether M or D mutated first.
## Ditto for genotype "F, D, M".

## Meaning of specification: X > Y means
## that X is mutated before Y.


o3 <- allFitnessEffects(orderEffects = c(
                            "F > D > M" = -0.3,
                            "D > F > M" = 0.4,
                            "D > M > F" = 0.2,
                            "D > M"     = 0.1,
                            "M > D"     = 0.5))

## With the above specification, let's double check
## the fitness of the possible genotypes

(oeag <- evalAllGenotypes(o3, addwt = TRUE, order = TRUE))

``` 

Now, run simulations and examine how frequently the runs do not end
up in extinction. As above, for real, you would of course increase
the number of total populations, the range of initial population
sizes, mutation rate, etc:



```{r hiddoef, echo=FALSE}
set.seed(2)
RNGkind("L'Ecuyer-CMRG")
```

```{r exusageoe2}
## For reproducibility
set.seed(2)
RNGkind("L'Ecuyer-CMRG")

totalpops <- 5
soe1 <- oncoSimulPop(totalpops, o3, model = "Exp",
                    initSize = 500,
                    onlyCancer = FALSE,
					mc.cores = 2, ## adapt to your hardware
                    seed = NULL) ## for reproducibility
					
## What proportion of the simulations do not end up extinct?
sum(summary(soe1)[, "TotalPopSize"] > 0)/totalpops

```

```{r hidden-rng-exoef, echo = FALSE}
set.seed(NULL)
```


As we just said, alternatively, to examine how long it takes to
reach cancer you could run simulations with `onlyCancer = TRUE` and
look at the value of `FinalTime` as we did above (section
\@ref(exmutantimut)).




## Trade-offs and what is OncoSimulR not well suited for {#whatnotfor}

OncoSimulR is designed for complex fitness specifications and selection
scenarios and uses forward-time simulations; the types of questions where
OncoSimulR can be of help are discussed in sections \@ref(generalwhatfor)
and \@ref(whatfor) and running time and space consumption of OncoSimulR are
addressed in section \@ref(timings). You should be aware that **coalescent
simulations**, sometimes also called backward-time simulations, are much
more efficient for simulating neutral data as well as some special selection
scenarios [@Yuan2012; @Carvajal-Rodriguez2010; @Hoban2011].  

In addition, since OncoSimulR allows you to specify fitness with
arbitrary epistatic and order effects, as well as mutator effects,
you need to learn the syntax of how to specify those effects and you
might be paying a performance penalty if your scenario does not
require this complexity. For instance, in the model of
@Beerenwinkel2007b, the fitness of a genotype depends only on the
total number of drivers mutated, but not on which drivers are
mutated (and, thus, not on the epistatic interactions nor the order
of accumulation of the drivers). This means that the syntax for
specifying that model could probably be a lot simpler
(e.g., specify $s$ per driver).

But it also means that code written for just that case could
probably run much faster. First, because fitness evaluation is
easier. Second, and possibly much more important, because what we
need to keep track of leads to much simpler and economic structures:
we do not need to keep track of clones (where two cells are regarded
as different clones if they differ anywhere in their genotype), but
only of clone types or clone classes as defined by the number of
mutated drivers, and keeping track of clones can be expensive ---see
sections \@ref(timings) and \@ref(trackindivs).

So for those cases where you do not need the full flexibility of OncoSimulR,
special purpose software might be easier to use and faster to run. Of
course, for some types of problems this special purpose software might not
be available, though.




## Steps for using OncoSimulR {#steps}


Using this package will often involve the following steps:


1. Specify fitness effects: sections \@ref(specfit) and \@ref(litex).

2. Simulate cancer progression: section \@ref(simul). You can
  simulate for a single individual or subject or for a set of
  subjects. You will need to:
  
    - Decide on a model. This basically amounts to choosing a model with
    exponential growth ("Exp" or "Bozic") or a model with carrying capacity
    ("McFL"). If exponential growth, you can choose whether the the effects
    of mutations operate on the death rate ("Bozic") or the birth rate
    ("Exp")[^1].
  
    - Specify other parameters of the simulation. In particular, decide
    when to stop the simulation, mutation rates, etc.

     Of course, at least for initial playing around, you can use the
     defaults.
  
3. Sample from the simulated data and do something with those
  simulated data (e.g., fit an OT model to them, examine diversity
  or time until cancer, etc). Most of what you do with the data,
  however, is outside the scope of this package and this vignette.

        

[^1]:It is of course possible to do this with the carrying capacity (or
      gompertz-like) models, but there probably is little reason to do
      it. @McFarland2013 discuss this has little effect on their results,
      for example. In addition, decreasing the death rate will more easily
      lead to numerical problems as shown in section \@ref(ex-0-death).




Before anything else, let us load the package in case it was not yet
loaded. We also explicitly load `r Biocpkg("graph")` and 
`r CRANpkg("igraph")` for the vignette to work (you do not need that
for your usual interactive work). And I set the default color for
vertices in igraph.

```{r, results="hide",message=FALSE, echo=TRUE, include=TRUE}
library(OncoSimulR)
library(graph)
library(igraph)
igraph_options(vertex.color = "SkyBlue2")
``` 

```{r, echo=FALSE, results='hide'}
options(width = 68)
``` 

To be explicit, what version are we running?
```{r}
packageVersion("OncoSimulR")
``` 



## Two quick examples of fitness specifications {#quickexample}

Following \@ref(steps) we will run two very minimal examples. First a
model with a few genes and **epistasis**:

```{r, fig.width=6.5, fig.height=10}
## 1. Fitness effects: here we specify an 
##    epistatic model with modules.
sa <- 0.1
sb <- -0.2
sab <- 0.25
sac <- -0.1
sbc <- 0.25
sv2 <- allFitnessEffects(epistasis = c("-A : B" = sb,
                                       "A : -B" = sa,
                                       "A : C" = sac,
                                       "A:B" = sab,
                                       "-A:B:C" = sbc),
                         geneToModule = c(
                             "A" = "a1, a2",
                             "B" = "b",
                             "C" = "c"),
                         drvNames = c("a1", "a2", "b", "c"))
evalAllGenotypes(sv2, addwt = TRUE)

## 2. Simulate the data. Here we use the "McFL" model and set
##    explicitly parameters for mutation rate, initial size, size
##    of the population that will end the simulations, etc

RNGkind("Mersenne-Twister")
set.seed(983)
ep1 <- oncoSimulIndiv(sv2, model = "McFL",
                      mu = 5e-6,
                      sampleEvery = 0.025,
                      keepEvery = 0.5,
                      initSize = 2000,
                      finalTime = 3000,
                      onlyCancer = FALSE)
``` 



<!-- % set.seed(9) -->
<!-- % ep1 <- oncoSimulIndiv(sv2, model = "McFL", -->
<!-- %                      mu = 5e-6, -->
<!-- %                      sampleEvery = 0.02, -->
<!-- %                      keepEvery = 0.5, -->
<!-- %                      initSize = 2000, -->
<!-- %                      finalTime = 3000, -->
<!-- %                      onlyCancer = FALSE) -->

<!-- %% <<fig.width=6.5, fig.height=10>>= -->
<!-- %%  ## 1. Fitness effects: here we specify a  -->
<!-- %%  ##    epistatic model with modules. -->
<!-- %%  sa <- 0.1 -->
<!-- %%  sb <- -0.2 -->
<!-- %%  sab <- 0.25 -->
<!-- %%  sac <- -0.1 -->
<!-- %%  sbc <- 0.25 -->
<!-- %%  sv2 <- allFitnessEffects(epistasis = c("-A : B" = sb, -->
<!-- %%                                         "A : -B" = sa, -->
<!-- %%                                         "A : C" = sac, -->
<!-- %%                                         "A:B" = sab, -->
<!-- %%                                         "-A:B:C" = sbc), -->
<!-- %%                           geneToModule = c( -->
<!-- %%                               "Root" = "Root", -->
<!-- %%                               "A" = "a1, a2", -->
<!-- %%                               "B" = "b", -->
<!-- %%                               "C" = "c")) -->
<!-- %%  evalAllGenotypes(sv2, order = FALSE, addwt = TRUE) -->

<!-- %%  ## 2. Simulate the data. Here we use the "McFL" model and set explicitly -->
<!-- %%  ##    parameters for mutation rate, final and initial sizes, etc. -->
<!-- %%  RNGkind("Mersenne-Twister") -->
<!-- %%  set.seed(983) -->
<!-- %%  ep1 <- oncoSimulIndiv(sv2, model = "McFL", -->
<!-- %%                       mu = 5e-6, -->
<!-- %%                       sampleEvery = 0.02, -->
<!-- %%                       keepEvery = 0.5, -->
<!-- %%                       initSize = 2000, -->
<!-- %%                       finalTime = 3000, -->
<!-- %%                       onlyCancer = FALSE) -->
<!-- %% @  -->


```{r iep1x1,fig.width=6.5, fig.height=4.5, fig.cap="Plot of drivers of an epistasis simulation."}
## 3. We will not analyze those data any further. We will only plot
## them.  For the sake of a small plot, we thin the data.
plot(ep1, show = "drivers", xlim = c(0, 1500),
     thinData = TRUE, thinData.keep = 0.5)
``` 

<!-- ## Increase ylim and legend.ncols to avoid overlap of  -->
<!-- ## legend with rest of figure -->
<!-- plot(ep1, show = "genotypes", ylim = c(0, 4500),  -->
<!--      legend.ncols = 4, -->
<!--      xlim = c(0, 1500), -->
<!--      thinData = TRUE, thinData.keep = 0.5) -->



As a second example, we will use a model where we specify
**restrictions in the order of accumulation of mutations using a
DAG** with the
pancreatic cancer poset in @Gerstung2011 (see more
details in section \@ref(pancreas)):

```{r fepancr1, fig.width=5}
## 1. Fitness effects: 
pancr <- allFitnessEffects(
    data.frame(parent = c("Root", rep("KRAS", 4), 
                   "SMAD4", "CDNK2A", 
                   "TP53", "TP53", "MLL3"),
               child = c("KRAS","SMAD4", "CDNK2A", 
                   "TP53", "MLL3",
                   rep("PXDN", 3), rep("TGFBR2", 2)),
               s = 0.1,
               sh = -0.9,
               typeDep = "MN"),
    drvNames = c("KRAS", "SMAD4", "CDNK2A", "TP53", 
	             "MLL3", "TGFBR2", "PXDN"))
```

```{r figfpancr1, fig.width=5, fig.cap="Plot of DAG corresponding to fitnessEffects object."}
## Plot the DAG of the fitnessEffects object
plot(pancr)
``` 


```{r}
## 2. Simulate from it. We change several possible options. 

set.seed(4) ## Fix the seed, so we can repeat it
ep2 <- oncoSimulIndiv(pancr, model = "McFL",
                     mu = 1e-6,
                     sampleEvery = 0.02,
                     keepEvery = 1,
                     initSize = 1000,
                     finalTime = 10000,
                     onlyCancer = FALSE)
``` 

<!-- % set.seed(6) ## Fix the seed, so we can repeat it -->
<!-- % ep2 <- oncoSimulIndiv(pancr, model = "McFL", -->
<!-- %                      mu = 1e-6, -->
<!-- %                      sampleEvery = 0.02, -->
<!-- %                      keepEvery = 1, -->
<!-- %                      initSize = 2000, -->
<!-- %                      finalTime = 10000, -->
<!-- %                      onlyCancer = FALSE)
-->
```{r iep2x2, fig.width=6.5, fig.height=5, fig.cap= "Plot of genotypes of a simulation from a DAG."}
## 3. What genotypes and drivers we get? And play with limits
##    to show only parts of the data. We also aggressively thin
##    the data.
par(cex = 0.7)
plot(ep2, show = "genotypes", xlim = c(1000, 8000), 
     ylim = c(0, 2400),
     thinData = TRUE, thinData.keep = 0.03)
``` 


The rest of this vignette explores all of those functions and arguments in
much more detail. 

## Citing OncoSimulR and other documentation {#citing}

In R, you can do
```{r}
citation("OncoSimulR")
``` 
which will tell you how to cite the package. Please, do cite the
Bionformatics paper if you use the package in publications.


This is the URL for the Bioinformatics paper: [https://doi.org/10.1093/bioinformatics/btx077](https://doi.org/10.1093/bioinformatics/btx077)
(there is also an early preprint
at [bioRxiv](http://biorxiv.org/content/early/2016/08/14/069500),
but it should now point to the Bioinformatics paper). 

<!-- This no longer helps that much with the many changes -->
<!-- You -->
<!-- can also take a look at this -->
<!-- poster, [http://dx.doi.org/10.7490/f1000research.1112860.1](http://dx.doi.org/10.7490/f1000research.1112860.1), -->
<!-- presented at ECCB 2016. -->


### HTML and PDF versions of the vignette {#pdfvignette}

A PDF version of this vignette is available
from <https://rdiaz02.github.io/OncoSimul/pdfs/OncoSimulR.pdf>.  And an HTML
version from <https://rdiaz02.github.io/OncoSimul/OncoSimulR.html>. These
files should correspond to the most recent, GitHub version, of the package
(i.e., they might include changes not yet available from the BioConductor
package).



## Testing, code coverage, and other examples {#codecover}

OncoSimulR includes more than 2000 tests that are run at every check
cycle. These tests provide a code coverage of more than 90%
including both the C++ and R code. Another set of over 500
long-running (several hours) tests can be run on demand (see
directory '/tests/manual'). In addition to serving as test cases,
some of that code also provides further examples of usage.





## Versions {#versions}

In this vignette and the documentation I often refer to version 1 (v.1) and
version 2 of OncoSimulR. Version 1 is the version available up to, and
including, BioConductor v. 3.1. Version 2 of OncoSimulR is available
starting from BioConductor 3.2 (and, of course, available too from
development versions of BioC).  So, if you are using the current stable or
development version of BioConductor, or you grab the sources from GitHub
(<https://github.com/rdiaz02/OncoSimul>) you are using what we call
version 2. Please note that **the functionality of version 1 will soon be removed.**



\clearpage

# Running time and space consumption of OncoSimulR {#timings}

Time to complete the simulations and size of returned objects (space
consumption) depend on several, interacting factors. The usual rule
of "experiment before launching a large number of simulations"
applies, but here we will walk through several cases to get a
feeling for the major factors that affect speed and size. Many of
the comments on this section need to use ideas discussed in other
places of this document; if you read this section first, you might
want to come back after reading the relevant parts.


Speed will depend on:

* Your hardware, of course.
* The evolutionary model.
* The granularity of how often you keep data (`keepEvery`
  argument). Note that the default, which is to keep as often as you
  sample (so that we preserve all history) can lead to slow execution
  times.
* The mutation rate, because higher mutation rates lead to more
  clones, and more clones means we need to iterate over, well, more clones,
  and keep larger data structures.
* The fitness specification: more complex fitness specifications tend
  to be slightly slower but specially different fitness specifications can
  have radically different effects on the evolutionary trajectories,
  accessibility of fast growing genotypes and, generally, the evolutionary dynamics.
* The stopping conditions (`detectionProb`, `detectionDrivers`,
  `detectionSize` arguments) and whether or not simulations are run until
  cancer is reached (`onlyCancer` argument).
* Most of the above factors can interact in complex ways.


Size of returned objects will depend on:

* Any factor that affects the number of clones tracked/returned, in
  particular: initial sizes and stopping conditions, mutation rate,
  and how often you keep data (the `keepEvery` argument can make a
  huge difference here).
* Whether or not you keep the complete genealogy of clones (this affects
  slightly the size of returned object, not speed).


In the sections that follow, we go over several cases to understand
some of the main settings that affect running time (or execution
time) and space consumption (the size of returned objects). It
should be understood, however, that many of the examples shown below
do not represent typical use cases of OncoSimulR and are used only
to identify what and how affects running time and space
consumption. As we will see in most examples in this vignette,
typical use cases of OncoSimulR involve hundreds to thousands of
genes on population sizes up to $10^5$ to $10^7$. 


Note that most of the code in this section is not executed during the
building of the vignette to keep vignette build time reasonable and prevent
using huge amounts of RAM. All of the code, ready to be sourced and run, is
available from the 'inst/miscell' directory (and the summary output from
some of the benchmarks is available from the
'miscell-files/vignette_bench_Rout' directory of the main OncoSimul
repository at https://github.com/rdiaz02/OncoSimul).


```{r colnames_benchmarks, echo = FALSE, eval = TRUE}

data(benchmark_1)
data(benchmark_1_0.05)
data(benchmark_2)
data(benchmark_3)

colnames(benchmark_1)[
    match(c(
	"time_per_simul",
    "size_mb_per_simul", "NumClones.Median", "NumIter.Median",
	"FinalTime.Median", "TotalPopSize.Median", "TotalPopSize.Mean",
	"TotalPopSize.Max.", "keepEvery",  "Attempts.Median",
	"Attempts.Mean", "Attempts.Max.",
	"PDBaseline", "n2", "onlyCancer"),
	 colnames(benchmark_1)
	)] <- c("Elapsed Time, average per simulation (s)",
	              "Object Size, average per simulation (MB)",
				  "Number of Clones, median",
				  "Number of Iterations, median",
				  "Final Time, median",
				  "Total Population Size, median",
				   "Total Population Size, mean",
				  "Total Population Size, max.",
				  "keepEvery",
				  "Attempts until Cancer, median",
				  "Attempts until Cancer, mean",
				  "Attempts until Cancer, max.",
				  "PDBaseline", "n2", "onlyCancer"
				  )
				  
	
colnames(benchmark_1_0.05)[
    match(c("time_per_simul",
    "size_mb_per_simul", "NumClones.Median", "NumIter.Median",
	"FinalTime.Median", "TotalPopSize.Median", "TotalPopSize.Mean", 
	"TotalPopSize.Max.",
	"keepEvery",
	"PDBaseline", "n2", "onlyCancer", "Attempts.Median"),
	colnames(benchmark_1_0.05))] <- c("Elapsed Time, average per simulation (s)",
	              "Object Size, average per simulation (MB)",
				  "Number of Clones, median",
				  "Number of Iterations, median",
				  "Final Time, median",
				  "Total Population Size, median",
				  "Total Population Size, mean",
				  "Total Population Size, max.",
				  "keepEvery",
				  "PDBaseline", "n2", "onlyCancer",
				  "Attempts until Cancer, median"
				  )


colnames(benchmark_2)[match(c("Model", "fitness", "time_per_simul",
    "size_mb_per_simul", "NumClones.Median", "NumIter.Median",
	"FinalTime.Median", "TotalPopSize.Median", "TotalPopSize.Mean", 
	"TotalPopSize.Max."), colnames(benchmark_2))] <-  c("Model",
				  "Fitness",
	"Elapsed Time, average per simulation (s)",
	              "Object Size, average per simulation (MB)",
				  "Number of Clones, median",
				  "Number of Iterations, median",
				  "Final Time, median",
				  "Total Population Size, median",
				  "Total Population Size, mean",
				  "Total Population Size, max."
				  )	
				  
colnames(benchmark_3)[match(c("Model", "fitness", "time_per_simul",
    "size_mb_per_simul", "NumClones.Median", "NumIter.Median",
	"FinalTime.Median", "TotalPopSize.Median", "TotalPopSize.Mean", 
	"TotalPopSize.Max."), colnames(benchmark_3))] <-  c("Model",
				  "Fitness",
	"Elapsed Time, average per simulation (s)",
	              "Object Size, average per simulation (MB)",
				  "Number of Clones, median",
				  "Number of Iterations, median",
				  "Final Time, median",
				  "Total Population Size, median",
				  "Total Population Size, mean",
				  "Total Population Size, max."
				  )					  
```


## Exp and McFL with "detectionProb" and pancreas example {#bench1}

To get familiar with some of they factors that affect time and size,
we will use the fitness specification from section
\@ref(quickexample), with the `detectionProb` stopping mechanism
(see \@ref(detectprob)).  We will use the two main growth models
(exponential and McFarland). Each model will be run with two
settings of `keepEvery`. With `keepEvery = 1` (runs `exp1` and
`mc1`), population samples are stored at time intervals of 1 (even
if most of the clones in those samples later become extinct). With
`keepEvery = NA` (runs `exp2` and `mc2`) no intermediate population
samples are stored, so clones that become extinct at any sampling
period are pruned and only the existing clones at the end of the
simulation are returned (see details in \@ref(prune)).



Will run `r unique(benchmark_1$Numindiv)` simulations.  The results
I show are for a laptop with an 8-core Intel Xeon E3-1505M CPU,
running Debian GNU/Linux (the results from these benchmarks are
available as `data(benchmark_1)`).



```{r timing1, eval=FALSE}
## Specify fitness
pancr <- allFitnessEffects(
    data.frame(parent = c("Root", rep("KRAS", 4), 
                   "SMAD4", "CDNK2A", 
                   "TP53", "TP53", "MLL3"),
               child = c("KRAS","SMAD4", "CDNK2A", 
                   "TP53", "MLL3",
                   rep("PXDN", 3), rep("TGFBR2", 2)),
               s = 0.1,
               sh = -0.9,
               typeDep = "MN"),
    drvNames = c("KRAS", "SMAD4", "CDNK2A", "TP53", 
	             "MLL3", "TGFBR2", "PXDN"))

Nindiv <- 100 ## Number of simulations run.
              ## Increase this number to decrease sampling variation

## keepEvery = 1
t_exp1 <- system.time(
    exp1 <- oncoSimulPop(Nindiv, pancr, 
                            detectionProb = "default", 
                            detectionSize = NA,
                            detectionDrivers = NA,
                            finalTime = NA,
                            keepEvery = 1,
                            model = "Exp", 
                            mc.cores = 1))["elapsed"]/Nindiv


t_mc1 <- system.time(
    mc1 <- oncoSimulPop(Nindiv, pancr, 
                           detectionProb = "default", 
                           detectionSize = NA,
                           detectionDrivers = NA,
                           finalTime = NA,
                           keepEvery = 1,                                  
                           model = "McFL", 
                           mc.cores = 1))["elapsed"]/Nindiv

## keepEvery = NA
t_exp2 <- system.time(
    exp2 <- oncoSimulPop(Nindiv, pancr, 
                            detectionProb = "default", 
                            detectionSize = NA,
                            detectionDrivers = NA,
                            finalTime = NA,
                            keepEvery = NA, 
                            model = "Exp", 
                            mc.cores = 1))["elapsed"]/Nindiv


t_mc2 <- system.time(
    mc2 <- oncoSimulPop(Nindiv, pancr, 
                           detectionProb = "default", 
                           detectionSize = NA,
                           detectionDrivers = NA,
                           finalTime = NA,
                           keepEvery = NA,
                           model = "McFL", 
                           mc.cores = 1))["elapsed"]/Nindiv


``` 

We can obtain times, sizes of objects, and summaries of numbers
of clones, iterations, and final times doing, for instance:


``` {r, eval=FALSE}
cat("\n\n\n t_exp1 = ", t_exp1, "\n")
object.size(exp1)/(Nindiv * 1024^2)
cat("\n\n")
summary(unlist(lapply(exp1, "[[", "NumClones")))
summary(unlist(lapply(exp1, "[[", "NumIter")))
summary(unlist(lapply(exp1, "[[", "FinalTime")))
summary(unlist(lapply(exp1, "[[", "TotalPopSize")))
```


The above runs yield the following:


\blandscape

Table: (\#tab:bench1) Benchmarks of Exp and McFL models using the default `detectionProb` with two settings of `keepEvery`. 
```{r bench1, eval=TRUE, echo = FALSE}

panderOptions('table.split.table', 99999999)
panderOptions('table.split.cells', 900)  ## For HTML
## panderOptions('table.split.cells', 8) ## For PDF

set.alignment('right')
panderOptions('round', 2)
panderOptions('big.mark', ',')
panderOptions('digits', 2)
				          
pander(benchmark_1[1:4, c("Elapsed Time, average per simulation (s)", 
 	              "Object Size, average per simulation (MB)",
 				  "Number of Clones, median",
 				  "Number of Iterations, median",
 				  "Final Time, median",
 				  "Total Population Size, median",
 				  "Total Population Size, max.",
 				  "keepEvery")],
				  justify = c('left', rep('right', 8)), ##  o.w. hlines not right
				  ## caption = "\\label{tab:bench1}Benchmarks of Exp and McFL  models using the default `detectionProb` with two settings of `keepEvery`."
				  )
```

\elandscape

\clearpage



The above table shows that a naive comparison (looking simply at execution
time) might conclude that the McFL model is much, much slower than the Exp
model. But that is not the complete story: using the `detectionProb`
stopping mechanism (see \@ref(detectprob)) will lead to stopping the
simulations very quickly in the exponential model because as soon as a
clone with fitness $>1$ appears it starts growing exponentially. In fact,
we can see that the number of iterations and the final time are much
smaller in the Exp than in the McFL model.  We will elaborate on this
point below (section \@ref(common1)), when we discuss the setting for
`checkSizePEvery` (here left at its default value of 20): checking the
exiting condition more often (smaller `checkSizePEvery`) would probably be
justified here (notice also the very large final times) and would lead to
a sharp decrease in number of iterations and, thus, running time.



This table also shows that the `keepEvery = NA` setting, which was
in effect in simulations `exp2` and `mc2`, can make a difference
especially for the McFL models, as seen by the median number of
clones and the size of the returned object. Models `exp2` and `mc2`
do not store any intermediate population samples so clones that
become extinct at any sampling period are pruned and only the
existing clones at the end of the simulation are returned. In
contrast, models `exp1` and `mc1` store population samples at time
intervals of 1 (`keepEvery = 1`), even if many of those clones
eventually become extinct. We will return to this issue below as
execution time and object size depend strongly on the number of
clones tracked.



We can run the exponential model again modifying the arguments of the
`detectionProb` mechanism; in two of the models below (`exp3` and `exp4`) no
detection can take place unless populations are at least 100 times larger
than the initial population size, and probability of detection is 0.1 with a
population size 1,000 times larger than the initial one (`PDBaseline = 5e4`,
`n2 = 5e5`). In the other two models (`exp5` and `exp6`), no detection can
take place unless populations are at least 1,000 times larger than the
initial population size, and probability of detection is 0.1 with a
population size 100,000 times larger than the initial one (`PDBaseline =
5e5`, `n2 = 5e7`)[^rva]. In runs `exp3` and `exp5` we set `keepEvery = 1` and in
runs `exp4` and `exp6` we set `keepEvery = NA`.

[^rva]:Again, these are not necessarily reasonable or common
settings. We are using them to understand what and how affects
running time and space consumption.


```{r timing2, eval = FALSE}
t_exp3 <- system.time(
    exp3 <- oncoSimulPop(Nindiv, pancr, 
                            detectionProb = c(PDBaseline = 5e4,
                                              p2 = 0.1, n2 = 5e5,
                                              checkSizePEvery = 20), 
                            detectionSize = NA,
                            detectionDrivers = NA,
                            finalTime = NA,
                            keepEvery = 1, 
                            model = "Exp", 
                            mc.cores = 1))["elapsed"]/Nindiv

t_exp4 <- system.time(
    exp4 <- oncoSimulPop(Nindiv, pancr, 
                            detectionProb = c(PDBaseline = 5e4,
                                              p2 = 0.1, n2 = 5e5,
                                              checkSizePEvery = 20), 
                            detectionSize = NA,
                            detectionDrivers = NA,
                            finalTime = NA,
                            keepEvery = NA, 
                            model = "Exp", 
                            mc.cores = 1))["elapsed"]/Nindiv



t_exp5 <- system.time(
    exp5 <- oncoSimulPop(Nindiv, pancr, 
                            detectionProb = c(PDBaseline = 5e5,
                                              p2 = 0.1, n2 = 5e7), 
                            detectionSize = NA,
                            detectionDrivers = NA,
                            finalTime = NA,
                            keepEvery = 1, 
                            model = "Exp", 
                            mc.cores = 1))["elapsed"]/Nindiv

t_exp6 <- system.time(
    exp6 <- oncoSimulPop(Nindiv, pancr, 
                            detectionProb = c(PDBaseline = 5e5,
                                              p2 = 0.1, n2 = 5e7), 
                            detectionSize = NA,
                            detectionDrivers = NA,
                            finalTime = NA,
                            keepEvery = NA, 
                            model = "Exp", 
                            mc.cores = 1))["elapsed"]/Nindiv

```

\blandscape

Table: (\#tab:bench1b) Benchmarks of Exp models modifying the default `detectionProb` with two settings of `keepEvery`.
```{r bench1b, eval=TRUE, echo = FALSE}
panderOptions('table.split.table', 99999999)
panderOptions('table.split.cells', 900)  ## For HTML
## panderOptions('table.split.cells', 8) ## For PDF

set.alignment('right')
panderOptions('round', 2)
panderOptions('big.mark', ',')
panderOptions('digits', 2)

pander(benchmark_1[5:8, c("Elapsed Time, average per simulation (s)",
 	              "Object Size, average per simulation (MB)",
 				  "Number of Clones, median",
 				  "Number of Iterations, median",
 				  "Final Time, median",
 				  "Total Population Size, median",
 				  "Total Population Size, max.",
 				  "keepEvery",
				  "PDBaseline",
				  "n2")], 
				  justify = c('left', rep('right', 10)), ##  o.w. hlines not right
## 				  round = c(rep(2, 3), rep(0, 7)),
## 				  digits = c(rep(2, 3), rep(1, 7)),
	  ## caption = "\\label{tab:bench1b}Benchmarks of Exp and McFL models modifying the default `detectionProb` with two settings of `keepEvery`."
    )

```

\elandscape

\clearpage

As above,  `keepEvery = NA` (in `exp4` and `exp6`) leads to much
smaller object sizes and slightly smaller numbers of clones and
execution times. Changing the exiting conditions (by changing
`detectionProb` arguments) leads to large increases in number of
iterations (in this case by factors of about 15x to 25x) and a
corresponding increase in execution time as well as much larger
population sizes (in some cases $>10^{10}$).


In some of the runs of `exp5` and `exp6` we get the (recoverable)
exception message from the C++ code: `Recoverable exception ti set to
DBL_MIN. Rerunning`, which is related to those simulations reaching total
population sizes $>10^{10}$; we return to this below (section
\@ref(popgtzx)). You might also wonder why total and median population
sizes are so large in these two runs, given the exiting conditions. One of
the reasons is that we are using the default `checkSizePEvery = 20`, so
the interval between successive checks of the exiting condition is large;
this is discussed at greater length in section \@ref(common1).


All the runs above used the default value `onlyCancer = TRUE`. This means
that simulations will be repeated until the exiting conditions are reached
(see details in section \@ref(endsimul)) and, therefore, any simulation
that ends up in extinction will be repeated. This setting can thus have a
large effect on the exponential models, because when the initial
population size is not very large and we start from the wildtype, it is
not uncommon for simulations to become extinct (when birth and death
rates are equal and the population size is small, it is easy to reach
extinction before a mutation in a gene that increases fitness occurs). But
this is rarely the case in the McFarland model (unless we use really tiny
initial population sizes) because of the dependency of death rate on total
population size (see section \@ref(mcfl)).


The number of attempts until cancer was reached in the above
models is shown in  Table \@ref(tab:bench1c) (the values can be obtained from
any of the above runs doing, for instance, `median(unlist(lapply(exp1,
function(x) x$other$attemptsUsed)))` ):

Table: (\#tab:bench1c) Number of attempts until cancer.
```{r bench1c, eval=TRUE, echo = FALSE}
panderOptions('table.split.table', 99999999)
panderOptions('table.split.cells', 900)  ## For HTML
## panderOptions('table.split.cells', 12) ## For PDF
set.alignment('right')
panderOptions('round', 2)
panderOptions('big.mark', ',')
panderOptions('digits', 2)

pander(benchmark_1[1:8, c(
"Attempts until Cancer, median", 
"Attempts until Cancer, mean", 
"Attempts until Cancer, max.", 
				  "PDBaseline",
				  "n2")], 
				  justify = c('left', rep('right', 5)), ##  o.w. hlines not right
## 				  round = c(rep(2, 3), rep(0, 7)),
## 				  digits = c(rep(2, 3), rep(1, 7)),
	  ## caption = "\\label{tab:bench1c}Number of attempts until cancer."
    )
## ## data(benchmark_1)
## knitr::kable(benchmark_1[1:8, c("Attempts.Median",
##                                 "PDBaseline", "n2"), drop = FALSE], 
##     booktabs = TRUE,
## 	row.names = TRUE,
## 	col.names = c("Attempts until cancer", "PDBaseline", "n2"),
##     caption = "Median number of attempts until cancer.", 
## 	align = "r")
	
```


The McFL models finish in a single attempt. The exponential model
simulations where we can exit with small population sizes (`exp1`, `exp2`)
need many fewer attempts to reach cancer than those where large population
sizes are required (`exp3` to `exp6`). There is no relevant different
among those last four, which is what we would expect: a population that has
already reached a size of 50,000 cells from an initial population size of
500 is obviously a growing population where there is at least one mutant
with positive fitness; thus, it unlikely to go extinct and therefore
having to grow up to at least 500,000 will not significantly increase the
risk of extinction.


We will now rerun all of the above models with argument `onlyCancer =
FALSE`.  The results are shown in Table \@ref(tab:timing3) (note that the
differences between this table and Table \@ref(tab:bench1) for the McFL
models are due only to sampling variation).

\bslandscape


Table: (\#tab:timing3) Benchmarks of models in Table \@ref(tab:bench1) and \@ref(tab:bench1b) when run with `onlyCancer = FALSE`
```{r bench1d, eval=TRUE, echo = FALSE}
panderOptions('table.split.table', 99999999)
panderOptions('table.split.cells', 900)  ## For HTML
## panderOptions('table.split.cells', 8) ## For PDF
panderOptions('table.split.cells', 15) ## does not fit otherwise
set.alignment('right')
panderOptions('round', 3)

pander(benchmark_1[9:16, 
    c("Elapsed Time, average per simulation (s)",
 	              "Object Size, average per simulation (MB)",
 				  "Number of Clones, median",
 				  "Number of Iterations, median",
 				  "Final Time, median",
 				  "Total Population Size, median",
				  "Total Population Size, mean",
 				  "Total Population Size, max.",
 				  "keepEvery",
				  "PDBaseline",
				  "n2")],
				  justify = c('left', rep('right', 11)), ##  o.w. hlines not right
## caption = "\\label{tab:timing3} Benchmarks of models in Table \\@ref(tab:bench1) and \\@ref(tab:bench1b) when run with `onlyCancer = FALSE`."
				  )	
	
```

\eslandscape

\clearpage




Now most simulations under the exponential model end up in extinction, as
seen by the median population size of 0 (but not all, as the mean and
max. population size are clearly away from zero). Consequently,
simulations under the exponential model are now faster (and the size of
the average returned object is smaller). Of course, whether one should run
simulations with `onlyCancer = TRUE` or `onlyCancer = FALSE` will depend
on the question being asked (see, for example, section \@ref(exbauer) for
a question where we will naturally want to use `onlyCancer = FALSE`).


To make it easier to compare results with those of the next section, Table
\@ref(tab:allr1bck) shows all the runs so far.


\bslandscape

Table: (\#tab:allr1bck) Benchmarks of all models in Tables \@ref(tab:bench1), \@ref(tab:bench1b), and \@ref(tab:timing3).  
```{r bench1dx0, eval=TRUE, echo = FALSE}
panderOptions('table.split.table', 99999999)
## panderOptions('table.split.cells', 900)  ## For HTML
panderOptions('table.split.cells', 19)

set.alignment('right') 
panderOptions('round', 3)
	
pander(benchmark_1[ , c("Elapsed Time, average per simulation (s)",
 	              "Object Size, average per simulation (MB)", 
				  "Number of Clones, median", 
				  "Number of Iterations, median", 
				  "Final Time, median", "Total Population Size, median", 
				  "Total Population Size, mean", "Total Population Size, max.",
 	              "keepEvery", "PDBaseline", "n2", "onlyCancer")], 
				  justify = c('left', rep('right', 12)), ##  o.w. hlines not right
				  ## caption = "\\label{tab:allr1bck}Benchmarks of all models in Tables \\@ref(tab:bench1), \\@ref(tab:bench1b),  and \\@ref(tab:timing3)."  
				  )  
```

\eslandscape

\clearpage



### Changing fitness: $s=0.1$ and $s=0.05$ {#bench1xf}

In the above fitness specification the fitness effect of each gene
(when its restrictions are satisfied) is $s = 0.1$ (see section
\@ref(numfit) for details). Here we rerun all the above benchmarks
using $s= 0.05$ (the results from these benchmarks are available as
`data(benchmark_1_0.05)`) and results are shown below in Table
\@ref(tab:timing3xf).

\bslandscape

Table: (\#tab:timing3xf) Benchmarks of all models in Table \@ref(tab:allr1bck) using $s=0.05$ (instead of $s=0.1$).
```{r bench1dx, eval=TRUE, echo = FALSE}
## data(benchmark_1_0.05)
## knitr::kable(benchmark_1_0.05[, c("time_per_simul",
##     "size_mb_per_simul", "NumClones.Median", "NumIter.Median",
## 	"FinalTime.Median", "TotalPopSize.Median", "TotalPopSize.Mean", 
## 	"TotalPopSize.Max.",
## 	"keepEvery",
## 	"PDBaseline", "n2", "onlyCancer")], 
##     booktabs = TRUE,
## 	col.names = c("Elapsed Time, average per simulation (s)",
## 	              "Object Size, average per simulation (MB)",
## 				  "Number of Clones, median",
## 				  "Number of Iterations, median",
## 				  "Final Time, median",
## 				  "Total Population Size, median",
## 				  "Total Population Size, mean",
## 				  "Total Population Size, max.",				  
## 				  "keepEvery",
## 				  "PDBaseline", "n2", "onlyCancer"
## 				  ),
## ##    caption = "Benchmarks of models in Table \@ref(tab:bench1) and
## ##   \@ref(tab:bench1b) when run with `onlyCancer = FALSE`", 
## 	align = "c")
	
panderOptions('table.split.table', 99999999)
## panderOptions('table.split.cells', 900)  ## For HTML
panderOptions('table.split.cells', 19)

set.alignment('right') 
panderOptions('round', 3)
	
pander(benchmark_1_0.05[ , c("Elapsed Time, average per simulation (s)",
 	              "Object Size, average per simulation (MB)", 
				  "Number of Clones, median", 
				  "Number of Iterations, median", 
				  "Final Time, median", 
				  "Total Population Size, median", 
				  "Total Population Size, mean", "Total Population Size, max.",
 	              "keepEvery", "PDBaseline", "n2", "onlyCancer")], 
				  justify = c('left', rep('right', 12)), ##  o.w. hlines not right
 	              ## caption = "\\label{tab:timing3xf}Benchmarks of all models in Table \\@ref(tab:allr1bck) using $s=0.05$ (instead of $s=0.1$)."  
)  
				  
```

\eslandscape

\clearpage

As expected, having a smaller $s$ leads to slower processes in most cases,
since it takes longer to reach the exiting conditions sooner. Particularly
noticeable are the runs for the McFL models (notice the increases in
population size and number of iterations ---see also below). 


That is not the case, however, for `exp5` and `exp6` (and `exp5_noc` and
`exp6_noc`). When running with $s=0.05$ the simulations exit at a later
time (see column "Final Time") but they exit with smaller population
sizes. Here we have an interaction between sampling frequency, speed of
growth of the population, mutation events and number of clones. In
populations that grow much faster mutation events will happen more often
(which will trigger further iterations of the algorithm); in addition,
more new clones will be created, even if they only exist for short times
and become extinct by the following sampling period (so they are not
reflected in the `pops.by.time` matrix). These differences are
proportionally larger the larger the rate of growth of the
population. Thus, they are larger between, say, the `exp5` at $s=0.1$ and
$s=0.05$ than between the `exp4` at the two different $s$: the `exp5` exit
conditions can only be satisfied at much larger population sizes so at
populations sizes when growth is much faster (recall we are dealing with
exponential growth).

Recall also that with the default settings in `detectionProb`, we
assess the exiting condition every 20 time periods (argument
`checkSizePEvery`); this means that for fast growing populations,
the increase in population size between successive checks of the
exit conditions will be much larger (this phenomenon is also discussed in
section \@ref(common1)).


Thus, what is happening in the `exp5` and `exp6` with $s=0.1$ is
that close to the time the exit conditions could be satisfied, they
are growing very fast, accumulating mutants, and incurring in
additional iterations. They exit sooner in terms of time periods,
but they do much more work before arriving there.


The setting of `checkSizePEvery` is also having a huge effect on the McFL
model simulations (the number of iterations is $>10^6$). Even more than in
the previous section, checking the exiting condition more often (smaller
`checkSizePEvery`) would probably be justified here (notice also the very
large final times) and would lead to a sharp decrease in number of
iterations and, thus, running time.



The moral here is that in complex simulations like this (and most
simulations are complex), the effects of some parameters ($s$ in this
case) might look counter-intuitive at first. Thus the need to "experiment
before launching a large number of simulations".


## Several "common use cases" runs {#benchusual}

Let us now execute some simulations under more usual conditions. We will use
seven different fitness specifications: the pancreas example, two random
fitness landscapes, and four sets of independent genes (200 to 4000 genes)
with fitness effects randomly drawn from exponential distributions:


```{r fitusualb, echo = TRUE, eval = FALSE}
pancr <- allFitnessEffects(
    data.frame(parent = c("Root", rep("KRAS", 4), 
                   "SMAD4", "CDNK2A", 
                   "TP53", "TP53", "MLL3"),
               child = c("KRAS","SMAD4", "CDNK2A", 
                   "TP53", "MLL3",
                   rep("PXDN", 3), rep("TGFBR2", 2)),
               s = 0.1,
               sh = -0.9,
               typeDep = "MN"),
    drvNames = c("KRAS", "SMAD4", "CDNK2A", "TP53", 
	             "MLL3", "TGFBR2", "PXDN"))


## Random fitness landscape with 6 genes
## At least 50 accessible genotypes
rfl6 <- rfitness(6, min_accessible_genotypes = 50)
attributes(rfl6)$accessible_genotypes ## How many accessible
rf6 <- allFitnessEffects(genotFitness = rfl6)


## Random fitness landscape with 12 genes
## At least 200 accessible genotypes
rfl12 <- rfitness(12, min_accessible_genotypes = 200)
attributes(rfl12)$accessible_genotypes ## How many accessible
rf12 <- allFitnessEffects(genotFitness = rfl12)




## Independent genes; positive fitness from exponential distribution
## with mean around 0.1, and negative from exponential with mean 
## around -0.02. Half of genes positive fitness effects, half 
## negative.

ng <- 200 re_200 <- allFitnessEffects(noIntGenes = c(rexp(ng/2, 10),
-rexp(ng/2, 50)))

ng <- 500
re_500 <- allFitnessEffects(noIntGenes = c(rexp(ng/2, 10), 
                                           -rexp(ng/2, 50)))

ng <- 2000
re_2000 <- allFitnessEffects(noIntGenes = c(rexp(ng/2, 10), 
                                            -rexp(ng/2, 50)))

ng <- 4000
re_4000 <- allFitnessEffects(noIntGenes = c(rexp(ng/2, 10), 
                                            -rexp(ng/2, 50)))

```


### Common use cases, set 1. {#common1}

We will use the Exp and the McFL models, run with different parameters. The
script is provided as 'benchmark_2.R', under '/inst/miscell', with output in
the 'miscell-files/vignette_bench_Rout' directory of the main OncoSimul
repository at https://github.com/rdiaz02/OncoSimul. The data are available
as `data(benchmark_2)`.

For the Exp model the call will be

```{r exp-usual-r, eval = FALSE, echo = TRUE}

oncoSimulPop(Nindiv,
            fitness,
            detectionProb = NA, 
            detectionSize = 1e6,
            initSize = 500,
            detectionDrivers = NA,
            keepPhylog = TRUE,
            model = "Exp",
            errorHitWallTime = FALSE,
            errorHitMaxTries = FALSE,
            finalTime = 5000,
            onlyCancer = FALSE,
            mc.cores = 1,
            sampleEvery = 0.5,
			keepEvery = 1)
```

And for McFL:

```{r mc-usual-r, eval = FALSE, echo = TRUE}
initSize <- 1000
oncoSimulPop(Nindiv,
              fitness,
               detectionProb = c(
                   PDBaseline = 1.4 * initSize,
                   n2 = 2 * initSize,
                   p2 = 0.1,
                   checkSizePEvery = 4),
               initSize = initSize,
               detectionSize = NA,
               detectionDrivers = NA,
               keepPhylog = TRUE,
               model = "McFL",
               errorHitWallTime = FALSE,
               errorHitMaxTries = FALSE,
               finalTime = 5000,
               max.wall.time = 10,
               onlyCancer = FALSE,
               mc.cores = 1,
			   keepEvery = 1)

```


For the exponential model we will stop simulations when populations
have $>10^6$ cells (simulations start from 500 cells). For the McFarland
model we will use the `detectionProb` mechanism (see section
\@ref(detectprob) for details); we could have used as stopping mechanism
`detectionSize = 2 * initSize` (which would be basically equivalent to
reaching cancer, as argued in [@McFarland2013]) but we want to provide
further examples under the `detectionProb` mechanism. We will start from 1000
cells, not 500 (starting from 1000 we almost always reach cancer in a
single run).


Why not use the `detectionProb` mechanism with the `Exp` models?  Because
it can be hard to intuitively understand what are reasonable settings for
the parameters of the `detectionProb` mechanism when used in a population
that is growing exponentially, especially if different genes have very
different effects on fitness. Moreover, we are using fitness
specifications that are very different (compare the fitness landscape of
six genes, the pancreas specification, and the fitness specification with
4000 genes with fitness effects drawn from an exponential distribution
---`re_4000`). In contrast, the `detectionProb` mechanism might be simpler
to reason about in a population that is growing under a model of carrying
capacity with possibly large periods of stasis. Let us emphasize that it
is not that the `detectionProb` mechanism does not make sense with the Exp
model; it is simply that the parameters might need finer adjustment for
them to make sense, and in these benchmarks we are dealing with widely
different fitness specifications.


Note also that we specify `checkSizePEvery = 4` (instead of the default,
which is 20). Why? Because the fitness specifications where fitness
effects are drawn from exponential distributions (`re_200` to `re_4000`
above) include many genes (well, up to 4000) some of them with possibly
very large effects. In these conditions, simulations can run very fast in
the sense of "units of time". If we check exiting conditions every 20
units the population could have increased its size several orders of
magnitude in between checks (this is also discussed in sections
\@ref(bench1xf) and \@ref(detectprob)). You can verify this by running the
script with other settings for `checkSizePEvery` (and being aware that
large settings might require you to wait for a long time). To ensure that
populations have really grown, we have increased the setting of
`PDBaseline` so that no simulation can be considered for stopping unless
its size is 1.4 times larger than `initSize`.

In all cases we use `keepEvery = 1` and `keepPhylog = TRUE` (so we store
the population sizes of all clones every 1 time unit and we keep the
complete genealogy of clones). Finally, we run all models with
`errorHitWallTime = FALSE` and `errorHitMaxTries = FALSE` so that we can
see results even if stopping conditions are not met.



<!-- ```{r loadbench2usual, echo = FALSE, eval = TRUE}  -->
<!-- data(benchmark_2)  -->
<!-- ```  -->

The results of the benchmarks, using `r unique(benchmark_2$Numindiv)`
individual simulations, are shown in Table \@ref(tab:timingusual).


\blandscape

Table: (\#tab:timingusual) Benchmarks under some common use cases, set 1.
```{r benchustable, eval=TRUE, echo = FALSE}
## data(benchmark_2)

## knitr::kable(benchmark_2[, c("Model", "fitness", "time_per_simul",
##     "size_mb_per_simul", "NumClones.Median", "NumIter.Median",
## 	"FinalTime.Median", "TotalPopSize.Median", "TotalPopSize.Mean", 
## 	"TotalPopSize.Max.")], 
##     booktabs = TRUE,
## 	col.names = c("Model",
## 				  "Fitness",
## 	"Elapsed Time, average per simulation (s)",
## 	              "Object Size, average per simulation (MB)",
## 				  "Number of Clones, median",
## 				  "Number of Iterations, median",
## 				  "Final Time, median",
## 				  "Total Population Size, median",
## 				  "Total Population Size, mean",
## 				  "Total Population Size, max."
## 				  ),
## 	align = "c")

panderOptions('table.split.table', 99999999)
panderOptions('table.split.cells', 900)  ## For HTML
## panderOptions('table.split.cells', 8) ## For PDF

## set.alignment('right', row.names = 'center')
panderOptions('table.alignment.default', 'right')

panderOptions('round', 3)

pander(benchmark_2[ , c(
    "Model", "Fitness",
    "Elapsed Time, average per simulation (s)",
 	              "Object Size, average per simulation (MB)",
 				  "Number of Clones, median",
 				  "Number of Iterations, median",
 				  "Final Time, median",
 				  "Total Population Size, median",
 				  "Total Population Size, mean",				  
 				  "Total Population Size, max.")], 
				  justify = c('left', 'left', rep('right', 8)),
				  ## caption = "\\label{tab:timingusual}Benchmarks under some common use cases, set 1." 
				  )	
	
```

\elandscape

\clearpage

In most cases, simulations run reasonably fast (under 0.1 seconds per
individual simulation) and the returned objects are small. I will only
focus on a few cases.

The McFL model with random fitness landscape `rf12` and with `pancr` does
not satisfy the conditions of `detectionProb` in most cases: its median
final time is 5000, which was the maximum final time specified. This
suggests that the fitness landscape is such that it is unlikely that we
will reach population sizes $> 1400$ (remember we the setting for
`PDBaseline`) before 5000 time units. There is nothing particular about
using a fitness landscape of 12 genes and other runs in other 12-gene
random fitness landscapes do not show this pattern. However, complex
fitness landscapes might be such that genotypes of high fitness (those
that allow reaching a large population size quickly) are not easily
accessible[^access] so reaching them might take a long time. This does not
affect the exponential model in the same way because, well, because there
is exponential growth in that model: any genotype with fitness $>1$ will
grow exponentially (of course, at possibly very different rates). You
might want to play with the script and modify the call to `rfitness`
(using different values of `reference` and `c`, for instance) to have
simpler paths to a maximum or modify the call to `oncoSimulPop` (with,
say, `finalTime` to much larger values). Some of these issues are related
to more general questions about fitness landscapes and accessibility (see
section \@ref(ex-ochs) and references therein).

[^access]:By easily accessible I mean that there are many, preferably
    short, paths of non-decreasing fitness from the wildtype to this
    genotype. See definitions and discussion in, e.g.,
    @franke_evolutionary_2011.


You could also set `onlyCancer = TRUE`. This might make sense if you are
interested in only seeing simulations that "reach cancer" (where "reach
cancer" means reaching a state you define as a function of population size
or drivers). However, if you are exploring fitness landscapes, `onlyCancer
= TRUE` might not always be reasonable as reaching a particular population
size, for instance, might just not be possible under some fitness
landscapes (this phenomenon is of course not restricted to random fitness
landscapes ---see also section \@ref(largegenes005)).



As we anticipated above, the `detectionProb` mechanism has to be used with
care: some of the simulations run in very short "time units", such as
those for the fitness specifications with 2000 and 4000 genes. Having used
a `checkSizePEvery = 20`  probably would not have made sense.


Finally, it is interesting that in the cases examined here, the two
slowest running simulations are from "Exp", with fitnesses `re_2000`
and `re_4000` (and the third slowest is also Exp, under
`re_500`). These are also the cases with the largest number of
clones. Why? In the "Exp" model there is no competition, and fitness
specifications `re_2000` and `re_4000` have genomes with many genes
with positive fitness contributions. It is thus very easy to obtain,
from the wildtype ancestor, a large number of clones all of which
have birth rates $>1$ and, thus, clones that are unlikely to become
extinct.




### Common use cases, set 2. {#common2}

We will now rerun the simulations above changing the following:

- `finalTime`  set to 25000.
- `onlyCancer`  set to TRUE.
- The "Exp" models will stop when population size $> 10^5$.


This is in script 'benchmark_3.R', under '/inst/miscell', with
output in the 'miscell-files/vignette_bench_Rout' directory of the
main OncoSimul repository at https://github.com/rdiaz02/OncoSimul.
The data are available as `data(benchmark_3)`.
	
\blandscape

Table: (\#tab:timingusual2) Benchmarks under some common use cases, set 2.	
```{r benchustable2, eval=TRUE, echo = FALSE}
## data(benchmark_3)

## knitr::kable(benchmark_3[, c("Model", "fitness", "time_per_simul",
##     "size_mb_per_simul", "NumClones.Median", "NumIter.Median",
## 	"FinalTime.Median", "TotalPopSize.Median", "TotalPopSize.Mean", 
## 	"TotalPopSize.Max.")], 
##     booktabs = TRUE,
## 	col.names = c("Model",
## 				  "Fitness", "Elapsed Time, average per simulation (s)",
## 	              "Object Size, average per simulation (MB)",
## 				  "Number of Clones, median",
## 				  "Number of Iterations, median",
## 				  "Final Time, median",
## 				  "Total Population Size, median",
## 				  "Total Population Size, mean",
## 				  "Total Population Size, max."
## 				  ),
## 	align = "c")

panderOptions('table.split.table', 99999999)
panderOptions('table.split.cells', 900)  ## For HTML
## panderOptions('table.split.cells', 8) ## For PDF


panderOptions('round', 3)
panderOptions('table.alignment.default', 'right')

pander(benchmark_3[ , c(
    "Model", "Fitness",
    "Elapsed Time, average per simulation (s)",
 	              "Object Size, average per simulation (MB)",
 				  "Number of Clones, median",
 				  "Number of Iterations, median",
 				  "Final Time, median",
 				  "Total Population Size, median",
 				  "Total Population Size, mean",				  
 				  "Total Population Size, max.")],
				  justify = c('left', 'left', rep('right', 8)),
				  ## caption = "\\label{tab:timingusual2}Benchmarks under some common use cases, set 2."
				  )	
```

\elandscape

\clearpage


Since we increased the maximum final time and forced runs to "reach
cancer" the McFL run with the pancreas fitness specification takes a bit
longer because it also has to do a larger number of
iterations. Interestingly, notice that the median final time is close to
10000, so the runs in \@ref(common1) with maximum final time of 5000 would
have had a hard time finishing with `onlyCancer = TRUE`.

Forcing simulations to "reach cancer" and just random differences between
the random fitness landscape also affect the McFL run under `rf12`: final
time is below 5000 and the median number of iterations is about half of
what was above.

Finally, by stopping the Exp simulations at $10^5$, simulations with
`re_2000` and `re_4000` finish now in much shorter times (but they still
take longer than their McFL counterparts) and the number of clones created
is much smaller.


## Can we use a large number of genes? {#lnum}

Yes. In fact, in OncoSimulR there is no pre-set limit on genome
size. However, large numbers of genes can lead to unacceptably large
returned object sizes and/or running time. We discuss several examples
next that illustrate some of the major issues to consider. Another example
with 50,000 genes is shown in section \@ref(mcf50070).

We have seen in \@ref(bench1) and \@ref(common1) that for the Exp model,
benchmark results using `detectionProb` require a lot of care and can be
misleading. Here, we will fix initial population sizes (to 500) and all
final population sizes will be set to $\geq 10^6$. In addition, to avoid
the confounding factor of the `onlyCancer = TRUE` argument, we will set it
to FALSE, so we measure directly the time of individual runs.


### Exponential model with 10,000 and 50,000 genes {#exp50000}

#### Exponential, 10,000 genes, example 1  {#exp100001}

We will start with 10000 genes and an exponential model, where we
stop when the population grows over $10^6$ individuals:

```{r exp10000, echo = TRUE, eval = FALSE}
ng <- 10000
u <- allFitnessEffects(noIntGenes = c(rep(0.1, ng/2), 
                                      rep(-0.1, ng/2)))

t_e_10000 <- system.time(
    e_10000 <- oncoSimulPop(5, u, model = "Exp", mu = 1e-7,
                            detectionSize = 1e6,
                            detectionDrivers = NA,
                            detectionProb = NA,
                            keepPhylog = TRUE,
                            onlyCancer = FALSE,
                            mutationPropGrowth = TRUE,
                            mc.cores = 1))
```


```{r exp10000-out, echo = TRUE, eval = FALSE}
t_e_10000
##    user  system elapsed 
##   4.368   0.196   4.566 
  
summary(e_10000)[, c(1:3, 8, 9)]
##   NumClones TotalPopSize LargestClone FinalTime NumIter
## 1      5017      1180528       415116       143    7547
## 2      3726      1052061       603612       131    5746
## 3      4532      1100721       259510       132    6674
## 4      4150      1283115       829728        99    6646
## 5      4430      1139185       545958       146    6748

print(object.size(e_10000), units = "MB")
## 863.9 Mb

```

Each simulation takes about 1 second but note that the number of clones
for most simulations is already over 4000 and that the size of the returned
object is close to 1 GB (a more detailed explanation of where this 1 GB
comes from is deferred until section \@ref(wheresizefrom)).


#### Exponential, 10,000 genes, example 2 {#exp10000_2}

We can decrease the size of the returned object if we use the `keepEvery =
NA` argument (this setting was explained in detail in section
\@ref(bench1)):



```{r exp10000b, eval = FALSE, echo = TRUE}
t_e_10000b <- system.time(
    e_10000b <- oncoSimulPop(5,
                             u,
                             model = "Exp",
                             mu = 1e-7,
                             detectionSize = 1e6,
                             detectionDrivers = NA,
                             detectionProb = NA,
                             keepPhylog = TRUE,
                             onlyCancer = FALSE,
                             keepEvery = NA,
                             mutationPropGrowth = TRUE,
                             mc.cores = 1
                             ))

```

```{r exp10000b-out, echo = TRUE, eval = FALSE}
t_e_10000b
##    user  system elapsed 
##   5.484   0.100   5.585
  
summary(e_10000b)[, c(1:3, 8, 9)]
##   NumClones TotalPopSize LargestClone FinalTime NumIter
## 1      2465      1305094       727989        91    6447
## 2      2362      1070225       400329       204    8345
## 3      2530      1121164       436721       135    8697
## 4      2593      1206293       664494       125    8149
## 5      2655      1186994       327835       191    8572

print(object.size(e_10000b), units = "MB")
## 488.3 Mb

```

#### Exponential,  50,000 genes, example 1 {#exp500001}

Let's use 50,000 genes. To keep object sizes reasonable we use
`keepEvery = NA`. For now, we also set `mutationPropGrowth = FALSE`
so that the mutation rate does not become really large in clones
with many mutations but, of course, whether or not this is a
reasonable decision depends on the problem; see also below.

```{r exp50000, echo = TRUE, eval = FALSE}
ng <- 50000
u <- allFitnessEffects(noIntGenes = c(rep(0.1, ng/2), 
                                      rep(-0.1, ng/2)))
t_e_50000 <- system.time(
    e_50000 <- oncoSimulPop(5,
                            u,
                            model = "Exp",
                            mu = 1e-7,
                            detectionSize = 1e6,
                            detectionDrivers = NA,
                            detectionProb = NA,
                            keepPhylog = TRUE,
                            onlyCancer = FALSE,
                            keepEvery = NA,
                            mutationPropGrowth = FALSE,
                            mc.cores = 1
                            ))


t_e_50000
##    user  system elapsed
##  44.192   1.684  45.891 

summary(e_50000)[, c(1:3, 8, 9)]
##   NumClones TotalPopSize LargestClone FinalTime NumIter
## 1      7367      1009949       335455     75.00   18214
## 2      8123      1302324       488469     63.65   17379
## 3      8408      1127261       270690     72.57   21144
## 4      8274      1138513       318152     80.59   20994
## 5      7520      1073131       690814     70.00   18569

print(object.size(e_50000), units = "MB")
## 7598.6 Mb
```

Of course, simulations now take longer and the size of the returned
object is over 7 GB (we are keeping more than 7,000 clones, even if when
we prune all those that went extinct). 

#### Exponential, 50,000 genes, example 2 {#exp50000_2}

What if we had not pruned? 

```{r exp50000np, echo = TRUE, eval = FALSE}
ng <- 50000
u <- allFitnessEffects(noIntGenes = c(rep(0.1, ng/2), 
                                      rep(-0.1, ng/2)))
t_e_50000np <- system.time(
    e_50000np <- oncoSimulPop(5,
                              u,
                              model = "Exp",
                              mu = 1e-7,
                              detectionSize = 1e6,
                              detectionDrivers = NA,
                              detectionProb = NA,
                              keepPhylog = TRUE,
                              onlyCancer = FALSE,
                              keepEvery = 1,
                              mutationPropGrowth = FALSE,
                              mc.cores = 1
                              ))

t_e_50000np
##   user  system elapsed
## 42.316   2.764  45.079 

summary(e_50000np)[, c(1:3, 8, 9)]
##   NumClones TotalPopSize LargestClone FinalTime NumIter
## 1     13406      1027949       410074     71.97   19469
## 2     12469      1071325       291852     66.00   17834
## 3     11821      1089834       245720     90.00   16711
## 4     14008      1165168       505607     77.61   19675
## 5     14759      1074621       205954     87.68   20597

print(object.size(e_50000np), units = "MB")
## 12748.4 Mb

```
The main effect is not on execution time but on object size (it has
grown by 5 GB). We are tracking more than 10,000 clones.


#### Exponential,  50,000 genes, example 3 {#exp50000_3}

What about the `mutationPropGrowth` setting? We will rerun the example in
\@ref(exp500001) leaving `keepEvery = NA` but with the default
`mutationPropGrowth`:

```{r exp50000mpg, echo = TRUE, eval = FALSE}

ng <- 50000
u <- allFitnessEffects(noIntGenes = c(rep(0.1, ng/2), 
                                      rep(-0.1, ng/2)))

t_e_50000c <- system.time(
    e_50000c <- oncoSimulPop(5,
                             u,
                             model = "Exp",
                             mu = 1e-7,
                             detectionSize = 1e6,
                             detectionDrivers = NA,
                             detectionProb = NA,
                             keepPhylog = TRUE,
                             onlyCancer = FALSE,
                             keepEvery = NA,
                             mutationPropGrowth = TRUE,
                             mc.cores = 1
                             ))

t_e_50000c
##    user  system elapsed 
## 84.228   2.416  86.665

summary(e_50000c)[, c(1:3, 8, 9)]
##   NumClones TotalPopSize LargestClone FinalTime NumIter
## 1     11178      1241970       344479     84.74   27137
## 2     12820      1307086       203544     91.94   33448
## 3     10592      1126091       161057     83.81   26064
## 4     11883      1351114       148986     65.68   25396
## 5     10518      1101392       253523     99.79   26082

print(object.size(e_50000c), units = "MB")
## 10904.9 Mb

```

As expected (because the mutation rate per unit time is increasing
in the fastest growing clones), we have many more clones, larger
objects, and longer times of execution here: we almost double the
time and the size of the object increases by almost 3 GB.


What about larger population sizes or larger mutation rates? The
number of clones starts growing fast, which means much slower
execution times and much larger returned objects (see also the examples
below).



#### Interlude: where is that 1 GB coming from? {#wheresizefrom}

In section \@ref(exp100001) we have seen an apparently innocuous
simulation producing a returned object of almost 1 GB. Where is that coming
from? It means that each simulation produced almost 200 MB of output.

Let us look at one simulation in more detail:

```{r sizedetail, eval = FALSE, echo = TRUE}
r1 <- oncoSimulIndiv(u,
                     model = "Exp",
                     mu = 1e-7,
                     detectionSize = 1e6,
                     detectionDrivers = NA,
                     detectionProb = NA,
                     keepPhylog = TRUE,
                     onlyCancer = FALSE,
                     mutationPropGrowth = TRUE
                     )
summary(r1)[c(1, 8)]
##   NumClones  FinalTime
## 1      3887        345

print(object.size(r1), units = "MB")
## 160 Mb

## Size of the two largest objects inside:
sizes <- lapply(r1, function(x) object.size(x)/(1024^2))
sort(unlist(sizes), decreasing = TRUE)[1:2]
## Genotypes pops.by.time 
##       148.28        10.26 

dim(r1$Genotypes)
## [1] 10000  3887
```

The above shows the reason: the `Genotypes` matrix is a 10,000 by 3,887
integer matrix (with a 0 and 1 indicating not-mutated/mutated for each
gene in each genotype) and in R integers use 4 bytes each. The
`pops.by.time` matrix is 346 by 3,888 (the 1 in $346 = 345 + 1$ comes from
starting at 0 and going up to the final time, both included; the 1 in
$3888 = 3887 + 1$ is from the column of time) double matrix and doubles
use 8 bytes[^popsbytime].

[^popsbytime]:These matrices do not exist during most of the
execution of the C++ code; they are generated right before returning
from the C++ code.


### McFarland model with 50,000 genes; the effect of `keepEvery` {#mc50000}

We show an example of McFarland's model with 50,000 genes in section
\@ref(mcf50070). We will show here a few more examples with those many
genes but with a different fitness specification and changing several
other settings.



#### McFarland, 50,000 genes, example 1 {#mc50000ex1}

Let's start with  `mutationPropGrowth = FALSE` and `keepEvery =
NA`. Simulations end when population size $\geq 10^6$.

```{r mc50000_1, echo = TRUE, eval = FALSE}
ng <- 50000
u <- allFitnessEffects(noIntGenes = c(rep(0.1, ng/2), 
                                      rep(-0.1, ng/2)))

t_mc_50000_nmpg <- system.time(
    mc_50000_nmpg <- oncoSimulPop(5,
                                  u,
                                  model = "McFL",
                                  mu = 1e-7,
                                  detectionSize = 1e6,
                                  detectionDrivers = NA,
                                  detectionProb = NA,
                                  keepPhylog = TRUE,
                                  onlyCancer = FALSE,
                                  keepEvery = NA,
                                  mutationPropGrowth = FALSE,
                                  mc.cores = 1
                                  ))
t_mc_50000_nmpg
##   user  system elapsed 
##  30.46    0.54   31.01 


summary(mc_50000_nmpg)[, c(1:3, 8, 9)]
##   NumClones TotalPopSize LargestClone FinalTime NumIter
## 1      1902      1002528       582752     284.2   31137
## 2      2159      1002679       404858     274.8   36905
## 3      2247      1002722       185678     334.5   42429
## 4      2038      1009606       493574     218.4   32519
## 5      2222      1004661       162628     291.0   38470

print(object.size(mc_50000_nmpg), units = "MB")
## 2057.6 Mb

```

We are already dealing with 2000 clones.

#### McFarland, 50,000 genes, example 2 {#mc50000ex2}

Setting `keepEvery = 1` (i.e., keeping track of clones with an
interval of 1):

```{r mc50000_kp, echo = TRUE, eval = FALSE}
t_mc_50000_nmpg_k <- system.time(
    mc_50000_nmpg_k <- oncoSimulPop(5,
                                    u,
                                    model = "McFL",
                                    mu = 1e-7,
                                    detectionSize = 1e6,
                                    detectionDrivers = NA,
                                    detectionProb = NA,
                                    keepPhylog = TRUE,
                                    onlyCancer = FALSE,
                                    keepEvery = 1,
                                    mutationPropGrowth = FALSE,
                                    mc.cores = 1
                                    ))

t_mc_50000_nmpg_k
##    user  system elapsed 
##  30.000   1.712  31.714 

summary(mc_50000_nmpg_k)[, c(1:3, 8, 9)]
##   NumClones TotalPopSize LargestClone FinalTime NumIter
## 1      8779      1000223       136453     306.7   38102
## 2      7442      1006563       428150     345.3   35139
## 3      8710      1003509       224543     252.3   35659
## 4      8554      1002537       103889     273.7   36783
## 5      8233      1003171       263005     301.8   35236

print(object.size(mc_50000_nmpg_k), units = "MB")
## 8101.4 Mb
```

Computing time increases slightly but the major effect is seen on the size
of the returned object, that increases by a factor of about 4x, up to 8
GB, corresponding to the increase in about 4x in the number of clones
being tracked (see details of where the size of this object comes from in
section \@ref(wheresizefrom)).


#### McFarland, 50,000 genes, example 3 {#mc50000ex3}

We will set `keepEvery = NA` again, but we will now increase
detection size by a factor of 3 (so we stop when total population
size becomes $\geq 3 * 10^6$).

```{r mc50000_popx, echo = TRUE, eval = FALSE}
ng <- 50000
u <- allFitnessEffects(noIntGenes = c(rep(0.1, ng/2), 
                                      rep(-0.1, ng/2)))

t_mc_50000_nmpg_3e6 <- system.time(
    mc_50000_nmpg_3e6 <- oncoSimulPop(5,
                                      u,
                                      model = "McFL",
                                      mu = 1e-7,
                                      detectionSize = 3e6,
                                      detectionDrivers = NA,
                                      detectionProb = NA,
                                      keepPhylog = TRUE,
                                      onlyCancer = FALSE,
                                      keepEvery = NA,
                                      mutationPropGrowth = FALSE,
                                      mc.cores = 1
                                      ))
t_mc_50000_nmpg_3e6
##    user  system elapsed 
##  77.240   1.064  78.308 

summary(mc_50000_nmpg_3e6)[, c(1:3, 8, 9)]
##   NumClones TotalPopSize LargestClone FinalTime NumIter
## 1      5487      3019083       836793     304.5   65121
## 2      4812      3011816       789146     286.3   53087
## 3      4463      3016896      1970957     236.6   45918
## 4      5045      3028142       956026     360.3   63464
## 5      4791      3029720       916692     358.1   55012

print(object.size(mc_50000_nmpg_3e6), units = "MB")
## 4759.3 Mb
												   
```

Compared with the first run (\@ref(mc50000ex1)) we have
approximately doubled computing time, number of iterations, number
of clones, and object size.


#### McFarland, 50,000 genes, example 4 {#mc50000ex4}

Let us use the same `detectionSize = 1e6` as in the first example
(\@ref(mc50000ex1)), but with 5x the mutation rate:

```{r mc50000_mux, echo = TRUE, eval = FALSE}

t_mc_50000_nmpg_5mu <- system.time(
    mc_50000_nmpg_5mu <- oncoSimulPop(5,
                                      u,
                                      model = "McFL",
                                      mu = 5e-7,
                                      detectionSize = 1e6,
                                      detectionDrivers = NA,
                                      detectionProb = NA,
                                      keepPhylog = TRUE,
                                      onlyCancer = FALSE,
                                      keepEvery = NA,
                                      mutationPropGrowth = FALSE,
                                      mc.cores = 1
                                      ))

t_mc_50000_nmpg_5mu
##    user  system elapsed 
## 167.332   1.796 169.167 

summary(mc_50000_nmpg_5mu)[, c(1:3, 8, 9)]
##   NumClones TotalPopSize LargestClone FinalTime NumIter
## 1      7963      1004415       408352     99.03   57548
## 2      8905      1010751       120155    130.30   74738
## 3      8194      1005465       274661     96.98   58546
## 4      9053      1014049       119943    112.23   75379
## 5      8982      1011817        95047     99.95   76757

print(object.size(mc_50000_nmpg_5mu), units = "MB")
## 8314.4 Mb
``` 

The number of clones we are tracking is about 4x the number of
clones of the first example (\@ref(mc50000ex1)), and roughly similar
to the number of clones of the second example (\@ref(mc50000ex2)),
and size of the returned object is similar to that of the second
example.  But computing time has increased by a factor of about 5x
and iterations have increased by a factor of about 2x. Iterations
increase because mutation is more frequent; in addition, at each
sampling period each iteration needs to do more work as it needs to
loop over a larger number of clones and this larger number includes
clones that are not shown here, because they are pruned (they are
extinct by the time we exit the simulation ---again, pruning is
discussed with further details in \@ref(prune)).


#### McFarland, 50,000 genes, example 5 {#mc50000ex5}

Now let's run the above example but with `keepEvery = 1`:

```{r mcf5muk, echo = TRUE, eval = FALSE}
t_mc_50000_nmpg_5mu_k <- system.time(
    mc_50000_nmpg_5mu_k <- oncoSimulPop(5,
                                        u,
                                        model = "McFL",
                                        mu = 5e-7,
                                        detectionSize = 1e6,
                                        detectionDrivers = NA,
                                        detectionProb = NA,
                                        keepPhylog = TRUE,
                                        onlyCancer = FALSE,
                                        keepEvery = 1,
                                        mutationPropGrowth = FALSE,
                                        mc.cores = 1
                                        ))
												   
t_mc_50000_nmpg_5mu_k
##    user  system elapsed 
## 174.404   5.068 179.481 

summary(mc_50000_nmpg_5mu_k)[, c(1:3, 8, 9)]
##   NumClones TotalPopSize LargestClone FinalTime NumIter
## 1     25294      1001597       102766     123.4   74524
## 2     23766      1006679       223010     124.3   71808
## 3     21755      1001379       203638     114.8   62609
## 4     24889      1012103       161003     119.3   75031
## 5     21844      1002927       255388     108.8   64556

print(object.size(mc_50000_nmpg_5mu_k), units = "MB")
## 22645.8 Mb
 
```

We have already seen these effects before in section
\@ref(mc50000ex2): using `keepEvery = 1` leads to a slight increase
in execution time. What is really affected is the size of the
returned object which increases by a factor of about 3x (and is now
over 20GB). That 3x corresponds, of course, to the increase in the
number of clones being tracked (now over 20,000). This, by the way,
also allows us to understand the comment above, where we said that
in these two cases (where we have increased mutation rate) at each
iteration we need to do more work as at every update of the
population the algorithm needs to loop over a much larger number of
clones (even if many of those are eventually pruned).



#### McFarland, 50,000 genes, example 6 {#mc50000ex6}

Finally, we will run the example in section \@ref(mc50000ex1) with the
default of `mutationPropGrowth = TRUE`:


```{r mc50000_2, echo = TRUE, eval = FALSE}

t_mc_50000 <- system.time(
    mc_50000 <- oncoSimulPop(5,
                             u,
                             model = "McFL",
                             mu = 1e-7,
                             detectionSize = 1e6,
                             detectionDrivers = NA,
                             detectionProb = NA,
                             keepPhylog = TRUE,
                             onlyCancer = FALSE,
                             keepEvery = NA,
                             mutationPropGrowth = TRUE,
                             mc.cores = 1
                             ))

t_mc_50000
##    user  system elapsed 
## 303.352   2.808 306.223 

summary(mc_50000)[, c(1:3, 8, 9)]
##   NumClones TotalPopSize LargestClone FinalTime NumIter
## 1     13928      1010815       219814     210.9   91255
## 2     12243      1003267       214189     178.1   67673
## 3     13880      1014131       124354     161.4   88322
## 4     14104      1012941        75521     205.7   98583
## 5     12428      1005594       232603     167.4   70359

print(object.size(mc_50000), units = "MB")
## 12816.6 Mb

```

Note the huge increase in computing time (related of course to the huge
increase in number of iterations) and in the size of the returned object: we
have gone from having to track about 2000 clones to tracking over 12000
clones even when we prune all clones without descendants.



### Examples with $s = 0.05$ {#largegenes005}

A script with the above runs but using $s=0.05$ instead of $s=0.1$ is
available from the repository
('miscell-files/vignette_bench_Rout/large_num_genes_0.05.Rout'). I will
single out a couple of cases here.

First, we repeat the run shown in section \@ref(mc50000ex5):

```{r mcf5muk005, echo = TRUE, eval = FALSE}
t_mc_50000_nmpg_5mu_k <- system.time(
    mc_50000_nmpg_5mu_k <- oncoSimulPop(2,
                                        u,
                                        model = "McFL",
                                        mu = 5e-7,
                                        detectionSize = 1e6,
                                        detectionDrivers = NA,
                                        detectionProb = NA,
                                        keepPhylog = TRUE,
                                        onlyCancer = FALSE,
                                        keepEvery = 1,
                                        mutationPropGrowth = FALSE,
                                        mc.cores = 1
                                        ))
t_mc_50000_nmpg_5mu_k
##    user  system elapsed 
## 305.512   5.164 310.711 
 
summary(mc_50000_nmpg_5mu_k)[, c(1:3, 8, 9)]
##   NumClones TotalPopSize LargestClone FinalTime NumIter
## 1     61737      1003273       104460  295.8731  204214
## 2     65072      1000540       133068  296.6243  210231
 
print(object.size(mc_50000_nmpg_5mu_k), units = "MB")
## 24663.6 Mb

```

Note we use only two replicates, since those two already lead to a
24 GB returned object as we are tracking more than 60,000 clones, more
than twice those with $s=0.1$.  The reason for the difference in
number of clones and iterations is of course the change from $s=0.1$
to $s=0.05$: under the McFarland model to reach population sizes of
$10^6$ starting from an equilibrium population of 500 we need about
43 mutations (whereas only about 22 are needed if $s=0.1$[^mcnum]).


[^mcnum]: Given the dependence of death rates on population size in
    McFarland's model (section \@ref(mcfl)), if all mutations have
    the same fitness effects we can calculate the equilibrium
    population size (where birth and death rates are equal) for a
    given number of mutated genes as: $K * (e^{(1 + s)^p} - 1)$,
    where $K$ is the initial equilibrium size, $s$ the fitness
    effect of each mutation, and $p$ the number of mutated genes.


Next, let us rerun \@ref(mc50000ex1):

```{r mc50000_1_005, echo = TRUE, eval = FALSE}
t_mc_50000_nmpg <- system.time(
    mc_50000_nmpg <- oncoSimulPop(5,
                                  u,
                                  model = "McFL",
                                  mu = 1e-7,
                                  detectionSize = 1e6,
                                  detectionDrivers = NA,
                                  detectionProb = NA,
                                  keepPhylog = TRUE,
                                  onlyCancer = FALSE,
                                  keepEvery = NA,
                                  mutationPropGrowth = FALSE,
                                  mc.cores = 1
                                  ))
t_mc_50000_nmpg
##    user  system elapsed 
## 111.236   0.596 111.834 
 
summary(mc_50000_nmpg)[, c(1:3, 8, 9)]
##   NumClones TotalPopSize LargestClone FinalTime NumIter
## 1      2646      1000700       217188   734.475  108566
## 2      2581      1001626       209873   806.500  107296
## 3      2903      1001409       125148   841.700  120859
## 4      2310      1000146       473948   906.300   91519
## 5      2704      1001290       448409   838.800  103556
 
print(object.size(mc_50000_nmpg), units = "MB")
## 2638.3 Mb

```
Using $s=0.05$ leads to a large increase in final time and number of
iterations. However, as we are using the `keepEvery = NA` setting,
the increase in number of clones tracked and in size of returned
object is relatively small.




### The different consequences of `keepEvery = NA` in the Exp and McFL models {#kpexpmc}

We have seen that `keepEvery = NA` often leads to much smaller returned
objects when using the McFarland model than when using the Exp model. Why?
Because in the McFarland model there is strong competition and there can
be complete clonal sweeps so that in extreme cases a single clone might be
all that is left after some time. This is not the case in the exponential
models. 

Of course, the details depend on the difference in fitness effects
between different genotypes (or clones). In particular, we have seen
several examples where even with `keepEvery=NA` there are a lot of
clones in the McFL models. In those examples many clones had
identical fitness (the fitness effects of all genes with positive
fitness was the same, and ditto for the genes with negative fitness
effects), so no clone ends up displacing all the others.




### Are we keeping the complete history (genealogy) of the clones? {#histlargegenes}

Yes we are if we run with `keepPhylog = TRUE`, regardless of the
setting for `keepEvery`. As explained in section \@ref(trackindivs),
OncoSimulR prunes clones that never had a population size larger
than zero at any sampling period (so they are not reflected in the
`pops.by.time` matrix in the output). And when we set `keepEvery =
NA` we are telling OncoSimulR to discard all sampling periods except
the very last one (i.e., the `pops.by.time` matrix contains only the
clones with 1 or more cells at the end of the simulation).

`keepPhylog` operates differently: it records the exact time at
which a clone appeared and the clone that gave rise to it. This
information is kept regardless of whether or not those clones appear
in the `pops.by.time` matrix. 

Keeping the complete genealogy might be of limited use if the
`pops.by.time` matrix only contains the very last period. However,
you can use `plotClonePhylog` and ask to be shown only clones that
exist in the very last period (while of course showing all of their
ancestors, even if those are now extinct ---i.e., regardless of
their abundance).

For instance, in run \@ref(exp500001) we could have looked at the
information stored about the genealogy of clones by doing (we look at the
first "individual" of the simulation, of the five "individuals" we
simulated):

```{r filog_exp50000_1, echo = TRUE, eval = FALSE}
head(e_50000[[1]]$other$PhylogDF)
##   parent child   time
## 1         3679 0.8402
## 2         4754 1.1815
## 3        20617 1.4543
## 4        15482 2.3064
## 5         4431 3.7130
## 6        41915 4.0628

tail(e_50000[[1]]$other$PhylogDF)
##                          parent                            child time
## 20672               3679, 20282               3679, 20282, 22359 75.0
## 20673        3679, 17922, 22346        3679, 17922, 22346, 35811 75.0
## 20674                2142, 3679                2142, 3679, 25838 75.0
## 20675        3679, 17922, 19561        3679, 17922, 19561, 43777 75.0
## 20676 3679, 15928, 19190, 20282 3679, 15928, 19190, 20282, 49686 75.0
## 20677         2142, 3679, 16275         2142, 3679, 16275, 24201 75.0
```

where each row corresponds to one event of appearance of a new
clone, the column labeled "parent" are the mutated genes in the
parent, and the column labeled "child" are the mutated genes in the child.


And we could plot the genealogical relationships of clones that have
a population size of at least one in the last period (again, while
of course showing all of their ancestors, even if those are now
extinct ---i.e., regardless of their current numbers) doing:

```{r noplotlconephylog, echo = TRUE, eval = FALSE}
plotClonePhylog(e_50000[[1]]) ## plot not shown

```

What is the cost of keep the clone genealogies? In terms of time it
is minor. In terms of space, and as shown in the example above, we
can end up storing a data frame with tends of thousands of rows and
three columns (two factors, one float). In the example above the
size of that data frame is approximately 2 MB for a single
simulation. This is much smaller than the `pops.by.time` or
`Genotypes` matrices, but it can quickly build up if you routinely
launch, say, 1000 simulations via `oncoSimulPop`. That is why the
default is `keepPhylog = FALSE` as this information is not needed as
often as that in the other two matrices (`pops.by.time` and
`Genotypes`). <!-- However, if you plan to measure evolutionary predictability -->
<!-- using Lines of Descent, or LOD (section \@ref(evolpredszend)) you should -->
<!-- run simulations with `keepPhylog = TRUE`. -->





## Population sizes $\geq 10^{10}$ {#popgtzx}

We have already seen examples where population sizes reach $10^{8}$
to $10^{10}$, as in Tables \@ref(tab:bench1b), \@ref(tab:timing3),
\@ref(tab:timing3xf). What about even larger population sizes?

The C++ code will unconditionally alert if population sizes exceed
$4*10^{15}$ as in those cases loosing precision (as we are using
doubles) would be unavoidable, and we would also run into problems
with the generation of binomial random variates (code that
illustrates and discusses this problem is available in file
"example-binom-problems.cpp", in directory
"/inst/miscell"). However, well before we reach $4*10^{15}$ we loose
precision from other sources. One of the most noticeable ones is
that when we reach population sizes around $10^{11}$ the C++ code
will often alert us by throwing exceptions with the message
`Recoverable exception ti set to DBL_MIN. Rerunning.` I throw this
exception because $t_i$, the random variable for time to next
mutation, is less than `DBL_MIN`, the minimum representable
floating-point number. This happens because, unless we use really
tiny mutation rates, the time to a mutation starts getting closer to
zero as population sizes grow very large. It might be possible to
ameliorate these problems somewhat by using long doubles (instead of
doubles) or special purpose libraries that provide more
precision. However, this would make it harder to run the same code
in different operating systems and would likely decrease execution
speed on the rest of the common scenarios for which OncoSimulR has
been designed.





The following code shows some examples where we use population sizes
of $10^{10}$ or larger. Since we do not want simulations in the
exponential model to end because of extinction, I use a fitness
specification where all genes have a positive fitness effect and we
start all simulations from a large population (to make it unlikely
that the population will become extinct before cells mutate and
start increasing in numbers). We set the maximum running time to 10
minutes. We keep the genealogy of the clones and use `keepEvery = 1`.


```{r ex-large-pop-size, eval = FALSE, echo = TRUE}
ng <- 50 
u <- allFitnessEffects(noIntGenes = c(rep(0.1, ng)))
```


```{r ex-large-mf, eval = FALSE, echo = TRUE}
t_mc_k_50_1e11 <- system.time(
    mc_k_50_1e11 <- oncoSimulPop(5,
                                 u,
                                 model = "McFL",
                                 mu = 1e-7,
                                 detectionSize = 1e11,
                                 initSize = 1e5,
                                 detectionDrivers = NA,
                                 detectionProb = NA,
                                 keepPhylog = TRUE,
                                 onlyCancer = FALSE,
                                 mutationPropGrowth = FALSE,
                                 keepEvery = 1,
                                 finalTime = 5000,
                                 mc.cores = 1,
                                 max.wall.time = 600
                                 ))

## Recoverable exception ti set to DBL_MIN. Rerunning.
## Recoverable exception ti set to DBL_MIN. Rerunning.

t_mc_k_50_1e11
## user  system elapsed 
## 613.612   0.040 613.664 

summary(mc_k_50_1e11)[, c(1:3, 8, 9)]
##   NumClones TotalPopSize LargestClone FinalTime NumIter
## 1      5491 100328847809  44397848771  1019.950  942764
## 2      3194 100048090441  34834178374   789.675  888819
## 3      5745 100054219162  24412502660   927.950  929231
## 4      4017 101641197799  60932177160   750.725  480938
## 5      5393 100168156804  41659212367   846.250  898245

## print(object.size(mc_k_50_1e11), units = "MB")
## 177.8 Mb

```

We get to $10^{11}$. But notice the exception with the warning about
$t_i$. Notice also that this takes a long time and we run a very
large number of iterations (getting close to one million in some
cases).


Now the exponential model with `detectionSize = 1e11`:

```{r ex-large-exp, eval = FALSE, echo = TRUE}
t_exp_k_50_1e11 <- system.time(
    exp_k_50_1e11 <- oncoSimulPop(5,
                                  u,
                                  model = "Exp",
                                  mu = 1e-7,
                                  detectionSize = 1e11,
                                  initSize = 1e5,
                                  detectionDrivers = NA,
                                  detectionProb = NA,
                                  keepPhylog = TRUE,
                                  onlyCancer = FALSE,
                                  mutationPropGrowth = FALSE,
                                  keepEvery = 1,
                                  finalTime = 5000,
                                  mc.cores = 1,
                                  max.wall.time = 600,
                                  errorHitWallTime = FALSE,
                                  errorHitMaxTries = FALSE
                                  ))

## Recoverable exception ti set to DBL_MIN. Rerunning.
## Hitted wall time. Exiting.
## Recoverable exception ti set to DBL_MIN. Rerunning.
## Recoverable exception ti set to DBL_MIN. Rerunning.
## Recoverable exception ti set to DBL_MIN. Rerunning.
## Hitted wall time. Exiting.
## Recoverable exception ti set to DBL_MIN. Rerunning.
## Recoverable exception ti set to DBL_MIN. Rerunning.
## Recoverable exception ti set to DBL_MIN. Rerunning.
## Recoverable exception ti set to DBL_MIN. Rerunning.
## Recoverable exception ti set to DBL_MIN. Rerunning.
## Hitted wall time. Exiting.
## Hitted wall time. Exiting.
 
t_exp_k_50_1e11
##     user   system  elapsed 
## 2959.068    0.128 2959.556 
try(summary(exp_k_50_1e11)[, c(1:3, 8, 9)])
##   NumClones TotalPopSize LargestClone FinalTime NumIter
## 1      6078  65172752616  16529682757  235.7590 1883438
## 2      5370 106476643712  24662446729  232.0000 2516675
## 3      2711  21911284363  17945303353  224.8608  543698
## 4      2838  13241462284   2944300245  216.8091  372298
## 5      7289  76166784312  10941729810  240.0217 1999489

print(object.size(exp_k_50_1e11), units = "MB")
## 53.5 Mb

``` 

Note that we almost reached `max.wall.time` (600 * 5 = 3000). What if we
wanted to go up to $10^{12}$? We would not be able to do it in 10
minutes. We could set `max.wall.time` to a value larger than 600 to allow
us to reach larger sizes but then we would be waiting for a possibly
unacceptable time for simulations to finish. Moreover, this would
eventually fail as simulations would keep hitting the $t_i$ exception
without ever being able to complete. Finally, even if we were very
patient, hitting that $t_i$ exception should make us worry about possible
biases in the samples.




## A summary of some determinants of running time and space consumption 

To summarize this section, we have seen: 

  - Both McFL and Exp can be run in short times over a range of
      sizes for the `detectionProb` and `detectionSize` mechanisms
      using a complex fitness specification with  moderate numbers
      of genes. These are the typical or common use cases of
      OncoSimulR.
	  
  - The `keepEvery` argument can have a large effect on time in the
    McFL models and specially on object sizes. If only the end
    result of the simulation is to be used, you should set
    `keepEvery = NA`.
	
  - The distribution of fitness effects and the fitness landscape
    can have large effects on running times. Sometimes these are
    intuitive and simple to reason about, sometimes they are not as
    they interact with other factors (e.g., stopping mechanism,
    numbers of clones, etc). In general, there can be complex
    interactions between different settings, from mutation rate to
    fitness effects to initial size. As usual, test before launching
    a massive simulation.

  - Simulations start to slow down and lead to a very large object size
    when we keep track of around 6000 to 10000 clones. Anything that leads
    to these patterns will slow down the simulations.
	
  - OncoSimulR needs to keep track of genotypes (or clones), not
    just numbers of drivers and passengers, because it allows you to
    use complex fitness and mutation specifications that depend on
    specific genotypes. The `keepEvery = NA` is an approach to store
    only the minimal information needed, but it is unavoidable that
    during the simulations we might be forced to deal with many
    thousands of different clones.
 



\clearpage


# Specifying fitness effects {#specfit}

OncoSimulR uses a standard continuous time model, where individual
cells divide, die, and mutate with rates that can depend on genotype
and population size; over time the abundance of the different
genotypes changes by the action of selection (due to differences in
net growth rates among genotypes), drift, and mutation. As a
result of a mutation in a pre-existing clone new clones arise, and
the birth rate of a newly arisen clone is determined at the time of
its emergence as a function of its genotype.  Simulations can use an
use exponential growth model or a model with carrying capacity that
follows @McFarland2013. For the exponential growth model, the death
rate is fixed at one whereas in the model with carrying capacity
death rate increases with population size. In both cases, therefore,
fitness differences among genotypes in a given population at a
given time are due to differences in the mapping between genotype
and birth rate. There is second exponential model (called "Bozic")
where birth rate is fixed at one, and genotype determines death rate
instead of birth rate (see details in \@ref(numfit)). So when we
discuss specifying fitness effects or the effects of genes on
fitness, we are actually referring to specifying effects on birth
(or death) rates, which then translate into differences in fitness
(since the other rate, death or birth, is either fixed, as in the
Exp and Bozic models, or depends on the population size). This is
also shown in Table \@ref(tab:osrfeatures), in the rows for "Fitness
components", under "Evolutionary Features".




## Introduction to the specification of fitness effects {#introfit}



With OncoSimulR you can specify different types of effects on fitness:



* A special type of epistatic effect that is particularly amenable
  to be represented as a graph (a DAG). In this graph having, say,
  "B" be a child of "A" means that a mutation in B can only
  accumulate if a mutation in A is already present.  This is what OT
  [@Desper1999JCB; @Szabo2008], CBN [@Beerenwinkel2007;
  @Gerstung2009; @Gerstung2011], progression networks
  [@Farahani2013], and other similar models [@Korsunsky2014]
  generally mean. Details are provided in section
  \@ref(posetslong). Note that this is not an order effect
  (discussed below): the fitness of a genotype from this DAGs is a
  function of whether or not the restrictions in the graph are
  satisfied, not the historical sequence of how they were satisfied.

* Effects where the order in which mutations are acquired matters, as
  illustrated in section \@ref(oe). There is, in fact, empirical evidence of
  these effects [@Ortmann2015]. For instance, the fitness of genotype "A, B"
  would differ depending on whether A or B was acquired first (or, as in the
  actual example in [@Ortmann2015], the fitness of the mutant with JAK2 and
  TET2 mutated will depend on which of the genes was mutated first).

  
* General epistatic effects (e.g., section \@ref(epi)), including
  synthetic viability (e.g., section \@ref(sv)) and synthetic
  lethality/mortality (e.g., section \@ref(sl)).


* Genes that have independent effects on fitness (section \@ref(noint)).
  

* Modules (see section \@ref(modules0)) allow you to specify any of the
  above effects (except those for genes without interactions, as it would
  not make sense there) in terms of modules (sets of genes), not individual
  genes. We will introduce them right after \@ref(posetslong), and we will
  continue using them thereafter.


A guiding design principle of OncoSimulR is to try to make the
specification of those effects as simple as possible but also as flexible
as possible. Thus, there are two main ways of specifying fitness effects:


* Combining different types of effects in a single specification. For
  instance, you can combine epistasis with order effects with no interaction
  genes with modules. What you would do here is specify the effects that
  different mutations (or their combinations) have on fitness (the fitness
  effects) and then have OncoSimulR take care of combining them as if each
  of these were lego pieces. We will refer to this as the **lego system of
  fitness effects**. (As explained above, I find this an intuitive and very
  graphical analogy, which I have copied from @Hothorn_2006 and
  @Hothorn_2008).
  
  
* Explicitly passing to OncoSimulR a mapping of genotypes to
  fitness. Here you specify the fitness of each genotype. We will refer to
  this as the **explicit mapping of genotypes to fitness**.



Both approaches have advantages and disadvantages. Here I emphasize some
relevant differences.

* With the lego system you can specify huge genomes with an enormous
  variety of interactions, since the possible genotypes are not
  constructed in advance. You would not be able to do this with the
  explicit mapping of genotypes to fitness if you wanted to, say,
  construct that mapping for a modest genotype of 500 genes (you'd have
  more genotypes than particles in the observable Universe).
  
* For many models/data you often intuitively start with the fitness
  of the genotypes, not the fitness consequences of the different
  mutations. In these cases, you'd need to do the math to specify
  the terms you want if you used the lego system so you'll probably
  use the specification with the direct mapping genotype
  $\rightarrow$ fitness.
  
* Likewise, sometimes you already have a moderate size genotype
  $\rightarrow$ fitness mapping and you certainly do not want to do
  the math by hand: here the lego system would be painful to use.
  
* But sometimes we do think in terms of "the effects on fitness of
  such and such mutations are" and that immediately calls for the lego
  system, where you focus on the effects, and let OncoSimulR take care of
  doing the math of combining.
  
* If you want to use order effects, you must use the lego system (at
  least for now).
    
* If you want to specify modules, you must use the lego system (the
  explicit mapping of genotypes is, by its very nature, ill-suited for
  this).
  
* The lego system might help you see what your model really means: in
  many cases, you can obtain fairly succinct specifications of complex
  fitness models with just a few terms. Similarly, depending on what your
  emphasis is, you can often specify the same fitness landscape in several
  different ways.
  



Regardless of the route, you need to get that information into
  OncoSimulR's functions. The main function we will use is
  `allFitnessEffects`: this is the function in charge of reading
  the fitness specifications. We also need to discuss how, what, and where
  you have to pass to `allFitnessEffects`.
 

### Explicit mapping of genotypes to fitness {#explicitmap}

Conceptually, the simplest way to specify fitness is to specify the
mapping of all genotypes to fitness explicitly. An example will make
this clear. Let's suppose you have a simple two-gene scenario, so a
total of four genotypes, and you have a data frame with genotypes
and fitness, where genoytpes are specified as character vectors,
with mutated genes separated by commas:

```{r}
m4 <- data.frame(G = c("WT", "A", "B", "A, B"), F = c(1, 2, 3, 4))
``` 

Now, let's give that to the `allFitnessEffects` function:

```{r}
fem4 <- allFitnessEffects(genotFitness = m4)
``` 
(The message is just telling you what the program guessed you
wanted.)


That's it. You can try to plot that fitnessEffects object
```{r}
try(plot(fem4))
``` 

In this case, you probably want to plot the fitness landscape. <!--  that plot is not very interesting (compare with the -->
<!-- `plot(pancr)` we saw in \@ref(quickexample) or the plots in -->
<!-- \@ref(posetslong)). --> 


```{r, fig.width=6.5, fig.height = 6.5}
plotFitnessLandscape(evalAllGenotypes(fem4))
``` 

You can also check what OncoSimulR thinks the fitnesses are, with the
`evalAllGenotypes` function that we will use repeatedly below
(of course, here we should see the same fitnesses we entered):

```{r}
evalAllGenotypes(fem4, addwt = TRUE)
``` 


And you can plot the fitness landscape:

```{r}
plotFitnessLandscape(evalAllGenotypes(fem4))
``` 

To specify the mapping you can also use a matrix (or data frame) with
$g + 1$ columns; each of the first $g$ columns contains a 1 or a 0
indicating that the gene of that column is mutated or not. Column $g+ 1$
contains the fitness values. And you do not even need to specify all the
genotypes: the missing genotypes are assigned a fitness 0 ---except
for the WT genotype which, if missing, is assigned a fitness of 1:

```{r}
m6 <- cbind(c(1, 1), c(1, 0), c(2, 3))
fem6 <- allFitnessEffects(genotFitness = m6)
evalAllGenotypes(fem6, addwt = TRUE)
## plot(fem6)
```

```{r, fig.width=6.5, fig.height = 6.5}
plotFitnessLandscape(evalAllGenotypes(fem6))
``` 


This way of giving a fitness specification to OncoSimulR might be
ideal if you directly generate random mappings of genotypes to
fitness (or random fitness landscapes), as we will do in section
\@ref(gener-fit-land). Specially when the fitness landscape contains
many non-viable genotypes (which are considered those with fitness
---birth rate--- $<1e-9$) this can result in considerable savings as
we only need to check the fitness of the viable genotypes in a table
(a C++ map). Note, however, that using the Bozic model with the
fitness landscape specification is not tested. In addition, for
speed, missing genotypes from the fitness landscape specification
are taken to be non-viable genotypes (**beware!! this is a breaking
change relative to versions < 2.9.1**)[^flfast].

[^flfast]: Note for curious readers: it used to be the case that we
converted the table of fitness of genotypes to a fitness
specification with all possible epistatic interactions; you can take
a look at the test file `test.genot_fitness_to_epistasis.R` that
uses the `fem6` object. We no longer do that but instead pass
directly the fitness landscape.


<!-- How to simulate Big Bang: start from some initMutant, say S, so that -->
<!-- S A B C ...  Fitness -->
<!-- 1 0 0 0 0    >1 -->

<!-- and add that column of S with 1 to all genotypes from fitness -->

<!-- * What happens with mutator?   -->
<!--   - the fl specification? now tested-->
<!--   - do we have modules? nope in fitness yes in mutator -->


<!-- Bozic and fitness landscape specification -->
<!-- if(nrow(rFE$fitnessLandscape_df) > 0) -->
<!--             warning("Bozic model passing a fitness landscape will not work", -->
<!--                     " for now.") -->
<!--         ## FIXME: bozic and fitness landscape -->
<!--         ## the issue is that in the C++ code we directly do -->
<!--         ## s = birth rate - 1 -->
<!--         ## but we would need something different -->
<!--         ## Can be done going through epistasis, etc -->

<!-- We will see an example of this way of passing fitness again in -->
<!-- \@ref(bauer), where we will compare it with the lego system. -->


<!-- % Please see the documentation of `allFitnessEffects` for further -->
<!-- % details and examples. -->


<!-- % This can be done with OncoSimulR (e.g., see sections \@ref(e2), \@ref(e3) -->
<!-- % and \@ref(theminus) or the example in \@ref(weis1b)), but this only makes -->
<!-- % sense for subsets of the genes or for very small genotypes, as you -->
<!-- % probably do not want to be explicit about the mapping of $2^k$ genotypes -->
<!-- % to fitness when $k$ is larger than, say, four or five, and definitely not -->
<!-- % when $k$ is 10. -->


<!-- Not for self: actually, what I really do is mapping -->
<!-- the genotypes to epistasis effects. -->

<!-- So we either use a succint description, or we map all -->
<!-- to epistasis. -->




### How to specify fitness effects with the lego system {#howfit}

<!-- % A guiding design principle of OncoSimulR is to try to make the -->
<!-- % specification of those effects as simple as possible but also as flexible -->
<!-- % as possible.  -->


An alternative general approach followed in many genetic simulators is to
specify how particular combinations of alleles modify the wildtype genotype
or the genotype that contains the individual effects of the interacting
genes (e.g., see equation 1 in the supplementary material for FFPopSim
[@Zanini2012]).  For example, if we specify that a mutation in "A"
contributes 0.04, a mutation in "B" contributes 0.03, and the double
mutation "A:B" contributes 0.1, that means that the fitness of the "A, B"
genotype (the genotype with A and B mutated) is that of the wildtype (1, by
default), plus (actually, times ---see section \@ref(numfit)--- but plus on
the log scale) the effects of having A mutated, plus (times) the effects of
having B mutated, plus (times) the effects of "A:B" both being mutated.


We will see below that with the "lego system" it is possible to do
something very similar to the explicit mapping of section
\@ref(explicitmap).  But this will sometimes require a more cumbersome
notation (and sometimes also will require your doing some math). We will see
examples in sections \@ref(e2), \@ref(e3) and \@ref(theminus) or the example
in \@ref(weis1b). But then, if we can be explicit about (at least some of)
the mappings $genotype \rightarrow fitness$, how are these procedures
different? When you use the "lego system" you can combine both a partial
explicit mapping of genotypes to fitness with arbitrary fitness effects of
other genes/modules. In other words, with the "lego system" OncoSimulR
makes it simple to be explicit about the mapping of specific genotypes,
while also using the "how this specific effects modifies previous
effects" logic, leading to a flexible specification. This also means that
in many cases the same fitness effects can be specified in several
different ways.

Most of the rest of this section is devoted to explaining how to combine
those pieces. Before that, however, we need to discuss the fitness model
we use.

<!-- % As we will see in the examples (e.g., see sections \@ref(e2), \@ref(e3), -->
<!-- % \@ref(exlong)) OncoSimulR makes it simple to be explicit about the mapping -->
<!-- % of specific genotypes, while also using the "how this specific effects -->
<!-- % modifies previous effects" logic, leading to a flexible -->
<!-- % specification. This also means that in many cases the same fitness -->
<!-- % effects can be specified in several different ways. -->


## Numeric values of fitness effects {#numfit}

We evaluate fitness using the usual  [@Zanini2012; @Gillespie1993;
  @Beerenwinkel2007; @Datta2013] multiplicative model: fitness is
$\prod (1 + s_i)$ where $s_i$ is the fitness effect of gene (or gene
interaction) $i$.  In all models except Bozic, this fitness refers to the
growth rate (the death rate being fixed to 1[^2]).
The original model of @McFarland2013 has a slightly different
parameterization, but you can go easily from one to the other (see section
\@ref(mcfl)).

For the Bozic model [@Bozic2010], however, the birth rate is set to
1, and the death rate then becomes $\prod (1 - s_i)$.

[^2]: You can change this if you really want to. 

<!-- On their interpretation/naming: -->
<!-- I do \prod (1 + s_i). The s_i I could call "selection coefficients", -->
<!--      as Gillespie, 1993; or "mutation effects", as in Fogle et al., 2008, -->
<!--      or Dayarian and Shraiman, 2014. -->
<!--    - The multiplicative \Prod (1+ s_j) is typicalone, as in Beerenwinkel -->
<!--      2007, PLoS Comp Biol -->


### McFarland parameterization {#mcfl}

In the original model of @McFarland2013, the effects of drivers
contribute to the numerator of the birth rate, and those of the
(deleterious) passengers to the denominator as: $\frac{(1 +
s)^d}{(1 + s_p)^p}$, where $d$ and $p$ are, respectively, the total
number of drivers and passengers in a genotype, and here the fitness
effects of all drivers is the same ($s$) and that of all passengers
the same too ($s_p$). Note that, as written above, and as explicitly
said in @McFarland2013 (see p. 2911) and @McFarland2014-phd (see
p. 9), "(...)  $s_p$ is the fitness disadvantage conferred by a
passenger". In other words, the larger the $s_p$ the more
deleterious the passenger.

This is obvious, but I make it explicit because in our
parameterization a positive $s$ means fitness advantage, whereas
fitness disadvantages are associated with negative $s$. Of course,
if you rewrite the above expression as $\frac{(1 + s)^d}{(1 -
s_p)^p}$ then we are back to the "positive means fitness advantage
and negative means fitness disadvantage".


As @McFarland2014-phd explains (see p.\ 9, bottom), we can rewrite
the above expression so that there are no terms in the
denominator. McFarland writes it as (I copy verbatim from the fourth
and fifth lines from the bottom on his p.\ 9) 
$(1 + s_d)^{n_d} (1 - s_p^{'})^{n_p}$ where $s_p^{'} = s_p/(1 + s_p)$.


However, if we want to express everything as products (no ratios)
and use the "positive s means advantage and negative s means
disadvantage" rule, we want to write the above expression as 
$(1 + s_d)^{n_d} (1 + s_{pp})^{n_p}$ where $s_{pp} = -s_p/(1 + s_p)$. And
this is actually what we do in v.2. There is an example, for
instance, in section \@ref(mcf5070) where you will see:

```{r mcflparam}
sp <- 1e-3 
spp <- -sp/(1 + sp)
```

so we are going from the "(...) $s_p$ is the fitness disadvantage
conferred by a passenger" in @McFarland2013 (p. 2911) and
@McFarland2014-phd (p. 9) to the expression where we have a product
$\prod (1 + s_i)$, with the "positive s means advantage and negative
s means disadvantage" rule. This reparameterization applies to
v.2. In v.1 we use the same parameterization as in the original one
in @McFarland2013, but with the "positive s means advantage and negative
s means disadvantage" rule (so we are using expression 
$\frac{(1 + s)^d}{(1 - s_p)^p}$).


<!-- However, we can map from this ratio to the usual product of terms by -->
<!-- using a different value of $s_p$, that we will call $s_{pp} = -->
<!-- -s_p/(1 + s_p)$ (see @McFarland2014-phd, his eq. 2.1 in p.9).  -->

For death rate, we use the expression that @McFarland2013 (see their
p. 2911) use "(...) for large cancers (grown to $10^6$ cells)": $D(N) =
\log(1 + N/K)$ where $K$ is the initial equilibrium population size. As
the authors explain, for large N/K the above expression "(...)
recapitulates Gompertzian dynamics observed experimentally for large
tumors".

By default, OncoSimulR uses a value of $K=initSize/(e^{1} - 1)$ so that
the starting population is at equilibrium. 


### No viability of clones and types of models {#noviab}

For all models where fitness affects directly the birth rate (all except
Bozic), if you specify that some event (say, mutating gene A) has $s_A \le
-1$, if that event happens then birth rate becomes zero. This is taken to
indicate that the clone is not even viable and thus disappears immediately
without any chance for mutation[^3].

[^3]:This is a shortcut that we take because we think that it is
  what you mean. Note, however, that technically a clone with birth
  rate of 0 might have a non-zero probability of mutating before
  becoming extinct because in the continuous time model we use
  mutation is not linked to reproduction. In the present code, we
  are not allowing for any mutation when birth rate is 0. There are
  other options, but none which I find really better. An alternative
  implementation makes a clone immediately extinct if and only if
  any of the $s_i = -\infty$.  However, we still need to handle the
  case with $s_i < -1$ as a special case. We either make it
  identical to the case with any $s_i = -\infty$ or for any $s_i >
  -\infty$ we set $(1 + s_i) = \max(0, 1 + s_i)$ (i.e., if $s_i <
  -1$ then $(1 + s_i) = 0$), to avoid obtaining negative birth rates
  (that make no sense) and the problem of multiplying an even number
  of negative numbers. I think only the second would make sense as
  an alternative.

Models based on Bozic, however, have a birth rate of 1 and mutations
affect the death rate. In this case, a death rate larger than birth rate,
*per se*, does not signal immediate extinction and, moreover, even for death
rates that are a few times larger than birth rates, the clone could mutate
before becoming extinct[^4].

[^4]:We said "a few times". For a clone of population size 1 ---which is
the size at which all clones start from mutation---, if death rate is,
say, 90 but birth rate is 1, the probability of mutating before becoming
extinct is very, very close to zero for all reasonable values of mutation
rate}. How do we signal immediate extinction or no viability in this case?
You can set the value of $s = -\infty$.


In general, if you want to identify some mutations or some
combinations of mutations as leading to immediate extinction (i.e.,
no viability), of the affected clone, set it to $-\infty$ as this
would work even if how birth rates of 0 are handled changes. Most
examples below evaluate fitness by its effects on the birth
rate. You can see one where we do it both ways in Section
\@ref(fit-neg-pos).



## Genes without interactions {#noint}

This is a simple scenario. Each gene $i$ has a fitness effect $s_i$ if
mutated. The $s_i$ can come from any distribution you want. As an example
let's use three genes. We know there are no order effects, but we will
also see what happens if we examine genotypes as ordered.

```{r}

ai1 <- evalAllGenotypes(allFitnessEffects(
    noIntGenes = c(0.05, -.2, .1)), order = FALSE)
``` 


We can easily verify the first results:

```{r}
ai1
``` 

```{r}
all(ai1[, "Fitness"]  == c( (1 + .05), (1 - .2), (1 + .1),
       (1 + .05) * (1 - .2),
       (1 + .05) * (1 + .1),
       (1 - .2) * (1 + .1),
       (1 + .05) * (1 - .2) * (1 + .1)))

``` 

And we can see that considering the order of mutations (see section
\@ref(oe)) makes no difference:

```{r}
(ai2 <- evalAllGenotypes(allFitnessEffects(
    noIntGenes = c(0.05, -.2, .1)), order = TRUE,
    addwt = TRUE))

``` 

(The meaning of the notation in the output table is as follows: "WT"
denotes the wild-type, or non-mutated clone. The notation $x > y$ means
that a mutation in "x" happened before a mutation in "y". A genotype
$x > y\ \_\ z$ means that a mutation in "x" happened before a
mutation in "y"; there is also a mutation in "z", but that is a gene
for which order does not matter).



And what if I want genes without interactions but I want modules (see
section \@ref(modules0))? Go to section \@ref(mod-no-epi).



## Using DAGs: Restrictions in the order of mutations as extended posets {#posetslong}


### AND, OR, XOR relationships {#andorxor}

The literature on Oncogenetic trees, CBNs, etc, has used graphs as a way
of showing the restrictions in the order in which mutations can
accumulate. The meaning of "convergent arrows" in these graphs, however,
differs. In Figure 1 of @Korsunsky2014 we are shown a simple diagram
that illustrates the three basic different meanings of convergent arrows
using two parental nodes. We will illustrate it here with three. Suppose
we focus on node "g" in the following figure (we will create it shortly)

```{r, fig.height=4}
data(examplesFitnessEffects)
plot(examplesFitnessEffects[["cbn1"]])
``` 


* In relationships of the type used in **Conjunctive Bayesian
  Networks (CBN)** [e.g., @Gerstung2009], we are modeling an **AND**
  relationship, also called **CMPN** by @Korsunsky2014 or
  **monotone** relationship by @Farahani2013. If the relationship in
  the graph is fully respected, then "g" will only appear if all of
  "c", "d", and "e" are already mutated.
  
* **Semimonotone** relationships *sensu*
  @Farahani2013 or **DMPN** *sensu*
  @Korsunsky2014 are **OR** relationships: "g" will appear if
  one or more of "c", "d", or "e" are already mutated.

* **XMPN** relationships [@Korsunsky2014] are **XOR**
  relationships: "g" will be present only if exactly one of "c",
  "d", or "e" is present. 



Note that Oncogenetic trees [@Desper1999JCB; @Szabo2008] need not
deal with the above distinctions, since the DAGs are trees: no node has
more than one incoming connection or more than one parent[^5].

[^5]: OTs and CBNs have some other technical differences about the
underlying model they assume, such as the exponential waiting time in
CBNs. We will not discuss them here.


To have a flexible way of specifying all of these restrictions, we will
want to be able to say what kind of dependency each child
node has on its parents.


### Fitness effects {#fitnessposets}

Those DAGs specify dependencies and, as explained in
@Diaz-Uriarte2015, it is simple to map them to a simple evolutionary
model: any set of mutations that does not conform to the restrictions
encoded in the graph will have a fitness of 0. However, we might not want
to require absolute compliance with the DAG. This means we might want to
allow deviations from the DAG with a corresponding penalization that is,
however, not identical to setting fitness to 0 [again, see
@Diaz-Uriarte2015]. This we can do by being explicit about the
fitness effects of the deviations from the restrictions encoded in the
DAG. We will use below a column of `s` for the fitness effect when
the restrictions are satisfied and a column of `sh` when they are
not. (See also \@ref(numfit) for the details about the meaning of the
fitness effects).


That way of specifying fitness effects makes it also trivial to use the
model in @Hjelm2006 where all mutations might be allowed to occur,
but the presence of some mutations increases the probability of occurrence
of other mutations. For example, the values of `sh` could be all
small positive ones (or for mildly deleterious effects, small negative
numbers), while the values of `s` are much larger positive numbers.


### Extended posets

In version 1 of this package we used posets in the sense of
@Beerenwinkel2007 and @Gerstung2009, as explained in section
\@ref(poset) and in the help for `poset`. Here, we continue using
two columns, that specify parents and children, but we add columns
for the specific values of fitness effects (both s and sh ---i.e.,
fitness effects for what happens when restrictions are and are not
satisfied) and for the type of dependency as explained in section
\@ref(andorxor).



We can now illustrate the specification of different fitness effects
using DAGs.

### DAGs: A first conjunction (AND) example {#cbn1}

```{r}

cs <-  data.frame(parent = c(rep("Root", 4), "a", "b", "d", "e", "c"),
                 child = c("a", "b", "d", "e", "c", "c", rep("g", 3)),
                 s = 0.1,
                 sh = -0.9,
                 typeDep = "MN")

cbn1 <- allFitnessEffects(cs)

``` 

(We skip one letter, just to show that names need not be consecutive or
have any particular order.)


We can get a graphical representation using the default "graphNEL"
```{r, fig.height=3}
plot(cbn1)
``` 

or one using "igraph":
```{r, fig.height=5}
plot(cbn1, "igraph")
``` 

<!-- %% The vignette crashes if I try to use the layout. -->

Since we have a parent and children, the reingold.tilford layout is
probably the best here, so you might want to use that:

```{r, fig.height=5}
library(igraph) ## to make the reingold.tilford layout available
plot(cbn1, "igraph", layout = layout.reingold.tilford)
``` 



And what is the fitness of all genotypes?

```{r}
gfs <- evalAllGenotypes(cbn1, order = FALSE, addwt = TRUE)

gfs[1:15, ]
```

You can verify that for each genotype, if a mutation is present without
all of its dependencies present, you get a $(1 - 0.9)$ multiplier, and you
get a $(1 + 0.1)$ multiplier for all the rest with its direct parents
satisfied. For example, genotypes "a", or "b", or "d", or "e" have
fitness $(1 + 0.1)$, genotype "a, b, c" has fitness $(1 + 0.1)^3$, but
genotype "a, c" has fitness $(1 + 0.1) (1 - 0.9) = 0.11$.



### DAGs: A second conjunction example {#cbn2}


Let's try a first attempt at a somewhat more complex example, where the
fitness consequences of different genes differ.
```{r}

c1 <- data.frame(parent = c(rep("Root", 4), "a", "b", "d", "e", "c"),
                 child = c("a", "b", "d", "e", "c", "c", rep("g", 3)),
                 s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, rep(0.2, 3)),
                 sh = c(rep(0, 4), c(-.1, -.2), c(-.05, -.06, -.07)),
                 typeDep = "MN")

try(fc1 <- allFitnessEffects(c1))

``` 
If you try this, you'll get an error. There is an error because the
"sh" varies within a child, and we do not allow that for a
poset-type specification, as it is ambiguous. If you need arbitrary
fitness values for arbitrary combinations of genotypes, you can
specify them using epistatic effects as in section \@ref(epi) and
order effects as in section \@ref(oe).

Why do we need to specify as many "s" and "sh" as there are rows (or a
single one, that gets expanded to those many) when the "s" and "sh"
are properties of the child node, not of the edges? Because, for ease, we
use a data.frame.

<!-- %% (By the way, yes, we convert all factors to strings in the parent, child, -->
<!-- %% and typeDep columns, so no need to specify `stringsAsFactor = TRUE`). -->



We fix the error in our specification. Notice that the "sh" is not set
to $-1$ in these examples. If you want strict compliance with the poset
restrictions, you should set $sh = -1$ or, better yet, $sh = -\infty$ (see
section \@ref(noviab)), but having an $sh > -1$ will lead to fitnesses that
are $> 0$ and, thus, is a way of modeling small deviations from the poset
[see discussion in @Diaz-Uriarte2015].

<!-- %% In these examples, the reason to set "sh" to values larger than $-1$ and -->
<!-- %% different among the genes is to allow us to easily see the actual, -->
<!-- %% different, terms that enter into the multiplication of the fitness effects -->
<!-- %% (and, also, to make it easier to catch bugs). -->


Note that for those nodes that depend only on "Root" the type of
dependency is irrelevant.

```{r}
c1 <- data.frame(parent = c(rep("Root", 4), "a", "b", "d", "e", "c"),
                 child = c("a", "b", "d", "e", "c", "c", rep("g", 3)),
                 s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, rep(0.2, 3)),
                 sh = c(rep(0, 4), c(-.9, -.9), rep(-.95, 3)),
                 typeDep = "MN")

cbn2 <- allFitnessEffects(c1)

``` 

<!-- %% We can get a graphical representation using the default "graphNEL" -->
<!-- %% <<fig.height=3>>= -->
<!-- %% plot(cbn2) -->
<!-- %% @  -->

<!-- %% or one using "igraph": -->
<!-- %% <<fig.height=5>>= -->
<!-- %%plot(cbn2, "igraph", layout = layout.reingold.tilford) -->
<!-- %% @  -->

<!-- %% (since this is a tree, the reingold.tilford layout is probably the best here). -->

We could get graphical representations but the figures would
be the same as in the example in section \@ref(cbn1), since the structure
has not changed, only the numeric values.

What is the fitness of all possible genotypes? Here, order of events
*per se* does not matter, beyond that considered in the poset. In
other words, the fitness of genotype "a, b, c" is the same no matter how
we got to "a, b, c". What matters is whether or not the genes on which
each of "a", "b", and "c" depend are present or not (I only show the first
10 genotypes)

```{r}
gcbn2 <- evalAllGenotypes(cbn2, order = FALSE)
gcbn2[1:10, ]
``` 


Of course, if we were to look at genotypes but taking into account order
of occurrence of mutations, we would see no differences

```{r}
gcbn2o <- evalAllGenotypes(cbn2, order = TRUE, max = 1956)
gcbn2o[1:10, ]
``` 

(The $max = 1956$ is there so that we show all the genotypes, even
if they are more than 256, the default.)

You can check the output and verify things are as they should. For instance:

```{r}
all.equal(
        gcbn2[c(1:21, 22, 28, 41, 44, 56, 63 ) , "Fitness"],
        c(1.01, 1.02, 0.1, 1.03, 1.04, 0.05,
          1.01 * c(1.02, 0.1, 1.03, 1.04, 0.05),
          1.02 * c(0.10, 1.03, 1.04, 0.05),
          0.1 * c(1.03, 1.04, 0.05),
          1.03 * c(1.04, 0.05),
          1.04 * 0.05,
          1.01 * 1.02 * 1.1,
          1.01 * 0.1 * 0.05,
          1.03 * 1.04 * 0.05,
          1.01 * 1.02 * 1.1 * 0.05,
          1.03 * 1.04 * 1.2 * 0.1, ## notice this
          1.01 * 1.02 * 1.03 * 1.04 * 1.1 * 1.2
          ))
``` 

A particular one that is important to understand is genotype with
mutated genes "c, d, e, g":

```{r}
gcbn2[56, ] 
all.equal(gcbn2[56, "Fitness"], 1.03 * 1.04 * 1.2 * 0.10)
``` 

where "g" is taken as if its dependencies are satisfied (as "c",
"d", and "e" are present) even when the dependencies of "c" are not
satisfied (and that is why the term for "c" is 0.9).


### DAGs: A semimonotone or "OR" example {#mn1}

We will reuse the above example, changing the type of relationship:
```{r}

s1 <- data.frame(parent = c(rep("Root", 4), "a", "b", "d", "e", "c"),
                 child = c("a", "b", "d", "e", "c", "c", rep("g", 3)),
                 s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, rep(0.2, 3)),
                 sh = c(rep(0, 4), c(-.9, -.9), rep(-.95, 3)),
                 typeDep = "SM")

smn1 <- allFitnessEffects(s1)

``` 

It looks like this (where edges are shown in blue to denote the
semimonotone relationship):
```{r, fig.height=3}
plot(smn1)
``` 


```{r}
gsmn1 <- evalAllGenotypes(smn1, order = FALSE)

``` 

Having just one parental dependency satisfied is now enough, in contrast
to what happened before. For instance:

```{r}
gcbn2[c(8, 12, 22), ]
gsmn1[c(8, 12, 22), ]

gcbn2[c(20:21, 28), ]
gsmn1[c(20:21, 28), ]
``` 


### An "XMPN" or "XOR" example {#xor1}

Again, we reuse the example above, changing the type of relationship:

```{r}

x1 <- data.frame(parent = c(rep("Root", 4), "a", "b", "d", "e", "c"),
                 child = c("a", "b", "d", "e", "c", "c", rep("g", 3)),
                 s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, rep(0.2, 3)),
                 sh = c(rep(0, 4), c(-.9, -.9), rep(-.95, 3)),
                 typeDep = "XMPN")

xor1 <- allFitnessEffects(x1)

``` 


It looks like this (edges in red to denote the "XOR" relationship):
```{r, fig.height=3}
plot(xor1)
``` 

```{r}

gxor1 <- evalAllGenotypes(xor1, order = FALSE)

``` 


Whenever "c" is present with both "a" and "b", the fitness component
for "c" will be $(1 - 0.1)$. Similarly for "g" (if more than one of
"d", "e", or "c" is present, it will show as $(1 - 0.05)$). For example:

```{r}
gxor1[c(22, 41), ] 
c(1.01 * 1.02 * 0.1, 1.03 * 1.04 * 0.05)
``` 

However, having just both "a" and "b" is identical to the case with
CBN and the monotone relationship (see sections \@ref(cbn2) and
\@ref(mn1)). If you want the joint presence of "a" and "b" to result in
different fitness than the product of the individual terms, without
considering the presence of "c", you can specify that using general
epistatic effects (section
\@ref(epi)).

<!-- %% ; XOR relationships of these kind are, actually, -->
<!-- %% examples of synthetic lethality, which are shown in section \@ref(sl). -->


We also see a very different pattern compared to CBN (section \@ref(cbn2))
here:
```{r}
gxor1[28, ] 
1.01 * 1.1 * 1.2
``` 

as exactly one of the dependencies for both "c" and "g" are satisfied.

But 
```{r}
gxor1[44, ] 
1.01 * 1.02 * 0.1 * 1.2
``` 
is the result of a $0.1$ for "c" (and a $1.2$ for "g" that has exactly
one of its dependencies satisfied).


### Posets: the three types of relationships {#p3}

```{r}

p3 <- data.frame(
    parent = c(rep("Root", 4), "a", "b", "d", "e", "c", "f"),
    child = c("a", "b", "d", "e", "c", "c", "f", "f", "g", "g"),
    s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, 0.2, 0.2, 0.3, 0.3),
    sh = c(rep(0, 4), c(-.9, -.9), c(-.95, -.95), c(-.99, -.99)),
    typeDep = c(rep("--", 4), 
                "XMPN", "XMPN", "MN", "MN", "SM", "SM"))
fp3 <- allFitnessEffects(p3)
``` 

This is how it looks like:
```{r, fig.height=3}
plot(fp3)
``` 

We can also use "igraph":

```{r, fig.height=6}
plot(fp3, "igraph", layout.reingold.tilford)
``` 


```{r}

gfp3 <- evalAllGenotypes(fp3, order = FALSE)

``` 



Let's look at a few:

```{r}
gfp3[c(9, 24, 29, 59, 60, 66, 119, 120, 126, 127), ]

c(1.01 * 1.1, 1.03 * .05, 1.01 * 1.02 * 0.1, 0.1 * 0.05 * 1.3,
  1.03 * 1.04 * 1.2, 1.01 * 1.02 * 0.1 * 0.05,
  0.1 * 1.03 * 1.04 * 1.2 * 1.3,
  1.01 * 1.02 * 0.1 * 1.03 * 1.04 * 1.2,
  1.02 * 1.1 * 1.03 * 1.04 * 1.2 * 1.3,
  1.01 * 1.02 * 1.03 * 1.04 * 0.1 * 1.2 * 1.3)

``` 

As before, looking at the order of mutations makes no difference (look at
the test directory to see a test that verifies this assertion).



## Modules {#modules0}
As already mentioned, we can think of all the effects of fitness in terms
not of individual genes but, rather, modules. This idea is discussed in,
for example, @Raphael2014a, @Gerstung2011: the restrictions encoded
in, say, the DAGs can be considered to apply not to genes, but to
modules, where each module is a set of genes (and the intersection between
modules is the empty set). Modules, then, play the role of a "union
operation" over sets of genes. Of course, if we can use modules for the
restrictions in the DAGs we should also be able to use them for epistasis
and order effects, as we will see later (e.g., \@ref(oemod)).







### What does a module provide {#module-what-for}

Modules can provide very compact ways of specifying relationships when you
want to, well, model the existence of modules. For simplicity suppose
there is a module, "A", made of genes "a1" and "a2", and a module
"B", made of a single gene "b1". Module "B" can mutate if module
"A" is mutated, but mutating both "a1" and "a2" provides no
additional fitness advantage compared to mutating only a single one of
them.  We can specify this as:

```{r}
s <- 0.2
sboth <- (1/(1 + s)) - 1
m0 <- allFitnessEffects(data.frame(
    parent = c("Root", "Root", "a1", "a2"),
    child = c("a1", "a2", "b", "b"),
    s = s,
    sh = -1,
    typeDep = "OR"),
                        epistasis = c("a1:a2" = sboth))
evalAllGenotypes(m0, order = FALSE, addwt = TRUE)
``` 

Note that we need to add an epistasis term, with value "sboth"
to capture the idea of "mutating both "a1" and "a2"
provides no additional fitness advantage compared to mutating only a
single one of them"; see details in section \@ref(epi).



Now, specify it using modules:

```{r}
s <- 0.2
m1 <- allFitnessEffects(data.frame(
    parent = c("Root", "A"),
    child = c("A", "B"),
    s = s,
    sh = -1,
    typeDep = "OR"),
                        geneToModule = c("Root" = "Root",
                                         "A" = "a1, a2",
                                         "B" = "b1"))
evalAllGenotypes(m1, order = FALSE, addwt = TRUE)
``` 

This captures the ideas directly. The typing savings here are small, but
they can be large with modules with many genes.


<!-- %% %% \begin{tabular} {c c c} -->
<!-- %% %%   A & B & Fitness \\ -->
<!-- %% %%   \hline -->
<!-- %% %%   wt&wt& 1 \\ -->
<!-- %% %%   wt&M& sb \\ -->
<!-- %% %%   M&wt& sa\\ -->
<!-- %% %%   M&M& sab)\\ -->
<!-- %% %%   \hline -->
<!-- %% %% \end{tabular} -->

<!-- %% with A being 1, 2, and B 3, 4. -->

<!-- %% and having in a tree A depends on Root and B depends on A -->



<!-- %% \begin{tabular} {c c c} -->
<!-- %%   model & Fitness satisfied & fitness not satisf\\ -->
<!-- %%   \hline -->
<!-- %%   0 , 1 & s \\ -->
<!-- %%   0 , 2 & s \\ -->

<!-- %%   1 , 3 & s3 & sm \\ -->
<!-- %%   2 , 3 & s3 & sm \\ -->

<!-- %%   1 , 4 & s3 & sm \\ -->
<!-- %%   2 , 4 & s3 & sm \\ -->
  
<!-- %%   1 : 2 & s12 \\ -->
<!-- %%   3:  4 & s34 \\ -->
  

<!-- %%   \hline -->
<!-- %% \end{tabular} -->

<!-- %% just give the specification, the full one. -->


<!-- %% and write equivalendes of s12 as a function of Sa, S34 as a function of -->
<!-- %% Sb, etc. -->

### Specifying modules {#modules}

How do you specify modules? The general procedure is simple: you pass a
vector that makes explicit the mapping from modules to sets of genes. We
just saw an example. There are several additional examples such as
\@ref(pm3), \@ref(oemod), \@ref(epimod).

<!-- %% Why do we force you to specify "Root" = "Root"? We could check for it, -->
<!-- %% and add it if it is not present. But we want you to be explicit (and we -->
<!-- %% want to avoid you shooting yourself in the foot having a gene that is not -->
<!-- %% the root of the tree but is called "Root", etc). -->


It is important to note that, once you specify modules, we expect all of
the relationships (except those that involve the non interacting genes) to
be specified as modules. Thus, all elements of the epistasis, posets (the
DAGs) and order effects components should be specified in terms of
modules. But you can, of course, specify a module as containing a single
gene (and a single gene with the same name as the module).


What about the "Root" node? If you use a "restriction table", that
restriction table (that DAG) must have a node named "Root" and in the
mapping of genes to module there **must** be a first entry that has a
module and gene named "Root", as we saw above with 
`geneToModule  = c("Root" = "Root", ...`. 
We force you to do this to be explicit about
the "Root" node. This is not needed (thought it does not hurt) with
other fitness specifications. For instance, if we have a model with two
modules, one of them with two genes (see details in section
\@ref(mod-no-epi)) we do not need to pass a "Root" as in


```{r}
fnme <- allFitnessEffects(epistasis = c("A" = 0.1,
                                        "B" = 0.2),
                          geneToModule = c("A" = "a1, a2",
                                           "B" = "b1"))
evalAllGenotypes(fnme, order = FALSE, addwt = TRUE)
``` 

but it is also OK to have a "Root" in the `geneToModule`:

```{r}
fnme2 <- allFitnessEffects(epistasis = c("A" = 0.1,
                                        "B" = 0.2),
                          geneToModule = c(
                              "Root" = "Root",
                              "A" = "a1, a2",
                              "B" = "b1"))
evalAllGenotypes(fnme, order = FALSE, addwt = TRUE)
``` 


### Modules and posets again: the three types of relationships and modules {#pm3}


We use the same specification of poset, but add modules. To keep it
manageable, we only add a few genes for some modules, and have some
modules with a single gene. Beware that the number of genotypes is
starting to grow quite fast, though.  We capitalize to differentiate
modules (capital letters) from genes (lowercase with a number), but this
is not needed.


```{r}
p4 <- data.frame(
    parent = c(rep("Root", 4), "A", "B", "D", "E", "C", "F"),
    child = c("A", "B", "D", "E", "C", "C", "F", "F", "G", "G"),
    s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, 0.2, 0.2, 0.3, 0.3),
    sh = c(rep(0, 4), c(-.9, -.9), c(-.95, -.95), c(-.99, -.99)),
    typeDep = c(rep("--", 4), 
                "XMPN", "XMPN", "MN", "MN", "SM", "SM"))

fp4m <- allFitnessEffects(
    p4,
    geneToModule = c("Root" = "Root", "A" = "a1",
                     "B" = "b1, b2", "C" = "c1",
                     "D" = "d1, d2", "E" = "e1",
                     "F" = "f1, f2", "G" = "g1"))
``` 

By default, plotting shows the modules:

```{r, fig.height=3}
plot(fp4m)
``` 

but we can show the gene names instead of the module names:

```{r, fig.height=3}
plot(fp4m, expandModules = TRUE)
``` 

or

```{r, fig.height=6}
plot(fp4m, "igraph", layout = layout.reingold.tilford, 
     expandModules = TRUE)

``` 

We obtain the fitness of all genotypes in the usual way:

```{r}
gfp4 <- evalAllGenotypes(fp4m, order = FALSE, max = 1024)
``` 

Let's look at a few of those:

```{r}
gfp4[c(12, 20, 21, 40, 41, 46, 50, 55, 64, 92,
       155, 157, 163, 372, 632, 828), ]

c(1.01 * 1.02, 1.02, 1.02 * 1.1, 0.1 * 1.3, 1.03, 
  1.03 * 1.04, 1.04 * 0.05, 0.05 * 1.3,  
  1.01 * 1.02 * 0.1, 1.02 * 1.1, 0.1 * 0.05 * 1.3,
  1.03 * 0.05, 1.03 * 0.05, 1.03 * 1.04 * 1.2, 1.03 * 1.04 * 1.2, 
  1.02 * 1.1 * 1.03 * 1.04 * 1.2 * 1.3)

``` 



## Order effects {#oe}

As explained in the introduction (section \@ref(introdd)), by order effects we
mean a phenomenon such as the one shown empirically by @Ortmann2015:
the fitness of a double mutant "A", "B" is different depending on
whether "A" was acquired before "B" or "B" before "A". This, of
course, can be generalized to more than two genes.

Note that order effects are different from the restrictions in the
order of accumulation of mutations discussed in section
\@ref(posetslong). With restrictions in the order of accumulation of
mutations we might say that acquiring "B" depends or is facilitated
by having "A" mutated (and, unless we allowed for multiple
mutations, having "A" mutated means having "A" mutated before
"B"). However, once you have the genotype "A, B", its fitness does
not depend on the order in which "A" and "B" appeared.



### Order effects: three-gene orders {#oeftres}

Consider this case, where three specific three-gene orders and two
two-gene orders (one of them a subset of one of the three) lead to
different fitness compared to the wild-type. We add also modules, to show
its usage (but just limit ourselves to using one gene per module here). 

Order effects are specified using a $x > y$, which means that that order
effect is satisfied when module $x$ is mutated before module $y$.

```{r}
o3 <- allFitnessEffects(orderEffects = c(
                            "F > D > M" = -0.3,
                            "D > F > M" = 0.4,
                            "D > M > F" = 0.2,
                            "D > M"     = 0.1,
                            "M > D"     = 0.5),
                        geneToModule =
                            c("M" = "m",
                              "F" = "f",
                              "D" = "d") )


(ag <- evalAllGenotypes(o3, addwt = TRUE, order = TRUE))
``` 


<!-- %% <<>>= -->
<!-- %% o3 <- allFitnessEffects(orderEffects = c( -->
<!-- %%                             "F > D > M" = -0.3, -->
<!-- %%                             "D > F > M" = 0.4, -->
<!-- %%                             "D > M > F" = 0.2, -->
<!-- %%                             "D > M"     = 0.1, -->
<!-- %%                             "M > D"     = 0.5), -->
<!-- %%                         geneToModule = -->
<!-- %%                             c("Root" = "Root", -->
<!-- %%                               "M" = "m", -->
<!-- %%                               "F" = "f", -->
<!-- %%                               "D" = "d") ) -->
<!-- %% (ag <- evalAllGenotypes(o3, addwt = TRUE)) -->
<!-- %% @  -->

(The meaning of the notation in the output table is as follows: "WT"
denotes the wild-type, or non-mutated clone. The notation $x > y$ means
that a mutation in "x" happened before a mutation in "y". A genotype
$x > y\ \_\ z$ means that a mutation in "x" happened before a
mutation in "y"; there is also a mutation in "z", but that is a gene
for which order does not matter).


The values for the first nine genotypes come directly from the fitness
specifications. The 10th genotype matches $D > F > M$ ($= (1 + 0.4)$)
but also $D > M$ ($(1 + 0.1)$). The 11th matches $D > M > F$ and $D >
M$. The 12th matches $F > D > M$ but also $D > M$. Etc.




### Order effects and modules with multiple genes {#oemod}

Consider the following case:

```{r}

ofe1 <- allFitnessEffects(
    orderEffects = c("F > D" = -0.3, "D > F" = 0.4),
    geneToModule =
        c("F" = "f1, f2",
          "D" = "d1, d2") )

ag <- evalAllGenotypes(ofe1, order = TRUE)

``` 

There are four genes, $d1, d2, f1, f2$, where each $d$ belongs to module
$D$ and each $f$ belongs to module $F$.

What to expect for cases such as $d1 > f1$ or $f1 > d1$ is clear, as shown in

```{r}
ag[5:16,]
``` 

Likewise, cases such as $d1 > d2 > f1$ or $f2 > f1 > d1$ are clear,
because in terms of modules they map to $ D > F$ or $F > D$: the observed
order of mutation $d1 > d2 > f1$ means that module $D$ was mutated first
and module $F$ was mutated second. Similar for $d1 > f1 > f2$ or
$f1 > d1 > d2$: those map to $D > F$ and $F > D$. We can see the fitness
of those four case in:

```{r}
ag[c(17, 39, 19, 29), ]
``` 

and they correspond to the values of those order effects, where $F > D =
(1 - 0.3)$ and $D > F = (1 + 0.4)$:

```{r}
ag[c(17, 39, 19, 29), "Fitness"] == c(1.4, 0.7, 1.4, 0.7)
``` 

What if we match several patterns? For example, $d1 > f1 > d2 > f2$ and
$d1 > f1 > f2 > d2$? The first maps to $D > F > D > F$ and the second to
$D > F > D$. But since we are concerned with which one happened first and
which happened second we should expect those two to correspond to the same
fitness, that of pattern $D > F$, as is the case:

```{r}
ag[c(43, 44),]
ag[c(43, 44), "Fitness"] == c(1.4, 1.4)
``` 
More generally, that applies to all the patterns that start with one of
the "d" genes:

```{r}
all(ag[41:52, "Fitness"] == 1.4)
``` 

Similar arguments apply to the opposite pattern, $F > D$, which apply to
all the possible gene mutation orders that start with one of the "f"
genes. For example:

```{r}
all(ag[53:64, "Fitness"] == 0.7)
``` 



### Order and modules with 325 genotypes

We can of course have more than two genes per module. This just repeats
the above, with five genes (there are 325 genotypes, and that is why we
pass the "max" argument to `evalAllGenotypes`, to allow for
more than the default 256).

```{r}

ofe2 <- allFitnessEffects(
    orderEffects = c("F > D" = -0.3, "D > F" = 0.4),
    geneToModule =
        c("F" = "f1, f2, f3",
          "D" = "d1, d2") )
ag2 <- evalAllGenotypes(ofe2, max = 325, order = TRUE)

``` 

We can verify that any combination that starts with a "d" gene and then
contains at least one "f" gene will have  a fitness of $1+0.4$.  And any
combination that starts with an "f" gene and contains at least one "d"
genes will have a fitness of $1 - 0.3$.  All other genotypes have a
fitness of 1:

```{r}
all(ag2[grep("^d.*f.*", ag2[, 1]), "Fitness"] == 1.4)
all(ag2[grep("^f.*d.*", ag2[, 1]), "Fitness"] == 0.7)
oe <- c(grep("^f.*d.*", ag2[, 1]), grep("^d.*f.*", ag2[, 1]))
all(ag2[-oe, "Fitness"] == 1)
``` 



### Order effects and genes without interactions

We will now look at both order effects and interactions. To make things
more interesting, we name genes so that the ordered names do split nicely
between those with and those without order effects (this, thus, also
serves as a test of messy orders of names).

```{r}

foi1 <- allFitnessEffects(
    orderEffects = c("D>B" = -0.2, "B > D" = 0.3),
    noIntGenes = c("A" = 0.05, "C" = -.2, "E" = .1))

``` 

You can get a verbose view of what the gene names and modules are (and
their automatically created numeric codes) by:

```{r}
foi1[c("geneModule", "long.geneNoInt")]
``` 

We can get the fitness of all genotypes (we set $max = 325$ because that
is the number of possible genotypes):

```{r}
agoi1 <- evalAllGenotypes(foi1,  max = 325, order = TRUE)
head(agoi1)
``` 



Now:
```{r}
rn <- 1:nrow(agoi1)
names(rn) <- agoi1[, 1]

agoi1[rn[LETTERS[1:5]], "Fitness"] == c(1.05, 1, 0.8, 1, 1.1)

``` 

According to the fitness effects we have specified, we also know that any
genotype with only two mutations, one of which is either "A", "C" "E" and
the other is "B" or "D" will have the fitness corresponding to "A", "C" or
"E", respectively:

```{r}
agoi1[grep("^A > [BD]$", names(rn)), "Fitness"] == 1.05
agoi1[grep("^C > [BD]$", names(rn)), "Fitness"] == 0.8
agoi1[grep("^E > [BD]$", names(rn)), "Fitness"] == 1.1
agoi1[grep("^[BD] > A$", names(rn)), "Fitness"] == 1.05
agoi1[grep("^[BD] > C$", names(rn)), "Fitness"] == 0.8
agoi1[grep("^[BD] > E$", names(rn)), "Fitness"] == 1.1
``` 


We will not be playing many additional games with regular expressions, but
let us check those that start with "D" and have all the other mutations,
which occupy rows 230 to 253; fitness should be equal (within numerical
error, because of floating point arithmetic) to the order effect of "D"
before "B" times the other effects $(1 - 0.3) * 1.05 * 0.8 * 1.1 = 0.7392$

```{r}
all.equal(agoi1[230:253, "Fitness"] ,
          rep((1 - 0.2) * 1.05 * 0.8 * 1.1, 24))
``` 
and that will also be the value of any genotype with the five mutations
where "D" comes before "B" such as those in rows 260 to 265, 277, or
322 and 323, but it will be equal to $(1 + 0.3) * 1.05 * 0.8 * 1.1 =
1.2012$ in those where "B" comes before "D". Analogous arguments apply
to four, three, and two mutation genotypes.



## Epistasis {#epi}

### Epistasis: two alternative specifications {#e2}

We want the following mapping of genotypes to fitness:

------------------------
A    B       Fitness 
--   --   --------------
wt   wt    1 

wt   M     $1 + s_b$ 

M    wt    $1 + s_a$

M    M     $1 + s_{ab}$
--  --    --------------- 
 
Suppose that the actual numerical values are $s_a = 0.2, s_b = 0.3, s_{ab}
= 0.7$.

We specify the above as follows: 
```{r}
sa <- 0.2
sb <- 0.3
sab <- 0.7

e2 <- allFitnessEffects(epistasis =
                            c("A: -B" = sa,
                              "-A:B" = sb,
                              "A : B" = sab))
evalAllGenotypes(e2, order = FALSE, addwt = TRUE)

``` 

That uses the "-" specification, so we explicitly exclude some patterns:
with "A:-B" we say "A when there is no B".

But we can also use a specification where we do not use the "-". That
requires a different numerical value of the interaction, because now, as
we are rewriting the interaction term as genotype "A is mutant, B is
mutant" the double mutant will incorporate the effects of "A mutant",
"B mutant" and "both A and B mutants". We can define a new $s_2$ that
satisfies $(1 + s_{ab}) = (1 + s_a) (1 + s_b) (1 + s_2)$ so
$(1 + s_2) = (1 + s_{ab})/((1 + s_a) (1 + s_b))$ and therefore specify as:


```{r}
s2 <- ((1 + sab)/((1 + sa) * (1 + sb))) - 1

e3 <- allFitnessEffects(epistasis =
                            c("A" = sa,
                              "B" = sb,
                              "A : B" = s2))
evalAllGenotypes(e3, order = FALSE, addwt = TRUE)

``` 

Note that this is the way you would specify effects with FFPopsim
[@Zanini2012]. Whether this specification or the previous one with
"-" is simpler will depend on the model. For synthetic mortality and
viability, I think the one using "-" is simpler to map genotype tables
to fitness effects. See also section \@ref(e3) and \@ref(theminus) and the
example in section \@ref(weis1b).


Finally, note that we can also specify some of these effects by combining
the graph and the epistasis, as shown in section \@ref(misra1a) or
\@ref(weis1b).

### Epistasis with three genes and two alternative specifications {#e3}

Suppose we have 

-------------------------------------
A  B   C     Fitness 
-- --  --  --------------------------
M  wt  wt   $1 + s_a$ 

wt M   wt   $1 + s_b$ 

wt wt  M    $1 + s_c$ 

M  M   wt   $1 + s_{ab}$ 

wt M   M    $1 + s_{bc}$ 

M  wt  M    $(1 + s_a) (1 + s_c)$ 

M  M   M    $1 + s_{abc}$ 
-- --  --  -------------------------- 

where missing rows have a fitness of 1 (they have been deleted for
conciseness). Note that the mutant for exactly A and C has a fitness that
is the product of the individual terms (so there is no epistasis in that case).


```{r}
sa <- 0.1
sb <- 0.15
sc <- 0.2
sab <- 0.3
sbc <- -0.25
sabc <- 0.4

sac <- (1 + sa) * (1 + sc) - 1

E3A <- allFitnessEffects(epistasis =
                            c("A:-B:-C" = sa,
                              "-A:B:-C" = sb,
                              "-A:-B:C" = sc,
                              "A:B:-C" = sab,
                              "-A:B:C" = sbc,
                              "A:-B:C" = sac,
                              "A : B : C" = sabc)
                                                )

evalAllGenotypes(E3A, order = FALSE, addwt = FALSE)

``` 

We needed to pass the $s_{ac}$ coefficient explicitly, even if it that
term was just the product. We can try to avoid using the "-", however
(but we will need to do other calculations). For simplicity, I use capital
"S" in what follows where the letters differ from the previous
specification:


```{r}

sa <- 0.1
sb <- 0.15
sc <- 0.2
sab <- 0.3
Sab <- ( (1 + sab)/((1 + sa) * (1 + sb))) - 1
Sbc <- ( (1 + sbc)/((1 + sb) * (1 + sc))) - 1
Sabc <- ( (1 + sabc)/
          ( (1 + sa) * (1 + sb) * (1 + sc) *
            (1 + Sab) * (1 + Sbc) ) ) - 1

E3B <- allFitnessEffects(epistasis =
                             c("A" = sa,
                               "B" = sb,
                               "C" = sc,
                               "A:B" = Sab,
                               "B:C" = Sbc,
                               ## "A:C" = sac, ## not needed now
                               "A : B : C" = Sabc)
                                                )
evalAllGenotypes(E3B, order = FALSE, addwt = FALSE)

``` 

The above two are, of course, identical:

```{r}
all(evalAllGenotypes(E3A, order = FALSE, addwt = FALSE) == 
    evalAllGenotypes(E3B, order = FALSE, addwt = FALSE))
``` 

We avoid specifying the "A:C", as it just follows from the individual
"A" and "C" terms, but given a specified genotype table, we need to do
a little bit of addition and multiplication to get the coefficients. 


### Why can we specify some effects with a "-"? {#theminus}

Let's suppose we want to specify the synthetic viability example seen
before:

-----------------
A  B   Fitness
-- -- ----------
wt wt  1

wt M   0

M  wt  0

M  M   (1 + s)
-- -- --------



where "wt" denotes wild type and "M" denotes mutant.

If you want to directly map the above table to the fitness table for the
program, to specify the genotype "A is wt, B is a mutant" you can
specify it as "-A,B", not just as "B". Why? Because
just the presence of a "B" is also compatible with genotype "A is
mutant and B is mutant".  If you use "-" you are explicitly saying what
should not be there so that "-A,B" is NOT compatible with
"A, B". Otherwise, you need to carefully add coefficients.
Depending on what you are trying to model, different specifications might
be simpler. See the examples in section \@ref(e2) and \@ref(e3). You have
both options.


### Epistasis: modules {#epimod}
There is nothing conceptually new, but we will show an example here:

```{r}

sa <- 0.2
sb <- 0.3
sab <- 0.7

em <- allFitnessEffects(epistasis =
                            c("A: -B" = sa,
                              "-A:B" = sb,
                              "A : B" = sab),
                        geneToModule = c("A" = "a1, a2",
                                         "B" = "b1, b2"))
evalAllGenotypes(em, order = FALSE, addwt = TRUE)
``` 


Of course, we can do the same thing without using the "-", as in section \@ref(e2):

```{r}
s2 <- ((1 + sab)/((1 + sa) * (1 + sb))) - 1

em2 <- allFitnessEffects(epistasis =
                            c("A" = sa,
                              "B" = sb,
                              "A : B" = s2),
                         geneToModule = c("A" = "a1, a2",
                                         "B" = "b1, b2")
                         )
evalAllGenotypes(em2, order = FALSE, addwt = TRUE)

``` 


## I do not want epistasis, but I want modules! {#mod-no-epi}

Sometimes you might want something like having several modules, say "A"
and "B", each with a number of genes, but with "A" and "B" showing
no interaction. 

It is a terminological issue whether we should allow `noIntGenes`
(no interaction genes), as explained in section \@ref(noint) to actually be
modules. The reasoning for not allowing them is that the situation
depicted above (several genes in module A, for example) actually is one of
interaction: the members of "A" are combined using an "OR" operator
(i.e., the fitness consequences of having one or more genes of A mutated
are the same), not just simply multiplying their fitness; similarly for
"B". This is why no interaction genes also mean no modules allowed.

So how do you get what you want in this case?  Enter the names of
the modules in the `epistasis` component but have no term for ":"
(the colon). Let's see an example:


```{r}

fnme <- allFitnessEffects(epistasis = c("A" = 0.1,
                                        "B" = 0.2),
                          geneToModule = c("A" = "a1, a2",
                                           "B" = "b1, b2, b3"))

evalAllGenotypes(fnme, order = FALSE, addwt = TRUE)

``` 

In previous versions these was possible using the longer, still accepted
way of specifying a `:` with a value of 0, but this is no longer
needed:

```{r}
fnme <- allFitnessEffects(epistasis = c("A" = 0.1,
                                        "B" = 0.2,
                                        "A : B" = 0.0),
                          geneToModule = c("A" = "a1, a2",
                                           "B" = "b1, b2, b3"))

evalAllGenotypes(fnme, order = FALSE, addwt = TRUE)

``` 

This can, of course, be extended to more modules.



## Synthetic viability {#sv}

Synthetic viability and synthetic lethality
  [e.g., @Ashworth2011; @Hartman2001] are just special cases of epistasis
  (section \@ref(epi)) but we deal with them here separately.


### A simple synthetic viability example
A simple and extreme example of synthetic viability is shown in the
following table, where the joint mutant has fitness larger than the wild
type, but each single mutant is lethal.



---------------
A  B   Fitness
-- -- ----------
wt wt  1

wt M   0

M  wt  0

M  M   (1 + s)
-- -- --------


where "wt" denotes wild type and "M" denotes mutant.


We can specify this (setting $s = 0.2$) as (I play around with spaces, to
show there is a certain flexibility with them):

```{r}
s <- 0.2
sv <- allFitnessEffects(epistasis = c("-A : B" = -1,
                                      "A : -B" = -1,
                                      "A:B" = s))
``` 

Now, let's look at all the genotypes (we use "addwt" to also get the wt,
which by decree has fitness of 1), and disregard order:

```{r}
(asv <- evalAllGenotypes(sv, order = FALSE, addwt = TRUE))
``` 

Asking the program to consider the order of mutations of course makes no
difference:

```{r}
evalAllGenotypes(sv, order = TRUE, addwt = TRUE)
``` 

Another example of synthetic viability is shown in section \@ref(misra1b).

Of course, if multiple simultaneous mutations are not possible in the
simulations, it is not possible to go from the wildtype to the double
mutant in this model where the single mutants are not viable.


### Synthetic viability, non-zero fitness, and modules

This is a slightly more elaborate case, where there is one module and the
single mutants have different fitness between themselves, which is
non-zero. Without the modules, this is the same as in @Misra2014, Figure
1b, which we go over in section \@ref(misra).


---------------------
A  B       Fitness 
-- --   -------------
wt wt     1 

wt M     $1 + s_b$ 

M  wt     $1 + s_a$

M  M     $1 + s_{ab}$
-- -- ----------------


where $s_a, s_b < 0$ but $s_{ab} > 0$. 



```{r}
sa <- -0.1
sb <- -0.2
sab <- 0.25
sv2 <- allFitnessEffects(epistasis = c("-A : B" = sb,
                             "A : -B" = sa,
                             "A:B" = sab),
                         geneToModule = c(
                             "A" = "a1, a2",
                             "B" = "b"))
evalAllGenotypes(sv2, order = FALSE, addwt = TRUE)
``` 

And if we look at order, of course it makes no difference:

```{r}
evalAllGenotypes(sv2, order = TRUE, addwt = TRUE)
``` 

<!-- %% And it looks like: -->
<!-- %% <<>>= -->
<!-- %% plot(sv2) -->
<!-- %% @  -->
<!-- %% a fairly simple plot. -->


## Synthetic mortality or synthetic lethality {#sl}

In contrast to section \@ref(sv), here the joint mutant has decreased viability:

-------------------------
A    B        Fitness 
--- ---   -------------
wt  wt                1 

wt  M         $1 + s_b$ 

M   wt        $1 + s_a$

M   M      $1 + s_{ab}$
--  --  ----------------

where $s_a, s_b > 0$ but $s_{ab} < 0$. 



```{r}
sa <- 0.1
sb <- 0.2
sab <- -0.8
sm1 <- allFitnessEffects(epistasis = c("-A : B" = sb,
                             "A : -B" = sa,
                             "A:B" = sab))
evalAllGenotypes(sm1, order = FALSE, addwt = TRUE)

``` 

And if we look at order, of course it makes no difference:

```{r}
evalAllGenotypes(sm1, order = TRUE, addwt = TRUE)
``` 

## Possible issues with Bozic model {#boznumissues}

### Synthetic viability using Bozic model {#fit-neg-pos}

If we were to use the above specification with Bozic's models, we might
not get what we think we should get:

```{r}
evalAllGenotypes(sv, order = FALSE, addwt = TRUE, model = "Bozic")
```

What gives here? The simulation code would alert you of this (see section
\@ref(ex-0-death)) in this particular case because there are "-1",
which might indicate that this is not what you want. The problem is that
you probably want the Death rate to be infinity (the birth rate was 0, so
no clone viability, when we used birth rates ---section \@ref(noviab)).

Let us say so explicitly:

```{r}
s <- 0.2
svB <- allFitnessEffects(epistasis = c("-A : B" = -Inf,
                                      "A : -B" = -Inf,
                                      "A:B" = s))
evalAllGenotypes(svB, order = FALSE, addwt = TRUE, model = "Bozic")
```


Likewise, values of $s$ larger than one have no effect beyond setting $s =
1$ (a single term of $(1 - 1)$ will drive the product to 0, and as we
cannot allow negative death rates negative values are set to 0):


```{r}

s <- 1
svB1 <- allFitnessEffects(epistasis = c("-A : B" = -Inf,
                                       "A : -B" = -Inf,
                                       "A:B" = s))

evalAllGenotypes(svB1, order = FALSE, addwt = TRUE, model = "Bozic")


s <- 3
svB3 <- allFitnessEffects(epistasis = c("-A : B" = -Inf,
                                       "A : -B" = -Inf,
                                       "A:B" = s))

evalAllGenotypes(svB3, order = FALSE, addwt = TRUE, model = "Bozic")


```

Of course, death rates of 0.0 are likely to lead to trouble down the road,
when we actually conduct simulations (see section \@ref(ex-0-death)).


### Numerical issues with death rates of 0 in Bozic model {#ex-0-death}

As we mentioned above (section \@ref(fit-neg-pos)) death rates of 0 can
lead to trouble when using Bozic's model:


```{r}
i1 <- allFitnessEffects(noIntGenes = c(1, 0.5))
evalAllGenotypes(i1, order = FALSE, addwt = TRUE, 
                 model = "Bozic")
				 
i1_b <- oncoSimulIndiv(i1, model = "Bozic")

``` 


Of course, there is no problem in using the above with other models:

```{r}
evalAllGenotypes(i1, order = FALSE, addwt = TRUE, 
                 model = "Exp")
i1_e <- oncoSimulIndiv(i1, model = "Exp")
summary(i1_e)
``` 



## A longer example: Poset, epistasis, synthetic mortality and viability, order effects and genes without interactions, with some modules {#exlong}

We will now put together a complex example. We will use the poset from
section \@ref(pm3) but will also add:

* Order effects that involve genes in the poset. In this case, if C
  happens before F, fitness decreases by $1 - 0.1$. If it happens the
  other way around, there is no effect on fitness beyond their individual
  contributions.
  <!-- but if it happens the   other way around it increases by $1 + 0.13$. -->
* Order effects that involve two new modules, "H" and "I" (with
  genes "h1, h2" and "i1", respectively), so that if H happens before
  I fitness increases by $1 + 0.12$.
* Synthetic mortality between modules "I" (already present in the
  epistatic interaction) and "J" (with genes "j1" and "j2"): the
  joint presence of these modules leads to cell death (fitness of 0).
* Synthetic viability between modules "K" and "M" (with genes
  "k1", "k2" and "m1", respectively), so that their joint presence
  is viable but adds nothing to fitness (i.e., mutation of both has
  fitness $1$), whereas each single mutant has a fitness of $1 - 0.5$.
* A set of 5 driver genes ($n1, \ldots, n5$) with fitness that comes
  from an exponential distribution with rate of 10.




As we are specifying many different things, we will start by writing each
set of effects separately:

```{r}
p4 <- data.frame(
    parent = c(rep("Root", 4), "A", "B", "D", "E", "C", "F"),
    child = c("A", "B", "D", "E", "C", "C", "F", "F", "G", "G"),
    s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, 0.2, 0.2, 0.3, 0.3),
    sh = c(rep(0, 4), c(-.9, -.9), c(-.95, -.95), c(-.99, -.99)),
    typeDep = c(rep("--", 4), 
                "XMPN", "XMPN", "MN", "MN", "SM", "SM"))

oe <- c("C > F" = -0.1, "H > I" = 0.12)
sm <- c("I:J"  = -1)
sv <- c("-K:M" = -.5, "K:-M" = -.5)
epist <- c(sm, sv)

modules <- c("Root" = "Root", "A" = "a1",
             "B" = "b1, b2", "C" = "c1",
             "D" = "d1, d2", "E" = "e1",
             "F" = "f1, f2", "G" = "g1",
             "H" = "h1, h2", "I" = "i1",
             "J" = "j1, j2", "K" = "k1, k2", "M" = "m1")

set.seed(1) ## for reproducibility
noint <- rexp(5, 10)
names(noint) <- paste0("n", 1:5)

fea <- allFitnessEffects(rT = p4, epistasis = epist,
                         orderEffects = oe,
                         noIntGenes = noint,
                         geneToModule = modules)
``` 


How does it look?

```{r, fig.height=5.5}
plot(fea)
``` 

or

```{r, fig.height=5.5}
plot(fea, "igraph")
``` 


We can, if we want, expand the modules using a "graphNEL" graph
```{r, fig.height=5.5}
plot(fea, expandModules = TRUE)
``` 

or an "igraph" one
```{r, fig.height=6.5}
plot(fea, "igraph", expandModules = TRUE)
``` 


We will not evaluate the fitness of all genotypes, since the number of all
ordered genotypes is $> 7*10^{22}$. We will look at some specific genotypes:


```{r}

evalGenotype("k1 > i1 > h2", fea) ## 0.5
evalGenotype("k1 > h1 > i1", fea) ## 0.5 * 1.12

evalGenotype("k2 > m1 > h1 > i1", fea) ## 1.12

evalGenotype("k2 > m1 > h1 > i1 > c1 > n3 > f2", fea) 
## 1.12 * 0.1 * (1 + noint[3]) * 0.05 * 0.9

``` 

Finally, let's generate some ordered genotypes randomly:

```{r}

randomGenotype <- function(fe, ns = NULL) {
    gn <- setdiff(c(fe$geneModule$Gene,
                    fe$long.geneNoInt$Gene), "Root")
    if(is.null(ns)) ns <- sample(length(gn), 1)
    return(paste(sample(gn, ns), collapse = " > "))
}

set.seed(2) ## for reproducibility

evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  k2 > i1 > c1 > n1 > m1
##  Individual s terms are : 0.0755182 -0.9
##  Fitness:  0.107552 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  n2 > h1 > h2
##  Individual s terms are : 0.118164
##  Fitness:  1.11816 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  d2 > k2 > c1 > f2 > n4 > m1 > n3 > f1 > b1 > g1 > n5 > h1 > j2
##  Individual s terms are : 0.0145707 0.0139795 0.0436069 0.02 0.1 0.03 -0.95 0.3 -0.1
##  Fitness:  0.0725829 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  h2 > c1 > f1 > n2 > b2 > a1 > n1 > i1
##  Individual s terms are : 0.0755182 0.118164 0.01 0.02 -0.9 -0.95 -0.1 0.12
##  Fitness:  0.00624418 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  h2 > j1 > m1 > d2 > i1 > b2 > k2 > d1 > b1 > n3 > n1 > g1 > h1 > c1 > k1 > e1 > a1 > f1 > n5 > f2
##  Individual s terms are : 0.0755182 0.0145707 0.0436069 0.01 0.02 -0.9 0.03 0.04 0.2 0.3 -1 -0.1 0.12
##  Fitness:  0 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  n1 > m1 > n3 > i1 > j1 > n5 > k1
##  Individual s terms are : 0.0755182 0.0145707 0.0436069 -1
##  Fitness:  0 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  d2 > n1 > g1 > f1 > f2 > c1 > b1 > d1 > k1 > a1 > b2 > i1 > n4 > h2 > n2
##  Individual s terms are : 0.0755182 0.118164 0.0139795 0.01 0.02 -0.9 0.03 -0.95 0.3 -0.5
##  Fitness:  0.00420528 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  j1 > f1 > j2 > a1 > n4 > c1 > n3 > k1 > d1 > h1
##  Individual s terms are : 0.0145707 0.0139795 0.01 0.1 0.03 -0.95 -0.5
##  Fitness:  0.0294308 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  n5 > f2 > f1 > h2 > n4 > c1 > n3 > b1
##  Individual s terms are : 0.0145707 0.0139795 0.0436069 0.02 0.1 -0.95
##  Fitness:  0.0602298 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  h1 > d1 > f2
##  Individual s terms are : 0.03 -0.95
##  Fitness:  0.0515 


``` 

## Homozygosity, heterozygosity, oncogenes, tumor suppressors {#oncog}

We are using what is conceptually a single linear chromosome. However, you
can use it to model scenarios where the numbers of copies affected matter,
by properly duplicating the genes. 

Suppose we have a tumor suppressor gene, G, with two copies, one from Mom
and one from Dad. We can have a table like:

----------------------------
$O_M$  $O_D$     Fitness 
-----  -----  -------------- 
wt      wt       1 

wt      M        1

M       wt       1

M       M        $(1 + s)$
----   -----  ---------------


where $s > 0$, meaning that you need two hits, one in each copy, to
trigger the clonal expansion.


What about oncogenes? A simple model is that one single hit leads to
clonal expansion and additional hits lead to no additional changes, as in
this table for gene O, where again the M or D subscript denotes the copy
from Mom or from Dad:

----------------------------
$O_M$  $O_D$     Fitness 
-----  -----  -------------- 
wt      wt        1 

wt      M        $(1 + s)$

M       wt       $(1 + s)$

M       M        $(1 + s)$
----  -----   --------------


If you have multiple copies you can proceed similarly. As you can see,
these are nothing but special cases of synthetic mortality (\@ref(sl)),
synthetic viability (\@ref(sv)) and epistasis (\@ref(epi)).


## Gene-specific mutation rates {#per-gene-mutation}

You can specify gene-specific mutation rates. Instead of passing a scalar
value for `mu`, you pass a named vector. (This does not work with
the old v. 1 format, though; yet another reason to stop using that
format). This is a simple example (many more are available in the tests,
see file `./tests/testthat/test.per-gene-mutation-rates.R`).

```{r}

muvar2 <- c("U" = 1e-6, "z" = 5e-5, "e" = 5e-4, "m" = 5e-3,
            "D" = 1e-4)
ni1 <- rep(0, 5)
names(ni1) <- names(muvar2) ## We use the same names, of course
fe1 <- allFitnessEffects(noIntGenes = ni1)
bb <- oncoSimulIndiv(fe1, 
                     mu = muvar2, onlyCancer = FALSE,
                     initSize = 1e5,
                     finalTime = 25,
                     seed =NULL)

``` 

## Mutator genes {#mutator}

You can specify mutator/antimutator genes
  [e.g. @gerrish_complete_2007; @tomlinson_mutation_1996]. These are genes
  that, when mutated, lead to an increase/decrease in the mutation rate
  all over the genome (similar to what happens with, say, mutations in
  mismatch-repair genes or microsatellite instability in cancer).

The specification is very similar to that for fitness effects,
except we do not (at least for now) allow the use of DAGs nor of
order effects (we have seen no reference in the literature to
suggest any of these would be relevant). You can, however, specify
epistasis and use modules. Note that the mutator genes must be a
subset of the genes in the fitness effects; if you want to have
mutator genes that have no direct fitness effects, give them a
fitness effect of 0.


This first is a very simple example with simple fitness effects and
modules for mutators. We will specify the fitness and mutator effects and
evaluate the fitness and mutator effects:

```{r}
fe2 <- allFitnessEffects(noIntGenes =
                         c(a1 = 0.1, a2 = 0.2,
                           b1 = 0.01, b2 = 0.3, b3 = 0.2,
                           c1 = 0.3, c2 = -0.2))

fm2 <- allMutatorEffects(epistasis = c("A" = 5,
                                       "B" = 10,
                                       "C" = 3),
                         geneToModule = c("A" = "a1, a2",
                                          "B" = "b1, b2, b3",
                                          "C" = "c1, c2"))

## Show the fitness effect of a specific genotype
evalGenotype("a1, c2", fe2, verbose = TRUE)

## Show the mutator effect of a specific genotype
evalGenotypeMut("a1, c2", fm2, verbose = TRUE)

## Fitness and mutator of a specific genotype
evalGenotypeFitAndMut("a1, c2", fe2, fm2, verbose = TRUE)
``` 

You can also use the `evalAll` functions. We do not show the output
here to avoid cluttering the vignette:

```{r, eval=FALSE}
## Show only all the fitness effects
evalAllGenotypes(fe2, order = FALSE)

## Show only all mutator effects
evalAllGenotypesMut(fm2)

## Show all fitness and mutator
evalAllGenotypesFitAndMut(fe2, fm2, order = FALSE)
``` 


Building upon the above, the next is an example where we have a bunch of
no interaction genes that affect fitness, and a small set of genes that
affect the mutation rate (but have no fitness effects).


```{r}

set.seed(1) ## for reproducibility
## 17 genes, 7 with no direct fitness effects
ni <- c(rep(0, 7), runif(10, min = -0.01, max = 0.1))
names(ni) <- c("a1", "a2", "b1", "b2", "b3", "c1", "c2",
               paste0("g", 1:10))

fe3 <- allFitnessEffects(noIntGenes = ni)

fm3 <- allMutatorEffects(epistasis = c("A" = 5,
                                       "B" = 10,
                                       "C" = 3,
                                       "A:C" = 70),
                         geneToModule = c("A" = "a1, a2",
                                          "B" = "b1, b2, b3",
                                          "C" = "c1, c2"))
``` 

Let us check what the effects are of a few genotypes:

```{r}
## These only affect mutation, not fitness
evalGenotypeFitAndMut("a1, a2", fe3, fm3, verbose = TRUE)
evalGenotypeFitAndMut("a1, b3", fe3, fm3, verbose = TRUE)

## These only affect fitness: the mutator multiplier is 1
evalGenotypeFitAndMut("g1", fe3, fm3, verbose = TRUE)                      
evalGenotypeFitAndMut("g3, g9", fe3, fm3, verbose = TRUE)

## These affect both
evalGenotypeFitAndMut("g3, g9, a2, b3", fe3, fm3, verbose = TRUE)
``` 


Finally, we will do a simulation with those data

```{r}
set.seed(1) ## so that it is easy to reproduce
mue1 <- oncoSimulIndiv(fe3, muEF = fm3, 
                       mu = 1e-6,
                       initSize = 1e5,
                       model = "McFL",
                       detectionSize = 5e6,
                       finalTime = 500,
                       onlyCancer = FALSE)
``` 

```{r, eval=FALSE}
## We do not show this in the vignette to avoid cluttering it
## with output
mue1
``` 


Of course, it is up to you to keep things reasonable: mutator
effects are multiplicative, so if you specify, say, 20 genes
(without modules), or 20 modules, each with a mutator effect of 50,
the overall mutation rate can be increased by a factor of $50^{20}$
and that is unlikely to be what you really want (see also section
\@ref(tomlinexcept)).

You can play with the following case (an extension of the example
above), where a clone with a mutator phenotype and some fitness
enhancing mutations starts giving rise to many other clones, some
with additional mutator effects, and thus leading to the number of
clones blowing up (as some also accumulate additional
fitness-enhancing mutations). Things start getting out of hand
shortly after time 250. The code below takes a few minutes to run
and is not executed here, but you can run it to get an idea of the
increase in the number of clones and their relationships (the usage
of `plotClonePhylog` is explained in section \@ref(phylog)).

```{r, eval=FALSE}

set.seed(1) ## for reproducibility
## 17 genes, 7 with no direct fitness effects
ni <- c(rep(0, 7), runif(10, min = -0.01, max = 0.1))
names(ni) <- c("a1", "a2", "b1", "b2", "b3", "c1", "c2",
               paste0("g", 1:10))

## Next is for nicer figure labeling.
## Consider as drivers genes with s >0
gp <- which(ni > 0) 

fe3 <- allFitnessEffects(noIntGenes = ni,
                         drvNames = names(ni)[gp])


set.seed(12)
mue1 <- oncoSimulIndiv(fe3, muEF = fm3, 
                       mu = 1e-6,
                       initSize = 1e5,
                       model = "McFL",
                       detectionSize = 5e6,
                       finalTime = 270,
                       keepPhylog = TRUE,
                       onlyCancer = FALSE)
mue1
## If you decrease N even further it gets even more cluttered
op <- par(ask = TRUE)
plotClonePhylog(mue1, N = 10, timeEvents = TRUE) 
plot(mue1, plotDrivers = TRUE, addtot = TRUE,
     plotDiversity = TRUE)
	 
## The stacked plot is slow; be patient
## Most clones have tiny population sizes, and their lines
## are piled on top of each other
plot(mue1, addtot = TRUE,
     plotDiversity = TRUE, type = "stacked")
par(op)
``` 

\clearpage

# Plotting fitness landscapes {#plot-fit-land}


The `evalAllGenotypes` and related functions allow you to obtain
tables of the genotype to fitness mappings. It might be more convenient to
actually plot that, allowing us to quickly identify local minima and
maxima and get an idea of how the fitness landscape looks.

In `plotFitnessLandscape` I have blatantly and shamelessly copied most
of the looks of the plots of MAGELLAN [@brouillet_magellan:_2015] (see
also <http://wwwabi.snv.jussieu.fr/public/Magellan/>), a very nice
web-based tool for fitness landscape plotting and analysis (MAGELLAN
provides some other extra functionality and epistasis statistics not
provided here).

As an example, let us show the example of Weissman et al. we saw in
\@ref(weissmanex):

```{r}

d1 <- -0.05 ## single mutant fitness 0.95
d2 <- -0.08 ## double mutant fitness 0.92
d3 <- 0.2   ## triple mutant fitness 1.2
s2 <- ((1 + d2)/(1 + d1)^2) - 1
s3 <- ( (1 + d3)/((1 + d1)^3 * (1 + s2)^3) ) - 1

wb <- allFitnessEffects(
    epistasis = c(
        "A" = d1,
        "B" = d1,
        "C" = d1,
        "A:B" = s2,
        "A:C" = s2,
        "B:C" = s2,
        "A:B:C" = s3))
``` 

<!-- \clearpage -->

```{r, fig.width=6.5, fig.height=5}
plotFitnessLandscape(wb, use_ggrepel = TRUE) 
``` 

We have set `use_ggrepel = TRUE` to avoid overlap of labels.

<!-- \clearpage -->

\clearpage

For some types of objects, directly invoking `plot` will give
you the fitness landscape plot:

```{r, fig.width=6.5, fig.height=5}
(ewb <- evalAllGenotypes(wb, order = FALSE))
plot(ewb, use_ggrepel = TRUE) 

``` 

\clearpage

This is example (section \@ref(pancreas)) will give a very busy plot:

```{r wasthis111, fig.width=9.5, fig.height=9.5}
par(cex = 0.7)
pancr <- allFitnessEffects(
    data.frame(parent = c("Root", rep("KRAS", 4), 
                   "SMAD4", "CDNK2A", 
                   "TP53", "TP53", "MLL3"),
               child = c("KRAS","SMAD4", "CDNK2A", 
                   "TP53", "MLL3",
                   rep("PXDN", 3), rep("TGFBR2", 2)),
               s = 0.1,
               sh = -0.9,
               typeDep = "MN"))
plot(evalAllGenotypes(pancr, order = FALSE), use_ggrepel = TRUE)

``` 


<!-- \clearpage -->

\clearpage

# Specifying fitness effects: some examples from the literature {#litex}

## Bauer et al., 2014 {#bauer}

In the model of Bauer and collaborators [@Bauer2014, pp. 54] we have
"For cells without the primary driver mutation, each secondary
driver mutation leads to a change in the cell's fitness by
$s_P$. For cells with the primary driver mutation, the fitness
advantage obtained with each secondary driver mutation is $s_{DP}$."

The proliferation probability is given as:


* $\frac{1}{2}(1 + s_p)^k$ when there are $k$ secondary drivers mutated and no primary diver;
* $\frac{1}{2}\frac{1+S_D^+}{1+S_D^-} (1 + S_{DP})^k$ when the primary driver is mutated;


apoptosis is one minus the proliferation rate.

### Using a DAG

We cannot find a simple mapping from their expressions to our fitness
parameterization, but we can get fairly close by using a DAG; in this one,
note the unusual feature of having one of the "s" terms (that for the
driver dependency on root) be negative. Using the parameters given in the
legend of their Figure 3 for $s_p, S_D^+, S_D^-, S_{DP}$ and obtaining
that negative value for the dependency of the driver on root we can do:


```{r}
K <- 4
sp <- 1e-5
sdp <- 0.015
sdplus <- 0.05
sdminus <- 0.1
cnt <- (1 + sdplus)/(1 + sdminus)
prod_cnt <- cnt - 1
bauer <- data.frame(parent = c("Root", rep("D", K)),
                    child = c("D", paste0("s", 1:K)),
                    s = c(prod_cnt, rep(sdp, K)),
                    sh = c(0, rep(sp, K)),
                    typeDep = "MN")
fbauer <- allFitnessEffects(bauer)
(b1 <- evalAllGenotypes(fbauer, order = FALSE, addwt = TRUE))

``` 

(We use "D" for "driver" or "primary driver", as is it is called in
the original paper, and "s" for secondary drivers, somewhat similar
to passengers).

Note that what we specify as "typeDep" is irrelevant (MN, SMN, or XMPN
make no difference).

This is the DAG:
```{r, fig.height=3}
plot(fbauer)
``` 


And if you compare the tabular output of `evalAllGenotypes` you can
see that the values of fitness reproduces the fitness landscape that they
show in their Figure 1. We can also use our plot for fitness landscapes:

```{r, fig.width=6, fig.height=6}
plot(b1, use_ggrepel = TRUE)
``` 


### Specifying fitness of genotypes directly

An alternative approach to specify the fitness, if the number of
genotypes is reasonably small, is to directly evaluate fitness as
given by their expressions. Then, use the `genotFitness` argument to
`allFitnessEffects`. 

We will create all possible genotypes; then we will write a function
that gives the fitness of each genotype according to their
expression; finally, we will call this function on the data frame of
genotypes, and pass this data frame to `allFitnessEffects`.


```{r}
m1 <- expand.grid(D = c(1, 0), s1 = c(1, 0), s2 = c(1, 0),
                  s3 = c(1, 0), s4 = c(1, 0))

fitness_bauer <- function(D, s1, s2, s3, s4, 
                          sp = 1e-5, sdp = 0.015, sdplus = 0.05,
                          sdminus = 0.1) {
    if(!D) {
        b <- 0.5 * ( (1 + sp)^(sum(c(s1, s2, s3, s4))))
    } else {
        b <- 0.5 * 
            (((1 + sdplus)/(1 + sdminus)  *
              (1 + sdp)^(sum(c(s1, s2, s3, s4)))))
    }
    fitness <- b - (1 - b)
    our_fitness <- 1 + fitness ## prevent negative fitness and
    ## make wt fitness = 1
    return(our_fitness)
}

m1$Fitness <- 
    apply(m1, 1, function(x) do.call(fitness_bauer, as.list(x)))

bauer2 <- allFitnessEffects(genotFitness = m1)
``` 

Now, show the fitness of all genotypes:

```{r}
evalAllGenotypes(bauer2, order = FALSE, addwt = TRUE)
``` 



<!-- % If the primary driver is -->
<!-- % mutated, then the expression is $\frac{1+S_D^+}{1+S_D^-} (1 + S_{DP})^k$. -->
<!-- % They set apoptosis as $1 - proliferation$.  So, ignoring constants such as -->
<!-- % $1/2$, and setting $ P = \frac{1+S_D^+}{1+S_D^-}$ we can prepare a table -->
<!-- % as (for a largest $k$ of 5 in this example, but can make it arbitrarily -->
<!-- % large): -->

<!-- % <<>>= -->

<!-- % K <- 5 -->
<!-- % sd <- 0.1 -->
<!-- % sdp <- 0.15 -->
<!-- % sp <- 0.05 -->
<!-- % bauer <- data.frame(parent = c("Root", rep("p", K)), -->
<!-- %                     child = c("p", paste0("s", 1:K)), -->
<!-- %                     s = c(sd, rep(sdp, K)), -->
<!-- %                     sh = c(0, rep(sp, K)), -->
<!-- %                     typeDep = "MN") -->
<!-- % fbauer <- allFitnessEffects(bauer) -->


<!-- % @  -->




<!-- % <<>>= -->
<!-- % (b1 <- evalAllGenotypes(fbauer, order = FALSE))[1:10, ] -->
<!-- % @  -->

<!-- % Order makes no difference -->

<!-- % <<>>= -->
<!-- % (b2 <- evalAllGenotypes(fbauer, order = TRUE, max = 2000))[1:15, ] -->
<!-- % @  -->

<!-- % And the number of levels is the right one: 11 -->
<!-- % <<>>= -->
<!-- % length(table(b1$Fitness)) -->
<!-- % length(table(b2$Fitness)) -->
<!-- % @  -->


<!-- %% ### Bauer et al.\ specified only via epistatic interactions} -->
<!-- %% Yes, do it: as -p,s1, and -p,s2, etc. But much more of a mess. -->

<!-- %% ### Adding modules to Bauer et al.} -->

Can we use modules in this example, if we use the "lego system"? Sure,
as in any other case.

<!-- \clearpage -->

## Misra et al., 2014 {#misra}

Figure 1 of @Misra2014 presents three scenarios which
are different types of epistasis. 

<!-- %% (I show the fitness scenarios without -->
<!-- %% axis, to replicate as close as possible what they show in their -->
<!-- paper) -->


### Example 1.a {#misra1a}
```{r, echo=FALSE, fig.height=4, fig.width=4}

df1 <- data.frame(x = c(1, 1.2, 1.4), f = c(1, 1.2, 1.2),
                 names = c("wt", "A", "B"))
plot(df1[, 2] ~ df1[, 1], axes = TRUE, xlab= "", 
     ylab = "Fitness", xaxt = "n", yaxt = "n", ylim = c(1, 1.21))
segments(1, 1, 1.2, 1.2)
segments(1, 1, 1.4, 1.2)
text(1, 1, "wt", pos = 4)
text(1.2, 1.2, "A", pos = 2)
text(1.4, 1.2, "B", pos = 2)
## axis(1,  tick = FALSE, labels = FALSE)
## axis(2,  tick = FALSE, labels = FALSE)
``` 


In that figure it is evident that the fitness effect of "A" and "B"
are the same. There are two different models depending on whether "AB"
is just the product of both, or there is epistasis. In the first case
probably the simplest is:

```{r}
s <- 0.1 ## or whatever number
m1a1 <- allFitnessEffects(data.frame(parent = c("Root", "Root"),
                                     child = c("A", "B"),
                                     s = s,
                                     sh = 0,
                                     typeDep = "MN"))
evalAllGenotypes(m1a1, order = FALSE, addwt = TRUE)
``` 


If the double mutant shows epistasis, as we saw before (section \@ref(e2))
we have a range of options. For example:

```{r}
s <- 0.1
sab <- 0.3
m1a2 <- allFitnessEffects(epistasis = c("A:-B" = s,
                                        "-A:B" = s,
                                        "A:B" = sab))
evalAllGenotypes(m1a2, order = FALSE, addwt = TRUE)
``` 

But we could also modify the graph dependency structure, and we have to
change the value of the coefficient, since that is what multiplies each of
the terms for "A" and "B": $(1 + s_{AB}) = (1 + s)^2(1 + s_{AB3})$


```{r}
sab3 <- ((1 + sab)/((1 + s)^2)) - 1
m1a3 <- allFitnessEffects(data.frame(parent = c("Root", "Root"),
                                     child = c("A", "B"),
                                     s = s,
                                     sh = 0,
                                     typeDep = "MN"),
                          epistasis = c("A:B" = sab3))
evalAllGenotypes(m1a3, order = FALSE, addwt = TRUE)
``` 

And, obviously
```{r}
all.equal(evalAllGenotypes(m1a2, order = FALSE, addwt = TRUE),
          evalAllGenotypes(m1a3, order = FALSE, addwt = TRUE))
``` 


### Example 1.b {#misra1b}

This is a specific case of synthetic viability (see also section \@ref(sv)):

```{r, echo=FALSE, fig.width=4, fig.height=4}

df1 <- data.frame(x = c(1, 1.2, 1.2, 1.4), f = c(1, 0.4, 0.3, 1.3),
                 names = c("wt", "A", "B", "AB"))
plot(df1[, 2] ~ df1[, 1], axes = TRUE, xlab= "", ylab = "Fitness",
     xaxt = "n", yaxt = "n", ylim = c(0.29, 1.32))
segments(1, 1, 1.2, 0.4)
segments(1, 1, 1.2, 0.3)
segments(1.2, 0.4, 1.4, 1.3)
segments(1.2, 0.3, 1.4, 1.3)
text(x = df1[, 1], y = df1[, 2], labels = df1[, "names"], pos = c(4, 2, 2, 2))
## text(1, 1, "wt", pos = 4)
## text(1.2, 1.2, "A", pos = 2)
## text(1.4, 1.2, "B", pos = 2)
``` 


Here, $S_A, S_B < 0$, $S_B < 0$, $S_{AB} > 0$ and $(1 + S_{AB}) (1 + S_A) (1 +
S_B) > 1$.

As before, we can specify this in several different ways. The simplest is
to specify all genotypes:
```{r}
sa <- -0.6
sb <- -0.7
sab <- 0.3
m1b1 <- allFitnessEffects(epistasis = c("A:-B" = sa,
                                        "-A:B" = sb,
                                        "A:B" = sab))
evalAllGenotypes(m1b1, order = FALSE, addwt = TRUE)
``` 

We could also use a tree and modify the "sab" for the epistasis, as
before (\@ref(misra1a)).



### Example 1.c {#misra1c}

The final case, in figure 1.c of Misra et al., is just epistasis, where a
mutation in one of the genes is deleterious (possibly only mildly), in the
other is beneficial, and the double mutation has fitness larger than any
of the other two.


```{r, echo=FALSE, fig.width=4, fig.height=4}

df1 <- data.frame(x = c(1, 1.2, 1.2, 1.4), f = c(1, 1.2, 0.7, 1.5),
                 names = c("wt", "A", "B", "AB"))
plot(df1[, 2] ~ df1[, 1], axes = TRUE, xlab = "", ylab = "Fitness",
     xaxt = "n", yaxt = "n", ylim = c(0.69, 1.53))
segments(1, 1, 1.2, 1.2)
segments(1, 1, 1.2, 0.7)
segments(1.2, 1.2, 1.4, 1.5)
segments(1.2, 0.7, 1.4, 1.5)
text(x = df1[, 1], y = df1[, 2], labels = df1[, "names"], pos = c(3, 3, 3, 2))
## text(1, 1, "wt", pos = 4)
## text(1.2, 1.2, "A", pos = 2)
## text(1.4, 1.2, "B", pos = 2)

``` 

Here we have that $s_A > 0$, $s_B < 0$, $(1 + s_{AB}) (1 + s_A) (1 +
s_B) > (1 + s_{AB})$ so $s_{AB} > \frac{-s_B}{1 + s_B}$


As before, we can specify this in several different ways. The simplest is
to specify all genotypes:
```{r}
sa <- 0.2
sb <- -0.3
sab <- 0.5
m1c1 <- allFitnessEffects(epistasis = c("A:-B" = sa,
                                        "-A:B" = sb,
                                        "A:B" = sab))
evalAllGenotypes(m1c1, order = FALSE, addwt = TRUE)
``` 

We could also use a tree and modify the "sab" for the epistasis, as
before (\@ref(misra1a)).


<!-- \clearpage -->
## Ochs and Desai, 2015 {#ochsdesai}

In @Ochs2015 the authors present a model shown graphically as (the
actual numerical values are arbitrarily set by me):


```{r, echo=FALSE, fig.width=4.5, fig.height=3.5}

df1 <- data.frame(x = c(1, 2, 3, 4), f = c(1.1, 1, 0.95, 1.2),
                 names = c("u", "wt", "i", "v"))
plot(df1[, 2] ~ df1[, 1], axes = FALSE, xlab = "", ylab = "")
par(las = 1)
axis(2)
axis(1, at = c(1, 2, 3, 4), labels = df1[, "names"], ylab = "")
box()
arrows(c(2, 2, 3), c(1, 1, 0.95),
       c(1, 3, 4), c(1.1, 0.95, 1.2))
## text(1, 1, "wt", pos = 4)
## text(1.2, 1.2, "A", pos = 2)
## text(1.4, 1.2, "B", pos = 2)
```

In their model, $s_u > 0$, $s_v > s_u$, $s_i < 0$, we can only arrive at
$v$ from $i$, and the mutants "ui" and "uv" can never appear as their
fitness is 0, or $-\infty$, so $s_{ui} = s_{uv} = -1$ (or $-\infty$).

We can specify this combining a graph and epistasis specifications:

```{r}
su <- 0.1
si <- -0.05
fvi <- 1.2 ## the fitness of the vi mutant
sv <- (fvi/(1 + si)) - 1
sui <- suv <- -1
od <- allFitnessEffects(
    data.frame(parent = c("Root", "Root", "i"),
               child = c("u", "i", "v"),
               s = c(su, si, sv),
               sh = -1,
               typeDep = "MN"),
    epistasis = c(
        "u:i" = sui,
        "u:v" = suv))
``` 

A figure showing that model is
```{r, fig.width=3, fig.height=3}
plot(od)
``` 

And the fitness of all genotype is
```{r}
evalAllGenotypes(od, order = FALSE, addwt = TRUE)
``` 


We could alternatively have specified fitness either directly
specifying the fitness of each genotype or specifying epistatic
effects. Let us use the second approach:

%% this was wrong
%% u <- 0.2; i <- -0.02; vi <- 0.6; ui <- uv <- -Inf

```{r}
u <- 0.1; i <- -0.05; vi <- (1.2/0.95) - 1; ui <- uv <- -Inf
od2 <- allFitnessEffects(
    epistasis = c("u" = u,  "u:i" = ui,
                  "u:v" = uv, "i" = i,
                  "v:-i" = -Inf, "v:i" = vi))
evalAllGenotypes(od2, addwt = TRUE)

```

We will return to this model when we explain the usage of `fixation`
for stopping the simulations (see \@ref(fixation) and \@ref(fixationG)).




## Weissman et al., 2009 {#weissmanex}
In their figure 1a, @Weissman2009 present this model
(actual numeric values are set arbitrarily)

### Figure 1.a {#weiss1a}

```{r, echo=FALSE, fig.width=4, fig.height=3}

df1 <- data.frame(x = c(1, 2, 3), f = c(1, 0.95, 1.2),
                 names = c("wt", "1", "2"))
plot(df1[, 2] ~ df1[, 1], axes = FALSE, xlab = "", ylab = "")
par(las = 1)
axis(2)
axis(1, at = c(1, 2, 3), labels = df1[, "names"], ylab = "")
box()
segments(c(1, 2), c(1, 0.95),
       c(2, 3), c(0.95, 1.2))
## text(1, 1, "wt", pos = 4)
## text(1.2, 1.2, "A", pos = 2)
## text(1.4, 1.2, "B", pos = 2)


```

where the "1" and "2" in the figure refer to the total number of mutations
in two different loci. This is, therefore, very similar to the example in
section \@ref(misra1b). Here we have, in their notation, $\delta_1 < 0$,
fitness of single "A" or single "B" = $1 + \delta_1$, $S_{AB} > 0$, $(1 +
S_{AB})(1 + \delta_1)^2 > 1$.


### Figure 1.b {#weis1b}

In their figure 1b they show

```{r, echo=FALSE, fig.width=4, fig.height=3}

df1 <- data.frame(x = c(1, 2, 3, 4), f = c(1, 0.95, 0.92, 1.2),
                 names = c("wt", "1", "2", "3"))
plot(df1[, 2] ~ df1[, 1], axes = FALSE, xlab = "", ylab = "")
par(las = 1)
axis(2)
axis(1, at = c(1, 2, 3, 4), labels = df1[, "names"], ylab = "")
box()
segments(c(1, 2, 3), c(1, 0.95, 0.92),
       c(2, 3, 4), c(0.95, 0.92, 1.2))
## text(1, 1, "wt", pos = 4)
## text(1.2, 1.2, "A", pos = 2)
## text(1.4, 1.2, "B", pos = 2)
```

Where, as before, 1, 2, 3, denote the total number of mutations over three
different loci and $\delta_1 < 0$, $\delta_2 < 0$, fitness of single
mutant is $(1 + \delta_1)$, of double mutant is $(1 + \delta_2)$ so that
$(1 + \delta_2) = (1 + \delta_1)^2 (1 + s_2)$ and of triple mutant is
$(1 + \delta_3)$, so that
$(1 + \delta_3) = (1 + \delta_1)^3 (1 + s_2)^3 (1 + s_3)$.


We can specify this combining a graph with epistasis:

```{r}

d1 <- -0.05 ## single mutant fitness 0.95
d2 <- -0.08 ## double mutant fitness 0.92
d3 <- 0.2   ## triple mutant fitness 1.2

s2 <- ((1 + d2)/(1 + d1)^2) - 1
s3 <- ( (1 + d3)/((1 + d1)^3 * (1 + s2)^3) ) - 1

w <- allFitnessEffects(
    data.frame(parent = c("Root", "Root", "Root"),
               child = c("A", "B", "C"),
               s = d1,
               sh = -1,
               typeDep = "MN"),
    epistasis = c(
        "A:B" = s2,
        "A:C" = s2,
        "B:C" = s2,
        "A:B:C" = s3))
``` 

The model can be shown graphically as:
```{r, fig.width=4, fig.height=4}
plot(w)
``` 

And fitness of all genotypes is:

```{r}
evalAllGenotypes(w, order = FALSE, addwt = TRUE)
``` 



Alternatively, we can directly specify what each genotype adds to the
fitness, given the included genotype. This is basically replacing the
graph by giving each of "A", "B", and "C" directly:

```{r}
wb <- allFitnessEffects(
    epistasis = c(
        "A" = d1,
        "B" = d1,
        "C" = d1,
        "A:B" = s2,
        "A:C" = s2,
        "B:C" = s2,
        "A:B:C" = s3))

evalAllGenotypes(wb, order = FALSE, addwt = TRUE)
``` 

The plot, of course, is not very revealing and we cannot show that there
is a three-way interaction (only all three two-way interactions):

```{r, , fig.width=3, fig.height=3}
plot(wb)
``` 

As we have seen several times already (sections \@ref(e2), \@ref(e3),
\@ref(theminus)) we can also give the genotypes directly and, consequently,
the fitness of each genotype (not the added contribution):

```{r}
wc <- allFitnessEffects(
    epistasis = c(
        "A:-B:-C" = d1,
        "B:-C:-A" = d1,
        "C:-A:-B" = d1,
        "A:B:-C" = d2,
        "A:C:-B" = d2,
        "B:C:-A" = d2,
        "A:B:C" = d3))
evalAllGenotypes(wc, order = FALSE, addwt = TRUE)
``` 


## Gerstung et al., 2011, pancreatic cancer poset {#pancreas}

Similar to what we did in v.1 (see section \@ref(poset)) we can specify the
pancreatic cancer poset in @Gerstung2011 (their
figure 2B, left). We use directly the names of the genes, since that is
immediately supported by the new version.

```{r, fig.width=4}

pancr <- allFitnessEffects(
    data.frame(parent = c("Root", rep("KRAS", 4), 
                   "SMAD4", "CDNK2A", 
                   "TP53", "TP53", "MLL3"),
               child = c("KRAS","SMAD4", "CDNK2A", 
                   "TP53", "MLL3",
                   rep("PXDN", 3), rep("TGFBR2", 2)),
               s = 0.1,
               sh = -0.9,
               typeDep = "MN"))

plot(pancr)
``` 

Of course the "s" and "sh" are set arbitrarily here.



## Raphael and Vandin's 2014 modules {#raphael-ex}

In @Raphael2014a, the authors show several progression models
in terms of modules. We can code the extended poset for the colorectal
cancer model in their Figure 4.a is (s and sh are arbitrary):


```{r, fig.height = 4}
rv1 <- allFitnessEffects(data.frame(parent = c("Root", "A", "KRAS"),
                                    child = c("A", "KRAS", "FBXW7"),
                                    s = 0.1,
                                    sh = -0.01,
                                    typeDep = "MN"),
                         geneToModule = c("Root" = "Root",
                             "A" = "EVC2, PIK3CA, TP53",
                             "KRAS" = "KRAS",
                             "FBXW7" = "FBXW7"))

plot(rv1, expandModules = TRUE, autofit = TRUE)

``` 

We have used the (experimental) `autofit` option to fit the labels to the
edges. Note how we can use the same name for genes and modules, but we
need to specify all the modules. 

<!-- \clearpage -->
Their Figure 5b is

```{r, fig.height=6}
rv2 <- allFitnessEffects(
    data.frame(parent = c("Root", "1", "2", "3", "4"),
               child = c("1", "2", "3", "4", "ELF3"),
               s = 0.1,
               sh = -0.01,
               typeDep = "MN"),
    geneToModule = c("Root" = "Root",
                     "1" = "APC, FBXW7",
                     "2" = "ATM, FAM123B, PIK3CA, TP53",
                     "3" = "BRAF, KRAS, NRAS",
                     "4" = "SMAD2, SMAD4, SOX9",
                     "ELF3" = "ELF3"))

plot(rv2, expandModules = TRUE,   autofit = TRUE)
``` 

<!-- %%very poor rendering in the PDF, in separate page, et c. -->
<!-- %% plot(rv2, "igraph", expandModules = TRUE,  -->
<!-- %%       layout = layout.reingold.tilford, -->
<!-- %%       autofit = TRUE, -->
<!-- %%       scale_char = 8) -->

\clearpage

# Running and plotting the simulations: starting, ending, and examples {#simul}

## Starting and ending {#starting-ending}


After you have decided the specifics of the fitness effects and the model,
you need to decide:


* Where will you start your simulation from. This involves deciding
  the initial population size (argument `initSize`) and, possibly,
  the genotype of the initial population; the later is covered in section
  \@ref(initmut).

* When will you stop it: how long to run it, and whether or not to
  require simulations to reach cancer (under some definition of what it
  means to reach cancer). This is covered in \@ref(endsimul).



## Can I start the simulation from a specific mutant? {#initmut}

You bet. In version 2 you can specify the genotype for the
initial mutant with the same flexibility as in `evalGenotype`. Here we show
a couple of examples (we use the representation of the parent-child
relationships ---discussed in section \@ref(phylog)--- of the clones so that
you can see which clones appear, and from which, and check that we are not
making mistakes).

<!-- In v.1 you can only give the initial mutant as one with a single -->
<!-- mutated gene. -->

<!-- %% o3init <- allFitnessEffects(orderEffects = c( -->
<!-- %%                             "M > D > F" = 0.99, -->
<!-- %%                             "D > M > F" = 0.2, -->
<!-- %%                             "D > M"     = 0.1, -->
<!-- %%                             "M > D"     = 0.9), -->
<!-- %%                         noIntGenes = c("u" = 0.01,  -->
<!-- %%                                        "v" = 0.01, -->
<!-- %%                                        "w" = 0.001, -->
<!-- %%                                        "x" = 0.0001, -->
<!-- %%                                        "y" = -0.0001, -->
<!-- %%                                        "z" = -0.001), -->
<!-- %%                         geneToModule = -->
<!-- %%                             c("Root" = "Root", -->
<!-- %%                               "M" = "m", -->
<!-- %%                               "F" = "f", -->
<!-- %%                               "D" = "d") ) -->

```{r prbau003, fig.height=6}

o3init <- allFitnessEffects(orderEffects = c(
                            "M > D > F" = 0.99,
                            "D > M > F" = 0.2,
                            "D > M"     = 0.1,
                            "M > D"     = 0.9),
                        noIntGenes = c("u" = 0.01, 
                                       "v" = 0.01,
                                       "w" = 0.001,
                                       "x" = 0.0001,
                                       "y" = -0.0001,
                                       "z" = -0.001),
                        geneToModule =
                            c("M" = "m",
                              "F" = "f",
                              "D" = "d") )

oneI <- oncoSimulIndiv(o3init, model = "McFL",
                       mu = 5e-5, finalTime = 500,
                       detectionDrivers = 3,
                       onlyCancer = FALSE,
                       initSize = 1000,
                       keepPhylog = TRUE,
                       initMutant = c("m > u > d")
                       )
plotClonePhylog(oneI, N = 0)
```


```{r prbau003bb, fig.height=6}
## Note we also disable the stopping stochastically as a function of size
## to allow the population to grow large and generate may different
## clones.
ospI <- oncoSimulPop(2, 
                    o3init, model = "Exp",
                    mu = 5e-5, finalTime = 500,
                    detectionDrivers = 3,
                    onlyCancer = TRUE,
                    initSize = 10,
                    keepPhylog = TRUE,
                    initMutant = c("d > m > z"),
                    mc.cores = 2
                    )
op <- par(mar = rep(0, 4), mfrow = c(1, 2))
plotClonePhylog(ospI[[1]])
plotClonePhylog(ospI[[2]])
par(op)

ossI <- oncoSimulSample(2, 
                        o3init, model = "Exp",
                        mu = 5e-5, finalTime = 500,
                        detectionDrivers = 2,
                        onlyCancer = TRUE,
                        initSize = 10,
                        initMutant = c("z > d"),
                        ## check presence of initMutant:
                        thresholdWhole = 1 
                    )

## No phylogeny is kept with oncoSimulSample, but look at the 
## OcurringDrivers and the sample
ossI$popSample
ossI$popSummary[, "OccurringDrivers", drop = FALSE]
``` 

## Ending the simulations {#endsimul}

OncoSimulR provides very flexible ways to decide when to stop a
simulation. Here we focus on a single simulation; see further options with
multiple simulations in \@ref(sample).

### Ending the simulations: conditions

* **`onlyCancer = TRUE`**. A simulation will be repeated until any
  one of the "reach cancer" conditions is met, if this happens before
  the simulation reaches `finalTime`[^6]. These conditions are:

    [^6]: Of course, the "reach cancer" idea and the `onlyCancer` argument
    are generic names; this could have been labeled "reach whatever
    interests me".
  
    + Total population size becomes larger than `detectionSize`.
    + The number of drivers in any one genotype or clone becomes equal
    to, or larger than, `detectionDrivers`; note that this allows you
    to stop the simulation as soon as a **specific genotype** is
    found, by using exactly and only the genes that make that genotype as
    the drivers.
	+ A gene or gene combination among those listed in `fixation`
    becomes fixed in the population (i.e., has a frequency is 1)
    (see details in (\@ref(fixation) and \@ref(fixationG)). 
    + The tumor is detected according to a stochastic detection
    mechanism, where the probability of "detecting the tumor" increases
    with population size; this is explained below (\@ref(detectprob)) and
    is controlled by argument `detectionProb`.
	
  
  
    As we exit as soon as any of the exiting conditions is reached,
    if you only care about one condition, set the other to `NA` (see
    also section \@ref(anddrvprob)).
	

* **`onlyCancer = FALSE`**. A simulation will run only once, and
  will exit as soon as any of the above conditions are met or as soon as
  the total population size becomes zero or we reach `finalTime`.




As an example of `onlyCancer = TRUE`, focusing on the first two
mechanisms, suppose you give `detectionSize = 1e4` and `detectionDrivers
=3` (and you have `detectionProb = NA`).  A simulation will exit as soon
as it reaches a total population size of $10^4$ or any clone has four
drivers, whichever comes first (if any of these happen before
`finalTime`).

In the `onlyCancer = TRUE` case, what happens if we reach
`finalTime` (or the population size becomes zero) before any of the
"reach cancer" conditions have been fulfilled?  The simulation will be
repeated again, within the following limits:


* `max.wall.time`: the total wall time we allow an individual
  simulation to run;
* `max.num.tries`: the maximum number of times we allow a
  simulation to be repeated to reach cancer;
* `max.wall.time.total` and `max.num.tries.total`,
  similar to the above but over a set of simulations in function
  `oncoSimulSample`.



Incidentally, we keep track of the number of attempts used (the component
`other$attemptsUsed$`) before we reach cancer, so you can estimate
(as from a negative binomial sampling) the probability of reaching your
desired end point under different scenarios.


The `onlyCancer = FALSE` case might be what you want to do when you
examine general population genetics scenarios without focusing on possible
sampling issues. To do this, set `finalTime` to the value you want
and set `onlyCancer = FALSE`; in addition, set
`detectionProb` to "NA" and `detectionDrivers` and
`detectionSize` to "NA" or to huge numbers^[Setting
  `detectionDrivers` and `detectionSize` to "NA" is in
  fact equivalent to setting them to the largest possible numbers for
  these variables: $2^{32} -1$ and $\infty$, respectively.]. In this
scenario you simply collect the simulation output at the end of the run,
regardless of what happened with the population (it became extinct, it did
not reach a large size, it did not accumulate drivers, etc).


### Stochastic detection mechanism: "detectionProb"  {#detectprob}

This is the process that is controlled by the argument
`detectionProb`. Here the probability of tumor detection increases
with the total population size. This is biologically a reasonable
assumption: the larger the tumor, the more likely it is it will be
detected.

<!-- % The third mechanism for stopping the simulation assumes that the -->
<!-- % probability of tumor detection increases with the total population -->
<!-- % size. This is biologically a reasonable assumption: the larger the tumor, -->
<!-- % the more likely it is it will be detected.  -->

At regularly spaced times during the simulation, we compute the
probability of detection as a function of size and determine (by comparing
against a random uniform number) if the simulation should finish. For
simplicity, and to make sure the probability is bounded between 0 and 1,
we use the function


 <!-- \begin{equation} -->
 <!--   P(N) =  -->
 <!--   \begin{cases} -->
 <!--   1 - e^{ -cPDetect (N - PDBaseline)} & \text{if } N > PDBaseline \\ -->
 <!--   0 & \text{if } N \leq PDBaseline -->
 <!--   \end{cases} -->
 <!--   \label{eq:2} -->
 <!-- \end{equation} -->




\begin{equation}
  P(N) = 
  \begin{cases}
  1 - e^{ -c ( (N - B)/B)} & \text{if } N > B \\
  0 & \text{if } N \leq B
  \end{cases}
  \label{eq:2}
\end{equation}


where $P(N)$ is the probability that a tumor with a population size $N$
will be detected, and $c$ (argument $cPDetect$ in the `oncoSimul*`
functions) controls how fast $P(N)$ increases with increasing population
size relative to a baseline, $B$ ($PDBaseline$ in the `oncoSimul*`
functions); with $B$ we both control the minimal population size at which
this mechanism stats operating (because we will rarely want detection
unless there is some meaningful increase of population size over
`initSize`) and we model the increase in $P(N)$ as a function of relative
differences with respect to $B$. (Note that this is **a major change** in
version 2.9.9. Before version 2.9.9, the expression used was
$P(N) = 1 - e^{ -c ( N - B)}$, so we did not make the increase relative to
$B$; of course, you can choose an appropriate $c$ to make different models
comparable, but the expression used before 2.9.9 made it much harder to
compare simulations with very different initial population sizes, as
baselines are often naturall a function of initial population sizes.)


The $P(N)$ refers to the probability of detection at each one of the
occasions when we assess the probability of exiting. When, or how often,
do we do that? When we assess probability of exiting is controlled by
`checkSizePEvery`, which will often be much larger than
`sampleEvery`^[We assess probability of exiting at every
  sampling time, as given by `sampleEvery`, that is the smallest
  possible sampling time that is separated from the previous time of
  assessment by at least `checkSizePEvery`. In other words, the
  interval between successive assessments will be the smallest multiple
  integer of `sampleEvery` that is larger than
  `checkSizePEvery`. For example, suppose `sampleEvery = 2`
  and `checkSizePEvery = 3`: we will assess exiting at times
  $4, 8, 12, 16, \ldots$. If `sampleEvery = 3` and
  `checkSizePEvery = 3`: we will assess exiting at times
  $6, 12, 18, \ldots$.]. Biologically, a way to think of
`checkSizePEvery` is "time between doctor appointments".


An important **warning**, though: for populations that are growing
very, very fast or where some genes might have very large effects on
fitness even a moderate `checkSizePEvery` of, say, 10, might be
inappropriate, since populations could have increased by several
orders of magnitude between successive checks. This issue is also
discussed in section \@ref(bench1xf) and \@ref(benchusual).


Finally, you can specify $c$ ($cPDetect$) directly (you will need to set
`n2` and `p2` to NA). However, it might be more intuitive to specify the
pair `n2`, `p2`, such that $P(n2) = p2$ (and from that pair we solve for
the value of $cPDetect$).


You can get a feeling for the effects of these arguments by playing with
the following code, that we do not execute here for the sake of
speed. Here no mutation has any effect, but there is a non-zero
probability of exiting as soon as the total population size becomes larger
than the initial population size. So, eventually, all simulations will
exit and, as we are using the McFarland model, population size will vary
slightly around the initial population size.


```{r prbaux002, eval=FALSE}
gi2 <- rep(0, 5)
names(gi2) <- letters[1:5]
oi2 <- allFitnessEffects(noIntGenes = gi2)
s5 <- oncoSimulPop(200,
                   oi2,
                   model = "McFL",
                   initSize = 1000,
                   detectionProb = c(p2 = 0.1, 
                                     n2 = 2000, 
                                     PDBaseline = 1000,
                                     checkSizePEvery = 2), 
                   detectionSize = NA,
                   finalTime = NA, 
                   keepEvery = NA,
                   detectionDrivers = NA)
s5
hist(unlist(lapply(s5, function(x) x$FinalTime)))
``` 

As you decrease `checkSizePEvery` the distribution of "FinalTime"
will resemble more and more an exponential distribution.


In this vignette, there are some further examples of using this mechanism
in \@ref(s-cbn1) and \@ref(mcf5070), with the default arguments.



#### Stochastic detection mechanism and minimum number of drivers {#anddrvprob}

We said above that we exit as soon as any of the conditions is
reached (i.e., we use an OR operation over the exit
conditions). There is a special exception to this procedure: if you
set `AND_DrvProbExit = TRUE`, both the number of drivers and the
`detectionProb` mechanism condition must fulfilled. This means that
the `detectionProb` mechanism not assessed unless the
`detectionDrivers` condition is. Using `AND_DrvProbExit = TRUE`
allows to run simulations and ensure that all of the returned
simulations will have at least some cells with the number of drivers
as specified by `detectionDrivers`. Note, though, that this does not
guarantee that when you sample the population, all those drivers
will be detected (as this depends on the actual proportion of cells
with the drivers and the settings of `samplePop`).






### Fixation of genes/gene combinations {#fixation}

In some cases we might be interested in running simulations until a
particular set of genes, or gene combinations, reaches
fixation. This exit condition might be more relevant than some of
the above in many non-cancer-related evolutionary genetics
scenarios.

Simulations will stop as soon as any of the genes or gene combinations in
the vector (or list) `fixation` reaches a frequency of 1.  These gene
combinations might have non-zero intersection (i.e., they might share
genes), and those genes need not be drivers. If we want simulations to
only stop when fixation of those genes/gene combinations is reached, we
will set all other stopping conditions to `NA`. It is, of course, up to
you to ensure that those stopping conditions are reasonable (that they can
be reached) and to use, or not, `finalTime`; otherwise, simulations will
eventually abort (e.g., when `max.wall.time` or `max.num.tries` are
reached). Since we are asking for fixation, the `Exp` or `Bozic` models
will often not be appropriate here; instead, models with competition such
as `McFL` are more appropriate.


We return here to the example from section \@ref(ochsdesai). 

```{r}
u <- 0.2; i <- -0.02; vi <- 0.6; ui <- uv <- -Inf
od2 <- allFitnessEffects(
    epistasis = c("u" = u,  "u:i" = ui,
                  "u:v" = uv, "i" = i,
                  "v:-i" = -Inf, "v:i" = vi))
```

Ochs and Desai explain that "Each simulated population was evolved
until either the uphill genotype or valley-crossing genotype fixed."
(see @Ochs2015, p.2, section "Simulations"). We will do the same
here.  We specify that we want to end the simulation when either the
"u" or the "v, i" genotypes have reached fixation, by passing those
genotype combinations as the `fixation` argument (in this example
using `fixation = c("u", "v")` would have been equivalent, since the
"v" genotype by itself has fitness of 0).


We want to be explicit that fixation will be the one and only
condition for ending the simulations, and thus we set arguments
`detectionDrivers`, `finalTime`, `detectionSize` and `detectionProb`
explicitly to `NA`. (We set the number of repetitions only to 10 for
the sake of speed when creating the vignette).


```{r simul-ochs}
initS <- 20
## We use only a small number of repetitions for the sake
## of speed.
od100 <- oncoSimulPop(10, od2,
                      fixation = c("u", "v, i"),
                      model = "McFL",
                      mu = 1e-4,
                      detectionDrivers = NA,
                      finalTime = NA,
                      detectionSize = NA,
                      detectionProb = NA,
                      onlyCancer = TRUE,
                      initSize = initS, 
                      mc.cores = 2)
```


What is the frequency of each genotype among the simulations? (or,
what is the frequency of fixation of each genotype?)

```{r ochs-freq-genots}
sampledGenotypes(samplePop(od100))                      
```

Note the very large variability in reaching fixation
```{r sum-simul-ochs}
head(summary(od100)[, c(1:3, 8:9)])
```


### Fixation of genotypes {#fixationG}

Section \@ref(fixation) deals with the fixation of gene/gene
combinations. What if you want fixation on specific genotypes? To give an
example, suppose we have a five loci genotype and suppose that you want to
stop the simulations only if genotypes "A", "B, E", or "A, B, C, D, E"
reach fixation. You do not want to stop it if, say, genotype "A, B, E"
reaches fixation. To specify genotypes, you prepend the genotype
combinations with a "_,", and that tells OncoSimulR that you want
fixation of **genotypes**, not just gene combinations.

An example of the differences between the mechanisms can be seen from
this code:

```{r fixationG1}
## Create a simple fitness landscape
rl1 <- matrix(0, ncol = 6, nrow = 9)
colnames(rl1) <- c(LETTERS[1:5], "Fitness")
rl1[1, 6] <- 1
rl1[cbind((2:4), c(1:3))] <- 1
rl1[2, 6] <- 1.4
rl1[3, 6] <- 1.32
rl1[4, 6] <- 1.32
rl1[5, ] <- c(0, 1, 0, 0, 1, 1.5)
rl1[6, ] <- c(0, 0, 1, 1, 0, 1.54)
rl1[7, ] <- c(1, 0, 1, 1, 0, 1.65)
rl1[8, ] <- c(1, 1, 1, 1, 0, 1.75)
rl1[9, ] <- c(1, 1, 1, 1, 1, 1.85)
class(rl1) <- c("matrix", "genotype_fitness_matrix")
## plot(rl1) ## to see the fitness landscape

## Gene combinations
local_max_g <- c("A", "B, E", "A, B, C, D, E")
## Specify the genotypes
local_max <- paste0("_,", local_max_g)

fr1 <- allFitnessEffects(genotFitness = rl1, drvNames = LETTERS[1:5])
initS <- 2000


######## Stop on gene combinations   #####
r1 <- oncoSimulPop(10,
                       fp = fr1,
                       model = "McFL",
                       initSize = initS,
                       mu = 1e-4,
                       detectionSize = NA,
                       sampleEvery = .03,
                       keepEvery = 1, 
                       finalTime = 50000,
                       fixation = local_max_g, 
                       detectionDrivers = NA,
                       detectionProb = NA,
                       onlyCancer = TRUE,
                       max.num.tries = 500,
                       max.wall.time = 20, 
                       errorHitMaxTries = TRUE,
                       keepPhylog = FALSE,
                       mc.cores = 2)
sp1 <- samplePop(r1, "last", "singleCell")
sgsp1 <- sampledGenotypes(sp1)
## often you will stop on gene combinations that
## are not local maxima in the fitness landscape
sgsp1
sgsp1$Genotype %in% local_max_g

####### Stop on genotypes   ####

r2 <- oncoSimulPop(10,
                       fp = fr1,
                       model = "McFL",
                       initSize = initS,
                       mu = 1e-4,
                       detectionSize = NA,
                       sampleEvery = .03,
                       keepEvery = 1, 
                       finalTime = 50000,
                       fixation = local_max, 
                       detectionDrivers = NA,
                       detectionProb = NA,
                       onlyCancer = TRUE,
                       max.num.tries = 500,
                       max.wall.time = 20, 
                       errorHitMaxTries = TRUE,
                       keepPhylog = FALSE,
                       mc.cores = 2)
## All final genotypes should be local maxima                       
sp2 <- samplePop(r2, "last", "singleCell")
sgsp2 <- sampledGenotypes(sp2)
sgsp2$Genotype %in% local_max_g



```

### Fixation: tolerance, number of periods, minimal size

In particular if you specify stopping on genotypes, you might want to
think about three additional parameters: `fixation_tolerance`,
`min_successive_fixation`, and `fixation_min_size`.

`fixation_tolerance`: fixation is considered to have happened if the
genotype/gene combinations specified as genotypes/gene combinations for
fixation have reached a frequency $> 1 - fixation\_tolerance$. (The
default is 0, so we ask for genotypes/gene combinations with a frequency
of 1, which might not be what you want with large mutation rates and
complex fitness landscape with genotypes of similar fitness.)


`min_successive_fixation`: during how many successive sampling periods the
conditions of fixation need to be fulfilled before declaring
fixation. These must be successive sampling periods without interruptions
(i.e., a single period when the condition is not fulfilled will set the
counter to 0). This can help to exclude short, transitional, local maxima
that are quickly replaced by other genotypes. (The default is 50, but this
is probably too small for "real life" usage).

`fixation_min_size`: you might only want to consider fixation to have
happened if a minimal size has been reached (this can help weed out local
maxima that have fitness that is barely above that of the wild-type
genotype). (The default is 0).


An example of using those options:

```{r fixationG2}
## Create a simple fitness landscape
rl1 <- matrix(0, ncol = 6, nrow = 9)
colnames(rl1) <- c(LETTERS[1:5], "Fitness")
rl1[1, 6] <- 1
rl1[cbind((2:4), c(1:3))] <- 1
rl1[2, 6] <- 1.4
rl1[3, 6] <- 1.32
rl1[4, 6] <- 1.32
rl1[5, ] <- c(0, 1, 0, 0, 1, 1.5)
rl1[6, ] <- c(0, 0, 1, 1, 0, 1.54)
rl1[7, ] <- c(1, 0, 1, 1, 0, 1.65)
rl1[8, ] <- c(1, 1, 1, 1, 0, 1.75)
rl1[9, ] <- c(1, 1, 1, 1, 1, 1.85)
class(rl1) <- c("matrix", "genotype_fitness_matrix")
## plot(rl1) ## to see the fitness landscape

## The local fitness maxima are
## c("A", "B, E", "A, B, C, D, E")

fr1 <- allFitnessEffects(genotFitness = rl1, drvNames = LETTERS[1:5])
initS <- 2000

## Stop on genotypes

r3 <- oncoSimulPop(10,
                  fp = fr1,
                  model = "McFL",
                  initSize = initS,
                  mu = 1e-4,
                  detectionSize = NA,
                  sampleEvery = .03,
                  keepEvery = 1, 
                  finalTime = 50000,
                  fixation = c(paste0("_,",
                                   c("A", "B, E", "A, B, C, D, E")),
                               fixation_tolerance = 0.1,
                               min_successive_fixation = 200,
                               fixation_min_size = 3000),
                  detectionDrivers = NA,
                  detectionProb = NA,
                  onlyCancer = TRUE,
                  max.num.tries = 500,
                  max.wall.time = 20, 
                  errorHitMaxTries = TRUE,
                  keepPhylog = FALSE,
                  mc.cores = 2)

```





### Mixing stopping on gene combinations and genotypes {#fixationmix}

This would probably be awfully confusing and is not tested formally
(though it should work). Let me know if you think this is an important
feature. (Pull requests with tests welcome.)


## Plotting genotype/driver abundance over time; plotting the simulated trajectories {#plotraj}

We have seen many of these plots already, starting with Figure
\@ref(fig:iep1x1) and Figure \@ref(fig:iep2x2) and we will see many
more below, in the examples, starting with section \@ref(bauer2)
such as in figures \@ref(fig:baux1) and \@ref(fig:baux2). In a
nutshell, what we are plotting is the information contained in the
`pops.by.time` matrix, the matrix that contains the abundances of
all the clones (or genotypes) at each of the sampling periods.


The functions that do the work are called `plot` and these are actually
methods for objects of class "oncosimul" and "oncosimulpop". You can access
the help by doing `?plot.oncosimul`, for example.

What entities are shown in the plot? You can show the trajectories of:

- numbers of drivers (e.g., \@ref(fig:baux1));

- genotypes or clones (e.g., \@ref(fig:baux2)).


(Of course, showing "drivers" requires that you have specified
certain genes as drivers.)


What types of plots are available?

- line plots;

- stacked plots;

- stream plots.

All those three are shown in both of Figure \@ref(fig:baux1) and
Figure \@ref(fig:baux2).


If you run multiple simulations using `oncoSimulPop` you can plot
the trajectories of all of the simulations.

## Several examples of simulations and plotting simulation trajectories {#severalexplot}

### Bauer's example again {#bauer2} 

We will use the model of @Bauer2014 that we saw in section \@ref(bauer).


```{r prbaux001}
K <- 5
sd <- 0.1
sdp <- 0.15
sp <- 0.05
bauer <- data.frame(parent = c("Root", rep("p", K)),
                    child = c("p", paste0("s", 1:K)),
                    s = c(sd, rep(sdp, K)),
                    sh = c(0, rep(sp, K)),
                    typeDep = "MN")
fbauer <- allFitnessEffects(bauer, drvNames = "p")
set.seed(1)
## Use fairly large mutation rate
b1 <- oncoSimulIndiv(fbauer, mu = 5e-5, initSize = 1000,
                     finalTime = NA,
                     onlyCancer = TRUE,
                     detectionProb = "default")

``` 


We will now use a variety of plots

```{r baux1,fig.width=6.5, fig.height=10, fig.cap="Three drivers' plots of a simulation of Bauer's model"}
par(mfrow = c(3, 1))
## First, drivers
plot(b1, type = "line", addtot = TRUE)
plot(b1, type = "stacked")
plot(b1, type = "stream")
``` 

```{r baux2,fig.width=6.5, fig.height=10, fig.cap="Three genotypes' plots of a simulation of Bauer's model"}
par(mfrow = c(3, 1))
## Next, genotypes
plot(b1, show = "genotypes", type = "line")
plot(b1, show = "genotypes", type = "stacked")
plot(b1, show = "genotypes", type = "stream")
``` 


<!-- % %% For poster -->
<!-- % <<>>= -->

<!-- % pdf(file = "b1-traj.pdf", width = 8, height = 8) -->
<!-- % par(cex = 1.55) -->
<!-- % par(cex.axis= 0.9) -->
<!-- % par(las = 1) -->
<!-- % plot(b1, show = "genotypes", type = "stacked",  -->
<!-- %      plotDiversity = TRUE, legend.ncols = 3) -->
<!-- % dev.off() -->

<!-- % @  -->

In this case, probably the stream plots are most helpful. Note, however,
that (in contrast to some figures in the literature showing models of
clonal expansion) the stream plot (or the stacked plot) does not try to
explicitly show parent-descendant relationships, which would hardly be
realistically possible in these plots (although the plots of phylogenies
in section \@ref(phylog) could be of help).

### McFarland model with 5000 passengers and 70 drivers {#mcf5070}

```{r, fig.width=6}

set.seed(678)
nd <- 70  
np <- 5000 
s <- 0.1  
sp <- 1e-3 
spp <- -sp/(1 + sp)
mcf1 <- allFitnessEffects(noIntGenes = c(rep(s, nd), rep(spp, np)),
                          drvNames = seq.int(nd))
mcf1s <-  oncoSimulIndiv(mcf1,
                         model = "McFL", 
                         mu = 1e-7,
                         detectionProb = "default",
                         detectionSize = NA, 
                         detectionDrivers = NA,
                         sampleEvery = 0.025,
                         keepEvery = 8,
                         initSize = 2000,
                         finalTime = 4000,
                         onlyCancer = FALSE)
summary(mcf1s)

``` 

<!-- %% -->
<!-- %% set.seed(456) -->

```{r mcf1sx1,fig.width=6.5, fig.height=10}
par(mfrow  = c(2, 1))
## I use thinData to make figures smaller and faster
plot(mcf1s, addtot = TRUE, lwdClone = 0.9, log = "", 
     thinData = TRUE, thinData.keep = 0.5)
plot(mcf1s, show = "drivers", type = "stacked",
     thinData = TRUE, thinData.keep = 0.3,
     legend.ncols = 2)
``` 


With the above output (where we see there are over 500 different
genotypes) trying to represent the genotypes makes no sense. 



### McFarland model with 50,000 passengers and 70 drivers: clonal competition {#mcf50070}

The next is too slow (takes a couple of minutes in an i5 laptop) and too
big to run in a vignette, because we keep track of over 4000 different
clones (which leads to a result object of over 800 MB):

```{r, eval=FALSE}

set.seed(123)
nd <- 70  
np <- 50000 
s <- 0.1  
sp <- 1e-4 ## as we have many more passengers
spp <- -sp/(1 + sp)
mcfL <- allFitnessEffects(noIntGenes = c(rep(s, nd), rep(spp, np)),
                          drvNames = seq.int(nd))
mcfLs <-  oncoSimulIndiv(mcfL,
                         model = "McFL", 
                         mu = 1e-7,
                         detectionSize = 1e8, 
                         detectionDrivers = 100,
                         sampleEvery = 0.02,
                         keepEvery = 2,
                         initSize = 1000,
                         finalTime = 2000,
                         onlyCancer = FALSE)
``` 

But you can access the pre-stored results and plot them (beware: this
object has been trimmed by removing empty passenger rows in the Genotype matrix)

```{r, mcflsx2,fig.width=6}
data(mcfLs)
plot(mcfLs, addtot = TRUE, lwdClone = 0.9, log = "",
     thinData = TRUE, thinData.keep = 0.3,
     plotDiversity = TRUE)
``` 


The argument `plotDiversity = TRUE` asks to show a small plot on top
with Shannon's diversity index.


```{r}
summary(mcfLs)
## number of passengers per clone
summary(colSums(mcfLs$Genotypes[-(1:70), ]))
``` 


Note that we see clonal competition between clones with the same number of
drivers (and with different drivers, of course). We will return to this
(section \@ref(clonalint)).

A stacked plot might be better to show the extent of clonal
competition (plotting takes some time ---a stream plot reveals
similar patterns and is also slower than the line plot). I will
aggressively thin the data for this plot so it is faster and smaller
(but we miss some of the fine grain, of course):


```{r mcflsx3}
plot(mcfLs, type = "stacked", thinData = TRUE, 
     thinData.keep = 0.2,
     plotDiversity = TRUE,
     xlim = c(0, 1000))
``` 


<!-- %% %% The problem is the Genotype matrix. We remove empty passenger rows. -->
<!-- %% <<>>= -->
<!-- %% g1 <- mcfLs$Genotypes[1:nd, ] -->
<!-- %% g2 <- mcfLs$Genotypes[(nd+1):(nd+np), ] -->
<!-- %% rs <- rowSums(g2) -->
<!-- %% g3 <- g2[which(rs == 0), ] -->
<!-- %% g4 <- rbind(g1, g3) -->
<!-- %% @  -->








### Simulation with a conjunction example {#s-cbn1}

<!-- #### Loading fitnessEffects data for simulation examples {#fedata} -->
We will use several of the previous examples. Most of them are in file
`examplesFitnessEffects`, where they are stored inside a list,
with named components (names the same as in the examples above):

```{r}
data(examplesFitnessEffects)
names(examplesFitnessEffects)
``` 



We will simulate using the simple CBN-like restrictions of
section \@ref(cbn1) with two different models. 

```{r smmcfls}
data(examplesFitnessEffects)
evalAllGenotypes(examplesFitnessEffects$cbn1, order = FALSE)[1:10, ]
sm <-  oncoSimulIndiv(examplesFitnessEffects$cbn1,
                      model = "McFL", 
                      mu = 5e-7,
                      detectionSize = 1e8, 
                      detectionDrivers = 2,
                      detectionProb = "default",
                      sampleEvery = 0.025,
                      keepEvery = 5,
                      initSize = 2000,
                      onlyCancer = TRUE)
summary(sm)
``` 



```{r smbosb1}
set.seed(1234)
evalAllGenotypes(examplesFitnessEffects$cbn1, order = FALSE, 
                 model = "Bozic")[1:10, ]
sb <-  oncoSimulIndiv(examplesFitnessEffects$cbn1,
                       model = "Bozic", 
                       mu = 5e-6,
                      detectionProb = "default",
                       detectionSize = 1e8, 
                       detectionDrivers = 4,
                       sampleEvery = 2,
                       initSize = 2000,
                       onlyCancer = TRUE)
summary(sb)
``` 

As usual, we will use several plots here.
<!-- \clearpage -->

```{r sbx1,fig.width=6.5, fig.height=3.3}
## Show drivers, line plot
par(cex = 0.75, las = 1)
plot(sb,show = "drivers", type = "line", addtot = TRUE,
     plotDiversity = TRUE)
``` 
```{r sbx2,fig.width=6.5, fig.height=3.3}
## Drivers, stacked
par(cex = 0.75, las = 1)
plot(sb,show = "drivers", type = "stacked", plotDiversity = TRUE)
``` 
```{r sbx3,fig.width=6.5, fig.height=3.3}
## Drivers, stream
par(cex = 0.75, las = 1)
plot(sb,show = "drivers", type = "stream", plotDiversity = TRUE)
``` 
<!-- \clearpage -->
```{r sbx4,fig.width=6.5, fig.height=3.3}
## Genotypes, line plot
par(cex = 0.75, las = 1)
plot(sb,show = "genotypes", type = "line", plotDiversity = TRUE)
``` 
```{r sbx5,fig.width=6.5, fig.height=3.3}
## Genotypes, stacked
par(cex = 0.75, las = 1)
plot(sb,show = "genotypes", type = "stacked", plotDiversity = TRUE)
``` 
```{r sbx6,fig.width=6.5, fig.height=3.3}
## Genotypes, stream
par(cex = 0.75, las = 1)
plot(sb,show = "genotypes", type = "stream", plotDiversity = TRUE)
``` 

The above illustrates again that different types of plots can be useful to
reveal different patterns in the data. For instance, here, because of the
huge relative frequency of one of the clones/genotypes, the stacked and
stream plots do not reveal the other clones/genotypes as we cannot use a
log-transformed y-axis, even if there are other clones/genotypes present.




### Simulation with order effects and McFL model {#clonalint}

<!-- %% Interesting to show effects of order: o3 -->

<!-- %% Increase mutation rate, so does not take forever -->
<!-- %% <<>>= -->

<!-- %% tmp <-  oncoSimulIndiv(examplesFitnessEffects[["o3"]], -->
<!-- %%                        model = "McFL",  -->
<!-- %%                        mu = 5e-5, -->
<!-- %%                        detectionSize = 1e8,  -->
<!-- %%                        detectionDrivers = 3, -->
<!-- %%                        sampleEvery = 0.025, -->
<!-- %%                        max.num.tries = 10, -->
<!-- %%                        keepEvery = -9, -->
<!-- %%                        initSize = 2000, -->
<!-- %%                        finalTime = 8000, -->
<!-- %%                        onlyCancer = TRUE);  -->

<!-- %% tmp -->

<!-- %% tmp <-  oncoSimulIndiv(examplesFitnessEffects[["o3"]], -->
<!-- %%                        model = "Bozic",  -->
<!-- %%                        mu = 5e-5, -->
<!-- %%                        detectionSize = 1e6,  -->
<!-- %%                        detectionDrivers = 4, -->
<!-- %%                        sampleEvery = 2, -->
<!-- %%                        max.num.tries = 100, -->
<!-- %%                        keepEvery = -9, -->
<!-- %%                        initSize = 2000, -->
<!-- %%                        onlyCancer = TRUE) -->
<!-- %% tmp -->

<!-- %% @  -->

(We use a somewhat large mutation rate than usual, so that the simulation
runs quickly.)


```{r, fig.width=6}

set.seed(4321)
tmp <-  oncoSimulIndiv(examplesFitnessEffects[["o3"]],
                       model = "McFL", 
                       mu = 5e-5,
                       detectionSize = 1e8, 
                       detectionDrivers = 3,
                       sampleEvery = 0.025,
                       max.num.tries = 10,
                       keepEvery = 5,
                       initSize = 2000,
                       finalTime = 6000,
                       onlyCancer = FALSE) 
``` 

We show a stacked and a line plot of the drivers:

<!-- \clearpage -->

```{r tmpmx1,fig.width=6.5, fig.height=4.1}
par(las = 1, cex = 0.85)
plot(tmp, addtot = TRUE, log = "", plotDiversity = TRUE,
     thinData = TRUE, thinData.keep = 0.2)
``` 
```{r tmpmx2,fig.width=6.5, fig.height=4.1}
par(las = 1, cex = 0.85)
plot(tmp, type = "stacked", plotDiversity = TRUE, 
     ylim = c(0, 5500), legend.ncols = 4,
     thinData = TRUE, thinData.keep = 0.2)
``` 

In this example (and at least under Linux, with both GCC and clang
---random number streams in C++, and thus simulations, can differ
between combinations of operating system and compiler), we can see
that the mutants with three drivers do not get established when we
stop the simulation at time 6000. This is one case where the summary
statistics about number of drivers says little of value, as fitness
is very different for genotypes with the same number of mutations,
and does not increase in a simple way with drivers:

```{r}
evalAllGenotypes(examplesFitnessEffects[["o3"]], addwt = TRUE,
                 order = TRUE)
``` 

A few figures could help:

```{r tmpmx3,fig.width=6.5, fig.height=5.5}
plot(tmp, show = "genotypes", ylim = c(0, 5500), legend.ncols = 3,
     thinData = TRUE, thinData.keep = 0.5)
``` 

(When reading the figure legends, recall that genotype  $x > y\ \_\ z$ is
one where a mutation in "x" happened before a mutation in "y", and
there is also a mutation in "z" for which order does not matter. Here,
there are no genes for which order does not matter and thus there is
nothing after the "_").


In this case, the clones with three drivers end up displacing those with
two by the time we stop; moreover, notice how those with one driver never
really grow to a large population size, so we basically go from a
population with clones with zero drivers to a population made of clones
with two or three drivers:

<!-- %%<<fig.width=6} -->


<!-- Setting sampleEvery to 0.015, we get really nice figures too, smooth -->
<!-- and nice colors between 500 and 1500 time units. -->

```{r}
set.seed(15)
tmp <-  oncoSimulIndiv(examplesFitnessEffects[["o3"]],
                       model = "McFL", 
                       mu = 5e-5,
                       detectionSize = 1e8, 
                       detectionDrivers = 3,
                       sampleEvery = 0.025,
                       max.num.tries = 10,
                       keepEvery = 5,
                       initSize = 2000,
                       finalTime = 20000,
                       onlyCancer = FALSE,
                       extraTime = 1500)
tmp
``` 

<!-- \clearpage -->

use a drivers plot:
```{r tmpmdx5,fig.width=6.5, fig.height=4}
par(las = 1, cex = 0.85)
plot(tmp, addtot = TRUE, log = "", plotDiversity = TRUE,
     thinData = TRUE, thinData.keep = 0.5)
``` 
```{r tmpmdx6,fig.width=6.5, fig.height=4}
par(las = 1, cex = 0.85)
plot(tmp, type = "stacked", plotDiversity = TRUE,
     legend.ncols = 4, ylim = c(0, 5200), xlim = c(3400, 5000),
     thinData = TRUE, thinData.keep = 0.5)
``` 

<!-- \clearpage -->

Now show the genotypes explicitly:
```{r tmpmdx7,fig.width=6.5, fig.height=5.3}
## Improve telling apart the most abundant 
## genotypes by sorting colors
## differently via breakSortColors
## Modify ncols of legend, so it is legible by not overlapping
## with plot
par(las = 1, cex = 0.85)
plot(tmp, show = "genotypes", breakSortColors = "distave",
     plotDiversity = TRUE, legend.ncols = 4,
     ylim = c(0, 5300), xlim = c(3400, 5000),
     thinData = TRUE, thinData.keep = 0.5)
```



As before, the argument `plotDiversity = TRUE` asks to show a small
plot on top with Shannon's diversity index. Here, as before, the quick
clonal expansion of the clone with two drivers leads to a sudden drop in
diversity (for a while, the population is made virtually of a single
clone). Note, however, that compared to section \@ref(mcf50070), we are
modeling here a scenario with very few genes, and correspondingly very few
possible genotypes, and thus it is not strange that we observe very little
diversity.

<!-- %% These patterns, however, are not always present -->

<!-- %% <<fig.width=6>>= -->

<!-- %% set.seed(7654)  -->
<!-- %% tmp <-  oncoSimulIndiv(examplesFitnessEffects[["o3"]], -->
<!-- %%                        model = "McFL",  -->
<!-- %%                        mu = 5e-5, -->
<!-- %%                        detectionSize = 1e8,  -->
<!-- %%                        detectionDrivers = 3, -->
<!-- %%                        sampleEvery = 0.015, -->
<!-- %%                        max.num.tries = 10, -->
<!-- %%                        keepEvery = 5, -->
<!-- %%                        initSize = 2000, -->
<!-- %%                        finalTime = 10000, -->
<!-- %%                        onlyCancer = FALSE, -->
<!-- %%                        extraTime = 10) -->
<!-- %% tmp -->
<!-- %% plot(tmp, addtot = TRUE, log = "") -->

<!-- %% @  -->


<!-- %% Although in other runs we do not reach the three gene mutant and continue -->
<!-- %% with clone competition for a long time: -->


(We have used `extraTime` to continue the simulation well past the
point of detection, here specified as three drivers. Instead of specifying
`extraTime` we can set the `detectionDrivers` value to a
number larger than the number of existing possible drivers, and the
simulation will run until `finalTime` if `onlyCancer = FALSE`.)





<!-- \clearpage -->





## Interactive graphics {#interactive}

It is possible to create interactive stacked area and stream plots using
the `r Githubpkg("hrbrmstr/streamgraph")` package, available from
<https://github.com/hrbrmstr/streamgraph>.  However, that package is
not available as a CRAN or BioConductor package, and thus we cannot depend
on it for this vignette (or this package). You can, however, paste the
code below and make it run locally.

Before calling the `streamgraph` function, though, we need to
convert the data from the original format in which it is stored into
"long format". A simple convenience function is provided as
`OncoSimulWide2Long` in `r Biocpkg("OncoSimulR")`.


As an example, we will use the data we generated above for section
\@ref(bauer2).


```{r, eval=FALSE}
## Convert the data
lb1 <- OncoSimulWide2Long(b1)

## Install the streamgraph package from GitHub and load
library(devtools)
devtools::install_github("hrbrmstr/streamgraph")
library(streamgraph)

## Stream plot for Genotypes
sg_legend(streamgraph(lb1, Genotype, Y, Time, scale = "continuous"),
              show=TRUE, label="Genotype: ")

## Staked area plot and we use the pipe
streamgraph(lb1, Genotype, Y, Time, scale = "continuous", 
            offset = "zero") %>%
    sg_legend(show=TRUE, label="Genotype: ")
``` 


<!-- %% (Note: the idiomatic way of doing the above with `r CRANpkg("tidyr")` is using  -->
<!-- %% \verb= %>% =, the pipe operator. Something like  -->
<!-- %% \begin{verbatim} -->
<!-- %% streamgraph(lb1, Genotype, Y, Time, scale = "continuous",   -->
<!-- %%            offset = "zero") \%>\%                                                                                                                                sg_legend(show=TRUE, label="Genotype: ")  -->
<!-- %% \end{verbatim} -->

<!-- %% but it gives me problems with knitr, etc). -->


\clearpage

# Sampling multiple simulations {#sample}

Often, you will want to simulate multiple runs of the same scenario, and
then obtain the matrix of runs by mutations (a matrix of
individuals/samples by genes or, equivalently, a vector of "genotypes"),
and do something with them.  OncoSimulR offers several ways of doing this.

The key function here is `samplePop`, either called explicitly
after `oncoSimulPop` (or `oncoSimulIndiv`), or implicitly as
part of a call to `oncoSimulSample`. With `samplePop` you can
use **single cell** or **whole tumor** sampling (for details see the
help of `samplePop`). Depending on how the simulations were
conducted, you might also sample at different times, or as a
function of population sizes. A major difference between procedures
has to do with whether or not you want to keep the complete history
of the simulations.

**You want to keep the complete history of population sizes of
clones during the simulations**. You will
simulate using:


  - `oncoSimulIndiv` repeatedly (maybe within `mclapply`, to  parallelize the run).
  
  - `oncoSimulPop`. `oncoSimulPop` is basically a thin wrapper around `oncoSimulIndiv` that uses  `mclapply`.

In both cases, you specify the conditions for ending the simulations
(as explained in \@ref(endsimul)). Then, you use function
`samplePop` to obtain the matrix of samples by mutations.

  
**You do not want to keep the complete history of population sizes
of clones during the simulations**. You will simulate using:


- `oncoSimulIndiv` repeatedly,  with argument `keepEvery = NA`. 


- `oncoSimulPop`, with argument `keepEvery = NA`. 

     In both cases you specify the conditions for ending the
     simulations (as explained in \@ref(endsimul)).  Then, you 
     use function `samplePop`.


  
- `oncoSimulSample`, specifying the conditions for ending the
  simulations (as explained in \@ref(endsimul)). In this case, you
  will not use `samplePop`, as that is implicitly called by
  `oncoSimulSample`.  The output is directly the matrix (and a
  little bit of summary from each run), and during the simulation it
  only stores one time point.
  

Why the difference between the above cases? If you keep the complete
history of population sizes, you can take samples at any of the
times between the beginning and the end of the simulations. If you
do not keep the history, you can only sample at the time the
simulation exited (see section \@ref(trackindivs)). Why would you
want to use the second route? If we are only interested in the final
matrix of individuals by mutations, keeping the complete history
above is wasteful because we store fully all of the simulations (for
example in the call to `oncoSimulPop`) and then sample (in the call
to `samplePop`). <!-- Other reasons for choosing one over the other
--> <!-- might have to do with flexibility (e.g., if you use -->
<!-- `oncoSimulPop` the arguments for `detectionSize`, --> <!--
`detectionDrivers` must be the same for all simulations but this is
--> <!-- not the case for `oncoSimulSample`) and parallelized -->
<!-- execution.  --> 

Further criteria to use when choosing between sampling procedures is
whether you need `detectionSize` and `detectionDrivers` do differ
between simulations: if you use `oncoSimulPop` the arguments for
`detectionSize` and `detectionDrivers` must be the same for all
simulations but this is not the case for `oncoSimulSample`. See
further comments in \@ref(diffsample). Finally, parallelized
execution is available for `oncoSimulPop` but, by design, not for
`oncoSimulSample`.




The following are a few examples. First we run `oncoSimulPop` to obtain 4
simulations and in the last line we sample from them: 


```{r pancrpopcreate}

pancrPop <- oncoSimulPop(4, pancr,
                         detectionSize = 1e7,
                         keepEvery = 10,
                         mc.cores = 2)

summary(pancrPop)
samplePop(pancrPop)

``` 
  

Now a simple multiple call to `oncoSimulIndiv` wrapped inside
`mclapply`; this is basically the same we just did above. We set
the class of the object to allow direct usage of
`samplePop`. (Note: in Windows `mc.cores > 1` is not
supported, so for the vignette to run in Windows, Linux, and Mac we
explicitly set it here in the call to `mclapply`. For regular
usage, you will not need to do this; just use whatever is appropriate for
your operating system and number of cores. As well, we do not need any of
this with `oncoSimulPop` because the code inside
`oncoSimulPop` already takes care of setting `mc.cores`
to 1 in Windows).

```{r}
library(parallel)

if(.Platform$OS.type == "windows") {
    mc.cores <- 1
} else {
    mc.cores <- 2
}

p2 <- mclapply(1:4, function(x) oncoSimulIndiv(pancr,
                                               detectionSize = 1e7,
                                               keepEvery = 10),
                                               mc.cores = mc.cores)
class(p2) <- "oncosimulpop"
samplePop(p2)
``` 
  

Above, we have kept the complete history of the simulations as you can
check by doing, for instance

```{r}
tail(pancrPop[[1]]$pops.by.time)
``` 


If we were not interested in the complete history of simulations we could
have done instead (note the argument `keepEvery = NA`)

```{r}
pancrPopNH <- oncoSimulPop(4, pancr,
                           detectionSize = 1e7,
                           keepEvery = NA,
                           mc.cores = 2)

summary(pancrPopNH)
samplePop(pancrPopNH)
``` 

which only keeps the very last sample:
```{r}
pancrPopNH[[1]]$pops.by.time
``` 

Or we could have used `oncoSimulSample`:

```{r}
pancrSamp <- oncoSimulSample(4, pancr)
pancrSamp$popSamp

``` 


Again, why the above differences? If we are only interested in the
final matrix of populations by mutations, keeping the complete
history the above is wasteful, because we store fully all of the
simulations (in the call to `oncoSimulPop`) and then sample (in
the call to `samplePop`).


## Whole-tumor and single-cell sampling, and do we always want to sample? {#alwayssamp}

`samplePop` is designed to emulate the process of obtaining a sample
from a (set of) "patient(s)". But there is no need to sample. The
history of the population, with a granularity that is controlled by
argument `keepEvery`, is kept in the matrix `pops.by.time` which
contains the number of cells of every clone at every sampling point
(see further details in \@ref(trackindivs)). This is the information
used in the plots that show the trajectory of a simulation: the
plots that show the change in genotype or driver abundance over time
(see section \@ref(plotraj) and examples mentioned there).

Regardless of whether and how you plot the information in `pops.by.time`,
you can also sample one or multiple simulations using `samplePop`. In
**whole-tumor** sampling the resolution is the whole tumor (or the whole
population). Thus, a key argument is `thresholdWhole`, the threshold for
detecting a mutation: a gene is considered mutated if it is altered in at
least "thresholdWhole" proportion of the cells in that simulation (at a
particular time point). This of course means that your "sampled genotype"
might not correspond to any existing genotype because we are summing over
all cells in the population. For instance, suppose that at the time we take
the sample there are only two clones in the population, one clone with a
frequency of 0.4 that has gene A mutated, and a second clone one with a
frequency of 0.6 that has gene B mutated. If you set `thresholdWhole` to
values $\leq 0.4$ the sampled genotype will show both A and B
mutated. **Single-cell** sampling is provided as an option in contrast to
whole-tumor sampling. Here any sampled genotype will correspond to an
existing genotype as you are sampling with single-cell resolution.


When `samplePop` is run on a set of simulated data of, say, 100 simulated
trajectories (100 "subjects"), it will produce a matrix with 100 rows (100
"subjects"). But if it makes sense in the context of your problem (e.g.,
multiple samples per patient?) you can of course run `samplePop` repeatedly.





## Differences between "samplePop" and "oncoSimulSample" {#diffsample}


`samplePop` provides two sampling times: "last" and
"uniform". It also allows you to sample at the first sample time(s) at
which the population(s) reaches a given size, which can be either the same
or different for each simulation (with argument `popSizeSample`).
"last" means to sample each individual in the very last time period of the
simulation. "uniform" means sampling each individual at a time chosen
uniformly from all the times recorded in the simulation between the time
when the first driver appeared and the final time period. "unif" means
that it is almost sure that different individuals will be sampled at
different times. "last" does not guarantee that different individuals will
be sampled at the same time unit, only that all will be sampled in the
last time unit of their simulation.



With `oncoSimulSample` we obtain samples that correspond to
`timeSample = "last"` in `samplePop` by specifying a
unique value for `detectionSize` and
`detectionDrivers`. The data from each simulation will
correspond to the time point at which those are reached (analogous to
`timeSample = "last"`). How about uniform sampling? We pass a
vector of `detectionSize` and `detectionDrivers`,
where each value of the vector comes from a uniform distribution. This is
not identical to the "uniform" sampling of  `oncoSimulSample`,
as we are not sampling uniformly over all time periods, but are stopping
at uniformly distributed values over the stopping conditions. Arguably,
however, the procedure in `samplePop` might be closer to what we
mean with "uniformly sampled over the course of the disease" if that
course is measured in terms of drivers or size of tumor.


<!-- % As an example, if you look at the output above, the object "pancrSamp" -->
<!-- % contains some simulations that have only a few drivers because those -->
<!-- % simulations were set to run only until they had just a small number of -->
<!-- % cells. -->

An advantage of `oncoSimulSample` is that we can specify
arbitrary sampling schemes, just by passing the appropriate vector
`detectionSize` and `detectionDrivers`. A disadvantage
is that if we change the stopping conditions we can not just resample the
data, but we need to run it again.

There is no difference between `oncoSimulSample` and
`oncoSimulPop` + `samplePop` in terms of the
`typeSample` argument (whole tumor or single cell).


Finally, there are some additional differences between the two
functions. `oncoSimulPop` can run parallelized (it uses
`mclapply`). This is not done with `oncoSimulSample`
because this function is designed for simulation experiments where you
want to examine many different scenarios simultaneously. Thus, we provide
additional stopping criteria (`max.wall.time.total` and
`max.num.tries.total`) to determine whether to continue running the
simulations, that bounds the total running time of all the simulations in
a call to `oncoSimulSample`. And, if you are running multiple
different scenarios, you might want to make multiple, separate,
independent calls (e.g., from different R processes) to
`oncoSimulSample`, instead of relying in `mclapply`,
since this is likely to lead to better usage of multiple cores/CPUs if you
are examining a large number of different scenarios.






<!-- % ### `popSizeSample` and `oncoSimulPop`} -->
<!-- % \label{popss-osp} -->


<!-- % Suppose you run a set of simulations with a large `finalTime` with -->
<!-- % `oncoSimulPop`. Since you can pass to `samplePop` a -->
<!-- % vector for `popSizeSample` this might seem to resemble a -->
<!-- % deterministic version of the stochastic detection mechanism -->
<!-- % (\@ref(detectprob)): for each simulation, you would take a sample at the -->
<!-- % first time a given population size is reached. You should notice, however, -->
<!-- % that for any simulation you are not sampling with a probability of -->
<!-- % detection that increases with size. Moreover, you can end up with "NA" -->
<!-- % in your sample. Suppose you have a set of fitness such that a given -->
<!-- % simulation, say number 1, never went beyond total population size -->
<!-- % 1000. Now, if you pass a vector of `popSizeSample` that has a value -->
<!-- % larger than a 1000 for the first position, you will get an "NA": there -->
<!-- % never is a value where for population size where that size is reached.  -->


<!-- %% in an attempt to explain it, this just makes it too confusing. The -->
<!-- %% above is enough. -->
<!-- %% ## What if there is order? {#sim-order} -->

<!-- %% Consider the following example (I fix the seed and use a single core, so -->
<!-- %% no parallelization, to make sure we can reproduce the results) -->

<!-- %% <<>>= -->

<!-- %% oe8 <- allFitnessEffects(orderEffects = c( -->
<!-- %%                              "M > F > M" = 0, -->
<!-- %%                              "D > F > M" = 0.1, -->
<!-- %%                              "F > D > M" = 0.2 -->
<!-- %% ), -->
<!-- %%                       epistasis = c("D" = 0.02, "M" = 0.02, "F" = 0.02), -->
<!-- %%                         geneToModule = -->
<!-- %%                             c("Root" = "Root", -->
<!-- %%                               "M" = "m", -->
<!-- %%                               "F" = "f", -->
<!-- %%                               "D" = "d") ) -->

<!-- %% evalAllGenotypes(oe8) -->

<!-- %% set.seed(678)  -->
<!-- %% oe8P1 <- oncoSimulPop(8, oe8, -->
<!-- %%                      model = "Exp",  -->
<!-- %%                       detectionSize = 1e8, keepEvery = 10, mc.cores = 1) -->
<!-- %% lapply(oe8P1, print) -->




<!-- %% set.seed(678)  -->
<!-- %% oe8P1 <- oncoSimulPop(1, oe8, -->
<!-- %%                      model = "McFL",  -->
<!-- %%                       detectionDrivers = 2,  -->
<!-- %%                       keepEvery = 10, mc.cores = 1) -->
<!-- %% lapply(oe8P1, print) -->

<!-- %% @  -->




<!-- %% <<>>= -->

<!-- %% o8 <- allFitnessEffects(orderEffects = c( -->
<!-- %%                             "F > D" = 0, -->
<!-- %%                             "D > F" = 0.14, -->
<!-- %%                             "D > M" = 0.13, -->
<!-- %%                             "F > M" = 0.12, -->
<!-- %%                             "M > D" = 0.15), -->
<!-- %%                       epistasis = c("D" = 0.01, "M" = 0.01, "F" = 0.02), -->
<!-- %%                         geneToModule = -->
<!-- %%                             c("Root" = "Root", -->
<!-- %%                               "M" = "m", -->
<!-- %%                               "F" = "f", -->
<!-- %%                               "D" = "d") ) -->

<!-- %% evalAllGenotypes(o8) -->

<!-- %% set.seed(678)  -->
<!-- %% o8P1 <- oncoSimulPop(8, o8, -->
<!-- %%                      model = "Exp", keepEvery = 10, mc.cores = 1) -->
<!-- %% ## lapply(o8P1, print) -->
<!-- %% @  -->

<!-- %% Now, if we look at the sixth population we see -->

<!-- %% <<>>= -->
<!-- %% o8P1[[6]] -->
<!-- %% @  -->

<!-- %% Obviously, in terms of the genes that are mutated, both "d, f, m" and -->
<!-- %% "d, m, f" have the same genes mutated so if we sample, for instance doing -->

<!-- %% <<>>= -->

<!-- %% @  -->





<!-- %% o9 <- allFitnessEffects(orderEffects = c( -->
<!-- %%                             "F > D > M" = 0, -->
<!-- %%                             "D > F > M" = 0.14, -->
<!-- %%                             "D > M > F" = 0.13, -->
<!-- %%                             "D > M"     = 0.12, -->
<!-- %%                             "M > D"     = 0.15), -->
<!-- %%                       epistasis = c("D:-M" = 0.05, "M:-D" = 0.04), -->
<!-- %%                         geneToModule = -->
<!-- %%                             c("Root" = "Root", -->
<!-- %%                               "M" = "m", -->
<!-- %%                               "F" = "f", -->
<!-- %%                               "D" = "d") ) -->


<!-- %% set.seed(11) -->
<!-- %% o9P1 <- oncoSimulPop(8, o9, -->
<!-- %%                      model = "Exp", keepEvery = 10, mc.cores = 1) -->
<!-- %% lapply(o9P1, print) -->


<!-- %% @  -->



<!-- %% ## Testing of mappings} -->

<!-- %% The mapping of restriction tables, epistasis, and order effects to -->
<!-- %% fitness, especially when there are modules, is a delicate part of the -->
<!-- %% code: reasonable cases are straightforward to deal with, but there are -->
<!-- %% many ways to shoot oneself in the foot. That is why we have placed lots of -->
<!-- %% pre- and post-condition checks in the code (both R and C++), and we have a -->
<!-- %% comprehensive set of tests in file zz. You are welcome to suggest more -->
<!-- %% tricky scenarios (and tests for them). -->



\clearpage

# Showing the genealogical relationships of clones {#phylog}

If you run simulations with `keepPhylog = TRUE`, the simulations keep track
of when every clone is generated, and that will allow us to see the
parent-child relationships between clones. (This is disabled by default).

<!-- In an abuse of terminology, we will use functions with the "phylog" -->
<!-- term, but note these are not proper phylogenies. (Though you could -->
<!-- construct proper phylogenies from the information kept). -->


Let us re-run a previous example:

```{r}

set.seed(15)
tmp <-  oncoSimulIndiv(examplesFitnessEffects[["o3"]],
                       model = "McFL", 
                       mu = 5e-5,
                       detectionSize = 1e8, 
                       detectionDrivers = 3,
                       sampleEvery = 0.025,
                       max.num.tries = 10,
                       keepEvery = 5,
                       initSize = 2000,
                       finalTime = 20000,
                       onlyCancer = FALSE,
                       extraTime = 1500,
                       keepPhylog = TRUE)
tmp
``` 

We can plot the parent-child relationships^[There are several
  packages in R devoted to phylogenetic inference and related issues. For
  instance, `r CRANpkg("ape")`. I have not used that infrastructure because of
  our very specific needs and circumstances; for instance, internal nodes
  are observed, we can have networks instead of trees, and we have no
  uncertainty about when events occurred.] of every clone ever created
(with fitness larger than 0 ---clones without viability are never shown):

```{r}
plotClonePhylog(tmp, N = 0)
``` 

However, we often only want to show clones that exist (have number of
cells $>0$) at a certain time (while of course showing all of their
ancestors, even if those are now extinct ---i.e., regardless of their
current numbers).

```{r}
plotClonePhylog(tmp, N = 1)
``` 

If we set `keepEvents = TRUE` the arrows show how many times each
clone appeared:

(The next can take a while)
```{r pcpkeepx1}
plotClonePhylog(tmp, N = 1, keepEvents = TRUE)
``` 

And we can show the plot so that the vertical axis is proportional to time
(though you might see overlap of nodes if a child node appeared shortly
after the parent):

```{r}
plotClonePhylog(tmp, N = 1, timeEvents = TRUE)
``` 

We can obtain the adjacency matrix doing

```{r, fig.keep="none"}
get.adjacency(plotClonePhylog(tmp, N = 1, returnGraph = TRUE))

``` 


We can see another example here:

```{r}

set.seed(456)
mcf1s <-  oncoSimulIndiv(mcf1,
                         model = "McFL", 
                         mu = 1e-7,
                         detectionSize = 1e8, 
                         detectionDrivers = 100,
                         sampleEvery = 0.025,
                         keepEvery = 2,
                         initSize = 2000,
                         finalTime = 1000,
                         onlyCancer = FALSE,
                         keepPhylog = TRUE)

``` 

Showing only clones that exist at the end of the simulation (and all their
parents):

```{r}
plotClonePhylog(mcf1s, N = 1)
``` 

Notice that the labels here do not have a "_", since there were no order
effects in fitness. However, the labels show the genes that are
mutated, just as before.

Similar, but with vertical axis proportional to time:


```{r}
par(cex = 0.7)
plotClonePhylog(mcf1s, N = 1, timeEvents = TRUE)
``` 

What about those that existed in the last 200 time units?
```{r}
par(cex = 0.7)
plotClonePhylog(mcf1s, N = 1, t = c(800, 1000))
``` 

And try now to show also when the clones appeared (we restrict the time
to between 900 and 1000, to avoid too much clutter):
```{r}
par(cex = 0.7)
plotClonePhylog(mcf1s, N = 1, t = c(900, 1000), timeEvents = TRUE)
``` 

<!-- % % %% For poster -->

<!-- % <<>>= -->

<!-- % pdf("phylog-clone1.pdf", width = 4, height = 3.4) -->
<!-- % par(cex = 0.4); plotClonePhylog(mcf1s, N = 1, t = c(900, 1000), timeEvents = TRUE) -->
<!-- % dev.off() -->

<!-- % @  -->


(By playing with `t`, it should be possible to obtain animations of
the phylogeny. We will not pursue it here.)


If the previous graph seems cluttered, we can represent it in a different
way by calling `r CRANpkg("igraph")` directly after storing the graph and using
the default layout:

```{r fig.keep="none"}
g1 <- plotClonePhylog(mcf1s, N = 1, t = c(900, 1000),
                      returnGraph = TRUE)
``` 

```{r}
plot(g1)
``` 

which might be easier to show complex relationships or identify central or
key clones.


It is of course quite possible that, especially if we consider few genes,
the parent-child relationships will form a network, not a tree, as the
same child node can have multiple parents. You can play with this example,
modified from one we saw before (section \@ref(mn1)):

```{r, eval=FALSE}
op <- par(ask = TRUE)
while(TRUE) {
    tmp <- oncoSimulIndiv(smn1, model = "McFL",
                          mu = 5e-5, finalTime = 500,
                          detectionDrivers = 3,
                          onlyCancer = FALSE,
                          initSize = 1000, keepPhylog = TRUE)
    plotClonePhylog(tmp, N = 0)
}
par(op)
``` 


## Parent-child relationships from multiple runs {#phylogmult}

If you use `oncoSimulPop` you can store and plot the
"phylogenies" of the different runs:

```{r}

oi <- allFitnessEffects(orderEffects =
               c("F > D" = -0.3, "D > F" = 0.4),
               noIntGenes = rexp(5, 10),
                          geneToModule =
                              c("F" = "f1, f2, f3",
                                "D" = "d1, d2") )
oiI1 <- oncoSimulIndiv(oi, model = "Exp")
oiP1 <- oncoSimulPop(4, oi,
                     keepEvery = 10,
                     mc.cores = 2,
                     keepPhylog = TRUE)

``` 

We will plot the first two:
```{r, fig.height=9}

op <- par(mar = rep(0, 4), mfrow = c(2, 1))
plotClonePhylog(oiP1[[1]])
plotClonePhylog(oiP1[[2]])
par(op)

``` 


This is so far disabled in function `oncoSimulSample`, since
that function is optimized for other uses. This might change in the future.


\clearpage

# Generating random fitness landscapes {#gener-fit-land}

## Random fitness landscapes from a Rough Mount Fuji model {#rmfmodel}
In most of the examples seen above, we have fully specified the
fitness of the different genotypes (either by providing directly the
full mapping genotypes to fitness, or by providing that mapping by
specifying the effects of the different gene combinations). In some
cases, however, we might want to specify a particular model that
generates the fitness landscape, and then have fitnesses be random
variables obtained under this model. In other words, in this random
fitness landscape the fitness of the genotypes is a random variable
generated under some specific model.  Random fitness landscapes are
used extensively, for instance, to understand the evolutionary
consequences of different types of epistatic interactions 
[e.g., @szendro_quantitative_2013; @franke_evolutionary_2011] and there are
especially developed tools for plotting and analyzing random fitness
landscapes [e.g., @brouillet_magellan:_2015].



With OncoSimulR it is possible to generate mappings of genotype to fitness
using the function `rfitness` that allows you to use from a pure House of
Cards model to a purely additive model (see \@ref(nkmodel) for NK model). I
have followed @szendro_quantitative_2013 and @franke_evolutionary_2011 and
model fitness as

<!-- $$ -->
<!--   f_i = -c d(i, reference) + x_i -->
<!-- $$ {#eq:1} -->


\begin{equation}
  f_i = -c\ d(i, reference) + x_i
  \label{eq:1}
\end{equation} 


where $d(i, j)$ is the Hamming distance between genotypes $i$ and $j$ (the
number of positions that differ), $c$ is the decrease in fitness of a
genotype per each unit increase in Hamming distance from the reference
genotype, and $x_i$ is a random variable (in this
case, a normal deviate of mean 0 and standard deviation $sd$).  You can
change the reference genotype to any of the genotypes: for the
deterministic part, you make the fittest genotype be the one with all
positions mutated by setting `reference = "max"`, or use the
wildtype by using a string of 0s, or randomly select a genotype as a
reference by using `reference = "random"` or `reference = "random2"`.  
And by changing $c$ and $sd$ you can flexibly modify the relative weight
of the purely House of Cards vs. additive component. The expression used
above is also very similar to the one on @greene_changing_2014 if you use
`rfitness` with the argument `reference = "max"`.


What can you do with these genotype to fitness mappings? You could plot
them, you could use them as input for `oncoSimulIndiv` and related
functions, or you could export them (`to_Magellan`) and plot them externally
(e.g., in MAGELLAN: <http://wwwabi.snv.jussieu.fr/public/Magellan/>,
@brouillet_magellan:_2015).


```{r}
## A small example
rfitness(3)

## A 5-gene example, where the reference genotype is the
## one with all positions mutated, similar to Greene and Crona,
## 2014.  We will plot the landscape and use it for simulations
## We downplay the random component with a sd = 0.5

r1 <- rfitness(5, reference = rep(1, 5), sd = 0.6)
plot(r1)
oncoSimulIndiv(allFitnessEffects(genotFitness = r1))
``` 

<!-- % % % % %% For poster -->

<!-- % <<>>= -->

<!-- % pdf(file = "fl1.pdf", height = 5.8, width = 5) -->
<!-- % plot(r1, use_ggrepel = TRUE) -->
<!-- % dev.off() -->

<!-- % @  -->


## Random fitness landscapes from Kauffman's NK model {#nkmodel}

You can also use Kauffman's NK model [e.g., @ferretti_measuring_2016;
@brouillet_magellan:_2015]. We call the function `fl_generate` from
MAGELLAN [@brouillet_magellan:_2015].


```{r nkex1}
rnk <- rfitness(5, K = 3, model = "NK")
plot(rnk)
oncoSimulIndiv(allFitnessEffects(genotFitness = rnk))
```

## Epistasis and fitness landscape statistics {#epistats}

We can call MAGELLAN's [@brouillet_magellan:_2015] `fl_statistics` to
obtain fitness landscape statistics, including measures of sign and
reciprocal sign epistasis. See the help of `Magellan_stats` for further
details on output format. For example:

```{r magstats1}
rnk1 <- rfitness(6, K = 1, model = "NK")
Magellan_stats(rnk1)

rnk2 <- rfitness(6, K = 4, model = "NK")
Magellan_stats(rnk2)
```

\clearpage

# Measures of evolutionary predictability and genotype diversity {#evolpredszend}

Several measures of evolutionary predictability have been proposed in the
literature (see, e.g., @szendro_predictability_2013 and references
therein). We provide two, Lines of Descent (LOD) and Path of the Maximum
(POM), following @szendro_predictability_2013; we also provide a simple
measure of diversity of the actual genotypes sampled.

In @szendro_predictability_2013 "(...) paths defined as the time
ordered sets of genotypes that at some time contain the largest
subpopulation" are called "Path of the Maximum" (POM) (see their
p. 572). In our case, POM are obtained by finding the clone with
largest population size whenever we sample and, thus, <!-- from the
`pops.by.time` --> <!-- returned object (i.e., from the genotypes at
each of the sampling --> <!-- times) and, thus, --> the POMs will be
affected by how often we sample (argument `sampleEvery`), since we
are running a continuous time process.


@szendro_predictability_2013 also define Lines of Descent (LODs)
which "(...)  represent the lineages that arrive at the most
populated genotype at the final time". In that same page (572) they
provide the details on how the LODs are obtained. Starting with
version 2.9.2 of OncoSimulR I only provide an implementation where a
single LOD per simulation is returned, with the same meaning as in
@szendro_predictability_2013.

<!-- I provide two -->
<!-- implementations, one where a single LOD per simulation is returned, -->
<!-- with the same meaning as in @szendro_predictability_2013, and -->
<!-- another where I keep all the paths that "(...) arrive at the most -->
<!-- populated genotype at the final time" (first paragraph in p. 572 of -->
<!-- Szendro et al.). We can also obtain a single LOD that is the first -->
<!-- path to arrive at the genotype that eventually becomes the most -->
<!-- populated genotype at the final time (and, in this sense, agrees -->
<!-- with the LOD of Szendro et al.). See the help file for details. -->
<!-- Obtaining LOD requires that the simulations be run with `keepPhylog -->
<!-- = TRUE` (we need the genealogy of clones). -->


<!-- My implementation is not exactly identical to --> <!-- the
definition given in p. 572 of Szendro et al. First, in case this
might --> <!-- be useful, for each simulation I keep all the paths
that "(...) arrive at --> <!-- the most populated genotype at the
final time" (first paragraph in p. 572 --> <!-- of Szendro et al.),
whereas they only keep one (see second column of --> <!--
p. 572). However, I do provide a single LOD for each run, too. This
is the --> <!-- first path to arrive at the genotype that eventually
becomes the most --> <!-- populated genotype at the final time (and,
in this sense, agrees with the --> <!-- LOD of Szendro et
al.). However, in contrast to what is apparently done in --> <!--
Szendro ("A given genotype may undergo several episodes of
colonization --> <!-- and extinction that are stored by the
algorithm, and the last episode --> <!-- before the colonization of
the final state is used to construct the --> <!-- step."), I do not
check that this genotype (which is the one that will --> <!-- become
the most populated at final time) does not become extinct before -->
<!-- the final colonization. So there could be other paths (all in
`all_paths`) --> <!-- that are actually the one(s) that are
colonizers of the most populated --> <!-- genotype (with no
extinction before the final --> <!-- colonization). -->


<!-- My implementation is not exactly identical to their -->
<!-- definition. My `lod_single` is the first path to arrive at the genotype that -->
<!-- eventually becomes the most populated genotype at the final time (and, in -->
<!-- this sense, agrees with the LOD of Szendro et al.). However, in contrast to -->
<!-- what is apparently done in Szendro et al. ("A given genotype may undergo -->
<!-- several episodes of colonization and extinction that are stored by the -->
<!-- algorithm, and the last episode before the colinization of the final state -->
<!-- is used to construct the step."), I do not check that this genotype (which -->
<!-- is the one that will become the most populated at final time) does not -->
<!-- become extinct before the final colonization. So there could be other paths -->
<!-- that are actually the one(s) that are colonizers of the most populated -->
<!-- genotype (with no extinction before the final colonization). Nevertheless, -->
<!-- all the paths that "(...) arrive at the most populated genotype at the final -->
<!-- time" (first paragraph in p. 572 of Szendro et al.), are stored (in the -->
<!-- `all_paths` list returned). (This differs from Szendro et al., as they only -->
<!-- keep one ---see second column of p. 572). -->

<!-- As just said, in case this might be useful, for each simulation I keep all -->
<!-- the paths that "(...) arrive at the most populated genotype at the final -->
<!-- time" (first paragraph in p. 572 of Szendro et al.), whereas they only . -->


To briefly show some output, we will use again the \@ref(pancreas)
example. <!-- Note that using LOD requires running the simulations with -->
<!-- `keepPhylog = TRUE`. -->


```{r lod_pom_ex}
pancr <- allFitnessEffects(
    data.frame(parent = c("Root", rep("KRAS", 4), "SMAD4", "CDNK2A", 
                          "TP53", "TP53", "MLL3"),
               child = c("KRAS","SMAD4", "CDNK2A", 
                         "TP53", "MLL3",
                         rep("PXDN", 3), rep("TGFBR2", 2)),
               s = 0.05, sh = -0.3, typeDep = "MN"))

pancr16 <- oncoSimulPop(16, pancr, model = "Exp",
                        mc.cores = 2)

## Look a the first POM 
str(POM(pancr16)[1:3])

LOD(pancr16)[1:2]

## The diversity of LOD (lod_single) and POM might or might not
## be identical
diversityPOM(POM(pancr16))
diversityLOD(LOD(pancr16))

## Show the genotypes and their diversity (which might, or might
## not, differ from the diversity of LOD and POM)
sampledGenotypes(samplePop(pancr16))

```


\clearpage


# Generating random DAGs for restrictions {#simo}

You might want to randomly generate DAGs like those often found in the
literature on Oncogenetic trees et al. Function `simOGraph`
might help here. 

```{r}
## No seed fixed, so reruns will give different DAGs.
(a1 <- simOGraph(10))
library(graph) ## for simple plotting
plot(as(a1, "graphNEL"))
``` 

Once you obtain the adjacency matrices, it is for now up to you to convert
them into appropriate posets or fitnessEffects objects.



Why this function? I searched for, and could not find any that did what I
wanted, in particular bounding the number of parents, being able to
specify the approximate depth^[Where depth is defined in the usual
  way to mean smallest number of nodes ---or edges--- to traverse to get
  from the bottom to the top of the DAG.] of the graph, and optionally
being able to have DAGs where no node is connected to another both
directly (an edge between the two) and indirectly (there is a path between
the two through other nodes). So I wrote my own code. The code is fairly
simple to understand (all in file `generate-random-trees.R`). I
would not be surprised if this way of generating random graphs has been
proposed and named before; please let me know, best if with a reference.


Should we remove direct connections if there are indirect? Or,
should we set `removeDirectIndirect = TRUE`? Setting
`removeDirectIndirect = TRUE` is basically asking for
the
[transitive reduction](https://en.wikipedia.org/wiki/Transitive_reduction) of
the generated DAG. Except for @Farahani2013 and
@ramazzotti_capri_2015, none of the DAGs I've seen in the context of
CBNs, Oncogenetic trees, etc, include both direct and indirect
connections between nodes. If these exist, reasoning about the model
can be harder. For example, with CBN (AND or CMPN or monotone
relationships) adding a direct connection makes no difference iff we
assume that the relationships encoded in the DAG are fully respected
(e.g., all $s_h = -\infty$). But it can make a difference if we
allow for deviations from the monotonicity, specially if we only
check for the satisfaction of the presence of the immediate
ancestors. And things get even trickier if we combine XOR with
AND. <!-- The code for computing fitness, however, should deal with all -->
<!-- of this just fine. --> Thus, I strongly suggest you leave the
default `removeDirectIndirect = TRUE`. If you change it, you should
double check that the fitnesses of the possible genotypes are what
you expect. In fact, I would suggest that, to be sure you get what
you think you should get, you convert the fitness from the DAG to a
fitness table, and pass that to the simulations, and this requires
using non-exposed user functions; to give you an idea, this could
work (but you've been warned: this is dangerous!)


```{r simographindirect, eval=FALSE,echo=TRUE}
g2 <- simOGraph(4, out = "rT", removeDirectIndirect = FALSE)

fe_from_d <- allFitnessEffects(g2)
fitness_d <- evalAllGenotypes(fe_from_d)

fe_from_t <- allFitnessEffects(genotFitness =
                          OncoSimulR:::allGenotypes_to_matrix(fitness_d))
						  
## Compare
fitness_d
(fitness_t <- evalAllGenotypes(fe_from_t))

identical(fitness_d, fitness_t)
						  

## ... but to be safe use fe_from_t as the fitnessEffects object for simulations

```




\clearpage


# FAQ, odds and ends 

## What we mean by "clone"; and "I want clones disregarding passengers" {#meaningclone}

In this vignette we often use "clone" or "genotype" interchangeably. A
clone denotes a set of cells that have identical genotypes. So if you are
using a fitness specification with four genes (i.e., your genome has only
four loci), there can be up to $16 = 2^4$ different genotypes or
clones. Any two entities that differ in the genotype are different
clones. And this applies regardless of whether or not you declare that
some genes (loci) are drivers or not. So if you have four genes, it does
not matter whether only the first or all four are regarded as drivers; you
will always have at most 16 different clones or 16 different genotypes.
Of course you can arrive at the same clone/genotype by different
routes. Just think about loci A and B in our four-loci genome, and how you
can end up with a cell with both A and B mutated.

Analogously, if you have 100 genes, 10 drivers and 90 passengers, you can
have up to $2^{100}$ different clones or genotypes. Sure, one cell might
have driver A mutated and passenger B mutated, and another cell might have
driver A mutated and passenger C mutated. So if you only look at drivers
you might be tempted to say that they are "the same clone for all
practical purposes"; but they really are not the same clone as they differ
in their genotype and this makes a lot of difference computationally. 

If you want summaries of simulations that collapse over some genes
(say, some "passengers", the 90 passengers we just mentioned) look
at the help for `samplePop`, argument `geneNames`.  This would allow
you, for instance, to look at the diversity of clones/genotypes,
considering as identical those genotypes that only differ in genes
you deem relevant; something similar to defining a "drivers' clone"
as the set formed from the union of all sets of cells that have
identical genotype with respect to only the drivers (so that in the
example of "A, B" and "A, C" just mentioned both cells would be
considered "the same clone" as they only differ with respect to
passengers). However, this "disregard some genes" only applies to
summaries of simulations once we are done simulating
data. OncoSimulR will always track clones, as defined above,
regardless of whether many of those clones have the same genotype if
you were to only focus on driver genes; see also section
\@ref(trackindivs).

Labeling something as a "driver", therefore, does not affect what we mean
by clone. Yes, labeling something as a driver can affect when you stop
simulations if you use `detectionDrivers` as a stopping mechanism (see
section \@ref(endsimul)). But, again, this has nothing to do with the
definition of "clone".

<!-- Below we will see the expression haplotype. I am hesitant to use it -->
<!-- because haplotypes often have implicit the idea of having been in -->
<!-- Strictly, we keep track of clones, not haplotypes as -->
<!-- the haplotype refers to inheritance. But you can arrive at the same -->
<!-- clone through different routes so that two cells that are the same -->
<!-- clone could have originated from different immediate parents. Just -->
<!-- think of the ways you can get to a cell with both genes A and B -->
<!-- mutated.  -->

If this is all obvious to you, ignore it. I am adding it here because I've
seen strange misunderstandings that eventually could be traced to the
apparently multiple meanings of clone. (And to make the story complete,
@Mather2012 use the expression "class" ---e.g., Algorithm 4 in the paper,
Algorithm 5 in the supplementary material).



## Does OncoSimulR keep track of individuals or of clones? And how can it keep track of such large populations? {#trackindivs}

OncoSimulR keeps track of clones, where a clone is a set of cells
that are genetically identical (note that this means completely
identical over the whole set of genes/markers you are using; see
section \@ref(meaningclone)). We do not need to keep track of
individual cells because, for all purposes, and since we do not
consider spatial structure, two or more cells that are genetically
identical are interchangeable. This means, for instance, that the
computational cost of keeping a population of a single clone with 1
individual or with $10^9$ individuals is exactly the same: we just
keep track of the genotype and the number of cells. (Sure, it is
much more likely we will see a mutation soon in a clone with $10^9$
cells than in a clone with 1, but that is a different issue.)


Of course, the entities that die, reproduce, and mutate are
individual cells. This is of course dealt with by tracking clones
(as is clearly shown by Algorithms 4 and 5 in @Mather2012). Tracking
individuals, as individuals, would provide no advantage, but would
increase the computational burden by many orders of magnitude.




<!-- As we are interested in examining the effects of selection, mutation, -->
<!-- keeping track of the parent-child relationships between clones, etc, and -->
<!-- thus we must keep track of the complete set of clones. -->



<!-- (or haplotype if -->
<!-- you want, as that is the term used in the reference below ---but see -->
<!-- \@(meaningclone)) -->



<!-- What we have explained is an obvious "trick" that has been used for -->
<!-- a long time.  For instance, John H. Gillespie in his classical 1993 -->
<!-- paper @Gillespie1993 writes in p. 972: -->



<!-- > Allelic genealogies are represented in the computer by a rooted -->
<!-- > tree, each node of which is a unique haplotype.  A haplotype node is -->
<!-- > a data structure with pointers to parent and sibling nodes and with -->
<!-- > values of the current abundance and selection coefficient of the -->
<!-- > haplotype.  Each node also records the generation at which the -->
<!-- > haplotype first appeared in the population with its mutant site -->
<!-- > (the origination time of the site) and, should the mutation become -->
<!-- > fixed-the node becomes the root node for all alleles in the -->
<!-- > population-the fixation time of the site.  When haplotypes without -->
<!-- > descendents are lost from the population, the allelic genealogy is -->
<!-- > pruned to free computer memory.  When the simulation is completed, -->
<!-- > the properties of the origination and fixation point processes may -->
<!-- > be studied by 'climbing' the tree and recording the origination and -->
<!-- > fixation times of sites. -->

### `sampleEvery`, `keepPhylog`, and pruning {#prune}

At each sampling time (where `sampleEvery` determines the time units
between sampling times) the abundance of all the clones with number of
cells $>0$ is recorded. This is the structure that at the end of the run
is converted into the `pops.by.time` matrix.

Now, some clones might arise from mutation between successive
population samples but these clones might be extinct by the time we
take a population sample. These clones do not appear in the
`pops.by.time` matrix because, as we just said, they have 0 cells at
the time of sampling. Of course, some of these clones might appear
again later and reach a size larger than 0 at some posterior
sampling time; it is at this time when this/these clone(s) will
appear in the `pops.by.time` matrix. This pruning of clones with 0
cells can allow considerable savings in computing time (OncoSimulR
needs to track the genotype of clones, their population sizes,
their birth, death, and mutation rates, their next mutation time and
the last time they were updated and thus it is important that we
only loop over structures with information that is really needed).

However, we still need to track clones as clones, not simply as
classes such as "number of mutated genes". Therefore, very large
genomes can represent a problem if they lead to the creation and
tracking of many different clones (even if they have the same number
of mutated genes), as we have seen, for instance, in section
\@ref(lnum). In this case, programs that only keep track of numbers
of mutated genes or of drivers, not individual clones, can of course
achieve better speed.

<!-- Again, e that we never prune any -->
<!-- clone that had a population size larger than zero at any sampling -->
<!-- period (so they are reflected in the `pops.by.time` matrix in the -->
<!-- output).  -->

What about the genealogy? If you ask OncoSimulR to keep track of the
complete parent-child relationships (`keepPhylog = TRUE`), you might
see in the genealogy clones that are not present in `pops.by.time`
if these are clones that never had a population size larger than 0
at any sampling time. To give an example, suppose that we will take
population samples at times 0, 1, and 2. Clone A, with a population
size larger than 0 at time 1, gives rise at time 1.5 to clone B;
clone B then gives rise to clone C at time 1.8. Finally, suppose
that at time 2 only clone C is alive. In other words, when we carry
out the update of the population with Algorithm 5 from @Mather2012,
clones A and B have size 0. Now, at time 1 clones B and C did not
yet exist, and clone B is never alive at times 1 or 2. Thus, clone B
is not present in `pops.by.time`. But we cannot remove clone B from
our genealogy if we want to reflect the complete genealogy of C. Thus,
`pops.by.time` will show only clones A and C (not B) but the
complete genealogy will show clones A, B, C (and will show that B
appeared from A at time 1.5 and C appeared from B at time
1.8). Since function `plotClonePhylog` offers a lot of flexibility
with respect to what clones to show depending on their population
sizes at different times, you can prevent being shown B, but its
existence is there should you need it (see also
\@ref(histlargegenes)).



<!-- Keeping track of clones (not individuals) and erasing clones that have no -->
<!-- descendants from the `pops.by.time` matrix (and other internal structures -->
<!-- in the C++ code) therefore leads to important memory and time -->
<!-- savings.  -->





## Dealing with errors in "oncoSimulPop" {#errorosp}

When running OncoSimulR under Windows `mclapply` does not use
multiple cores, and errors from `oncoSimulPop` are reported
directly. For example:

```{r}
## This code will only be evaluated under Windows
if(.Platform$OS.type == "windows")
    try(pancrError <- oncoSimulPop(10, pancr,
                               initSize = 1e-5,
                               detectionSize = 1e7,
                               keepEvery = 10,
                               mc.cores = 2))
``` 


Under POSIX operating systems (e.g., GNU/Linux or Mac OSX)
`oncoSimulPop` can ran parallelized by calling
`mclapply`. Now, suppose you did something like

```{r}
## Do not run under Windows
if(.Platform$OS.type != "windows")
    pancrError <- oncoSimulPop(10, pancr,
                               initSize = 1e-5,
                               detectionSize = 1e7,
                               keepEvery = 10,
                               mc.cores = 2)
``` 

The warning you are seeing tells you there was an error in the functions
called by `mclapply`. If you check the help for
`mclpapply` you'll see that it returns a try-error object, so we
can inspect it. For instance, we could do:

```{r, eval=FALSE}
pancrError[[1]]
``` 

But the output of this call might be easier to read:

```{r, eval=FALSE}
pancrError[[1]][1]
``` 
 
And from here you could see the error that was returned by
`oncoSimulIndiv`: `initSize < 1` (which is indeed true:
we pass `initSize = 1e-5`).


## Whole tumor sampling, genotypes, and allele counts: what  gives? And what about order? {#wtsampl}

You are obtaining genotypes, regardless of order.  When we use "whole
tumor sampling", it is the frequency of the mutations in each gene that
counts, not the order. So, for instance, "c, d" and "c, d" both
contribute to the counts of "c" and "d". Similarly, when we use single
cell sampling, we obtain a genotype defined in terms of mutations, but
there might be multiple orders that give this genotype. For example, $d >
c$ and $c > d$ both  give you a genotype with "c" and "d" mutated, and
thus in the output you can have two columns with both genes mutated.


## Doesn't the BNB algorithm require small mutation rates for it to be advantageous? {#bnbmutation}

As discussed in the original paper by @Mather2012 (see also their
supplementary material), the BNB algorithm can achieve considerable
speed advantages relative to other algorithms especially when
mutation events are rare relative to birth and death events; the
larger the mutation rate, the smaller the gains compared to other
algorithms. As mentioned in their supplementary material (see p.5)
"Note that the 'cost' of each step in BNB is somewhat higher than in
SSA [SSA is the original Gillespie's Stochastic Simulation Algorithm]
since it requires generation of several random numbers as compared
to only two uniform random numbers for SSA. However this cost
increase is small compared with significant benefits of jumping over
birth and death reactions for the case of rare mutations."

Since the earliest versions, OncoSimulR has provided information to
assess these issues. The output of function `oncoSimulIndiv`
includes a list called *"other"* that itself includes two lists
named *"minDMratio"* and *"minBMratio"*, the smallest ratio, over
all simulations, of death rate to mutation rate or birth rate to
mutation rate, respectively. As explained above, the BNB algorithm
thrives when those are large. Note, though, we say "it thrives":
these ratios being large is not required for the BNB algorithm to be
an exact simulation algorithm; these ratios being large make BNB
comparatively much faster than other algorithms.



## Can we use the BNB algorithm with state-dependent birth or death rates? {#bnbdensdep}

As discussed in the original paper by @Mather2012 (see sections 2.6
and 3.2 of the paper and section E of the supplementary material),
the BNB algorithm can be used as an approximate stochastic
simulation algorithm "(...) with non-constant birth, death, and
mutation rates by evolving the system with a BNB step restricted to
a short duration t." (p. 9 in supplementary material). The
justification is that "(...) the propensities for reactions can be
considered approximately constant during some short interval."
(p. 1234). This is the reason why, when we use McFarland's model, we
set a very short `sampleEvery`. In addition, the output of the
simulation functions contains the simple summary statistic `errorMF`
that can be used to assess the quality of the
approximation[^errorMF].

[^errorMF]: Death rates are affected by density dependence and,
    thus, it is on the death rates where the approximation that they
    are constant over a short interval plays a role. Thus, we
    examine how large the difference between successive death rates
    is. More precisely, let $A$ and $C$ denote two successive
    sampling periods, with $D_A = log(1 + N_A/K)$ and $D_C= log(1 +
    N_C/K)$ their death rates. `errorMF_size` stores the largest
    $abs(D_C - D_A)$ between any two sampling periods ever seen
    during a simulation. `errorMF` stores the largest $abs(D_C -
    D_A)/D_A$.  Additionally, a simple procedure to use is to run
    the simulations with different values of `sampleEvery`, say the
    default value of 0.025 and values that are 10, 20, and 50 times
    larger or smaller, and assess their effects on the output of the
    simulations and the `errorMF` statistic itself. You can check
    that using a `sampleEvery` much smaller than 0.025 rarely makes
    any difference in `errorMF` or in the simulation output (though
    it increases computing time significantly). And, just for the
    fun of it, you can also check that using huge values for
    `sampleEvery` can lead to trouble and will be manifested too in
    the simulation output with large and unreasonable jumps in total
    population sizes and sudden extinctions.


Note that, as the authors point out, approximations are common with
stochastic simulation algorithms when there is density dependence,
but the advantage of the BNB algorithm compared to, say, most
tau-leap methods is that clones of different population sizes are
treated uniformly. @Mather2012 further present results from
simulations comparing the BNB algorithm with the original direct SSA
method and the tau-leaps (see their Fig. 5), which shows that the
approximation is very accurate as soon as the interval between
samples becomes reasonably short. 


## Sometimes I get exceptions when running with mutator genes {#tomlinexcept}

Yes, sure, the following will cause an exception; this is similar to
the example used in \@ref(exmutantimut) but there is one crucial
difference:

```{r ex-tomlin1exc}
sd <- 0.1 ## fitness effect of drivers
sm <- 0 ## fitness effect of mutator
nd <- 20 ## number of drivers
nm <- 5  ## number of mutators
mut <- 50 ## mutator effect  THIS IS THE DIFFERENCE

fitnessGenesVector <- c(rep(sd, nd), rep(sm, nm))
names(fitnessGenesVector) <- 1:(nd + nm)
mutatorGenesVector <- rep(mut, nm)
names(mutatorGenesVector) <- (nd + 1):(nd + nm)

ft <- allFitnessEffects(noIntGenes = fitnessGenesVector,
                        drvNames = 1:nd)
mt <- allMutatorEffects(noIntGenes = mutatorGenesVector)

```


Now, simulate using the fitness and mutator specification. We fix
the number of drivers to cancer, and we stop when those numbers of
drivers are reached. Since we only care about the time it takes to
reach cancer, not the actual trajectories, we set `keepEvery = NA`:


```{r ex-tomlinexc2}
ddr <- 4
set.seed(2)
RNGkind("L'Ecuyer-CMRG")
st <- oncoSimulPop(4, ft, muEF = mt,
                   detectionDrivers = ddr,
                   finalTime = NA,
                   detectionSize = NA,
                   detectionProb = NA,
                   onlyCancer = TRUE,
                   keepEvery = NA, 
                   mc.cores = 2, ## adapt to your hardware
                   seed = NULL) ## for reproducibility
set.seed(NULL) ## return things to their "usual state"

```

What happened? That you are using five mutator genes, each with an
effect of multiplying by 50 the mutation rate. So the genotype with
all those five genes mutated will have an increased mutation rate of
$50^5 = 312500000$. If you set the mutation rate to the default of
$1e-6$ you have a mutation rate of 312 which makes no sense (and
leads to all sorts of numerical issues down the road and an early
warning).


Oh, but you want to accumulate mutator effects and have some, or the
early ones, have a large effects and the rest progressively smaller
effects? You can do that using epistatic effects for mutator effects.


## What are good values of `sampleEvery`? {#whatgoodsampleevery}

First, we need to differentiate between the McFarland and the
exponential models. If you use the McFarland model, you should read
section \@ref(bnbdensdep) but, briefly, the small default is
probably a good choice. 

With the exponential model, however, simulations can often be much
faster if `sampleEvery` is large. How large? As large as you can
make it. `sampleEvery` should not be larger than your desired
`keepEvery`, where `keepEvery` determines the resolution or
granularity of your samples (i.e., how often you take a snapshot of
the population). If you only care about the final state, then set
`keepEvery = NA`.

The other factors that affects choosing a reasonable `sampleEvery`
are mutation rate and population size. If population growth is very
fast or mutation rate very large, you need to sample frequently to
avoid the "Recoverable exception ti set to DBL_MIN. Rerunning."
issue (see discussion in section \@ref(popgtzx)).


## What can you do with the simulations?

This is up to you. In section \@ref(sample-1) we show an example
where we infer an Oncogenetic tree from simulated data and in
section \@ref(whatfor) we go over a varied set of scientific
questions where OncoSimulR could help.


<!-- %% As an example, we can try to infer an oncogenetic tree -->
<!-- %% (and plot it) using the `r CRANpkg("Oncotree")` package @Oncotree after -->
<!-- %% getting a quick look at the marginal frequencies of events: -->

<!-- %% <<fig.width=4, fig.height=4>>= -->
<!-- %% colSums(pancrSamp)/nrow(pancrSamp) -->

<!-- %% require(Oncotree) -->
<!-- %% otp <- oncotree.fit(pancrSamp) -->
<!-- %% plot(otp) -->
<!-- %% @  -->

\clearpage

# Using v.1 posets and simulations {#v1}

It is strongly recommended that you use the new (v.2) procedures for
specifying fitness effects. However, the former v.1 procedures are still
available, with only very minor changes to function calls. What follows
below is the former vignette. You might want to use v.1 because for
certain models (e.g., small number of genes, with restrictions as
specified by a simple poset) simulations might be faster with v.1 (fitness
evaluation is much simpler ---we are working on further improving speed).

## Specifying restrictions: posets {#poset}

How to specify the restrictions is shown in the help for
`poset`. It is often useful, to make sure you did not make any
mistakes, to plot the poset. This is from the examples (we use an "L"
after a number so that the numbers are integers, not doubles; we could
alternatively have modified `storage.mode`).

```{r, fig.height=3}
## Node 2 and 3 depend on 1, and 4 depends on no one
p1 <- cbind(c(1L, 1L, 0L), c(2L, 3L, 4L))
plotPoset(p1, addroot = TRUE)
``` 

```{r, fig.height=3}
## A simple way to create a poset where no gene (in a set of 15)
## depends on any other.
p4 <- cbind(0L, 15L)
plotPoset(p4, addroot = TRUE)
``` 



Specifying posets is actually straightforward. For instance, we can
specify the pancreatic cancer poset in 
@Gerstung2011 (their figure 2B, left). We specify the poset using
numbers, but for nicer plotting we will use names (KRAS is 1, SMAD4 is 2,
etc). This example is also in the help for `poset`:

```{r, fig.height=3}
pancreaticCancerPoset <- cbind(c(1, 1, 1, 1, 2, 3, 4, 4, 5),
                               c(2, 3, 4, 5, 6, 6, 6, 7, 7))
storage.mode(pancreaticCancerPoset) <- "integer"
plotPoset(pancreaticCancerPoset,
          names = c("KRAS", "SMAD4", "CDNK2A", "TP53",
                    "MLL3","PXDN", "TGFBR2"))

```
## Simulating cancer progression {#simul1}


We can simulate the progression in a single subject. Using an example
very similar to the one in the help:


```{r, echo=FALSE,results='hide',error=FALSE}
options(width=60)
``` 

```{r}
## use poset p1101
data(examplePosets)
p1101 <- examplePosets[["p1101"]]

## Bozic Model
b1 <- oncoSimulIndiv(p1101, keepEvery = 15)
summary(b1)
``` 


The first thing we do is make it simpler (for future examples) to use a
set of restrictions. In this case, those encoded in poset p1101. Then, we
run the simulations and look at a simple summary and a plot. 

<!-- %% We explicitly -->
<!-- %% set `silent = TRUE` to prevent the vignette from filling up with -->
<!-- %% intermediate output. -->

If you want to plot the trajectories, it is better to keep more frequent
samples,  so you can see when clones appear:

```{r pb2bothx1,fig.height=5.5, fig.width=5.5}
b2 <- oncoSimulIndiv(p1101, keepEvery = 1)
summary(b2)
plot(b2)
``` 

As we have seen before, the stacked plot here is less useful and that is
why I do not evaluate that code for this vignette.

```{r pbssttx1,eval=FALSE}
plot(b2, type = "stacked")
``` 




The following is an example where we do not care about passengers, but we
want to use a different graph, and we want a few more drivers before
considering cancer has been reached. And we allow it to run for longer.
Note that in the McFL model `detectionSize` really plays no
role. Note also how we pass the poset: it is the same as before, but now
we directly access the poset in the list of posets.

```{r, echo=FALSE,eval=TRUE}
set.seed(1) ## for reproducibility. Once I saw it not reach cancer.
``` 
```{r}

m2 <- oncoSimulIndiv(examplePosets[["p1101"]], model = "McFL", 
                     numPassengers = 0, detectionDrivers = 8, 
                     mu = 5e-7, initSize = 4000, 
                     sampleEvery = 0.025,
                     finalTime = 25000, keepEvery = 5, 
                     detectionSize = 1e6) 
``` 

(Very rarely the above run will fail to reach cancer. If that
happens, execute it again.)


As usual, we will plot using both a line and a stacked plot:

```{r m2x1,fig.width=6.5, fig.height=10}
par(mfrow = c(2, 1))
plot(m2, addtot = TRUE, log = "",
     thinData = TRUE, thinData.keep = 0.5)
plot(m2, type = "stacked",
     thinData = TRUE, thinData.keep = 0.5)
``` 

The default is to simulate progression until a simulation reaches cancer
(i.e., only simulations that satisfy the detectionDrivers or the
detectionSize will be returned). If you use the McFL model with large
enough `initSize` this will often be the case but not if you use
very small `initSize`. Likewise, most of the Bozic runs do not
reach cancer. Lets try a few:

```{r}
b3 <- oncoSimulIndiv(p1101, onlyCancer = FALSE)
summary(b3)

b4 <- oncoSimulIndiv(p1101, onlyCancer = FALSE)
summary(b4)
``` 

Plot those runs:

```{r b3b4x1ch1, fig.width=8, fig.height=4}
par(mfrow = c(1, 2))
par(cex = 0.8) ## smaller font
plot(b3)
plot(b4)
``` 


### Simulating progression in several subjects

To simulate the progression in a bunch of subjects (we will use only
four, so as not to fill the vignette with plots) we can do, with the same
settings as above:

```{r ch2}
p1 <- oncoSimulPop(4, p1101, mc.cores = 2)
par(mfrow = c(2, 2))
plot(p1, ask = FALSE)
``` 

We can also use stream and stacked plots, though they might not be as
useful in this case. For the sake of keeping the vignette small, these are
commented out.
```{r p1multx1,eval=FALSE}
par(mfrow = c(2, 2))
plot(p1, type = "stream", ask = FALSE)
```

```{r p1multstx1,eval=FALSE}
par(mfrow = c(2, 2))
plot(p1, type = "stacked", ask = FALSE)
```


## Sampling from a set of simulated subjects {#sample-1}


You will often want to do something with the simulated data. For instance,
sample the simulated data. Here we will obtain the trajectories for 100
subjects in a scenario without passengers. Then we will sample with the
default options and store that as a vector of genotypes (or a matrix of
subjects by genes):


```{r}

m1 <- oncoSimulPop(100, examplePosets[["p1101"]], 
                   numPassengers = 0, mc.cores = 2)

``` 

The function `samplePop` samples that object, and also gives you
some information about the output:

```{r}
genotypes <- samplePop(m1)
``` 



What can you do with it? That is up to you. As an example, let us try to
infer an Oncogenetic tree (and plot it) using the `r CRANpkg("Oncotree")`
package @Oncotree after getting a quick look at the marginal
frequencies of events:

```{r fxot1,fig.width=4, fig.height=4}
colSums(genotypes)/nrow(genotypes)

require(Oncotree)
ot1 <- oncotree.fit(genotypes)
plot(ot1)
``` 

Your run will likely differ from mine, but with the defaults (detection
size of $10^8$) it is likely that events down the tree will never
appear. You can set `detectionSize = 1e9` and you will see that
events down the tree are now found in the cross-sectional sample.


Alternatively, you can use single cell sampling and that, sometimes,
recovers one or a couple more events.

```{r fxot2,fig.width=4, fig.height=4}
genotypesSC <- samplePop(m1, typeSample = "single")
colSums(genotypesSC)/nrow(genotypesSC)

ot2 <- oncotree.fit(genotypesSC)
plot(ot2)
``` 

You can of course rename the columns of the output matrix to something
else if you want so the names of the nodes will reflect those potentially
more meaningful names.

\clearpage



# Session info and packages used

This is the information about the version of R and packages used:
```{r}
sessionInfo()
``` 

# Funding

Supported by BFU2015-67302-R (MINECO/FEDER, EU).

```{r, echo=FALSE, eval=TRUE}
## reinitialize the seed
set.seed(NULL)
``` 

<!-- %\newpage -->
<!-- %%\bibliographystyle{apalike} %% or agsm or natbib? or apalike; maybe agsm -->
<!-- %% does the URL without turning into note? -->
<!-- %\bibliographystyle{apalike} %% or agsm or natbib? or apalike; maybe agsm -->

<!-- \bibliography{OncoSimulR} -->

# References


<!-- %% remember to use bibexport to keep just the minimal bib needed -->
<!-- %% bibexport -o extracted.bib OncoSimulR.aux -->
<!-- %% rm OncoSimulR.bib -->
<!-- %% mv extracted.bib OncoSimulR.bib -->
<!-- %% and then turn URL of packages into notes -->




<!-- Local Variables: -->
<!-- fill-column: 68 -->
<!-- End: -->

<!--  LocalWords:  genotype Genotyping -->