Browse code

2.5.5; vignette fixes

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/OncoSimulR@125127 bc3139a8-67e5-0310-9ffc-ced21a209358

Ramon Diaz-Uriarte authored on 14/12/2016 12:53:22
Showing6 changed files

... ...
@@ -1,8 +1,8 @@
1 1
 Package: OncoSimulR
2 2
 Type: Package
3 3
 Title: Forward Genetic Simulation of Cancer Progression with Epistasis 
4
-Version: 2.5.4
5
-Date: 2016-12-12
4
+Version: 2.5.5
5
+Date: 2016-12-14
6 6
 Authors@R: c(person("Ramon", "Diaz-Uriarte", role = c("aut", "cre"),
7 7
 		     email = "rdiaz02@gmail.com"),
8 8
 	      person("Mark", "Taylor", role = "ctb", email = "ningkiling@gmail.com"))
... ...
@@ -1,3 +1,6 @@
1
+Changes in version 2.5.5 (2016-12-14):
2
+	- Vignette: miscell changes (typos, etc)
3
+	
1 4
 Changes in version 2.5.4 (2016-12-12):
2 5
 	- Vignette: miscell changes (order of examples, typos, etc)
3 6
 
... ...
@@ -160,11 +160,11 @@ mutator effects.
160 160
  
161 161
  
162 162
  
163
-OncoSimulR was originally developed to simulate tumor progression using
164
-several models of tumor progression with emphasis on allowing users to set
165
-restrictions in the accumulation of mutations as specified, for example, by
166
-Oncogenetic Trees [OT: @Desper1999JCB; @Szabo2008] or Conjunctive Bayesian
167
-Networks [CBN: @Beerenwinkel2007; @Gerstung2009; @Gerstung2011], with the
163
+OncoSimulR was originally developed to simulate tumor progression with
164
+emphasis on allowing users to set restrictions in the accumulation of
165
+mutations as specified, for example, by Oncogenetic Trees
166
+[OT: @Desper1999JCB; @Szabo2008] or Conjunctive Bayesian Networks
167
+[CBN: @Beerenwinkel2007; @Gerstung2009; @Gerstung2011], with the
168 168
 possibility of adding passenger mutations to the simulations and allowing
169 169
 for several types of sampling.
170 170
 
... ...
@@ -195,22 +195,24 @@ of genes in a module. In addition, arbitrary numbers of genes without
195 195
 interactions (and with fitness effects coming from any distribution you
196 196
 might want) are also possible.
197 197
 
198
-Mutator/antimutator genes, genes that alter the mutation rate of other genes
199
-[@gerrish_complete_2007, @tomlinson_mutation_1996], can also be simulated
200
-with OncoSimulR and specified with most of the mechanisms above (you can
201
-have, for instance, interactions between mutator genes). And, with regards
202
-to mutation rates, different genes can have different mutation rates
203
-(regardless of the presence or not of other mutator/antimutator genes).
198
+Mutator/antimutator genes, genes that alter the mutation rate of
199
+other genes [@gerrish_complete_2007; @tomlinson_mutation_1996], can
200
+also be simulated with OncoSimulR and specified with most of the
201
+mechanisms above (you can have, for instance, interactions between
202
+mutator genes). And, regardless of the presence or not of other
203
+mutator/antimutator genes, different genes can have different
204
+mutation rates.
204 205
 
205 206
 
206
-Simulations can be stopped as a function of total population size, number of
207
-mutated driver genes, or number of time periods. Simulations can also be
208
-stopped with a stochastic detection mechanism where the probability of
207
+Simulations can be stopped as a function of total population size, number
208
+of mutated driver genes, or number of time periods. Simulations can also
209
+be stopped with a stochastic detection mechanism where the probability of
209 210
 detecting a tumor increases with total population size. Simulations return
210
-the number of cells of every genotype/clone at each of the sampling periods
211
-and the genealogical relationships of all clones generated during the
212
-simulation. We can take samples from those data with single-cell or whole-
213
-tumor resolution, adding noise if we want.
211
+the number of cells of every genotype/clone at each of the sampling
212
+periods and we can take samples from the former with single-cell or whole-
213
+tumor resolution, adding noise if we want. If we ask for them, simulations
214
+also store and return the genealogical relationships of all clones
215
+generated during the simulation.
214 216
 
215 217
 
216 218
 The models so far implemented are all continuous time models, which are
... ...
@@ -219,7 +221,7 @@ implemented in C++, providing for fast execution.  To help with simulation
219 221
 studies, code to simulate random graphs of the kind often seen in CBNs, OTs,
220 222
 etc, is also available. Finally, OncoSimulR also allows for the generation
221 223
 of random fitness landscapes and the representation of fitness landscapes
222
-and provides some statistics of of evolutionary predictability.
224
+and provides statistics of evolutionary predictability.
223 225
 
224 226
 
225 227
 
... ...
@@ -228,7 +230,7 @@ and provides some statistics of of evolutionary predictability.
228 230
 
229 231
 As mentioned above, OncoSimulR is now a very general package for forward
230 232
 genetic simulation, with applicability well beyond tumor progression. This
231
-is a summary of some of the key features:
233
+is a summary of some of its key features:
232 234
 
233 235
 <!-- FIXME: add the tables of the poster -->
234 236
 
... ...
@@ -240,14 +242,14 @@ is a summary of some of the key features:
240 242
       Oncogenetic Trees (OTs), Conjunctive Bayesian Networks (CBNs),
241 243
       semimonotone progression networks, and XOR relationships.
242 244
 		  
243
-    - Epistatic interactions, including, but not limited to, synthetic
245
+    - Epistatic interactions including, but not limited to, synthetic
244 246
        viability and synthetic lethality.
245 247
     - Order effects.
246 248
 	
247 249
 * You can add passenger mutations.
248 250
 * You can add mutator/antimutator effects.
249 251
 * Fitness and mutation rates can be gene-specific.
250
-* More generally, you can add arbitrary numbers of non-interacting
252
+* You can add arbitrary numbers of non-interacting
251 253
       genes with arbitrary fitness effects.
252 254
   
253 255
 * you can allow for deviations from the OT, CBN, semimonotone, and
... ...
@@ -259,9 +261,9 @@ is a summary of some of the key features:
259 261
       sampling. 
260 262
 	  
261 263
 * You can stop the simulations using a flexible combination of
262
-      conditions, from final time to number of drivers to population
263
-      size, to fixation of certain genotypes, to a stochastic
264
-      stopping mechanism that depends on size, etc.  
264
+      conditions: final time, number of drivers, population
265
+      size, fixation of certain genotypes, and a stochastic
266
+      stopping mechanism that depends on population size.
265 267
   
266 268
 * Right now, three different models are available, two that lead
267 269
       to exponential growth, one of them loosely based on @Bozic2010, and
... ...
@@ -270,7 +272,7 @@ is a summary of some of the key features:
270 272
 <!-- * Code in C++ is available (though not yet callable from R) for -->
271 273
 <!--       using several other models, including the one from @Beerenwinkel2007b. -->
272 274
       
273
-* You can use very large numbers of genes (e.g., see an example of
275
+* You can use large numbers of genes (e.g., see an example of
274 276
       50000 in section \@ref(mcf50070) ).
275 277
       
276 278
 * Simulations are generally very fast: I use C++ to implement the BNB
... ...
@@ -288,13 +290,13 @@ is a summary of some of the key features:
288 290
       
289 291
 * You can plot fitness landscapes.
290 292
       
291
-* You can obtain some basic measures of evolutionary predictability
293
+* You can obtain statistics of evolutionary predictability
292 294
   from the simulations.
293 295
   
294 296
 
295 297
 
296 298
 The table below, modified from the table at the
297
-[Genetics Simulation Resources (GSR) page](https://popmodels.cancercontrol.cancer.gov/gsr/packages/oncosimulr/#detailed)
299
+[Genetics Simulation Resources (GSR) page](https://popmodels.cancercontrol.cancer.gov/gsr/packages/oncosimulr/#detailed), 
298 300
 provides a summary of the key features of OncoSimulR. (An
299 301
 explanation of the meaning of terms specific to the GSR table is
300 302
 available from
... ...
@@ -304,21 +306,21 @@ by moving the mouse over each term).
304 306
 
305 307
 \clearpage
306 308
 
307
-|Attribute Category     | Attribute                                     |
308
-|-----------------------|-----------------------------------------------|
309
-|**Target**             |  |
310
-|&nbsp; Type of Simulated Data|           Haploid DNA Sequence|
311
-|&nbsp; variations            |           Biallelic Marker, Genotype or Sequencing Error|
312
-|**Simulation Method**            |           Forward-time|
313
-|&nbsp; Type of Dynamical Model | Continuous time|
314
-|&nbsp; Entities Tracked | Clones (see \@ref(trackindivs))|
309
+|Attribute Category                  | Attribute                                     |
310
+|-------------------------------|-----------------------------------------------|
311
+|**Target**                           |  |
312
+|&nbsp; Type of Simulated Data        |           Haploid DNA Sequence|
313
+|&nbsp; Variations                    |           Biallelic Marker, Genotype or Sequencing Error|
314
+|**Simulation Method**                |           Forward-time|
315
+|&nbsp; Type of Dynamical Model       | Continuous time|
316
+|&nbsp; Entities Tracked              | Clones (see \@ref(trackindivs))|
315 317
 |**Input** | Program specific (R data frames and matrices specifying genotypes' fitness, gene effects, and starting genotype) |
316 318
 |**Output**||
317 319
 |&nbsp; Data Type| Genotype or Sequence, Individual Relationship (complete parent-child relationships between clones), Demographic (populations sizes of all clones at sampling times), Diversity Measures (LOD, POM, diversity of genotypes), Fitness|
318 320
 |&nbsp; Sample Type|	Random or Independent, Longitudinal, Other (proportional to population size)|
319 321
 |**Evolutionary Features**	||
320 322
 |&nbsp; Mating Scheme| Asexual Reproduction |
321
-|&nbsp; Demographic||	
323
+|&nbsp; Demographic                   ||	
322 324
 |&nbsp; &nbsp; Population Size Changes|	Exponential (two models), Logistic (McFarland et al., 2013)|
323 325
 |&nbsp; Fitness Components||	
324 326
 |&nbsp; &nbsp; Birth Rate|	Individually Determined from Genotype (models "Exp" and "McFL")|
... ...
@@ -345,28 +347,29 @@ found in @Diaz-Uriarte2015, where additional comments about model
345 347
 parameters and caveats are discussed.
346 348
 
347 349
 
348
-Are there similar programs? The Java program by @Reiter2013a offers somewhat
349
-similar functionality to the previous version of OncoSimulR, but it is
350
-restricted to at most four drivers (whereas v.1 of OncoSimulR allowed for up
351
-to 64), you cannot use arbitrary CBNs or OTs (or XORs or semimonotone
352
-graphs) to specify restrictions, there is no allowance for passengers, and a
353
-single type of model (a discrete time Galton-Watson process) is
354
-implemented. The current functionality of OncoSimulR goes well beyond the
355
-the previous version (and, thus, also the TPT of [@Reiter2013a]). We now
356
-allow you to specify all types of fitness effects in other general forward
357
-genetic simulators such as FFPopSim [@Zanini2012], and some that, to our
358
-knowledge (e.g., order effects) are not available from any genetics
359
-simulator. In addition, the "Lego system" to flexibly combine different
360
-fitness specifications is also unique; by "Lego system" I mean that we can
361
-combine different pieces and blocks, similarly to what we do with Lego
362
-bricks. (I find this an intuitive and very graphical analogy, which I have
363
-copied from @Hothorn_2006 and @Hothorn_2008). In a nutshell, salient
364
-features of OncoSimulR compared to other simulators are the unparalleled
365
-flexibility to specify fitness and mutator effects, with modules and order
366
-effects as particularly unique, and the options for sampling and stopping
367
-the simulations, particularly convenient in cancer evolution models. Also
368
-unique in this type of software is the addition of functions for simulating
369
-fitness landscapes and assessing evolutionary predictability.
350
+Are there similar programs? The Java program by @Reiter2013a, TTP, offers
351
+somewhat similar functionality to the previous version of OncoSimulR, but
352
+it is restricted to at most four drivers (whereas v.1 of OncoSimulR
353
+allowed for up to 64), you cannot use arbitrary CBNs or OTs (or XORs or
354
+semimonotone graphs) to specify restrictions, there is no allowance for
355
+passengers, and a single type of model (a discrete time Galton-Watson
356
+process) is implemented. The current functionality of OncoSimulR goes well
357
+beyond the the previous version (and, thus, also the TPT of
358
+@Reiter2013a). We now allow you to specify all types of fitness effects
359
+in other general forward genetic simulators such as FFPopSim
360
+[@Zanini2012], and some that, to our knowledge (e.g., order effects) are
361
+not available from any genetics simulator. In addition, the "Lego system"
362
+to flexibly combine different fitness specifications is also unique; by
363
+"Lego system" I mean that we can combine different pieces and blocks,
364
+similarly to what we do with Lego bricks. (I find this an intuitive and
365
+very graphical analogy, which I have copied from @Hothorn_2006 and
366
+@Hothorn_2008). In a nutshell, salient features of OncoSimulR compared to
367
+other simulators are the unparalleled flexibility to specify fitness and
368
+mutator effects, with modules and order effects as particularly unique,
369
+and the options for sampling and stopping the simulations, particularly
370
+convenient in cancer evolution models. Also unique in this type of
371
+software is the addition of functions for simulating fitness landscapes
372
+and assessing evolutionary predictability.
370 373
 
371 374
 
372 375
 
... ...
@@ -390,7 +393,7 @@ OncoSimulR can help address involve combinations of:
390 393
           DAGs/posets where these DAGs/posets:
391 394
 		     - Are user-specified
392 395
 	         - Generated randomly (`simOGraph`)
393
-        - Any mapping between genotypes and fitness:
396
+        - Any mapping between genotypes and fitness where this mapping is:
394 397
             - User-specified
395 398
             - Generated randomly from families of random fitness landscapes (`rfitness`)     
396 399
      - Mutation rates can:
... ...
@@ -411,12 +414,12 @@ OncoSimulR can help address involve combinations of:
411 414
     - ... and assessing the consequences of those on the observed
412 415
       genotypes and their diversity (`sampledGenotypes`) and any other
413 416
       inferences that depend on the observational process.
414
-	- OncoSimulR returns the genotypes at each of the sampling
415
-	  points, so you are not restricted by what the
416
-	  `samplePop` function provides.
417
+	- (OncoSimulR returns the abundances of all genotypes at each of
418
+	  the sampling points, so you are not restricted by what the
419
+	  `samplePop` function provides.)
417 420
   
418 421
 
419
-* Tracing the genealogical relationships of clones
422
+* Tracking the genealogical relationships of clones
420 423
   (`plotClonePhylog`) and assessing evolutionary predictability
421 424
   (`LOD`, `POM`).
422 425
 
... ...
@@ -428,8 +431,8 @@ OncoSimulR are discussed in section \@ref(whatfor).
428 431
 -----
429 432
 
430 433
 A quick overview of the main functions and their relationships is shown in
431
-the figure \@ref(fig:frelats), where we use italics for the type/class of R
432
-object and courier for the name of the functions.
434
+Figure \@ref(fig:frelats), where we use italics for the type/class of R
435
+object and courier font for the name of the functions.
433 436
 <!-- Note: figure1.png, and how to create it, explained in miscell-files -->
434 437
 <!-- in the repo -->
435 438
 <!-- ![Relationship between the main functions in OncoSimulR.](figure1.png) -->
... ...
@@ -440,21 +443,22 @@ object and courier for the name of the functions.
440 443
 knitr::include_graphics("relfunct.png")
441 444
 ```
442 445
 
443
-
446
+\clearpage
444 447
 
445 448
 ## Examples of questions that can be addressed with OncoSimulR {#whatfor}
446 449
 
447 450
 Most of the examples in the rest of this vignette, starting with those in
448 451
 \@ref(quickexample), focus on the mechanics. Here, we will illustrate some
449
-problems in cancer genomics and evolutionary genetics where OncoSimulR could
450
-be of help. This section does not try to provide an answer to any of these
451
-questions (those would be full papers by themselves) but simply to
452
-illustrate some kinds of questions where you can use OncoSimulR; of course,
453
-the possible uses of OncoSimulR are only limited by your ingenuity. Finally,
454
-I will only use short snippets of working code as we are limited by time of
455
-execution; for real work you would want to use many more scenarios and many
456
-more simulations, you would use appropriate statistical methods to compare
457
-the output of runs, etc, etc, etc.
452
+problems in cancer genomics and evolutionary genetics where OncoSimulR
453
+could be of help. This section does not try to provide an answer to any of
454
+these questions (those would be full papers by themselves). Instead, this
455
+section simply tries to illustrate some kinds of questions where you can
456
+use OncoSimulR; of course, the possible uses of OncoSimulR are only
457
+limited by your ingenuity. Here, I will only use short snippets of working
458
+code as we are limited by time of execution; for real work you would want
459
+to use many more scenarios and many more simulations, you would use
460
+appropriate statistical methods to compare the output of runs, etc, etc,
461
+etc.
458 462
 
459 463
 
460 464
 ```{r firstload}
... ...
@@ -464,10 +468,10 @@ library(OncoSimulR)
464 468
 
465 469
 ### Recovering restrictions in the order of accumulation of mutations {#ex-order}
466 470
 
467
-This is a question that was addressed, for instance in
471
+This is a question that was addressed, for instance, in
468 472
 @Diaz-Uriarte2015: do methods that try to infer restrictions in the
469
-order of accumulation of mutations (e.g., @Szabo2008, @Gerstung2009,
470
-@ramazzotti_capri_2015) work well under different evolutionary
473
+order of accumulation of mutations [e.g., @Szabo2008; @Gerstung2009;
474
+@ramazzotti_capri_2015] work well under different evolutionary
471 475
 models and with different sampling schemes? (This issue is also
472 476
 touched upon in section \@ref(sample-1)).
473 477
 
... ...
@@ -486,52 +490,55 @@ RNGkind("L'Ecuyer-CMRG")
486 490
 ```
487 491
 
488 492
 ```{r ex-dag-inf}
489
-## 1. Simulate a DAG
493
+## For reproducibility
494
+set.seed(2)
495
+RNGkind("L'Ecuyer-CMRG")
496
+
497
+## Simulate a DAG
490 498
 g1 <- simOGraph(4, out = "rT")
491 499
 
492
-## 2. Simulate two evolutionary trajectories
500
+## Simulate 10 evolutionary trajectories
493 501
 s1 <- oncoSimulPop(10, allFitnessEffects(g1, drvNames = 1:4),
494 502
                    mc.cores = 2, ## adapt to your hardware
495 503
                    seed = NULL) ## for reproducibility of vignette
496 504
 
497
-## 3. Sample those data uniformly, and add noise
505
+## Sample those data uniformly, and add noise
498 506
 d1 <- samplePop(s1, timeSample = "unif", propError = 0.1)
499 507
 
500
-## 4. You would now run the appropriate inferential method and
501
-## compare observed and true
508
+## You would now run the appropriate inferential method and
509
+## compare observed and true. For example
502 510
 
503
-require(Oncotree)
504
-fit1 <- oncotree.fit(d1)
511
+## require(Oncotree)
512
+## fit1 <- oncotree.fit(d1)
505 513
 
506
-## Now, compare fitted and original. This is well beyond the
507
-## scope of this document (and OncoSimulR itself).
514
+## Now, you'd compare fitted and original. This is well beyond 
515
+## the scope of this document (and OncoSimulR itself).
508 516
 
509 517
 ```
510 518
 
511
-```{r, echo=FALSE}
512
-set.seed(7)
513
-RNGkind("L'Ecuyer-CMRG")
519
+```{r hidden-rng-exochs, echo = FALSE}
520
+set.seed(NULL)
514 521
 ```
515 522
 
523
+
524
+
516 525
 ### Sign epistasis and probability of crossing fitness valleys {#ex-ochs}
517 526
 
518
-These questions, together with the next one (\@ref(ex-predict)),
519
-encompass a wide range of specific issues that have been addressed
520
-in evolutionary genetics studies and which include from detailed
521
-analysis of simple models with a few uphill paths and valleys as in
522
-@Weissman2009 or @Ochs2015, to questions that refer to larger, more
523
-complex fitness landscapes as in @szendro_predictability_2013 or
524
-@franke_evolutionary_2011 (see below).
525
-
526
-
527
-Using as an example @Ochs2015, we could specify the fitness
528
-landscape and run simulations until fixation (with argument
529
-`fixation` to `oncoSimulPop` ---see more details in section
530
-\@ref(fixation) where we also use this same model). We would then
531
-examine the proportion of genotypes fixed under different
532
-scenarios. For instance, we can use the example from @Ochs2015 (we
533
-will see this example also in section \@ref(ochsdesai), where we
534
-cover different ways of specifying fitness). And we can extend this
527
+This question, and the question in the next section (\@ref(ex-predict)),
528
+encompass a wide range of issues that have been addressed in evolutionary
529
+genetics studies and which include from detailed analysis of simple models
530
+with a few uphill paths and valleys as in @Weissman2009 or @Ochs2015, to
531
+questions that refer to larger, more complex fitness landscapes as in
532
+@szendro_predictability_2013 or @franke_evolutionary_2011 (see below).
533
+
534
+
535
+Using as an example @Ochs2015 (we will see this example again in
536
+section \@ref(ochsdesai), where we cover different ways of
537
+specifying fitness), we could specify the fitness landscape and run
538
+simulations until fixation (with argument `fixation` to
539
+`oncoSimulPop` ---see more details in section \@ref(fixation), again
540
+with this example). We would then examine the proportion
541
+of genotypes fixed under different scenarios. And we can extend this
535 542
 example by adding mutator genes:
536 543
 
537 544
 
... ...
@@ -541,6 +548,10 @@ RNGkind("L'Ecuyer-CMRG")
541 548
 ```
542 549
 
543 550
 ```{r exochs}
551
+## For reproducibility
552
+set.seed(2)
553
+RNGkind("L'Ecuyer-CMRG")
554
+
544 555
 ## Specify fitness effects. 
545 556
 
546 557
 ## Numeric values arbitrary, but set the intermediate genotype en
... ...
@@ -578,7 +589,8 @@ initS <- 10
578 589
 od_sim <- oncoSimulPop(10, od, muEF = odm,
579 590
                        fixation = c("u", "i, v"), initSize = initS,
580 591
                        model = "McFL",
581
-                       mu = 1e-4, detectionDrivers = NA, finalTime = NA,
592
+                       mu = 1e-4, detectionDrivers = NA, 
593
+					   finalTime = NA,
582 594
                        detectionSize = NA, detectionProb = NA,
583 595
                        onlyCancer = TRUE, 
584 596
 					   mc.cores = 2, ## adapt to your hardware
... ...
@@ -587,7 +599,7 @@ od_sim <- oncoSimulPop(10, od, muEF = odm,
587 599
 sampledGenotypes(samplePop(od_sim))
588 600
 ```
589 601
 
590
-```{r hidden-rng-exochs, echo = FALSE}
602
+```{r hidden-rng-exochs33, echo = FALSE}
591 603
 set.seed(NULL)
592 604
 ```
593 605
 
... ...
@@ -598,7 +610,8 @@ we would run simulations under random fitness landscapes with varied
598 610
 ruggedness, and would then examine the evolutionary predictability
599 611
 of the trajectories with measures such as "Lines of Descent" and
600 612
 "Path of the Maximum" [@szendro_predictability_2013] and the
601
-diversity of the sampled genotypes under different sampling regimes.
613
+diversity of the sampled genotypes under different sampling regimes (see
614
+details in section \@ref(evolpredszend)).
602 615
 
603 616
 
604 617
 
... ...
@@ -608,17 +621,22 @@ RNGkind("L'Ecuyer-CMRG")
608 621
 ```
609 622
 
610 623
 ```{r exszendro}
624
+## For reproducibility
625
+set.seed(7)
626
+RNGkind("L'Ecuyer-CMRG")
627
+
611 628
 
612 629
 ## Repeat the following loop for different combinations of whatever
613 630
 ## interests you, such as number of genes, or distribution of the
614 631
 ## c and sd (which affect how rugged the landscape is), or 
615
-## reference genotype or evolutionary model, or stopping criterion, 
616
-## sampling procedure, or ...
632
+## reference genotype, or evolutionary model, or stopping criterion, 
633
+## or sampling procedure, or ...
617 634
 
618
-## 1. Generate a random fitness landscape, from the Rough Mount
619
-##    Fuji model, with g genes, and c ("slope" constant) and
620
-##    reference chosen randomly. Ask for a minimal number of
621
-##    accessible genotypes
635
+##  Generate a random fitness landscape, from the Rough Mount
636
+##  Fuji model, with g genes, and c ("slope" constant) and
637
+##  reference chosen randomly (reference is random by default and 
638
+##  thus not specified below). Require a minimal number of  
639
+##  accessible genotypes
622 640
 
623 641
 g <- 6
624 642
 c <- runif(1, 1/5, 5)
... ...
@@ -632,20 +650,19 @@ rl <- rfitness(g, c = c, min_accessible_genotypes = g)
632 650
 ## Obtain landscape measures from Magellan. Export to Magellan
633 651
 to_Magellan(rl, file = "rl1.txt")
634 652
 
635
-## (Getting the statistics requires you to install Magellan,
636
-## and that requires either calling the web app 
637
-## (http://wwwabi.snv.jussieu.fr/public/Magellan/) or
638
-## asking Magellan's authors for the software. This is of course
639
-## beyond the scope of the example and the package.
653
+## (Getting the statistics from Magellan requires either 
654
+## calling the web app (http://wwwabi.snv.jussieu.fr/public/Magellan/) 
655
+## or asking Magellan's authors for the software. This is of course
656
+## beyond the scope of the example and the package.)
640 657
 
641
-## 3. Simulate evolution in that landscape many times (here just 5)
658
+## Simulate evolution in that landscape many times (here just 10)
642 659
 simulrl <- oncoSimulPop(10, allFitnessEffects(genotFitness = rl),
643 660
                         keepPhylog = TRUE, keepEvery = 1,
644 661
                         initSize = 4000,
645 662
                         seed = NULL, ## for reproducibility
646 663
                         mc.cores = 2) ## adapt to your hardware
647 664
 
648
-## 4. Obtain measures of evolutionary predictability
665
+## Obtain measures of evolutionary predictability
649 666
 diversityLOD(LOD(simulrl))
650 667
 diversityPOM(POM(simulrl))
651 668
 sampledGenotypes(samplePop(simulrl, typeSample = "whole"))
... ...
@@ -657,10 +674,10 @@ set.seed(NULL)
657 674
 
658 675
 
659 676
 
660
-### Mutator and antimutator genes {#ex-mut-antimut}
677
+### Mutator and antimutator genes {#exmutantimut}
661 678
 
662 679
 The effects of mutator and antimutator genes have been examined both
663
-in cancer genetics [@nowak_evolutionary_2006,
680
+in cancer genetics [@nowak_evolutionary_2006;
664 681
 @tomlinson_mutation_1996] and in evolutionary genetics
665 682
 [@gerrish_complete_2007], and are related to wider issues such as
666 683
 Muller's ratchet and the evolution of sex. There are, thus, a large
... ...
@@ -675,17 +692,17 @@ in driver genes needed to reach cancer is large and if the mutator effect is
675 692
 large.
676 693
 
677 694
 
678
-We might want to ask, then, how long it takes before we reach
679
-cancer. This is stored in the component `FinalTime` of the
680
-output. We would specify different numbers and effects of mutator
681
-genes (argument `muEF`). We would also change the criteria for
695
+We might want to ask, then, how long it takes before to reach cancer under
696
+different scenarios. Time to reach cancer is stored in the component
697
+`FinalTime` of the output. We would specify different numbers and effects
698
+of mutator genes (argument `muEF`). We would also change the criteria for
682 699
 reaching cancer and in our case we can easily do that by specifying
683
-different numbers in `detectionDrivers`. Of course, we would also
684
-want to examine the effects of varying numbers of mutators, drivers,
685
-and possibly fitness consequences of mutators (below we assume
686
-mutators are neutral and we assume there are no additional genes
687
-with deleterious mutations, but this need not be so, of course; see
688
-also [@tomlinson_mutation_1996, @gerrish_complete_2007, @McFarland2014]).
700
+different numbers in `detectionDrivers`. Of course, we would also want to
701
+examine the effects of varying numbers of mutators, drivers, and possibly
702
+fitness consequences of mutators. Below we assume mutators are neutral and
703
+we assume there are no additional genes with deleterious mutations, but
704
+this need not be so, of course [see also
705
+@tomlinson_mutation_1996; @gerrish_complete_2007; @McFarland2014].
689 706
 
690 707
 
691 708
 Let us run an example. For the sake of simplicity, we assume no
... ...
@@ -721,6 +738,11 @@ RNGkind("L'Ecuyer-CMRG")
721 738
 ```
722 739
 
723 740
 ```{r ex-tomlin2}
741
+## For reproducibility
742
+set.seed(2)
743
+RNGkind("L'Ecuyer-CMRG")
744
+
745
+
724 746
 ddr <- 4
725 747
 st <- oncoSimulPop(4, ft, muEF = mt,
726 748
                    detectionDrivers = ddr,
... ...
@@ -743,9 +765,46 @@ set.seed(NULL)
743 765
 
744 766
 
745 767
 (Incidentally, notice that it is easy to get OncoSimulR to throw an
746
-exception if you do not think twice about the mutator effects you are using
747
-and specify a huge mutation rate if all mutator genes are mutated: see
748
-\@(tomlin-except).)
768
+exception if you accidentally specify a huge mutation rate when all
769
+mutator genes are mutated: see section \@ref(tomlinexcept).)
770
+
771
+
772
+<!-- More complex example where we look at accumulation of deleterious -->
773
+
774
+<!-- ```{r r ex-tomlin3} -->
775
+<!-- set.seed(2) ## for reproducibility -->
776
+<!-- RNGkind("L'Ecuyer-CMRG") -->
777
+
778
+<!-- sd <- 0.1 ## fitness effect of drivers -->
779
+<!-- sp <- -0.01 ## fitness effect of mildly deleterious passengers -->
780
+<!-- sm <- 0 ## fitness effect of mutator -->
781
+<!-- nd <- 20 ## number of drivers -->
782
+<!-- nm <- 5  ## number of mutators -->
783
+<!-- np <- 50 -->
784
+<!-- mut <- 5 ## mutator effect -->
785
+
786
+<!-- fitnessGenesVector <- c(rep(sd, nd), rep(sm, nm), rep(sp, np)) -->
787
+<!-- names(fitnessGenesVector) <- 1:(nd + nm + np) -->
788
+<!-- mutatorGenesVector <- rep(mut, nm) -->
789
+<!-- names(mutatorGenesVector) <- (nd + np + 1):(nd + np + nm) -->
790
+
791
+<!-- ft <- allFitnessEffects(noIntGenes = fitnessGenesVector, -->
792
+<!--                         drvNames = 1:nd) -->
793
+<!-- mt <- allMutatorEffects(noIntGenes = mutatorGenesVector) -->
794
+
795
+<!-- ddr <- 4 -->
796
+<!-- st <- oncoSimulPop(4, ft, muEF = mt, -->
797
+<!--                    detectionDrivers = ddr, -->
798
+<!--                    finalTime = NA, -->
799
+<!--                    detectionSize = NA, -->
800
+<!--                    detectionProb = NA, -->
801
+<!--                    onlyCancer = FALSE, -->
802
+<!--                    keepEvery = NA,  -->
803
+<!--                    mc.cores = 2, ## adapt to your hardware -->
804
+<!--                    seed = NULL) ## for reproducibility -->
805
+<!-- colSums(samplePop(st, timeSample = "last", typeSample = "single"), na.rm = TRUE) -->
806
+<!-- ``` -->
807
+
749 808
 
750 809
 
751 810
 
... ...
@@ -787,10 +846,13 @@ fbauer <- allFitnessEffects(bauer)
787 846
 plot(b1, use_ggrepel = TRUE) ## avoid overlapping labels
788 847
 ```
789 848
 
790
-Now run simulations and examine how frequently the runs do not end
791
-up in extinction (for real, you would of course increase the number
792
-of total populations, the range of initial population sizes, model,
793
-mutation rate, etc):
849
+Now run simulations and examine how frequently the runs end up with
850
+population sizes larger than a pre-specified threshold; for
851
+instance, below we look at increasing population size 4x in the
852
+default maximum number of 2281 time periods (for real, you would of
853
+course increase the number of total populations, the range of
854
+initial population sizes, model, mutation rate, required population
855
+size or number of drivers, etc):
794 856
 
795 857
 
796 858
 ```{r hiddenbau, echo=FALSE}
... ...
@@ -799,14 +861,20 @@ RNGkind("L'Ecuyer-CMRG")
799 861
 ```
800 862
 
801 863
 ```{r exusagebau2}
864
+## For reproducibility
865
+set.seed(2)
866
+RNGkind("L'Ecuyer-CMRG")
867
+
802 868
 totalpops <- 5
869
+initSize <- 100
803 870
 sb1 <- oncoSimulPop(totalpops, fbauer, model = "Exp",
804
-                    initSize = 100,
871
+                    initSize = initSize,
805 872
                     onlyCancer = FALSE, 
806 873
 					mc.cores = 2, ## adapt to your hardware
807 874
                     seed = NULL) ## for reproducibility
808
-## What proportion of the simulations do not end up extinct?
809
-sum(summary(sb1)[, "TotalPopSize"] > 0)/totalpops
875
+					
876
+## What proportion of the simulations reach 4x initSize?
877
+sum(summary(sb1)[, "TotalPopSize"] > (4 * initSize))/totalpops
810 878
 
811 879
 ```
812 880
 ```{r hidden-rng-exbau, echo = FALSE}
... ...
@@ -814,6 +882,36 @@ set.seed(NULL)
814 882
 ```
815 883
 
816 884
 
885
+Alternatively, to examine how long it takes to reach cancer for a
886
+pre-specified size, you could look at the value of `FinalTime` as we
887
+did above (section \@ref(exmutantimut)) after running simulations
888
+with `onlyCancer = TRUE` and `detectionSize` set to some reasonable value:
889
+
890
+
891
+```{r hiddenbau22, echo=FALSE}
892
+set.seed(2)
893
+RNGkind("L'Ecuyer-CMRG")
894
+```
895
+
896
+```{r hhhhbbbb22}
897
+
898
+totalpops <- 5
899
+initSize <- 100
900
+sb2 <- oncoSimulPop(totalpops, fbauer, model = "Exp",
901
+                    initSize = initSize,
902
+                    onlyCancer = TRUE,
903
+					detectionSize = 10 * initSize,
904
+					mc.cores = 2, ## adapt to your hardware
905
+                    seed = NULL) ## for reproducibility
906
+				
907
+## How long did it take to reach cancer?
908
+unlist(lapply(sb2, function(x) x$FinalTime))
909
+```
910
+
911
+```{r hidden-rng-exbau22, echo = FALSE}
912
+set.seed(NULL)
913
+```
914
+
817 915
 #### Consequences of order effects for cancer initiation {#exorder1intro}
818 916
 
819 917
 Instead of focusing on different models for epistatic interactions,
... ...
@@ -829,12 +927,13 @@ order effects):
829 927
 ```{r oex1intro}
830 928
 ## Order effects involving three genes.
831 929
 
832
-## Genotype D, M has different fitness effects
930
+## Genotype "D, M" has different fitness effects
833 931
 ## depending on whether M or D mutated first.
932
+## Ditto for genotype "F, D, M".
933
+
934
+## Meaning of specification: X > Y means
935
+## that X is mutated before Y.
834 936
 
835
-## Genotype F, D, M has different fitness
836
-## depending on which of the order in which the genes
837
-## mutated.
838 937
 
839 938
 o3 <- allFitnessEffects(orderEffects = c(
840 939
                             "F > D > M" = -0.3,
... ...
@@ -863,6 +962,10 @@ RNGkind("L'Ecuyer-CMRG")
863 962
 ```
864 963
 
865 964
 ```{r exusageoe2}
965
+## For reproducibility
966
+set.seed(2)
967
+RNGkind("L'Ecuyer-CMRG")
968
+
866 969
 totalpops <- 5
867 970
 soe1 <- oncoSimulPop(totalpops, o3, model = "Exp",
868 971
                     initSize = 500,
... ...
@@ -880,6 +983,10 @@ set.seed(NULL)
880 983
 ```
881 984
 
882 985
 
986
+As we just said, alternatively, to examine how long it takes to
987
+reach cancer you could run simulations with `onlyCancer = TRUE` and
988
+look at the value of `FinalTime` as we did above (section
989
+\@ref(exmutantimut)).
883 990
 
884 991
 
885 992
 
... ...
@@ -893,27 +1000,29 @@ and \@ref(whatfor) and running time and space consumption of OncoSimulR are
893 1000
 addressed in section \@ref(timings). You should be aware that **coalescent
894 1001
 simulations**, sometimes also called backward-time simulations, are much
895 1002
 more efficient for simulating neutral data as well as some special selection
896
-scenarios [@Yuan2012,@Carvajal-Rodriguez2010,@Hoban2011].  
897
-
898
-In addition, since OncoSimulR allows you to specify fitness with arbitrary
899
-epistatic and order effects, as well as mutator effects, you need to learn
900
-the syntax of how to specify those effects and you might be paying a
901
-performance penalty if your scenario does not require this complexity. For
902
-instance, in the models in @Beerenwinkel2007b or @Bozic2010, the fitness of
903
-a genotype depends only on the total number of drivers mutated, but not on
904
-which drivers are mutated (and, thus, not on the epistatic interactions nor
905
-the order of accumulation of the drivers). This means that the syntax for
906
-specifying those fitness models could probably be a lot simpler (e.g.,
907
-specify $s$ per driver). 
908
-
909
-But it also means that code written for just that case could probably run
910
-much faster. First, because fitness evaluation is easier. Second, and
911
-possibly much more important, because what we need to keep track of leads to
912
-much simpler and economic structures: we do not need to keep track of clones
913
-(where two cells are regarded as different clones if they differ anywhere in
914
-their genotype), but only of clone types as defined by the number of mutated
915
-drivers, and keeping track of clones can be expensive ---see sections
916
-\@ref(timings) and \@ref(trackindivs). 
1003
+scenarios [@Yuan2012; @Carvajal-Rodriguez2010; @Hoban2011].  
1004
+
1005
+In addition, since OncoSimulR allows you to specify fitness with
1006
+arbitrary epistatic and order effects, as well as mutator effects,
1007
+you need to learn the syntax of how to specify those effects and you
1008
+might be paying a performance penalty if your scenario does not
1009
+require this complexity. For instance, in the model of
1010
+@Beerenwinkel2007b, the fitness of a genotype depends only on the
1011
+total number of drivers mutated, but not on which drivers are
1012
+mutated (and, thus, not on the epistatic interactions nor the order
1013
+of accumulation of the drivers). This means that the syntax for
1014
+specifying that model could probably be a lot simpler
1015
+(e.g., specify $s$ per driver).
1016
+
1017
+But it also means that code written for just that case could
1018
+probably run much faster. First, because fitness evaluation is
1019
+easier. Second, and possibly much more important, because what we
1020
+need to keep track of leads to much simpler and economic structures:
1021
+we do not need to keep track of clones (where two cells are regarded
1022
+as different clones if they differ anywhere in their genotype), but
1023
+only of clone types or clone classes as defined by the number of
1024
+mutated drivers, and keeping track of clones can be expensive ---see
1025
+sections \@ref(timings) and \@ref(trackindivs).
917 1026
 
918 1027
 So for those cases where you do not need the full flexibility of OncoSimulR,
919 1028
 special purpose software might be easier to use and faster to run. Of
... ...
@@ -923,16 +1032,17 @@ be available, though.
923 1032
 
924 1033
 
925 1034
 
926
-## Steps in using OncoSimulR {#steps}
1035
+## Steps for using OncoSimulR {#steps}
927 1036
 
928 1037
 
929 1038
 Using this package will often involve the following steps:
930 1039
 
931 1040
 
932
-1. Specify the fitness effects: sections \@ref(specfit) and \@ref(litex).
1041
+1. Specify fitness effects: sections \@ref(specfit) and \@ref(litex).
933 1042
 
934
-2. Simulate cancer progression: section \@ref(simul). You can simulate
935
-  for a single subject or for a set of subjects. You will need to:
1043
+2. Simulate cancer progression: section \@ref(simul). You can
1044
+  simulate for a single individual or subject or for a set of
1045
+  subjects. You will need to:
936 1046
   
937 1047
     - Decide on a model. This basically amounts to choosing a model with
938 1048
     exponential growth ("Exp" or "Bozic") or a model with carrying capacity
... ...
@@ -1012,7 +1122,7 @@ sv2 <- allFitnessEffects(epistasis = c("-A : B" = sb,
1012 1122
 evalAllGenotypes(sv2, addwt = TRUE)
1013 1123
 
1014 1124
 ## 2. Simulate the data. Here we use the "McFL" model and set
1015
-##    explicitly parameters for mutation rate, initial size and size
1125
+##    explicitly parameters for mutation rate, initial size, size
1016 1126
 ##    of the population that will end the simulations, etc
1017 1127
 
1018 1128
 RNGkind("Mersenne-Twister")
... ...
@@ -1071,7 +1181,7 @@ ep1 <- oncoSimulIndiv(sv2, model = "McFL",
1071 1181
 <!-- %% @  -->
1072 1182
 
1073 1183
 
1074
-```{r iep1x1,fig.width=6.5, fig.height=4.5, fig.cap="Plot of drivers of an epistasis simulation"}
1184
+```{r iep1x1,fig.width=6.5, fig.height=4.5, fig.cap="Plot of drivers of an epistasis simulation."}
1075 1185
 ## 3. We will not analyze those data any further. We will only plot
1076 1186
 ## them.  For the sake of a small plot, we thin the data.
1077 1187
 plot(ep1, show = "drivers", xlim = c(0, 1500),
... ...
@@ -1093,7 +1203,7 @@ DAG** with the
1093 1203
 pancreatic cancer poset in @Gerstung2011 (see more
1094 1204
 details in section \@ref(pancreas)):
1095 1205
 
1096
-```{r, fig.width=5}
1206
+```{r fepancr1, fig.width=5}
1097 1207
 ## 1. Fitness effects: 
1098 1208
 pancr <- allFitnessEffects(
1099 1209
     data.frame(parent = c("Root", rep("KRAS", 4), 
... ...
@@ -1107,10 +1217,14 @@ pancr <- allFitnessEffects(
1107 1217
                typeDep = "MN"),
1108 1218
     drvNames = c("KRAS", "SMAD4", "CDNK2A", "TP53", 
1109 1219
 	             "MLL3", "TGFBR2", "PXDN"))
1220
+```
1110 1221
 
1222
+```{r figfpancr1, fig.width=5, fig.cap="Plot of DAG corresponding to fitnessEffects object."}
1111 1223
 ## Plot the DAG of the fitnessEffects object
1112 1224
 plot(pancr)
1113 1225
 ``` 
1226
+
1227
+
1114 1228
 ```{r}
1115 1229
 ## 2. Simulate from it. We change several possible options. 
1116 1230
 
... ...
@@ -1172,7 +1286,7 @@ paper:
1172 1286
 A PDF version of this vignette is available
1173 1287
 from <https://rdiaz02.github.io/OncoSimul/pdfs/OncoSimulR.pdf>.  And an HTML
1174 1288
 version from <https://rdiaz02.github.io/OncoSimul/OncoSimulR.html>. These
1175
-files should correspond to the most recent, github version, of the package
1289
+files should correspond to the most recent, GitHub version, of the package
1176 1290
 (i.e., they might include changes not yet available from the BioConductor
1177 1291
 package).
1178 1292
 
... ...
@@ -1201,18 +1315,17 @@ of "experiment before launching a large number of simulations"
1201 1315
 applies, but here we will walk through several cases to get a
1202 1316
 feeling for the major factors that affect speed and size. Many of
1203 1317
 the comments on this section need to use ideas discussed in other
1204
-places of this document. If you read this section first, you might
1318
+places of this document; if you read this section first, you might
1205 1319
 want to come back after reading the relevant parts.
1206 1320
 
1207 1321
 
1208 1322
 Speed will depend on:
1209 1323
 
1210 1324
 * Your hardware, of course.
1211
-* The evolutionary model: using the "McFL" model is often slower
1212
-  (this model includes density dependence and is more complex).
1325
+* The evolutionary model.
1213 1326
 * The granularity of how often you keep data (`keepEvery`
1214 1327
   argument). Note that the default, which is to keep as often as you
1215
-  sample (so that we preserve all history) can lead to very slow execution
1328
+  sample (so that we preserve all history) can lead to slow execution
1216 1329
   times.
1217 1330
 * The mutation rate, because higher mutation rates lead to more
1218 1331
   clones, and more clones means we need to iterate over, well, more clones,
... ...
@@ -1224,6 +1337,7 @@ Speed will depend on:
1224 1337
 * The stopping conditions (`detectionProb`, `detectionDrivers`,
1225 1338
   `detectionSize` arguments) and whether or not simulations are run until
1226 1339
   cancer is reached (`onlyCancer` argument).
1340
+* Most of the above factors can interact in complex ways.
1227 1341
 
1228 1342
 
1229 1343
 Size of returned objects will depend on:
... ...
@@ -1430,7 +1544,7 @@ t_mc2 <- system.time(
1430 1544
 
1431 1545
 ``` 
1432 1546
 
1433
-And we can obtain times, sizes of objects, and summaries of numbers
1547
+We can obtain times, sizes of objects, and summaries of numbers
1434 1548
 of clones, iterations, and final times doing, for instance:
1435 1549
 
1436 1550
 
... ...
@@ -1458,7 +1572,9 @@ panderOptions("table.split.cells", 900)  ## For HTML
1458 1572
 ## panderOptions("table.split.cells", 8) ## For PDF
1459 1573
 
1460 1574
 set.alignment('right')
1461
-panderOptions('round', 3)
1575
+panderOptions('round', 2)
1576
+panderOptions('big.mark', ',')
1577
+panderOptions('digits', 2)
1462 1578
 				          
1463 1579
 pander(benchmark_1[1:4, c("Elapsed Time, average per simulation (s)", 
1464 1580
  	              "Object Size, average per simulation (MB)",
... ...
@@ -1468,6 +1584,7 @@ pander(benchmark_1[1:4, c("Elapsed Time, average per simulation (s)",
1468 1584
  				  "Total Population Size, median",
1469 1585
  				  "Total Population Size, max.",
1470 1586
  				  "keepEvery")],
1587
+				  justify = c('left', rep('right', 8)), ##  o.w. hlines not right
1471 1588
 				  ## caption = "\\label{tab:bench1}Benchmarks of Exp and McFL  models using the default `detectionProb` with two settings of `keepEvery`."
1472 1589
 				  )
1473 1590
 ```
... ...
@@ -1494,16 +1611,17 @@ a sharp decrease in number of iterations and, thus, running time.
1494 1611
 
1495 1612
 
1496 1613
 
1497
-This table also shows that the `keepEvery = NA` setting, which was in effect
1498
-in simulations `exp2` and `mc2`, can make a difference especially for the
1499
-McFL models, as seen by the median number of clones and the size of the
1500
-returned object. Models `exp2` and `mc2` do not store any intermediate
1501
-population samples so clones that become extinct at any sampling period are
1502
-pruned and only the existing clones at the end of the simulation are
1503
-returned. In contrast, models `exp1` and `mc1` store population samples at
1504
-time intervals of 1 (`keepEvery = 1`), even if many of those clones
1505
-eventually become extinct. We will return to this issue below as execution
1506
-time and object size (which affects RAM usage) depend strongly on number of
1614
+This table also shows that the `keepEvery = NA` setting, which was
1615
+in effect in simulations `exp2` and `mc2`, can make a difference
1616
+especially for the McFL models, as seen by the median number of
1617
+clones and the size of the returned object. Models `exp2` and `mc2`
1618
+do not store any intermediate population samples so clones that
1619
+become extinct at any sampling period are pruned and only the
1620
+existing clones at the end of the simulation are returned. In
1621
+contrast, models `exp1` and `mc1` store population samples at time
1622
+intervals of 1 (`keepEvery = 1`), even if many of those clones
1623
+eventually become extinct. We will return to this issue below as
1624
+execution time and object size depend strongly on the number of
1507 1625
 clones tracked.
1508 1626
 
1509 1627
 
... ...
@@ -1517,9 +1635,12 @@ population size 1,000 times larger than the initial one (`PDBaseline = 5e4`,
1517 1635
 take place unless populations are at least 1,000 times larger than the
1518 1636
 initial population size, and probability of detection is 0.1 with a
1519 1637
 population size 100,000 times larger than the initial one (`PDBaseline =
1520
-5e5`, `n2 = 5e7`). In runs `exp3` and `exp5` we set `keepEvery = 1` and in
1638
+5e5`, `n2 = 5e7`)[^rva]. In runs `exp3` and `exp5` we set `keepEvery = 1` and in
1521 1639
 runs `exp4` and `exp6` we set `keepEvery = NA`.
1522 1640
 
1641
+[^rva]:Again, these are not necessarily reasonable or common
1642
+settings. We are using them to understand what and how affects
1643
+running time and space consumption.
1523 1644
 
1524 1645
 
1525 1646
 ```{r timing2, eval = FALSE}
... ...
@@ -1580,6 +1701,7 @@ Table: (\#tab:bench1b) Benchmarks of Exp models modifying the default `detection
1580 1701
 panderOptions("table.split.table", 99999999)
1581 1702
 panderOptions("table.split.cells", 900)  ## For HTML
1582 1703
 ## panderOptions("table.split.cells", 8) ## For PDF
1704
+
1583 1705
 set.alignment('right')
1584 1706
 panderOptions('round', 2)
1585 1707
 panderOptions('big.mark', ',')
... ...
@@ -1595,6 +1717,7 @@ pander(benchmark_1[5:8, c("Elapsed Time, average per simulation (s)",
1595 1717
  				  "keepEvery",
1596 1718
 				  "PDBaseline",
1597 1719
 				  "n2")], 
1720
+				  justify = c('left', rep('right', 10)), ##  o.w. hlines not right
1598 1721
 ## 				  round = c(rep(2, 3), rep(0, 7)),
1599 1722
 ## 				  digits = c(rep(2, 3), rep(1, 7)),
1600 1723
 	  ## caption = "\\label{tab:bench1b}Benchmarks of Exp and McFL models modifying the default `detectionProb` with two settings of `keepEvery`."
... ...
@@ -1611,7 +1734,8 @@ smaller object sizes and slightly smaller numbers of clones and
1611 1734
 execution times. Changing the exiting conditions (by changing
1612 1735
 `detectionProb` arguments) leads to large increases in number of
1613 1736
 iterations (in this case by factors of about 15x to 25x) and a
1614
-corresponding increase in execution time.
1737
+corresponding increase in execution time as well as much larger
1738
+population sizes (in some cases $>10^{10}$).
1615 1739
 
1616 1740
 
1617 1741
 In some of the runs of `exp5` and `exp6` we get the (recoverable)
... ...
@@ -1631,16 +1755,16 @@ that simulations will be repeated until the exiting conditions are reached
1631 1755
 that ends up in extinction will be repeated. This setting can thus have a
1632 1756
 large effect on the exponential models, because when the initial
1633 1757
 population size is not very large and we start from the wildtype, it is
1634
-not infrequent for simulations to become extinct (when birth and death
1758
+not uncommon for simulations to become extinct (when birth and death
1635 1759
 rates are equal and the population size is small, it is easy to reach
1636 1760
 extinction before a mutation in a gene that increases fitness occurs). But
1637
-this is rarely the case in the McFarland model (unless we use really small
1761
+this is rarely the case in the McFarland model (unless we use really tiny
1638 1762
 initial population sizes) because of the dependency of death rate on total
1639 1763
 population size (see section \@ref(mcfl)).
1640 1764
 
1641 1765
 
1642 1766
 The number of attempts until cancer was reached in the above
1643
-models is shown in the Table \@ref(tab:bench1c) (the values can be obtained from
1767
+models is shown in  Table \@ref(tab:bench1c) (the values can be obtained from
1644 1768
 any of the above runs doing, for instance, `median(unlist(lapply(exp1,
1645 1769
 function(x) x$other$attemptsUsed)))` ):
1646 1770
 
... ...
@@ -1648,7 +1772,7 @@ Table: (\#tab:bench1c) Number of attempts until cancer.
1648 1772
 ```{r bench1c, eval=TRUE, echo = FALSE}
1649 1773
 panderOptions("table.split.table", 99999999)
1650 1774
 panderOptions("table.split.cells", 900)  ## For HTML
1651
-## panderOptions("table.split.cells", 8) ## For PDF
1775
+## panderOptions("table.split.cells", 12) ## For PDF
1652 1776
 set.alignment('right')
1653 1777
 panderOptions('round', 2)
1654 1778
 panderOptions('big.mark', ',')
... ...
@@ -1660,9 +1784,10 @@ pander(benchmark_1[1:8, c(
1660 1784
 "Attempts until Cancer, max.", 
1661 1785
 				  "PDBaseline",
1662 1786
 				  "n2")], 
1787
+				  justify = c('left', rep('right', 5)), ##  o.w. hlines not right
1663 1788
 ## 				  round = c(rep(2, 3), rep(0, 7)),
1664 1789
 ## 				  digits = c(rep(2, 3), rep(1, 7)),
1665
-	  ## caption = "\\label{tab:bench1c}Median number of attempts until cancer."
1790
+	  ## caption = "\\label{tab:bench1c}Number of attempts until cancer."
1666 1791
     )
1667 1792
 ## ## data(benchmark_1)
1668 1793
 ## knitr::kable(benchmark_1[1:8, c("Attempts.Median",
... ...
@@ -1680,7 +1805,7 @@ The McFL models finish in a single attempt. The exponential model
1680 1805
 simulations where we can exit with small population sizes (`exp1`, `exp2`)
1681 1806
 need many fewer attempts to reach cancer than those where large population
1682 1807
 sizes are required (`exp3` to `exp6`). There is no relevant different
1683
-among those four, which is what we would expect: a population that has
1808
+among those last four, which is what we would expect: a population that has
1684 1809
 already reached a size of 50,000 cells from an initial population size of
1685 1810
 500 is obviously a growing population where there is at least one mutant
1686 1811
 with positive fitness; thus, it unlikely to go extinct and therefore
... ...
@@ -1690,7 +1815,7 @@ risk of extinction.
1690 1815
 
1691 1816
 We will now rerun all of the above models with argument `onlyCancer =
1692 1817
 FALSE`.  The results are shown in Table \@ref(tab:timing3) (note that the
1693
-differences between this table and table \@ref(tab:bench1) for the McFL
1818
+differences between this table and Table \@ref(tab:bench1) for the McFL
1694 1819
 models are due only to sampling variation).
1695 1820
 
1696 1821
 \bslandscape
... ...
@@ -1717,6 +1842,7 @@ pander(benchmark_1[9:16,
1717 1842
  				  "keepEvery",
1718 1843
 				  "PDBaseline",
1719 1844
 				  "n2")],
1845
+				  justify = c('left', rep('right', 11)), ##  o.w. hlines not right
1720 1846
 ## caption = "\\label{tab:timing3} Benchmarks of models in Table \\@ref(tab:bench1) and \\@ref(tab:bench1b) when run with `onlyCancer = FALSE`."
1721 1847
 				  )	
1722 1848
 	
... ...
@@ -1745,7 +1871,7 @@ To make it easier to compare results with those of the next section, Table
1745 1871
 
1746 1872
 \bslandscape
1747 1873
 
1748
-Table: (\#tab:allr1bck) Benchmarks of all models in Tables \@ref(tab:bench1), \@ref(tab:bench1b) and \@ref(tab:timing3).  
1874
+Table: (\#tab:allr1bck) Benchmarks of all models in Tables \@ref(tab:bench1), \@ref(tab:bench1b), and \@ref(tab:timing3).  
1749 1875
 ```{r bench1dx0, eval=TRUE, echo = FALSE}
1750 1876
 panderOptions("table.split.table", 99999999)
1751 1877
 ## panderOptions("table.split.cells", 900)  ## For HTML
... ...
@@ -1761,7 +1887,8 @@ pander(benchmark_1[ , c("Elapsed Time, average per simulation (s)",
1761 1887
 				  "Final Time, median", "Total Population Size, median", 
1762 1888
 				  "Total Population Size, mean", "Total Population Size, max.",
1763 1889
  	              "keepEvery", "PDBaseline", "n2", "onlyCancer")], 
1764
-				  ## caption = "\\label{tab:allr1bck}Benchmarks of all models in Tables \\@ref(tab:bench1), \\@ref(tab:bench1b)  and \\@ref(tab:timing3)."  
1890
+				  justify = c('left', rep('right', 12)), ##  o.w. hlines not right
1891
+				  ## caption = "\\label{tab:allr1bck}Benchmarks of all models in Tables \\@ref(tab:bench1), \\@ref(tab:bench1b),  and \\@ref(tab:timing3)."  
1765 1892
 				  )  
1766 1893
 ```
1767 1894
 
... ...
@@ -1773,11 +1900,12 @@ pander(benchmark_1[ , c("Elapsed Time, average per simulation (s)",
1773 1900
 
1774 1901
 ### Changing fitness: $s=0.1$ and $s=0.05$ {#bench1xf}
1775 1902
 
1776
-In the above fitness specification the fitness effect of each gene (when
1777
-its restrictions are satisfied) is $s = 0.1$ (see section \@ref(numfit)
1778
-for details). Here we rerun all the above benchmarks using $s= 0.05$ and
1779
-results are shown below in Table \@ref(tab:timing3xf); the results from these
1780
-benchmarks are available as `data(benchmark_1_0.05)`:
1903
+In the above fitness specification the fitness effect of each gene
1904
+(when its restrictions are satisfied) is $s = 0.1$ (see section
1905
+\@ref(numfit) for details). Here we rerun all the above benchmarks
1906
+using $s= 0.05$ (the results from these benchmarks are available as
1907
+`data(benchmark_1_0.05)`) and results are shown below in Table
1908
+\@ref(tab:timing3xf).
1781 1909
 
1782 1910
 \bslandscape
1783 1911
 
... ...
@@ -1821,6 +1949,7 @@ pander(benchmark_1_0.05[ , c("Elapsed Time, average per simulation (s)",
1821 1949
 				  "Total Population Size, median", 
1822 1950
 				  "Total Population Size, mean", "Total Population Size, max.",
1823 1951
  	              "keepEvery", "PDBaseline", "n2", "onlyCancer")], 
1952
+				  justify = c('left', rep('right', 12)), ##  o.w. hlines not right
1824 1953
  	              ## caption = "\\label{tab:timing3xf}Benchmarks of all models in Table \\@ref(tab:allr1bck) using $s=0.05$ (instead of $s=0.1$)."  
1825 1954
 )  
1826 1955
 				  
... ...
@@ -1877,10 +2006,10 @@ iterations and, thus, running time.
1877 2006
 
1878 2007
 
1879 2008
 
1880
-The moral here is that in complex simulations like this, the effects
1881
-of some parameters ($s$ in this case) might look counter-intuitive
1882
-at first. Thus the need to "experiment before launching a large
1883
-number of simulations". 
2009
+The moral here is that in complex simulations like this (and most
2010
+simulations are complex), the effects of some parameters ($s$ in this
2011
+case) might look counter-intuitive at first. Thus the need to "experiment
2012
+before launching a large number of simulations".
1884 2013
 
1885 2014
 
1886 2015
 ## Several "common use cases" runs {#benchusual}
... ...
@@ -1923,12 +2052,12 @@ rf12 <- allFitnessEffects(genotFitness = rfl12)
1923 2052
 
1924 2053
 
1925 2054
 ## Independent genes; positive fitness from exponential distribution
1926
-## mean around 0.1, and negative from exponential with mean around 
1927
-## 0.02. Half of genes positive fitness effects, half negative.
2055
+## with mean around 0.1, and negative from exponential with mean 
2056
+## around -0.02. Half of genes positive fitness effects, half 
2057
+## negative.
1928 2058
 
1929
-ng <- 200
1930
-re_200 <- allFitnessEffects(noIntGenes = c(rexp(ng/2, 10), 
1931
-                                           -rexp(ng/2, 50)))
2059
+ng <- 200 re_200 <- allFitnessEffects(noIntGenes = c(rexp(ng/2, 10),
2060
+-rexp(ng/2, 50)))
1932 2061
 
1933 2062
 ng <- 500
1934 2063
 re_500 <- allFitnessEffects(noIntGenes = c(rexp(ng/2, 10), 
... ...
@@ -2001,7 +2130,7 @@ oncoSimulPop(Nindiv,
2001 2130
 ```
2002 2131
 
2003 2132
 
2004
-For the exponential model we will stop simulations when the populations
2133
+For the exponential model we will stop simulations when populations
2005 2134
 have $>10^6$ cells (simulations start from 500 cells). For the McFarland
2006 2135
 model we will use the `detectionProb` mechanism (see section
2007 2136
 \@ref(detectprob) for details); we could have used as stopping mechanism
... ...
@@ -2009,45 +2138,46 @@ model we will use the `detectionProb` mechanism (see section
2009 2138
 reaching cancer, as argued in [@McFarland2013]) but we want to provide
2010 2139
 further examples under the `detectionProb` mechanism. We will start from 1000
2011 2140
 cells, not 500 (starting from 1000 we almost always reach cancer in a
2012
-single execution).
2141
+single run).
2013 2142
 
2014 2143
 
2015 2144
 Why not use the `detectionProb` mechanism with the `Exp` models?  Because
2016 2145
 it can be hard to intuitively understand what are reasonable settings for
2017 2146
 the parameters of the `detectionProb` mechanism when used in a population
2018 2147
 that is growing exponentially, especially if different genes have very
2019
-different effects on fitness and we are using fitness specifications that
2020
-are so different (compare the fitness landscape of six genes, the pancreas
2021
-specification, and the fitness specification with 4000 genes with fitness
2022
-effects drawn from an exponential distribution ---`re_4000`). In contrast,
2023
-the `detectionProb` mechanism might be simpler to reason about in a
2024
-population that is growing under a model of carrying capacity with
2025
-possibly large periods of stasis. Let us emphasize that it is not that the
2026
-`detectionProb` mechanism does not make sens with the Exp model; it is
2027
-simply that the parameters might need finer adjustment for them to make
2028
-sense, and in these benchmarks we are dealing with widely different
2029
-fitness specifications.
2148
+different effects on fitness. Moreover, we are using fitness
2149
+specifications that are very different (compare the fitness landscape of
2150
+six genes, the pancreas specification, and the fitness specification with
2151
+4000 genes with fitness effects drawn from an exponential distribution
2152
+---`re_4000`). In contrast, the `detectionProb` mechanism might be simpler
2153
+to reason about in a population that is growing under a model of carrying
2154
+capacity with possibly large periods of stasis. Let us emphasize that it
2155
+is not that the `detectionProb` mechanism does not make sense with the Exp
2156
+model; it is simply that the parameters might need finer adjustment for
2157
+them to make sense, and in these benchmarks we are dealing with widely
2158
+different fitness specifications.
2030 2159
 
2031 2160
 
2032 2161
 Note also that we specify `checkSizePEvery = 4` (instead of the default,
2033 2162
 which is 20). Why? Because the fitness specifications where fitness
2034 2163
 effects are drawn from exponential distributions (`re_200` to `re_4000`
2035 2164
 above) include many genes (well, up to 4000) some of them with possibly
2036
-very large effects. In these conditions, simulations will run very fast in
2165
+very large effects. In these conditions, simulations can run very fast in
2037 2166
 the sense of "units of time". If we check exiting conditions every 20
2038
-units the population could have increased size several orders of magnitude
2039
-in between checks (this is also discussed in sections \@ref(detectprob)
2040
-and \@ref(bench1xf)). You can verify this by running the script with other
2041
-settings for `checkSizePEvery` (and being aware that large settings might
2042
-require you to wait for a long time). To ensure that populations have
2043
-really grown, we have increased the setting of `PDBaseline` so that no
2044
-simulation can be considered for stopping unless its size is 1.4 times
2045
-larger than `initSize`.
2167
+units the population could have increased its size several orders of
2168
+magnitude in between checks (this is also discussed in sections
2169
+\@ref(bench1xf) and \@ref(detectprob)). You can verify this by running the
2170
+script with other settings for `checkSizePEvery` (and being aware that
2171
+large settings might require you to wait for a long time). To ensure that
2172
+populations have really grown, we have increased the setting of
2173
+`PDBaseline` so that no simulation can be considered for stopping unless
2174
+its size is 1.4 times larger than `initSize`.
2046 2175
 
2047
-In all cases we use `keepEvery = 1` and `keepPhylog = TRUE`. Finally, we
2048
-run all models with `errorHitWallTime = FALSE` and `errorHitMaxTries =
2049
-FALSE` so that we can see results even if stopping conditions are not
2050
-met.
2176
+In all cases we use `keepEvery = 1` and `keepPhylog = TRUE` (so we store
2177
+the population sizes of all clones every 1 time unit and we keep the
2178
+complete genealogy of clones). Finally, we run all models with
2179
+`errorHitWallTime = FALSE` and `errorHitMaxTries = FALSE` so that we can
2180
+see results even if stopping conditions are not met.
2051 2181
 
2052 2182
 
2053 2183
 
... ...
@@ -2120,7 +2250,7 @@ The McFL model with random fitness landscape `rf12` and with `pancr` does
2120 2250
 not satisfy the conditions of `detectionProb` in most cases: its median
2121 2251
 final time is 5000, which was the maximum final time specified. This
2122 2252
 suggests that the fitness landscape is such that it is unlikely that we
2123
-will reach population sizes $> 1400$ (remember we increase the setting for
2253
+will reach population sizes $> 1400$ (remember we the setting for
2124 2254
 `PDBaseline`) before 5000 time units. There is nothing particular about
2125 2255
 using a fitness landscape of 12 genes and other runs in other 12-gene
2126 2256
 random fitness landscapes do not show this pattern. However, complex
... ...
@@ -2161,14 +2291,15 @@ a `checkSizePEvery = 20`  probably would not have made sense.
2161 2291
 
2162 2292
 
2163 2293
 Finally, it is interesting that in the cases examined here, the two
2164
-slowest running simulations are from "Exp", with fitnesses `re_2000` and
2165
-`re_4000` (and the third slowest is also Exp, under `re_500`). These are
2166
-the cases with the largest number of clones. Why? In the "Exp" model there
2167
-is no competition, and fitness specifications `re_2000` and `re_4000` have
2168
-genomes with many genes with positive fitness contributions. It is thus
2169
-very easy to obtain, from the wildtype ancestor, a large number of clones
2170
-all of which have birth rates $>1$ and, thus, clones that are unlikely to
2171
-become extinct.
2294
+slowest running simulations are from "Exp", with fitnesses `re_2000`
2295
+and `re_4000` (and the third slowest is also Exp, under
2296
+`re_500`). These are also the cases with the largest number of
2297
+clones. Why? In the "Exp" model there is no competition, and fitness
2298
+specifications `re_2000` and `re_4000` have genomes with many genes
2299
+with positive fitness contributions. It is thus very easy to obtain,
2300
+from the wildtype ancestor, a large number of clones all of which
2301
+have birth rates $>1$ and, thus, clones that are unlikely to become
2302
+extinct.
2172 2303
 
2173 2304
 
2174 2305
 
... ...
@@ -2177,8 +2308,8 @@ become extinct.
2177 2308
 
2178 2309
 We will now rerun the simulations above changing the following:
2179 2310
 
2180
-- `finalTime` is  set to 25000.
2181
-- `onlyCancer` is  set to TRUE.
2311
+- `finalTime`  set to 25000.
2312
+- `onlyCancer`  set to TRUE.
2182 2313
 - The "Exp" models will stop when population size $> 10^5$.
2183 2314
 
2184 2315
 
... ...
@@ -2245,7 +2376,7 @@ iterations. Interestingly, notice that the median final time is close to
2245 2376
 10000, so the runs in \@ref(common1) with maximum final time of 5000 would
2246 2377
 have had a hard time finishing with `onlyCancer = TRUE`.
2247 2378
 
2248
-Forcing simulations to "reach cancer", and just random differences between
2379
+Forcing simulations to "reach cancer" and just random differences between
2249 2380
 the random fitness landscape also affect the McFL run under `rf12`: final
2250 2381
 time is below 5000 and the median number of iterations is about half of
2251 2382
 what was above.
... ...
@@ -2259,17 +2390,17 @@ is much smaller.
2259 2390
 ## Can we use a large number of genes? {#lnum}
2260 2391
 
2261 2392
 Yes. In fact, in OncoSimulR there is no pre-set limit on genome
2262
-size. However, large numbers of genes can lead to unacceptable
2263
-memory usage and/or running time. We discuss several examples next that
2264
-illustrate some of the major issues to consider. Another example
2393
+size. However, large numbers of genes can lead to unacceptably large
2394
+returned object sizes and/or running time. We discuss several examples
2395
+next that illustrate some of the major issues to consider. Another example
2265 2396
 with 50,000 genes is shown in section \@ref(mcf50070).
2266 2397
 
2267 2398
 We have seen in \@ref(bench1) and \@ref(common1) that for the Exp model,
2268 2399
 benchmark results using `detectionProb` require a lot of care and can be
2269
-misleading. Here, we will fix initial and final population
2270
-sizes. Moreover, to avoid the confounding factor of the `onlyCancer =
2271
-TRUE` argument, we will set it to FALSE, so we measure directly the time
2272
-of individual runs.
2400
+misleading. Here, we will fix initial population sizes (to 500) and all
2401
+final population sizes will be set to $\geq 10^6$. In addition, to avoid
2402
+the confounding factor of the `onlyCancer = TRUE` argument, we will set it
2403
+to FALSE, so we measure directly the time of individual runs.
2273 2404
 
2274 2405
 
2275 2406
 ### Exponential model with 10,000 and 50,000 genes {#exp50000}
... ...
@@ -2277,7 +2408,7 @@ of individual runs.
2277 2408
 #### Exponential, 10,000 genes, example 1  {#exp100001}
2278 2409
 
2279 2410
 We will start with 10000 genes and an exponential model, where we
2280
-stop when the population grows over $1e6$ individuals:
2411
+stop when the population grows over $10^6$ individuals:
2281 2412
 
2282 2413
 ```{r exp10000, echo = TRUE, eval = FALSE}
2283 2414
 ng <- 10000
... ...
@@ -2409,10 +2540,10 @@ print(object.size(e_50000), units = "MB")
2409 2540
 ```
2410 2541
 
2411 2542
 Of course, simulations now take longer and the size of the returned
2412
-object is over 7 GB (we are keeping over 7000 clones, even if when
2543
+object is over 7 GB (we are keeping more than 7,000 clones, even if when
2413 2544
 we prune all those that went extinct). 
2414 2545
 
2415
-### Exponential, example 4: 50,000 genes {#exp50000_2}
2546
+#### Exponential, 50,000 genes, example 2 {#exp50000_2}
2416 2547
 
2417 2548
 What if we had not pruned? 
2418 2549
 
... ...
@@ -2452,10 +2583,10 @@ print(object.size(e_50000np), units = "MB")
2452 2583
 
2453 2584
 ```
2454 2585
 The main effect is not on execution time but on object size (it has
2455
-grown by 5 GB).
2586
+grown by 5 GB). We are tracking more than 10,000 clones.
2456 2587
 
2457 2588
 
2458
-#### Exponential,  50,000 genes, example 2 {#exp50000_3}
2589
+#### Exponential,  50,000 genes, example 3 {#exp50000_3}
2459 2590
 
2460 2591
 What about the `mutationPropGrowth` setting? We will rerun the example in
2461 2592
 \@ref(exp500001) leaving `keepEvery = NA` but with the default
... ...
@@ -2512,6 +2643,54 @@ below).
2512 2643
 
2513 2644
 
2514 2645
 
2646
+#### Interlude: where is that 1 GB coming from? {#wheresizefrom}
2647
+
2648
+In section \@ref(exp100001) we have seen an apparently innocuous
2649
+simulation producing a returned object of almost 1 GB. Where is that coming
2650
+from? It means that each simulation produced almost 200 MB of output.
2651
+
2652
+Let us look at one simulation in more detail:
2653
+
2654
+```{r sizedetail, eval = FALSE, echo = TRUE}
2655
+r1 <- oncoSimulIndiv(u,
2656
+                     model = "Exp",
2657
+                     mu = 1e-7,
2658
+                     detectionSize = 1e6,
2659
+                     detectionDrivers = NA,
2660
+                     detectionProb = NA,
2661
+                     keepPhylog = TRUE,
2662
+                     onlyCancer = FALSE,
2663
+                     mutationPropGrowth = TRUE
2664
+                     )
2665
+summary(r1)[c(1, 8)]
2666
+##   NumClones  FinalTime
2667
+## 1      3887        345
2668
+
2669
+print(object.size(r1), units = "MB")
2670
+## 160 Mb
2671
+
2672
+## Size of the two largest objects inside:
2673
+sizes <- lapply(r1, function(x) object.size(x)/(1024^2))
2674
+sort(unlist(sizes), decreasing = TRUE)[1:2]
2675
+## Genotypes pops.by.time 
2676
+##       148.28        10.26 
2677
+
2678
+dim(r1$Genotypes)
2679
+## [1] 10000  3887
2680
+```
2681
+
2682
+The above shows the reason: the `Genotypes` matrix is a 10,000 by 3,887
2683
+integer matrix (with a 0 and 1 indicating not-mutated/mutated for each
2684
+gene in each genotype) and in R integers use 4 bytes each. The
2685
+`pops.by.time` matrix is 346 by 3,888 (the 1 in $346 = 345 + 1$ comes from
2686
+starting at 0 and going up to the final time, both included; the 1 in
2687
+$3888 = 3887 + 1$ is from the column of time) double matrix and doubles
2688
+use 8 bytes[^popsbytime].
2689
+
2690
+[^popsbytime]:These matrices do not exist during most of the
2691
+execution of the C++ code; they are generated right before returning
2692
+from the C++ code.
2693
+
2515 2694
 
2516 2695
 ### McFarland model with 50,000 genes; the effect of `keepEvery` {#mc50000}
2517 2696
 
... ...
@@ -2524,7 +2703,8 @@ other settings.
2524 2703
 
2525 2704
 #### McFarland, 50,000 genes, example 1 {#mc50000ex1}
2526 2705
 
2527
-Let's start with  `mutationPropGrowth = FALSE` and `keepEvery = NA`:
2706
+Let's start with  `mutationPropGrowth = FALSE` and `keepEvery =
2707
+NA`. Simulations end when population size $\geq 10^6$.
2528 2708
 
2529 2709
 ```{r mc50000_1, echo = TRUE, eval = FALSE}
2530 2710
 ng <- 50000
... ...
@@ -2602,17 +2782,18 @@ print(object.size(mc_50000_nmpg_k), units = "MB")
2602 2782
 ## 8101.4 Mb
2603 2783
 ```
2604 2784
 
2605
-Computing time increases slightly but the major effect is seen on
2606
-the size of the returned object, that increases by a factor of about
2607
-4x, up to 8 GB, corresponding to the increase in about 4x in the
2608
-number of clones being tracked (see details of where size comes from in
2785
+Computing time increases slightly but the major effect is seen on the size
2786
+of the returned object, that increases by a factor of about 4x, up to 8
2787
+GB, corresponding to the increase in about 4x in the number of clones
2788
+being tracked (see details of where the size of this object comes from in
2609 2789
 section \@ref(wheresizefrom)).
2610 2790
 
2611 2791
 
2612 2792
 #### McFarland, 50,000 genes, example 3 {#mc50000ex3}
2613 2793
 
2614 2794
 We will set `keepEvery = NA` again, but we will now increase
2615
-detection size by a factor of 3:
2795
+detection size by a factor of 3 (so we stop when total population
2796
+size becomes $\geq 3 * 10^6$).
2616 2797
 
2617 2798
 ```{r mc50000_popx, echo = TRUE, eval = FALSE}
2618 2799
 ng <- 50000
... ...
@@ -2746,15 +2927,15 @@ print(object.size(mc_50000_nmpg_5mu_k), units = "MB")
2746 2927
 
2747 2928
 We have already seen these effects before in section
2748 2929
 \@ref(mc50000ex2): using `keepEvery = 1` leads to a slight increase
2749
-in execution time. What is really affected is the size of the returned
2750
-object which increases by a factor of about 3x (and is now over
2751
-20GB). That 3x corresponds, of course, to the increase in the number
2752
-of clones being tracked. This, by the way, also allows us to
2753
-understand the comment above, where we said that in these two cases
2754
-(where we have increased mutation rate) at each iteration we need to
2755
-do more work as at every update of the population the algorithm
2756
-needs to loop over a much larger number of clones (even if many of
2757
-those are eventually pruned).
2930
+in execution time. What is really affected is the size of the
2931
+returned object which increases by a factor of about 3x (and is now
2932
+over 20GB). That 3x corresponds, of course, to the increase in the
2933
+number of clones being tracked (now over 20,000). This, by the way,
2934
+also allows us to understand the comment above, where we said that
2935
+in these two cases (where we have increased mutation rate) at each
2936
+iteration we need to do more work as at every update of the
2937
+population the algorithm needs to loop over a much larger number of
2938
+clones (even if many of those are eventually pruned).
2758 2939
 
2759 2940
 
2760 2941
 
... ...
@@ -2853,12 +3034,12 @@ $10^6$ starting from an equilibrium population of 500 we need about
2853 3034
 
2854 3035
 
2855 3036
 [^mcnum]: Given the dependence of death rates on population size in
2856
-    McFarland's model (\section \@ref(mcfl)), if all mutations have the
2857
-    same fitness effects we can calculate the equilibrium population size
2858
-    (where birth and death rates are equal) for a given number of mutated
2859
-    genes: $K * (e^{(1 + s)^p} - 1)$ where $K$ is the initial equilibrium
2860
-    size, $s$ the fitness effect of each mutation and $p$ the number of
2861
-    mutated genes.
3037
+    McFarland's model (section \@ref(mcfl)), if all mutations have
3038
+    the same fitness effects we can calculate the equilibrium
3039
+    population size (where birth and death rates are equal) for a
3040
+    given number of mutated genes as: $K * (e^{(1 + s)^p} - 1)$,
3041
+    where $K$ is the initial equilibrium size, $s$ the fitness
3042
+    effect of each mutation, and $p$ the number of mutated genes.
2862 3043
 
2863 3044
 
2864 3045
 Next, let us rerun \@ref(mc50000ex1):
... ...
@@ -2911,25 +3092,27 @@ be complete clonal sweeps so that in extreme cases a single clone might be
2911 3092
 all that is left after some time. This is not the case in the exponential
2912 3093
 models. 
2913 3094
 
2914
-Of course, the details depend on the difference in fitness effects between
2915
-different genotypes (or clones). In particular, we have seen several
2916
-examples where even with `keepEvery=NA` there are a lot of clones in the
2917
-McFL models. In those examples many clones had identical fitness (the
2918
-fitness effects of all genes with positive fitness is the same, and ditto
2919
-for the genes with negative fitness effects), so no clone ends up
2920
-displacing all the others.
3095
+Of course, the details depend on the difference in fitness effects
3096
+between different genotypes (or clones). In particular, we have seen
3097
+several examples where even with `keepEvery=NA` there are a lot of
3098
+clones in the McFL models. In those examples many clones had
3099
+identical fitness (the fitness effects of all genes with positive
3100
+fitness was the same, and ditto for the genes with negative fitness
3101
+effects), so no clone ends up displacing all the others.
2921 3102
 
2922 3103
 
2923 3104
 
2924 3105
 
2925 3106
 ### Are we keeping the complete history (genealogy) of the clones? {#histlargegenes}
2926 3107
 
2927
-Yes we are if we run with `keepPhylog = TRUE`, regardless of the setting
2928
-for `keepEvery`. As explained in section \@ref(trackindivs), OncoSimulR
2929
-prunes clones that never had a population size larger than zero at any
2930
-sampling period (so they are not reflected in the `pops.by.time` matrix in
2931
-the output). And when set `keepEvery = NA` we are telling OncoSimulR to
2932
-discard any sampling period except the very last one.
3108
+Yes we are if we run with `keepPhylog = TRUE`, regardless of the
3109
+setting for `keepEvery`. As explained in section \@ref(trackindivs),
3110
+OncoSimulR prunes clones that never had a population size larger
3111
+than zero at any sampling period (so they are not reflected in the
3112
+`pops.by.time` matrix in the output). And when we set `keepEvery =
3113
+NA` we are telling OncoSimulR to discard all sampling periods except
3114
+the very last one (i.e., the `pops.by.time` matrix contains only the
3115
+clones with 1 or more cells at the end of the simulation).
2933 3116
 
2934 3117
 `keepPhylog` operates differently: it records the exact time at
2935 3118
 which a clone appeared and the clone that gave rise to it. This
... ...
@@ -2937,12 +3120,16 @@ information is kept regardless of whether or not those clones appear
2937 3120
 in the `pops.by.time` matrix. 
2938 3121
 
2939 3122
 Keeping the complete genealogy might be of limited use if the
2940
-`pops.by.time` matrix only contains the very last period. However, you can
2941
-use `plotClonePhylog` and ask to be shown only clones that exist in the
2942
-very last period.
3123
+`pops.by.time` matrix only contains the very last period. However,
3124
+you can use `plotClonePhylog` and ask to be shown only clones that
3125
+exist in the very last period (while of course showing all of their
3126
+ancestors, even if those are now extinct ---i.e., regardless of
3127
+their abundance).
2943 3128
 
2944 3129
 For instance, in run \@ref(exp500001) we could have looked at the
2945
-information stored about the genealogy of clones by doing:
3130
+information stored about the genealogy of clones by doing (we look at the
3131
+first "individual" of the simulation, of the five "individuals" we
3132
+simulated):
2946 3133
 
2947 3134
 ```{r filog_exp50000_1, echo = TRUE, eval = FALSE}
2948 3135
 head(e_50000[[1]]$other$PhylogDF)
... ...
@@ -2969,8 +3156,10 @@ clone, the column labeled "parent" are the mutated genes in the
2969 3156
 parent, and the column labeled "child" are the mutated genes in the child.
2970 3157
 
2971 3158
 
2972
-And we could plot the clones that have a population size of at least
2973
-one in the last period doing:
3159
+And we could plot the genealogical relationships of clones that have
3160
+a population size of at least one in the last period (again, while
3161
+of course showing all of their ancestors, even if those are now
3162
+extinct ---i.e., regardless of their current numbers) doing:
2974 3163
 
2975 3164
 ```{r noplotlconephylog, echo = TRUE, eval = FALSE}
2976 3165
 plotClonePhylog(e_50000[[1]]) ## plot not shown
... ...
@@ -2987,55 +3176,11 @@ simulation. This is much smaller than the `pops.by.time` or
2987 3176
 launch, say, 1000 simulations via `oncoSimulPop`. That is why the
2988 3177
 default is `keepPhylog = FALSE` as this information is not needed as
2989 3178
 often as that in the other two matrices (`pops.by.time` and
2990
-`Genotypes`).
2991
-
2992
-
2993
-#### Interlude: where is that 1 GB coming from? {#wheresizefrom}
2994
-
2995
-In section \@ref(exp100001) we have seen an apparently innocuous
2996
-simulation producing a returned object of almost 1 GB. Where is that coming
2997
-from? It means that each simulation produced almost 200 MB of output.
2998
-
2999
-Let us look at one simulation in more detail:
3000
-
3001
-```{r sizedetail, eval = FALSE, echo = TRUE}
3002
-r1 <- oncoSimulIndiv(u,
3003
-                     model = "Exp",
3004
-                     mu = 1e-7,
3005
-                     detectionSize = 1e6,
3006
-                     detectionDrivers = NA,
3007
-                     detectionProb = NA,
3008
-                     keepPhylog = TRUE,
3009
-                     onlyCancer = FALSE,
3010
-                     mutationPropGrowth = TRUE
3011
-                     )
3012
-summary(r1)[c(1, 8)]
3013
-##   NumClones  FinalTime
3014
-## 1      3887        345
3015
-
3016
-print(object.size(r1), units = "MB")
3017
-## 160 Mb
3018
-
3019
-sizes <- lapply(r1, function(x) object.size(x)/(1024^2))
3020
-sort(unlist(sizes), decreasing = TRUE)[1:2]
3021
-## Genotypes pops.by.time 
3022
-##       148.28        10.26 
3023
-
3024
-dim(r1$Genotypes)
3025
-## [1] 10000  3887
3026
-```
3179
+`Genotypes`). However, if you plan to measure evolutionary predictability
3180
+using Lines of Descent, or LOD (section \@ref(evolpredszend)) you should
3181
+run simulations with `keepPhylog = TRUE`.
3027 3182
 
3028
-The above shows the reason: the `Genotypes` matrix is a 10,000 by 3,887
3029
-integer matrix (with a 0 and 1 indicating not-mutated/mutated for each
3030
-gene in each genotype) and in R integers use 4 bytes each. The
3031
-`pops.by.time` matrix is 346 by 3,888 (the 1 in $346 = 345 + 1$ comes from
3032
-starting at 0 and going up to the final time, both included; the 1 in
3033
-$3888 = 3887 + 1$ is from the column of time) double matrix and doubles
3034
-use 8 bytes[^popsbytime].
3035 3183
 
3036
-[^popsbytime]:These matrices do not exist during most of the
3037
-execution of the C++ code; they are generated right before returning
3038
-from the C++ code.
3039 3184
 
3040 3185
 
3041 3186
 
... ...
@@ -3055,7 +3200,7 @@ illustrates and discusses this problem is available in file
3055 3200
 precision from other sources. One of the most noticeable ones is
3056 3201
 that when we reach population sizes around $10^{11}$ the C++ code
3057 3202
 will often alert us by throwing exceptions with the message
3058
-`Recoverable exception ti set to DBL_MIN. Rerunning.`; I throw this
3203
+`Recoverable exception ti set to DBL_MIN. Rerunning.` I throw this
3059 3204
 exception because $t_i$, the random variable for time to next
3060 3205
 mutation, is less than `DBL_MIN`, the minimum representable
3061 3206
 floating-point number. This happens because, unless we use really
... ...
@@ -3187,14 +3332,14 @@ print(object.size(exp_k_50_1e11), units = "MB")
3187 3332
 ``` 
3188 3333
 
3189 3334
 Note that we almost reached `max.wall.time` (600 * 5 = 3000). What if we
3190
-wanted to go up to $10^{12}$? We could not do it in 10 minutes. We could
3191
-set `max.wall.time` to a value larger than 600 to allow us to reach larger
3192
-sizes but then we would be waiting for a possibly unacceptable time for
3193
-simulations to finish. Moreover, this would eventually fail as simulations
3194
-would keep hitting the $t_i$ exception without ever being able to
3195
-complete. Finally, even if we were very patient, hitting that $t_i$
3196
-exception should make us worry about possible biases in the samples.
3197
-
3335
+wanted to go up to $10^{12}$? We would not be able to do it in 10
3336
+minutes. We could set `max.wall.time` to a value larger than 600 to allow
3337
+us to reach larger sizes but then we would be waiting for a possibly
3338
+unacceptable time for simulations to finish. Moreover, this would
3339
+eventually fail as simulations would keep hitting the $t_i$ exception
3340
+without ever being able to complete. Finally, even if we were very
3341
+patient, hitting that $t_i$ exception should make us worry about possible
3342
+biases in the samples.
3198 3343
 
3199 3344
 
3200 3345
 
... ...
@@ -3203,10 +3348,11 @@ exception should make us worry about possible biases in the samples.
3203 3348
 
3204 3349
 To summarize this section, we have seen: 
3205 3350
 
3206
-  - Both McFL and Exp can be run in short times over a range of sizes for
3207
-      the `detectionProb` mechanism using a complex fitness specification
3208
-      with a small number of genes. These are the typical or common use
3209
-      cases of OncoSimulR.
3351
+  - Both McFL and Exp can be run in short times over a range of
3352
+      sizes for the `detectionProb` and `detectionSize` mechanisms
3353
+      using a complex fitness specification with  moderate numbers
3354
+      of genes. These are the typical or common use cases of
3355
+      OncoSimulR.
3210 3356
 	  
3211 3357
   - The `keepEvery` argument can have a large effect on time in the
3212 3358
     McFL models and specially on object sizes. If only the end
... ...
@@ -3226,12 +3372,13 @@ To summarize this section, we have seen:
3226 3372
     when we keep track of around 6000 to 10000 clones. Anything that leads
3227 3373
     to these patterns will slow down the simulations.
3228 3374
 	
3229
-  - OncoSimulR needs to keep track of genotypes (or clones), not just
3230
-    numbers of drivers and passengers, because it allows you to use
3231
-    complex fitness and mutation specifications that depend on specific
3232
-    genotypes. The `keepEvery = NA` is an approach to keep the minimal
3233
-    history needed, but it is unavoidable that during the simulations we
3234
-    might be forced to deal with many thousands of different clones.
3375
+  - OncoSimulR needs to keep track of genotypes (or clones), not
3376
+    just numbers of drivers and passengers, because it allows you to
3377
+    use complex fitness and mutation specifications that depend on
3378
+    specific genotypes. The `keepEvery = NA` is an approach to store
3379
+    only the minimal information needed, but it is unavoidable that
3380
+    during the simulations we might be forced to deal with many
3381
+    thousands of different clones.
3235 3382
  
3236 3383
 
3237 3384
 
... ...
@@ -3245,7 +3392,7 @@ OncoSimulR uses a standard continuous time model, where individual
3245 3392
 cells divide, die, and mutate with rates that can depend on genotype
3246 3393
 and population size; over time the abundance of the different
3247 3394
 genotypes changes by the action of selection (due to differences in
3248
-net growth rates between genotypes), drift, and mutation. As a
3395
+net growth rates among genotypes), drift, and mutation. As a
3249 3396
 result of a mutation in a pre-existing clone new clones arise, and
3250 3397
 the birth rate of a newly arisen clone is determined at the time of
3251 3398
 its emergence as a function of its genotype.  Simulations can use an
... ...
@@ -3253,7 +3400,7 @@ use exponential growth model or a model with carrying capacity that
3253 3400
 follows @McFarland2013. For the exponential growth model, the death
3254 3401
 rate is fixed at one whereas in the model with carrying capacity
3255 3402
 death rate increases with population size. In both cases, therefore,
3256
-fitness differences between genotypes in a given population at a
3403
+fitness differences among genotypes in a given population at a
3257 3404
 given time are due to differences in the mapping between genotype
3258 3405
 and birth rate. There is second exponential model (called "Bozic")
3259 3406
 where birth rate is fixed at one, and genotype determines death rate
... ...
@@ -3277,15 +3424,17 @@ With OncoSimulR you can specify different types of effects on fitness:
3277 3424
 
3278 3425
 
3279 3426
 
3280
-* A special type of epistatic effect that is particularly amenable to be
3281
-  represented as a graph. In this graph, having, say, "B" be a child of "A"
3282
-  means that a mutation in B can only accumulate if a mutation in A is
3283
-  already present.  This is what OT [@Desper1999JCB; @Szabo2008], CBN
3284
-  [@Beerenwinkel2007; @Gerstung2009; @Gerstung2011], progression networks
3285
-  [@Farahani2013], and other similar models [@Korsunsky2014] generally
3286
-  mean. Details are provided in section \@ref(posetslong). Note that this is
3287
-  not an order effect (discussed below): the fitness of a genotype from this
3288
-  DAGs is a function of whether or not the restrictions in the graph are
3427
+* A special type of epistatic effect that is particularly amenable
3428
+  to be represented as a graph (a DAG). In this graph having, say,
3429
+  "B" be a child of "A" means that a mutation in B can only
3430
+  accumulate if a mutation in A is already present.  This is what OT
3431
+  [@Desper1999JCB; @Szabo2008], CBN [@Beerenwinkel2007;
3432
+  @Gerstung2009; @Gerstung2011], progression networks
3433
+  [@Farahani2013], and other similar models [@Korsunsky2014]
3434
+  generally mean. Details are provided in section
3435
+  \@ref(posetslong). Note that this is not an order effect
3436
+  (discussed below): the fitness of a genotype from this DAGs is a
3437
+  function of whether or not the restrictions in the graph are
3289 3438
   satisfied, not the historical sequence of how they were satisfied.
3290 3439
 
3291 3440
 * Effects where the order in which mutations are acquired matters, as
... ...
@@ -3343,14 +3492,16 @@ relevant differences.
3343 3492
   construct that mapping for a modest genotype of 500 genes (you'd have
3344 3493
   more genotypes than particles in the observable Universe).
3345 3494
   
3346
-* For many models/data you often intuitively start with the fitness of
3347
-  the genotypes, not the fitness consequences of the different
3348
-  mutations. In these cases, you'd need to do the math to specify the
3349
-  terms you want if you used the lego system.
3495
+* For many models/data you often intuitively start with the fitness
3496
+  of the genotypes, not the fitness consequences of the different
3497
+  mutations. In these cases, you'd need to do the math to specify
3498
+  the terms you want if you used the lego system so you'll probably
3499
+  use the specification with the direct mapping genotype
3500
+  $\rightarrow$ fitness.
3350 3501
   
3351
-* Sometimes you already have a moderate size genotype $\rightarrow$
3352
-  fitness mapping and you certainly do not want to do the math by 
3353
-  hand: here the lego system would be painful to use.
3502
+* Likewise, sometimes you already have a moderate size genotype
3503
+  $\rightarrow$ fitness mapping and you certainly do not want to do
3504
+  the math by hand: here the lego system would be painful to use.
3354 3505
   
3355 3506
 * But sometimes we do think in terms of "the effects on fitness of
3356 3507
   such and such mutations are" and that immediately calls for the lego
... ...
@@ -3365,7 +3516,7 @@ relevant differences.
3365 3516
   this).
3366 3517
   
3367 3518
 * The lego system might help you see what your model really means: in
3368
-  many cases, you can obtain fairly succinct specification of complex
3519
+  many cases, you can obtain fairly succinct specifications of complex
3369 3520
   fitness models with just a few terms. Similarly, depending on what your
3370 3521
   emphasis is, you can often specify the same fitness landscape in several
3371 3522
   different ways.
... ...
@@ -3452,7 +3603,7 @@ fitness (or random fitness landscapes), as we will do in section
3452 3603
 
3453 3604
 
3454 3605
 We will see an example of this way of passing fitness again in
3455
-\@ref(bauer), where will compare it with the lego system.
3606
+\@ref(bauer), where we will compare it with the lego system.
3456 3607
 
3457 3608
 
3458 3609
 <!-- % Please see the documentation of `allFitnessEffects` for further -->
... ...
@@ -3659,31 +3810,6 @@ And what if I want genes without interactions but I want modules (see
3659 3810
 section \@ref(modules0))? Go to section \@ref(mod-no-epi).
3660 3811
 
3661 3812
 
3662
-## Numerical issues with death rates of 0 in Bozic model {#ex-0-death}
3663
-
3664
-As we mentioned above (section \@ref(fit-neg-pos)) death rates of 0 can
3665
-lead to trouble when using Bozic's model:
3666
-
3667
-
3668
-```{r}
3669
-i1 <- allFitnessEffects(noIntGenes = c(1, 0.5))
3670
-evalAllGenotypes(i1, order = FALSE, addwt = TRUE, 
3671
-                 model = "Bozic")
3672
-i1_b <- oncoSimulIndiv(i1, model = "Bozic")
3673
-
3674
-``` 
3675
-
3676
-
3677
-Of course, there is no problem in using the above with other models:
3678
-
3679
-```{r}
3680
-evalAllGenotypes(i1, order = FALSE, addwt = TRUE, 
3681
-                 model = "Exp")
3682
-i1_e <- oncoSimulIndiv(i1, model = "Exp")
3683
-summary(i1_e)
3684
-``` 
3685
-
3686
-
3687 3813
 
3688 3814
 ## Using DAGs: Restrictions in the order of mutations as extended posets {#posetslong}
3689 3815
 
... ...
@@ -4128,7 +4254,7 @@ the test directory to see a test that verifies this assertion).
4128 4254
 
4129 4255
 
4130 4256
 ## Modules {#modules0}
4131
-As already mentioned, we can think in all the effects of fitness in terms
4257
+As already mentioned, we can think of all the effects of fitness in terms
4132 4258
 not of individual genes but, rather, modules. This idea is discussed in,
4133 4259
 for example, @Raphael2014a, @Gerstung2011: the restrictions encoded
4134 4260
 in, say, the DAGs can be considered to apply not to genes, but to
... ...
@@ -4367,13 +4493,14 @@ the fitness of a double mutant "A", "B" is different depending on
4367 4493
 whether "A" was acquired before "B" or "B" before "A". This, of
4368 4494
 course, can be generalized to more than two genes.
4369 4495
 
4370
-Note that these order effects are different from the order
4371
-restrictions discussed in section \@ref(posetslong). In there we
4372
-might say that acquiring "B" depends or is facilitated by having "A"
4373
-mutated (and, unless we allowed for multiple mutations, having "A"
4374
-mutated means having "A" mutated before "B"). However, once you have
4375
-the genotype "A, B", its fitness does not depend on the order in
4376
-which "A" and "B" appeared.
4496
+Note that order effects are different from the restrictions in the
4497
+order of accumulation of mutations discussed in section
4498
+\@ref(posetslong). With restrictions in the order of accumulation of
4499
+mutations we might say that acquiring "B" depends or is facilitated
4500
+by having "A" mutated (and, unless we allowed for multiple
4501
+mutations, having "A" mutated means having "A" mutated before
4502
+"B"). However, once you have the genotype "A, B", its fitness does
4503
+not depend on the order in which "A" and "B" appeared.
4377 4504
 
4378 4505
 
4379 4506
 
... ...
@@ -4384,7 +4511,7 @@ two-gene orders (one of them a subset of one of the three) lead to
4384 4511
 different fitness compared to the wild-type. We add also modules, to show
4385 4512
 its usage (but just limit ourselves to using one gene per module here). 
4386 4513
 
4387
-Order effects are specified using a $x > y$, that means that that order
4514
+Order effects are specified using a $x > y$, which means that that order
4388 4515
 effect is satisfied when module $x$ is mutated before module $y$.
4389 4516
 
4390 4517
 ```{r}
... ...
@@ -4612,496 +4739,521 @@ to four, three, and two mutation genotypes.
4612 4739
 
4613 4740
 
4614 4741
 
4742
+## Epistasis {#epi}
4615 4743
 
4744
+### Epistasis: two alternative specifications {#e2}
4616 4745
 
4617
-## Synthetic viability {#sv}
4618
-
4619
-Synthetic viability and synthetic lethality
4620
-  [e.g., @Ashworth2011; @Hartman2001] are just special cases of epistasis
4621
-  (section \@ref(epi)) but we deal with them here separately.
4746
+We want the following mapping of genotypes to fitness:
4622 4747
 
4748
+------------------------
4749
+A    B       Fitness 
4750
+--   --   --------------
4751
+wt   wt    1 
4623 4752
 
4624
-### A simple synthetic viability example
4625
-A simple and extreme example of synthetic viability is shown in the
4626
-following table, where the joint mutant has fitness larger than the wild
4627
-type, but each single mutant is lethal.
4753
+wt   M     $1 + s_b$ 
4628 4754
 
4755
+M    wt    $1 + s_a$
4629 4756
 
4757
+M    M     $1 + s_{ab}$
4758
+--  --    --------------- 
4759
+ 
4760
+Suppose that the actual numerical values are $s_a = 0.2, s_b = 0.3, s_{ab}
4761
+= 0.7$.
4630 4762
 
4631
-A  B   Fitness
4632
-wt wt  1
4763
+We specify the above as follows: 
4764
+```{r}
4765
+sa <- 0.2
4766
+sb <- 0.3
4767
+sab <- 0.7
4633 4768
 
4634
-wt M   0
4769
+e2 <- allFitnessEffects(epistasis =
4770
+                            c("A: -B" = sa,
4771
+                              "-A:B" = sb,
4772
+                              "A : B" = sab))
4773
+evalAllGenotypes(e2, order = FALSE, addwt = TRUE)
4635 4774
 
4636
-M  wt  0
4775
+``` 
4637 4776
 
4638
-M  M   (1 + s)
4777
+That uses the "-" specification, so we explicitly exclude some patterns:
4778
+with "A:-B" we say "A when there is no B".
4639 4779
 
4780
+But we can also use a specification where we do not use the "-". That
4781
+requires a different numerical value of the interaction, because now, as
4782
+we are rewriting the interaction term as genotype "A is mutant, B is
4783
+mutant" the double mutant will incorporate the effects of "A mutant",
4784
+``B mutant" and "both A and B mutants". We can define a new $s_2$ that
4785
+satisfies $(1 + s_{ab}) = (1 + s_a) (1 + s_b) (1 + s_2)$ so
4786
+$(1 + s_2) = (1 + s_{ab})/((1 + s_a) (1 + s_b))$ and therefore specify as:
4640 4787
 
4641
-where "wt" denotes wild type and "M" denotes mutant.
4642 4788
 
4789
+```{r}
4790
+s2 <- ((1 + sab)/((1 + sa) * (1 + sb))) - 1
4643 4791
 
4644
-We can specify this (setting $s = 0.2$) as (I play around with spaces, to
4645
-show there is a certain flexibility with them):
4792
+e3 <- allFitnessEffects(epistasis =
4793
+                            c("A" = sa,
4794
+                              "B" = sb,
4795
+                              "A : B" = s2))
4796
+evalAllGenotypes(e3, order = FALSE, addwt = TRUE)
4646 4797
 
4647
-```{r}
4648
-s <- 0.2
4649
-sv <- allFitnessEffects(epistasis = c("-A : B" = -1,
4650
-                                      "A : -B" = -1,
4651
-                                      "A:B" = s))
4652 4798
 ``` 
4653 4799
 
4654
-Now, let's look at all the genotypes (we use "addwt" to also get the wt,
4655
-which by decree has fitness of 1), and disregard order:
4656
-
4657
-```{r}
4658
-(asv <- evalAllGenotypes(sv, order = FALSE, addwt = TRUE))
4659
-``` 
4800
+Note that this is the way you would specify effects with FFPopsim
4801
+[@Zanini2012]. Whether this specification or the previous one with
4802
+"-" is simpler will depend on the model. For synthetic mortality and
4803
+viability, I think the one using "-" is simpler to map genotype tables
4804
+to fitness effects. See also section \@ref(e3) and \@ref(theminus) and the
4805
+example in section \@ref(weis1b).
4660 4806
 
4661
-Asking the program to consider the order of mutations of course makes no
4662
-difference:
4663 4807
 
4664
-```{r}
4665
-evalAllGenotypes(sv, order = TRUE, addwt = TRUE)
4666
-``` 
4808
+Finally, note that we can also specify some of these effects by combining
4809
+the graph and the epistasis, as shown in section \@ref(misra1a) or
4810
+\@ref(weis1b).
4667 4811
 
4668
-Another example of synthetic viability is shown in section \@ref(misra1b).
4812
+### Epistasis with three genes and two alternative specifications {#e3}
4669 4813
 
4670
-Of course, if multiple simultaneous mutations are not possible in the
4671
-simulations, it is not possible to go from the wildtype to the double
4672
-mutant in this model where the single mutants are not viable.
4814
+Suppose we have 
4673 4815
 
4816
+-------------------------------------
4817
+A  B   C     Fitness 
4818
+-- --  --  --------------------------
4819
+M  wt  wt   $1 + s_a$ 
4820
+
4821
+wt M   wt   $1 + s_b$ 
4674 4822
 
4823
+wt wt  M    $1 + s_c$ 
4675 4824
 
4825
+M  M   wt   $1 + s_{ab}$ 
4676 4826
 
4677
-### Synthetic viability using Bozic model {#fit-neg-pos}
4827
+wt M   M    $1 + s_{bc}$ 
4678 4828
 
4679
-If we were to use the above specification with Bozic's models, we might
4680
-not get what we think we should get:
4829
+M  wt  M    $(1 + s_a) (1 + s_c)$ 
4681 4830
 
4682
-```{r}
4683
-evalAllGenotypes(sv, order = FALSE, addwt = TRUE, model = "Bozic")
4684
-```
4831
+M  M   M    $1 + s_{abc}$ 
4832
+-- --  --  -------------------------- 
4685 4833
 
4686
-What gives here? The simulation code would alert you of this (see section
4687
-\@ref(ex-0-death)) in this particular case because there are ``-1",
4688
-which might indicate that this is not what you want. The problem is that
4689
-you probably want the Death rate to be infinity (the birth rate was 0, so
4690
-no clone viability, when we used birth rates ---section \@ref(noviab)).
4834
+where missing rows have a fitness of 1 (they have been deleted for
4835
+conciseness). Note that the mutant for exactly A and C has a fitness that
4836
+is the product of the individual terms (so there is no epistasis in that case).
4691 4837
 
4692
-Let us say so explicitly:
4693 4838
 
4694 4839
 ```{r}
4695
-s <- 0.2
4696
-svB <- allFitnessEffects(epistasis = c("-A : B" = -Inf,
4697
-                                      "A : -B" = -Inf,
4698
-                                      "A:B" = s))
4699
-evalAllGenotypes(svB, order = FALSE, addwt = TRUE, model = "Bozic")
4700
-```
4840
+sa <- 0.1
4841
+sb <- 0.15
4842
+sc <- 0.2
4843
+sab <- 0.3