DESCRIPTION000066400000000000000000000031701453240703200125650ustar00rootroot00000000000000Package: DNABarcodeCompatibility
Type: Package
Title: A Tool for Optimizing Combinations of DNA Barcodes Used in Multiplexed Experiments on Next Generation Sequencing Platforms
Version: 1.19.0
Maintainer: Céline Trébeau
Authors@R: c(
person("Céline", "Trébeau", email = "ctrebeau@pasteur.fr", role = "cre", comment = c(ORCID = "0000-0001-6795-5379")),
person("Jacques", "Boutet de Monvel", email = "jacques.boutet-de-monvel@pasteur.fr", role = "aut", comment = c(ORCID = "0000-0001-6182-3527")),
person("Fabienne", "Wong Jun Tai", email = "virginie.wong-jun-tai@pasteur.fr", role = "ctb"),
person("Raphaël", "Etournay", email = "raphael.etournay@pasteur.fr", role = "aut", comment = c(ORCID = "0000-0002-2441-9274")))
Description: The package allows one to obtain optimised combinations of DNA barcodes to be used for multiplex sequencing. In each barcode combination, barcodes are pooled with respect to Illumina chemistry constraints. Combinations can be filtered to keep those that are robust against substitution and insertion/deletion errors thereby facilitating the demultiplexing step. In addition, the package provides an optimiser function to further favor the selection of barcode combinations with least heterogeneity in barcode usage.
License: file LICENSE
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.1.1
Imports: dplyr, tidyr, numbers, purrr, stringr, DNABarcodes, stats, utils, methods
Depends: R (>= 3.6.0)
Suggests: knitr, rmarkdown, BiocStyle, testthat
VignetteBuilder: knitr
biocViews: Preprocessing, Sequencing
BugReports: https://github.com/comoto-pasteur-fr/DNABarcodeCompatibility/issues
LICENSE000066400000000000000000000432541453240703200120730ustar00rootroot00000000000000 GNU GENERAL PUBLIC LICENSE
Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The licenses for most software are designed to take away your
freedom to share and change it. By contrast, the GNU General Public
License is intended to guarantee your freedom to share and change free
software--to make sure the software is free for all its users. This
General Public License applies to most of the Free Software
Foundation's software and to any other program whose authors commit to
using it. (Some other Free Software Foundation software is covered by
the GNU Lesser General Public License instead.) You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
this service if you wish), that you receive source code or can get it
if you want it, that you can change the software or use pieces of it
in new free programs; and that you know you can do these things.
To protect your rights, we need to make restrictions that forbid
anyone to deny you these rights or to ask you to surrender the rights.
These restrictions translate to certain responsibilities for you if you
distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must give the recipients all the rights that
you have. You must make sure that they, too, receive or can get the
source code. And you must show them these terms so they know their
rights.
We protect your rights with two steps: (1) copyright the software, and
(2) offer you this license which gives you legal permission to copy,
distribute and/or modify the software.
Also, for each author's protection and ours, we want to make certain
that everyone understands that there is no warranty for this free
software. If the software is modified by someone else and passed on, we
want its recipients to know that what they have is not the original, so
that any problems introduced by others will not reflect on the original
authors' reputations.
Finally, any free program is threatened constantly by software
patents. We wish to avoid the danger that redistributors of a free
program will individually obtain patent licenses, in effect making the
program proprietary. To prevent this, we have made it clear that any
patent must be licensed for everyone's free use or not licensed at all.
The precise terms and conditions for copying, distribution and
modification follow.
GNU GENERAL PUBLIC LICENSE
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. This License applies to any program or other work which contains
a notice placed by the copyright holder saying it may be distributed
under the terms of this General Public License. The "Program", below,
refers to any such program or work, and a "work based on the Program"
means either the Program or any derivative work under copyright law:
that is to say, a work containing the Program or a portion of it,
either verbatim or with modifications and/or translated into another
language. (Hereinafter, translation is included without limitation in
the term "modification".) Each licensee is addressed as "you".
Activities other than copying, distribution and modification are not
covered by this License; they are outside its scope. The act of
running the Program is not restricted, and the output from the Program
is covered only if its contents constitute a work based on the
Program (independent of having been made by running the Program).
Whether that is true depends on what the Program does.
1. You may copy and distribute verbatim copies of the Program's
source code as you receive it, in any medium, provided that you
conspicuously and appropriately publish on each copy an appropriate
copyright notice and disclaimer of warranty; keep intact all the
notices that refer to this License and to the absence of any warranty;
and give any other recipients of the Program a copy of this License
along with the Program.
You may charge a fee for the physical act of transferring a copy, and
you may at your option offer warranty protection in exchange for a fee.
2. You may modify your copy or copies of the Program or any portion
of it, thus forming a work based on the Program, and copy and
distribute such modifications or work under the terms of Section 1
above, provided that you also meet all of these conditions:
a) You must cause the modified files to carry prominent notices
stating that you changed the files and the date of any change.
b) You must cause any work that you distribute or publish, that in
whole or in part contains or is derived from the Program or any
part thereof, to be licensed as a whole at no charge to all third
parties under the terms of this License.
c) If the modified program normally reads commands interactively
when run, you must cause it, when started running for such
interactive use in the most ordinary way, to print or display an
announcement including an appropriate copyright notice and a
notice that there is no warranty (or else, saying that you provide
a warranty) and that users may redistribute the program under
these conditions, and telling the user how to view a copy of this
License. (Exception: if the Program itself is interactive but
does not normally print such an announcement, your work based on
the Program is not required to print an announcement.)
These requirements apply to the modified work as a whole. If
identifiable sections of that work are not derived from the Program,
and can be reasonably considered independent and separate works in
themselves, then this License, and its terms, do not apply to those
sections when you distribute them as separate works. But when you
distribute the same sections as part of a whole which is a work based
on the Program, the distribution of the whole must be on the terms of
this License, whose permissions for other licensees extend to the
entire whole, and thus to each and every part regardless of who wrote it.
Thus, it is not the intent of this section to claim rights or contest
your rights to work written entirely by you; rather, the intent is to
exercise the right to control the distribution of derivative or
collective works based on the Program.
In addition, mere aggregation of another work not based on the Program
with the Program (or with a work based on the Program) on a volume of
a storage or distribution medium does not bring the other work under
the scope of this License.
3. You may copy and distribute the Program (or a work based on it,
under Section 2) in object code or executable form under the terms of
Sections 1 and 2 above provided that you also do one of the following:
a) Accompany it with the complete corresponding machine-readable
source code, which must be distributed under the terms of Sections
1 and 2 above on a medium customarily used for software interchange; or,
b) Accompany it with a written offer, valid for at least three
years, to give any third party, for a charge no more than your
cost of physically performing source distribution, a complete
machine-readable copy of the corresponding source code, to be
distributed under the terms of Sections 1 and 2 above on a medium
customarily used for software interchange; or,
c) Accompany it with the information you received as to the offer
to distribute corresponding source code. (This alternative is
allowed only for noncommercial distribution and only if you
received the program in object code or executable form with such
an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for
making modifications to it. For an executable work, complete source
code means all the source code for all modules it contains, plus any
associated interface definition files, plus the scripts used to
control compilation and installation of the executable. However, as a
special exception, the source code distributed need not include
anything that is normally distributed (in either source or binary
form) with the major components (compiler, kernel, and so on) of the
operating system on which the executable runs, unless that component
itself accompanies the executable.
If distribution of executable or object code is made by offering
access to copy from a designated place, then offering equivalent
access to copy the source code from the same place counts as
distribution of the source code, even though third parties are not
compelled to copy the source along with the object code.
4. You may not copy, modify, sublicense, or distribute the Program
except as expressly provided under this License. Any attempt
otherwise to copy, modify, sublicense or distribute the Program is
void, and will automatically terminate your rights under this License.
However, parties who have received copies, or rights, from you under
this License will not have their licenses terminated so long as such
parties remain in full compliance.
5. You are not required to accept this License, since you have not
signed it. However, nothing else grants you permission to modify or
distribute the Program or its derivative works. These actions are
prohibited by law if you do not accept this License. Therefore, by
modifying or distributing the Program (or any work based on the
Program), you indicate your acceptance of this License to do so, and
all its terms and conditions for copying, distributing or modifying
the Program or works based on it.
6. Each time you redistribute the Program (or any work based on the
Program), the recipient automatically receives a license from the
original licensor to copy, distribute or modify the Program subject to
these terms and conditions. You may not impose any further
restrictions on the recipients' exercise of the rights granted herein.
You are not responsible for enforcing compliance by third parties to
this License.
7. If, as a consequence of a court judgment or allegation of patent
infringement or for any other reason (not limited to patent issues),
conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot
distribute so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you
may not distribute the Program at all. For example, if a patent
license would not permit royalty-free redistribution of the Program by
all those who receive copies directly or indirectly through you, then
the only way you could satisfy both it and this License would be to
refrain entirely from distribution of the Program.
If any portion of this section is held invalid or unenforceable under
any particular circumstance, the balance of the section is intended to
apply and the section as a whole is intended to apply in other
circumstances.
It is not the purpose of this section to induce you to infringe any
patents or other property right claims or to contest validity of any
such claims; this section has the sole purpose of protecting the
integrity of the free software distribution system, which is
implemented by public license practices. Many people have made
generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that
system; it is up to the author/donor to decide if he or she is willing
to distribute software through any other system and a licensee cannot
impose that choice.
This section is intended to make thoroughly clear what is believed to
be a consequence of the rest of this License.
8. If the distribution and/or use of the Program is restricted in
certain countries either by patents or by copyrighted interfaces, the
original copyright holder who places the Program under this License
may add an explicit geographical distribution limitation excluding
those countries, so that distribution is permitted only in or among
countries not thus excluded. In such case, this License incorporates
the limitation as if written in the body of this License.
9. The Free Software Foundation may publish revised and/or new versions
of the General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the Program
specifies a version number of this License which applies to it and "any
later version", you have the option of following the terms and conditions
either of that version or of any later version published by the Free
Software Foundation. If the Program does not specify a version number of
this License, you may choose any version ever published by the Free Software
Foundation.
10. If you wish to incorporate parts of the Program into other free
programs whose distribution conditions are different, write to the author
to ask for permission. For software which is copyrighted by the Free
Software Foundation, write to the Free Software Foundation; we sometimes
make exceptions for this. Our decision will be guided by the two goals
of preserving the free status of all derivatives of our free software and
of promoting the sharing and reuse of software generally.
NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
REPAIR OR CORRECTION.
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
convey the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
Copyright (C)
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
Also add information on how to contact you by electronic and paper mail.
If the program is interactive, make it output a short notice like this
when it starts in an interactive mode:
Gnomovision version 69, Copyright (C) year name of author
Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, the commands you use may
be called something other than `show w' and `show c'; they could even be
mouse-clicks or menu items--whatever suits your program.
You should also get your employer (if you work as a programmer) or your
school, if any, to sign a "copyright disclaimer" for the program, if
necessary. Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program
`Gnomovision' (which makes passes at compilers) written by James Hacker.
, 1 April 1989
Ty Coon, President of Vice
This General Public License does not permit incorporating your program into
proprietary programs. If your program is a subroutine library, you may
consider it more useful to permit linking proprietary applications with the
library. If this is what you want to do, use the GNU Lesser General
Public License instead of this License.
NAMESPACE000066400000000000000000000006671453240703200123060ustar00rootroot00000000000000# Generated by roxygen2: do not edit by hand
export(distance_filter)
export(experiment_design)
export(file_loading_and_checking)
export(get_all_combinations)
export(get_random_combinations)
export(optimize_combinations)
import(DNABarcodes)
import(dplyr)
import(methods)
import(numbers)
import(purrr)
import(stringr)
import(tidyr)
importFrom(stats,na.omit)
importFrom(utils,combn)
importFrom(utils,read.table)
importFrom(utils,write.csv2)
NEWS000066400000000000000000000023321453240703200115550ustar00rootroot00000000000000Changes in version 0.99.10 (2018-12-01)
+ Rewording of a few terms in the documentation
+ Notify the Package release in the Bioconductor dev branch
Changes in version 0.99.9 (2018-11-25)
+ Notify that the package was approved by Bioconductor
Changes in version 0.99.8 (2018-11-25)
+ Fixed all Bioconductor reviewer's comments and updated documentation section accordingly. Passed BioCheck 1.18 locally
Changes in version 0.99.7 (2018-11-19)
+ Updated documentation section. Passed BioCheck 1.18 locally
Changes in version 0.99.6 (2018-11-19)
+ Change installation procedure to allow package install on R<3.6. Passed BioCheck 1.18 locally
Changes in version 0.99.5 (2018-11-19)
+ Remote BiocCheck imposes R >=3.6. Passed BioCheck 1.18 locally
Changes in version 0.99.4 (2018-11-19)
+ Improved error handling. Passed BioCheck 1.18 locally
Changes in version 0.99.3 (2018-11-13)
+ Fixed R version to 3.5. Passed BioCheck 1.18 locally
Changes in version 0.99.2 (2018-11-13)
+ New package features for the Shiny interface. Passed BioCheck 1.18 locally
Changes in version 0.99.1 (2018-10-15)
+ Passed BioCheck 1.17 locally
Changes in version 0.99.0 (2018-10-15)
+ Prepared package for submission to Bioconductor repository (BioCheck 1.16)R/000077500000000000000000000000001453240703200112575ustar00rootroot00000000000000R/DNABarcodeCompatibility-package.R000066400000000000000000000020441453240703200174070ustar00rootroot00000000000000#' DNABarcodeCompatibility:
#' to find optimised sets of compatible barcodes with least heterogeneity in
#' barcode usage for multiplex
#' experiments performed on next generation sequencing platforms.
#'
#'
#' The DNABarcodeCompatibility package provides six functions to load DNA
#' barcodes, and to generate, filter and optimise sets of barcode combinations
#' for multiplex sequencing experiments.
#' In particular, barcode combinations are selected to be compatible with
#' respect to Illumina chemistry constraints, and can be filtered to keep those
#' that are robust against substitution and insertion/deletion errors
#' thereby facilitating the demultiplexing step.
#' In addition, the package provides an optimiser function to further favor
#' the selection of compatible barcode combinations with least
#' heterogeneity in barcode usage.
#'
#' @docType package
#' @name DNABarcodeCompatibility-package
#' @import dplyr
#' @import tidyr
#' @import numbers
#' @import purrr
#' @import stringr
#' @import methods
#' @import DNABarcodes
NULLR/data.R000066400000000000000000000013441453240703200123150ustar00rootroot00000000000000#' Barcode dataset from Illumina.
#'
#' 48 barcodes from Illumina TruSeq Small RNA kits
#'
#' @format A data frame with 48 rows and 2 variables:
#' \describe{
#' \item{V1}{barcode identifier}
#' \item{V2}{DNA sequence}
#' ...
#' }
#'
"IlluminaIndexesRaw"
#' Barcode dataset from Illumina with features.
#'
#' 48 barcodes from Illumina TruSeq Small RNA kits along
#' with percentage in CG content and presence of homopolymers of size >= 3
#'
#' @format A data frame with 48 rows and 4 variables:
#' \describe{
#' \item{Id}{barcode identifier}
#' \item{sequence}{DNA sequence}
#' \item{GC_content}{percentage of G and C relative to A and T}
#' \item{homopolymer}{presence of nucleotide repetitions (>= 3)}
#' ...
#' }
#'
"IlluminaIndexes"
R/distance_filter.R000066400000000000000000000072651453240703200145530ustar00rootroot00000000000000#' @title
#' Select barcode combinations with error correction properties
#'
#' @description
#' Filters a list of barcode combinations for a given distance metric
#' (hamming or seqlev) and threshold in order to produce a list of barcodes
#' satisfying the distance constraints.
#'
#' @usage
#' distance_filter(index_df, combinations_m, metric, d)
#'
#' @param index_df A dataframe containing barcodes identifiers,
#' corresponding DNA sequences along with GC content and presence
#' of homopolymers.
#'
#' @param combinations_m A matrix of compatible barcode combinations.
#' @param metric The type of distance (hamming or seqlev or phaseshift).
#' @param d The minimum value of the distance.
#'
#' @details
#' The "hamming" distance is suitable for correcting substitution errors.
#' The "seqlev" distance is suitable for correcting both
#' substitution and insertion/deletion errors.
#'
#' @return
#' A filtered matrix containing the identifiers of the barcodes
#' satisfying the distance constraints.
#'
#' @examples
#' barcodes <- DNABarcodeCompatibility::IlluminaIndexes
#' m <- get_all_combinations(barcodes, 2, 4)
#' distance_filter(barcodes, m, "hamming", 3)
#'
#' @seealso
#' \code{\link{get_all_combinations}},
#' \code{\link{get_random_combinations}}
#'
#' @references Buschmann, T. 2015.
#' The Systematic Design and Application of Robust DNA Barcodes.
#' @references Buschmann, T. 2017.
#' DNABarcodes: an R package for the systematic construction of
#' DNA sample tags. Bioinformatics 33, 920–922.
#'
#' @export
#'
distance_filter = function(index_df, combinations_m, metric, d) {
check_input_dataframe( index_df,
c("Id", "sequence", "GC_content", "homopolymer"))
if (is.numeric(d)) {
if (d <= nchar(index_df$sequence[1])){
index_distance_df = index_distance(index_df)
if (metric == "hamming" ||
metric == "seqlev" ||
metric =="phaseshift") {
if (metric == "hamming") {
hamming_rejection_table = low_hamming_distance(
index_df,
index_distance_df,
d)
filtered_combinations_m = filter_combinations(
combinations_m,
hamming_rejection_table)
} else if (metric == "seqlev") {
seqlev_rejection_table = low_seqlev_distance(
index_df,
index_distance_df,
d)
filtered_combinations_m = filter_combinations(
combinations_m,
seqlev_rejection_table)
} else if (metric == "phaseshift") {
phaseshift_rejection_table = low_phaseshift_distance(
index_df,
index_distance_df,
d)
filtered_combinations_m = filter_combinations(
combinations_m,
phaseshift_rejection_table)
}
if(nrow(filtered_combinations_m) < 1){
display_message("No combination feats your research criteria")
return(NULL)
} else {
return (filtered_combinations_m)
}
} else {
display_message(paste(
"metric should be 'hamming',",
"'seqlev' or 'phaseshift'"))
return(NULL)
}
} else {
display_message("Please enter a d value <= sequence length")
return(NULL)
}
} else{
display_message("Please enter a number as d")
return(NULL)
}
}
R/experiment_design.R000066400000000000000000000162161453240703200151210ustar00rootroot00000000000000#' @title
#' Find a set of barcode combinations with least heterogeneity in
#' barcode usage for single and dual indexing
#'
#' @description
#' This function uses the Shannon Entropy to identify a set of compatible
#' barcode combinations with least heterogeneity in barcode usage,
#' in the context of single and dual indexing.
#' It performs either an exhaustive or a random-greedy search of compatible
#' DNA-barcode combinations depending on the size of the
#' DNA-barcode population and of the number of samples to be multiplexed.
#'
#' @usage
#' experiment_design(file1, sample_number, mplex_level, platform = 4,
#' file2 = NULL, export = NULL, metric = NULL, d = 3, thrs_size_comb,
#' max_iteration, method)
#'
#' @param file1 The input data file that contains 2 columns separated by a
#' space or a tabulation, namely the sequence identifiers and corresponding
#' DNA sequence
#' @param sample_number Number of libraries to be sequenced.
#' @param mplex_level The number at which the barcodes will be multiplexed.
#' @param file2 The input data file that contains 2 columns separated by
#' a space or a tabulation, namely the sequence identifiers and
#' corresponding DNA sequence; used for dual-indexing, see details below.
#' @param platform An integer representing the number of channels (1, 2, 4)
#' of the desired Illumina platform: 1 for iSeq; 2 for NextSeq, NovaSeq,
#' MiniSeq; 4 for HiSeq and MiSeq. 0 represents any other platform than
#' Illumina.
#' @param export If not NULL, results are saved in a csv file at the
#' specified path.
#' @param metric The type of distance (hamming or seqlev).
#' @param d The minimum value of the distance.
#' @param thrs_size_comb The maximum size of the set of compatible
#' combinations to be used for the greedy optimization.
#' @param max_iteration The maximum number of iterations during the
#' optimizing step.
#' @param method The choice of the greedy search: 'greedy_exchange' or
#' 'greedy_descent'.
#'
#' @details
#' By specifying the total number of libraries and the number of libraries
#' to be multiplexed, this function returns an optimal set of DNA-barcode
#' combinations to be used for sequencing.
#'
#' In the case of **single indexing**, one must only provide one input file
#' containing a list of DNA barcodes (file1 argument). The file2 argument being
#' optional, the program runs the optimisation for single indexing if this
#' argument is left empty. The output shows the sample ID with its respective
#' single barcode.
#'
#' In the case of **dual indexing**, one must provide two input files
#' containing DNA barcodes as two separate sets of barcodes. The program will
#' detect these two files and automatically switch to the 'dual indexing' mode.
#' The program then runs the optimisation for each barcode set separately.
#' The output shows the sample ID with its respective pair of barcodes.
#'
#' The inputs of the algorithm are a list of n distinct barcodes,
#' the number N of required libraries, and the multiplex level k; N = ak,
#' where a is the number of lanes of the flow cells to be used for the
#' experiment.
#'
#'
#' * Step 1:
#'
#' This step consists of identifying a set of compatible barcode
#' combinations. Given the number of barcodes and the multiplex level,
#' the total number of barcode combinations (compatible or not) reads:
#' \deqn{{n}\choose{k}}
#' If this number is not too large, the algorithm will perform an
#' exhaustive search and output all compatible combinations of k barcodes.
#' Otherwise, it will proceed by picking up combinations at random, in
#' order to identify a large enough set of compatible barcode combinations.
#'
#' * Step 2:
#'
#' Finds an optimized set of barcode combinations in which barcode
#' redundancy is minimized (see details in
#' \code{\link{optimize_combinations}})
#'
#'
#' @md
#'
#' @return
#' A dataframe containing compatible DNA-barcode combinations organized
#' by lanes of the flow cell.
#'
#' @examples
#' write.table(DNABarcodeCompatibility::IlluminaIndexesRaw,
#' txtfile <- tempfile(), row.names = FALSE, col.names = FALSE, quote=FALSE)
#' experiment_design(file1=txtfile, sample_number=18, mplex_level=3,
#' platform=4)
#'
#'
#' @importFrom stats na.omit
#' @importFrom utils combn read.table write.csv2
#'
#' @export
#'
#'
experiment_design = function (
file1,
sample_number,
mplex_level,
platform = 4,
file2 = NULL,
export = NULL,
metric = NULL,
d = 3,
thrs_size_comb = 120,
max_iteration = 50,
method = "greedy_exchange") {
if (is.null(file2)) {
index_df_1 = file_loading_and_checking(file1)
if (!is.null(index_df_1)) {
if (sample_and_multiplexing_level_check(
sample_number,
mplex_level) == TRUE) {
# print("mlx and sample ok")
result1 = final_result( index_df = index_df_1,
sample_number = sample_number,
mplex_level = mplex_level ,
platform = platform,
metric = metric,
d = d,
thrs_size_comb = thrs_size_comb,
max_iteration = max_iteration,
method = method)
if (!is.null(export)) {
write.csv2(result1, file = export)
}
return(result1)
} else{
stop("The multiplexing level or the sample number is wrong")
}
} else{
stop("An error occured on the file")
}
} else{
index_df_1 = file_loading_and_checking(file1)
if (!is.null(index_df_1)) {
index_df_2 = file_loading_and_checking(file2)
if (!is.null(index_df_2)) {
if (sample_and_multiplexing_level_check(
sample_number,
mplex_level)) {
result = final_result_dual(index_df_1 = index_df_1,
index_df_2 = index_df_2,
sample_number = sample_number,
mplex_level = mplex_level,
platform = platform,
metric = metric,
d = d,
thrs_size_comb = thrs_size_comb,
max_iteration = max_iteration,
method = method)
if (!is.null(export)) {
write.csv2(result, file = export)
}
return(result)
} else{
stop(paste(
"The multiplexing level",
"or the sample number is wrong"))
}
} else{
stop("An error occured on the second file")
}
} else{
stop("An error occured on the first file")
}
}
}
R/file_loading_and_checking.R000066400000000000000000000026051453240703200164760ustar00rootroot00000000000000#' @title
#' Loading and checking DNA barcodes.
#'
#' @description
#' Loads the file containing DNA barcodes and analyze barcode content.
#'
#' @usage
#' file_loading_and_checking(file)
#'
#' @param file The input data file that contains 2 columns separated by a space
#' or a tabulation, namely the sequence identifiers and
#' corresponding DNA sequence.
#'
#' @details
#' This function loads the DNA barcodes from the input file and checks barcodes
#' for unicity (identifier and sequence), DNA content, and equal size.
#' It also calculates the fraction of G and C relative to A and T, as referred
#' to as "GC content", and it detects the presence of
#' homopolymers of length >= 3.
#'
#' @return
#' A dataframe containing sequence identifiers, nucleotide sequence, GC content,
#' presence of homopolymers.
#'
#' @examples
#' write.table(DNABarcodeCompatibility::IlluminaIndexesRaw,
#' txtfile <- tempfile(), row.names = FALSE, col.names = FALSE, quote=FALSE)
#' file_loading_and_checking(txtfile)
#'
#' @export
#'
file_loading_and_checking = function(file){
index = read_index(file)
if (!is.null(index) && index_check(index)){# if no issue
index_number <- nrow(index)
index = index %>% mutate (GC_content = get_index_GC_content(index),
homopolymer = get_index_homopolymer(index))
return (index)
}else{
return(NULL)
}
}R/get_all_combinations.R000066400000000000000000000065531453240703200155670ustar00rootroot00000000000000#' @title
#' Get all compatible combinations.
#'
#' @description
#' Finds the exhaustive set of compatible barcode combinations.
#'
#' @usage
#' get_all_combinations(index_df, mplex_level, platform)
#'
#' @param index_df A dataframe containing barcodes identifiers, corresponding
#' DNA sequences along with GC content and presence of homopolymers.
#' @param mplex_level The number at which the barcodes will be multiplexed.
#' Illumina recommends to not multiplex more than 96 libraries.
#' @param platform An integer representing the number of channels (1, 2, 4)
#' of the desired Illumina platform: 1 for iSeq; 2 for NextSeq, NovaSeq,
#' MiniSeq; 4 for HiSeq and MiSeq. 0 represents any other platform than
#' Illumina.
#'
#' @details
#' Be aware that the total number of combinations may become prohibitively
#' large for large barcode sets and large multiplexing numbers.
#'
#' @return
#' A matrix containing the identifiers of compatible barcode combinations.
#'
#' @examples
#' get_all_combinations(DNABarcodeCompatibility::IlluminaIndexes, 2, 4)
#'
#'
#' @seealso
#' \code{\link{get_random_combinations}}
#'
#' @export
#'
get_all_combinations = function(index_df, mplex_level, platform){
check_input_dataframe( index_df,
c("Id", "sequence", "GC_content", "homopolymer"))
if (is.numeric(mplex_level)){
if(mplex_level <= 96){
if (mplex_level <= nrow(index_df)){
combination_m = NULL
if (platform == 4 ){
combinations_m =
get_all_combinations_4_channel( index_df,
mplex_level)
} else if (platform == 2){
combinations_m =
get_all_combinations_2_channel( index_df,
mplex_level)
} else if (platform == 1){
combinations_m =
get_all_combinations_1_channel( index_df,
mplex_level)
} else if (platform == 0){
combinations_m =
get_all_combinations_0_channel( index_df,
mplex_level)
} else {
display_message("Please choose a correct platform
for your experiment ")
}
if(nrow(as.matrix(combinations_m)) == 0){
display_message(
paste( "The programm didn't find any",
"compatible combination, please check",
"your index list and your search",
"parameters"))
return (NULL)
}else{
return (combinations_m)
}
} else {display_message(paste( "The value of mplex_level",
" should not be higher than the",
"number of available barcodes"))
}
} else {display_message(
"The value of mplex_level should not be higher than 96")
}
}else{
display_message("Please enter a number as mplex_level")
return(NULL)
}
}
R/get_random_combinations.R000066400000000000000000000067201453240703200162730ustar00rootroot00000000000000#' @title
#' Get a large set of compatible combinations.
#'
#' @description
#' Finds a randomly generated set of at most 1000 combinations
#' of compatible barcodes.
#'
#' @usage
#' get_random_combinations(index_df, mplex_level, platform)
#'
#' @param index_df A dataframe containing barcodes identifiers, corresponding
#' DNA sequences along with GC content and presence of homopolymers.
#' @param mplex_level The number at which the barcodes will be multiplexed.
#' Illumina recommends to not multiplex more than 96 libraries.
#' @param platform An integer representing the number of channels (1, 2, 4)
#' of the desired Illumina platform: 1 for iSeq; 2 for NextSeq, NovaSeq,
#' MiniSeq; 4 for HiSeq and MiSeq. 0 represents any other platform than
#' Illumina.
#'
#' @details
#' This function is suited if the total number of possible combinations is too
#' high for an exhaustive search to be possible in a reasonable amount of time.
#'
#' @return
#' A matrix containing the identifiers of compatible barcode combinations.
#'
#' @examples
#' get_random_combinations(DNABarcodeCompatibility::IlluminaIndexes, 3, 4)
#'
#'
#' @seealso
#' \code{\link{get_all_combinations}}
#'
#' @export
#'
get_random_combinations = function(index_df, mplex_level, platform){
check_input_dataframe( index_df,
c("Id", "sequence", "GC_content", "homopolymer"))
if (is.numeric(mplex_level)){
if(mplex_level <= 96){
if (mplex_level <= nrow(index_df)){
combinations_m = NULL
if (platform == 4 ){
combinations_m =
get_random_combinations_4_channel( index_df,
mplex_level)
} else if (platform == 2){
combinations_m =
get_random_combinations_2_channel( index_df,
mplex_level)
} else if (platform == 1){
combinations_m =
get_random_combinations_1_channel( index_df,
mplex_level)
} else if (platform == 0){
combinations_m =
get_random_combinations_0_channel( index_df,
mplex_level)
} else {
display_message(paste( "Please choose a correct platform",
"for your experiment"))
return(NULL)
}
if(nrow(as.matrix(combinations_m)) == 0){
display_message(paste(
"The programm didn't find any compatible",
"combination, please check your index list",
"and your search parameters"))
return (NULL)
}else{
return (combinations_m)
}
} else {display_message(paste( "The value of mplex_level",
" should not be higher than the",
"number of available barcodes"))
}
}else{display_message(
"The value of mplex_level should not be higher than 96")
}
}else{
display_message("Please enter a number as mplex_level")
return (NULL)
}
}
R/optimize_combinations.R000066400000000000000000000212451453240703200160130ustar00rootroot00000000000000#' @title
#' Find a set of barcode combinations with least heterogeneity
#' in barcode usage
#'
#' @description
#' This function uses the Shannon Entropy to identify a set of compatible
#' barcode combinations with least heterogeneity in barcode usage.
#'
#' @usage
#' optimize_combinations(combination_m, nb_lane, index_number,
#' thrs_size_comb, max_iteration, method)
#'
#' @param combination_m A matrix of compatible barcode combinations.
#' @param nb_lane The number of lanes to be use for sequencing
#' (i.e. the number of libraries divided by the multiplex level).
#' @param index_number The total number of distinct DNA barcodes in the
#' dataset.
#' @param thrs_size_comb The maximum size of the set of compatible
#' combinations to be used for the greedy optimization.
#' @param max_iteration The maximum number of iterations during
#' the optimizing step.
#' @param method The choice of the greedy search: 'greedy_exchange'
#' or 'greedy_descent'.
#'
#' @details
#' N/k compatible combinations are then selected using a Shannon entropy
#' maximization approach.
#' It can be shown that the maximum value of the entropy that can be attained
#' for a selection of N barcodes among n, with possible repetitions, reads:
#' \deqn{S_{max}=-(n-r)\frac{\lfloor N/n\rfloor}{N}
#' \log(\frac{\lfloor N/n\rfloor}{N})-r\frac{\lceil N/n\rceil}{N}
#' \log(\frac{\lceil N/n\rceil}{N})}
#'
#' where r denotes the rest of the division of N by n, while
#' \deqn{\lfloor N/n\rfloor} and \deqn{\lceil N/n\rceil} denote
#' the lower and upper integer parts of N/n, respectively.
#'
#' **Case 1:** number of lanes < number of compatible DNA-barcode
#' combinations
#'
#' This function seeks for compatible DNA-barcode combinations of
#' highest entropy. In brief this function uses a randomized greedy descent
#' algorithm to find an optimized selection.
#' Note that the resulting optimized selection may not be globally optimal.
#' It is actually close to optimal and much improved in terms of
#' non-redundancy of DNA barcodes used, compared to a randomly chosen set of
#' combinations of compatible barcodes.
#'
#' **Case 2:** number of lanes >= number of
#' compatible DNA-barcode combinations
#'
#' In such a case, there are not enough compatible DNA-barcode combinations
#' and redundancy is inevitable.
#'
#' @md
#'
#' @return
#' A matrix containing an optimized set of combinations
#' of compatible barcodes.
#'
#' @examples
#' m <- get_random_combinations(DNABarcodeCompatibility::IlluminaIndexes, 3, 4)
#' optimize_combinations(m, 12, 48)
#'
#'
#' @seealso
#' \code{\link{get_all_combinations}},
#' \code{\link{get_random_combinations}},
#' \code{\link{experiment_design}}
#'
#' @export
#'
optimize_combinations = function (combination_m,
nb_lane,
index_number,
thrs_size_comb = 120,
max_iteration = 50,
method = "greedy_exchange") {
if (!"matrix" %in% is(combination_m)) {
display_message(paste( "Wrong input format, the combinations should",
"be given as a matrix. The number of columns",
"should be equal to the multiplex level and",
"the number of rows should reflect the number",
"of compatible combinations"))
}
if (nrow(as.matrix(combination_m)) == 0) {
display_message("No combination have been found")
} else {
if (is.numeric(index_number)) {
if (is.numeric(nb_lane)) {
max = entropy_max(index_number, ncol(combination_m) * nb_lane)
print(paste("Theoretical max entropy:", round(max, 5)))
if (nb_lane < nrow(combination_m)) {
if (nrow(combination_m) > thrs_size_comb) {
a_combination = recursive_entropy(
combination_m[sample(seq(1,nrow(combination_m)),
thrs_size_comb), ],
nb_lane,
method = method)
i = 0
while ((i < max_iteration) &&
(entropy_result(a_combination) < max)) {
temp_combination =
recursive_entropy(
combination_m[
sample(seq(1,nrow(combination_m)),
thrs_size_comb), ],
nb_lane,
method = method)
if (entropy_result(temp_combination) >
entropy_result(a_combination)) {
a_combination = temp_combination
}
i = i + 1
}
} else {
if (nrow(combination_m) > 30) {
a_combination = recursive_entropy(combination_m,
nb_lane,
method = method)
i = 0
while ((i < max_iteration) &&
(entropy_result(a_combination) < max)) {
temp_combination = recursive_entropy(combination_m,
nb_lane,
method = method)
if (entropy_result(temp_combination) >
entropy_result(a_combination)) {
a_combination = temp_combination
}
i = i + 1
}
} else {
a_combination = recursive_entropy(combination_m,
nb_lane,
method = "greedy_descent")
i = 0
while ((i < max_iteration) &&
(entropy_result(a_combination) < max)) {
temp_combination = recursive_entropy(combination_m,
nb_lane,
method = "greedy_descent")
if (entropy_result(temp_combination) >
entropy_result(a_combination)) {
a_combination = temp_combination
}
i = i + 1
}
}
}
} else {
n = nb_lane - nrow(combination_m)
combination_m = combination_m[
sample(seq(1,nrow(combination_m)),
nrow(combination_m)), ,
drop = FALSE]
part_combination = combination_m[
sample(seq(1,nrow(combination_m)),
n,
replace = TRUE), ]
a_combination = rbind(combination_m, part_combination)
i = 0
while ((i < 30) &&
(entropy_result(a_combination) < max) && n > 0) {
part_combination = combination_m[
sample(seq(1,nrow(combination_m)),
n,
replace = TRUE), ]
temp_combination = rbind(combination_m, part_combination)
if (entropy_result(temp_combination) >
entropy_result(a_combination)) {
a_combination = temp_combination
}
i = i + 1
}
a_combination = a_combination[
sample(seq(1,nrow(a_combination)),
nrow(a_combination)), ]
}
print(paste(
"Entropy of the optimized set:",
round(entropy_result(a_combination), 5)
))
return(a_combination)
} else {
display_message("Please enter a number as nb_lane")
}
} else {
display_message("Please enter a number as index_number")
}
}
}
R/utils.R000066400000000000000000001042521453240703200125460ustar00rootroot00000000000000
library("dplyr")
library("tidyr")
library("numbers")
library("purrr")
library ("stringr")
library("DNABarcodes")
# initialize variable for CRAN/Bioc repository compatibility
globalVariables(
c(
"Id",
"Id1",
"Id2",
"Lane",
"V1",
"V2",
"barcodes",
"bcID",
"bc_gp",
"combID",
"hamming",
"index",
"isDropping",
"matchOcc",
"pairID",
"seqlev",
"phaseshift",
"."
)
)
# Inputs ------------------------------------------------------------------
index_env <- new.env() # ?bindenv() for help
read_index = function(file) {
if (!file.exists(file)) {
display_message("Your file doesn't exist, please check the path")
return(NULL)
assign("index", NULL, envir = index_env) #index <<- NULL
} else{
input <- try(as.data.frame(
read.table(
file,
header = FALSE,
sep = "",
col.names = c("Id", "sequence"),
colClasses = c("character", "character"),
stringsAsFactors = FALSE
)
), silent = TRUE)
if (exists("input")) {
if (! "try-error" %in% is(input)) {
assign("index", input, envir = index_env)
return(input)
} else {
assign("index", NULL, envir = index_env) #index <<- NULL
display_message(
"An error occurred, please check the content of your file")
return(NULL)
}
}
}
}
unicity_check = function(index) {
index$sequence = toupper(index$sequence)
assign("index_df", index, envir = index_env)
if (index$Id %>% anyDuplicated() != 0) {
#checks if the index Ids are unique
v = paste("two duplicated indexes IDs,check row number",
anyDuplicated(index$Id))
display_message(v)
return(FALSE)
}
else if (index$sequence %>% anyDuplicated() != 0) {
#checks if the index sequences are unique
v = paste(
"At least one of your sequences is not unique, check row number",
anyDuplicated(index$sequence)
)
display_message(v)
return(FALSE)
}
else{
return (TRUE)
}
}
#check the character for one sequence
character_check = function(sequence) {
wrong_letters = LETTERS[!(LETTERS) %in% c("A", "G", "C", "T")]
check = all(!str_detect(sequence, wrong_letters))
return(check)
}
#checks the characters of each sequence
sequences_character_check = function(sequences) {
return(map_lgl(sequences, character_check))
}
#checks the length of each sequence
length_check = function(sequences) {
return(str_length(sequences) == str_length(sequences[1]))
}
#checks for character type and sequence length
character_and_length_check = function(index) {
c_check = sequences_character_check(index$sequence)
if (length (which(c_check == FALSE)) > 0) {
display_message(paste(
"Your sequence contains a wrong character, row number : ",
which(c_check == FALSE)
))
return(FALSE)
}
else {
l_check = length_check(index$sequence)
if (length(which(l_check == FALSE)) > 0) {
wrong_length = as.numeric(
names(table(nchar(index$sequence)))[
as.numeric(
which(table(nchar(index$sequence)) ==
min(table(nchar(index$sequence)))))
])
c = which(nchar(index$sequence) == wrong_length)
display_message(paste("the indexes have not the same size,
check out row number", c))
return(FALSE)
} else{
return(TRUE)
}
}
}
## Check the index for possible issues
index_check = function(index) {
return (all(unicity_check(index),
character_and_length_check(index)))
}
get_sequence_GC_content = function (sequence) {
GC_content = str_count( sequence,
pattern = ("G|C")) / nchar(sequence) * 100
return(round(GC_content, digits = 2))
}
get_index_GC_content = function(index) {
index_GC_content = map_dbl(index$sequence, get_sequence_GC_content)
return (index_GC_content)
}
get_sequence_homopolymer = function(sequence) {
if (length(str_which(sequence, "A{3,}|G{3,}|C{3,}|T{3,}")) > 0) {
return (TRUE)
} else{
return (FALSE)
}
}
get_index_homopolymer = function(index) {
homopolymer = map_lgl(index$sequence, get_sequence_homopolymer)
}
sample_number_check = function (sample_number) {
if (is.na(sample_number)) {
display_message("you have to enter an integer value")
return(FALSE)
}
else if (sample_number != floor(sample_number)) {
display_message("you have to enter an integer value")
return(FALSE)
}
else if (sample_number < 2) {
display_message(paste("You need at least 2 samples in order to",
"use multiplexing"))
return(FALSE)
} else if (sample_number > 1000) {
display_message(paste("The sample number is too high,",
"please enter a value under 1000"))
return(FALSE)
}
else {
return(TRUE)
}
}
multiplexing_level_set = function (sample_number) {
v = 2:(sample_number)
multiplexing_level_choices = v[sample_number %% v == 0]
return (multiplexing_level_choices [multiplexing_level_choices < 96])
}
sample_and_multiplexing_level_check = function(sample_number, mplex_level) {
if (sample_number_check(sample_number)) {
possible_multiplexing_level = multiplexing_level_set(sample_number)
if (!(mplex_level %in% possible_multiplexing_level)) {
display_message(paste(
"The sample number isn't a multiple of the multiplexing level.",
"Here are the possible multiplexing levels :"
))
display_message(possible_multiplexing_level)
return(FALSE)
} else {
return (TRUE)
}
} else {
return(FALSE)
}
}
# sequence traduction for a 4_channel platform :
sequence_binary_conversion_4_channel = function(sequence) {
binary_sequence =
toupper(sequence) %>%
gsub("A|C", "0", .) %>%
gsub("G|T", "1", .)
return(binary_sequence)
}
# index traduction for a 4_channel platform :
index_binary_conversion_4_channel = function(index) {
index = index %>%
mutate (binary_4 = sequence_binary_conversion_4_channel(sequence))
return (index)
}
# sequence traduction for a 2_channel platform for image 1 :
sequence_binary_conversion_2_channel_1 = function(sequence) {
binary_sequence =
toupper(sequence) %>%
gsub("G|C", "0", .) %>%
gsub("A|T", "1", .)
return(binary_sequence)
}
# sequence traduction for a 2_channel platform for image 2 :
sequence_binary_conversion_2_channel_2 = function(sequence) {
binary_sequence =
toupper(sequence) %>%
gsub("G|T", "0", .) %>%
gsub("A|C", "1", .)
return(binary_sequence)
}
# index traduction for a 2_channel platform :
index_binary_conversion_2_channel = function(index) {
index = index %>%
mutate (
binary_2_image_1 = sequence_binary_conversion_2_channel_1(sequence),
binary_2_image_2 = sequence_binary_conversion_2_channel_2(sequence)
)
return (index)
}
# sequence traduction for a 1_channel platform for image 1 :
sequence_binary_conversion_1_channel_1 = function(sequence) {
binary_sequence =
toupper(sequence) %>%
gsub("G|C", "0", .) %>%
gsub("A|T", "1", .)
return(binary_sequence)
}
# sequence traduction for a 1_channel platform for image 2 :
sequence_binary_conversion_1_channel_2 = function(sequence) {
binary_sequence =
toupper(sequence) %>%
gsub("A|G", "0", .) %>%
gsub("C|T", "1", .)
return(binary_sequence)
}
index_binary_conversion_1_channel = function(index) {
index = index %>%
mutate (
binary_1_image_1 = sequence_binary_conversion_1_channel_1(sequence),
binary_1_image_2 = sequence_binary_conversion_1_channel_2(sequence)
)
return (index)
}
# Compatibility -----------------------------------------------------------
#conversion of the string sequence into a vector of numeric
binary_word_into_numeric = function (binary_word) {
as.numeric(unlist(strsplit(as.character(binary_word), "")))
}
#conversion of each index sequence combination into a matrix
vectors_into_matrix = function (binary_word) {
m = mapply(binary_word_into_numeric, binary_word)
return(m)
}
#test if a column/line of a index combination is correct
any_different = function(binary_sequence) {
if (length(unique(binary_sequence)) > 1) {
return (TRUE)
} else{
return (FALSE)
}
}
has_signal_in_both_channels = function(colored_sequence) {
if (any(as.logical(colored_sequence))) {
return (TRUE)
} else{
return (FALSE)
}
}
# check if a combination of indexes is correct
is_a_good_combination = function (combination_matrix) {
all_combinations = vectors_into_matrix(combination_matrix)
results = prod(apply(all_combinations, 1, any_different))
return(results)
}
is_a_good_combination_2 = function (combination_matrix) {
all_combinations = vectors_into_matrix(combination_matrix)
results = prod(apply(all_combinations, 1, has_signal_in_both_channels))
return(results)
}
# keeps only the good ones :
list_of_good_combinations = function (matrix_id) {
list = apply(matrix_id, 2, is_a_good_combination)
return(list)
}
list_of_good_combinations_2 = function (matrix_id) {
list = apply(matrix_id, 2, is_a_good_combination_2)
return(list)
}
##super fast and furious
#matches an id to its binary_sequence
id_into_4_channel_binary_sequence = function (index_id_combination, index_df) {
index_rows = subset(index_df, Id == index_id_combination)
index_binary_sequence_combination = as.character(index_rows$binary_4)
return (index_binary_sequence_combination)
}
# matches an id to its binary_sequence for 2_channel image 1
id_into_2_channel_image_1_binary_sequence = function (index_id_combination,
index_df) {
index_rows = subset(index_df, Id == index_id_combination)
index_binary_sequence_combination =
as.character(index_rows$binary_2_image_1)
return (index_binary_sequence_combination)
}
# matches an id to its binary_sequence for 2_channel image 2
id_into_2_channel_image_2_binary_sequence = function (index_id_combination,
index_df) {
index_rows = subset(index_df, Id == index_id_combination)
index_binary_sequence_combination =
as.character(index_rows$binary_2_image_2)
return (index_binary_sequence_combination)
}
# matches an id to its binary_sequence for 1_channel image 1
id_into_1_channel_image_1_binary_sequence = function (index_id_combination,
index_df) {
index_rows = subset(index_df, Id == index_id_combination)
index_binary_sequence_combination =
as.character(index_rows$binary_1_image_1)
return (index_binary_sequence_combination)
}
# matches an id to its binary_sequence for 1_channel image 2
id_into_1_channel_image_2_binary_sequence = function (index_id_combination,
index_df) {
index_rows = subset(index_df, Id == index_id_combination)
index_binary_sequence_combination =
as.character(index_rows$binary_1_image_2)
return (index_binary_sequence_combination)
}
# matches the matrix_id to the matrix_binary_sequence
matrix_id_to_binary_sequence = function(matrix_id, index_df) {
m = matrix_id
n = nrow(m)
# m[1:n, ] = sapply(X = m[1:n, ], FUN = id_into_4_channel_binary_sequence,
# index_df)
m[seq(1,n), ] = vapply(X = m[seq(1,n), ],
FUN = id_into_4_channel_binary_sequence,
"",
index_df)
return (m)
}
# matches the matrix_id to the matrix_2_channel_immage_1_binary_sequence
matrix_id_to_2_channel_image_1_binary_sequence = function(matrix_id, index_df){
m = matrix_id
n = nrow(m)
m[seq(1,n), ] = vapply(X = m[seq(1,n), ],
FUN = id_into_2_channel_image_1_binary_sequence,
"",
index_df)
return (m)
}
# matches the matrix_id to the matrix_2_channel_immage_2_binary_sequence
matrix_id_to_2_channel_image_2_binary_sequence = function(matrix_id, index_df){
m = matrix_id
n = nrow(m)
m[seq(1,n), ] = vapply(X = m[seq(1,n), ],
FUN = id_into_2_channel_image_2_binary_sequence,
"",
index_df)
return (m)
}
# matches the matrix_id to the matrix_1_channel_immage_1_binary_sequence
matrix_id_to_1_channel_image_1_binary_sequence = function(matrix_id, index_df){
m = matrix_id
n = nrow(m)
m[seq(1,n), ] = vapply(X = m[seq(1,n), ],
FUN = id_into_1_channel_image_1_binary_sequence,
"",
index_df)
return (m)
}
# matches the matrix_id to the matrix_1_channel_immage_2_binary_sequence
matrix_id_to_1_channel_image_2_binary_sequence = function(matrix_id, index_df){
m = matrix_id
n = nrow(m)
m[seq(1,n), ] = vapply(X = m[seq(1,n), ],
FUN = id_into_1_channel_image_2_binary_sequence,
"",
index_df)
return (m)
}
# get all compatible combinations of an index for 4_channel platform
get_all_combinations_4_channel = function(index_df, mplex_level) {
index_df = index_binary_conversion_4_channel(index_df)
index_df = index_df %>% arrange(Id)
matrix_id = combn(x = index_df$Id, m = mplex_level)
matrix_binary_sequence = matrix_id_to_binary_sequence(matrix_id, index_df)
logical_rigth_combination = as.logical (
x = list_of_good_combinations(matrix_id = matrix_binary_sequence)
)
list_of_all_combinations = matrix_id[, logical_rigth_combination] %>% t()
return(list_of_all_combinations)
}
# get all compatible combinations of an index for 2_channel platform
get_all_combinations_2_channel = function(index_df, mplex_level) {
index_df = index_binary_conversion_2_channel(index_df)
index_df = index_df %>% arrange(Id)
matrix_id = combn(x = index_df$Id, m = mplex_level)
#matches Ids to binary sequences
image_1_matrix_binary_sequence =
matrix_id_to_2_channel_image_1_binary_sequence(matrix_id = matrix_id,
index_df = index_df)
image_1_logical_rigth_combination = as.logical (
x = list_of_good_combinations_2(image_1_matrix_binary_sequence)
)
#matches Ids to binary sequences
image_2_matrix_binary_sequence =
matrix_id_to_2_channel_image_2_binary_sequence(matrix_id = matrix_id,
index_df = index_df)
image_2_logical_rigth_combination = as.logical (
x = list_of_good_combinations_2(image_2_matrix_binary_sequence)
)
logical_rigth_combination = as.logical(
image_1_logical_rigth_combination * image_2_logical_rigth_combination)
list_of_all_combinations = matrix_id[, (logical_rigth_combination)] %>%
t()
return(list_of_all_combinations)
}
# get all compatible combinations of an index for 1_channel platform
get_all_combinations_1_channel = function(index_df, mplex_level) {
index_df = index_binary_conversion_1_channel(index_df)
index_df = index_df %>% arrange(Id)
matrix_id = combn(x = index_df$Id, m = mplex_level)
#matches Ids to binary sequences
image_1_matrix_binary_sequence =
matrix_id_to_1_channel_image_1_binary_sequence(matrix_id = matrix_id,
index_df = index_df)
image_1_logical_rigth_combination = as.logical(
x = list_of_good_combinations_2(image_1_matrix_binary_sequence)
)
#matches Ids to binary sequences
image_2_matrix_binary_sequence =
matrix_id_to_1_channel_image_2_binary_sequence(matrix_id = matrix_id,
index_df = index_df)
image_2_logical_rigth_combination = as.logical(
x = list_of_good_combinations_2(image_2_matrix_binary_sequence)
)
logical_rigth_combination = as.logical(
image_1_logical_rigth_combination * image_2_logical_rigth_combination)
list_of_all_combinations = matrix_id[, (logical_rigth_combination)] %>%
t()
return(list_of_all_combinations)
}
#get all combination for other platform than Illumina (no constraint)
get_all_combinations_0_channel = function(index_df, mplex_level) {
list_of_all_combinations = combn(x = index_df$Id, m = mplex_level) %>%
t()
return(list_of_all_combinations)
}
# For a random search
get_random_combinations_4_channel = function (index_df, mplex_level) {
index_df = index_binary_conversion_4_channel(index_df)
list_of_good_combs = matrix(nrow = 1000, ncol = mplex_level)
j = 0
for (i in seq(1,1000)) {
combination = index_df[sample(
x = seq(1,nrow(index_df)),
size = (mplex_level),
replace = FALSE
), ]
if (is_a_good_combination(combination$binary_4)) {
j = j + 1
#facilitates the distance filter
combination = arrange(combination, Id)
list_of_good_combs [j, ] = combination$Id
}
}
M = list_of_good_combs %>% unique() %>% na.omit()
#facilitates the distance filter
M = M[order(M[, 1]), ]
return (M)
}
get_random_combinations_2_channel = function (index_df, mplex_level) {
index_df = index_binary_conversion_2_channel(index_df)
list_of_good_combs = matrix(nrow = 1000, ncol = mplex_level)
j = 0
for (i in seq(1,1000)) {
combination = index_df[sample(
x = seq(1,nrow(index_df)),
size = (mplex_level),
replace = FALSE
), ]
if (is_a_good_combination_2(combination$binary_2_image_1) &&
is_a_good_combination_2(combination$binary_2_image_2)) {
j = j + 1
#facilitates the distance filter
combination = arrange(combination, Id)
list_of_good_combs [j, ] = combination$Id
}
}
M = list_of_good_combs %>% unique() %>% na.omit()
#facilitates the distance filter
M = M[order(M[, 1]), ]
return (M)
}
get_random_combinations_1_channel = function (index_df, mplex_level) {
index_df = index_binary_conversion_1_channel(index_df)
list_of_good_combs = matrix(nrow = 1000, ncol = mplex_level)
j = 0
for (i in seq(1,1000)) {
combination = index_df[sample(
x = seq(1,nrow(index_df)),
size = (mplex_level),
replace = FALSE
), ]
if (is_a_good_combination_2(combination$binary_1_image_1) &&
is_a_good_combination_2(combination$binary_1_image_2)) {
j = j + 1
#facilitates the distance filter
combination = arrange(combination, Id)
list_of_good_combs [j, ] = combination$Id
}
}
M = list_of_good_combs %>% unique() %>% na.omit()
#facilitates the distance filter
M = M[order(M[, 1]), ]
return (M)
}
get_random_combinations_0_channel = function (index_df, mplex_level) {
list_of_good_combs = matrix(nrow = 1000, ncol = mplex_level)
for (i in seq(1,1000)) {
combination = index_df[sample(
x = seq(1,nrow(index_df)),
size = (mplex_level),
replace = FALSE
), ]
combination = arrange(combination, Id)
list_of_good_combs [i, ] = combination$Id
}
M = list_of_good_combs %>% unique()
#facilitates the distance filter
M = M[order(M[, 1]), ]
return (M)
}
# gets the rights combinations according to the number of possible combinations
get_combinations = function (index_df, mplex_level, platform){
if (choose(nrow(index_df),mplex_level) <= 2024 || mplex_level == 2){
return(get_all_combinations(index_df, mplex_level, platform))
} else {
return(get_random_combinations(index_df, mplex_level, platform))
}
}
# Filtering --------------------------------------------------------------
# generates all possible couples
# and calculates DNAbarcodes's hamming and seqlev distances
index_distance = function (index_df) {
index_distance_df = combn(index_df$sequence, 2) %>% t() %>%
as.data.frame(stringsAsFactors = FALSE)
index_distance_df = index_distance_df %>%
mutate(n = seq(1,nrow(index_distance_df)))
index_distance_df = index_distance_df %>% group_by(n) %>% mutate (
hamming = distance(V1, V2, metric = "hamming"),
seqlev = distance (V1, V2, metric = "seqlev"),
phaseshift = distance(V1, V2, metric = "phaseshift")
)
return(index_distance_df)
}
# couples with hamming distance under threshold
low_hamming_distance = function(index_df, index_distance_df, d) {
i_d = index_distance_df %>% filter(hamming < d)
i_d = i_d %>% group_by(n) %>%
# matches the sequence to the id
mutate(Id1 = index_df$Id[which(V1 == index_df$sequence)],
Id2 = index_df$Id[which(V2 == index_df$sequence)])
low_distance_tab = i_d %>% ungroup() %>% select(Id1, Id2)
return(low_distance_tab)
}
#couples with seqlev distance under threshold
low_seqlev_distance = function(index_df, index_distance_df, d) {
i_d = index_distance_df %>% filter(seqlev < d)
i_d = i_d %>% group_by(n) %>%
# matches the sequence to the id
mutate(Id1 = index_df$Id[which(V1 == index_df$sequence)],
Id2 = index_df$Id[which(V2 == index_df$sequence)])
low_distance_tab = i_d %>% ungroup() %>% select(Id1, Id2)
return(low_distance_tab)
}
#couples with phaseshift distance under threshold
low_phaseshift_distance = function(index_df, index_distance_df, d) {
i_d = index_distance_df %>% filter(phaseshift < d)
i_d = i_d %>% group_by(n) %>%
# matches the sequence to the id
mutate( Id1 = index_df$Id[which(V1 == index_df$sequence)],
Id2 = index_df$Id[which(V2 == index_df$sequence)])
low_distance_tab = i_d %>% ungroup() %>% select(Id1, Id2)
return(low_distance_tab)
}
# low distance tab = hamming rejection table or seqlev rejection table
filter_combinations = function(combinations_m, low_distance_tab) {
# browser()
combinations_df <-
combinations_m %>% as.data.frame(., stringsAsFactors = FALSE) %>%
mutate(combID =
0:((nrow(.) > 0) * (nrow(.) - 1)))
# head(combinations_df)
combinations_df_long <- gather(combinations_df,
bc_gp,
barcodes,
-combID) %>%
select(-bc_gp) %>% arrange(combID)
# head(combinations_df_long)
dist_tab_long <- gather(
low_distance_tab %>% mutate(pairID = 0:((nrow(low_distance_tab) > 0)
*(nrow(low_distance_tab) - 1))),
bcID,
barcodes,
-pairID) %>%
select(-bcID) %>% arrange(pairID)
droppedComb <- combinations_df_long %>% group_by(combID) %>%
do ({
single_comb <- .
dist_tab_long %>% group_by(pairID) %>%
summarise(
matchOcc = sum(barcodes %in% single_comb$barcodes)
) %>%
ungroup() %>%
mutate(isDropping = as.logical(matchOcc == 2)) %>%
summarise(isDropping = any(isDropping))
}) %>% ungroup()
keptComb <-
inner_join(combinations_df, droppedComb, by = "combID") %>%
filter(!isDropping) %>% select(-c(combID, isDropping)) %>% as.matrix()
# if (nrow(keptComb)<1) stop("No combination left after filtering,
# please reduce the threshold.")
return(keptComb)
}
# Result ------------------------------------------------------------------
# Shannon's entropy
shannon_entropy = function(frequency) {
return(-1 * sum(frequency * log(frequency)))
}
#for a matrix of combination
entropy_result = function (index_combination) {
d = index_combination %>% table()
d = d / sum(d)
entropy = shannon_entropy(d)
return (entropy)
}
# Celine's entropy for given parameters
entropy_n_k = function (index_number, sample_number) {
k = index_number
n = sample_number
entropy =
-(k - (n %% k)) * (floor(n / k) / n) * log(floor(n / k) / n) -
(n %% k) * (ceiling(n / k) / n) * log(ceiling(n / k) / n)
return(entropy)
}
entropy_max = function (index_number, sample_number) {
if (sample_number > index_number) {
return (entropy_n_k(index_number, sample_number))
}
else {
return(log(sample_number))
}
}
recursive_entropy = function(combination_m,
nb_lane,
method = "greedy_exchange") {
if (!method %in% c("greedy_exchange", "greedy_descent")) {
display_message(
paste0(
"recursive_entropy(): the selected method '",
method,
"' does not exist. Please select 'greedy_exchange'
or 'greedy_descent'"
)
)
return(combination_m)
}
if (method == "greedy_descent") {
# Celine's implementation
while (nrow(combination_m) > nb_lane) {
ind = combn(nrow(combination_m), nrow(combination_m) - 1)
a = vector(length = nrow(combination_m))
for (i in seq(1,nrow(combination_m))) {
a[i] = entropy_result(combination_m[ind[, i], ])
}
x = which.max(a)
z = which(a == a[x])
combination_m = combination_m[ind[, z[sample(seq(1,length(z)), 1)]], ]
}
return (combination_m)
}
if (method == "greedy_exchange") {
# Jacques' implementation from his Matlab code
clist = combination_m
ncomb = nb_lane
c0 = combination_m[sample(seq(1,nrow(combination_m)), nb_lane), ]
N = nrow(combination_m)
c = c0
ctest = c
Sc = entropy_result(c)
test = 1
while (test) {
test = 0
Stest = Sc
n = 1
i = 1
while ((n <= ncomb) && (i <= N) && (test == 0)) {
ctp = c
ctp[n, ] = clist[i, ]
Stp = entropy_result(ctp)
if (Stp > Sc) {
test = 1
Stest = Stp
ctest = ctp
}
if (i < N) {
i = i + 1
} else {
i = 1
n = n + 1
}
}
if (Stest > Sc) {
c = ctest
Sc = Stest
}
}
return (ctest)
}
}
# gets the result
get_result = function ( index_df,
sample_number,
mplex_level,
platform,
metric = NULL,
d = 3,
thrs_size_comb = 120,
max_iteration = 50,
method = "greedy_exchange") {
combinations_m = get_combinations(index_df, mplex_level, platform)
if (!is.null(metric)) {
combinations_m = distance_filter (index_df, combinations_m, metric, d)
}
if(!is.null(combinations_m)){
if (sample_number == mplex_level){
result = combinations_m[sample(seq(1,nrow(combinations_m)), 1),]
result = data.frame(Id = result,
Lane = (rep(
seq(1,1),
length.out = sample_number,
each = mplex_level
)))
result$Id = as.character(result$Id)
return (result)
}else{
nb_lane = sample_number %>%
as.numeric() / mplex_level %>% as.numeric()
index_number = nrow(index_df)
cb = optimize_combinations(combinations_m,
nb_lane,
index_number,
thrs_size_comb,
max_iteration,
method) %>% as.data.frame()
result = data.frame(Id = as.vector(cb %>% t() %>% as.vector),
Lane = (rep(
seq(1,nb_lane),
length.out = sample_number,
each = mplex_level
)))
result$Id = as.character(result$Id)
return(result)}
} else {
return(NULL)
}
}
# Experiment Design (single or dual) -----------------------------------------
# change the position of index if there is any duplicate
# Illumina 770-2017-004-C|page 3
# Using unique dual index combinations is a
# best practice to make sure that reads with incorrect indexes do not
# impact variant calling or assignment of gene expression counts.
check_for_duplicate = function(result1, result2) {
check = data.frame(Id1 = result1$Id, Id2 = result2$Id)
if (anyDuplicated(check) != 0) {
d = anyDuplicated(check)
for (i in seq(1,length(d))) {
id = result2[d[i], ]$Id
lane_to_change = result2 %>% filter(Lane == result2[d[i], ]$Lane)
lanes_to_keep = result2 %>% filter(Lane != result2[d[i], ]$Lane)
j = which(lane_to_change$Id == id) %>% as.numeric()
if (j < nrow(lane_to_change)) {
temp = lane_to_change[j, ]
lane_to_change[j, ] = lane_to_change[j + 1, ]
lane_to_change[j + 1, ] = temp
} else {
temp = lane_to_change[j, ]
print(temp)
lane_to_change[j, ] = lane_to_change[1, ]
lane_to_change[1, ] = temp
}
result = bind_rows(lane_to_change, lanes_to_keep) %>% arrange(Lane)
}
return (result)
} else{
return (result2)
}
}
is_a_prime_number = function (sample_number) {
result = isPrime(sample_number) %>% as.numeric()
return(result)
}
final_result = function(index_df,
sample_number,
mplex_level,
platform,
metric,
d,
thrs_size_comb = 120,
max_iteration = 50,
method = "greedy_exchange") {
if (mplex_level<= nrow(index_df)){
result1 = get_result(
index_df,
sample_number,
mplex_level,
platform,
metric,
d,
thrs_size_comb,
max_iteration,
method
)}else {
result1 = NULL
display_message(
"the number of barcodes is not sufficient for the multiplexing level")
}
if(!is.null(result1)){
result1 = data.frame(
sample = seq(1,sample_number) %>% as.character(),
Lane = result1$Lane %>% as.character(),
Id = result1$Id %>% as.character(),
stringsAsFactors = FALSE
)
result1 = left_join(result1, select(index_df, Id, sequence), by = "Id")
return (result1)
} else {
return(NULL)
}
}
final_result_dual = function( index_df_1,
index_df_2,
sample_number,
mplex_level,
platform = 4,
metric = NULL,
d = 3,
thrs_size_comb = 120,
max_iteration = 50,
method = "greedy_exchange") {
result1 = get_result(
index_df_1,
sample_number,
mplex_level,
platform,
metric,
d,
thrs_size_comb,
max_iteration,
method
)
print(result1)
result2 = get_result(
index_df_2,
sample_number,
mplex_level,
platform,
metric,
d,
thrs_size_comb,
max_iteration,
method
)
print(result2)
if(!is.null(result1) && !is.null(result2)){
result2 = check_for_duplicate(result1, result2)
result1 = left_join(result1,
select(index_df_1, Id, sequence),
by = "Id")
print(result1)
result2 = left_join(result2,
select(index_df_2, Id, sequence),
by = "Id")
print(result2)
result = data.frame(
sample = seq(1,sample_number) %>% as.numeric(),
Lane = result1$Lane %>% as.character(),
Id1 = result1$Id %>% as.character(),
sequence1 = result1$sequence %>% as.character(),
Id2 = result2$Id %>% as.character(),
sequence2 = result2$sequence %>% as.character(),
stringsAsFactors = FALSE
) %>% arrange(sample)
result$sample = as.character(result$sample)
return(result)
} else {
return(NULL)
}
}
display_message <- function (a_message) {
assign("error_message", a_message, envir = .GlobalEnv)
print(a_message)
}
check_input_dataframe <- function(df, column_names){
if ( !("data.frame" %in% is(df))) {
display_message(paste( "wrong input format: a dataframe is",
"expected"))
} else if (!all(column_names %in% names(df))) {
missing_column_names <- setdiff(column_names, names(df))
display_message(paste( "wrong input format,",
"missing column", missing_column_names))
}
}
README.md000066400000000000000000000046121453240703200123400ustar00rootroot00000000000000
About
=================
DNABarcodeCompatibility: R-package to find optimised sets of compatible barcodes for multiplex experiments.
News
=================
* The DNABarcodeCompatibility package has been approved by Bioconductor and was added to the Bioconductor release 3.9: [https://bioconductor.org/packages/DNABarcodeCompatibility](https://bioconductor.org/packages/DNABarcodeCompatibility)
* The DNABarcodeCompatibility package is also directly usable through a Shiny web application hosted by the Institut Pasteur: [https://dnabarcodecompatibility.pasteur.fr](https://dnabarcodecompatibility.pasteur.fr)
Installation
================
* Requirements
+ Install [R](https://www.r-project.org/) if not yet installed (R >= 3.6 is required).
* Within a R console, type in the following commands:
```
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DNABarcodeCompatibility")
```
Documentation
================
[Introduction](https://comoto-pasteur-fr.github.io/DNABarcodeCompatibility/)
[API documentation](https://comoto-pasteur-fr.github.io/DNABarcodeCompatibility/DNABarcodeCompatibility-manual.pdf)
* Related tools:
In addition, DNABarcodeCompatibility can be used from a graphical user interface. We now provide two different interfaces:
1) [Standalone Java-based graphical user interface](https://github.com/comoto-pasteur-fr/DNABarcodeCompatibility_GUI). In such a case, additional dependencies must be installed: [Java (JDK 8 - 64 bits)](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) and `rJava` R-package (`install.packages("rJava")`).
2) [Shiny web-based graphical user interface](https://github.com/comoto-pasteur-fr/DNABarcodeCompatibility_Shiny). This interface can be run locally within a web browser or deployed on a web server.
Support
=========
Please use the [github ticket system](https://github.com/comoto-pasteur-fr/DNABarcodeCompatibility/issues) to report issues or suggestions.
We also welcome pull requests.
Reference
==========
Trébeau C, Boutet de Monvel J, Wong Jun Tai F, Petit C, Etournay R. DNABarcodeCompatibility: an R-package for optimizing DNA-barcode combinations in multiplex sequencing experiments. Bioinformatics. 2018 Dec 21. [doi: 10.1093/bioinformatics/bty1030](https://doi.org/10.1093/bioinformatics/bty1030). [Epub ahead of print] PubMed PMID: 30576403.
data/000077500000000000000000000000001453240703200117675ustar00rootroot00000000000000data/IlluminaIndexes.rda000066400000000000000000000010721453240703200155510ustar00rootroot00000000000000BZh91AY&SYU8={́lt@gB@B ]6@ja4
hPS)(Chh
42dшbi4&&hɦF@@4
4=M
:j̞!h|'nR+؉(I 32I-LʋV,Zu6s UTM6$l Tv2NjRҁPoF"!HM ]-۷7
R:z9sM4+9s%9m[l~075ccb-V7#m<w"D|TnG
W8Gc`jP!
-v0V"
ZD
HڭAFSQ[XIl%XKI;dƮ&{*=uDj (K)j$i|:|+Tw43f9p:
SwoNaKZ, o~%vQ-?P7D"(H*fdata/IlluminaIndexesRaw.rda000066400000000000000000000011051453240703200162200ustar00rootroot00000000000000BZh91AY&SYk,U@/'@@!nli=6MGzQ1= #@'"G`ɓ&&4ME#C@&CLF@)60 L l03LvV9&[UKHL,G ؆ZAmbj$#ri.%xc/(SO 8(h yK)_~T)DyjzvvJR剐nێx44DH,0$I` @; K>QckP!JR&c*Dfg(m㉂cD!FQ#lHqSv6(drVdrFՈV0cɓ*hdZ$v Fj;-I,z99%$g8a;,"v5br%i_}P S08SBgzٞg2p稗qH]B@Cinst/000077500000000000000000000000001453240703200120335ustar00rootroot00000000000000inst/CITATION000066400000000000000000000042551453240703200131760ustar00rootroot00000000000000bibentry(bibtype = "Article",
key = "trebeau_dnabarcodecompatibility:_2018",
title = "{DNABarcodeCompatibility}: an {R}-package for optimizing {DNA}-barcode combinations in multiplex sequencing experiments",
issn = "1367-4811",
shorttitle = "{DNABarcodeCompatibility}",
doi = "10.1093/bioinformatics/bty1030",
abstract = "Summary: Using adequate DNA barcodes is essential to unambiguously identify each DNA library within a multiplexed set of libraries sequenced using next-generation sequencers. We introduce DNABarcodeCompatibility, an R-package that allows one to design single or dual-barcoding multiplex experiments by imposing desired constraints on the barcodes (including sequencer chemistry, barcode pairwise minimal distance, and nucleotide content), while optimizing barcode frequency usage, thereby allowing one to both facilitate the demultiplexing step and spare expensive library-preparation kits. The package comes with a user-friendly interface and a web app developed in Java and Shiny, respectively, with the aim to help bridge the expertise of core facilities with the experimental needs of non-experienced users.\nAvailability and implementation: DNABarcodeCompatibility can be easily extended to fulfil specific project needs. The source codes of the R-package and its user interfaces are publicly available along with documentation at [https://github.com/comoto-pasteur-fr] under the GPL-2 licence.\nSUPPLEMENTARY INFORMATION: SUPPLEMENTARY DATA ARE AVAILABLE AT BIOINFORMATICS ONLINE.",
language = "eng",
journal = "Bioinformatics (Oxford, England)",
author = c(person(given = "Céline",
family = "Trébeau"),
person(given = "Jacques",
family = "Boutet de Monvel"),
person(given = "Fabienne",
family = "Wong Jun Tai"),
person(given = "Christine",
family = "Petit"),
person(given = "Raphaël",
family = "Etournay")),
month = "dec",
year = "2018",
pmid = "30576403",
)man/000077500000000000000000000000001453240703200116315ustar00rootroot00000000000000man/DNABarcodeCompatibility-package.Rd000066400000000000000000000020211453240703200201200ustar00rootroot00000000000000% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/DNABarcodeCompatibility-package.R
\docType{package}
\name{DNABarcodeCompatibility-package}
\alias{DNABarcodeCompatibility-package}
\title{DNABarcodeCompatibility:
to find optimised sets of compatible barcodes with least heterogeneity in
barcode usage for multiplex
experiments performed on next generation sequencing platforms.}
\description{
The DNABarcodeCompatibility package provides six functions to load DNA
barcodes, and to generate, filter and optimise sets of barcode combinations
for multiplex sequencing experiments.
In particular, barcode combinations are selected to be compatible with
respect to Illumina chemistry constraints, and can be filtered to keep those
that are robust against substitution and insertion/deletion errors
thereby facilitating the demultiplexing step.
In addition, the package provides an optimiser function to further favor
the selection of compatible barcode combinations with least
heterogeneity in barcode usage.
}
man/IlluminaIndexes.Rd000066400000000000000000000012001453240703200152030ustar00rootroot00000000000000% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/data.R
\docType{data}
\name{IlluminaIndexes}
\alias{IlluminaIndexes}
\title{Barcode dataset from Illumina with features.}
\format{A data frame with 48 rows and 4 variables:
\describe{
\item{Id}{barcode identifier}
\item{sequence}{DNA sequence}
\item{GC_content}{percentage of G and C relative to A and T}
\item{homopolymer}{presence of nucleotide repetitions (>= 3)}
...
}}
\usage{
IlluminaIndexes
}
\description{
48 barcodes from Illumina TruSeq Small RNA kits along
with percentage in CG content and presence of homopolymers of size >= 3
}
\keyword{datasets}
man/IlluminaIndexesRaw.Rd000066400000000000000000000006531453240703200156700ustar00rootroot00000000000000% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/data.R
\docType{data}
\name{IlluminaIndexesRaw}
\alias{IlluminaIndexesRaw}
\title{Barcode dataset from Illumina.}
\format{A data frame with 48 rows and 2 variables:
\describe{
\item{V1}{barcode identifier}
\item{V2}{DNA sequence}
...
}}
\usage{
IlluminaIndexesRaw
}
\description{
48 barcodes from Illumina TruSeq Small RNA kits
}
\keyword{datasets}
man/distance_filter.Rd000066400000000000000000000030451453240703200152610ustar00rootroot00000000000000% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/distance_filter.R
\name{distance_filter}
\alias{distance_filter}
\title{Select barcode combinations with error correction properties}
\usage{
distance_filter(index_df, combinations_m, metric, d)
}
\arguments{
\item{index_df}{A dataframe containing barcodes identifiers,
corresponding DNA sequences along with GC content and presence
of homopolymers.}
\item{combinations_m}{A matrix of compatible barcode combinations.}
\item{metric}{The type of distance (hamming or seqlev or phaseshift).}
\item{d}{The minimum value of the distance.}
}
\value{
A filtered matrix containing the identifiers of the barcodes
satisfying the distance constraints.
}
\description{
Filters a list of barcode combinations for a given distance metric
(hamming or seqlev) and threshold in order to produce a list of barcodes
satisfying the distance constraints.
}
\details{
The "hamming" distance is suitable for correcting substitution errors.
The "seqlev" distance is suitable for correcting both
substitution and insertion/deletion errors.
}
\examples{
barcodes <- DNABarcodeCompatibility::IlluminaIndexes
m <- get_all_combinations(barcodes, 2, 4)
distance_filter(barcodes, m, "hamming", 3)
}
\references{
Buschmann, T. 2015.
The Systematic Design and Application of Robust DNA Barcodes.
Buschmann, T. 2017.
DNABarcodes: an R package for the systematic construction of
DNA sample tags. Bioinformatics 33, 920–922.
}
\seealso{
\code{\link{get_all_combinations}},
\code{\link{get_random_combinations}}
}
man/experiment_design.Rd000066400000000000000000000100711453240703200156300ustar00rootroot00000000000000% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/experiment_design.R
\name{experiment_design}
\alias{experiment_design}
\title{Find a set of barcode combinations with least heterogeneity in
barcode usage for single and dual indexing}
\usage{
experiment_design(file1, sample_number, mplex_level, platform = 4,
file2 = NULL, export = NULL, metric = NULL, d = 3, thrs_size_comb,
max_iteration, method)
}
\arguments{
\item{file1}{The input data file that contains 2 columns separated by a
space or a tabulation, namely the sequence identifiers and corresponding
DNA sequence}
\item{sample_number}{Number of libraries to be sequenced.}
\item{mplex_level}{The number at which the barcodes will be multiplexed.}
\item{platform}{An integer representing the number of channels (1, 2, 4)
of the desired Illumina platform: 1 for iSeq; 2 for NextSeq, NovaSeq,
MiniSeq; 4 for HiSeq and MiSeq. 0 represents any other platform than
Illumina.}
\item{file2}{The input data file that contains 2 columns separated by
a space or a tabulation, namely the sequence identifiers and
corresponding DNA sequence; used for dual-indexing, see details below.}
\item{export}{If not NULL, results are saved in a csv file at the
specified path.}
\item{metric}{The type of distance (hamming or seqlev).}
\item{d}{The minimum value of the distance.}
\item{thrs_size_comb}{The maximum size of the set of compatible
combinations to be used for the greedy optimization.}
\item{max_iteration}{The maximum number of iterations during the
optimizing step.}
\item{method}{The choice of the greedy search: 'greedy_exchange' or
'greedy_descent'.}
}
\value{
A dataframe containing compatible DNA-barcode combinations organized
by lanes of the flow cell.
}
\description{
This function uses the Shannon Entropy to identify a set of compatible
barcode combinations with least heterogeneity in barcode usage,
in the context of single and dual indexing.
It performs either an exhaustive or a random-greedy search of compatible
DNA-barcode combinations depending on the size of the
DNA-barcode population and of the number of samples to be multiplexed.
}
\details{
By specifying the total number of libraries and the number of libraries
to be multiplexed, this function returns an optimal set of DNA-barcode
combinations to be used for sequencing.
In the case of \strong{single indexing}, one must only provide one input file
containing a list of DNA barcodes (file1 argument). The file2 argument being
optional, the program runs the optimisation for single indexing if this
argument is left empty. The output shows the sample ID with its respective
single barcode.
In the case of \strong{dual indexing}, one must provide two input files
containing DNA barcodes as two separate sets of barcodes. The program will
detect these two files and automatically switch to the 'dual indexing' mode.
The program then runs the optimisation for each barcode set separately.
The output shows the sample ID with its respective pair of barcodes.
The inputs of the algorithm are a list of n distinct barcodes,
the number N of required libraries, and the multiplex level k; N = ak,
where a is the number of lanes of the flow cells to be used for the
experiment.
\itemize{
\item Step 1:
}
This step consists of identifying a set of compatible barcode
combinations. Given the number of barcodes and the multiplex level,
the total number of barcode combinations (compatible or not) reads:
\deqn{{n}\choose{k}}
If this number is not too large, the algorithm will perform an
exhaustive search and output all compatible combinations of k barcodes.
Otherwise, it will proceed by picking up combinations at random, in
order to identify a large enough set of compatible barcode combinations.
\itemize{
\item Step 2:
}
Finds an optimized set of barcode combinations in which barcode
redundancy is minimized (see details in
\code{\link{optimize_combinations}})
}
\examples{
write.table(DNABarcodeCompatibility::IlluminaIndexesRaw,
txtfile <- tempfile(), row.names = FALSE, col.names = FALSE, quote=FALSE)
experiment_design(file1=txtfile, sample_number=18, mplex_level=3,
platform=4)
}
man/file_loading_and_checking.Rd000066400000000000000000000021421453240703200172100ustar00rootroot00000000000000% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/file_loading_and_checking.R
\name{file_loading_and_checking}
\alias{file_loading_and_checking}
\title{Loading and checking DNA barcodes.}
\usage{
file_loading_and_checking(file)
}
\arguments{
\item{file}{The input data file that contains 2 columns separated by a space
or a tabulation, namely the sequence identifiers and
corresponding DNA sequence.}
}
\value{
A dataframe containing sequence identifiers, nucleotide sequence, GC content,
presence of homopolymers.
}
\description{
Loads the file containing DNA barcodes and analyze barcode content.
}
\details{
This function loads the DNA barcodes from the input file and checks barcodes
for unicity (identifier and sequence), DNA content, and equal size.
It also calculates the fraction of G and C relative to A and T, as referred
to as "GC content", and it detects the presence of
homopolymers of length >= 3.
}
\examples{
write.table(DNABarcodeCompatibility::IlluminaIndexesRaw,
txtfile <- tempfile(), row.names = FALSE, col.names = FALSE, quote=FALSE)
file_loading_and_checking(txtfile)
}
man/get_all_combinations.Rd000066400000000000000000000023021453240703200162710ustar00rootroot00000000000000% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/get_all_combinations.R
\name{get_all_combinations}
\alias{get_all_combinations}
\title{Get all compatible combinations.}
\usage{
get_all_combinations(index_df, mplex_level, platform)
}
\arguments{
\item{index_df}{A dataframe containing barcodes identifiers, corresponding
DNA sequences along with GC content and presence of homopolymers.}
\item{mplex_level}{The number at which the barcodes will be multiplexed.
Illumina recommends to not multiplex more than 96 libraries.}
\item{platform}{An integer representing the number of channels (1, 2, 4)
of the desired Illumina platform: 1 for iSeq; 2 for NextSeq, NovaSeq,
MiniSeq; 4 for HiSeq and MiSeq. 0 represents any other platform than
Illumina.}
}
\value{
A matrix containing the identifiers of compatible barcode combinations.
}
\description{
Finds the exhaustive set of compatible barcode combinations.
}
\details{
Be aware that the total number of combinations may become prohibitively
large for large barcode sets and large multiplexing numbers.
}
\examples{
get_all_combinations(DNABarcodeCompatibility::IlluminaIndexes, 2, 4)
}
\seealso{
\code{\link{get_random_combinations}}
}
man/get_random_combinations.Rd000066400000000000000000000024031453240703200170030ustar00rootroot00000000000000% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/get_random_combinations.R
\name{get_random_combinations}
\alias{get_random_combinations}
\title{Get a large set of compatible combinations.}
\usage{
get_random_combinations(index_df, mplex_level, platform)
}
\arguments{
\item{index_df}{A dataframe containing barcodes identifiers, corresponding
DNA sequences along with GC content and presence of homopolymers.}
\item{mplex_level}{The number at which the barcodes will be multiplexed.
Illumina recommends to not multiplex more than 96 libraries.}
\item{platform}{An integer representing the number of channels (1, 2, 4)
of the desired Illumina platform: 1 for iSeq; 2 for NextSeq, NovaSeq,
MiniSeq; 4 for HiSeq and MiSeq. 0 represents any other platform than
Illumina.}
}
\value{
A matrix containing the identifiers of compatible barcode combinations.
}
\description{
Finds a randomly generated set of at most 1000 combinations
of compatible barcodes.
}
\details{
This function is suited if the total number of possible combinations is too
high for an exhaustive search to be possible in a reasonable amount of time.
}
\examples{
get_random_combinations(DNABarcodeCompatibility::IlluminaIndexes, 3, 4)
}
\seealso{
\code{\link{get_all_combinations}}
}
man/optimize_combinations.Rd000066400000000000000000000051731453240703200165330ustar00rootroot00000000000000% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/optimize_combinations.R
\name{optimize_combinations}
\alias{optimize_combinations}
\title{Find a set of barcode combinations with least heterogeneity
in barcode usage}
\usage{
optimize_combinations(combination_m, nb_lane, index_number,
thrs_size_comb, max_iteration, method)
}
\arguments{
\item{combination_m}{A matrix of compatible barcode combinations.}
\item{nb_lane}{The number of lanes to be use for sequencing
(i.e. the number of libraries divided by the multiplex level).}
\item{index_number}{The total number of distinct DNA barcodes in the
dataset.}
\item{thrs_size_comb}{The maximum size of the set of compatible
combinations to be used for the greedy optimization.}
\item{max_iteration}{The maximum number of iterations during
the optimizing step.}
\item{method}{The choice of the greedy search: 'greedy_exchange'
or 'greedy_descent'.}
}
\value{
A matrix containing an optimized set of combinations
of compatible barcodes.
}
\description{
This function uses the Shannon Entropy to identify a set of compatible
barcode combinations with least heterogeneity in barcode usage.
}
\details{
N/k compatible combinations are then selected using a Shannon entropy
maximization approach.
It can be shown that the maximum value of the entropy that can be attained
for a selection of N barcodes among n, with possible repetitions, reads:
\deqn{S_{max}=-(n-r)\frac{\lfloor N/n\rfloor}{N}
\log(\frac{\lfloor N/n\rfloor}{N})-r\frac{\lceil N/n\rceil}{N}
\log(\frac{\lceil N/n\rceil}{N})}
where r denotes the rest of the division of N by n, while
\deqn{\lfloor N/n\rfloor} and \deqn{\lceil N/n\rceil} denote
the lower and upper integer parts of N/n, respectively.
\strong{Case 1:} number of lanes < number of compatible DNA-barcode
combinations
This function seeks for compatible DNA-barcode combinations of
highest entropy. In brief this function uses a randomized greedy descent
algorithm to find an optimized selection.
Note that the resulting optimized selection may not be globally optimal.
It is actually close to optimal and much improved in terms of
non-redundancy of DNA barcodes used, compared to a randomly chosen set of
combinations of compatible barcodes.
\strong{Case 2:} number of lanes >= number of
compatible DNA-barcode combinations
In such a case, there are not enough compatible DNA-barcode combinations
and redundancy is inevitable.
}
\examples{
m <- get_random_combinations(DNABarcodeCompatibility::IlluminaIndexes, 3, 4)
optimize_combinations(m, 12, 48)
}
\seealso{
\code{\link{get_all_combinations}},
\code{\link{get_random_combinations}},
\code{\link{experiment_design}}
}
tests/000077500000000000000000000000001453240703200122205ustar00rootroot00000000000000tests/testthat.R000066400000000000000000000001321453240703200141770ustar00rootroot00000000000000library(testthat)
library(DNABarcodeCompatibility)
test_check("DNABarcodeCompatibility")
tests/testthat/000077500000000000000000000000001453240703200140605ustar00rootroot00000000000000tests/testthat/test_file_loading_and_checking.R000066400000000000000000000004431453240703200223340ustar00rootroot00000000000000context("")
write.table(DNABarcodeCompatibility::IlluminaIndexesRaw ,
textfile <- tempfile(),
row.names = FALSE, col.names = FALSE, quote=FALSE)
test_that("file_loading_and_checking loads file", {
expect_is(file_loading_and_checking(textfile), "data.frame")
})
vignettes/000077500000000000000000000000001453240703200130665ustar00rootroot00000000000000vignettes/introduction.Rmd000066400000000000000000000304051453240703200162550ustar00rootroot00000000000000---
title: "Introduction to DNABarcodeCompatibility"
author: "Céline Trébeau, Jacques Boutet de Monvel,
Fabienne Wong Jun Tai, Raphaël Etournay"
date: "`r Sys.Date()`"
output:
BiocStyle::html_document:
toc: true
BiocStyle::pdf_document:
toc: true
vignette: >
%\VignetteIndexEntry{Introduction to DNABarcodeCompatibility}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
\usepackage[utf8]{inputenc}
---
```{r style, echo = FALSE, results = 'asis', warnings=FALSE, messages=FALSE}
#
BiocStyle::markdown()
```
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE #,
# comment = "#>"
)
options(width=120)
```
This document gives an overview of the DNABarcodeCompatibility R package with a
brief description of the set of tools that it contains. The package includes
six main functions that are briefly described below with examples.
These functions allow one to load a list of DNA barcodes (such as the Illumina
TruSeq small RNA kits), to filter these barcodes according to distance and
nucleotide content criteria, to generate sets of compatible barcode
combinations out of the filtered barcode list, and finally to generate an
optimized selection of barcode combinations for multiplex sequencing
experiments. In particular, the package provides an optimizer function to
favour the selection of compatible barcode combinations with
**least heterogeneity in the frequencies of DNA barcodes**, and allows one to
keep barcodes that are **robust against substitution and insertion/deletion
errors**, thereby facilitating the demultiplexing step.
The DNABarcodeCompatibility package also contains:
* one workflow called `experiment_design()` allowing one to perform all steps
in one go.
* two data sets called `IlluminaIndexesRaw` and `IlluminaIndexes` for running
and testing examples.
* a series of API to build your own workflow.
The package deals with the three existing sequencing-by-synthesis chemistries
from Illumina:
* Four-Channel SBS Chemistry: MiSeq, HiSeq systems
* Two-Channel SBS Chemistry: MiniSeq, NextSeq, NovaSeq systems
* One-Channel SBS Chemistry: iSeq system
# Load the package
```{r, echo=TRUE}
library("DNABarcodeCompatibility")
```
# Define a helper function to save the raw dataset as a temporary text file
```{r, echo=TRUE}
# This function is created for the purpose of the documentation
export_dataset_to_file =
function(dataset = DNABarcodeCompatibility::IlluminaIndexesRaw) {
if ("data.frame" %in% is(dataset)) {
write.table(dataset,
textfile <- tempfile(),
row.names = FALSE, col.names = FALSE, quote=FALSE)
return(textfile)
} else print(paste("The input dataset isn't a data.frame:",
"NOT exported into file"))
}
```
# Design an experiment
The function `experiment_design()` uses a Shannon-entropy maximization approach
to identify a set of compatible barcode combinations in which the frequencies
of occurrences of the various DNA barcodes are as uniform as possible.
The optimization can be performed in the contexts of single and dual barcoding.
It performs either an exhaustive or a random search of compatible DNA-barcode
combinations, depending on the size of the DNA-barcode set used, and on the
number of samples to be multiplexed.
## Examples for single indexing
* 12 libraries sequenced in multiplex of 3 on a HiSeq (4 channels) platform
```{r, echo=TRUE}
txtfile <- export_dataset_to_file (
dataset = DNABarcodeCompatibility::IlluminaIndexesRaw
)
experiment_design(file1=txtfile,
sample_number=12,
mplex_level=3,
platform=4)
```
* 12 libraries sequenced in multiplex of 3 on a NextSeq (2 channels) platform
```{r, echo=TRUE}
txtfile <- export_dataset_to_file (
dataset = DNABarcodeCompatibility::IlluminaIndexesRaw
)
experiment_design(file1=txtfile,
sample_number=12,
mplex_level=3,
platform=2)
```
* 12 libraries sequenced in multiplex of 3 on a iSeq (1 channels) platform
```{r, echo=TRUE}
txtfile <- export_dataset_to_file (
dataset = DNABarcodeCompatibility::IlluminaIndexesRaw
)
experiment_design(file1=txtfile,
sample_number=12,
mplex_level=3,
platform=1)
```
* 12 libraries sequenced in multiplex of 3 on a HiSeq platform using barcodes
robust against 1 substitution error
```{r, echo=TRUE}
txtfile <- export_dataset_to_file (
dataset = DNABarcodeCompatibility::IlluminaIndexesRaw
)
experiment_design(file1=txtfile,
sample_number=12,
mplex_level=3,
platform=4,
metric = "hamming",
d = 3)
```
## Examples for dual indexing
* 12 libraries sequenced in multiplex of 3 on a HiSeq platform
```{r, echo=TRUE}
# Select the first half of barcodes from the dataset
txtfile1 <- export_dataset_to_file (
DNABarcodeCompatibility::IlluminaIndexesRaw[1:24,]
)
# Select the second half of barcodes from the dataset
txtfile2 <- export_dataset_to_file (
DNABarcodeCompatibility::IlluminaIndexesRaw[25:48,]
)
# Get compatibles combinations of least redundant barcodes
experiment_design(file1=txtfile1,
sample_number=12,
mplex_level=3,
platform=4,
file2=txtfile2)
```
* 12 libraries sequenced in multiplex of 3 on a HiSeq platform using barcodes
robust against 1 substitution error
```{r, echo=TRUE}
# Select the first half of barcodes from the dataset
txtfile1 <- export_dataset_to_file (
DNABarcodeCompatibility::IlluminaIndexesRaw[1:24,]
)
# Select the second half of barcodes from the dataset
txtfile2 <- export_dataset_to_file (
DNABarcodeCompatibility::IlluminaIndexesRaw[25:48,]
)
# Get compatibles combinations of least redundant barcodes
experiment_design(file1=txtfile1, sample_number=12, mplex_level=3, platform=4,
file2=txtfile2, metric="hamming", d=3)
```
# Build your own workflow
This section guides you through the detailed API of the package with the aim to
help you build your own workflow. The package is designed to be flexible and
should be easily adaptable to most experimental contexts, using the
`experiment_design()` function as a template, or building your own workflow
from scratch.
## Load and check a dataset of barcodes
The `file_loading_and_checking()` function loads the file containing the DNA
barcodes set and analyzes its content. In particular, it checks that each
barcode in the set is unique and uniquely identified (removing any repetition
that occurs). It also checks the homogeneity of size of the barcodes,
calculates their GC content and detects the presence of homopolymers of
length >= 3.
```{r, echo=TRUE}
file_loading_and_checking(
file = export_dataset_to_file(
dataset = DNABarcodeCompatibility::IlluminaIndexesRaw
)
)
```
## Examples of an exhaustive search of compatible barcode combinations
The total number of combinations depends on the number of available barcodes
and of the multiplex level. For 48 barcodes and a multiplex level of 3, the
total number of combinations (compatible or not) can be calculated using
`choose(48,3)`, which gives `r format(choose(48,3))` combinations. In many
cases the total number of combinations can become much larger (even gigantic),
and one cannot perform an exhaustive search
(see `get_random_combinations()` below).
* 48 barcodes, multiplex level of 2, HiSeq platform
```{r, echo=TRUE}
# Total number of combinations
choose(48,2)
# Load barcodes
barcodes <- DNABarcodeCompatibility::IlluminaIndexes
# Time for an exhaustive search
system.time(m <- get_all_combinations(index_df = barcodes,
mplex_level = 2,
platform = 4))
# Each line represents a compatible combination of barcodes
head(m)
```
* 48 barcodes, multiplex level of 3, HiSeq platform
```{r, echo=TRUE}
# Total number of combinations
choose(48,3)
# Load barcodes
barcodes <- DNABarcodeCompatibility::IlluminaIndexes
# Time for an exhaustive search
system.time(m <- get_all_combinations(index_df = barcodes,
mplex_level = 3,
platform = 4))
# Each line represents a compatible combination of barcodes
head(m)
```
## Examples of a random search of compatible barcode combinations
When the total number of combinations is too high, it is recommended to pick
combinations at random and then select those that are compatible.
* 48 barcodes, multiplex level of 3, HiSeq platform
```{r, echo=TRUE}
# Total number of combinations
choose(48,3)
# Load barcodes
barcodes <- DNABarcodeCompatibility::IlluminaIndexes
# Time for a random search
system.time(m <- get_random_combinations(index_df = barcodes,
mplex_level = 2,
platform = 4))
# Each line represents a compatible combination of barcodes
head(m)
```
* 48 barcodes, multiplex level of 4, HiSeq platform
```{r, echo=TRUE}
# Total number of combinations
choose(48,4)
# Load barcodes
barcodes <- DNABarcodeCompatibility::IlluminaIndexes
# Time for a random search
system.time(m <- get_random_combinations(index_df = barcodes,
mplex_level = 4,
platform = 4))
# Each line represents a compatible combination of barcodes
head(m)
```
* 48 barcodes, multiplex level of 6, HiSeq platform
```{r, echo=TRUE}
# Total number of combinations
choose(48,6)
# Load barcodes
barcodes <- DNABarcodeCompatibility::IlluminaIndexes
# Time for a random search
system.time(m <- get_random_combinations(index_df = barcodes,
mplex_level = 6,
platform = 4))
# Each line represents a compatible combination of barcodes
head(m)
```
## Constrain barcodes to be robust against one substitution error
```{r, echo=TRUE}
# Load barcodes
barcodes <- DNABarcodeCompatibility::IlluminaIndexes
# Perform a random search of compatible combinations
m <- get_random_combinations(index_df = barcodes,
mplex_level = 3,
platform = 4)
# Keep barcodes that are robust against one substitution error
filtered_m <- distance_filter(index_df = barcodes,
combinations_m = m,
metric = "hamming",
d = 3)
# Each line represents a compatible combination of barcodes
head(filtered_m)
```
## Optimize the set of compatible combinations to reduce barcode redundancy
```{r, echo=TRUE}
# Keep set of compatible barcodes that are robust against one substitution
# error
filtered_m <- distance_filter(
index_df = DNABarcodeCompatibility::IlluminaIndexes,
combinations_m = get_random_combinations(index_df = barcodes,
mplex_level = 3,
platform = 4),
metric = "hamming", d = 3)
# Use a Shannon-entropy maximization approach to reduce barcode redundancy
df <- optimize_combinations(combination_m = filtered_m,
nb_lane = 12,
index_number = 48)
# Each line represents a compatible combination of barcodes and each row a lane
# of the flow cell
df
```
## The optimized result isn't an optimum when filtering out too many barcodes
* Increased distance between barcode sequences: redundancy may become
inevitable
```{r, echo=TRUE}
# Keep set of compatible barcodes that are robust against multiple substitution
# and insertion/deletion errors
filtered_m <- distance_filter(
index_df = DNABarcodeCompatibility::IlluminaIndexes,
combinations_m = get_random_combinations(index_df = barcodes,
mplex_level = 3,
platform = 4),
metric = "seqlev", d = 4)
# Use a Shannon-entropy maximization approach to reduce barcode redundancy
df <- optimize_combinations(combination_m = filtered_m,
nb_lane = 12,
index_number = 48)
# Each line represents a compatible combination of barcodes and each row a
# lane of the flow cell
df
```