<!-- README.md is generated from README.Rmd. Please edit that file -->
<!-- badges: start
<!-- badges: end -->
Metagenomics holds great promises for deepening our knowledge of key
bacterial driven processes, but metagenome assembly remains problematic,
typically resulting in representation biases and discarding significant
amounts of non-redundant sequence information. In order to alleviate
constraints assembly can impose on downstream analyses, and/or to
increase the fraction of raw reads assembled via targeted assemblies
relying on pre-assembly binning steps, we developed a set of binning
modules and evaluated their combination in a new “assembly-free” binning
The goal of metabinR is to provide functions for performing abundance
and composition based binning on metagenomic samples, directly from
FASTA or FASTQ files.
Abundance based binning is performed by analyzing sequences with long
kmers (k\>8), whereas composition based binning is performed by
utilizing short kmers (k\<8).
A JDK, at least 8, is required and needs to be present before installing
If you find my work useful, please cite the following publication : A
scalable assembly-free variable selection algorithm for biomarker
discovery from metagenomes.
To install `metabinR` package:
if (!requireNamespace("BiocManager", quietly=TRUE))
You can install the development version of `metabinR` like so:
## Sample data
Toy gzip compressed fasta simulated metagenome file is provided in
We use [CAMISIM](https://github.com/CAMI-challenge/CAMISIM) and simulate
a toy metagenome. See file `CAMISIM_config.ini` for parameters.
Paired-end 2x150bp Illumina reads are sampled from 10 bacterial genomes
(see file `genome_to_id.tsv`) with abundances drawn from a log-normal
distribution (see file `distribution_0.txt`).
In total 4Mbases are sampled, yielding 26664 reads (13332 pairs) (see
Original mapping information of each simulated read (from which genome
each read is originating from) is available in `reads_mapping.tsv.gz`.
Parameters used in `CAMISIM` to generate the simulated metagenome.
Bacterial genomes and their corresponding id that have been used to
generate the simulated metagenome.
Abundances of sampled genomes and the Abundance class each one belongs
to. For this toy simulated metagenome, we assume 2 Abundance classes.
Class 1 of most abundant taxa and class 2 of less abundant ones.
Gzip compressed fasta file of the simulated 26664 reads (13332 pairs,
Gzip compressed tabular file of the original mapping information of each
Read id from `reads.metagenome.fasta.gz` matches column
`anonymous_read_id` and read genome id matches column `genome_id`.
## Memory requirements
In order to allocate RAM, a special parameter needs to be passed while
JVM initializes. JVM parameters can be passed by setting
`java.parameters` option. The `-Xmx` parameter, followed (without space)
by an integer value and a letter, is used to tell JVM what is the
maximum amount of heap RAM that it can use. The letter in the parameter
(uppercase or lowercase), indicates RAM units. For example, parameters
`-Xmx1024m` or `-Xmx1024M` or `-Xmx1g` or `-Xmx1G`, all allocate 1
Gigabyte or 1024 Megabytes of maximum RAM for JVM.
In order to allocate 3GB of RAM for the JVM, through R code, use:
## Abundance based Binning
The Abundance based Binning module analyzes input fasta/fastq sequences
by long kmers (k\>8), in order to capture differences in the abundance
signal of organisms in the input data.
Initially a kmer count dictionary is created. An efficient kmer counter
is implemented in the java backend. It supports multi-threaded counting
and small memory footprint. This is the most expensive operation of the
module, due to the exponential size of kmer space (number of distinct
kmers is 4^k). For example the number of distinct 20mers is one
Next an EM routine tries to identify Poisson mixtures in the kmer count
spectrum. A Poisson distribution is fitted for each abundance cluster.
The number of abundance clusters cannot be automatically determined and
therefore needs to be user defined. The characteristics of each fitted
Poisson distribution are linked to the abundance level of the taxa
belonging to this abundance class and the their total genome size.
For each abundance class and fitted Poisson distribution, a vector
structure in the kmer spectrum space is constructed. A vector in the
same kmer space is also constructed for each input fasta sequence,
utilizing a tf\*idf weighting scheme.
Finally, cosine distance is calculated between input sequence vectors
and abundance class vectors. Each input sequence is assigned to the
abundance class of least distance.
This module outputs a fasta/fastq for each final cluster
`prefix__AB.n.fasta`, containing sequences that were assigned to it.
It outputs as well, a file `prefix__AB.assignments` with the information
of the assignment of each sequence (column `read_id`) to which cluster
(column `AB`) and the distance of the sequence to each cluster (columns
## Composition based Binning
The Composition based Binning module analyzes input sequences by short
kmers (k\<8), to capture differences in the composition signal and finer
details between similar organisms in the input data.
This module does not involve an expensive kmer counting step, since
short kmer space is relatively small. For example the number of distinct
8mers is around 60 thousands.
For every input sequence a vector is constructed in the kmer space. A
kmeans clustering algorithms is run on the input vectors. Distance
measure is calculated with Spearman footrule. An efficient
multi-threaded implementation of the kmeans algorithm is provided by the
java backend. The number of composition clusters cannot be automatically
determined and therefore needs to be user defined.
Finally, input sequences are assigned to the closest composition
This module outputs, as the AB module, a fasta/fastq for each final
cluster `prefix__CB.n.fasta`, containing sequences that were assigned to
It outputs as well, a file `prefix__CB.assignments` with the information
of the assignment of each sequence (column `read_id`) to which cluster
(column `CB`) and the distance of the sequence to each cluster (columns
## Hierarchical (2step ABxCB) Binning
This module performs hierarchical binning on metagenomic samples,
First it analyzes sequences by long kmer analysis (k\>8), as in .
Then for each AB bin, it guesses the number of composition bins in it
and performs composition based binning by short kmer analysis (k\<8), as
This module outputs, as with the independent AB and CB modules, a
fasta/fastq for each final cluster `prefix__ABxCB.n.fasta`, containing
sequences that were assigned to it.
Similarly to AB and CB module, it outputs as well, a file
`prefix__ABxCB.assignments` with the information of the assignment of
each sequence (column `read_id`) to which cluster (column `ABxCB`) and
the distance of the sequence to each cluster (columns `ABxCB.n`).
It also outputs a file `prefix__ABxCB.clustersDesign.tsv` with 3
columns. Column `ABxCB` is the id of the final hierarchical (step AB and
then CB) bin. Column `size` is the number of sequences that were
assigned to each ABxCB bin. Column `AB` is the id of the parent
Abundance based bin (AB) that each final ABxCB bin originated from.