Name Mode Size
R 040000
data 040000
inst 040000
man 040000
src 040000
tests 040000
vignettes 040000
.gitignore 100644 0 kb
DESCRIPTION 100644 1 kb
NAMESPACE 100644 2 kb
NEWS 100644 1 kb
README.md 100644 8 kb
README.md
# BANDITS: Bayesian ANalysis of DIfferenTial Splicing <img src="inst/extdata/BANDITS.png" width="200" align="right"/> `BANDITS` is a Bayesian hierarchical model for detecting differential splicing of genes and transcripts, via differential transcript usage (DTU), between two or more conditions. The method uses a Bayesian hierarchical framework, which allows for sample specific proportions in a Dirichlet-multinomial model, and samples the allocation of fragments to the transcripts. Parameters are inferred via Markov chain Monte Carlo (MCMC) techniques and a DTU test is performed via a multivariate Wald test on the posterior densities for the average relative abundance of transcripts. > Simone Tiberi and Mark D Robinson (2020). BANDITS: Bayesian differential splicing accounting for sample-to-sample variability and mapping uncertainty. > > *Genome Biology* **21 (69)**. doi: [10.1186/s13059-020-01967-8](https://doi.org/10.1186/s13059-020-01967-8) ## Bioconductor installation `BANDITS` is available on [Bioconductor](https://bioconductor.org/packages/BANDITS) and can be installed with the command: ``` r if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("BANDITS") ``` ## Vignette The vignette illustrating how to use the package can be accessed on [Bioconductor](https://bioconductor.org/packages/BANDITS) or from R via: ``` r vignette("BANDITS") ``` or ``` r browseVignettes("BANDITS") ``` ## Alignment The package inputs the equivalence classes and respective counts, representing what transcripts each read is compatible with. These can be obtained by aligning reads either directly to a reference transcriptome with pseudo-alignmers, via `salmon` or `kallisto`, or to a reference genome with splice-aware genome alignment algorithms, via `STAR`, and checking the transcripts compatible with each genome alignment with `salmon` NOTE: when using `salmon`, use the option `--dumpEq` to obtain the equivalence classes, when using `STAR`, use the option `--quantMode TranscriptomeSAM` to obtain alignments translated into transcript coordinates, and when using `kallisto`, run both the `quant` and `pseudo` modes to obtain the transcript estimated counts and equivalence classes, respectively. Below we show three pipelines for aligning reads with `salmon`, `kallisto` and `STAR`. ### Download the example input data To obtain the example raw data, download or clone the ARMOR github repository: ``` bash git clone https://github.com/csoneson/ARMOR.git # set a base_dir variable to the downloaded repo base_dir="~/ARMOR" # input reads: fastq_files=$base_dir/example_data/FASTQ ``` The example data consits of four paired-end RNA-seq reads of 63 base pairs. ### Align reads to the transcriptome with salmon Create a variable for the fasta format reference transcriptome (cDNA): ``` bash fasta_tr=$base_dir/example_data/reference/Ensembl.GRCh38.93/Homo_sapiens.GRCh38.cdna.all.1.1.10M.fa.gz ``` Make the directory for the salmon output and the genome index ``` bash # create a directory for salmon mkdir $base_dir/salmon # create a directory for the genome index mkdir $base_dir/salmon/Salmon_index idx=$base_dir/salmon/Salmon_index # create a directory for the output of the alignment mkdir $base_dir/salmon/alignment out_Salmon=$base_dir/salmon/alignment ``` Build salmon index ``` bash salmon index -i $idx -t $fasta_tr -p 4 --type quasi -k 31 ``` Align reads and quantify transcript abundance with salmon: ``` bash salmon quant -i $idx -l A -1 $fastq_files/SRR1039508_R1.fastq.gz -2 $fastq_files/SRR1039508_R2.fastq.gz \ -p 4 -o $out_Salmon/sample1 --seqBias --gcBias --dumpEq ``` The option `--dumpEq` is essential to obtain the equivalence classes from salmon. In the output folder (`$out_Salmon/sample1`), the file `quant.sf` contains the estimated transcripts abundances, while the equivalence classes (and respective counts) are stored in `aux_info/eq_classes.txt`. ### Align reads to the transcriptome with kallisto Create a variable for the fasta format reference transcriptome (cDNA): ``` bash fasta_tr=$base_dir/example_data/reference/Ensembl.GRCh38.93/Homo_sapiens.GRCh38.cdna.all.1.1.10M.fa.gz ``` Make the directory for the kallisto output and the genome index ``` bash # create a directory for kallisto mkdir $base_dir/kallisto # create a directory for the genome index mkdir $base_dir/kallisto/kallisto_index idx=$base_dir/kallisto/kallisto_index # create a directory for the output of the alignment mkdir $base_dir/kallisto/alignment out_kallisto_alignment=$base_dir/kallisto/alignment # create a directory for the output of the equivalence classes mkdir $base_dir/kallisto/pseudo out_kallisto_pseudo=$base_dir/kallisto/pseudo ``` Build kallisto index ``` bash kallisto index -i $idx/transcripts.idx $fasta_tr ``` Align reads and quantify transcript abundance with kallisto: ``` bash kallisto quant -i $idx/transcripts.idx -o $out_kallisto_alignment/sample1 --bias --threads 4 \ $fastq_files/SRR1039508_R1.fastq.gz $fastq_files/SRR1039508_R2.fastq.gz ``` In the output folder (`$out_kallisto_alignment/sample1`), the file `abundance.tsv` contains the estimated transcripts abundances. Compute the equivalence classes and respective counts with kallisto: ``` bash kallisto pseudo -i $idx/transcripts.idx -o $out_kallisto_pseudo/sample1 \ $fastq_files/SRR1039508_R1.fastq.gz $fastq_files/SRR1039508_R2.fastq.gz ``` In the output folder (`$out_kallisto_pseudo/sample1`), the file `pseudoalignments.ec` contains the transcripts forming each equivalence class, while `pseudoalignments.tsv` contains the equivalence classes counts. ### Align reads to the genome with STAR, and compute the equivalence classes with salmon Create a variable for the fasta format reference genome (DNA) and gtf file: ``` bash fasta=$base_dir/example_data/reference/Ensembl.GRCh38.93/Homo_sapiens.GRCh38.dna.chromosome.1.1.10M.fa gtf=$base_dir/example_data/reference/Ensembl.GRCh38.93/Homo_sapiens.GRCh38.93.1.1.10M.gtf ``` Make the directory for the STAR output and the genome index ``` bash # create a directory for STAR mkdir $base_dir/STAR # create a directory for the genome index mkdir $base_dir/STAR/genome_index GDIR=$base_dir/STAR/genome_index # create a directory for the output of the alignment mkdir $base_dir/STAR/alignment outDir=$base_dir/STAR/alignment ``` Generate the Genome index: ``` bash STAR --runMode genomeGenerate --runThreadN 4 --genomeDir $GDIR \ --genomeFastaFiles $fasta --sjdbGTFfile $gtf --sjdbOverhang 62 ``` Note that sjdbOverhang ideally should be set to the lenght of the reads -1 (our reads are 63 bps). Align reads with STAR: ``` bash cd $outDir STAR --runMode alignReads --runThreadN 4 --genomeDir $GDIR \ --readFilesIn <(zcat $fastq_files/SRR1039508_R1.fastq.gz) <(zcat $fastq_files/SRR1039508_R2.fastq.gz) \ --outFileNamePrefix sample1 --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM ``` When running STAR `--quantMode TranscriptomeSAM` is essential to obtain the transcript alignments. Use gffread to build a reference transcriptome (fasta format) compatible with the DNA fasta and gtf files used for STAR: ``` bash gffread -w cDNA.fa -g $fasta $gtf cdna=$outDir/cDNA.fa ``` Use salmon on the transcript alignments to compute the equivalence classes: ``` bash salmon quant -t $cdna -l A -a sample1Aligned.toTranscriptome.out.bam -o sample1 -p 4 --dumpEq ``` The option `--dumpEq` is essential to obtain the equivalence classes from salmon. In the output folder (`$outDir/sample1`), the file `quant.sf` contains the estimated transcripts abundances, while the equivalence classes (and respective counts) are stored in `aux_info/eq_classes.txt`.