NOTE: The context of this R notebook includes all the R code required for the analysis of Differential splicing analysis of bulk RNA-Seq data. Refer to the PART 1- Bulk RNA-Seq analysis of the Protocol section.
Conda is a popular and flexible package manager that allows convenient installation of packages with their dependencies across all platforms. Use ‘Anaconda’ (conda package manager) to install ‘conda’ which can be used to install the tools/packages required for the analysis.
Download ‘Anaconda’ according to the system requirements from https://www.anaconda.com/products/individual#Downloads and install it by following the prompts in graphical installer. Install all the packages using ‘conda’ by typing the following commands on linux command-line.
conda install -c daler sratoolkit
conda install -c conda-forge parallel
conda install -c bioconda star fastqc rmats rmats2sashimiplot
To download all the R packages used in the protocol, type the following code in R console or R-studio.
bioc_packages <- c("DEXSeq", "Rsubread", "EnhancedVolcano", "rtracklayer", "edgeR", "limma", "maser")
packages <- c("magrittr", "tidyverse","openxlsx", "BiocManager", "GenomicRanges")
#Install if not already installed
installed_packages <- packages %in% rownames(installed.packages())
installed_bioc_packages <- bioc_packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages],dependencies = TRUE)
BiocManager::install(packages[!installed_bioc_packages], dependencies = TRUE)
}
In this computational protocol, commands will be given as either R Notebook files (files with extension “.Rmd”), R code files (files with extension “.R”), or Linux Bash shell scripts (files with extension “.sh”). R Notebook (Rmd) files should be opened in RStudio using File|Open File…, and individual code chunks (which may be R commands or Bash shell commands) then run interactively by clicking the green arrow at the upper right. R code files can be run by opening in RStudio, or on the Linux command-line by prefacing with “Rscript”, e.g. Rscript example.R. Shell scripts are run on the Linux command-line by prefacing the script with the “sh” command e.g. sh example.sh.
The code snippets annotated below are available in the supplementary code file “Downloading_data_preprocessing.Rmd”, to follow the individual steps interactively, and are also provided as a bash script to be run in batch on the Linux command-line (sh downloading_data_preprocessing.sh).
Download the raw data from Sequence Read Archive [SRA] using the ‘prefetch’ command from SRA toolkit. Give the SRA Accession Ids in sequence in the following command to download them in parallel using GNU parallel using GNU parallel utility. Ensure SRA toolkit and GNU parallel is installed before proceeding this step. To download SRA files of accession ids from SRR10261601 to SRR10261606 in parallel, type the following commands on the linux command-line-
seq 10261601 10261606 | parallel prefetch SRR{}
Extract the fastq files from the archive using ‘fastq-dump’ function of SRA toolkit. Use GNU parallel and give the names of all SRA files together.
parallel -j 3 fastq-dump --gzip --skip-technical --read-filter pass --dumpbase --split-e --clip --origfmt {} ::: <name of sra files together>
Download the reference genome and annotations for Mouse (Genome assembly GRCm39) from www.ensembl.org using the following at the Linux command-line. The ‘wget’ function downloads the destination file in the provided link and ‘gunzip’ will decompress the file. Store both genome and annotation file in specific variables for further use.
wget -nv -O annotation.gtf.gz http://ftp.ensembl.org/pub/release-103/gtf/mus_musculus/Mus_musculus.GRCm39.103.gtf.gz \ && gunzip -f annotation.gtf.gz
wget -nv -O genome.fa.gz http://ftp.ensembl.org/pub/release-103/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz \ && gunzip -f genome.fa.gz
GTF=$(readlink -f annotation.gtf)
GENOME=$(readlink -f genome.fa)
Assess the quality of raw reads with FASTQC (v0.11.9). Create an output folder and run fastqc with parallel on multiple input fastq files. Ensure the path of raw input files is correct before running the command. This step will generate quality report for each sample. Examine the reports to ensure the quality of reads is acceptable before doing further analysis.(Refer to the user manual for understanding the reports at https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
mkdir fastqc_out
parallel "fastqc {} -o fastqc_out" ::: $RAW_DATA/*.fastq.gz
NOTE: If necessary, perform adapter trimming with ‘cutadapt’ or ‘trimmomatic’ to remove sequencing into flanking adapters, which varies based on RNA fragment size and read length. In this analysis we skipped this step as the fraction of reads affected was minimal.
The next step in pre-processing includes mapping the reads to the reference genome. Firstly, build the index for the reference genome using ‘genomeGenerate’ function of STAR (Spliced Transcripts Alignment to a Reference v2.7.5c) and then align the raw reads to the reference (alternatively prebuilt indexes are available on the STAR website and can be used directly for read alignment).
Run the following commands as a bash script on the linux command-line.
#Build STAR index
GDIR=STAR_indices
mkdir $GDIR
STAR --runMode genomeGenerate --genomeFastaFiles $GENOME --sjdbGTFfile $GTF --runThreadN 8 --genomeDir $GDIR
ODIR=results/mapping
mkdir -p $ODIR
#Align reads to the genome
for fq1 in $RAW_DATA/*R1.fastq.gz;
do
fq2=$(echo $fq1 | sed 's/1.fastq.gz/2.fastq.gz/g');
OUTPUT=$(basename ${fq1}| sed 's/R1.fastq.gz//g');
STAR --genomeDir $GDIR \
--runThreadN 12 \
--readFilesCommand zcat \
--readFilesIn ${fq1} ${fq2}\
--outFileNamePrefix $ODIR\/${OUTPUT} \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMattributes Standard
Done
STAR aligner will generate and sort BAM files for each sample after read alignments. Bam files must be sorted before proceeding to further steps. Create a folder ‘bams’ and copy all the bam files to it.
Run the supplementary file- “prepare_mm_exon_annotation.R” with the downloaded annotation in GTF format on the linux command-line to prepare the annotations. Save the given code as ‘prepare_mm_exon_annotation.R’ file and run it on the linux command-line as-
Rscript prepare_mm_exon_annotation.R annotation.gtf
Or run it on RStudio as:
#! /usr/bin/env Rscript
#To run- Rscript prepare_mm_annotation.R <GTF_file>
library(rtracklayer)
library(tidyverse)
args = commandArgs(trailingOnly = TRUE)
gff_mm <- args[1]
mgg <- import(gff_mm)
anno = mgg %>% data.frame %>%
mutate(seqnames = as.character(seqnames)) %>%
dplyr::filter(!grepl("chr[1-19]|X|Y", seqnames)) %>%
filter(type %in% "exon") %>%
select(seqnames, start, end ,width, strand, gene_name, transcript_id) %>% distinct() %>%
group_by(gene_name, seqnames, start, end, width, strand) %>%
summarise(transcript_ids = paste(transcript_id, sep="",collapse=",")) %>%
ungroup() %>%
mutate(number = 1) %>%
arrange( seqnames, start , end) %>% group_by(gene_name, seqnames) %>% #
mutate(ticker = cumsum(number)) %>%
mutate(ExonID = paste(gene_name, ".",seqnames,".", ticker, sep="")) %>%
dplyr::select(-number, ticker) %>%
add_count(gene_name) %>%
data.frame
colnames(anno) <- c("GeneID", "Chr", "Start", "End", "Width", "Strand", "TranscriptIDs", "Ticker", "ExonID", "n")
save(anno, file=paste("mm_exon_anno.RData", sep=""))
The GTF file contains multiple exon entries for different isoforms. This file is used to ‘collapse’ the multiple transcript IDs for each exon using a comma. It is an important step to define exon counting bins.
The next step is to count the number of reads mapped to different transcripts/exons. Use RStudio to run the following commands or call R from command-line by typing ‘R’.
The following packages are required for this part of the analysis. Make sure to install the packages if they are not already installed.
packages <- c("DEXSeq","Rsubread","tidyverse", "magrittr","EnhancedVolcano", "edgeR","openxlsx")
#Load libraries
invisible(lapply(packages, library, character.only = TRUE))
load("mm_exon_anno.RData")
Read the folder containing bam files by first listing each file from the directory that ends with .bam. Use ‘featureCounts’ from the Rsubread package which takes bam files and processed GTF annotation (reference) as input to generate a matrix of counts associated with each feature with rows representing exons(features) and columns representing samples. Before running this step, make sure you have the folder containing bam files in the current working directory. Next, perform non-specific filtering to remove lowly expressed exons. Transform the data from raw scale to counts per million (cpm) using the cpm function from ‘edgeR’ package and keep the counts greater than one in at least three samples. Remove the genes with only one exon.
# Read all bam files as input for featureCounts
countData <- dir("bams", pattern=".bam$", full.names=T) %>%
featureCounts(filesToCount,
annot.ext=anno,
isGTFAnnotationFile=FALSE,
minMQS=0, useMetaFeatures=FALSE,
allowMultiOverlap=TRUE,
largestOverlap = TRUE,
countMultiMappingReads=FALSE,
primaryOnly=TRUE,
isPairedEnd=TRUE,
nthreads = 12)
# Non-specific filtering: Remove the exons with low counts
isexpr <- rownames(countData$counts)[rowSums(cpm(countData$counts) > 1) >= 3]
countData$counts <- countData$counts[rownames(countData$counts) %in% isexpr, ]
anno <- anno %>% filter(GeneID %in% rownames(countData$counts))
# Remove genes with only 1 site and NA in geneIDs
dn <- anno %>% group_by(GeneID) %>% summarise(nsites=n()) %>% filter(nsites > 1 & !is.na(GeneID))
anno <- anno %>% filter(GeneID %in% dn$GeneID)
countData$counts <- countData$counts[rownames(countData$counts) %in% anno$GeneID, ]
NOTE: Examine the library type(single-end or paired-end) and modify the parameters in featureCounts respectively.
We describe two alternatives for this step: DEXSeq and DiffSplice. Either can be used and give similar results. For consistency, select DEXSeq if you prefer a DESeq2 package for DGE and use DiffSplice for a Limma-based DGE package.
sampleTable = data.frame(
row.names = c( "Mbnl1KO_Thymus_1", "Mbnl1KO_Thymus_2", "Mbnl1KO_Thymus_3",
"WT_Thymus_1", "WT_Thymus_2", "WT_Thymus_3" ),
condition = rep(c("Mbnl1_KO", "WT"),c(3,3)), libType = rep(c("paired-end")))
NOTE: The row names should be consistent with the bam file names used by featureCounts to count the reads. sampleTable consists of details of each sample which includes- library-type and condition. This is required to define the contrasts or test group for detecting differential usage.
Exon information in form of GRanges (genomic ranges) object is required as an input to create DEXSeq object. Match the gene Ids with the read counts and to create exoninfo object.
exoninfo = anno[anno$GeneID %in% rownames(countData$counts),]
exoninfo <- GRanges (seqnames= anno$Chr,
ranges= IRanges (start=anno$Start, end=anno$End, width = anno$Width), strand = Rle(anno$Strand))
mcols(exoninfo)$TranscriptIDs <- anno$TranscriptIDs
mcols(exoninfo)$Ticker <- anno$Ticker
mcols(exoninfo)$ExonID <- anno$ExonID
mcols(exoninfo)$n <- anno$n
mcols(exoninfo)$GeneID <- anno$GeneID
transcripts_l = strsplit(exoninfo$TranscriptIDs, "\\,")
#Save the countData, sampleTable and exoninfo and transcripts in one object for further use
save(countData, sampleTable, exoninfo, transcripts_l, file="AS_countdata.RData")
NOTE: Be careful with the names of the columns of the processed annotation file. With a different GTF file, the headers can change and cause a problem while running this step.
Create DEXSeq object using DEXSeqDataSet function. The DEXSeq object collects together read counts, exon feature information, and sample information. Use the read counts generated in step 3 and the exon information obtained from the previous step to create the DEXSeq object from the count matrix. The sampleData argument takes a data frame input defining the samples (and their attributes: library type and condition), ‘design’ uses sampleData to generate a design matrix for the differential testing using model formula notation. Note that a significant interaction term, condition:exon, indicates that the fraction of reads over a gene falling on a particular exon depends on the experimental condition i.e. there is AS. See DEXSeq documentation for a full description of setting the model formula for more complex experimental designs. For feature information- exon ids, corresponding gene and transcripts are required from the processed annotation file.
# Explore/Inspect the DEXSeq object
head(counts(dxd), 5)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
Mrpl15:Mrpl15.1.1 96 121 70 79 90 96 3399 4376 3223 3149 3340
Mrpl15:Mrpl15.1.2 697 862 636 616 699 657 2798 3635 2657 2612 2731
Mrpl15:Mrpl15.1.3 11 8 8 10 5 17 3484 4489 3285 3218 3425
Mrpl15:Mrpl15.1.4 498 642 498 434 516 465 2997 3855 2795 2794 2914
Mrpl15:Mrpl15.1.5 194 243 185 159 189 175 3301 4254 3108 3069 3241
[,12]
Mrpl15:Mrpl15.1.1 3080
Mrpl15:Mrpl15.1.2 2519
Mrpl15:Mrpl15.1.3 3159
Mrpl15:Mrpl15.1.4 2711
Mrpl15:Mrpl15.1.5 3001
colData(dxd)
DataFrame with 12 rows and 4 columns
sample condition libType exon
<factor> <factor> <factor> <factor>
1 Mbnl1KO_Thymus_1 Mbnl1_KO paired-end this
2 Mbnl1KO_Thymus_2 Mbnl1_KO paired-end this
3 Mbnl1KO_Thymus_3 Mbnl1_KO paired-end this
4 WT_Thymus_1 WT paired-end this
5 WT_Thymus_2 WT paired-end this
... ... ... ... ...
8 Mbnl1KO_Thymus_2 Mbnl1_KO paired-end others
9 Mbnl1KO_Thymus_3 Mbnl1_KO paired-end others
10 WT_Thymus_1 WT paired-end others
11 WT_Thymus_2 WT paired-end others
12 WT_Thymus_3 WT paired-end others
split(seq_len(ncol(dxd)), colData(dxd)$exon)
$this
[1] 1 2 3 4 5 6
$others
[1] 7 8 9 10 11 12
head(rowRanges(dxd),3)
GRanges object with 3 ranges and 10 metadata columns:
seqnames ranges strand | TranscriptIDs
<Rle> <IRanges> <Rle> | <character>
Mrpl15:Mrpl15.1.1 1 4843429-4844739 - | ENSMUST00000130201
Mrpl15:Mrpl15.1.2 1 4843434-4847024 - | ENSMUST00000156816
Mrpl15:Mrpl15.1.3 1 4844659-4844739 - | ENSMUST00000045689
Ticker ExonID n GeneID featureID
<numeric> <character> <integer> <character> <character>
Mrpl15:Mrpl15.1.1 1 Mrpl15.1.1 15 Mrpl15 Mrpl15.1.1
Mrpl15:Mrpl15.1.2 2 Mrpl15.1.2 15 Mrpl15 Mrpl15.1.2
Mrpl15:Mrpl15.1.3 3 Mrpl15.1.3 15 Mrpl15 Mrpl15.1.3
groupID exonBaseMean exonBaseVar
<character> <numeric> <numeric>
Mrpl15:Mrpl15.1.1 Mrpl15 92 306
Mrpl15:Mrpl15.1.2 Mrpl15 694.5 7814.7
Mrpl15:Mrpl15.1.3 Mrpl15 9.83333333333333 16.5666666666667
transcripts
<list>
Mrpl15:Mrpl15.1.1 ENSMUST00000130201
Mrpl15:Mrpl15.1.2 ENSMUST00000156816
Mrpl15:Mrpl15.1.3 ENSMUST00000045689
-------
seqinfo: 21 sequences from an unspecified genome; no seqlengths
sampleAnnotation(dxd)
DataFrame with 6 rows and 3 columns
sample condition libType
<factor> <factor> <factor>
1 Mbnl1KO_Thymus_1 Mbnl1_KO paired-end
2 Mbnl1KO_Thymus_2 Mbnl1_KO paired-end
3 Mbnl1KO_Thymus_3 Mbnl1_KO paired-end
4 WT_Thymus_1 WT paired-end
5 WT_Thymus_2 WT paired-end
6 WT_Thymus_3 WT paired-end
NOTE: This step might take some time. Once the object is created, explore the columns for initial understanding/exploration of the data.
# Explore/Inspect the DEXSeq object
head(counts(dxd), 5)
colData(dxd)
split(seq_len(ncol(dxd)), colData(dxd)$exon)
head(rowRanges(dxd),3)
sampleAnnotation(dxd)
Next, perform normalization between samples and estimate the variance of the data, due to both Poisson count noise from the discrete nature of RNA-seq and biological variability, using the following commands.
dxd %<>% estimateSizeFactors %>% estimateDispersions %T>% plotDispEsts
After the estimation of variation, test for differential exon usage for each gene and generate the results. The results are calculated at FDR 10%.
plotDEXSeq(dxr, "Wnk1", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2, FDR = 0.1)
plotDEXSeq(dxr, "Tcf7", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2)
plotDEXSeq(dxr, "Mbnl1", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2)
plotDEXSeq(dxr, "Lef1", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2)
plotDEXSeq(dxr, "Ncor2", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2)
plotDEXSeq(dxr, "Mbnl2", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2)
Visualize splicing events for selected genes using the following command
plotDEXSeq(dxr, "Lef1", expression=FALSE, splicing=TRUE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2)
plotDEXSeq(dxr, "Mbnl1", expression=FALSE, splicing=TRUE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2)
To display transcripts use displayTranscripts=TRUE
plotDEXSeq(dxr, "Rps3a1", displayTranscripts=TRUE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2)
plotDEXSeq(dxr, "Lef1", expression=FALSE, splicing=TRUE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2)
plotDEXSeq(dxr, "Mbnl1", expression=FALSE, splicing=TRUE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2)
Generate plots for all the significant genes
save.image("Mbnl1_KO_Dexseq.RData")
Volcano plot to visualize differentially expressed genes
EnhancedVolcano(dexon_sig, lab = dexon_sig$featureID, x = 'log2fold_WT_Mbnl1_KO', y = 'pvalue', title = 'Volcano Plot', subtitle = 'Mbnl1_KO vs WT (DEXSeq)', FCcutoff = 1, labSize = 4,legendPosition = "right",xlim= c(-4,4),caption = bquote(~Log[2]~ "Fold change cutoff, 2; FDR 10%"))
NOTE: EnhancedVolcano package was used to generate the above plot, install the package it if not already installed.
save.image("Mbnl1_KO_Dexseq.RData")
Follow the R Notebook file “AS_analysis_RNASeq.Rmd”. Ensure steps 1-3 have been followed to prepare input files before proceeding further.
library(limma)
library(edgeR)
Extract the matrix of read counts obtained in 3. Create a list of features using ‘DGEList’ function from edgeR package, where rows represent genes and columns represent samples. NOTE: As a non-specific filtering step, counts are filtered by cpm < 1 in x out of n samples, where x is the minimum number of replicates in any condition. n = 6 and x = 3 for this example data.
mycounts = countData$counts
#Change the rownames of the countdata to exon Ids instead of genes for unique rownames.
rownames(mycounts) = exoninfo$ExonID
dge <- DGEList(counts=mycounts)
#Non-specific Filtering
isexpr <- rowSums(cpm(dge) > 1) >=3
dge <- dge[isexpr,,keep.lib.sizes=FALSE]
#Extract the annotations for only filtered counts
exoninfo = anno %>% filter(ExonID %in% rownames(dge$counts))
#Convert the exoninfo into GRanges object
exoninfo1 <- GRanges (seqnames= exoninfo$Chr,
ranges= IRanges (start=exoninfo$Start, end=exoninfo$End, width = exoninfo$Width), strand = Rle(exoninfo$Strand))
mcols(exoninfo1)$TranscriptIDs <- exoninfo$TranscriptIDs
mcols(exoninfo1)$Ticker <- exoninfo$Ticker
mcols(exoninfo1)$ExonID <- exoninfo$ExonID
mcols(exoninfo1)$n <- exoninfo$n
mcols(exoninfo1)$GeneID <- exoninfo$GeneID
transcripts_l = strsplit(exoninfo1$TranscriptIDs, "\\,")
Normalize the counts across samples, with ‘calcNormFactors’ function from ‘edgeR’ package using Trimmed Mean of M values (TMM normalization method). It will compute scaling factors to adjust library sizes.
dge <- calcNormFactors(dge)
Use sampleTable as generated in step 4.1.1 and create the design matrix. The design matrix characterizes the design. See the Limma User Guide [ref] chapters 8 & 9 for details on design matrices for more advanced experimental designs.
Treat <- factor(sampleTable$condition)
design <- model.matrix(~0+Treat)
colnames(design) <- levels(Treat)
Run ‘voom’ function of ‘limma’ package to process RNA-seq data to estimate variance and generate precision weights to correct for Poisson count noise, and the exon-level counts to log2-counts per million (logCPM). Then run linear modelling using ‘lmfit’ function to fit linear models to the expression data for each exon. Compute empirical Bayes statistics for fitted model using ‘eBayes’ function to detect differential exon expression by ranking the exons. Next, define a contrast matrix for the experimental comparisons of interest. Use ‘contrasts.fit’ to obtain coefficients and standard errors for each pair of comparison.
v <- voom(dge, design, plot=FALSE)
fit <- lmFit(v,design)
fit <- eBayes(fit)
colnames(fit)
summary(decideTests(fit))
cont.matrix <- makeContrasts(
Mbnl1_KO_WT = Mbnl1_KO - WT,
levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
summary(decideTests(fit2))
Run ‘diffSplice’ on the fitted model to test the differences in exon usage of genes between wild-type and knockout and explore the top ranked results using ‘topSplice’ fuinction.: test=”t” gives a ranking of AS exons, test=”simes” gives a ranking of genes. Run diffSplice on the fitted model to test the differences in exon retention between the wild-type and knockout samples.
ex <- diffSplice(fit2, geneid = exoninfo1$GeneID, exonid = exoninfo1$ExonID)
#Check the top splicing results with topSplice
topSplice(ex)
#Exon-level statistics for splicing activity
ts <-topSplice(ex, n=Inf, FDR=0.1, test= "t", sort.by = "logFC")
#Gene-level statistics for splicing activity
tg <- topSplice(ex, n=Inf, FDR=0.1, test = "simes")
ts %$% GeneID %>% unique %>% length
ts %$% ExonID %>% unique %>% length
NOTE: This step uses the exon information object created in 4.1.2. Make sure it is loaded in the current working environment.
Plot the results with the ‘plotSplice’ function, giving gene of interest in the geneid argument. Save the top results sorted by log Fold change and also calculate per gene statictics using the “simes method”. First create a function to add different sheets as workbooks in excel. Generate a volcano plot to exhibit the exons.
Volcano plot showing up and downregulated genes at pCutoff = FDR 10% and FC 2.
EnhancedVolcano(ts, lab = ts$ExonID, selectLab = head((ts$ExonID),3000),xlab = bquote(~Log[2]~ 'fold change'), x = 'logFC', y = 'P.Value', title = 'Volcano Plot', subtitle = 'Mbnl1_KO vs WT (Limma_diffSplice)', FCcutoff = 1, labSize = 4,legendPosition = "right",caption = bquote(~Log[2]~ "Fold change cutoff, 1; FDR 10%"))
save.image("Mbnl1_KO_Limma_diffSplice.RData")
Ensure the latest version of rMATS v4.1.1 (also known as rMATS turbo due to the increased processing time and less requirements of memory) is installed either using conda or github in the working directory.
For pairwise splicing analysis, we will need sorted bam files for all the samples generated after the read alignment step using STAR. prepare text files for two conditions by copying the name of bam files (along with the path) separated by ‘,’ comma. Following commands should be run on the linux command-line-
mkdir rMATS_analysis
cd bams/
ls -pd "$PWD"/* | grep "WT" | tr '\n' ',' > Wt.txt
ls -pd "$PWD"/* | grep "Mb" | tr '\n' ',' > KO.txt
#Move the files to the main working directory
mv *.txt /rMATS_analysis
cd rMATS_analysis
Run the python script rmats.py with the two input files generated in the previous step, along with the GTF annotation obtained in 1.1.3 by typing the following command on linux command-line.
python rmats_turbo/rmats.py --b1 KO.txt --b2 Wt.txt --gtf annotation.gtf -t paired --readLength 56 --nthread 8 --od rmats_out/ --tmp rmats_tmp
NOTE:The reference annotation in the form of a GTF file is also required. Make sure both the folder containing bam files (bams/) and annotation file (annotation.gtf) is present in the current path/provide the correct path in the above command. (AS_analysis) Check the parameters if the data is single-end, change the -t option.
The results will be in the specified output directory by the -o option(rmats_out). The –tmp option creates a temporary subdirectory to store all the intermediate files in the analysis. The output folder contains text files for different splicing events with p-values. The summary.txt file presents the table of all the events falling in different categories of AS events. The main output file that we will focus on is *.MATS.JCEC.txt, this includes both reads that span junctions defined by rmats (Junction Counts) and reads that do not cross an exon boundary (Exon Counts).
Use Bioconductor package called ‘maser’ to explore the rMATS results. Load the *.JCEC files for each of the 5 events and filter by coverage including an average of 5 reads.
library(maser) %>% invisible
Welcome to maser
mbnl1 <- maser("rMATS_analysis/rmats_out/", c("WT", "Mbnl1_KO"), ftype = "JCEC")
#Filtering out events
mbnl1_filt <- filterByCoverage(mbnl1, avg_reads = 5)
Select the significant splicing events at False Discovery Rate (FDR) 10% and minimum 10% change in Percent Spliced In (deltaPSI) using ‘topEvents’ function from ‘maser’ package. Next , check the gene events for individual genes of interest (sample gene-Wnk1) and plot PSI values for each splicing event of that gene. Generate a volcano plot by specifying the event type.
#Top splicing events at 10% FDR
mbnl1_top <- topEvents(mbnl1_filt, fdr = 0.1, deltaPSI = 0.1)
mbnl1_top
A Maser object with 1012 splicing events.
Samples description:
Label=WT n=3 replicates
Label=Mbnl1_KO n=3 replicates
Splicing events:
A3SS.......... 33 events
A5SS.......... 69 events
SE.......... 706 events
RI.......... 123 events
MXE.......... 81 events
#Check the gene events for a particular gene
## Retrieve Wnk1 splicing events
mbnl1_wnk1 <- geneEvents(mbnl1_filt, geneS = "Wnk1", fdr = 0.1, deltaPSI = 0.1)
maser::display(mbnl1_wnk1, "SE")
plotGenePSI(mbnl1_wnk1, type = "SE", show_replicates = TRUE)
volcano(mbnl1_filt, fdr = 0.1, deltaPSI = 0.1, type = "SE") + xlab("deltaPSI") +ylab("Log10 Adj. Pvalue")+ ggtitle("Volcano Plot of exon skipping events")
NOTE: By default the “summary.txt” obtained in rMATS output folder is generated for FDR 5%. To change the FDR or other parameters run the summary.py script located in rMATS source folder (rmats_turbo/rMATS_P/summary.py) as-
python rmats_turbo/rMATS_P/summary.py rmats_out --p-cutoff 0.1 --summary-prefix rmats_summary
Generate an excel table containing significant events (FDR 10%) of different types of AS events from rMATS ranalysis. Use “add.xlsx” function to generate an excel sheet summarizing all the rMATS result.
# Read rMATS summary and individual text files for all events and combine them into one table
read.table('rMATS_analysis/rmats_summary_FDR_0.1_IncLevelDifference_0.txt', header = T) %T>%
write.xlsx(file = "rMATS_output_summary.xlsx", sheetName = "rMATS_Summary", firstRow = TRUE, headerStyle = createStyle(textDecoration = "BOLD", halign = "center"))
read.table('rMATS_analysis/rmats_out/SE.MATS.JCEC.txt', header = T) %>%
filter(FDR <= 0.1) %>% .[order(.$PValue),] %T>%
add.xlsx.sheet(file = "rMATS_output_summary.xlsx", "SE.MATS.JCEC", .)
read.table('rMATS_analysis/rmats_out/A5SS.MATS.JCEC.txt', header = T) %>%
filter(FDR <= 0.1) %>% .[order(.$PValue),] %T>%
add.xlsx.sheet(file = "rMATS_output_summary.xlsx", "A5SS.MATS.JCEC", .)
read.table('rMATS_analysis/rmats_out/A3SS.MATS.JCEC.txt', header = T) %>%
filter(FDR <= 0.1) %>% .[order(.$PValue),] %T>%
add.xlsx.sheet(file = "rMATS_output_summary.xlsx", "A3SS.MATS.JCEC", .)
read.table('rMATS_analysis/rmats_out/MXE.MATS.JCEC.txt', header = T) %>%
filter(FDR <= 0.1) %>% .[order(.$PValue),] %T>%
add.xlsx.sheet(file = "rMATS_output_summary.xlsx", "MXE.MATS.JCEC", .)
read.table('rMATS_analysis/rmats_out/RI.MATS.JCEC.txt', header = T) %>%
filter(FDR <= 0.1) %>% .[order(.$PValue),] %T>%
add.xlsx.sheet(file = "rMATS_output_summary.xlsx", "RI.MATS.JCEC", .)
NA
Generate Sashimi plot for the splicing events result obtained with rMATS in form of text files using ‘rmats2shahimiplot’ package. Ensure the working directory is set to the rmats2sashimiplot folder, to run the python script by typing the following command on linux command-line.
python ./src/rmats2sashimiplot/rmats2sashimiplot.py --b1 ../bams/WT_Thymus_1.bam,../bams/WT_Thymus_2.bam,../bams/WT_Thymus_3.bam --b2 ../bams/Mbnl1KO_Thymus_1.bam,../bams/Mbnl1KO_Thymus_2.bam,../bams/Mbnl1KO_Thymus_3.bam -t SE -e ../rMATS_analysis/rmats_out/SE.MATS.JC.txt --l1 WT --l2 Mbnl1_KO --exon_s 1 --intron_s 5 -o ../rMATS_analysis/rmats2shasmi_output
NOTE: This process can be time-consuming as it will generate the Sashimi plot for all the results in the events file. Choose the top results (gene names and exons) as displayed by the topEvents function from ‘maser’ and visualize the corresponding Sashimi plot.
NOTE: The results of rMATS for example, the chromosomes in SE.MATS.JC.txt are named with a ‘chr’ prefix by default, however the chromosomes in the input bam file varies according to given annotation file and are sometimes represented without the ‘chr’ prefix (Ensembl convention). This might cause errors when running rmats2sashimi. Hence, modify the Bam input using samtools to add ‘chr’ prefix before running rmats2sashimi for individual genes. Run the following command on the Linux command-line to modify all bam files and extract rMATS result for a particular gene, e.g., “Wnk1”. Replace Wnk1 with your gene of interest.
cd bams
for bam in ./*bam; do
echo $bam
samtools view -h $bam | sed -e '/^@SQ/s/SN\:/SN\:chr/' -e '/^[^@]/s/\t/\tchr/2' | samtools view -bS - > temporary.bam
mv -f temporary.bam $bam
samtools index $bam
done
#Extract rMATS output for 'Wnk1' gene
sed -n "1p;/Wnk1/p" SE.MATS.JCEC.txt > wnk1_SE_JCEC.txt
#Replace Wnk1 with your gene of interest.
Run the python command on the linux command-line to get the pdf files for each event of the sample gene wnk1.
python ./src/rmats2sashimiplot/rmats2sashimiplot.py --b1 ../bams/WT_Thymus_1.bam,../bams/WT_Thymus_2.bam,../bams/WT_Thymus_3.bam --b2 ../bams/Mbnl1KO_Thymus_1.bam,../bams/Mbnl1KO_Thymus_2.bam,../bams/Mbnl1KO_Thymus_3.bam -t SE -e ../rMATS_analysis/rmats_out/wnk1_SE_JCEC.txt --l1 WT --l2 Mbnl1_KO --exon_s 1 --intron_s 5 -o ../rMATS_analysis/rmats2shasmi_output_wnk1
Note:Ensure the working directory is correct. This step will generate several plots for all the events of ‘Wnk1’ gene. Check the event id to select a specific output.
The Sashimi plot for each event type obtained from rmats2sashimi are shown here-
#Example gene 'Wnk1"
knitr::include_graphics("Wnk1-1.png")
#SE example gene 'Add3'
knitr::include_graphics("SE-1.png")
#A5SS example gene 'Baz2b'
knitr::include_graphics("A5SS-1.png")
#A3SS example gene 'Lsm14b'
knitr::include_graphics("A3SS-1.png")
#MXE example gene 'Mta1'
knitr::include_graphics("MXE-1.png")
#RI example gene 'Arpp21'
knitr::include_graphics("RI-1.png")
NOTE: The pdf was converted to image to include them in the RMarkdown file using ‘pdftoppm’ on the linux command-line. (Eg: pdftoppm A5SS_Baz2b.pdf A5SS -png)
print(sessionInfo())
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8
[4] LC_COLLATE=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.31 maser_1.4.0 openxlsx_4.2.3
[4] edgeR_3.28.1 limma_3.42.2 EnhancedVolcano_1.4.0
[7] ggrepel_0.9.1 magrittr_2.0.1 forcats_0.5.1
[10] stringr_1.4.0 dplyr_1.0.4 purrr_0.3.4
[13] readr_1.4.0 tidyr_1.1.2 tibble_3.0.6
[16] ggplot2_3.3.3 tidyverse_1.3.0 Rsubread_2.0.1
[19] DEXSeq_1.32.0 RColorBrewer_1.1-2 AnnotationDbi_1.48.0
[22] DESeq2_1.26.0 SummarizedExperiment_1.16.1 DelayedArray_0.12.3
[25] matrixStats_0.58.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.1
[28] IRanges_2.20.2 S4Vectors_0.24.4 Biobase_2.46.0
[31] BiocGenerics_0.32.0 BiocParallel_1.20.1
loaded via a namespace (and not attached):
[1] readxl_1.3.1 backports_1.2.1 Hmisc_4.4-2
[4] BiocFileCache_1.10.2 plyr_1.8.6 lazyeval_0.2.2
[7] splines_3.6.3 crosstalk_1.1.1 digest_0.6.27
[10] ensembldb_2.10.2 htmltools_0.5.1.1 checkmate_2.0.0
[13] memoise_2.0.0 BSgenome_1.54.0 cluster_2.1.0
[16] Biostrings_2.54.0 annotate_1.64.0 modelr_0.1.8
[19] askpass_1.1 prettyunits_1.1.1 jpeg_0.1-8.1
[22] colorspace_2.0-0 blob_1.2.1 rvest_0.3.6
[25] rappdirs_0.3.3 haven_2.3.1 xfun_0.20
[28] crayon_1.4.0 RCurl_1.98-1.2 jsonlite_1.7.2
[31] genefilter_1.68.0 survival_3.2-7 VariantAnnotation_1.32.0
[34] glue_1.4.2 gtable_0.3.0 zlibbioc_1.32.0
[37] XVector_0.26.0 scales_1.1.1 DBI_1.1.1
[40] Rcpp_1.0.6 xtable_1.8-4 progress_1.2.2
[43] htmlTable_2.1.0 foreign_0.8-76 bit_4.0.4
[46] Formula_1.2-4 DT_0.17 htmlwidgets_1.5.3
[49] httr_1.4.2 ellipsis_0.3.1 farver_2.0.3
[52] pkgconfig_2.0.3 XML_3.99-0.3 Gviz_1.30.3
[55] nnet_7.3-14 dbplyr_2.1.0 locfit_1.5-9.4
[58] labeling_0.4.2 reshape2_1.4.4 tidyselect_1.1.0
[61] rlang_0.4.10 munsell_0.5.0 cellranger_1.1.0
[64] tools_3.6.3 cachem_1.0.3 cli_2.3.0
[67] generics_0.1.0 RSQLite_2.2.3 broom_0.7.4
[70] fastmap_1.1.0 yaml_2.2.1 bit64_4.0.5
[73] fs_1.5.0 zip_2.1.1 AnnotationFilter_1.10.0
[76] xml2_1.3.2 biomaRt_2.42.1 compiler_3.6.3
[79] rstudioapi_0.13 curl_4.3 png_0.1-7
[82] reprex_1.0.0 statmod_1.4.35 geneplotter_1.64.0
[85] stringi_1.5.3 GenomicFeatures_1.38.2 lattice_0.20-41
[88] ProtGenerics_1.18.0 Matrix_1.2-18 vctrs_0.3.6
[91] pillar_1.4.7 lifecycle_0.2.0 BiocManager_1.30.10
[94] data.table_1.13.6 bitops_1.0-6 rtracklayer_1.46.0
[97] R6_2.5.0 latticeExtra_0.6-29 hwriter_1.3.2
[100] gridExtra_2.3 dichromat_2.0-0 assertthat_0.2.1
[103] openssl_1.4.3 withr_2.4.1 GenomicAlignments_1.22.1
[106] Rsamtools_2.2.3 GenomeInfoDbData_1.2.2 hms_1.0.0
[109] grid_3.6.3 rpart_4.1-15 biovizBase_1.34.1
[112] lubridate_1.7.9.2 base64enc_0.1-3