--- title: "Identification of Alternative Splicing and polyadenylation in bulk RNA-seq data" author: "Gunjan Dixit" date: "22/01/2021" output: html_notebook: theme: united toc: yes editor_options: chunk_output_type: inline --- # Introduction NOTE: The context of this R notebook includes all the R code required for the analysis of Differential splicing analysis of bulk RNA-Seq data. Refer to the PART 1- Bulk RNA-Seq analysis of the Protocol section. # Installation of tools and R packages used in the analysis- Conda is a popular and flexible package manager that allows convenient installation of packages with their dependencies across all platforms. Use 'Anaconda' (conda package manager) to install 'conda' which can be used to install the tools/packages required for the analysis. Download 'Anaconda' according to the system requirements from https://www.anaconda.com/products/individual#Downloads and install it by following the prompts in graphical installer. Install all the packages using 'conda' by typing the following commands on linux command-line. ```{bash} conda install -c daler sratoolkit conda install -c conda-forge parallel conda install -c bioconda star fastqc rmats rmats2sashimiplot ``` To download all the R packages used in the protocol, type the following code in R console or R-studio. ```{r} bioc_packages <- c("DEXSeq", "Rsubread", "EnhancedVolcano", "rtracklayer", "edgeR", "limma", "maser") packages <- c("magrittr", "tidyverse","openxlsx", "BiocManager", "GenomicRanges") #Install if not already installed installed_packages <- packages %in% rownames(installed.packages()) installed_bioc_packages <- bioc_packages %in% rownames(installed.packages()) if (any(installed_packages == FALSE)) { install.packages(packages[!installed_packages],dependencies = TRUE) BiocManager::install(packages[!installed_bioc_packages], dependencies = TRUE) } ``` In this computational protocol, commands will be given as either R Notebook files (files with extension “.Rmd”), R code files (files with extension “.R”), or Linux Bash shell scripts (files with extension “.sh”). R Notebook (Rmd) files should be opened in RStudio using File|Open File…, and individual code chunks (which may be R commands or Bash shell commands) then run interactively by clicking the green arrow at the upper right. R code files can be run by opening in RStudio, or on the Linux command-line by prefacing with “Rscript”, e.g. Rscript example.R. Shell scripts are run on the Linux command-line by prefacing the script with the “sh” command e.g. sh example.sh. # 1. Data downloading and pre-processing The code snippets annotated below are available in the supplementary code file “Downloading_data_preprocessing.Rmd”, to follow the individual steps interactively, and are also provided as a bash script to be run in batch on the Linux command-line (sh downloading_data_preprocessing.sh). ### 1.1 Downloading the raw data. Download the raw data from Sequence Read Archive [SRA] using the 'prefetch' command from SRA toolkit. Give the SRA Accession Ids in sequence in the following command to download them in parallel using GNU parallel using GNU parallel utility. Ensure SRA toolkit and GNU parallel is installed before proceeding this step. To download SRA files of accession ids from SRR10261601 to SRR10261606 in parallel, type the following commands on the linux command-line- ```{bash} seq 10261601 10261606 | parallel prefetch SRR{} ``` ### 1.1.2 Extract the fastq files. Extract the fastq files from the archive using 'fastq-dump' function of SRA toolkit. Use GNU parallel and give the names of all SRA files together. ```{bash} parallel -j 3 fastq-dump --gzip --skip-technical --read-filter pass --dumpbase --split-e --clip --origfmt {} ::: ``` ### 1.1.3 Download reference genome and annotation Download the reference genome and annotations for Mouse (Genome assembly GRCm39) from www.ensembl.org using the following at the Linux command-line. The 'wget' function downloads the destination file in the provided link and 'gunzip' will decompress the file. Store both genome and annotation file in specific variables for further use. ```{bash} wget -nv -O annotation.gtf.gz http://ftp.ensembl.org/pub/release-103/gtf/mus_musculus/Mus_musculus.GRCm39.103.gtf.gz \ && gunzip -f annotation.gtf.gz wget -nv -O genome.fa.gz http://ftp.ensembl.org/pub/release-103/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz \ && gunzip -f genome.fa.gz GTF=$(readlink -f annotation.gtf) GENOME=$(readlink -f genome.fa) ``` ## 1.2 Pre-processing ### 1.2.1 Quality Control Assess the quality of raw reads with FASTQC (v0.11.9). Create an output folder and run fastqc with parallel on multiple input fastq files. Ensure the path of raw input files is correct before running the command. This step will generate quality report for each sample. Examine the reports to ensure the quality of reads is acceptable before doing further analysis.(Refer to the user manual for understanding the reports at https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) ```{bash} mkdir fastqc_out parallel "fastqc {} -o fastqc_out" ::: $RAW_DATA/*.fastq.gz ``` NOTE: If necessary, perform adapter trimming with 'cutadapt' or 'trimmomatic' to remove sequencing into flanking adapters, which varies based on RNA fragment size and read length. In this analysis we skipped this step as the fraction of reads affected was minimal. ### 1.2.2 Read alignment The next step in pre-processing includes mapping the reads to the reference genome. Firstly, build the index for the reference genome using 'genomeGenerate' function of STAR (Spliced Transcripts Alignment to a Reference v2.7.5c) and then align the raw reads to the reference (alternatively prebuilt indexes are available on the STAR website and can be used directly for read alignment). Run the following commands as a bash script on the linux command-line. ```{bash} #Build STAR index GDIR=STAR_indices mkdir $GDIR STAR --runMode genomeGenerate --genomeFastaFiles $GENOME --sjdbGTFfile $GTF --runThreadN 8 --genomeDir $GDIR ODIR=results/mapping mkdir -p $ODIR #Align reads to the genome for fq1 in $RAW_DATA/*R1.fastq.gz; do fq2=$(echo $fq1 | sed 's/1.fastq.gz/2.fastq.gz/g'); OUTPUT=$(basename ${fq1}| sed 's/R1.fastq.gz//g'); STAR --genomeDir $GDIR \ --runThreadN 12 \ --readFilesCommand zcat \ --readFilesIn ${fq1} ${fq2}\ --outFileNamePrefix $ODIR\/${OUTPUT} \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outSAMattributes Standard Done ``` STAR aligner will generate and sort BAM files for each sample after read alignments. Bam files must be sorted before proceeding to further steps. Create a folder 'bams' and copy all the bam files to it. # 2. Preparing Exon annotations Run the supplementary file- "prepare_mm_exon_annotation.R" with the downloaded annotation in GTF format on the linux command-line to prepare the annotations. Save the given code as 'prepare_mm_exon_annotation.R' file and run it on the linux command-line as- ```{bash} Rscript prepare_mm_exon_annotation.R annotation.gtf ``` Or run it on RStudio as: ```{r} #! /usr/bin/env Rscript #To run- Rscript prepare_mm_annotation.R library(rtracklayer) library(tidyverse) args = commandArgs(trailingOnly = TRUE) gff_mm <- args[1] mgg <- import(gff_mm) anno = mgg %>% data.frame %>% mutate(seqnames = as.character(seqnames)) %>% dplyr::filter(!grepl("chr[1-19]|X|Y", seqnames)) %>% filter(type %in% "exon") %>% select(seqnames, start, end ,width, strand, gene_name, transcript_id) %>% distinct() %>% group_by(gene_name, seqnames, start, end, width, strand) %>% summarise(transcript_ids = paste(transcript_id, sep="",collapse=",")) %>% ungroup() %>% mutate(number = 1) %>% arrange( seqnames, start , end) %>% group_by(gene_name, seqnames) %>% # mutate(ticker = cumsum(number)) %>% mutate(ExonID = paste(gene_name, ".",seqnames,".", ticker, sep="")) %>% dplyr::select(-number, ticker) %>% add_count(gene_name) %>% data.frame colnames(anno) <- c("GeneID", "Chr", "Start", "End", "Width", "Strand", "TranscriptIDs", "Ticker", "ExonID", "n") save(anno, file=paste("mm_exon_anno.RData", sep="")) ``` The GTF file contains multiple exon entries for different isoforms. This file is used to ‘collapse’ the multiple transcript IDs for each exon using a comma. It is an important step to define exon counting bins. # 3 Counting Reads The next step is to count the number of reads mapped to different transcripts/exons. Use RStudio to run the following commands or call R from command-line by typing 'R'. ### 3.1 Load required libraries- The following packages are required for this part of the analysis. Make sure to install the packages if they are not already installed. ```{r} packages <- c("DEXSeq","Rsubread","tidyverse", "magrittr","EnhancedVolcano", "edgeR","openxlsx") #Load libraries invisible(lapply(packages, library, character.only = TRUE)) ``` ### 3.2 Load the processed annotation file- ```{r} load("mm_exon_anno.RData") ``` ### 3.3 Read the bam files obtained in step 1.2.2(after mapping) as input for 'featureCounts'. Read the folder containing bam files by first listing each file from the directory that ends with .bam. Use 'featureCounts' from the Rsubread package which takes bam files and processed GTF annotation (reference) as input to generate a matrix of counts associated with each feature with rows representing exons(features) and columns representing samples. Before running this step, make sure you have the folder containing bam files in the current working directory. Next, perform non-specific filtering to remove lowly expressed exons. Transform the data from raw scale to counts per million (cpm) using the cpm function from 'edgeR' package and keep the counts greater than one in at least three samples. Remove the genes with only one exon. ```{r} # Read all bam files as input for featureCounts countData <- dir("bams", pattern=".bam$", full.names=T) %>% featureCounts(filesToCount, annot.ext=anno, isGTFAnnotationFile=FALSE, minMQS=0, useMetaFeatures=FALSE, allowMultiOverlap=TRUE, largestOverlap = TRUE, countMultiMappingReads=FALSE, primaryOnly=TRUE, isPairedEnd=TRUE, nthreads = 12) # Non-specific filtering: Remove the exons with low counts isexpr <- rownames(countData$counts)[rowSums(cpm(countData$counts) > 1) >= 3] countData$counts <- countData$counts[rownames(countData$counts) %in% isexpr, ] anno <- anno %>% filter(GeneID %in% rownames(countData$counts)) # Remove genes with only 1 site and NA in geneIDs dn <- anno %>% group_by(GeneID) %>% summarise(nsites=n()) %>% filter(nsites > 1 & !is.na(GeneID)) anno <- anno %>% filter(GeneID %in% dn$GeneID) countData$counts <- countData$counts[rownames(countData$counts) %in% anno$GeneID, ] ``` NOTE: Examine the library type(single-end or paired-end) and modify the parameters in featureCounts respectively. # 4. Differential Splicing and Exon usage analysis We describe two alternatives for this step: DEXSeq and DiffSplice. Either can be used and give similar results. For consistency, select DEXSeq if you prefer a DESeq2 package for DGE and use DiffSplice for a Limma-based DGE package. ## 4.1 Using DEXSeq pipeline for differential exon analysis ### 4.1.1 Create a sample table to define the experimental design ```{r} sampleTable = data.frame( row.names = c( "Mbnl1KO_Thymus_1", "Mbnl1KO_Thymus_2", "Mbnl1KO_Thymus_3", "WT_Thymus_1", "WT_Thymus_2", "WT_Thymus_3" ), condition = rep(c("Mbnl1_KO", "WT"),c(3,3)), libType = rep(c("paired-end"))) ``` NOTE: The row names should be consistent with the bam file names used by featureCounts to count the reads. sampleTable consists of details of each sample which includes- library-type and condition. This is required to define the contrasts or test group for detecting differential usage. ### 4.1.2 Prepare the exon information file Exon information in form of GRanges (genomic ranges) object is required as an input to create DEXSeq object. Match the gene Ids with the read counts and to create exoninfo object. ```{r} exoninfo = anno[anno$GeneID %in% rownames(countData$counts),] exoninfo <- GRanges (seqnames= anno$Chr, ranges= IRanges (start=anno$Start, end=anno$End, width = anno$Width), strand = Rle(anno$Strand)) mcols(exoninfo)$TranscriptIDs <- anno$TranscriptIDs mcols(exoninfo)$Ticker <- anno$Ticker mcols(exoninfo)$ExonID <- anno$ExonID mcols(exoninfo)$n <- anno$n mcols(exoninfo)$GeneID <- anno$GeneID transcripts_l = strsplit(exoninfo$TranscriptIDs, "\\,") #Save the countData, sampleTable and exoninfo and transcripts in one object for further use save(countData, sampleTable, exoninfo, transcripts_l, file="AS_countdata.RData") ``` NOTE: Be careful with the names of the columns of the processed annotation file. With a different GTF file, the headers can change and cause a problem while running this step. ### 4.1.3 Create DEXSeq object Create DEXSeq object using DEXSeqDataSet function. The DEXSeq object collects together read counts, exon feature information, and sample information. Use the read counts generated in step 3 and the exon information obtained from the previous step to create the DEXSeq object from the count matrix. The sampleData argument takes a data frame input defining the samples (and their attributes: library type and condition), 'design' uses sampleData to generate a design matrix for the differential testing using model formula notation. Note that a significant interaction term, condition:exon, indicates that the fraction of reads over a gene falling on a particular exon depends on the experimental condition i.e. there is AS. See DEXSeq documentation for a full description of setting the model formula for more complex experimental designs. For feature information- exon ids, corresponding gene and transcripts are required from the processed annotation file. ```{r} dxd = DEXSeqDataSet( countData$counts, sampleData=sampleTable, design= ~ sample + exon + condition:exon, featureID = exoninfo$ExonID, groupID = exoninfo$GeneID, featureRanges = exoninfo, transcripts = transcripts_l) ``` NOTE: This step might take some time. Once the object is created, explore the columns for initial understanding/exploration of the data. ```{r} # Explore/Inspect the DEXSeq object head(counts(dxd), 5) colData(dxd) split(seq_len(ncol(dxd)), colData(dxd)$exon) head(rowRanges(dxd),3) sampleAnnotation(dxd) ``` ### 4.1.4 Normalization and Dispersion Estimation Next, perform normalization between samples and estimate the variance of the data, due to both Poisson count noise from the discrete nature of RNA-seq and biological variability, using the following commands. ```{r} dxd %<>% estimateSizeFactors %>% estimateDispersions %T>% plotDispEsts ``` ### 4.1.5 Testing for Differential exon Usage After the estimation of variation, test for differential exon usage for each gene and generate the results. The results are calculated at FDR 10%. ```{r} dxd %<>% testForDEU %>% estimateExonFoldChanges(fitExpToVar = "condition") #Estimate fold changes dxr = DEXSeqResults(dxd) dxr mcols(dxr)$description ## ----tallyExons------------------------------------------------------------ table(dxr$padj < 0.1) ## ----tallyGenes------------------------------------------------------------ table(tapply(dxr$padj < 0.1, dxr$groupID, any)) ``` ### 4.1.6 Visualization of splicing events Visualize splicing events for selected genes using the following command ```{r} plotDEXSeq(dxr, "Wnk1", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2) plotDEXSeq(dxr, "Tcf7", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2) plotDEXSeq(dxr, "Mbnl1", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2) plotDEXSeq(dxr, "Lef1", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2) plotDEXSeq(dxr, "Ncor2", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2) plotDEXSeq(dxr, "Mbnl2", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2) ``` To display transcripts use displayTranscripts=TRUE ```{r} plotDEXSeq(dxr, "Rps3a1", displayTranscripts=TRUE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2) ``` ```{r} plotDEXSeq(dxr, "Lef1", expression=FALSE, splicing=TRUE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2) plotDEXSeq(dxr, "Mbnl1", expression=FALSE, splicing=TRUE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2) ``` Generate plots for all the significant genes ```{r} dxr = dxr[!is.na(dxr$padj),] dgene = data.frame(perGeneQValue=perGeneQValue(dxr)) %>% rownames_to_column("groupID") dexon = dxr %>% data.frame() %>% dplyr::select(-matches("dispersion|stat|countData|genomicData")) %>% inner_join(dgene) %>% arrange(perGeneQValue) %>% distinct() "openxlsx" %>% lapply(library, character.only = TRUE) %>% invisible dexon_sig = dexon %>% filter(padj < 0.1) %>% dplyr::select(-transcripts) %>% .[order(.$pvalue),] %T>% write.xlsx(file = "RNASeq_DEXSeq_significant_genes.xlsx", sheetName = "Mbnl1_KO vs WT", firstRow = TRUE, headerStyle = createStyle(textDecoration = "BOLD", halign = "center")) pdf(file= "DEXSeq_sig_Mbnl1KO_FDR_10.pdf", height = 15, width = 15) sig_genes = dexon_sig %>% filter(perGeneQValue <= 0.1) %$% groupID %>% unique for (geneid in sig_genes ) { nn = nrow(dxr[dxr$groupID %in% geneid, ]) if (nn >1 ) { plotDEXSeq(dxr, geneid, legend=TRUE,displayTranscripts=TRUE, splicing=TRUE, cex.axis=1.2, cex=1.3, lwd=2 ) } } dev.off() ``` Volcano plot to visualize differentially expressed genes ```{r} EnhancedVolcano(dexon_sig, lab = dexon_sig$featureID, x = 'log2fold_WT_Mbnl1_KO', y = 'pvalue', title = 'Volcano Plot', subtitle = 'Mbnl1_KO vs WT (DEXSeq)', FCcutoff = 1, labSize = 4,legendPosition = "right",xlim= c(-4,4),caption = bquote(~Log[2]~ "Fold change cutoff, 2; FDR 10%")) ``` NOTE: EnhancedVolcano package was used to generate the above plot, install the package it if not already installed. ### Save Rdata objects ```{r} save.image("Mbnl1_KO_Dexseq.RData") ``` ## 4.2 Using Limma diffSplice for differential splicing analysis Follow the R Notebook file “AS_analysis_RNASeq.Rmd”. Ensure steps 1-3 have been followed to prepare input files before proceeding further. ### 4.2.1 Load libraries ```{r} library(limma) library(edgeR) ``` ### 4.2.2 Non-specific Filtering- Extract the matrix of read counts obtained in 3. Create a list of features using ‘DGEList’ function from edgeR package, where rows represent genes and columns represent samples. NOTE: As a non-specific filtering step, counts are filtered by cpm < 1 in x out of n samples, where x is the minimum number of replicates in any condition. n = 6 and x = 3 for this example data. ```{r} mycounts = countData$counts #Change the rownames of the countdata to exon Ids instead of genes for unique rownames. rownames(mycounts) = exoninfo$ExonID dge <- DGEList(counts=mycounts) #Non-specific Filtering isexpr <- rowSums(cpm(dge) > 1) >=3 dge <- dge[isexpr,,keep.lib.sizes=FALSE] #Extract the annotations for only filtered counts exoninfo = anno %>% filter(ExonID %in% rownames(dge$counts)) #Convert the exoninfo into GRanges object exoninfo1 <- GRanges (seqnames= exoninfo$Chr, ranges= IRanges (start=exoninfo$Start, end=exoninfo$End, width = exoninfo$Width), strand = Rle(exoninfo$Strand)) mcols(exoninfo1)$TranscriptIDs <- exoninfo$TranscriptIDs mcols(exoninfo1)$Ticker <- exoninfo$Ticker mcols(exoninfo1)$ExonID <- exoninfo$ExonID mcols(exoninfo1)$n <- exoninfo$n mcols(exoninfo1)$GeneID <- exoninfo$GeneID transcripts_l = strsplit(exoninfo1$TranscriptIDs, "\\,") ``` ### 4.2.3 Normalize the counts- Normalize the counts across samples, with ‘calcNormFactors’ function from ‘edgeR’ package using Trimmed Mean of M values (TMM normalization method). It will compute scaling factors to adjust library sizes. ```{r} dge <- calcNormFactors(dge) ``` ### 4.2.4 Create design matrix for comparisons Use sampleTable as generated in step 4.1.1 and create the design matrix. The design matrix characterizes the design. See the Limma User Guide [ref] chapters 8 & 9 for details on design matrices for more advanced experimental designs. ```{r} Treat <- factor(sampleTable$condition) design <- model.matrix(~0+Treat) colnames(design) <- levels(Treat) ``` ### 4.2.5 Differential expression testing Run ‘voom’ function of ‘limma’ package to process RNA-seq data to estimate variance and generate precision weights to correct for Poisson count noise, and the exon-level counts to log2-counts per million (logCPM). Then run linear modelling using ‘lmfit’ function to fit linear models to the expression data for each exon. Compute empirical Bayes statistics for fitted model using ‘eBayes’ function to detect differential exon expression by ranking the exons. Next, define a contrast matrix for the experimental comparisons of interest. Use ‘contrasts.fit’ to obtain coefficients and standard errors for each pair of comparison. ```{r} v <- voom(dge, design, plot=FALSE) fit <- lmFit(v,design) fit <- eBayes(fit) colnames(fit) summary(decideTests(fit)) cont.matrix <- makeContrasts( Mbnl1_KO_WT = Mbnl1_KO - WT, levels=design) fit2 <- contrasts.fit(fit, cont.matrix) summary(decideTests(fit2)) ``` ### 4.2.6 Differential Splicing analysis Run ‘diffSplice’ on the fitted model to test the differences in exon usage of genes between wild-type and knockout and explore the top ranked results using ‘topSplice’ fuinction.: test=”t” gives a ranking of AS exons, test=”simes” gives a ranking of genes. Run diffSplice on the fitted model to test the differences in exon retention between the wild-type and knockout samples. ```{r} ex <- diffSplice(fit2, geneid = exoninfo1$GeneID, exonid = exoninfo1$ExonID) #Check the top splicing results with topSplice topSplice(ex) #Exon-level statistics for splicing activity ts <-topSplice(ex, n=Inf, FDR=0.1, test= "t", sort.by = "logFC") #Gene-level statistics for splicing activity tg <- topSplice(ex, n=Inf, FDR=0.1, test = "simes") ts %$% GeneID %>% unique %>% length ts %$% ExonID %>% unique %>% length ``` NOTE: This step uses the exon information object created in 4.1.2. Make sure it is loaded in the current working environment. ### 4.2.7 Visualization Plot the results with the ‘plotSplice’ function, giving gene of interest in the geneid argument. Save the top results sorted by log Fold change and also calculate per gene statictics using the "simes method". First create a function to add different sheets as workbooks in excel. Generate a volcano plot to exhibit the exons. ```{r} #Plot top 3 genes plotSplice(ex, geneid="Mbnl1", FDR = 0.1) plotSplice(ex, geneid="Ccdc88c", FDR = 0.1) plotSplice(ex, geneid="Wnk1", FDR = 0.1) #Plot experimentally validated genes plotSplice(ex, geneid="Lef1", FDR = 0.1) plotSplice(ex, geneid="Tcf7", FDR = 0.1) plotSplice(ex, geneid="Ncor2", FDR = 0.1) #Save the list of top differentially spliced exons at FDR 10% sig <- ts[ts$FDR<0.1,] %>% .[order(.$P.Value),] write.xlsx(sig, file = "RNASeq_diffSplice_significant_genes.xlsx", sheetName = "Mbnl1_KO vs WT", firstRow = TRUE, headerStyle = createStyle(textDecoration = "BOLD", halign = "center")) # Function to add rMATS output into one excel add.xlsx.sheet <- function(fileName, sheetName, data){ wb <- loadWorkbook(fileName) addWorksheet(wb, sheetName = sheetName) freezePane(wb, sheet = sheetName, firstRow = TRUE) writeData(wb, sheet = sheetName, data, headerStyle = createStyle(textDecoration = "BOLD", halign = "center")) saveWorkbook(wb, file = fileName, overwrite = TRUE) } add.xlsx.sheet(file = "RNASeq_diffSplice_significant_genes.xlsx", "Gene-level Statistics", ts1) ``` ### Volcano plot- Volcano plot showing up and downregulated genes at pCutoff = FDR 10% and FC 2. ```{r} EnhancedVolcano(ts, lab = ts$ExonID, selectLab = head((ts$ExonID),3000),xlab = bquote(~Log[2]~ 'fold change'), x = 'logFC', y = 'P.Value', title = 'Volcano Plot', subtitle = 'Mbnl1_KO vs WT (Limma_diffSplice)', FCcutoff = 1, labSize = 4,legendPosition = "right",caption = bquote(~Log[2]~ "Fold change cutoff, 1; FDR 10%")) ``` ### Save the R object ```{r} save.image("Mbnl1_KO_Limma_diffSplice.RData") ``` ## 4.3 Using rMATS to identify different types of splicing events ### 4.3.1 Download and Install rMATS Ensure the latest version of rMATS v4.1.1 (also known as rMATS turbo due to the increased processing time and less requirements of memory) is installed either using conda or github in the working directory. ### 4.3.2 Required Files For pairwise splicing analysis, we will need sorted bam files for all the samples generated after the read alignment step using STAR. prepare text files for two conditions by copying the name of bam files (along with the path) separated by ‘,’ comma. Following commands should be run on the linux command-line- ```{bash} mkdir rMATS_analysis cd bams/ ls -pd "$PWD"/* | grep "WT" | tr '\n' ',' > Wt.txt ls -pd "$PWD"/* | grep "Mb" | tr '\n' ',' > KO.txt #Move the files to the main working directory mv *.txt /rMATS_analysis cd rMATS_analysis ``` ### 4.3.3 Run rMATS Run the python script rmats.py with the two input files generated in the previous step, along with the GTF annotation obtained in 1.1.3 by typing the following command on linux command-line. ```{python} python rmats_turbo/rmats.py --b1 KO.txt --b2 Wt.txt --gtf annotation.gtf -t paired --readLength 56 --nthread 8 --od rmats_out/ --tmp rmats_tmp ``` NOTE:The reference annotation in the form of a GTF file is also required. Make sure both the folder containing bam files (bams/) and annotation file (annotation.gtf) is present in the current path/provide the correct path in the above command. (AS_analysis) Check the parameters if the data is single-end, change the -t option. The results will be in the specified output directory by the -o option(rmats_out). The --tmp option creates a temporary subdirectory to store all the intermediate files in the analysis. The output folder contains text files for different splicing events with p-values. The summary.txt file presents the table of all the events falling in different categories of AS events. The main output file that we will focus on is *.MATS.JCEC.txt, this includes both reads that span junctions defined by rmats (Junction Counts) and reads that do not cross an exon boundary (Exon Counts). ### 4.3.4 Exploring rMATS results Use Bioconductor package called 'maser' to explore the rMATS results. Load the *.JCEC files for each of the 5 events and filter by coverage including an average of 5 reads. ```{r} library(maser) %>% invisible mbnl1 <- maser("rMATS_analysis/rmats_out/", c("WT", "Mbnl1_KO"), ftype = "JCEC") #Filtering out events mbnl1_filt <- filterByCoverage(mbnl1, avg_reads = 5) ``` ### 4.3.4 Visualizing rMATS results Select the significant splicing events at False Discovery Rate (FDR) 10% and minimum 10% change in Percent Spliced In (deltaPSI) using ‘topEvents’ function from ‘maser’ package. Next , check the gene events for individual genes of interest (sample gene-Wnk1) and plot PSI values for each splicing event of that gene. Generate a volcano plot by specifying the event type. ```{r} #Top splicing events at 10% FDR mbnl1_top <- topEvents(mbnl1_filt, fdr = 0.1, deltaPSI = 0.1) mbnl1_top #Check the gene events for a particular gene ## Retrieve Wnk1 splicing events mbnl1_wnk1 <- geneEvents(mbnl1_filt, geneS = "Wnk1", fdr = 0.1, deltaPSI = 0.1) maser::display(mbnl1_wnk1, "SE") plotGenePSI(mbnl1_wnk1, type = "SE", show_replicates = TRUE) volcano(mbnl1_filt, fdr = 0.1, deltaPSI = 0.1, type = "SE") + xlab("deltaPSI") +ylab("Log10 Adj. Pvalue")+ ggtitle("Volcano Plot of exon skipping events") ``` NOTE: By default the "summary.txt" obtained in rMATS output folder is generated for FDR 5%. To change the FDR or other parameters run the summary.py script located in rMATS source folder (rmats_turbo/rMATS_P/summary.py) as- ```{python} python rmats_turbo/rMATS_P/summary.py rmats_out --p-cutoff 0.1 --summary-prefix rmats_summary ``` Generate an excel table containing significant events (FDR 10%) of different types of AS events from rMATS ranalysis. Use "add.xlsx" function to generate an excel sheet summarizing all the rMATS result. ```{r} # Read rMATS summary and individual text files for all events and combine them into one table read.table('rMATS_analysis/rmats_summary_FDR_0.1_IncLevelDifference_0.txt', header = T) %T>% write.xlsx(file = "rMATS_output_summary.xlsx", sheetName = "rMATS_Summary", firstRow = TRUE, headerStyle = createStyle(textDecoration = "BOLD", halign = "center")) read.table('rMATS_analysis/rmats_out/SE.MATS.JCEC.txt', header = T) %>% filter(FDR <= 0.1) %>% .[order(.$PValue),] %T>% add.xlsx.sheet(file = "rMATS_output_summary.xlsx", "SE.MATS.JCEC", .) read.table('rMATS_analysis/rmats_out/A5SS.MATS.JCEC.txt', header = T) %>% filter(FDR <= 0.1) %>% .[order(.$PValue),] %T>% add.xlsx.sheet(file = "rMATS_output_summary.xlsx", "A5SS.MATS.JCEC", .) read.table('rMATS_analysis/rmats_out/A3SS.MATS.JCEC.txt', header = T) %>% filter(FDR <= 0.1) %>% .[order(.$PValue),] %T>% add.xlsx.sheet(file = "rMATS_output_summary.xlsx", "A3SS.MATS.JCEC", .) read.table('rMATS_analysis/rmats_out/MXE.MATS.JCEC.txt', header = T) %>% filter(FDR <= 0.1) %>% .[order(.$PValue),] %T>% add.xlsx.sheet(file = "rMATS_output_summary.xlsx", "MXE.MATS.JCEC", .) read.table('rMATS_analysis/rmats_out/RI.MATS.JCEC.txt', header = T) %>% filter(FDR <= 0.1) %>% .[order(.$PValue),] %T>% add.xlsx.sheet(file = "rMATS_output_summary.xlsx", "RI.MATS.JCEC", .) ``` Generate Sashimi plot for the splicing events result obtained with rMATS in form of text files using ‘rmats2shahimiplot’ package. Ensure the working directory is set to the rmats2sashimiplot folder, to run the python script by typing the following command on linux command-line. ```{python} python ./src/rmats2sashimiplot/rmats2sashimiplot.py --b1 ../bams/WT_Thymus_1.bam,../bams/WT_Thymus_2.bam,../bams/WT_Thymus_3.bam --b2 ../bams/Mbnl1KO_Thymus_1.bam,../bams/Mbnl1KO_Thymus_2.bam,../bams/Mbnl1KO_Thymus_3.bam -t SE -e ../rMATS_analysis/rmats_out/SE.MATS.JC.txt --l1 WT --l2 Mbnl1_KO --exon_s 1 --intron_s 5 -o ../rMATS_analysis/rmats2shasmi_output ``` NOTE: This process can be time-consuming as it will generate the Sashimi plot for all the results in the events file. Choose the top results (gene names and exons) as displayed by the topEvents function from 'maser' and visualize the corresponding Sashimi plot. NOTE: The results of rMATS for example, the chromosomes in SE.MATS.JC.txt are named with a ‘chr’ prefix by default, however the chromosomes in the input bam file varies according to given annotation file and are sometimes represented without the ‘chr’ prefix (Ensembl convention). This might cause errors when running rmats2sashimi. Hence, modify the Bam input using samtools to add 'chr' prefix before running rmats2sashimi for individual genes. Run the following command on the Linux command-line to modify all bam files and extract rMATS result for a particular gene, e.g., "Wnk1". Replace Wnk1 with your gene of interest. ```{bash} cd bams for bam in ./*bam; do echo $bam samtools view -h $bam | sed -e '/^@SQ/s/SN\:/SN\:chr/' -e '/^[^@]/s/\t/\tchr/2' | samtools view -bS - > temporary.bam mv -f temporary.bam $bam samtools index $bam done #Extract rMATS output for 'Wnk1' gene sed -n "1p;/Wnk1/p" SE.MATS.JCEC.txt > wnk1_SE_JCEC.txt #Replace Wnk1 with your gene of interest. ``` Run the python command on the linux command-line to get the pdf files for each event of the sample gene wnk1. ```{python} python ./src/rmats2sashimiplot/rmats2sashimiplot.py --b1 ../bams/WT_Thymus_1.bam,../bams/WT_Thymus_2.bam,../bams/WT_Thymus_3.bam --b2 ../bams/Mbnl1KO_Thymus_1.bam,../bams/Mbnl1KO_Thymus_2.bam,../bams/Mbnl1KO_Thymus_3.bam -t SE -e ../rMATS_analysis/rmats_out/wnk1_SE_JCEC.txt --l1 WT --l2 Mbnl1_KO --exon_s 1 --intron_s 5 -o ../rMATS_analysis/rmats2shasmi_output_wnk1 ``` Note:Ensure the working directory is correct. This step will generate several plots for all the events of 'Wnk1' gene. Check the event id to select a specific output. The Sashimi plot for each event type obtained from rmats2sashimi are shown here- ```{r} #Example gene 'Wnk1" knitr::include_graphics("Wnk1-1.png") #SE example gene 'Add3' knitr::include_graphics("SE-1.png") #A5SS example gene 'Baz2b' knitr::include_graphics("A5SS-1.png") #A3SS example gene 'Lsm14b' knitr::include_graphics("A3SS-1.png") #MXE example gene 'Mta1' knitr::include_graphics("MXE-1.png") #RI example gene 'Arpp21' knitr::include_graphics("RI-1.png") ``` NOTE: The pdf was converted to image to include them in the RMarkdown file using 'pdftoppm' on the linux command-line. (Eg: pdftoppm A5SS_Baz2b.pdf A5SS -png) ## Print Session Info- ```{r} print(sessionInfo()) ```