---
title: "Identification of Alternative PolyAdenylation (APA) in 3p-seq data"
author: "Ying Zheng"
date: "24/02/2021"
output:
  html_notebook:
    theme: united
    toc: yes
editor_options:
  chunk_output_type: inline
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction
The content of this R notebook file includes all the codes requied for the Alternative PolyAdenylation (APA) analysis using 3p-Seq data. Refer to 'PART 2- Alternative polyadenylation (APA) analysis using 3p-seq' of the Protocol section.
NOTE: To proceed with each step below to get the results, type the bash commands on Linux command-line then press Enter, or type the R codes in R console then press Enter. 

# 0. Preparation: Installation of tools and packages used in this analysis
Ensure the related tools and packages used for APA analysis are installed correctly to guarantee the successful running of the following pipeline. As the analysis is required to be performed both on the Linux command-line and in the R console, and perl is expected to be used for data processing, ensure all of the environments are prepared at the beginning.

## 0.1 Installation of bash tools and packages
### 0.1.1 Installation of `conda`
`conda` is a popular and flexible package manager that allows convenient installation of packages with their dependencies across all platforms. To assist the further installation of related bash tools/packages required in this analysis, install `conda` at first by following the installing instruction of `Anaconda` (conda package manager) document at https://docs.anaconda.com/anaconda/install/. Ensure the system requirements and the installer are selected properly. 

### 0.1.2 Installation of bash tools and packages using `conda`
Install all the required packages in this analysis using `conda` on Linux command-line.
```{bash, eval=FALSE}
conda install -c daler sratoolkit 
conda install -c conda-forge parallel
conda install -c bioconda bowtie sratoolkit cutadapt samtools bedtools deeptools
```

## 0.2 Installation of R packages
Install all the R packages required in this analysis in R console if not installed.
```{r, eval=FALSE}
bioc_packages <- c("DEXSeq", "Rsubread", "EnhancedVolcano", "edgeR", "limma", "GenomicRanges") 
packages <- c("tidyverse", "BiocManager", "openxlsx") 
#Install if not already installed 
installed_packages <- packages %in% rownames(installed.packages()) 
installed_bioc_packages <- bioc_packages %in% rownames(installed.packages()) 
if (any(installed_packages == FALSE)) { 
  install.packages(packages[!installed_packages], dependencies = TRUE) 
  BiocManager::install(packages[!installed_bioc_packages], dependencies = TRUE) 
}
```

# 1. Data downloading and pre-processing 
## 1.1 Preparation of the project directory
Create the project directory `APA_analysis` and the raw data directory `raw_data`.
```{bash, eval=FALSE}
# Create and open the project directory
PROJ_DIR=APA_analysis/
mkdir $PROJ_DIR
cd $PROJ_DIR

# Create the raw data directory 
RAW_DATA=raw_data
mkdir -p $RAW_DATA
```

## 1.2 Raw data downloading and fastq files preparation
### 1.2.1 Raw data downloading
Given the Sequence Read Archive [SRA] Accession Ids (1553129 to 1553136) in sequence, download the raw data in parallel from SRA using the `prefetch` command from SRA toolkit (v2.10.8) and the `parallel` command from GNU parallel utility. Save the rsa files in the directory `raw_data`.
```{bash, eval=FALSE}
# Open the raw_data directory for raw data storage
cd $RAW_DATA

# Download SRA files using accession numbers
seq 1553129 1553136 | parallel prefetch SRR{}
```

### 1.2.2 fastq files preparation
Extract fastq.gz files from the archive using the `fastq-dump` command from the SRA toolkit. Ensure all the paths and names of the rsa files are listed correctly.
```{bash, eval=FALSE}
# Convert sra files to fastq files for alignment
parallel -j 3 fastq-dump --gzip --skip-technical --read-filter pass --dumpbase --split-e --clip --origfmt {} ::: SRR1553129/SRR1553129.sra SRR1553130/SRR1553130.sra SRR1553131/SRR1553131.sra SRR1553132/SRR1553132.sra SRR1553133/SRR1553133.sra SRR1553134/SRR1553134.sra SRR1553135/SRR1553135.sra SRR1553136/SRR1553136.sra
```

Unzip the compressed fastq.gz files in parallel using the command `gunzip` to obtain fastq files. 
```{bash, eval=FALSE}
# Unzip the fastq.gz data to gain fastq files
parallel “gunzip {}” :::$RAW_DATA/*.fastq.gz

# Return to the project directory for reference downloading
cd .. 
```

## 1.3 Reference genome and annotations downloading
Download the reference genome and annotations for Mouse (Genome assembly mm10) in the project directory using the command `wget`, then decompress the genome annotation file using the command `unzip`. Download the reference file for the mouse chromosome sizes from UCSC using the command `wget`. The reference genome file and its annotations (the bowtie index of mm10) here are saved in the directory `mm10`, and the the reference file for the mouse chromosome sizes is named as `mm10.chrom.sizes.txt`.
```{bash, eval=FALSE}
# Download the bowtie index of mm10
wget https://genome-idx.s3.amazonaws.com/bt/mm10.zip
unzip mm10.zip

# Download the mm10.chrom.sizes
wget https://hgdownload-test.gi.ucsc.edu/goldenPath/mm10/bigZips/mm10.chrom.sizes
```

## 1.4 Data pre-processing, read alignment, files split and conversion
### 1.4.1 Data pre-processing
#### 1.4.1.1 Illumina Adapter trimming
To remove the influence of adapter sequence on short reads (~35 bases) gained in polyA-seq, perform Illumina adapter trimming with the command `cutadapt`. Save the results in the newly created directory `adapter.trimming.results`.
```{bash, eval=FALSE}
TRIM_DIR=adapter.trimming.results
mkdir $TRIM_DIR

for fq in $RAW_DATA/*.fastq
do
OUTPUT=$(basename ${fq}| sed 's/_pass.fastq//g');
cutadapt -a "AGATCGGAAGAGC"  -m 15 -n 3 ${fq} | fastx_reverse_complement -i > $TRIM_DIR\/${OUTPUT}_trimmed.fastq
done
```

#### 1.4.1.2 Obtaining sense strand sequence
Ensure the read sequence used for genome alignment is consistent with the sense strand sequence. According to the description of library preparation steps in the original article, polyA-seq was performed on the antisense strand. Therefore, after adapter trimming, perform reverse complementary operation using the `fastx_reverse_complement` command from FASTX toolkit to obtain the sense strand sequence.
```{bash, eval=FALSE}
TRIMRC_DIR=trimmed.rc.data
mkdir $TRIMRC_DIR

for fq in $RAW_DATA/*.fastq
do
OUTPUT=$(basename ${fq}| sed 's/_pass.fastq//g');
fastx_reverse_complement -i $TRIM_DIR\/${OUTPUT}_trimmed.fastq -o $TRIMRC_DIR\/${OUTPUT}_trimmed_rc.fastq
done
```
NOTE: This step is specific to the PolyA-seq assay used. 

### 1.4.2 Read alignment
Map reads to mouse genome assembly using bowtie aligner and save the aligned SAM files in the directory `bamfiles`.
```{bash, eval=FALSE}
SAM_DIR=samfiles
mkdir $SAM_DIR

for fq in $RAW_DATA/*.fastq
do
OUTPUT=$(basename ${fq}| sed 's/_pass.fastq//g');
bowtie -n 2 -k 100 --best --strata -p 10  -x mm10/genome  --un unmapped_out $TRIMRC_DIR\/${OUTPUT}_trimmed_rc.fastq -S > $SAM_DIR\/${OUTPUT}.sam
done
```

### 1.4.3 File split, normalization and coversion
To prepare for genome coverage visualization, convert SAM files to BAM files, then sort and index BAM files for each sample using `samtools` commands before proceeding to further steps. Save the results in the directory `bamfiles` by typing the following commands and press Enter.
```{bash, eval=FALSE}
BAM_DIR=bamfiles
mkdir $BAM_DIR

for fq in $RAW_DATA/*.fastq
do
OUTPUT=$(basename ${fq}| sed 's/_pass.fastq//g');
samtools view -bS $SAM_DIR\/${OUTPUT}.sam -o $BAM_DIR\/${OUTPUT}.bam
samtools sort $BAM_DIR\/${OUTPUT}.bam -o $BAM_DIR\/${OUTPUT}.sorted.bam
samtools index $BAM_DIR\/${OUTPUT}.sorted.bam
done
```

To generate BigWig files for the read coverage visualization from BAM files, firstly apply `genomeCoverageBed` command from `bedtools` to generate bedgraph files, and use the parameter `split` and `strand` to split files based on their strand. Save the results in the directory `bedgraphfiles` by typing the following commands and press Enter.
```{bash, eval=FALSE}
BDG_DIR=bedgraphfiles
mkdir $BDG_DIR

for fq in $RAW_DATA/*.fastq
do
OUTPUT=$(basename ${fq}| sed 's/_pass.fastq//g');
# Generate genomeCoverageBed files
genomeCoverageBed -split -strand + -bga -ibam $BAM_DIR\/${OUTPUT}.sorted.bam | sort -k 1,1 -k2,2n > $BDG_DIR\/${OUTPUT}.bedgraph_fwd
genomeCoverageBed -split -strand - -bga -ibam $BAM_DIR\/${OUTPUT}.sorted.bam | sort -k 1,1 -k2,2n > $BDG_DIR\/${OUTPUT}.bedgraph_rev
done
```

If the counts are used for the read coverage visualization, the height of read peaks is influenced by library size. Therefore, normalization of read counts by sequencing depth/library size is a critical step in the read coverage visualization. Calculate the library size of each sample using the `samtools` command, and save the results as txt files in the directory `librarysizefiles`. Then use `perl -lane` command to calculate the counts per million (CPM) number normalized by the sample library size. Save the results in the directory `bedgraphfiles.norm` by typing the following commands and press Enter.
```{bash, eval=FALSE}
LBSIZE_DIR=librarysizefiles
mkdir $LBSIZE_DIR

BDGNORM_DIR=bedgraphfiles.norm
mkdir $BDGNORM_DIR

for fq in $RAW_DATA/*.fastq
do
OUTPUT=$(basename ${fq}| sed 's/_pass.fastq//g');
# Calculate the library size
samtools view -F 0x105 -c  $BAM_DIR\/${OUTPUT}.sorted.bam > $LBSIZE_DIR\/${OUTPUT}.temp

# Generate the normalized genomeCoverageBed files for the coverage visualisation to decide the flanking regions included in the annotation file
libsize=$(cat $LBSIZE_DIR\/${OUTPUT}.temp)
perl -lane '$xx=$F[3]*1e6/'$libsize'; print "$F[0]\t$F[1]\t$F[2]\t$xx"' $BDG_DIR\/${OUTPUT}.bedgraph_fwd > $BDGNORM_DIR\/${OUTPUT}.bedgraph_fwd_norm
perl -lane '$xx=$F[3]*1e6/'$libsize'; print "$F[0]\t$F[1]\t$F[2]\t$xx"' $BDG_DIR\/${OUTPUT}.bedgraph_rev > $BDGNORM_DIR\/${OUTPUT}.bedgraph_rev_norm
done
```

Generate BigWig files using the UCSC `wigToBigWig` command, and save the results in the directory `wigToBigWig`.
```{bash, eval=FALSE}
BWNORM_DIR=normalized.BigWigfiles
mkdir $BWNORM_DIR
for fq in $RAW_DATA/*.fastq
do
OUTPUT=$(basename ${fq}| sed 's/_pass.fastq//g');
# Generate bigwig files for the check in the genome browser/IGV
wigToBigWig $BDGNORM_DIR\/${OUTPUT}.bedgraph_fwd_norm mm10.chrom.sizes.txt $BWNORM_DIR\/${OUTPUT}.fwd.bw
wigToBigWig $BDGNORM_DIR\/${OUTPUT}.bedgraph_rev_norm mm10.chrom.sizes.txt $BWNORM_DIR\/${OUTPUT}.rev.bw
done
```

# 2. Preparation of pA sites annotations
NOTE: The processing steps of the pA sites annotation file is preformed firstly in R console, then on Linux command-line for the read coverage visualzation at pA sites and the generation of final annotation file. 

## 2.1 pA sites annotations downloading 
Download the pA site annotation file "atlas.clusters.2.0.GRCm38.96.tsv" from the PolyASite 2.0 database at https://polyasite.unibas.ch. To find the link for the download, select 'Atlas' on the top bar of the main website, then click the 'Atlas BED file' under the subtitle 'Mus musculus: v2.0 (GRCm38.96)'. The tsv file is expected to start downloading automatically.

## 2.2 Loading required libraries
Type the following commands in R console and press Enter to load the required libraries in R.
```{r message=FALSE, warning=FALSE}
library(magrittr)
```
                                      
## 2.3 Selection of analysis-related information in the annotation file
Read the unprocessed annotation file in R as an data frame using the `read.delim` function. As only a fraction of the information included in the unprocessed annotation file is required for the APA analysis, perform the information selection on both column-level and row-level. On column-level, select the ten columns of chromosome name, genome coordinates (start and end positions on chromosome), unique pA site ID, average expression across all samples (tags per million, tpm), encoded strand, representative pA site (namely cleavage site), pA site annotation, gene symbol and Ensembl gene Id. On row-level, retain 3'UTR pA sites on the main 20 pairs of mouse chromosomes, which are annotated as Terminal Exon (TE) or 1000 nt downstream of an annotated terminal exon (DS) for analysis.
NOTE: Before running this step, ensure the path of the tsv file is correct as the input of the `read.delim` function.
```{r message=FALSE, warning=FALSE}
anno <- read.delim("/Users/evelynzheng/3PSeq_files/atlas.clusters.2.0.GRCm38.96.tsv", 
                   stringsAsFactors = FALSE, check.names = FALSE, header = TRUE) %>% 
  dplyr::select(chrom, chromStart, chromEnd, name, score, strand, rep, annotation, gene_name, gene_id) %$%
  .[(.$annotation %in% "TE") | (.$annotation %in% "DS"), ] %$%
  .[.$chrom %in% as.character(c(1:19, "X", "Y")), ]
# Select our intesrested pA sites annotated as Terminal Exon (TE), 1,000 nt DownStream of an annotated terminal exon (DS)
```

## 2.4 Modifying the annotation format for adaption
As the packages used here for APA analysis are designed for differential exon usage and splicing analysis, the format of the annotations needs to be modified in two aspects for the adaption to APA analysis: 
1) Change the format of unique feature IDs (namely unique pA site cluster IDs) from ‘group ID: feature ID: strand’ to group ID: feature ID’ (namely 'gene ID: pA site cluster ID'). For the convenience of downstream analysis and to avoid duplicates in the pA site cluster IDs, remove the last common in each original cluster ID and replace the strand information `+` and `-` with letters `F` (Forward) and `R`(Reverse) individually; 
2) Add 'chr' before each chromosome number;
In addition to the changes above, for the visualization of reads coverage and pA cleavage sites:
3) Replace the pA site coordinates by their corresponding cleavage sites.
```{r}
anno$name <- anno$strand %>% replace(. %in% "+", "F") %>% # Replace the strand information '+' with 'F'
  replace(. %in% "-", "R") %>% # Replace the strand information '-' with 'R'
  paste0(anno$gene_name, ":", anno$rep, .) # Generate new unique pA site cluster IDs

anno$chrom <- paste0("chr", anno$chrom) # change the format of chromosome column for downstream analysis

# Replace coordinates of pA sites with their corresponding cleavage sites
anno$chromStart <-  anno$rep -1
anno$chromEnd <-  anno$rep
```

## 2.5 Generation of annotation files for each strand
To prepare for the read coverage visualization at pA sites, split the processed annotation file into two sperate files according to the strand information, then save the two files separatly in bed format as 'pA_annotation.fwd.bed' and 'pA_annotation.rev.bed' in the current working directory.
```{r}
# Generate annotation files for each strand for genomeCoverage plotting
anno %T>% 
  {
    anno.fwd = .[.$strand %in% "+", ]
    write.table(anno.fwd, file = "pA_annotation.fwd.bed", row.names = FALSE, col.names = FALSE, sep = "\t", quote = FALSE)
    } %>% {
    anno.rev <- .[.$strand %in% "-", ]
    write.table(anno.rev, file = "pA_annotation.rev.bed", row.names = FALSE, col.names = FALSE, sep = "\t", quote = FALSE)
  }
```

## 2.6 Saving the annotation file in bed format
Save the processed annotation file in bed format for further modification after pA sites read coverage visualization.
```{r}
anno %>% write.table(file = "pA_annotation.bed", row.names = FALSE, col.names = FALSE, sep = "\t", quote = FALSE)
```

## 2.7 Read coverage visualization at pA sites
According to the 3p-seq protocol, the reads are expected to be mapped to the region around, rather than specifically at pA cleavage sites. Therefore, visualization to check the peak dispersion of mapped reads at pA sites to optimize the annotation file for this data is a critical step in the annotation file processing. 

Visualize the peaks of aligned reads at pA cleavage sites using `computeMatrix` and `plotHeatmap` from deepTools by typing the following commands on Linux command-line and press Enter. Remeber to scale the distance upstream (-b) and downstream (-a) of the cleavage sites if the peak cannot be visualized completely. Save matrix results in the created directory `compute.matrix.results`, and the plots in pdf in the created directory `plotHeatmap.results`.
```{bash eval=FALSE}
CM_DIR=compute.matrix.results
mkdir $CM_DIR

HM_DIR=plotHeatmap.results
mkdir $HM_DIR

for fq in $RAW_DATA/*.fastq
do
OUTPUT=$(basename ${fq}| sed 's/_pass.fastq//g');
# Visualizing the read coverage at pA sites
computeMatrix reference-point -S $BWNORM_DIR\/${OUTPUT}.fwd.bw -R pA_annotation.fwd.bed -a 100 -b 100 -o $CM_DIR\/${OUTPUT}.fwd.ud100.gz --referencePoint center --sortRegions no
plotHeatmap -m $CM_DIR\/${OUTPUT}.fwd.ud100.gz -o $HM_DIR\/${OUTPUT}.fwd.ud100.pdf

computeMatrix reference-point -S $BWNORM_DIR\/${OUTPUT}.rev.bw -R pA_annotation.rev.bed -a 100 -b 100 -o $CM_DIR\/${OUTPUT}.rev.ud100.gz --referencePoint center --sortRegions no
plotHeatmap -m $CM_DIR\/${OUTPUT}.rev.ud100.gz -o $HM_DIR\/${OUTPUT}.rev.ud100.pdf
done
```

## 2.8 Modifying pA site coordinates to generate final annotation file
NOTE: Through the visualization of read coverage at pA sites, it was found that the peaks of the mapped reads were mainly dispersed within ~60 bp upstream of the cleavage sites (Refer to the supplementary figure “merge.heatmap.png”). Therefore, coordinates of pA sites from the annotation file are supposed to be extended to 60 bp upstream of their cleavage sites in the analysis of the PolyA-seq data. Remember that depending on the specific 3’ end sequencing protocol used, this step will need to be optimized for assays other than PolyA-seq.

Extend the coordinates of pA sites to 60 bp upstream of their cleavage sites in the annotation file by typing the following command on the Linux command-line. Download the chromosome size annotation file from UCSC using the command `wget`. Save the processed annotation file in bed format for downstream analysis as "flanking60added.pA_annotation.bed".
```{bash, eval=FALSE}
# Download the chromosome size annotation file 'mm10.chrom.sizes.txt'
wget https://hgdownload-test.gi.ucsc.edu/goldenPath/mm10/bigZips/mm10.chrom.sizes

# Change the added value based on the genomeCov results
bedtools slop -i pA_annotation.bed -g mm10.chrom.sizes.txt -l 60 -r 0 -s > flanking60added.pA_annotation.bed
```
NOTE: All the steps in the following sections are perfomed in R console. 

# 3 Counting Reads
## 3.1 Loading required libraries
Type the following commands in R console and press Enter to load the required libraries in R.
```{r message=FALSE, warning=FALSE}
c("Rsubread","tidyverse") %>% lapply(library, character.only = TRUE) %>% invisible
```

## 3.2 Preparation of the annotation file in R
Load the pA site annotation information from the bed file using the `read.table` function, then select eight columns used for the differential APA analysis, including unique pA site cluster ID, chromosome name, start and end positions of the pA site cluster, strand on which the cluster is encoded, the overlapping gene ensembles and their corresponding gene symbols, and the representative pA site of the cluster. 
NOTE: Before running this step, ensure the path of the bed file is correct as the input of the `read.table` function.
```{r}
anno <- read.table(file = "/Users/evelynzheng/3PSeq_files/flanking60added.pA_annotation.bed", stringsAsFactors = FALSE, check.names = FALSE, header = FALSE, sep = "")
colnames(anno) <- c("Chr", "Start", "End", "GeneID", "Score", "Strand", "repID", "Anno", "Symbol", "Ensembl")
anno <- dplyr::select(anno, GeneID, Chr, Start, End, Strand, Ensembl, Symbol, repID)
```

## 3.3 Applying `feactureCounts` to acquire the raw read counts
Read all bam files obtained as input for `featureCounts()` from `Rsubread` package to acquire the raw counts of reads mapped to each annotated pA site, the format of which is expected to be a matrix of counts associated with each feature with rows representing exons(features) and columns representing samples. Save the count table as the RData file "APA_countData.RData" for APA analysis using different tools.
NOTE: Before running this step, ensure the folder containing bam files is in the current working directory.
```{r, eval=FALSE}
# As the running of featureCounts requires around 25 G memory, during the generation of this Rmd file, this step was running on server and the result was saved as "countData.RData" and loaded in the next R session.
countData <- dir("bamfiles", pattern="sorted.bam$", full.names = TRUE) %>%  
  # Read all bam files as input for featureCounts
  featureCounts(annot.ext = anno, isGTFAnnotationFile = FALSE, minMQS = 0, useMetaFeatures = TRUE, 
                allowMultiOverlap = TRUE, largestOverlap = TRUE, strandSpecific = 1,
                countMultiMappingReads = TRUE, primaryOnly = TRUE, isPairedEnd = FALSE, nthreads = 12) %T>% 
save(file = "APA_countData.RData")
```
NOTE: Be conscious to change any of the parameters listed in the `featureCounts` function. Modify the `strandSpecific` parameter to ensure it is consistent with the sequencing direction of the 3’ end sequencing assay used (empirically, visualizing the data in a genome browser over genes on plus and minus strands will clarify this).

## 3.4 Applying non-specific filtering of countData
As filtering of the raw read counts can significantly improve the statistical robustness in differential pA site usage tests, apply read counts filtering on two levels:
1) On gene-level, remove genes with only one pA site, on which differential pA site usage cannot be defined;
2) On gene-level, remove genes annotated as NA, which will hinder the downstream processing and analysis;
3) On pA site-level, apply non-specific filtering based on coverage: filter counts by cpm (counts per million) less than 1 in x out of n samples, where x is the minimum number of replicates in any condition. n = 8 and x = 2 for this example data.
NOTE: Before running this step, ensure the path of the RData file is correct as the input of the `load` function.
```{r}
load(file = "/Users/evelynzheng/3PSeq_files/APA_countData.RData") # Skip this step if it is already loaded 

# Non-specific filtering: Remove the pA sites not differentially expressed in the samples
countData <-  countData$counts %>% as.data.frame %>% .[rowSums(edgeR::cpm(.) > 1) >= 2, ]
anno %<>% .[.$GeneID %in% rownames(countData), ] 

# Remove genes with only 1 site and  NA in geneIDs
dnsites <- anno %>% group_by(Symbol) %>% summarise(nsites=n()) %>% filter(nsites > 1 & !is.na(Symbol))
anno %<>% filter(Symbol %in% dnsites$Symbol)
countData %<>% .[rownames(.) %in% anno$GeneID, ]
```


# 4 Differential polyadenylation sites usage analysis using DEXSeq and diffSplice pipelines
## 4.1 Using DEXSeq package 
NOTE: As a contrast matrix cannot be defined for the DEXSeq pipeline, the differential APA analysis of each two experimental conditions has to be developed separatly. Here the differential APA analysis of the condition wild type (WT) and Mbnl 1/2 double knockout (DKO) is developed as an example to explain the procedure. The analysis of the contrast pairs WT and KD, WT and Ctrl are attached at the end.

### 4.1.1 Loading required libraries
Type the following commands in R console and press Enter to load the required libraries in R.
```{r message=FALSE, warning=FALSE}
c("DEXSeq", "GenomicRanges") %>% lapply(library, character.only = TRUE) %>% invisible
```

### 4.1.2 Creating the sample table to define the experimental design
Create the `sampleTable`, which consists of details of each sample with its library-type and condition. Ensure the order of row names is consistent with the order of bam file names used by `featureCounts` to count the reads. 
```{r message=FALSE, warning=FALSE}
sampleTable1 <- data.frame(row.names = c("WT_1","WT_2","DKO_1","DKO_2"), 
                           condition = c(rep("WT", 2), rep("DKO", 2)), 
                           libType = rep("single-end", 4))
```

### 4.1.3 Preparing the pA sites information file using Bioconductor GRanges
pA site information in the form of GRanges objects (https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) is required as an input to create DEXSeq object. Extract the annotation from the processed annotation object to create PASinfo object for the preparation of DEXSeq object construction.
```{r}
# Prepare the GRanges object for DEXSeqDataSet object construction
PASinfo <- GRanges(seqnames = anno$Chr,
                   ranges = IRanges(start = anno$Start, end = anno$End),
                   strand = Rle(anno$Strand))
mcols(PASinfo)$PASID <- anno$repID
mcols(PASinfo)$GeneEns <- anno$Ensembl
mcols(PASinfo)$GeneID <- anno$Symbol

# Prepare the new feature IDs, replace the strand information with letters to match the current pA site clusterID
new.featureID <- anno$Strand %>% as.character %>% replace(. %in% "+", "F") %>% replace(. %in% "-", "R") %>% paste0(as.character(anno$repID), .)
```

### 4.1.4 Creating the DEXSeq object
Create the DEXSeq object using the function `DEXSeqDataSet` to collect: 1) the filtered read counts generated in step 3; 2) sample information from the `sampleTable`; 3) design matrix which is generated based on the sampleData and a model formula notation for the differential APA testing; 4) feature information including pA site IDs and their corresponding gene symbols.
```{r}
countData1 <- dplyr::select(countData, SRR1553129.sorted.bam, SRR1553130.sorted.bam, 
                            SRR1553131.sorted.bam, SRR1553132.sorted.bam) # Select the read counts of the condtion WT and DKO
colnames(countData1) <- rownames(sampleTable1) # Rename the columns of countData using sample names in sampleTable
dxd1 <- DEXSeqDataSet(countData = countData1,
                     sampleData = sampleTable1, 
                     design = ~ sample + exon + condition:exon,
                     featureID = new.featureID,
                     groupID = anno$Symbol,
                     featureRanges = PASinfo)
```
NOTE: This step might take some time. Once the object is created, explore the columns for initial understanding of data.

```{r}
# Explore/Inspect the DEXSeq object
head(counts(dxd1), 5)
colData(dxd1)
split(seq_len(ncol(dxd1)), colData(dxd1)$exon)
#head(featureCounts(dxd1), 5)
head(rowRanges(dxd1), 3)
sampleAnnotation(dxd1)
```

### 4.1.5 Defining the contrast pair in the countData
Define the contrast pair in the following analysis through defining the levels of conditions in DEXSeq object. If not defined, the conditions saved in DEXSeq object will be transferred to alphabetical order-based factors, and the contrast pairs will be "'the condition with higher level' - 'the condition with lower level'". 
```{r}
dxd1$condition <- factor(dxd1$condition, levels = c("WT", "DKO")) # The contrast pair will be "DKO - WT"
```

### 4.1.6 Normalization and dispersion Estimation
Similar to RNA-seq data, for 3’ end sequencing data perform normalization between samples (column-wise median of ratios for each sample) using the `estimateSizeFactors` function, and estimate the variation of the data using with the `estimateDispersions` function, then visualize the dispersion estimation result using the `plotDispEsts` function.
```{r, fig.align="center"}
dxd1 %<>% estimateSizeFactors %>% estimateDispersions %T>% plotDispEsts # Visualization of the dispersion estimation result
```

### 4.1.7 Differential pA site usage testing
Test for differential pA site usage for each gene and generate the results using the function `testForDEU`, then estimate the fold change of pA site usage using the function `estimateExonFoldChanges`. Check the results using the function `DEXSeqResults` and set ‘FDR < 10%’ as the criterion for significantly differential pA sites.
```{r}
dxd1 %<>% testForDEU %>% estimateExonFoldChanges(fitExpToVar = "condition") #Estimate fold changes
dxr1 <- DEXSeqResults(dxd1)
dxr1
#dxr  <-  na.omit(dxr)
mcols(dxr1)$description
## ----tallyPASs------------------------------------------------------------
table(dxr1$padj < 0.1) # Check the number of differential pA sites (FDR < 0.1)
## ----tallyGenes------------------------------------------------------------
table(tapply(dxr1$padj < 0.1, dxr1$groupID, any)) # Check the number of gene overlapped with differential pA site clusters (FDR < 0.1)
```

### 4.1.8 Visualization of the differential pA site usage results
```{r}
# Select the top 100 significant differntial pA site usage
topdiff.PAS <- dxr1 %>% as.data.frame %>% rownames_to_column %>% arrange(padj) %$% groupID[1:100]
head(topdiff.PAS)
```

Use the `plotDEXSeq` function for visualization of differential pA usage result. Use different gene names to explore the results:
```{r, fig.align="center"}
plotDEXSeq(dxr1, "S100a7a", legend = TRUE, expression = FALSE, splicing = TRUE, cex.axis = 1.2, cex = 1.3, lwd = 2)
```

To display differential ployA usage use `splicing = TRUE`:
```{r, fig.align="center"}
plotDEXSeq(dxr1, "Fosl2", legend = TRUE, expression = FALSE, splicing = TRUE, cex.axis = 1.2, cex = 1.3, lwd = 2)
```

To display both experssion and differential ployA usage use `splicing = TRUE` and `expression = TRUE`:
```{r, fig.align="center"}
plotDEXSeq(dxr1, "Papola", legend = TRUE, expression = FALSE, splicing = TRUE, cex.axis = 1.2, cex = 1.3, lwd = 2)
```

Generate plots for all the top 300 significant genes using `perGeneQValue` function:
```{r}
dxr1 %<>% .[!is.na(.$padj), ]	
dgene <- data.frame(perGeneQValue=perGeneQValue(dxr1)) %>% rownames_to_column("groupID")

dePAS1 <- dxr1 %>% data.frame() %>% 
  dplyr::select(-matches("dispersion|stat|countData|genomicData")) %>% 
  inner_join(dgene) %>% arrange(perGeneQValue) %>% distinct()

# Save the significant differential pA sites (order by perGeneQValue)
"openxlsx" %>% lapply(library, character.only = TRUE) %>% invisible
dePAS_sig1 <- dePAS1 %>% filter(padj < 0.1) %T>%
  write.xlsx(file = "3PSeq_DEXSeq_significant_genes.xlsx", sheetName = "DKO vs WT", firstRow = TRUE, headerStyle = createStyle(textDecoration = "BOLD", halign = "center"))
```
Check the information of the top 300 differntial pA sites.
```{r message=TRUE, paged.print=TRUE}
DT::datatable(dePAS_sig1[1:300, ])
```
Generate plots of the top 300 differntial pA sites.
```{r}
pdf(file= "3PSeq_DEXSeq_significant_genes.DKO_WT.top300.pdf")
sig_genes <- dePAS_sig1 %>% filter(perGeneQValue <= 0.01) %$% groupID %>% unique %>% .[1:300]

for (geneid in sig_genes) {
  nn <- dxr1 %>% .[.$groupID %in% geneid, ] %>% nrow
  if (nn > 1) {
    plotDEXSeq(dxr1, geneid, legend = TRUE, expression = FALSE, splicing = TRUE, cex.axis = 1.2, cex = 1.3, lwd = 2)
  }
}
dev.off()
```

Use the `EnhancedVolcano` package to visualise differential pA site usage of the contrast pair.
```{r message=FALSE, warning=FALSE}
"EnhancedVolcano" %>% lapply(library, character.only = TRUE) %>% invisible
```

```{r fig.align="center"}
EnhancedVolcano(dePAS_sig1, lab = dePAS_sig1$groupID, x = 'log2fold_DKO_WT', y = 'pvalue', title = 'Volcano Plot', 
                subtitle = 'DKO vs WT', FCcutoff = 1, labSize = 4, legendPosition = "right", 
                caption = bquote(~Log[2]~ "Fold change cutoff, 1; FDR 10%"))
```
NOTE: EnhancedVolcano package was used to generate the above plot, you will need to install it if not already installed.

### 4.1.9 Developing differential pA site usage testing of the contrast pair "KD - WT"
```{r}
countData2 <- dplyr::select(countData, SRR1553129.sorted.bam, SRR1553130.sorted.bam, 
                            SRR1553133.sorted.bam, SRR1553134.sorted.bam) # Select the read counts of the condtion WT and KD 
sampleTable2 <- data.frame(row.names = c("WT_1", "WT_2", "KD_1", "KD_2"), 
                          condition = c(rep("WT", 2), rep("KD", 2)), 
                          libType = rep("single-end", 4))
dxd2 <- DEXSeqDataSet(countData = countData2,
                     sampleData = sampleTable2, 
                     design = ~ sample + exon + condition:exon,
                     featureID = new.featureID,
                     groupID = anno$Symbol,
                     featureRanges = PASinfo)

dxd2$condition <- factor(dxd2$condition, levels = c("WT", "KD"))
dxd2 %<>% estimateSizeFactors %>% estimateDispersions %>% testForDEU %>% estimateExonFoldChanges(fitExpToVar = "condition")
dxr2 <- DEXSeqResults(dxd2)

mcols(dxr2)$description
## ----tallyPASs------------------------------------------------------------
table(dxr2$padj < 0.1) # Check the number of differential pA sites (FDR < 0.1)
## ----tallyGenes------------------------------------------------------------
table(tapply(dxr2$padj < 0.1, dxr2$groupID, any)) # Check the number of gene overlapped with differential pA site clusters (FDR < 0.1)
```

```{r, fig.align="center"}
# Selection and visualization of the results
topdiff.PAS2 <- dxr2 %>% as.data.frame %>% rownames_to_column %>% arrange(padj) %$% groupID[1:100]
head(topdiff.PAS2)         

plotDEXSeq(dxr2, "Tpm1", legend = TRUE, expression = FALSE, splicing = TRUE, cex.axis = 1.2, cex = 1.3, lwd = 2)
```

```{r}
dxr2 %<>% .[!is.na(.$padj), ]
dgene2 <- data.frame(perGeneQValue=perGeneQValue(dxr2)) %>% rownames_to_column("groupID")

dePAS2 <- dxr2 %>% data.frame() %>% 
  dplyr::select(-matches("dispersion|stat|countData|genomicData")) %>% 
  inner_join(dgene2) %>% arrange(perGeneQValue) %>% distinct()

# Save the significant differential pA sites (order by perGeneQValue)
add.xlsx.sheet <- function(fileName, sheetName, data){
  wb <- loadWorkbook(fileName)
  addWorksheet(wb, sheetName = sheetName)
  freezePane(wb, sheet = sheetName, firstRow = TRUE)
  writeData(wb, sheet = sheetName, data, headerStyle = createStyle(textDecoration = "BOLD", halign = "center"))
  saveWorkbook(wb, file = fileName, overwrite = TRUE)
}
dePAS_sig2 <- dePAS2 %>% filter(padj < 0.1) %T>%
  add.xlsx.sheet(file = "3PSeq_DEXSeq_significant_genes.xlsx", "KD vs WT", .)
```
```{r message=TRUE, paged.print=TRUE}
DT::datatable(dePAS_sig2[1:300, ])
```

```{r, fig.align="center"}
EnhancedVolcano(dePAS_sig2, lab = dePAS_sig2$groupID, x = 'log2fold_KD_WT', y = 'pvalue', title = 'Volcano Plot', 
                subtitle = 'KD vs WT', FCcutoff = 1, labSize = 4, legendPosition = "right", 
                caption = bquote(~Log[2]~ "Fold change cutoff, 1; FDR 10%"))
```

### 4.1.10 Developing differential pA site usage testing of the contrast pair "Ctrl - WT"
```{r}
countData3 <- dplyr::select(countData, SRR1553129.sorted.bam, SRR1553130.sorted.bam, 
                            SRR1553135.sorted.bam, SRR1553136.sorted.bam) # Select the read counts of the condtion WT and KD 
sampleTable3 <- data.frame(row.names = c("WT_1", "WT_2", "Ctrl_1", "Ctrl_2"), 
                          condition = c(rep("WT", 2), rep("Ctrl", 2)), 
                          libType = rep("single-end", 4))
dxd3 <- DEXSeqDataSet(countData = countData3,
                     sampleData = sampleTable3, 
                     design = ~ sample + exon + condition:exon,
                     featureID = new.featureID,
                     groupID = anno$Symbol,
                     featureRanges = PASinfo)

dxd3$condition <- factor(dxd3$condition, levels = c("WT", "Ctrl"))
dxd3 %<>% estimateSizeFactors %>% estimateDispersions %>% testForDEU %>% estimateExonFoldChanges(fitExpToVar = "condition")
dxr3 <- DEXSeqResults(dxd3)

mcols(dxr3)$description
## ----tallyPASs------------------------------------------------------------
table(dxr3$padj < 0.1) # Check the number of differential pA sites (FDR < 0.1)
## ----tallyGenes------------------------------------------------------------
table(tapply(dxr3$padj < 0.1, dxr3$groupID, any)) # Check the number of gene overlapped with differential pA site clusters (FDR < 0.1)
```

```{r, fig.align="center"}
# Selection and visualization of the results
topdiff.PAS3 <- dxr3 %>% as.data.frame %>% rownames_to_column %>% arrange(padj) %$% groupID[1:100]
head(topdiff.PAS3)         

plotDEXSeq(dxr3, "Tpm1", legend = TRUE, expression = FALSE, splicing = TRUE, cex.axis = 1.2, cex = 1.3, lwd = 2)
```

```{r}
dxr3 %<>% .[!is.na(.$padj), ]
dgene3 <- data.frame(perGeneQValue=perGeneQValue(dxr3)) %>% rownames_to_column("groupID")

dePAS3 <- dxr3 %>% data.frame() %>% 
  dplyr::select(-matches("dispersion|stat|countData|genomicData")) %>% 
  inner_join(dgene3) %>% arrange(perGeneQValue) %>% distinct()

# Save the significant differential pA sites (order by perGeneQValue)
dePAS_sig3 <- dePAS3 %>% filter(padj < 0.1) %T>%
  add.xlsx.sheet(file = "3PSeq_DEXSeq_significant_genes.xlsx", "Ctrl vs WT", .)
```

```{r message=TRUE, paged.print=TRUE}
DT::datatable(dePAS_sig3[1:300, ])
```

```{r, fig.align="center"}
EnhancedVolcano(dePAS_sig3, lab = dePAS_sig3$groupID, x = 'log2fold_Ctrl_WT', y = 'pvalue', title = 'Volcano Plot', 
                subtitle = 'Ctrl vs WT', FCcutoff = 1, labSize = 4, legendPosition = "right", 
                caption = bquote(~Log[2]~ "Fold change cutoff, 1; FDR 10%"))
```

## 4.2 Using diffSplice package
### 4.2.1 Loading required libraries
Type the following commands in R console and press Enter to load the required libraries in R.
```{r message=FALSE, warning=FALSE}
c("limma", "edgeR") %>% lapply(library, character.only = TRUE) %>% invisible
```

### 4.2.2 Creating the sample table to define the experimental design
Create the `sampleTable`, which consists of details of each sample with its library-type and condition. Ensure the order of row names is consistent with the order of bam file names used by `featureCounts` to count the reads. Remember that as the contrast pairs can be defined directly in the diffSplice function, the `sampleTable` created in this step includes all the read counts of all conditions.
```{r message=FALSE, warning=FALSE}
sampleTable <- data.frame(row.names = c("WT_1","WT_2","DKO_1","DKO_2", "KD_1", "KD_2", "Ctrl_1", "Ctrl_2"), 
                          condition = c(rep("WT", 2), rep("DKO", 2), rep("KD", 2), rep("Ctrl", 2)), 
                          libType = rep("single-end", 4))
```

### 4.2.3 Creating the DGEList object and processing
Create the DGEList using the function `DGEList` to collect the filtered read counts generated in step 3. Remember that as the contrast pairs can be defined directly in the next step, the DGEList object created in this step include all the read counts of all conditions. Then develop processing steps of the DGEList object includes: 
1) Normalization of the read counts across samples using Trimmed Meand of M values (TMM normalization) by the function `calcNormFactors`; 
2) Differential expression testing across all conditions using the function `voomWithQualityWeights`;
3) Linear modelling based on the count data using the function `limFit`;
4) Computing moderated statistical results using the function `eBayes` given the linear model fit from the `limFit`.
```{r}
fit <- {
  dge <- DGEList(counts = countData) %>% calcNormFactors
  Treat <- factor(sampleTable$condition, levels = c("WT", "DKO", "KD", "Ctrl"))
  design <- model.matrix(~0 + Treat)
  colnames(design) <- levels(Treat)
  voomWithQualityWeights(dge, design, plot = FALSE)
} %>% lmFit(design) %>% eBayes
```

### 4.2.4 Defining contrasts and processing
Define the contrasts of interest for differential pA usage analysis using `makeContrasts` function. Then process the DGEList object `fit` generated in the previous step using the function `contrasts.fit` and `eBayes` to acquire the differential expression results of the contrast pairs.
```{r}
contrast.matrix <- makeContrasts(DKO_vs_WT = DKO-WT, KD_vs_WT = KD-WT, Ctrl_vs_WT = Ctrl-WT, levels = design)
fit2 <- fit %>% contrasts.fit(contrast.matrix) %>% eBayes
summary(decideTests(fit2))
```

### 4.2.5 Differential pA site usage analysis
Develop the differential pA site usage analysis for each defined contrasct pair on the fitted model using the function `diffSplice`.
```{r}
ex <- diffSplice(fit2, geneid = anno$Symbol, exonid = new.featureID)

#Check the top significant results with topSplice
topSplice(ex)
```
NOTE: This step uses the PAS information object created in 4.1.3. Ensure it is loaded in the current working environment.

### 4.2.6 Visualizing the result of the contrast pair "DKO - WT"
Check the information of the top 300 differntial pA sites.
```{r message=TRUE, paged.print=TRUE}
sig1 <-topSplice(ex, n = Inf, FDR = 0.1, coef = 1, test= "t", sort.by = "logFC") # Save the results sorted by logFC with FDR < 0.1
DT::datatable(sig1[1:300, ])
```

Check the information of 300 genes with the top differntial APA.
```{r message=TRUE, paged.print=TRUE}
sig1.genes <-topSplice(ex, n = Inf, FDR = 0.1, coef = 1, test= "simes") # Save the results with FDR < 0.1
DT::datatable(sig1.genes[1:300, ])
```

Visualize the differential APA result using the `plotSplice` function and volcano plots.
```{r, fig.align="center"}
pdf(file = "3PSeq_diffSplice_plotSplice_DKO_WT.pdf")
plotSplice(ex, coef = 1, geneid = "S100a7a", FDR = 0.1)
plotSplice(ex, coef = 1, geneid = "Tpm1", FDR = 0.1)
plotSplice(ex, coef = 1, geneid = "Smc6", FDR = 0.1)
dev.off()

pdf(file = "3PSeq_diffSplice_Volcano_DKO_WT.pdf", height = 10, width = 12)
EnhancedVolcano(sig1, lab = sig1$GeneID, xlab = bquote(~Log[2]~ 'fold change'), 
                x = 'logFC', y = 'P.Value', title = 'Volcano Plot', subtitle = 'DKO vs WT', 
                # pCutoff = 10e-16, 
                FCcutoff = 1, labSize = 8, legendPosition = "right")
dev.off()

# Save the list of top differential pA sites at FDR 10%
write.xlsx(sig1, file = "3PSeq_diffSplice_significant_genes.xlsx", sheetName = "DKO vs WT (pA site-level)", 
           firstRow = TRUE, headerStyle = createStyle(textDecoration = "BOLD", halign = "center"))

# Save the list of genes with top differential APA at FDR 10%
add.xlsx.sheet(file = "3PSeq_diffSplice_significant_genes.xlsx", "DKO vs WT (gene-level)", sig1.genes)

```

### 4.2.7 Visualizing the result of the contrast pair "KD - WT"
Check the information of the top 300 differntial pA sites.
```{r message=TRUE, paged.print=TRUE}
sig2 <-topSplice(ex, n = Inf, FDR = 0.1, coef = 2, test= "t", sort.by = "logFC")
DT::datatable(sig2[1:300, ])
```

Check the information of 300 genes with the top differntial APA.
```{r message=TRUE, paged.print=TRUE}
sig2.genes <-topSplice(ex, n = Inf, FDR = 0.1, coef = 2, test= "simes") # Save the results with FDR < 0.1
DT::datatable(sig2.genes[1:300, ])
```

Visualize the differential APA result using the `plotSplice` function and volcano plots.
```{r, fig.align="center"}
pdf(file = "3PSeq_diffSplice_plotSplice_KD_WT.pdf")
plotSplice(ex, coef = 2, geneid = "Mbd2", FDR = 0.1)
plotSplice(ex, coef = 2, geneid = "Ssbp1", FDR = 0.1)
plotSplice(ex, coef = 2, geneid = "Pfn1", FDR = 0.1)
dev.off()

pdf(file = "3PSeq_diffSplice_Volcano_KD_WT.pdf", height = 10, width = 12)
EnhancedVolcano(sig2, lab = sig2$GeneID, xlab = bquote(~Log[2]~ 'fold change'), 
                x = 'logFC', y = 'P.Value', title = 'Volcano Plot', subtitle = 'KD vs WT', 
                # pCutoff = 10e-16, 
                FCcutoff = 1, labSize = 8, legendPosition = "right")
dev.off()

#Save the list of top differential pA sites at FDR 10%
add.xlsx.sheet(file = "3PSeq_diffSplice_significant_genes.xlsx", "KD vs WT (pA site-level)", sig2)

# Save the list of genes with top differential APA at FDR 10%
add.xlsx.sheet(file = "3PSeq_diffSplice_significant_genes.xlsx", "KD vs WT (gene-level)", sig2.genes)
```

### 4.2.8 Visualizing the result of the contrast pair "Ctrl - WT"
Check the information of the top 300 differntial pA sites.
```{r message=TRUE, paged.print=TRUE}
sig3 <-topSplice(ex, n = Inf, FDR = 0.1, coef = 3, test= "t", sort.by = "logFC")
DT::datatable(sig3[1:300, ])
```

Check the information of 300 genes with the top differntial APA.
```{r message=TRUE, paged.print=TRUE}
sig3.genes <-topSplice(ex, n = Inf, FDR = 0.1, coef = 3, test= "simes") # Save the results with FDR < 0.1
DT::datatable(sig3.genes[1:300, ])
```

Visualize the differential APA result using the `plotSplice` function and volcano plots.
```{r, fig.align="center"}
pdf(file = "3PSeq_diffSplice_plotSplice_Ctrl_WT.pdf")
plotSplice(ex, coef = 3, geneid = "Mbd2", FDR = 0.1)
plotSplice(ex, coef = 3, geneid = "Ift81", FDR = 0.1)
plotSplice(ex, coef = 3, geneid = "Rps24", FDR = 0.1)
dev.off()

pdf(file = "3PSeq_diffSplice_Volcano_Ctrl_WT.pdf", height = 10, width = 12)
EnhancedVolcano(sig3, lab = sig3$GeneID, xlab = bquote(~Log[2]~ 'fold change'), 
                x = 'logFC', y = 'P.Value', title = 'Volcano Plot', subtitle = 'Ctrl vs WT', 
                # pCutoff = 10e-16, 
                FCcutoff = 1, labSize = 8, legendPosition = "right")
dev.off()

#Save the list of top differential pA sites at FDR 10%
add.xlsx.sheet(file = "3PSeq_diffSplice_significant_genes.xlsx", "Ctrl vs WT (pA site-level)", sig3)

# Save the list of genes with top differential APA at FDR 10%
add.xlsx.sheet(file = "3PSeq_diffSplice_significant_genes.xlsx", "Ctrl vs WT (gene-level)", sig3.genes)
```

## Print Session Info-
```{r}
  print(sessionInfo())
```