RNA-seq analysis with Bioconductor: Key Points

Introduction to RNA-seqWhere are we heading towards in this workshop?

RNA-seq is a technique of measuring the amount of RNA expressed within a cell/tissue and state at a given time.
Many choices have to be made when planning an RNA-seq experiment, such as whether to perform poly-A selection or ribosomal depletion, whether to apply a stranded or an unstranded protocol, and whether to sequence the reads in a single-end or paired-end fashion. Each of the choices have consequences for the processing and interpretation of the data.
Many approaches exist for quantification of RNA-seq data. Some methods align reads to the genome and count the number of reads overlapping gene loci. Other methods map reads to the transcriptome and use a probabilistic approach to estimate the abundance of each gene or transcript.
Information about annotated genes can be accessed via several sources, including Ensembl, UCSC and GENCODE.

Raw reads are typically provided in FASTQ format, which contains the nucleotide sequences and quality scores of the reads.
Reference genome and annotation files are essential for mapping and quantification of RNAseq data.
Organizing data files in a structured manner facilitates efficient data handling and analysis.
Tools such as wget and fasterq-dump can be used to download data files from public repositories.

Mapping Raw Reads: The process of mapping involves aligning short reads from sequencing data to a reference genome to determine their locations. STAR, a splice-aware RNA-seq aligner, is commonly used due to its speed and accuracy in handling exon-exon junctions.
Impact of Read Quality: Poor read quality, such as over-represented sequences or low-quality bases, can lead to unmapped or multi-mapped reads. It is important to check the quality of reads before mapping and review alignment statistics after mapping to ensure data quality.
Reference Genome Indexing: Before mapping, the reference genome needs to be indexed using tools like STAR. Indexing creates a data structure to facilitate rapid alignment, and additional files like GTF/GFF annotations can improve alignment accuracy.
Optimizing Mapping: Fine-tuning mapping parameters (e.g., handling multi-mapped reads, controlling sensitivity) based on the dataset is essential. The STAR aligner offers numerous options that allow adjustment for specific genome types and read qualities.

Quantifying gene expression involves counting the number of reads mapped to genomic features like genes or exons, which provides data for downstream analyses, such as differential gene expression.
Tools like featureCounts, RSEM, HTSeq-count, Salmon, and Kallisto are commonly used for counting reads, each offering different advantages based on feature type and data structure.
Depending on the research question, reads can be counted at various feature levels, such as genes, exons, or transcripts, to capture overall gene expression or more granular details like splicing events.
The output of a read counting tool typically includes read counts for each feature across all samples, serving as input for further analysis, such as identifying differentially expressed genes.

Proper organisation of the files required for your project in a working directory is crucial for maintaining order and ensuring easy access in the future.
RStudio project serves as a valuable tool for managing your project’s working directory and facilitating analysis.
The download.file function in R can be used for downloading datasets from the internet.

Depending on the gene expression quantification tool used, there are different ways (often distributed in Bioconductor packages) to read the output into a SummarizedExperiment or DGEList object for further processing in R.
Stable gene identifiers such as Ensembl or Entrez IDs should preferably be used as the main identifiers throughout an RNA-seq analysis, with gene symbols added for easier interpretation.

Exploratory analysis is essential for quality control and to detect potential problems with a data set.
Different classes of exploratory analysis methods expect differently preprocessed data. The most commonly used methods expect counts to be normalized and log-transformed (or similar- more sensitive/sophisticated), to be closer to homoskedastic. Other methods work directly on the raw counts.

With DESeq2, the main steps of a differential expression analysis (size factor estimation, dispersion estimation, calculation of test statistics) are wrapped in a single function: DESeq().
Independent filtering of lowly expressed genes is often beneficial.

The formula framework in R allows creation of design matrices, which details the variables expected to be associated with systematic differences in gene expression levels.
Comparisons of interest can be defined using contrasts, which are linear combinations of the model coefficients.

ORA analysis is based on the gene counts and it is based on Fisher’s exact test or the hypergeometric distribution.
In R, it is easy to obtain gene sets from a large number of sources.

RNA-seq data is very versatile and can be used for a number of different purposes. It is important, however, to carefully plan one’s analyses, to make sure that enough data is available and that abundances for appropriate features (e.g., genes, transcripts, or exons) are quantified.