Slides notes

This document contains notes about the course slides.

This is a somewhat internal document, so expect a fairly draft-ish and concise style.

01 Overview

slides 3 - 7

the central dogma of molecular biology is known to be not so simple
RNA is not only a messenger but may have other roles
most of these elements interacts and regulates one-another
alternative splicing in eukaryots adds a layer of possibilities to all this
main takeaway maybe : measuring RNA is a proxy for protein levels, which is a proxy for protein activity , which is a proxy for the physiological state of the cell

slides 8 - 9

non-exhaustive list of sequencing possibilities : list on Lior Pachter’s blog

slides 10-12

RNAseq, the challenges
slide 10 : from human gff, includes ncRNAs
slide 11 :
- data source: gTex (V8 gene TPMs)
- left: 1 sample -> from 1 to $10^5$ TPM
- right: 50 random samples -> 10% of genes contribute 90% of the transcripts
- NB: mammalian cell 10-30pg RNA/cell , around 360 000 mRNA molecules (source )
slide 12 : important considerations as well

slide 13

Illumina : market leader 50-600bp (generally 50-100), 0.1 (nextseq) to 3 (Hiseq) billion reads
Ion torrent : 600bp, 260M reads
Pacbio : 10-30kb N50 , 4M CCS reads
nanopore : theory single molecule, practice: variable (N50 >100kb on ultra-long kit, up to 4.2Mb )

slides 14-22 : describe different technologies

Ion Torrent :

cell sequentially flooded with A T G C

PacBio SMRT

DNApol at bottom of Zero-Mode-Waveguide
fluorescent dye on dNTPs

Illumina seq :

formation of clusters with the same sequence
SBS : labelled nucleotides have reversible terminators, so only 1 base is incorporated at a time.
slide 23: paired end sequencing
slide 24: stranded sequencing
slide 25-26: RIN , RNA purification
slide 27-34: sequencing depth and replicates
slide 33-34: this pattern applies to low- mid- and high- expressors (see their supp doc)
slide 35-42: schematic analysis
slide 35 : before the sequencing
slide 36 : basic analysis
slide 37 : basic analysis with trimmed reads
slide 38 : main QC steps
slide 39 : QC steps provide feedback on previous steps
slide 40 : analysis for variant calling / isoform descriptions / …
slide 41 : when no reference genome: de novo assembly
slide 42 : the analysis which we’ll do during this course

02 Quality control

slide 04 : illumina doc

control bit: 0 when none of the control bits are on, otherwise it is an even number. On HiSeq X and NextSeq systems, control specification is not performed and this number is always 0.

slide 16-… : interpretation of fastQC report

Some additionnal help for specific problems:
- https://sequencing.qcfail.com/articles/position-specific-failures-of-flowcells/
- https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/
- https://sequencing.qcfail.com/articles/read-through-adapters-can-appear-at-the-ends-of-sequencing-reads/
- https://sequencing.qcfail.com/articles/sudden-loss-of-base-call-quality/

03 trimming

slide 02-05 : enumerating reasons we may want to trim
slide 04 : adapter can be present if, for instance, insert size is shorter than sequenced length. cf. this post
slide 06-11 : different cases where we may trim

04 mapping

slides 4-8:

05 DE

slide 3-9 : challenges for RNAseq
slide 4: sequencing depth varies accross libraries
- left : 100 samples from the Gtex V8 dataset
- right : samples from a random binomial with and without a library size factor applied
slide 5: most of the expression is taken by very few genes + a lot of genes have 0 reads (plots: data from 100 samples from the Gtex V8 dataset )
slide 6: small number of samples. 10k simulations of negative binomial draws
slide 7-9 : xkcd.com/882
slide 10-11 : input
slide 12-14 : very good blog post on RPKM and TPM
slide 15: filtering:
- image source max and mean refer to CPM thresholds ; CPM 1: genes with a CPM less than one in more than half the samples are filtered
- DESeq2 filtering
- egdeR filtering: section 2.7 of edgeR doc

slide 16 : source

* TC: Total count (CPM) - UQ: Upper Quartile - Med: median - Q: quantile
* top left: coef of variation in housekeeping genes in H. sapiens data
* top right: average false-positive rate over 10 independent datasets simulated with varying proportions of differentially expressed genes (from 0% to 30% for each normalization method). 
* bottom:
    * distribution: distribution inter samples look the same
    * Intra-variance: intra group variance 
    * Housekeeping : coef of variation in 30 housekeeping genes, which are presumed to be similarly expressed across conditions
    * clustering: similarity of DE genes with other methods
    * false positive rate : see above

slide 17: normalization

From this biostar post

EdgeR: Trimmed Mean of M-values (TMM):

Based on the hypothesis that most genes are not DE.

The TMM factor is computed for each lane, with one lane being considered as a reference sample and the others as test samples. For each test sample, TMM is computed as the weighted mean of log ratios between this test and the reference, after exclusion of the most expressed genes and the genes with the largest log ratios.

According to the hypothesis of low DE, this TMM should be close to 1. If it is not, its value provides an estimate of the correction factor that must be applied to the library sizes (and not the raw counts) in order to fulfill the hypothesis. [source: https://www.ncbi.nlm.nih.gov/pubmed/22988256]

DESeq2

DESeq: is based on the hypothesis that most genes are not DE.

   the median of the ratio, for each gene, of its read count over its geometric mean across all lanes.

   The underlying idea is that non-DE genes should have similar read counts across samples, leading to a ratio of 1.

   Assuming most genes are not DE, the median of this ratio for the lane provides an estimate of the correction factor that should be applied to all read counts of this lane to fulfill the hypothesis.

       [source: https://www.ncbi.nlm.nih.gov/pubmed/22988256]

slide 19: NB model

* [image source](https://doi.org/10.1186/gb-2010-11-10-r106)
* orange line is the fit w(q)
* purple line show the variance implied by the Poisson distribution 
* dashed orange line is the variance estimate used by edgeR.

06 Enrichment

slide 08: geneontology
slide 09: reactome
slide 10: GSEA-msigdb
slide 11: KEGG
slides 15 to 21 : GSEA

GSEA is used a lot, but it comes in many flavour, with a large number of options and it is not always easy to understand what is happening.

“The enrichment score (ES) represents the degree to which a set S is over-represented at the top or bottom of the ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing when it is not encountered. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The ES is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov(KS)-like statistic (Subramanian et al. 2005).” (from clusterProfiler book)

Then the ES is normalized (NES) in order to compute a p-value.

in the base method (Subramanian 2005) for each gene sets they create a number of permutated dataset for which they compute an ES, and they then compare the ES on the original data to the distribution of permutated ES for that set. 2 flavours of permutation are described : “sample permutation” and “gene permutation”.
- The sample permutation is only possible if the expression data for each sample is given to the method. It is recommended to have at least 7 samples per condition for it to make sense.
- The gene permutation is performed from the ranking metric directly. Hence it is sometimes called “preranked-GSEA” and to my knowledge this is the most often used permutation scheme of the 2.
in fGSEA (Korotkevich et al. 2021) several tricks are used to make the p-value computation of “preranked-GSEA” faster and more accurate for low p-values. Consequently it has become the default GSEA method in some libraries such as clusterProfiler.

Finally, a parameter to consider when performing (preranked-)(f)GSEA is which metric to use to rank genes. The most commons are logFC , -log10(pvalues) * logFC , or some signed test statistics (eg. t-test or Wald statistic).