Reads mapping

Once you are happy with your read sequences in your FASTQ files, you can use a mapper software to align the reads to the genome and thereby find where they originated from.

At the end of this lesson, you will be able to :

identify the differences between a local aligner and a pseudo aligner.
perform genome indexing appropriate to your data.
map your RNA-seq data onto a genome.

Material

Download the presentation

STAR website

Building a reference genome index

Before any mapping can be achieved, you must first index the genome want to map to.

To do this with STAR, you need two files:

a fasta file containing the sequences of the chromosome (or genome contigs)
a gtf file containing annotations (ie. where the genes and exons are)

We will be using the Ensembl references, with their accompanying GTF annotations.

Note

While the data are already on the server here, in practice or if you are following this course without a teacher, you can grab the reference genome data from the Ensembl ftp website.

In particular, you will want a mouse DNA fasta file and gtf file.

Take note of the genome sequence and annotation versions, you will need this in your paper’s methods section!

Task : Using STAR, build a genome index for the mouse mitochondrial chromosome.

.fasta and .gtf files are in : /data/DATA/Mouse_MT_genome/.
create the index in the folder 041_d_STAR_mouseMT_reference
this job should require less than 4Gb and 10min to run.

STAR basic parameter for genome index generation

From the manual. Refer to it for more details

--runMode genomeGenerate : running STAR in index generation mode
--genomeDir </path/to/genomeDir> : output folder for the index
--genomeFastaFiles </path/to/genome/fasta1> : chromosome sequences fasta file (can be several files)
--sjdbGTFfile </path/to/annotations.gtf> : annotation gtf file
--runThreadN <NumberOfThreads> : number of threads to run on
--sjdbOverhang <ReadLength-1> : length of the genomic sequence around the annotated junctions to be used in constructing the splice junctions database. Ideally : read length - 1.

Additionally, because the genome is so small here (we only use the mitochondrial chromosome after all), you will need the following advanced option:

--genomeSAindexNbases 5 : must be scaled to min(14, log2(GenomeLength)/2 - 1), so 5 in our case

Note

While your indexing job is running, you can read ahead in STAR’s manual to prepare the next step : mapping your reads onto the indexed reference genome.

STAR indexing script

#!/usr/bin/bash
# indexing the mouse mitochondrial genome

G_FASTA=/data/DATA/Mouse_MT_genome/Mus_musculus.GRCm39.dna.chromosome.MT.fa
G_GTF=/data/DATA/Mouse_MT_genome/Mus_musculus.GRCm39.MT.gtf


mkdir -p 041_d_STAR_mouseMT_reference

STAR --runMode genomeGenerate \
     --genomeDir 041_d_STAR_mouseMT_reference \
     --genomeFastaFiles $G_FASTA \
     --sjdbGTFfile $G_GTF \
     --runThreadN 4 \
     --genomeSAindexNbases 5 \
     --sjdbOverhang 99

It can be found on the server at /data/Solutions/mouseMT/041_s_star_index.sh

Extra task : Determine how you would add an additional feature to your reference, for example for a novel transcript not described by the standard reference.

Answer

You would edit the gtf file to add your additional feature(s), following the proper format.

Note

In case you’ve got multiple FASTA files for your genome (eg, 1 per chromosome), you may just list them with the genomeFastaFiles option as follow:

--genomeFastaFiles /path/to/genome/fasta1.fa /path/to/genome/fasta2.fa /path/to/genome/fasta3.fa ...

Mapping reads onto the reference

Task : Using STAR, align the raw FASTQ files of the mouseMT dataset against the mouse mitochondrial reference you just created.

if were not able to complete the previous task, you can use the index in /data/Solutions/mouseMT/041_d_STAR_mouseMT_reference .
search the STAR manual for the option to output a BAM file sorted by coordinate.
search the STAR manual for the option to output a geneCounts file.
put the results in folder 042_d_STAR_map_raw/ .

STAR basic parameters for mapping

Taken again from the manual:

--genomeDir </path/to/genomeDir> : folder where you have put the genome index
--readFilesIn </path/to/read1> : path to a fastq file. If the reads are paired, then also include the path to the second fastq file
--runThreadN <NumberOfThreads>: number of threads to run on.
--outFileNamePrefix <prefix> : prefix of the output files, typically something like output_directory/sampleName .

Note

Take the time to read the parts of the STAR manual which concern you: a bit of planning ahead can save you a lot of time-consuming/headache-inducing trial-and-error on your script.

Warning

Mapping reads and generating a sorted BAM from one of the mouseMT FASTQ file will take less than a minute and very little RAM, but on a real dataset it should take from 15 minutes to an hour per sample and require at least 30GB of RAM.

STAR mapping script

We will be using a job array to map each file in different job that will run at the same time.

First create a file named sampleNames.txt, containing the sample names:

sample_a1
sample_a2
sample_a3
sample_a4
sample_b1
sample_b2
sample_b3
sample_b4

it can also be found in the server at /data/Solutions/mouseMT/sampleNames.txt

Then for our script:

#!/usr/bin/bash
# aligning mouseMT reads with STAR


mkdir -p 042_d_STAR_map_raw

for SAMPLE in `cat sampleNames.txt`
do
 FASTQ_NAME=/data/DATA/mouseMT/${SAMPLE}.fastq

 STAR --runThreadN 4 --genomeDir 041_d_STAR_mouseMT_reference \
               --outSAMtype BAM SortedByCoordinate \
               --outFileNamePrefix  042_d_STAR_map_raw/${SAMPLE}. \
               --quantMode GeneCounts \
               --readFilesIn $FASTQ_NAME
done

it can also be found in the server at /data/Solutions/mouseMT/042_s_STAR_map_raw.sh

and its results can be found at /data/Solutions/mouseMT/042_d_STAR_map_raw/

The options of STAR are :

--runThreadN 4 : 4 threads to go faster.
--genomeDir 041_STAR_reference : path of the genome to map to.
--outSAMtype BAM SortedByCoordinate : output a coordinate-sorted BAM file.
--outFileNamePrefix 042_STAR_map_raw/${SAMPLE}. : prefix of output files.
--quantMode GeneCounts : will create a file with counts of reads per gene.
--readFilesIn $FASTQ_NAME : input read file.

QC on the aligned reads

You can call MultiQC on the STAR output folder to gather a report on the individual alignments.

Task : use multiqc to generate a QC report on the results of your mapping.

Evaluate the alignment statistics. Do you consider this to be a good alignment?
How many unmapped reads are there? Where might this come from, and how would you determine this?
What could you say about library strandedness ?

script and answers

#!/usr/bin/bash
# multiqc on the mapping results

multiqc -n 043_r_multiqc_mouseMT_mapped_raw.html -f --title mapped_raw 042_d_STAR_map_raw/

it can also be found in the server at /data/Solutions/mouseMT/043_s_multiqc_map_raw.sh

Download the report

Comparison of mapping the trimmed reads

After having mapped the raw reads, we also map the trimmed reads and then compare the results to decide which one we want to use for the rest of our analysis.

We will spare you the mapping of the trimmed reads, and let you directly download the mapping multiqc report:

trimmed reads mapping report

For the curious: scripts for the mapping of trimmed reads

#!/usr/bin/bash
# mapping trimmed mouseMT reads



mkdir -p 044_d_STAR_map_trimmed

for SAMPLE in `cat sampleNames.txt`
do

 FASTQ_NAME=030_d_trim/${SAMPLE}.trimmed.fastq

 STAR --runThreadN 4 --genomeDir 041_d_STAR_mouseMT_reference \
                  --outSAMtype BAM SortedByCoordinate \
                  --outFileNamePrefix  044_d_STAR_map_trimmed/${SAMPLE}_trimmed. \
                  --quantMode GeneCounts \
                  --readFilesIn $FASTQ_NAME
done

it can also be found in the server at /data/Solutions/mouseMT/044_s_STAR_map_trimmed.sh

#!/usr/bin/bash
# multiqc of the mapped trimmed read

multiqc -n 045_r_multiqc_mouseMT_mapped_trimmed.html -f --title mapped_trimmed 044_d_STAR_map_trimmed/

it can also be found in the server at /data/Solutions/mouseMT/045_s_multiqc_mouseMT_mapped_trimmed.sh

QC report of mapping for the Liu2015 and Ruhland2016 dataset

Liu2015

Take the time to look at the following reports:

Liu2015 raw reads mapping report Liu2015 trimmed reads mapping report

Which one would you choose?

Ruhland

Ruhland2016 raw reads mapping report

ADDITIONAL : pseudo-aligning with salmon

salmon website

Salmon can allow you to quantify transcript expression without explicitly aligning the sequenced reads onto the reference genome with its gene and splice junction annotations, but instead to a simplification of the corresponding transcriptome, thus saving computational resources.

We refer you to the tool’s documentation in order to see how the reference index is computed.

Task : run salmon to quantify the expression of either the Ruhland or Liu dataset.

Use the tool documentation to craft your command line.
precomputed indices can be found in /data/DATA/Mouse_salmon_index and /data/DATA/Human_salmon_index.

Warning

Please check with the teacher before you launch these tasks, because they require intensive resources (~6G of RAM and 40 minutes per fastq file).

script

#!/usr/bin/bash
# pseudo alignment of the Ruhland2016 reads with salmon


dataDIR=/data/DATA/Ruhland2016

sourceFILE=Ruhland2016.fastqFiles.txt


genomeDIR=/data/DATA/Mouse_salmon_index

for fastqFILE in `cat $sourceFILE`
do
outDIR=033_d_salmon_Ruhland2016_${fastqFILE%.*}

 mkdir -p $outDIR

 salmon quant -i $genomeDIR -l A \
              -r $dataDIR/$fastqFILE \
              -p 4 --validateMappings --gcBias --seqBias \
              -o $outDIR
done

it can also be found in the server at /data/Solutions/Ruhland2016/033_s_salmon_Ruhland2016.sh