Reads mapping
Once you are happy with your read sequences in your FASTQ files, you can use a mapper software to align the reads to the genome and thereby find where they originated from.
At the end of this lesson, you will be able to :
- identify the differences between a local aligner and a pseudo aligner.
- perform genome indexing appropriate to your data.
- map your RNA-seq data onto a genome.
Material
Building a reference genome index
Before any mapping can be achieved, you must first index the genome want to map to.
To do this with STAR, you need two files:
- a fasta file containing the sequences of the chromosome (or genome contigs)
- a gtf file containing annotations (ie. where the genes and exons are)
We will be using the Ensembl references, with their accompanying GTF annotations.
Note
While the data are already on the server here, in practice or if you are following this course without a teacher, you can grab the reference genome data from the Ensembl ftp website.
In particular, you will want a mouse DNA fasta file and gtf file.
Take note of the genome sequence and annotation versions, you will need this in your paper’s methods section!
Task : Using STAR, build a genome index for the mouse mitochondrial chromosome.
- .fasta and .gtf files are in :
/data/DATA/Mouse_MT_genome/
. - create the index in the folder
041_d_STAR_mouseMT_reference
- this job should require less than 4Gb and 10min to run.
STAR basic parameter for genome index generation
From the manual. Refer to it for more details
--runMode genomeGenerate
: running STAR in index generation mode--genomeDir </path/to/genomeDir>
: output folder for the index--genomeFastaFiles </path/to/genome/fasta1>
: chromosome sequences fasta file (can be several files)--sjdbGTFfile </path/to/annotations.gtf>
: annotation gtf file--runThreadN <NumberOfThreads>
: number of threads to run on--sjdbOverhang <ReadLength-1>
: length of the genomic sequence around the annotated junctions to be used in constructing the splice junctions database. Ideally : read length - 1.
Additionally, because the genome is so small here (we only use the mitochondrial chromosome after all), you will need the following advanced option:
--genomeSAindexNbases 5
: must be scaled tomin(14, log2(GenomeLength)/2 - 1)
, so 5 in our case
Note
While your indexing job is running, you can read ahead in STAR’s manual to prepare the next step : mapping your reads onto the indexed reference genome.
STAR indexing script
#!/usr/bin/bash
# indexing the mouse mitochondrial genome
G_FASTA=/data/DATA/Mouse_MT_genome/Mus_musculus.GRCm39.dna.chromosome.MT.fa
G_GTF=/data/DATA/Mouse_MT_genome/Mus_musculus.GRCm39.MT.gtf
mkdir -p 041_d_STAR_mouseMT_reference
STAR --runMode genomeGenerate \
--genomeDir 041_d_STAR_mouseMT_reference \
--genomeFastaFiles $G_FASTA \
--sjdbGTFfile $G_GTF \
--runThreadN 4 \
--genomeSAindexNbases 5 \
--sjdbOverhang 99
It can be found on the server at /data/Solutions/mouseMT/041_s_star_index.sh
Extra task : Determine how you would add an additional feature to your reference, for example for a novel transcript not described by the standard reference.
Answer
You would edit the gtf file to add your additional feature(s), following the proper format.
Note
In case you’ve got multiple FASTA files for your genome (eg, 1 per chromosome), you may just list them with the genomeFastaFiles
option as follow:
--genomeFastaFiles /path/to/genome/fasta1.fa /path/to/genome/fasta2.fa /path/to/genome/fasta3.fa ...
Mapping reads onto the reference
Task : Using STAR, align the raw FASTQ files of the mouseMT dataset against the mouse mitochondrial reference you just created.
- if were not able to complete the previous task, you can use the index in
/data/Solutions/mouseMT/041_d_STAR_mouseMT_reference
. - search the STAR manual for the option to output a BAM file sorted by coordinate.
- search the STAR manual for the option to output a geneCounts file.
- put the results in folder
042_d_STAR_map_raw/
.
STAR basic parameters for mapping
Taken again from the manual:
--genomeDir </path/to/genomeDir>
: folder where you have put the genome index--readFilesIn </path/to/read1>
: path to a fastq file. If the reads are paired, then also include the path to the second fastq file--runThreadN <NumberOfThreads>
: number of threads to run on.--outFileNamePrefix <prefix>
: prefix of the output files, typically something likeoutput_directory/sampleName
.
Note
Take the time to read the parts of the STAR manual which concern you: a bit of planning ahead can save you a lot of time-consuming/headache-inducing trial-and-error on your script.
Warning
Mapping reads and generating a sorted BAM from one of the mouseMT FASTQ file will take less than a minute and very little RAM, but on a real dataset it should take from 15 minutes to an hour per sample and require at least 30GB of RAM.
STAR mapping script
We will be using a job array to map each file in different job that will run at the same time.
First create a file named sampleNames.txt
, containing the sample names:
sample_a1
sample_a2
sample_a3
sample_a4
sample_b1
sample_b2
sample_b3
sample_b4
/data/Solutions/mouseMT/sampleNames.txt
Then for our script:
#!/usr/bin/bash
# aligning mouseMT reads with STAR
mkdir -p 042_d_STAR_map_raw
for SAMPLE in `cat sampleNames.txt`
do
FASTQ_NAME=/data/DATA/mouseMT/${SAMPLE}.fastq
STAR --runThreadN 4 --genomeDir 041_d_STAR_mouseMT_reference \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix 042_d_STAR_map_raw/${SAMPLE}. \
--quantMode GeneCounts \
--readFilesIn $FASTQ_NAME
done
/data/Solutions/mouseMT/042_s_STAR_map_raw.sh
and its results can be found at /data/Solutions/mouseMT/042_d_STAR_map_raw/
The options of STAR are :
--runThreadN 4
: 4 threads to go faster.--genomeDir 041_STAR_reference
: path of the genome to map to.--outSAMtype BAM SortedByCoordinate
: output a coordinate-sorted BAM file.--outFileNamePrefix 042_STAR_map_raw/${SAMPLE}.
: prefix of output files.--quantMode GeneCounts
: will create a file with counts of reads per gene.--readFilesIn $FASTQ_NAME
: input read file.
QC on the aligned reads
You can call MultiQC on the STAR output folder to gather a report on the individual alignments.
Task : use multiqc
to generate a QC report on the results of your mapping.
- Evaluate the alignment statistics. Do you consider this to be a good alignment?
- How many unmapped reads are there? Where might this come from, and how would you determine this?
- What could you say about library strandedness ?
script and answers
#!/usr/bin/bash
# multiqc on the mapping results
multiqc -n 043_r_multiqc_mouseMT_mapped_raw.html -f --title mapped_raw 042_d_STAR_map_raw/
/data/Solutions/mouseMT/043_s_multiqc_map_raw.sh
Comparison of mapping the trimmed reads
After having mapped the raw reads, we also map the trimmed reads and then compare the results to decide which one we want to use for the rest of our analysis.
We will spare you the mapping of the trimmed reads, and let you directly download the mapping multiqc report:
For the curious: scripts for the mapping of trimmed reads
#!/usr/bin/bash
# mapping trimmed mouseMT reads
mkdir -p 044_d_STAR_map_trimmed
for SAMPLE in `cat sampleNames.txt`
do
FASTQ_NAME=030_d_trim/${SAMPLE}.trimmed.fastq
STAR --runThreadN 4 --genomeDir 041_d_STAR_mouseMT_reference \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix 044_d_STAR_map_trimmed/${SAMPLE}_trimmed. \
--quantMode GeneCounts \
--readFilesIn $FASTQ_NAME
done
/data/Solutions/mouseMT/044_s_STAR_map_trimmed.sh
#!/usr/bin/bash
# multiqc of the mapped trimmed read
multiqc -n 045_r_multiqc_mouseMT_mapped_trimmed.html -f --title mapped_trimmed 044_d_STAR_map_trimmed/
/data/Solutions/mouseMT/045_s_multiqc_mouseMT_mapped_trimmed.sh
QC report of mapping for the Liu2015 and Ruhland2016 dataset
Liu2015
Take the time to look at the following reports:
Liu2015 raw reads mapping report Liu2015 trimmed reads mapping report
Which one would you choose?
Ruhland
Ruhland2016 raw reads mapping report
ADDITIONAL : pseudo-aligning with salmon
Salmon can allow you to quantify transcript expression without explicitly aligning the sequenced reads onto the reference genome with its gene and splice junction annotations, but instead to a simplification of the corresponding transcriptome, thus saving computational resources.
We refer you to the tool’s documentation in order to see how the reference index is computed.
Task : run salmon to quantify the expression of either the Ruhland or Liu dataset.
- Use the tool documentation to craft your command line.
- precomputed indices can be found in
/data/DATA/Mouse_salmon_index
and/data/DATA/Human_salmon_index
.
Warning
Please check with the teacher before you launch these tasks, because they require intensive resources (~6G of RAM and 40 minutes per fastq file).
script
#!/usr/bin/bash
# pseudo alignment of the Ruhland2016 reads with salmon
dataDIR=/data/DATA/Ruhland2016
sourceFILE=Ruhland2016.fastqFiles.txt
genomeDIR=/data/DATA/Mouse_salmon_index
for fastqFILE in `cat $sourceFILE`
do
outDIR=033_d_salmon_Ruhland2016_${fastqFILE%.*}
mkdir -p $outDIR
salmon quant -i $genomeDIR -l A \
-r $dataDIR/$fastqFILE \
-p 4 --validateMappings --gcBias --seqBias \
-o $outDIR
done
/data/Solutions/Ruhland2016/033_s_salmon_Ruhland2016.sh