Reads mapping
Once you are happy with your read sequences in your FASTQ files, you can use a mapper software to align the reads to the genome and thereby find where they originated from.
At the end of this lesson, you will be able to :
- identify the differences between a local aligner and a pseudo aligner.
- perform genome indexing appropriate to your data.
- map your RNA-seq data onto the genome.
Material
Building a reference genome index
Before any mapping can be achieved, you must first index the genome want to map to.
To do this with STAR, you need two files:
- a fasta file containing the sequences of the chromosome (or genome contigs)
- a gtf file containing annotations (ie. where the genes and exons are)
We will be using the Ensembl references, with their accompanying GTF annotations.
Note
While the data are already on the server here, in practice or if you are following this course without a teacher, you can grab the reference genome data from the Ensembl ftp website.
In particular, you will want a mouse DNA fasta file and gtf file [release-104 at the time we are linking this. Checking for more recent release is recommended, but may slightly alter the results].
Task : Using STAR, build a genome index for chromosome 19 of Mus musculus using the associated GTF
Important notes :
- the module name for this aligner is
star
. - .fasta and .gtf files are in :
/shared/data/DATA/Mouse_chr19/
. - refer to the manual to determine which options to use.
- the
--genomeDir
parameter is the folder where the indexed genome will be output to. - this job should require less than 4Gb and 30min to run.
Note
While your indexing job is running, you can read ahead in STAR’s manual to prepare the next step : mapping your reads onto the indexed reference genome.
STAR indexing script
#!/usr/bin/bash
#SBATCH --job-name=star-build
#SBATCH --time=00:30:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=4G
#SBATCH -o star-build.o
#SBATCH -e star-build.e
G_FASTA=/shared/data/DATA/Mouse_chr19/Mus_musculus.GRCm38.dna.chromosome.19.fa
G_GTF=/shared/data/DATA/Mouse_chr19/Mus_musculus.GRCm38.101.chr19.gtf
ml star
mkdir -p STAR_references
STAR --runMode genomeGenerate \
--genomeDir STAR_references \
--genomeFastaFiles $G_FASTA \
--sjdbGTFfile $G_GTF \
--runThreadN 4 \
--genomeSAindexNbases 11 \
--sjdbOverhang 49
Extra task : Determine how you would add an additional feature to your reference, for example for a novel transcript not described by the standard reference.
Answer
Edit the gtf file to add your additional feature(s), following the proper format.
Mapping reads onto the reference
Task : Using STAR, align ONE of the FASTQ files from the Ruhland2016 study against the mouse genome.
- Use the full indexed genome at
/shared/data/DATA/Mouse_STAR_index/
, rather than the one we just made. - IMPORTANT: use the following option in your STAR command:
--outTmpDir /tmp/${SLURM_JOB_USER}_${SLURM_JOB_ID}/
. You can use the manual to look up what this option does. The slurm variables ensure a distinct directory is created in/tmp/
for each user and for each job. - Generate a BAM file sorted by coordinate.
- Generate a geneCounts file.
- Mapping reads and generating a sorted BAM from one of the Ruhland2016 et al. FASTQ files should take about 20 minutes.
Note
Take the time to read the parts of the STAR manual which concern you : a bit of planning ahead can save you a lot of time-consuming/headache-inducing trial and error on your script.
Warning
Remember : request a maximum of 30G and 8 CPUs for 1 hour.
STAR mapping script
#!/usr/bin/bash
#SBATCH --job-name=star-aln
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=30G
#SBATCH -o star-aln.o
#SBATCH -e star-aln.e
ml star
outDIR=STAR_Ruhland2016
mkdir -p $outDIR
dataDIR=/shared/data/DATA/Ruhland2016
genomeDIR=/shared/data/DATA/Mouse_STAR_index
STAR --runThreadN 8 --genomeDir $genomeDIR \
--outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx \
--outFileNamePrefix $outDIR/SRR3180535_EtOH1_1 \
--quantMode GeneCounts \
--readFilesIn $dataDIR/SRR3180535_EtOH1_1.fastq.gz --readFilesCommand zcat \
The options of STAR are :
- –runThreadN 8 : 8 threads to go faster.
- –genomeDir $genomeDIR : path of the genome to map to.
- –outSAMtype BAM SortedByCoordinate : output a coordinate-sorted BAM file.
- –outReadsUnmapped Fastx : output the non-mapping reads (in case we want to analyse them).
- –outFileNamePrefix $outDIR/$fastqFILE : prefix of output files.
- –quantMode GeneCounts : will create a file with counts of reads per gene.
- –readFilesIn $dataDIR/$fastqFILE : input read file.
- –readFilesCommand zcat : command to unzip the input file.
advanced : STAR mapping script with array job
The following sets up an array of tasks to align all samples.
Source file : Ruhland2016.fastqFiles.txt
:
SRR3180535_EtOH1_1.fastq.gz
SRR3180536_EtOH2_1.fastq.gz
SRR3180537_EtOH3_1.fastq.gz
SRR3180538_TAM1_1.fastq.gz
SRR3180539_TAM2_1.fastq.gz
SRR3180540_TAM3_1.fastq.gz
sbatch script :
#!/usr/bin/bash
#SBATCH --job-name=star-aln
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=30G
#SBATCH -o star-aln.%a.o
#SBATCH -e star-aln.%a.e
#SBATCH --array 1-1%1
ml star
outDIR=STAR_Ruhland2016
mkdir -p $outDIR
dataDIR=/shared/data/DATA/Ruhland2016
sourceFILE=Ruhland2016.fastqFiles.txt
fastqFILE=`sed -n ${SLURM_ARRAY_TASK_ID}p $sourceFILE`
genomeDIR=/shared/data/DATA/Mouse_STAR_index
STAR --runThreadN 8 --genomeDir $genomeDIR \
--outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx \
--outFileNamePrefix $outDIR/$fastqFILE \
--quantMode GeneCounts \
--readFilesIn $dataDIR/$fastqFILE --readFilesCommand zcat \
The options of STAR are :
- –runThreadN 8 : 8 threads to go faster.
- –genomeDir $genomeDIR : path of the genome to map to.
- –outSAMtype BAM SortedByCoordinate : output a coordinate-sorted BAM file.
- –outReadsUnmapped Fastx : output the non-mapping reads (in case we want to analyse them).
- –outFileNamePrefix $outDIR/$fastqFILE : prefix of output files.
- –quantMode GeneCounts : will create a file with counts of reads per gene.
- –readFilesIn $dataDIR/$fastqFILE : input read file.
- –readFilesCommand zcat : command to unzip the input file.
QC on the aligned reads
You can call MultiQC on the STAR output folder to gather a report on the individual alignments.
Here we’ve aligned a single sample, but usually this would cover all your samples.
Task : use multiqc
to generate a QC report on the results of your mapping.
- Evaluate the alignment statistics. Do you consider this to be a good alignment?
- How many unmapped reads are there? Where might this come from, and how would you determine this?
- What could you say about library strandedness ?
script and answers
#!/usr/bin/bash
#SBATCH --job-name=multiqc
#SBATCH --time=00:30:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH -o multiqc_star_Ruhland2016.o
#SBATCH -e multiqc_star_Ruhland2016.e
mkdir -p STAR_MULTIQC_Ruhland2016/
multiqc -o STAR_MULTIQC_Ruhland2016/ STAR_Ruhland2016/
Result :
ADDITIONNAL : STAR 2-Pass
Genome annotations are incomplete, particularly for complex eukaryotes : there are many as-of-yet unannotated splice junctions.
The first pass of STAR can create a splice junction database, containing both known and novel junctions. This splice junction database can, in turn, be used to guide an improved second round of alignment, using a command like:
STAR <1st round options> --sjdbFileChrStartEnd sample_SJ.out.tab
Task : run STAR in this STAR-2pass mode on the same sample as before and evaluate the results.
script
#!/usr/bin/bash
#SBATCH --job-name=star-aln2
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=30G
#SBATCH -o star-aln-2pass.%a.o
#SBATCH -e star-aln-2pass.%a.e
#SBATCH --array 1-1%1
ml star
outDIR=STAR_Ruhland2016
mkdir -p $outDIR
dataDIR=/shared/data/DATA/Ruhland2016
sourceFILE=Ruhland2016.fastqFiles.txt
fastqFILE=`sed -n ${SLURM_ARRAY_TASK_ID}p $sourceFILE`
genomeDIR=/shared/data/DATA/Mouse_STAR_index
STAR --runThreadN 8 --genomeDir $genomeDIR \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix $outDIR/$fastqFILE.2Pass. \
--outReadsUnmapped Fastx --quantMode GeneCounts \
--sjdbFileChrStartEnd $outDIR/${fastqFILE}SJ.out.tab \
--readFilesIn $dataDIR/$fastqFILE --readFilesCommand zcat \
ADDITIONAL : pseudo-aligning with salmon
salmon can allow you to quantify transcript expression without explicitly aligning the sequenced reads onto the reference genome with its gene and splice junction annotations, but to a simplification of the corresponding transcriptome, thus saving computational resources.
We refer you to the tool’s documentation in order to see how the reference index is computed.
Task : run salmon to quantify the expression of either the Ruhland or Liu dataset.
- Use the tool documentation to craft your command line.
- precomputed indices can be found in
/shared/data/Mouse_salmon_index
and/shared/data/Human_salmon_index
.
script
#!/usr/bin/bash
#SBATCH --job-name=salmonRuhland
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=30G
#SBATCH -o salmon_ruhland2016.%a.o
#SBATCH -e salmon_ruhland2016.%a.e
#SBATCH --array 1-6%1
ml salmon
outDIR=salmon_Ruhland2016
mkdir -p $outDIR
dataDIR=/shared/data/DATA/Ruhland2016
sourceFILE=Ruhland2016.fastqFiles.txt
fastqFILE=`sed -n ${SLURM_ARRAY_TASK_ID}p $sourceFILE`
genomeDIR=/shared/data/DATA/Mouse_salmon_index
salmon quant -i $genomeDIR -l A \
-r $dataDIR/$fastqFILE \
-p 8 --validateMappings --gcBias --seqBias \
-o $outDIR