Variant calling

Learning outcomes

After having completed this chapter you will be able to:

Describe how variant information is stored in a variant call format (.vcf) file
Describe the ‘missing genotype problem’ when calling variants of multiple samples, and the different methods on how this can be solved
Follow gatk best practices workflow to perform a variant analysis by:
- Applying Base Quality Score Recalibration on an alignment file
- Calling variants with gatk HaplotypeCaller
- Combining multiple vcf files into a single vcf file
Perform basic operations to get statistics of a vcf file

Material

Download the presentation

VCF format description

The paper on genomic variant call format (gVCF)

GATK best practices germline short variant workflow:

Exercises

1. Indexing, indexing, indexing

Many algorithms work faster, or only work with an index of their (large) input files. In that sense, gatk is no different from other tools. The index for a reference needs to be created in two steps:

cd ~/workdir/data/reference
samtools faidx <reference.fa>
gatk CreateSequenceDictionary --REFERENCE <reference.fa>

Also input vcf files need to be indexed. This will create a .idx file associated with the .vcf. You can do this like this:

gatk IndexFeatureFile --input <variants.vcf>

Exercise: Create the required gatk indexes for:

The reference genome reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa
A part of the dbsnp database: variants/GCF.38.filtered.renamed.vcf
A part of the 1000 genomes indel golden standard: variants/1000g_gold_standard.indels.filtered.vcf

Answer

Creating the index for the reference genome:

cd ~/workdir/data
samtools faidx reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa
gatk CreateSequenceDictionary --REFERENCE reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa

Creating the indices for the vcfs:

gatk IndexFeatureFile --input variants/1000g_gold_standard.indels.filtered.vcf
gatk IndexFeatureFile --input variants/GCF.38.filtered.renamed.vcf

Chromosome names

Unlike IGV, gatk requires equal chromosome names for all its input files and indexes, e.g. in .fasta, .bam and .vcf files. In general, for the human genome there are three types of chromosome names:

Just a number, e.g. 20
Prefixed by chr. e.g. chr20
Refseq name, e.g. NC_000020.11

Before you start the alignment, it’s wise to check out what chromosome naming your input files are using, because changing chromosome names in a .fasta file is easier than in a .bam file.

If your fasta titles are e.g. starting with a number you can add chr to it with sed:

sed s/^>/>chr/g <reference.fasta>

You can change chromsome names in a vcf with bcftools annotate:

bcftools annotate --rename-chrs <tab-delimited-renaming> <input.vcf>

2. Base Quality Score Recalibration (BQSR)

BQSR evaluates the base qualities on systematic error. It can ignore sites with known variants. BQSR helps to identify faulty base calls, and therefore reduces the chance on discovering false positive variant positions.

BQSR is done in two steps:

Recalibration with gatk BaseRecalibrator
By using the output of gatk BaseRecalibrator, the application to the bam file with gatk ApplyBQSR

Exercise: Check out the documentation of the tools. Which options are required?

Answer

For gatk BaseRecalibrator:

--reference
--input
--known-sites
--output

For gatk ApplyBQSR:

--bqsr-recal-file
--input
--output

Exercise: Run the two commands with the required options on mother.rg.md.bam, with --known-sites variants/1000g_gold_standard.indels.filtered.vcf and variants/GCF.38.filtered.renamed.vcf.

Multiple inputs for same argument

In some cases you need to add multiple inputs (e.g. multiple vcf files) into the same argument (e.g. --known-sites). To provide multiple inputs for the same argument in gatk, you can use the same argument multiple times, e.g.:

gatk BaseRecalibrator \
--reference <reference.fa> \
--input <alignment.bam> \
--known-sites <variants1.vcf> \
--known-sites <variants2.vcf> \
--output <output.table>

Answer

cd ~/workdir

mkdir bqsr

gatk BaseRecalibrator \
--reference data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa \
--input alignment/mother.rg.md.bam \
--known-sites data/variants/GCF.38.filtered.renamed.vcf \
--known-sites data/variants/1000g_gold_standard.indels.filtered.vcf \
--output bqsr/mother.recal.table

gatk ApplyBQSR \
--input alignment/mother.rg.md.bam \
--bqsr-recal-file bqsr/mother.recal.table \
--output bqsr/mother.recal.bam

Exercise: Place these commands in a ‘for loop’, that performs the BQSR for mother, father and son.

Answer

cd ~/workdir

for sample in mother father son
do
  gatk BaseRecalibrator \
  --reference data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa \
  --input alignment/$sample.rg.md.bam \
  --known-sites data/variants/GCF.38.filtered.renamed.vcf \
  --known-sites data/variants/1000g_gold_standard.indels.filtered.vcf \
  --output bqsr/$sample.recal.table

  gatk ApplyBQSR \
  --input alignment/$sample.rg.md.bam \
  --bqsr-recal-file bqsr/$sample.recal.table \
  --output bqsr/$sample.recal.bam
done

3. Variant calling

Calculating PL and GQ by hand

Here’s a function in R to calculate genotype likelihoods as described in Li H. Bioinformatics. 2011;27:2987–93 (assuming equal base error probabilities for all reads):

genotype_likelihood <- function(m,g,e,ref,alt){
  (((m-g)*e+g*(1-e))^alt * ((m-g)*(1-e)+g*e)^ref)/(m^(ref+alt))
}

Where:

m : ploidy
g : number of alternative alleles
e : base error probability
ref : number of reference alleles counted
alt : number of alternative alleles counted

Exercise: In a local R session, calculate the three genotype likelihoods (for g = 0, g = 1 and g = 2) for a case where we count 22 reference alleles and 4 alternative alleles (so a coverage of 26), and base error probability of 0.01. Calculate the PL values (-10*log10(likelihood)) for each genotype.

No local R installation?

If you don’t have access to an R installation, you can also install it in the jupyter environment by running inside the terminal:

conda install r-base

After installation, start an interactive R session by typing R.

Answer

# For g = 0 (i.e. 0 reference alleles)
-10*log10(genotype_likelihood(m = 2, g= 0, e = 0.01, ref = 22, alt = 4))
# [1] 80.96026 
-10*log10(genotype_likelihood(m = 2, g= 1, e = 0.01, ref = 22, alt = 4))
# [1] 78.2678
-10*log10(genotype_likelihood(m = 2, g= 2, e = 0.01, ref = 22, alt = 4))
# [1] 440.1746

Exercise: What is the most likely genotype? What is the genotype quality (GQ)? Do you think we should be confident about this genotype call?

Answer

The most likely genotype has the lowest PL, so where g=1 (heterozygous). GL is calculated by subtracting the lowest PL from the second lowest PL, so 80.96 - 78.27 = 2.69.

This is a low genotype quality (note that we’re in the phred scale), i.e. an error probability of 0.54. This makes sense, if the genotype is heterozygous we would roughly expect to count as many reference as alternative alleles, and our example quite strongly deviates from this expectation.

Calling variants with GATK

The command gatk HaplotypeCaller is the core command of gatk. It performs the actual variant calling.

Exercise: Check out the gatk HaplotypeCaller documentation, and find out which arguments are required.

Answer

Required arguments are:

--input
--ouput
--reference

Exercise: Make a directory ~/workdir/variants to write the output vcf. After that, run gatk HaplotypeCaller with required options on the recalibrated alignment file of the mother (bqsr/mother.recal.bam). We’ll focus on a small region, so add --intervals chr20:10018000-10220000.

Answer

cd ~/workdir
mkdir variants

gatk HaplotypeCaller \
--reference data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa \
--input bqsr/mother.recal.bam \
--output variants/mother.HC.vcf \
--intervals chr20:10018000-10220000

Exercise: You can get the number of records in a vcf with piping the output of grep -v '^#' to wc -l. Get the number of variants in the vcf.

Answer

grep -v '^#' variants/mother.HC.vcf | wc -l

Shows you that there are 411 variants in there.

You can get some more statistics with gatk VariantsToTable. The output can be used to easily query things in R or MS Excel.

Here’s an example:

gatk VariantsToTable \
--variant variants/mother.HC.vcf \
--fields CHROM -F POS -F TYPE -GF GT \
--output variants/mother.HC.table

Exercise: Run the command and have a look at the first few records (use e.g. head or less). After that, report the number of SNPs and INDELs.

Answer

You can get the number of SNPs with:

grep -c "SNP" variants/mother.HC.table

which will give 326

And the number of INDELs with:

grep -c "INDEL" variants/mother.HC.table

that outputs 84

A more fancy way to this would be:

cut -f 3 variants/mother.HC.table | tail -n +2 | sort | uniq -c

Giving:

84 INDEL
1 MIXED
326 SNP

We will do the variant calling on all three samples. Later we want to combine the variant calls. For efficient merging of vcfs, we will need to output the variants as a GVCF. To do that, we will use the option --emit-ref-confidence GVCF. Also, we’ll visualise the haplotype phasing with IGV in the next section. For that we’ll need a phased bam. You can get this output with the argument --bam-output.

Exercise: Run gatk HaplotypeCaller for mother, father and son by using a loop, and by using the arguments in the previous exercise. On top of that add the arguments --emit-ref-confidence GVCF and --bamoutput <phased.bam>.

Answer

cd ~/workdir

for sample in mother father son
do
  gatk HaplotypeCaller \
  --reference data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa \
  --input bqsr/$sample.recal.bam \
  --output variants/$sample.HC.g.vcf \
  --bam-output variants/$sample.phased.bam \
  --intervals chr20:10018000-10220000 \
  --emit-ref-confidence GVCF
done

4. Combining GVCFs

Now that we have all three GVCFs of the mother, father and son, we can combine them into a database. We do this because it enables us to later add GVCFs (with the option --genomicsdb-update-workspace-path), and to efficiently combine them into a single vcf.

You can generate a GenomicsDB on our three samples like this:

cd ~/workdir

gatk GenomicsDBImport \
--variant variants/mother.HC.g.vcf \
--variant variants/father.HC.g.vcf \
--variant variants/son.HC.g.vcf \
--intervals chr20:10018000-10220000 \
--genomicsdb-workspace-path genomicsdb

Exercise: Run this command to generate the database.

You can retrieve the combined vcf from the database with gatk GenotypeGVCFs.

gatk GenotypeGVCFs \
--reference data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa \
--variant gendb://genomicsdb \
--intervals chr20:10018000-10220000 \
--output variants/trio.vcf

Exercise: Run this command to generate the combined vcf.