Variant calling

Short variant calling

Now that we have quality-controlled the BAM files, we can go ahead with the variant calling itself. For this, we used mutect2 which is a somatic variant caller from GATK based on HaplotypeCaller. In addition to the expected bam files and references genome, the mutect2 command requires some additional input:

  • --intervals: the intervals of our target regions. This is in the interval_list format.
  • -normal: the sample name of the normal sample. So, the SM tag of the read group of the normal sample.
  • --germline-resource: know sites of germline variants with their allele frequencies in the population. These are used to estimate the confidence of a germline variant in the normal sample.
  • --panel-of-normals: A VCF generated from normal samples that contain sites with known technical artifacts. Ideally, you create a PON from your own normal samples, but this is typically recommended if you have more than 40 normal samples. Therefore, here, we use a pre-generated PON from the 1000 genomes project. More information on PON in this article.
Exercise

Create a script called 03_run_mutect2.sh. After that, check out the manual of Mutect2 and replace the placeholders FIXME with the required values of the Mutect2 below command and run it:


#!/usr/bin/env bash

ALIGNDIR=~/project/course_data/alignments
REFDIR=~/project/course_data/reference
RESOURCEDIR=~/project/course_data/resources
VARIANTDIR=~/project/course_data/variants

mkdir -p $VARIANTDIR

gatk Mutect2 \
-R FIXME \
--intervals "$REFDIR"/exome_regions.bed.interval_list \
-I FIXME \
-I FIXME \
-normal FIXME \
--germline-resource "$RESOURCEDIR"/af-only-gnomad.hg38.subset.vcf.gz \
--panel-of-normals "$RESOURCEDIR"/1000g_pon.hg38.subset.vcf.gz \
-O "$VARIANTDIR"/somatic.vcf.gz
Caution

This takes a while to run. Have a break!

#!/usr/bin/env bash

ALIGNDIR=~/project/course_data/alignments
REFDIR=~/project/course_data/reference
RESOURCEDIR=~/project/course_data/resources
VARIANTDIR=~/project/course_data/variants

mkdir -p $VARIANTDIR

gatk Mutect2 \
-R "$REFDIR"/ref_genome.fa \
--intervals "$REFDIR"/exome_regions.bed.interval_list \
-I "$ALIGNDIR"/tumor.recal.bam \
-I "$ALIGNDIR"/normal.recal.bam \
-normal normal \
--germline-resource "$RESOURCEDIR"/af-only-gnomad.hg38.subset.vcf.gz \
--panel-of-normals "$RESOURCEDIR"/1000g_pon.hg38.subset.vcf.gz \
-O "$VARIANTDIR"/somatic.vcf.gz

After calling the variants we can do an initial filtering step. We do this with FilterMutectCalls. This method takes technical artifacts, possibility of germline variants, and sequencing error in account, calculates an error probability and tries to optimize between recall and precision.

Exercise

Create a script called 04_filter_mutect_calls.sh in ~/project/scripts. From the script, run the command to filter the somatic variants:

#!/usr/bin/env bash

REFDIR=~/project/course_data/reference
VARIANTDIR=~/project/course_data/variants

# fill the filter column
gatk FilterMutectCalls \
-R "$REFDIR"/ref_genome.fa \
-V "$VARIANTDIR"/somatic.vcf.gz \
-O "$VARIANTDIR"/somatic.filtered.vcf.gz

# create a vcf with variants passing filters (PASS)
bcftools view -f PASS "$VARIANTDIR"/somatic.filtered.vcf.gz -Oz \
> "$VARIANTDIR"/somatic.filtered.PASS.vcf.gz

bcftools index --tbi "$VARIANTDIR"/somatic.filtered.PASS.vcf.gz

How many variants were kept? What were the main reason for filtering them out?

Note

The filtering information is in the 7th column of the vcf, so you can do something like:

bcftools view -H my_variants.vcf.gz | cut -f 7 | sort | uniq -c | sort -nr | head -n 10

To get the number of variants in the unfiltered vcf:

bcftools view -H somatic.vcf.gz | wc -l

Resulting in 467 unfiltered variants.

To get the 10 most frequently occurring filters in the filter column of the filtered vcf:

bcftools view -H somatic.filtered.vcf.gz | cut -f 7 | sort | uniq -c | sort -nr | head -n 10

Resulting in:

133 PASS
48 weak_evidence
26 normal_artifact;strand_bias
21 panel_of_normals
17 normal_artifact
13 normal_artifact;slippage;weak_evidence
13 clustered_events;normal_artifact;strand_bias
12 germline;multiallelic;normal_artifact;panel_of_normals
12 base_qual;normal_artifact;strand_bias
9 strand_bias;weak_evidence

So most variants were filtered out, and most of the variants were filtered out because of ‘weak evidence’.

In order to figure out what that means we can check out the vcf header:

bcftools view -h somatic.filtered.vcf.gz | grep "^##FILTER"

Resulting in:

##FILTER=<ID=FAIL,Description="Fail the site if all alleles fail but for different reasons.">
##FILTER=<ID=PASS,Description="Site contains at least one allele that passes filters">
##FILTER=<ID=base_qual,Description="alt median base quality">
##FILTER=<ID=clustered_events,Description="Clustered events observed in the tumor">
##FILTER=<ID=contamination,Description="contamination">
##FILTER=<ID=duplicate,Description="evidence for alt allele is overrepresented by apparent duplicates">
##FILTER=<ID=fragment,Description="abs(ref - alt) median fragment length">
##FILTER=<ID=germline,Description="Evidence indicates this site is germline, not somatic">
##FILTER=<ID=haplotype,Description="Variant near filtered variant on same haplotype.">
##FILTER=<ID=low_allele_frac,Description="Allele fraction is below specified threshold">
##FILTER=<ID=map_qual,Description="ref - alt median mapping quality">
##FILTER=<ID=multiallelic,Description="Site filtered because too many alt alleles pass tumor LOD">
##FILTER=<ID=n_ratio,Description="Ratio of N to alt exceeds specified ratio">
##FILTER=<ID=normal_artifact,Description="artifact_in_normal">
##FILTER=<ID=orientation,Description="orientation bias detected by the orientation bias mixture model">
##FILTER=<ID=panel_of_normals,Description="Blacklisted site in panel of normals">
##FILTER=<ID=position,Description="median distance of alt variants from end of reads">
##FILTER=<ID=possible_numt,Description="Allele depth is below expected coverage of NuMT in autosome">
##FILTER=<ID=slippage,Description="Site filtered due to contraction of short tandem repeat region">
##FILTER=<ID=strand_bias,Description="Evidence for alt allele comes from one read direction only">
##FILTER=<ID=strict_strand,Description="Evidence for alt allele is not represented in both directions">
##FILTER=<ID=weak_evidence,Description="Mutation does not meet likelihood threshold">

Showing us that these mutations do not meet the likelihood threshold, basically telling us that these are the variants filtered out because of the combined error probability based on technical artifacts, possibility of germline variants, and sequencing error. The other filters are so called ‘hard filters’, meaning that by themselves they do not meet a fixed threshold.

EXTRA: Summarizing the VCF

Only when you have time

Do the exercises about summarizing the VCF only if you have time. Otherwise, continue to Copy number variation calling.

Let’s check out the VCF in a bit more detail. We’d like to have an idea what kind of variants are in there, and what for example their variant allele frequency is.

First, we can check what the variants look like:

bcftools view -H somatic.filtered.vcf.gz | head

Showing us that both the FORMAT and INFO fields are filled with information. To check out the meaning of what is in the INFO field you can run:

bcftools view -h somatic.filtered.vcf.gz | grep "^##INFO"

And you can do the same for the format fields:

bcftools view -h somatic.filtered.vcf.gz | grep "^##FORMAT"

To summarize the vcf for information in the format or info fields you can use bcftools query. For example, for getting the likelihood of variant and depth of each sample, you can do the following (for the first 20 variants):

bcftools query -f '%CHROM\t%POS\t%FILTER\t\t%INFO/TLOD\t[%SAMPLE=%DP;]\n' somatic.filtered.vcf.gz | head -20
Exercise

Instead of the depth DP for each sample get the variant allele frequency for each sample, and compare that with the FILTER column.

We can check which tag contains the variant allele frequency in the vcf:

bcftools view -h somatic.filtered.vcf.gz | grep "^##FORMAT"

Returning:

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=FAD,Number=R,Type=Integer,Description="Count of fragments supporting each allele.">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">

Where AF contains ‘Allele fractions of alternate alleles in the tumor’, so that is the tag we are looking for. So, we replace DP with AF:

bcftools query -f '%CHROM\t%POS\t%FILTER\t\t%INFO/TLOD\t[%SAMPLE=%AF;]\n' somatic.filtered.vcf.gz | head -30

Returning:

chr6    106265  map_qual;strand_bias;weak_evidence              3.99    normal=0.00671;tumor=0.01;
chr6    325357  PASS            719.91  normal=0.001054;tumor=0.161;
chr6    325783  PASS            1292.31 normal=0.0009626;tumor=0.224;
chr6    325901  PASS            1030.6  normal=0.005447;tumor=0.183;
chr6    325916  PASS            937.7   normal=0.004142;tumor=0.116;
chr6    1742560 panel_of_normals                235.6   normal=0.02;tumor=0.41;
chr6    1916336 PASS            187.23  normal=0.022;tumor=0.482;
chr6    1930220 PASS            152.54  normal=0.016;tumor=0.207;
chr6    2674999 PASS            190.32  normal=0.028;tumor=0.514;
chr6    2834013 PASS            245.04  normal=0.02;tumor=0.494;
chr6    2840548 PASS            274.4   normal=0.021;tumor=0.522;
chr6    3110986 weak_evidence           3.03    normal=0.017;tumor=0.019;
chr6    3723708 PASS            266.67  normal=0.017;tumor=0.42;
chr6    4892169 weak_evidence           4.43    normal=0.024;tumor=0.02;
chr6    5103987 germline;multiallelic;normal_artifact;panel_of_normals          13.57,257.27    normal=0.037,0.825;tumor=0.077,0.834;
chr6    6250887 PASS            260.99  normal=0.018;tumor=0.459;
chr6    7176847 PASS            107.39  normal=0.026;tumor=0.319;
chr6    7585458 PASS            171.88  normal=0.015;tumor=0.283;
chr6    7833697 PASS            122.24  normal=0.057;tumor=0.513;
chr6    7883281 germline                99.5    normal=0.05;tumor=0.476;
chr6    8054329 PASS            420.89  normal=0.045;tumor=0.446;
chr6    9900366 panel_of_normals                525.07  normal=0.024;tumor=0.992;
chr6    10397660        germline;multiallelic;slippage          3.21,50.94      normal=0.042,0.043;tumor=0.098,0.798;
chr6    10398435        PASS            365.76  normal=0.013;tumor=0.943;
chr6    10801908        PASS            91.41   normal=0.02;tumor=0.312;
chr6    10989772        PASS            71.25   normal=0.041;tumor=0.49;
chr6    11306007        PASS            209.24  normal=0.009359;tumor=0.4;
chr6    13479062        germline;multiallelic;normal_artifact;panel_of_normals          4.63,5.86,13.96 normal=0.022,0.027,0.258;tumor=0.043,0.068,0.165;
chr6    13977507        weak_evidence           3.4     normal=0.012;tumor=0.027;
chr6    16147945        PASS            331.72  normal=0.014;tumor=0.978;

Where we see that if there’s the filter weak_evidence the VAF of the tumor is typically low.

Copy number variation calling

Variation of copy number in genes can have a large effect on the phenotype of a tumor. Therefore, we will also estimate the copy number variation occurring on chromosome 6 and 17. For that, we use CNVkit.

Exercise

Check out the documentation of CNVkit.py bash, and the helper (cnvkit.py batch -h), create a script called 05_run_cnvkit.sh, and replace the missing values at FIXME.

After that, checkout the visualizations (tumor-scatter.png and tumor-scatter.png) at ~/project/course_data/variants/cnvkit. Do you see any evidence for copy number variation?

#!/usr/bin/env bash

ALIGNDIR=~/project/course_data/alignments
REFDIR=~/project/course_data/reference
RESOURCEDIR=~/project/course_data/resources
VARIANTDIR=~/project/course_data/variants

cnvkit.py batch FIXME \
--normal FIXME \
--targets FIXME \
--fasta FIXME \
--annotate "$RESOURCEDIR"/refFlat.txt \
--output-reference "$VARIANTDIR"/reference.cnn \
--output-dir "$VARIANTDIR"/cnvkit/ \
--processes 4 \
--scatter \
--diagram
Note

The scale on the scatter plot is log2(copy ratio). So, if there is a duplication at one chromsome you would expect a copy ratio of 1.5 (3 chromsomes/2 chromsomes). The log2 of 1.5 is 0.58. So estimates at 0.58 mean a gain of one copy, and estimates at -0.5 (log2(0.5)) a loss of one copy.

We provide the tumor bam, normal bam, the interval list (can also be a bed file) and our reference genome:

#!/usr/bin/env bash

ALIGNDIR=~/project/course_data/alignments
REFDIR=~/project/course_data/reference
RESOURCEDIR=~/project/course_data/resources
VARIANTDIR=~/project/course_data/variants

cnvkit.py batch "$ALIGNDIR"/tumor.recal.bam \
--normal "$ALIGNDIR"/normal.recal.bam \
--targets "$REFDIR"/exome_regions.bed.interval_list \
--fasta "$REFDIR"/ref_genome.fa \
--annotate "$RESOURCEDIR"/refFlat.txt \
--output-reference "$VARIANTDIR"/reference.cnn \
--output-dir "$VARIANTDIR"/cnvkit/ \
--processes 4 \
--scatter \
--diagram

After running cnvkit, multiple files are created in ~/project/course_data/variants/cnvkit. The scatter plot is typically interesting to investigate at first:

Here we see that there is evidence for both losses and gains on both chromosomes.