Variant calling
Short variant calling
Now that we have quality-controlled the BAM files, we can go ahead with the variant calling itself. For this, we used mutect2
which is a somatic variant caller from GATK based on HaplotypeCaller
. In addition to the expected bam files and references genome, the mutect2
command requires some additional input:
--intervals
: the intervals of our target regions. This is in theinterval_list
format.-normal
: the sample name of the normal sample. So, theSM
tag of the read group of the normal sample.--germline-resource
: know sites of germline variants with their allele frequencies in the population. These are used to estimate the confidence of a germline variant in the normal sample.--panel-of-normals
: A VCF generated from normal samples that contain sites with known technical artifacts. Ideally, you create a PON from your own normal samples, but this is typically recommended if you have more than 40 normal samples. Therefore, here, we use a pre-generated PON from the 1000 genomes project. More information on PON in this article.
Create a script called 03_run_mutect2.sh
. After that, check out the manual of Mutect2
and replace the placeholders FIXME
with the required values of the Mutect2
below command and run it:
#!/usr/bin/env bash
ALIGNDIR=~/project/course_data/alignments
REFDIR=~/project/course_data/reference
RESOURCEDIR=~/project/course_data/resources
VARIANTDIR=~/project/course_data/variants
mkdir -p $VARIANTDIR
gatk Mutect2 \
\
-R FIXME "$REFDIR"/exome_regions.bed.interval_list \
--intervals \
-I FIXME \
-I FIXME \
-normal FIXME "$RESOURCEDIR"/af-only-gnomad.hg38.subset.vcf.gz \
--germline-resource "$RESOURCEDIR"/1000g_pon.hg38.subset.vcf.gz \
--panel-of-normals "$VARIANTDIR"/somatic.vcf.gz
-O
This takes a while to run. Have a break!
#!/usr/bin/env bash
ALIGNDIR=~/project/course_data/alignments
REFDIR=~/project/course_data/reference
RESOURCEDIR=~/project/course_data/resources
VARIANTDIR=~/project/course_data/variants
mkdir -p $VARIANTDIR
gatk Mutect2 \
"$REFDIR"/ref_genome.fa \
-R "$REFDIR"/exome_regions.bed.interval_list \
--intervals "$ALIGNDIR"/tumor.recal.bam \
-I "$ALIGNDIR"/normal.recal.bam \
-I \
-normal normal "$RESOURCEDIR"/af-only-gnomad.hg38.subset.vcf.gz \
--germline-resource "$RESOURCEDIR"/1000g_pon.hg38.subset.vcf.gz \
--panel-of-normals "$VARIANTDIR"/somatic.vcf.gz -O
After calling the variants we can do an initial filtering step. We do this with FilterMutectCalls
. This method takes technical artifacts, possibility of germline variants, and sequencing error in account, calculates an error probability and tries to optimize between recall and precision.
Create a script called 04_filter_mutect_calls.sh
in ~/project/scripts
. From the script, run the command to filter the somatic variants:
#!/usr/bin/env bash
REFDIR=~/project/course_data/reference
VARIANTDIR=~/project/course_data/variants
# fill the filter column
gatk FilterMutectCalls \
"$REFDIR"/ref_genome.fa \
-R "$VARIANTDIR"/somatic.vcf.gz \
-V "$VARIANTDIR"/somatic.filtered.vcf.gz
-O
# create a vcf with variants passing filters (PASS)
bcftools view -f PASS "$VARIANTDIR"/somatic.filtered.vcf.gz -Oz \
> "$VARIANTDIR"/somatic.filtered.PASS.vcf.gz
bcftools index --tbi "$VARIANTDIR"/somatic.filtered.PASS.vcf.gz
How many variants were kept? What were the main reason for filtering them out?
The filtering information is in the 7th column of the vcf, so you can do something like:
bcftools view -H my_variants.vcf.gz | cut -f 7 | sort | uniq -c | sort -nr | head -n 10
To get the number of variants in the unfiltered vcf:
bcftools view -H somatic.vcf.gz | wc -l
Resulting in 467 unfiltered variants.
To get the 10 most frequently occurring filters in the filter column of the filtered vcf:
bcftools view -H somatic.filtered.vcf.gz | cut -f 7 | sort | uniq -c | sort -nr | head -n 10
Resulting in:
133 PASS
48 weak_evidence
26 normal_artifact;strand_bias
21 panel_of_normals
17 normal_artifact
13 normal_artifact;slippage;weak_evidence
13 clustered_events;normal_artifact;strand_bias
12 germline;multiallelic;normal_artifact;panel_of_normals
12 base_qual;normal_artifact;strand_bias
9 strand_bias;weak_evidence
So most variants were filtered out, and most of the variants were filtered out because of ‘weak evidence’.
In order to figure out what that means we can check out the vcf header:
bcftools view -h somatic.filtered.vcf.gz | grep "^##FILTER"
Resulting in:
##FILTER=<ID=FAIL,Description="Fail the site if all alleles fail but for different reasons.">
##FILTER=<ID=PASS,Description="Site contains at least one allele that passes filters">
##FILTER=<ID=base_qual,Description="alt median base quality">
##FILTER=<ID=clustered_events,Description="Clustered events observed in the tumor">
##FILTER=<ID=contamination,Description="contamination">
##FILTER=<ID=duplicate,Description="evidence for alt allele is overrepresented by apparent duplicates">
##FILTER=<ID=fragment,Description="abs(ref - alt) median fragment length">
##FILTER=<ID=germline,Description="Evidence indicates this site is germline, not somatic">
##FILTER=<ID=haplotype,Description="Variant near filtered variant on same haplotype.">
##FILTER=<ID=low_allele_frac,Description="Allele fraction is below specified threshold">
##FILTER=<ID=map_qual,Description="ref - alt median mapping quality">
##FILTER=<ID=multiallelic,Description="Site filtered because too many alt alleles pass tumor LOD">
##FILTER=<ID=n_ratio,Description="Ratio of N to alt exceeds specified ratio">
##FILTER=<ID=normal_artifact,Description="artifact_in_normal">
##FILTER=<ID=orientation,Description="orientation bias detected by the orientation bias mixture model">
##FILTER=<ID=panel_of_normals,Description="Blacklisted site in panel of normals">
##FILTER=<ID=position,Description="median distance of alt variants from end of reads">
##FILTER=<ID=possible_numt,Description="Allele depth is below expected coverage of NuMT in autosome">
##FILTER=<ID=slippage,Description="Site filtered due to contraction of short tandem repeat region">
##FILTER=<ID=strand_bias,Description="Evidence for alt allele comes from one read direction only">
##FILTER=<ID=strict_strand,Description="Evidence for alt allele is not represented in both directions">
##FILTER=<ID=weak_evidence,Description="Mutation does not meet likelihood threshold">
Showing us that these mutations do not meet the likelihood threshold, basically telling us that these are the variants filtered out because of the combined error probability based on technical artifacts, possibility of germline variants, and sequencing error. The other filters are so called ‘hard filters’, meaning that by themselves they do not meet a fixed threshold.
EXTRA: Summarizing the VCF
Do the exercises about summarizing the VCF only if you have time. Otherwise, continue to Copy number variation calling.
Let’s check out the VCF in a bit more detail. We’d like to have an idea what kind of variants are in there, and what for example their variant allele frequency is.
First, we can check what the variants look like:
bcftools view -H somatic.filtered.vcf.gz | head
Showing us that both the FORMAT
and INFO
fields are filled with information. To check out the meaning of what is in the INFO
field you can run:
bcftools view -h somatic.filtered.vcf.gz | grep "^##INFO"
And you can do the same for the format fields:
bcftools view -h somatic.filtered.vcf.gz | grep "^##FORMAT"
To summarize the vcf for information in the format or info fields you can use bcftools query
. For example, for getting the likelihood of variant and depth of each sample, you can do the following (for the first 20 variants):
bcftools query -f '%CHROM\t%POS\t%FILTER\t\t%INFO/TLOD\t[%SAMPLE=%DP;]\n' somatic.filtered.vcf.gz | head -20
Instead of the depth DP
for each sample get the variant allele frequency for each sample, and compare that with the FILTER column.
We can check which tag contains the variant allele frequency in the vcf:
bcftools view -h somatic.filtered.vcf.gz | grep "^##FORMAT"
Returning:
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=FAD,Number=R,Type=Integer,Description="Count of fragments supporting each allele.">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
Where AF
contains ‘Allele fractions of alternate alleles in the tumor’, so that is the tag we are looking for. So, we replace DP
with AF
:
bcftools query -f '%CHROM\t%POS\t%FILTER\t\t%INFO/TLOD\t[%SAMPLE=%AF;]\n' somatic.filtered.vcf.gz | head -30
Returning:
chr6 106265 map_qual;strand_bias;weak_evidence 3.99 normal=0.00671;tumor=0.01;
chr6 325357 PASS 719.91 normal=0.001054;tumor=0.161;
chr6 325783 PASS 1292.31 normal=0.0009626;tumor=0.224;
chr6 325901 PASS 1030.6 normal=0.005447;tumor=0.183;
chr6 325916 PASS 937.7 normal=0.004142;tumor=0.116;
chr6 1742560 panel_of_normals 235.6 normal=0.02;tumor=0.41;
chr6 1916336 PASS 187.23 normal=0.022;tumor=0.482;
chr6 1930220 PASS 152.54 normal=0.016;tumor=0.207;
chr6 2674999 PASS 190.32 normal=0.028;tumor=0.514;
chr6 2834013 PASS 245.04 normal=0.02;tumor=0.494;
chr6 2840548 PASS 274.4 normal=0.021;tumor=0.522;
chr6 3110986 weak_evidence 3.03 normal=0.017;tumor=0.019;
chr6 3723708 PASS 266.67 normal=0.017;tumor=0.42;
chr6 4892169 weak_evidence 4.43 normal=0.024;tumor=0.02;
chr6 5103987 germline;multiallelic;normal_artifact;panel_of_normals 13.57,257.27 normal=0.037,0.825;tumor=0.077,0.834;
chr6 6250887 PASS 260.99 normal=0.018;tumor=0.459;
chr6 7176847 PASS 107.39 normal=0.026;tumor=0.319;
chr6 7585458 PASS 171.88 normal=0.015;tumor=0.283;
chr6 7833697 PASS 122.24 normal=0.057;tumor=0.513;
chr6 7883281 germline 99.5 normal=0.05;tumor=0.476;
chr6 8054329 PASS 420.89 normal=0.045;tumor=0.446;
chr6 9900366 panel_of_normals 525.07 normal=0.024;tumor=0.992;
chr6 10397660 germline;multiallelic;slippage 3.21,50.94 normal=0.042,0.043;tumor=0.098,0.798;
chr6 10398435 PASS 365.76 normal=0.013;tumor=0.943;
chr6 10801908 PASS 91.41 normal=0.02;tumor=0.312;
chr6 10989772 PASS 71.25 normal=0.041;tumor=0.49;
chr6 11306007 PASS 209.24 normal=0.009359;tumor=0.4;
chr6 13479062 germline;multiallelic;normal_artifact;panel_of_normals 4.63,5.86,13.96 normal=0.022,0.027,0.258;tumor=0.043,0.068,0.165;
chr6 13977507 weak_evidence 3.4 normal=0.012;tumor=0.027;
chr6 16147945 PASS 331.72 normal=0.014;tumor=0.978;
Where we see that if there’s the filter weak_evidence
the VAF of the tumor is typically low.
Copy number variation calling
Variation of copy number in genes can have a large effect on the phenotype of a tumor. Therefore, we will also estimate the copy number variation occurring on chromosome 6 and 17. For that, we use CNVkit.
Check out the documentation of CNVkit.py bash
, and the helper (cnvkit.py batch -h
), create a script called 05_run_cnvkit.sh
, and replace the missing values at FIXME
.
After that, checkout the visualizations (tumor-scatter.png
and tumor-scatter.png
) at ~/project/course_data/variants/cnvkit
. Do you see any evidence for copy number variation?
#!/usr/bin/env bash
ALIGNDIR=~/project/course_data/alignments
REFDIR=~/project/course_data/reference
RESOURCEDIR=~/project/course_data/resources
VARIANTDIR=~/project/course_data/variants
cnvkit.py batch FIXME \
\
--normal FIXME \
--targets FIXME \
--fasta FIXME "$RESOURCEDIR"/refFlat.txt \
--annotate "$VARIANTDIR"/reference.cnn \
--output-reference "$VARIANTDIR"/cnvkit/ \
--output-dir \
--processes 4 \
--scatter --diagram
The scale on the scatter plot is log2(copy ratio). So, if there is a duplication at one chromsome you would expect a copy ratio of 1.5 (3 chromsomes/2 chromsomes). The log2 of 1.5 is 0.58. So estimates at 0.58 mean a gain of one copy, and estimates at -0.5 (log2(0.5)) a loss of one copy.
We provide the tumor bam, normal bam, the interval list (can also be a bed file) and our reference genome:
#!/usr/bin/env bash
ALIGNDIR=~/project/course_data/alignments
REFDIR=~/project/course_data/reference
RESOURCEDIR=~/project/course_data/resources
VARIANTDIR=~/project/course_data/variants
cnvkit.py batch "$ALIGNDIR"/tumor.recal.bam \
"$ALIGNDIR"/normal.recal.bam \
--normal "$REFDIR"/exome_regions.bed.interval_list \
--targets "$REFDIR"/ref_genome.fa \
--fasta "$RESOURCEDIR"/refFlat.txt \
--annotate "$VARIANTDIR"/reference.cnn \
--output-reference "$VARIANTDIR"/cnvkit/ \
--output-dir \
--processes 4 \
--scatter --diagram
After running cnvkit, multiple files are created in ~/project/course_data/variants/cnvkit
. The scatter plot is typically interesting to investigate at first:
Here we see that there is evidence for both losses and gains on both chromosomes.