Filtering & evaluation

Learning outcomes

After having completed this chapter you will be able to:

Explain why using Variant Quality Score Recalibration (VQSR) for filtering variants can outperform hard filtering
Perform hard filtering on both SNPs and INDELs separately by using gatk SelectVariants in combination with gatk VariantFiltration
Perform concordance between called variants and a truth set and evaluate performance of a variant calling workflow

Material

Exercises

1. Hard filtering

The developers of gatk strongly advise to do Variant Quality Score Recalibration (VQSR) for filtering SNPs and INDELs. However, this is not always possible. For example, in the case of limited data availability and/or in the case you are working with non-model organisms and/or in the case you are a bit lazy and okay with a number of false positives.

Our dataset is too small to apply VQSR. We will therefore do hard filtering instead.

Splitting SNPs and INDELs

First, filtering thresholds are usually different for SNPs and INDELs. You can extract all the SNP records in our trio vcf like this:

cd ~/workdir

gatk SelectVariants \
--variant variants/trio.vcf \
--select-type SNP \
--output variants/trio.SNP.vcf

Exercise: Check out the documentation of gatk SelectVariants, and:

Figure out what you’ll need to fill in at --select-type if you want to select only INDELS.
Generate a vcf with only the SNPs and a second vcf with only the INDELs from trio.vcf.

Answer

You will need to fill in INDEL at --select-type to filter for INDELs.

To get the SNPs you can run the command above. To get the INDELs you’ll need to change --select-type to INDEL:

gatk SelectVariants \
--variant variants/trio.vcf \
--select-type INDEL \
--output variants/trio.INDEL.vcf

Filtering SNPs

The command gatk VariantFiltration enables you to filter for both the INFO field (per variant) and FORMAT field (per genotype). For now we’re only interested in filtering variants. Below you can find the command to hard-filter the SNP variants on some sensible thresholds (that are explained here).

gatk VariantFiltration \
--variant variants/trio.SNP.vcf \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "QUAL < 30.0" --filter-name "QUAL30" \
--filter-expression "SOR > 3.0" --filter-name "SOR3" \
--filter-expression "FS > 60.0" --filter-name "FS60" \
--filter-expression "MQ < 40.0" --filter-name "MQ40" \
--filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
--filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
--output variants/trio.SNP.filtered.vcf

Exercise: Run the filtering command above. Did it affect the number of records in the vcf?

Hint

You can check out the number of records in a vcf with:

grep -v "^#" <variants.vcf> | wc -l

Answer

There are no differences in the number of records:

grep -v "^#" variants/trio.SNP.vcf | wc -l

and

grep -v "^#" variants/trio.SNP.filtered.vcf | wc -l

both give 446.

However, there are SNPs filtered out, by changing the FILTER column. You can check the number of records with PASS by:

grep -v "^#" variants/trio.SNP.filtered.vcf | cut -f 7 | sort | uniq -c

Giving:

441 PASS
2 QD2;SOR3
3 SOR3

Filtering INDELs

A command with sensible parameters to do a first iteration of hard filtering the INDELs would be:

gatk VariantFiltration \
--variant variants/trio.INDEL.vcf \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "QUAL < 30.0" --filter-name "QUAL30" \
--filter-expression "FS > 200.0" --filter-name "FS200" \
--filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
--output variants/trio.INDEL.filtered.vcf

Exercise: Run the command and figure out how many variants are filtered out.

Hint

You can use this command from the answer to the previous exercise:

grep -v "^#" <variants.vcf> | cut -f 7 | sort | uniq -c

to see how many INDELs were filtered out.

Answer

grep -v "^#" variants/trio.INDEL.filtered.vcf | cut -f 7 | sort | uniq -c

gives:

110 PASS

So no variants are filtered out.

Merging filtered SNPs and INDELs

Now that we have filtered the INDELs and SNPs separately, we can merge them again with this command:

gatk MergeVcfs \
--INPUT variants/trio.SNP.filtered.vcf \
--INPUT variants/trio.INDEL.filtered.vcf \
--OUTPUT variants/trio.filtered.vcf

Exercise: Run the command to merge the vcfs.

2. Evaluation by concordance

For this region we have a highly curated truth set for the mother available. It originates from the Illumina Platinum truth set. You can find it at data/variants/NA12878.vcf.gz

To check how well we did, we’d first need to extract a vcf with only the information of the mother.

Exercise: To extract variants that have at least one alternative allele in the mother from variants/trio.filtered.vcf, use gatk SelectVariants with the arguments:

--sample-name mother
--exclude-non-variants
--remove-unused-alternates

In addition to the required arguments.

Answer

gatk SelectVariants \
--variant variants/trio.filtered.vcf \
--sample-name mother \
--exclude-non-variants \
--remove-unused-alternates \
--output variants/mother.trio.filtered.vcf

Exercise:

A. How many variants are in mother.trio.filtered.vcf? How many of those are filtered out?

B. Compare our vcf with the curated truth set with the command below. How many SNPs didn’t we detect?

gatk Concordance \
--evaluation variants/mother.trio.filtered.vcf \
--truth data/variants/NA12878.vcf.gz \
--intervals chr20:10018000-10220000 \
--summary variants/concordance.mother.trio.filtered

Answer

To get the number of records per FILTER, we run:

grep -v "^#" variants/mother.trio.filtered.vcf | cut -f 7 | sort | uniq -c

gives:

407 PASS
2 SOR3

So two records were filtered out, based on the Symmetric Odds Ratio (issues with strand bias).

Check out the output of gatk Concordance with cat:

cat variants/concordance.mother.trio.filtered

gives:

type    TP      FP      FN      RECALL  PRECISION
SNP     319     5       9       0.973   0.985
INDEL   63      20      6       0.913   0.759

Showing that there were 9 false negatives, i.e. SNPs we didn’t detect.

Recall & precision

More info on the definition of recall and precision on this wikipedia page

Exercise: Check out the concordance of the mother with the truth set before filtering. Did filtering improve the recall or precision?

Note

We did the filtering on trio.vcf, therefore, you first have to extract the records that only apply to the mother by using gatk SelectVariants.

Also note that trio.vcf contains records other than SNPs and INDELs. Use --select-type to select only SNPs and INDELs.

Answer

First select only SNPs and INDELs from the mother from the unfiltered vcf:

gatk SelectVariants \
--variant variants/trio.vcf \
--sample-name mother \
--exclude-non-variants \
--remove-unused-alternates \
--select-type INDEL \
--select-type SNP \
--output variants/mother.trio.vcf

Get the concordance with the truth set:

gatk Concordance \
--evaluation variants/mother.trio.vcf \
--truth data/variants/NA12878.vcf.gz \
--intervals chr20:10018000-10220000 \
--summary variants/concordance.mother.trio

Which gives:

type    TP      FP      FN      RECALL  PRECISION
SNP     319     7       9       0.973   0.979
INDEL   63      20      6       0.913   0.759

The precision for SNPs is slightly lower. Due to filtering, we removed two false positives.