Skip to content

Annotation

Learning outcomes

After having completed this chapter you will be able to:

  • Describe the aims of variant annotation
  • Explain how variants are ranked in order of importance
  • Explain how splice variation affects variant annotation
  • Perform a variant annotation with snpEff
  • Interpret the report generated by snpEff
  • Explain how variant annotation can be added to a vcf file

Material

Presentation will be sent to you by e-mail.

Exercises

To use the human genome as a reference, we have downloaded the database with:

No need to download, it’s already downloaded for you

# don't run this. It's already downloaded for you
snpEff download -v GRCh38.99

You can run snpEff like so:

mkdir annotation

snpEff -Xmx4g \
-v \
-o gatk \
GRCh38.99 \
variants/trio.filtered.vcf > annotation/trio.filtered.snpeff.vcf

Output -o gatk is deprecated for gatk4

Here, we use output -o gatk for readability reasons (only one effect per variant is reported). With gatk3 you could use gatk VariantAnnotator with input from snpEff. In gatk4 that is not supported anymore.

Exercise: Run the command, and check out the html file (snpEff_summary.html). Try to answer these questions:

A. How many effects were calculated?

B. How many variants are in the vcf?

C. Why is this different?

D. How many effects result in a missense mutation?

Answer

A. There were 10,357 effects calculated.

B. There are only 556 variants in the vcf.

C. This means that there are multiple effects per variant. snpEff calculates effects for each splice variant, and therefore the number of effects are a multitude of the number of variants.

D. Two effects result in a missense mutation.

You can (quick and dirty) query the annotation vcf (trio.filtered.snpeff.vcf) for the missense mutation with grep.

Exercise: Find the variant causing the missense mutation (the line contains the string MISSENSE). And answer the following questions:

Hint
grep MISSENSE annotation/trio.filtered.snpeff.vcf

Only one effect per SNP in the vcf

In the vcf we have created you can only find one effect per SNP. If you would run snpEff without -o gatk, you would get all effects per variant.

A. How are the SNP annotations stored in the vcf?

B. What are the genotypes of the individuals?

C. Which amino acid change does it cause?

Answer

Find the line with the missense mutation like this:

grep MISSENSE annotation/trio.filtered.snpeff.vcf

This results in (long line, scroll to the right to see more):

chr20   10049540        .       T       A       220.29  PASS    AC=1;AF=0.167;AN=6;BaseQRankSum=-6.040e-01;DP=85;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.167;MQ=60.00;MQRankSum=0.00;QD=8.16;ReadPosRankSum=0.226;SOR=0.951;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|cTg/cAg|L324Q|ANKEF1|protein_coding|CODING|ENST00000378392|7)      GT:AD:DP:GQ:PL   0/0:34,0:34:99:0,102,1163  0/1:17,10:27:99:229,0,492       0/0:24,0:24:72:0,72,811

A. SNP annotations are stored in the INFO field, starting with EFF=

B. The genotypes are homozygous reference for the father and son, and heterozygous for the mother. (find the order of the samples with grep ^#CHROM)

C. The triplet changes from cTg to cAg, resulting in a change from L (Leucine) to Q (Glutamine).