A Practical Introduction
Yes, there are many mechanisms of DNA repair. Some of the most common are:
NER (Nucleotide Excision Repair): Repairs bulky lesions such as thymine dimers caused by UV radiation. This mechanism involves the removal of a short single-stranded DNA segment containing the damage, followed by DNA synthesis using the complementary strand as a template.
MMR (Mismatch Repair): Fixes replication errors such as base mismatches or insertion/deletion loops. MMR recognizes the newly synthesized strand and corrects errors by removing the incorrect nucleotides and replacing them.
BER (Base Excision Repair): Repairs single-base lesions such as oxidative damage, alkylation, or deamination. This involves the removal of damaged bases by specific glycosylases, followed by the excision of the resulting abasic site.
HR (Homologous Recombination): Repairs double-strand breaks using a homologous sequence as a template, typically from a sister chromatid. This is an error-free repair mechanism.
NHEJ (Non-Homologous End Joining): Repairs double-strand breaks without the need for a homologous template. While quicker, this method is error-prone and can lead to insertions or deletions.
Some mutations are more important than others for tumor progression. Perhaps they are more disruptive and detrimental for the cell to harbor. Those can be observed in multiple cancer types (think of TP53 or BRCA1/2).
Not all the driver mutations are known and not all the driver mutations are really always drivers.
Not all mutations are driving the tumorigenesis!
Functional Impact:
Recurrent Patterns Across Tumors:
COSMIC
What it tells us: Somatic mutations found in cancer Example output: COSV59384583; OCCURENCE=1(skin) Interpretation: Mutation seen in skin cancer; frequency helps assess if likely driver
ClinVar
What it tells us: Clinical significance of variants
Example output: Pathogenic/Likely_pathogenic
Interpretation: Strong evidence for disease causation
gnomAD
What it tells us: Population frequency
Example output: AF=0.00001
Interpretation: Very rare variant (<0.01% frequency)
AlphaMissense
What it tells us: Predicted functional impact
Example output: likely_pathogenic
Interpretation: AI model predicts damaging effect
Imagine thousands of variants to annotate manually to understand what each of them does, searching in different databases, searching in the literature. It would take weeks and it would be error prone. Automation of these tasks reduces the time-to- discovery
VEP is a great tool, but it does not remove the work from the scientists. Automation simplifies the process but it can give false positives and false negatives. If the experimental setup is low quality, the results most likely will be low quality too.
There are no databases investigating those aspects of biology that might play an important role in the cancer development. After all it is estimated that >97% of all the mutations are “passenger events” and they do not have direct impact on the tumor growth.
We can answer questions like:
# Installing Conda #https://docs.anaconda.com/miniconda/
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
source ~/miniconda3/bin/activate
conda init --all
# Using Conda
conda create -n vep samtools python=3.10 ensembl-vep=113
# conda install -c bioconda ensembl-vep # if installing in current env
# Activate VEP's env
conda activate vep
# Test installation
vep --help
# Download cache files, it takes a long time and ~/.vep (~25GB)
vep_install -a cf -s homo_sapiens -y GRCh38 # download precomputed human data
# if install official plugins
vep_install -a p --PLUGINS list # a for action LIST akk plugins
# vep_install -a p --PLUGINS all # install all plugins
# vep_install -a p --PLUGINS dbNSFP,... # install specific plugins
if you want to specify a specific folder to store the VEP’s cache: -c ~/vep_cache
SubsetVCF for example does not need additional data, it works directly on the VEP’s output
- installing "SubsetVCF"
- add "--plugin SubsetVCF" to your VEP command to use this plugin
- OK
And others like REVEL need additional data to work properly
- installing "REVEL"
- This plugin requires data
- See Plugins/REVEL.pm for details
- OK
Plugins can enhance VEP’s capabilities and can add additional depth of information to the annotations (with the price of more complexity).
If this command does not work
Please use this manual option
VEP options for handling multiple transcripts:
--pick
: One consequence per variant--pick_allele
: One per variant allele--pick_allele_gene
: One per variant allele per gene--per_gene
: One per gene--flag_pick
: Flags selected while keeping othersTop 10 most severe consequences:
Likely to cause a severe effect on protein structure/function.
Non-disruptive changes that might affect protein structure or function.
Variants that are less likely to have a significant effect on protein function.
Usually non-coding or intergenic variants with no expected impact on the protein but might influence gene regulation.
Input VCF:
##fileformat=VCFv4.2
#CHROM POS ID REF ALT QUAL FILTER INFO
chr7 55242465 rs121913529 A T . PASS .
chr17 7577121 rs28934578 C T . PASS .
chr13 32936646 rs28897743 C T . PASS .
Results:
Then we get something more informative like:
For ClinVar
# Download latest ClinVar VCF
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi
For COSMIC
We cannot provide those files (a license is required), you need to register on the COSMIC’s website and download the files. For example: CosmicMutantExport.tsv.gz
.
You can register here
vep -i somatic.vcf \
--cache \
--assembly GRCh38 \
--format vcf \
--symbol \
--check_existing \
--pick \
--output_file output.txt
instead of symbol one can select --hgvs
or both --hgvs \--symbol
Adding more specific annotations:
vep -i input.vcf \
--cache \
--assembly GRCh38 \
--format vcf \
--symbol \
--pick \
--sift b \
--polyphen b \
--force_overwrite \
--output_file output.txt
Essential plugins for cancer analysis:
REVEL paper: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants
You can download here the data (~6.5GB): https://sites.google.com/site/revelgenomics/downloads or https://zenodo.org/records/7072866
unzip revel-v1.3_all_chromosomes.zip
cat revel_with_transcript_ids | tr "," "\t" > tabbed_revel.tsv
sed '1s/.*/#&/' tabbed_revel.tsv > new_tabbed_revel.tsv
bgzip new_tabbed_revel.tsv
Prepare for GRCh38
zcat new_tabbed_revel.tsv.gz | head -n1 > h
zgrep -h -v ^#chr new_tabbed_revel.tsv.gz | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat h - | bgzip -c > new_tabbed_revel_grch38.tsv.gz
tabix -f -s 1 -b 3 -e 3 new_tabbed_revel_grch38.tsv.gz
Usage:
AlphaMissense’s Paper: Accurate proteome-wide missense variant effect prediction with AlphaMissense
Prepare for GRCh38
wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_hg38.tsv.gz
tabix -s 1 -b 2 -e 2 -f -S 1 AlphaMissense_hg38.tsv.gz
Run it with VEP
dbNSFP v4 paper: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs
A VEP plugin that retrieves data for missense variants from a tabix-indexed dbNSFP file.
Prepare the data
version=4.7c
wget ftp://dbnsfp:dbnsfp@dbnsfp.softgenetics.com/dbNSFP${version}.zip
unzip dbNSFP${version}.zip
zcat dbNSFP${version}_variant.chr1.gz | head -n1 > h
Prepare for GRCh38
zgrep -h -v ^#chr dbNSFP${version}_variant.chr* | sort -k1,1 -k2,2n - | cat h - | bgzip -c > dbNSFP${version}_grch38.gz
tabix -s 1 -b 2 -e 2 dbNSFP${version}_grch38.gz
Run it with VEP
You can use the tool that is included in the VEP’s suite of tools. This tool generally works very well with data that have been VEP-annotated
Filter SIFT deleterious events `filter_vep -i variant_effect_output.txt -filter "SIFT is deleterious" | grep -v "##" | head -n5
Can be used with pipes, for example (might save memory)
There are many other plugins that can be used depending on the context and the specific biological question at hand. You can have a look here: Plugins for VEP
You can run multiple plugins at the same time
The more plugins the more computationally expensive it could become
There is a nice help
included in VEP that can be useful for consultation
Feel free to drop a line in the chat or to contact us.