A Practical Introduction
2025-12-12

Yes, there are many mechanisms of DNA repair. Some of the most common are:
NER (Nucleotide Excision Repair): Repairs bulky lesions such as thymine dimers caused by UV radiation. This mechanism involves the removal of a short single-stranded DNA segment containing the damage, followed by DNA synthesis using the complementary strand as a template.
MMR (Mismatch Repair): Fixes replication errors such as base mismatches or insertion/deletion loops. MMR recognizes the newly synthesized strand and corrects errors by removing the incorrect nucleotides and replacing them.
BER (Base Excision Repair): Repairs single-base lesions such as oxidative damage, alkylation, or deamination. This involves the removal of damaged bases by specific glycosylases, followed by the excision of the resulting abasic site.
HR (Homologous Recombination): Repairs double-strand breaks using a homologous sequence as a template, typically from a sister chromatid. This is an error-free repair mechanism.
NHEJ (Non-Homologous End Joining): Repairs double-strand breaks without the need for a homologous template. While quicker, this method is error-prone and can lead to insertions or deletions.
Figure 3: Cancer somatic mutations, The cancer genome, Stratton et al, 2009
Some mutations are more important than others for tumor progression. Perhaps they are more disruptive and detrimental for the cell to harbor. Those can be observed in multiple cancer types (think of TP53 or BRCA1/2).
Not all the driver mutations are known and not all the driver mutations are really always drivers.
Not all mutations are driving the tumorigenesis!
Functional Impact:
Recurrent Patterns Across Tumors:
(a) Ensembl-VEP
COSMIC (Catalogue of Somatic Mutations in Cancer)
| Field | Content |
|---|---|
| Purpose | Somatic mutations in cancer |
| Example | COSV59384583; OCCURENCE=1(skin) |
| Interpretation | Mutation seen in skin cancer; frequency indicates driver likelihood |
ClinVar (Clinical Variant Database)
| Field | Content |
|---|---|
| Purpose | Clinical significance assessment |
| Example | Pathogenic/Likely_pathogenic |
| Interpretation | Strong evidence for disease causation |
gnomAD (Genome Aggregation Database)
| Field | Content |
|---|---|
| Purpose | Population allele frequency |
| Example | AF=0.00001 |
| Interpretation | Very rare (<0.01% frequency) |
AlphaMissense
| Field | Content |
|---|---|
| Purpose | AI-predicted functional impact |
| Example | likely_pathogenic |
| Interpretation | Model predicts damaging effect |
Filtering Strategy
Combine databases for variant prioritization:
The Challenge: Imagine annotating thousands of variants manually - searching databases, reviewing literature, interpreting functional impact. This would take weeks and be highly error-prone.
VEP automates annotation to accelerate discovery:
What VEP Does NOT Do
VEP is powerful, but it’s not magic:
Remember: Automation simplifies the process, but critical thinking and validation remain essential!
There are no databases investigating those aspects of biology that might play an important role in the cancer development. After all it is estimated that >97% of all the mutations are “passenger events” and they do not have direct impact on the tumor growth.
We can answer questions like:
The Challenge: A single variant can have 4-12+ annotations due to multiple transcript isoforms. Which one matters?
VEP options to simplify output when a gene has multiple transcripts:
--pick: One consequence per variant (Best for simple filtering)--pick_allele: One per variant allele--per_gene: One per gene--flag_pick: Flags the “chosen” one but keeps others--pick_allele_gene: Most comprehensive filteringSelection hierarchy:
Top 10 Most Severe:
Cancer examples:
Practical Tip
For clinical reporting, use --pick with --transcript_version to ensure reproducibility. Always document which transcript was used.

| Impact | Description | Examples |
|---|---|---|
| HIGH | Disruptive. Likely loss of function. | Frameshift, Stop Gained, Splice acceptor/donor. |
| MODERATE | Non-disruptive change to protein. | Missense, In-frame indel. |
| LOW | Unlikely to change function. | Synonymous, Splice region (non-canonical). |
| MODIFIER | Non-coding / Regulatory. | Intronic, UTRs, Intergenic. |
Important Exceptions
Not all HIGH impact = pathogenic
Not all MODERATE impact = benign (e.g., IDH1 p.R132H is LOW but a critical driver mutation)
Always consider biological context!
Purpose: Predicts if a missense mutation is damaging based on protein structure and evolutionary conservation.
Scoring Categories:
Usage
PolyPhen-2 is specific to human proteins.
It should be used in conjunction with other tools (like SIFT) to build a consensus on missense variant interpretation.
Purpose: Predicts if an amino acid substitution is deleterious based on sequence conservation across species.
Scoring Categories:
Note: Lower scores = More damaging (opposite to PolyPhen-2!)
Usage
SIFT is available for 10 species in Ensembl (including human, mouse, zebrafish).
All possible amino acid substitutions are pre-calculated, making annotation very fast.
Use in combination with PolyPhen-2 for consensus prediction.
Interpretation Guide
| SIFT | PolyPhen-2 | Confidence |
|---|---|---|
| Deleterious | Probably Damaging | High - prioritize |
| Tolerated | Benign | High - likely benign |
| Deleterious | Benign | Mixed - manual review |
| Tolerated | Damaging | Mixed - manual review |
Essential plugins for cancer analysis:
REVEL paper: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants
You can download here the data (~6.5GB): https://sites.google.com/site/revelgenomics/downloads or https://zenodo.org/records/7072866
Prepare for GRCh38
Usage:
AlphaMissense’s Paper: Accurate proteome-wide missense variant effect prediction with AlphaMissense
Prepare for GRCh38
Run it with VEP
dbNSFP v4 paper: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs
A VEP plugin that retrieves data for missense variants from a tabix-indexed dbNSFP file.
Prepare the data
Prepare for GRCh38
Run it with VEP
You can use the tool that is included in the VEP’s suite of tools. This tool generally works very well with data that have been VEP-annotated
Filter SIFT deleterious events
Can be used with pipes, for example (might save memory)
Operators:
is: Exact matchmatch: Regex pattern<, >, <=, >=: Numeric comparisonin: List membershipand, or: Combine conditionsExample 1: High-confidence damaging
There are many other plugins that can be used depending on the context and the specific biological question at hand. You can have a look here: Plugins for VEP
You can run multiple plugins at the same time
The more plugins the more computationally expensive it could become
There is a nice help included in VEP that can be useful for consultation
Include MODERATE impact variants
Many drivers remain unknown
CNV calling limitations
No Tool is Perfect
Always validate important findings visually (IGV) and consider orthogonal methods for clinical decisions.
Feel free to drop a line in the chat or to contact us.
Giphy
Cancer Variant Analysis - SIB