Introduction

The 1000 Vietnamese Genomes Project has contributed to the discovery of tens of millions of genetic variations existing in the Vietnamese population, including variations that have never been discovered in previous studies. In such a huge “sea” of data, it is difficult to study each variation in turn and how it affects the body. To help scientists complete this workload, the project used a tool called Variant Effect Predictor, from Ensembl, a powerful annotation tool that can evaluate, classify, compare or even predict the impact of variations.

The Birth of VEP

In recent years, as science and technology have developed continuously with many outstanding research projects on biomedical information, we are getting closer to precision medicine, applying basic research results to medical examination and treatment. In precision medicine, in addition to external environmental factors, genetics plays an extremely important role. Down syndrome, color blindness or Turner syndrome, which are often mentioned in medicine, all originate from genetic variations. Therefore, analyzing the impact of each variation in the human genome is extremely essential. Currently, many organizations have conducted decoding and research on the functional effects of variations at each stage of the gene expression process. In addition, each variation is also named according to many different rules, with two main reference data sets, GENCODE and Reference Sequence (RefSeq), managed and updated by the National Center for Biotechnology Information, USA (NCBI) [1]. However, this diversity and abundance of information leads to inconsistencies in combining and interpreting information. Recognizing this problem, many organizations have developed variant annotation tools, effectively exploiting research results from many different databases, and assigning each variant information about the corresponding nomenclature and function (eg. ANNOVAR, SnpEff, SnpSift, Fuma, …). Variant Effect Predictor – VEP is one of those prominent names. This is a rare tool that allows users to use it for free in commercial and non-commercial research. In addition to the human genome, VEP can also be used to annotate variants of more than 80 vertebrate and invertebrate species with genomes in the Ensembl database.

How VEP works

VEP allows users to use it with many different interfaces: Web, Perl and REST API. VEP is superior to other tools, it accepts input information in many different formats. At the top of the list is VCF (Variant Call Format). VCF file is the result of the variation detection process, the final step in the variation analysis and selection process. The results are organized in columns, separated by “tabs”. The required headings (Figure 1) include CHROM (chromosome), POS (position), ID, REF (allele in the original reference set), ALT (mutated allele). The information interpreted by VEP will then be added to the INFO column in the VCF file.

Hình 1. Những thành phần cơ bản của file VCF

Before variants are annotated, VEP normalizes insertions and deletions in repeat gene sequences, separating complex variants with multiple structural changes at the same position on the gene into separate clones. Variants that do not pass quality standards for length, position, parent allele, and corresponding allele in the reference genome are removed. VEP evaluates and labels the variant effect type (CSQ) using attribute functions. In many cases, variants located in different transcripts of the same gene have different effects on the gene. For example, the variant at position 935833 on chromosome 1 (C>G) is simultaneously evaluated as a missense mutation on transcript ENST00000618779, and a mutation in the intron on ENST00000620200. VEP also allows users to customize the criteria to choose the most suitable annotation.

VEP’s running speed with chromosome 21 (67416 variants) is 1428 variants/second, 2 times faster than SnpEff (635 variants/second) and 1.2 times slower than Annovar (1732 variants/second). However, with the entire human genome (4,474,140 variants), VEP (1200 variants/second) runs slower than Annovar (3415 variants/second) and SnpEff (1598 variants/second).

VEP versions are updated simultaneously with other Ensembl tools. The latest update of VEP is Ensembl 104.

Databases used by VEP

In addition to predicting the impact of variation, VEP can also search and compare data from other databases. These VEP annotations (Table 1) include annotations for transcripts, proteins, non-translated regions, allele frequencies in the population, phenotypes, and other annotations. In addition, VEP allows users to use plugins to annotate variation with two databases, dbscSNV and dbNSFP.

Table 1. Annotations in VEP and corresponding data bases

Type of annotation in VEP	Database
Transcription	GENCODE, RefSeq, Ensembl, APPRIS (Only support HG38)
Protein	SIFT, PolyPhen-2
Untranslated gene region	ENCODE, BLUEPRINT, NIH Epigenomics Roadmap
Allele frequencies in the population, phenotypes and other annotations	dbSNP, COSMIC, Human Gene Mutation Database (HGMD), Database of Genomic Variants

The dbscSNV database provides information on all single nucleotide variants in the splice site and applies machine learning to predict the function of the variant. The features used in the machine learning model include predictions from Position Weight Matrix (PWM), MaxEntScan (MES), Splice Site Prediction (NNSplice), GeneSplicer, Human Splicing Finder (HSF), CADD_phred and PhyloP46way. AdaBoost and random forests are two machine learning methods used. The most recent and still used update of dbscSNV is v1.1 (April 2015).

The dbNSFP database contains data from 37 impact prediction algorithms (SIFT, SIFT4G, Polyphen2-HDIV, Polyphen2-HVAR, LRT, MutationTaster2, MutationAssessor, FATHMM, MetaSVM, MetaLR, CADD, CADD_hg19, VEST4, PROVEAN, FATHMM-MKL coding, FATHMM-XF coding, fitCons, LINSIGHT, DANN, GenoCanyon, Eigen, Eigen-PC, M-CAP, REVEL, MutPred, MVP, MPC, PrimateAI, GEOGEN2, BayesDel_addAF, BayesDel_noAF, ClinPred, LIST-S2, ALoFT), 9 gene conservation scoring algorithms (PhyloP, phastCons, GERP++, SiPhy, bStatistic), in addition to allele frequencies from the World 1000 Genomes Project (phase 3), UK10K, ExAC, gnomAD, ESP6500 and other nomenclatures, descriptions of gene function, expression and interactions from various databases. In the updated version of dbNSFP v4 (2020), based on Gencode version 29 and Ensembl version 94, the average missing information rate of the returns from the gene-damage prediction algorithms was 11%.

Conclusion

To date, VEP remains one of the most powerful tools for variant annotation and has been used to annotate variants in the World 1000 Genomes Project [4], the first study to detect high-risk loci for attention deficit hyperactivity disorder across the entire human genome [5], Combined Annotation-Dependent Depletion (CADD) [6], ExAC: A browser that allows searching for reference information of more than 60,000 exomes [7]. With the continuous development and updating of Ensembl, VEP promises to become an essential annotation tool, contributing to the development of bioinformatics research.

References

[1] McLaren, W., Gil, L., Hunt, S.E. et al. The Ensembl Variant Effect Predictor. Genome Biol 17, 122 (2016). The Ensembl Variant Effect Predictor – Genome Biology

[2] Xueqiu Jian, Eric Boerwinkle, Xiaoming Liu, In silico prediction of splice-altering single nucleotide variants in the human genome, Nucleic Acids Research, Volume 42, Issue 22, 16 December 2014, Pages 13534–13544, In silico prediction of splice-altering single nucleotide variants in the human genome

[3] Jpopgen – dbNSFP

[4] 1000 Genomes | A Deep Catalog of Human Genetic Variation

[5] Demontis, D., Walters, R.K., Martin, J. et al. Discovery of the first genome-wide significant risk loci for attention deficit/hyperactivity disorder. Nat Genet 51, 63–75 (2019). https://doi.org/10.1038/s41588-018-0269-7

[6] Philipp Rentzsch, Daniela Witten, Gregory M Cooper, Jay Shendure, Martin Kircher, CADD: predicting the deleteriousness of variants throughout the human genome, Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D886–D894, CADD: predicting the deleteriousness of variants throughout the human genome

[7] Konrad J. Karczewski, Ben Weisburd, Brett Thomas, Matthew Solomonson, Douglas M. Ruderfer, David Kavanagh, Tymor Hamamsy, Monkol Lek, Kaitlin E. Samocha, Beryl B. Cummings, Daniel Birnbaum, The Exome Aggregation Consortium, Mark J. Daly, Daniel G. MacArthur, The ExAC browser: displaying reference data information from over 60 000 exomes, Nucleic Acids Research, Volume 45, Issue D1, January 2017, Pages D840–D845, The ExAC browser: displaying reference data information from over 60 000 exomes

Variant Effect Predictor - Ensembl's powerful variation annotation tool