June 10, 2025
Why Parse VEP-Annotated VCFs?
Variant Effect Predictor (VEP) is a powerful tool by Ensembl that annotates VCF files with rich biological insights — such as predicted consequences, gene names, protein changes, and population allele frequencies. However, this annotation is stored in a condensed format under the CSQ tag in the INFO field, making it difficult to filter, group, or visualize.
To get the most out of VEP annotations, it’s essential to parse these fields into a clean, tabular structure.
R Script Overview
Here’s an example script I use to:
options(stringsAsFactors = FALSE)
library(vcfR)
library(ensemblVEP)
library(dplyr)
library(tidyr)
files <- list.files(pattern = "*.vcf.gz")
vcf_db <- read.vcfR(files[1], verbose = FALSE)
vep_header <- data.frame(vcf_db@meta)
vep_variants <- data.frame(vcf_db@fix)
vep_gt <- data.frame(vcf_db@gt)
VAF <- data.frame(extract.gt(vcf_db, element = "AF", as.numeric = TRUE))
colnames(VAF) <- "VAF"
vcf_ens <- readVcf(files[1], "hg38")
csq_vep <- parseCSQToGRanges(vcf_ens)
csq_vep <- data.frame(csq_vep)[ , -c(1:5, 8, 9)]
info <- strsplit(vep_variants$INFO, ";CSQ")
info_df <- as.data.frame(matrix(unlist(info), ncol=2, byrow=TRUE))
INFO <- info_df[1]
colnames(INFO) <- "INFO"
vep_variants_filt <- vep_variants[ , c(1:2,4:7) ]
VEP <- cbind.data.frame(vep_variants_filt, INFO, csq_vep, VAF, vep_gt)
write.table(VEP, "VEP_annotated_table.tsv", sep = "", row.names = FALSE, quote = FALSE)
Output Columns