Parsing VEP-Annotated VCFs with R

Parsing VEP-Annotated VCF Files into Tabular Format with R

June 10, 2025

Why Parse VEP-Annotated VCFs?

Variant Effect Predictor (VEP) is a powerful tool by Ensembl that annotates VCF files with rich biological insights — such as predicted consequences, gene names, protein changes, and population allele frequencies. However, this annotation is stored in a condensed format under the CSQ tag in the INFO field, making it difficult to filter, group, or visualize.

To get the most out of VEP annotations, it’s essential to parse these fields into a clean, tabular structure.

R Script Overview

Here’s an example script I use to:

Load a VEP-annotated VCF file
Extract and parse the CSQ field
Combine variant info, CSQ annotations, VAF, and genotype data
Output a final tidy table for downstream analysis

options(stringsAsFactors = FALSE)
library(vcfR)
library(ensemblVEP)
library(dplyr)
library(tidyr)


files <- list.files(pattern = "*.vcf.gz")
vcf_db <- read.vcfR(files[1], verbose = FALSE)

vep_header <- data.frame(vcf_db@meta)
vep_variants <- data.frame(vcf_db@fix)
vep_gt <- data.frame(vcf_db@gt)
VAF <- data.frame(extract.gt(vcf_db, element = "AF", as.numeric = TRUE))
colnames(VAF) <- "VAF"

vcf_ens <- readVcf(files[1], "hg38")
csq_vep <- parseCSQToGRanges(vcf_ens)
csq_vep <- data.frame(csq_vep)[ , -c(1:5, 8, 9)]

info <- strsplit(vep_variants$INFO, ";CSQ")
info_df <- as.data.frame(matrix(unlist(info), ncol=2, byrow=TRUE))
INFO <- info_df[1]
colnames(INFO) <- "INFO"

vep_variants_filt <- vep_variants[ , c(1:2,4:7) ]
VEP <- cbind.data.frame(vep_variants_filt, INFO, csq_vep, VAF, vep_gt)
write.table(VEP, "VEP_annotated_table.tsv", sep = "", row.names = FALSE, quote = FALSE)

Output Columns

CHROM, POS, REF, ALT, QUAL, FILTER
INFO field (excluding CSQ)
VEP annotations (e.g., Consequence, Gene, Feature_type)
VAF (Variant Allele Fraction)
Genotype-level fields

Zeeshan Fazal's Bioinformatics Blog

Parsing VEP-Annotated VCF Files into Tabular Format with R