Import variation data

Variants can imported using a wrapper around the Ensembl Import VCF Script, which exposes a subset of the full functionality.

The input must be a bgzipped vcf file:

$ bgzip variants.vcf

Create and edit a panel file to associate samples with populations:

this file has 2, tab-separated columns
sample names must match the sample names in the vcf file
only samples listed in this file will be imported

$ nano nano /path/to/data/panel.tsv
sample_1           population 1
sample_2           population 1
sample_3           population 2
sample_4           population 2
...

Create and edit a description file to add a description for each sample:

this file has 2, tab-separated columns
sample names must match the sample names in the vcf file
html markup is supported and can be used to add a link to related SRA accession, if available

$ nano nano /path/to/data/description.tsv
sample_1           description of sample 1
sample_2           description of sample 2
sample_3           description of sample 3
sample_4           description of sample 4
...

Create and edit a configuration file to set database and variant details:

as with the core database import, common settings can be specified in a default.ini file and passwords can be set in an overwrite.ini file
if database connection settings are not set in a [DATABASE_VARIATION] section, values from [DATABASE_CORE] will be reused
The variation database name must match the corresponding core database name with "variation" in place of "core"
when importing local files, specify the path to the file as mounted in the container
the FILTER will be passed to the bcftools view command with the -i flag, this is not needed if your SNP data are already filtered

$ nano nano /path/to/conf/example_variants.ini
[DATABASE_CORE]
    NAME = heliconius_erato_demophoon_v1_core_32_85_1
    HOST = genomehubs-mysql
    PORT = 3306
    RW_USER = importer
    RW_PASS = CHANGEME
    RO_USER = anonymous
[DATABASE_VARIATION]
    NAME = genus_species_assembly_variation_32_85_1
[META]
    SPECIES.PRODUCTION_NAME = genus_species_assembly
    SPECIES.SCIENTIFIC_NAME = Genus species
[FILES]
    VCF = [ vcf /import/data/variants.vcf.gz ]
    PANEL = [ tsv /import/data/panel.tsv ]
[STUDY]
    SOURCE = Anonymous 2017
[BCFTOOLS]
    FILTER = QUAL>=30 & FMT/DP>=10 & FMT/DP<=100 & SUM(FMT/DP)<=N_SAMPLES*100 & FMT/SB<200 & MIN(FMT/GQ)>=30

$ nano nano /path/to/conf/example_variants.ini
[DATABASE_CORE]
    NAME = heliconius_erato_demophoon_v1_core_36_89_1
    HOST = genomehubs-mysql
    PORT = 3306
    RW_USER = importer
    RW_PASS = CHANGEME
    RO_USER = anonymous
[DATABASE_VARIATION]
    NAME = genus_species_assembly_variation_36_89_1
[META]
    SPECIES.PRODUCTION_NAME = genus_species_assembly
    SPECIES.SCIENTIFIC_NAME = Genus species
    SPECIES.DIVISION = EnsemblMetazoa
[FILES]
    VCF = [ vcf /import/data/example_variants/variants.vcf.gz ]
    PANEL = [ tsv /import/data/example_variants/panel.tsv ]
    DESCRIPTION = [ tsv /import/data/example_variants/description.tsv ]
[STUDY]
    SOURCE = Anonymous 2017
    DESCRIPTION = Anonymous 2017. Article title. Journal. Vol:pages
[BCFTOOLS]
    FILTER = QUAL>=30 & FMT/DP>=10 & FMT/DP<=100 & SUM(FMT/DP)<=N_SAMPLES*100 & FMT/SB<200 & MIN(FMT/GQ)>=30
[MODIFY]
    OVERWRITE_DB = 1

Run the GenomeHubs variation container:

depending on the number of SNPs in your VCF file after filtering, is likely to take several hours to run

docker run --rm \
           -d \
           --name genomehubs-variation \
           -u $UID:$GROUPS \
           -v /path/to/conf:/import/conf \
           -v /path/to/data:/import/data \
           -e FLAGS="-i" \
           -e VARIANTS=example_variants \
           genomehubs/variation:17.03

docker run --rm \
           -d \
           --name genomehubs-variation \
           -u $UID:$GROUPS \
           -v /path/to/conf:/import/conf \
           -v /path/to/data:/import/data \
           -e FLAGS="-i" \
           -e VARIANTS=example_variants \
           genomehubs/variation:17.06

Modify the EasyMirror configuration to load variation databases:

EasyMirror will attempt to load database types listed in SPECIES_DB_AUTOEXPAND so this can be used to load funcgen, etc databases mirrored from Ensembl

$ nano ~/genomehubs/v1/ensembl/conf/setup.ini
[DATA_SOURCE]
  SPECIES_DB_AUTOEXPAND = [ variation ]

Restart your Ensembl site to load the newly created variation database:

$ docker rm -f genomehubs-ensembl
$ docker run -d \
             --name genomehubs-ensembl \
             -v ~/genomehubs/v1/ensembl/conf:/ensembl/conf:ro \
             --link genomehubs-mysql \
             -p 8081:8080 \
             genomehubs/easy-mirror:17.03

$ docker rm -f genomehubs-ensembl
$ docker run -d \
             --name genomehubs-ensembl \
             -v ~/genomehubs/v1/ensembl/conf:/conf:ro \
             --link genomehubs-mysql \
             -p 8081:8080 \
             genomehubs/easy-mirror:17.06

PreviousRun comparative analyses NextAdd track hubs

Last updated 4 years ago

Was this helpful?