The first step in adding new data to a GenomeHubs Ensembl site is to import the assembled genome sequence and gene models from FASTA and GFF files into an Ensembl MySQL database using an EasyImport Docker container.
Parameters for the import scripts within the EasyImport container are controlled in assembly-specific configuration files. These offer a wide range of options to set passwords and assembly-specific metadata as well as accommodating the diversity of real-world GFF files and allowing the files to be imported from any location on the local filesystem or accessible via http/ftp. In practice, only a small number of these parameters need to be altered for a given assembly import so many of the parameters can be set in a default configuration file that remains unchanged across all imported assemblies.
The complexity of running this step is largely determined by the validity of the input files - there are many ways in which GFF3 files in particular can differ from published standards and there are configuration options to accommodate much of the diversity of real-world GFF files (see easy-import.readme.io). If you prefer to ensure the validity of your files using external tools then, apart from accommodating conflicting definitions of phase, the default import settings should be sufficient.
Set general import parameters
Create and edit a default.ini file to set database connection parameters that are likely to remain constant across all assembly imports:
if the databases are hosted in a MySQL Docker container hosted on the same machine as the import will run then the HOST should be the name of the container, otherwise it should be the name/ip address of the host
values in this file can be overwritten by entries in an assembly-specific configuration file described below
this file also includes default GFF parsing parameters that are unlikely to need changing
$ cd ~/genomehubs/v1/import/conf/
$ cp default.ini default.ini.original
$ nano default.ini
# update values to match your database connection details
[DATABASE_CORE]
HOST = genomehubs-mysql
PORT = 3306
RO_USER = anonymous
[DATABASE_SEARCH]
NAME = genomehubs_search_40_93
HOST = genomehubs-mysql
PORT = 3306
RO_USER = anonymous
RO_PASS =
[DATABASE_TAXONOMY]
NAME = ncbi_taxonomy
HOST = genomehubs-mysql
PORT = 3306
RO_USER = anonymous
[DATABASE_TEMPLATE]
NAME = melitaea_cinxia_core_40_93_1
HOST = genomehubs-mysql
PORT = 3306
RO_USER = anonymous
$ cd ~/genomehubs/v1/import/conf/
$ cp default.ini default.ini.original
$ nano default.ini
# update values to match your database connection details
[DATABASE_CORE]
HOST = genomehubs-mysql
PORT = 3306
RO_USER = anonymous
[DATABASE_SEARCH]
NAME = genomehubs_search_36_89
HOST = genomehubs-mysql
PORT = 3306
RO_USER = anonymous
RO_PASS =
[DATABASE_TAXONOMY]
NAME = ncbi_taxonomy
HOST = genomehubs-mysql
PORT = 3306
RO_USER = anonymous
[DATABASE_TEMPLATE]
NAME = melitaea_cinxia_core_36_89_1
HOST = genomehubs-mysql
PORT = 3306
RO_USER = anonymous
$ cd ~/genomehubs/v1/import/conf/
$ cp default.ini default.ini.original
$ nano default.ini
# update values to match your database connection details
[DATABASE_CORE]
HOST = genomehubs-mysql
PORT = 3306
RO_USER = anonymous
[DATABASE_SEARCH]
NAME = genomehubs_search_32_85
HOST = genomehubs-mysql
PORT = 3306
RO_USER = anonymous
RO_PASS =
[DATABASE_TAXONOMY]
NAME = ncbi_taxonomy
HOST = genomehubs-mysql
PORT = 3306
RO_USER = anonymous
[DATABASE_TEMPLATE]
NAME = melitaea_cinxia_core_32_85_1
HOST = genomehubs-mysql
PORT = 3306
RO_USER = anonymous
Update the passwords in overwrite.ini:
values in this file will overwrite entries in default.ini and the assembly-specific file so it is a convenient way to keep passwords separate from the remaining configuration details
$ cp overwrite.ini overwrite.ini.original
$ nano overwrite.ini
# update values to match your database connection details
[DATABASE_CORE]
RW_USER = importer
RW_PASS = CHANGEME
[DATABASE_SEARCH]
RW_USER = importer
RW_PASS = CHANGEME
Choose a name for your new assembly database
Each imported assembly must be stored in a uniquely named database. GenomeHubs follows the Ensembl naming conventions with the addition of an assembly name to allow alternate assemblies for a single species to be hosted in a single site. Database names should be all lower case with no special characters other than letters and numbers. Components of the database name should be separated by underscores. For Ensembl release 40/93 (which is currently the version supported by GenomeHubs) the database name for Operophtera brumata assembly Obru1 would be:
operophtera_brumata_obru1_core_40_93_1
operophtera_brumata_obru1_core_36_89_1
operophtera_brumata_obru1_core_32_85_1
The first two words should match the genus and species name being imported
It is useful to include an assembly name that will be unique across all species in a given GenomeHub, e.g. obru1 rather than v1
A subspecies/strain name can optionally be included before the assembly name
The word core must be present to denote the type of database
The first two numbers must match the Ensembl Genomes and Ensembl version
The final number can be used to denote the genebuild version, typically this should be '1'
Set assembly metadata
Create and edit a <database name>.ini file in the import/conf directory to set assembly-specific metadata using the genus_species_asm_core_xx_xx_1.ini template file:
pay attention to the case fo the default values you are replacing and the use of spaces/underscores
the value for ASSEMBLY.NAME should show the assembly name as you would like it to be displayed and may contain dots but any dots should be omitted from the assembly name portion of SPECIES.PRODUCTION_NAME and SPECIES.URL
$ cd ~/genomehubs/v1/import/conf/
$ cp genus_species_asm_core_40_93_1.ini operophtera_brumata_obru1_core_40_93_1.ini
$ nano operophtera_brumata_obru1_core_40_93_1.ini
# update values to match your species/assembly name and other details
[DATABASE_CORE]
NAME = operophtera_brumata_obru1_core_40_93_1
[META]
SPECIES.PRODUCTION_NAME = operophtera_brumata_obru1
SPECIES.SCIENTIFIC_NAME = Operophtera brumata
SPECIES.COMMON_NAME = Winter moth
SPECIES.DISPLAY_NAME = Operophtera brumata
SPECIES.DIVISION = EnsemblMetazoa
SPECIES.URL = Operophtera_brumata_obru1
SPECIES.TAXONOMY_ID = 104452
SPECIES.ALIAS = [ ]
ASSEMBLY.NAME = Obru1
ASSEMBLY.DATE = 2015-08-11
ASSEMBLY.ACCESSION =
ASSEMBLY.DEFAULT = Obru1
PROVIDER.NAME = Wageningen University
PROVIDER.URL = http://www.bioinformatics.nl/wintermoth/portal/
GENEBUILD.ID = 1
GENEBUILD.START_DATE = 2017-05
GENEBUILD.VERSION = 1
GENEBUILD.METHOD = import
$ cd ~/genomehubs/v1/import/conf/
$ cp genus_species_assembly_core_36_89_1.ini operophtera_brumata_obru1_core_36_89_1.ini
$ nano operophtera_brumata_obru1_core_36_89_1.ini
# update values to match your species/assembly name and other details
[DATABASE_CORE]
NAME = operophtera_brumata_obru1_core_36_89_1
[META]
SPECIES.PRODUCTION_NAME = operophtera_brumata_obru1
SPECIES.SCIENTIFIC_NAME = Operophtera brumata
SPECIES.COMMON_NAME = Winter moth
SPECIES.DISPLAY_NAME = Operophtera brumata
SPECIES.DIVISION = EnsemblMetazoa
SPECIES.URL = Operophtera_brumata_obru1
SPECIES.TAXONOMY_ID = 104452
SPECIES.ALIAS = [ ]
ASSEMBLY.NAME = Obru1
ASSEMBLY.DATE = 2015-08-11
ASSEMBLY.ACCESSION =
ASSEMBLY.DEFAULT = Obru1
PROVIDER.NAME = Wageningen University
PROVIDER.URL = http://www.bioinformatics.nl/wintermoth/portal/
GENEBUILD.ID = 1
GENEBUILD.START_DATE = 2017-05
GENEBUILD.VERSION = 1
GENEBUILD.METHOD = import
$ cd ~/genomehubs/v1/import/conf/
$ cp genus_species_asm_core_32_85_1.ini operophtera_brumata_obru1_core_32_85_1.ini
$ nano operophtera_brumata_obru1_core_32_85_1.ini
# update values to match your species/assembly name and other details
[DATABASE_CORE]
NAME = operophtera_brumata_obru1_core_32_85_1
[META]
SPECIES.PRODUCTION_NAME = operophtera_brumata_obru1
SPECIES.SCIENTIFIC_NAME = Operophtera brumata
SPECIES.COMMON_NAME = Winter moth
SPECIES.DISPLAY_NAME = Operophtera brumata
SPECIES.DIVISION = EnsemblMetazoa
SPECIES.URL = Operophtera_brumata_obru1
SPECIES.TAXONOMY_ID = 104452
SPECIES.ALIAS = [ ]
ASSEMBLY.NAME = Obru1
ASSEMBLY.DATE = 2015-08-11
ASSEMBLY.ACCESSION =
ASSEMBLY.DEFAULT = Obru1
PROVIDER.NAME = Wageningen University
PROVIDER.URL = http://www.bioinformatics.nl/wintermoth/portal/
GENEBUILD.ID = 1
GENEBUILD.START_DATE = 2017-05
GENEBUILD.VERSION = 1
GENEBUILD.METHOD = import
Set assembly-specific file locations and gff parameters
Edit <database name>.ini to set paths to files to import, locations of identifiers in the files and settings to control the wat the gff file is processed:
files can be in any location accessible on the local filesystem or via ftp/http
names are provided in the gff file so stable IDs can be set using this attribute, commonly only ID is available and this would be used as the source of the stable IDs (see full documentation at easy-import.readme.io)
this guide assumes you will be importing valid gff3, full details of the syntax to repair invalid gff files during import is available at easy-import.readme.io
Run an EasyImport Docker container with flags to import sequences (-s), prepare (-p), import (-g) gff and verify (-v) the imported sequences using the provided protein FASTA file:
verification compares imported gene models against an expected protein FASTA sequence, if you do not have a file with predicted protein sequences you may wish to export sequences from the database (replacing the -v flag with -e) to check the translations manually
Check the import and verification log files for errors:
local files for each assembly will be written to a <database name> directory
there are likely to be a number of warnings, which can be ignored
if there are errors, update the configuration files and rerun the step above
$ ls ~/genomehubs/v1/import/data/operophtera_brumata_obru1_core_40_93_1
$ ls ~/genomehubs/v1/import/data/operophtera_brumata_obru1_core_40_93_1/log
$ ls ~/genomehubs/v1/import/data/operophtera_brumata_obru1_core_40_93_1/summary
$ ls ~/genomehubs/v1/import/data/operophtera_brumata_obru1_core_36_89_1
$ ls ~/genomehubs/v1/import/data/operophtera_brumata_obru1_core_36_89_1/logs
$ ls ~/genomehubs/v1/import/data/operophtera_brumata_obru1_core_36_89_1/summary
$ ls ~/genomehubs/v1/import/data/operophtera_brumata_obru1_core_32_85_1
$ ls ~/genomehubs/v1/import/data/operophtera_brumata_obru1_core_32_85_1/logs
$ ls ~/genomehubs/v1/import/data/operophtera_brumata_obru1_core_32_85_1/summary
Process exceptions
If it is not possible to process all features in a GFF3 file with the current settings (e.g. if some features lack Name attributes), features not written to the prepared gff will be written to a .exception.gff file for processing in a second pass. See easy-import.readme.io for details.