Run comparative analyses

A powerful Ensembl feature supported by GenomeHubs is the ability to incorporate orthology information in the Compara database schema, which can be displayed in gene trees on the web site. GenomeHubs does not use the Ensembl Compara pipeline but provides an analysis containers and import scripts to populate a Compara database using an OrthoFinder-based approach.

Create a config file with the same name as your Compara database in the import/conf directory. This will have the form sitename_compara_40_93.ini. The examples below are based on mealybug.org so substitute mealybug for your site name. The initial sections in this file are similar to the assembly config files described elsewhere:

nano ~/genomehubs/v1/import/conf/mealybug-compara_40_93.ini
[ENSEMBL]
LOCAL = /ensembl
[DATABASE_COMPARA]
NAME = mealybug_compara_40_93
HOST = genomehubs-mysql
PORT = 3306
RW_USER = importer
RW_PASS = importpassword
RO_USER = anonymous
RO_PASS =
[DATABASE_TEMPLATE]
NAME = ensembl_compara_metazoa_40_93
URL = ftp://ftp.ensemblgenomes.org/pub/release-40/metazoa/mysql/
[DATABASE_CORE]
HOST = genomehubs-mysql
PORT = 3306
RO_USER = anonymous
RO_PASS =

The [SETUP] section contains some paths to define where files will be written relative to the container filesystem. FASTA_DIR will contain sequence files exported from the core databases ready for analysis, TMP_DIR is needed to retain untrimmed alignments that would otherwise be discarded by OrthoFinder, ORTHOFINDER_DIR will contain the OrthoFinder results and ORTHOGROUP_DIR is used for processed orthogroup files ready for import into the Compara database. REMOVE is probably not needed but is listed here in case it is still used somewhere.

[SETUP]
FASTA_DIR = /import/data/mealybug_compara_40_93/exported
TMP_DIR = /import/data/tmp
ORTHOFINDER_DIR = /import/data/mealybug_compara_40_93/OrthoFinder/Results_v1
ORTHOGROUP_DIR = /import/data/mealybug_compara_40_93/orthogroups
REMOVE = [ ]

The [TAXA] section contains a mapping between the assembly database names to be used in the analysis and a set of short names for use in files. There may be some constraints on these names so it is probably best to use six upper case letters to represent each assembly.

The [SPECIES_SET] section defines a name for the species set (in this case HEMIPTERA - currently only one species set is supported) and the assemblies that should be included in the set. The TREE_FILE needs to be the same file used by OrthoFinder. OrthoFinder supports a user specified tree, but currently the only supported option is to use the species tree inferred by OrthoFinder, which will be located at a path similar to the example. The Tree_Label should match one of the ranks in the NCBI taxonomy-based classification of the assemblies in your GenomeHub.

[TAXA]
ACYPIS = acyrthosiphon_pisum_core_40_93_2
BEMTAB = bemisia_tabaci_meam1v1x1_core_40_93_1
MYZPER = myzus_persicae_cloneg006v2_core_40_93_1
PLACIT = planococcus_citri_pcitriv1_core_40_93_1
PLAFIC = planococcus_ficus_pficusv0_core_40_93_1
PSELON = pseudococcus_longispinus_plongv1_core_40_93_1
RHOPRO = rhodnius_prolixus_core_40_93_3
[SPECIES_SET]
HEMIPTERA = [ ACYPIS BEMTAB MYZPER PLACIT PLAFIC PSELON RHOPRO ]
TREE_FILE = /import/data/mealybug_compara_40_93/OrthoFinder/Results_v1/Species_Tree/SpeciesTree_rooted_node_labels.txt
TREE_LABEL = Hemiptera

The [ORTHOGROUP] section defines a set of file types and suffixes to use when preparing files for import in the ORTHOGROUP_DIR. Most of these should not need to be altered but the PREFIX should be changed to math your sit name. The example PREFIX is for MealyBug Gene Tree, version 01.

[ORTHOGROUP]
PREFIX = MBGT01
PROTEIN = .faa
PROTEIN_ALIGN = [ .maaft.faa Mafft ]
PROTEIN_TRIMMED = .maaft.trimmed.faa
FNAFILE = .fna
BOUNDEDFILE = .fba
TREE = .raxml-ng.nhx
HOMOLOG = .homologs.txt

A number of fields need to be defined in the [METHOD_LINK] section. This probably doesn't belong here but it is needed to populate a table in the Compara database and should be copied and pasted directly into your configuration file.

[METHOD_LINK]
; Names and classes of methods used with the main table in which results are stored
; Method class must be one of DNA-DNA_ALIGNMENT, GENOMIC_SYNTENY, PROTEIN_HOMOLOGY, PROTEIN_FAMILY or PROTEIN-NCRNA_TREE
; KEY = [ METHOD_CLASS DATABASE_TABLE.DATA_TYPE ]
BLASTZ_NET = [ DNA-DNA_ALIGNMENT GenomicAlignBlock.pairwise_alignment ]
TRANSLATED_BLAT = [ DNA-DNA_ALIGNMENT GenomicAlignBlock.pairwise_alignment ]
TRANSLATED_BLAT_NET = [ DNA-DNA_ALIGNMENT GenomicAlignBlock.pairwise_alignment ]
PECAN = [ DNA-DNA_ALIGNMENT GenomicAlignBlock.multiple_alignment ]
ORTHEUS = [ DNA-DNA_ALIGNMENT GenomicAlignTree.tree_alignment ]
LASTZ_NET = [ DNA-DNA_ALIGNMENT GenomicAlignBlock.pairwise_alignment ]
ATAC = [ DNA-DNA_ALIGNMENT GenomicAlignBlock.pairwise_alignment ]
SYNTENY = [ GENOMIC_SYNTENY SyntenyRegion.synteny ]
ENSEMBL_ORTHOLOGUES = [ PROTEIN_HOMOLOGY Homology.homology ]
ENSEMBL_PARALOGUES = [ PROTEIN_HOMOLOGY Homology.homology ]
ENSEMBL_PROJECTIONS = [ PROTEIN_HOMOLOGY Homology.homology ]
ENSEMBL_HOMOEOLOGUES = [ PROTEIN_HOMOLOGY Homology.homology ]
FAMILY = [ PROTEIN_FAMILY Family.family ]
PROTEIN_TREES = [ PROTEIN-NCRNA_TREE ProteinTree.protein_tree_node ]
GERP_CONSTRAINED_ELEMENT = [ PROTEIN-NCRNA_TREE GenomicAlignBlock.constrained_element ]

The remaining steps can all be run in a single step using the genomehubs/compara container, however it may be useful to run the individual steps separately to debug any problems with the configuration. The steps to run are controlled by the FLAGS:

  • -e - Export sequence files from the core databases listed in the [TAXA] section. Three files are written for each assembly, representing a single, canonical transcript per gene in files of protein sequence, protein sequence showing exon boundaries and CDS sequence.

  • -o - Run OrthoFinder. The command to OrthoFinder is of the form orthofinder -f $FASTA_DIR -M msa -S diamond -A mafft_and_trim -T raxml-ng -o $ORTHOFINDER_DIR -n $VERSION -t $THREADS. $VERSION is determined based on the version number at the end of the ORTHOFINDER_DIR parameter. The handler scripts are not yet sophisticated to allow OrthoFinder to be resumed so the analysis needs to be run in a single step.

  • -m - Make orthogroup files. Processes the OrthoFinder output to generate a set of files per orthogroup fready for import into the Compara database.

  • -s - Setup database. Creates a new/replaces and existing compara database and populates some tables in preparation fo importing the orthogroup files.

  • -i - Import orthogroup files. Files are imported for each orthogroup in turn.

docker run --rm \
-u $UID:$GROUPS \
--name compara-import \
--network genomehubs-network
-v ~/genomehubs/v1/import/conf:/import/conf \
-v ~/genomehubs/v1/import/data:/import/data \
-e FLAGS="-e -o -m -s -i" \
-e DATABASE=mealybug_compara_40_93 \
-e THREADS=16 \
genomehubs/compara:19.05

Orthofinder can be configured further by setting environment variables to determine the similarity search (SEARCH), multiple sequence alignment (ALIGN) and tree reconstruction (TREE) options to use.

running with default settings is equivalent to specifying

-e SEARCH=diamond
-e ALIGN=mafft_and_trim
-e TREE=fasttree

alternate options can be selected from the default config.json file:

{
"muscle":{
"program_type": "msa",
"cmd_line": "muscle -in INPUT -out OUTPUT"
},
"raxml":{
"program_type": "tree",
"cmd_line": "raxmlHPC-AVX -m PROTGAMMALG -p 12345 -s INPUT -n IDENTIFIER -w PATH > /dev/null",
"ouput_filename": "PATH/RAxML_bestTree.IDENTIFIER"
},
"raxml-ng":{
"program_type": "tree",
"cmd_line": "raxml-ng --msa INPUT --model LG+G4 --seed 12345 --threads 1 --blopt nr_safe",
"ouput_filename": "INPUT.raxml.bestTree"
},
"iqtree":{
"program_type": "tree",
"cmd_line": "iqtree -s INPUT -pre PATH/IDENTIFIER > /dev/null",
"ouput_filename": "PATH/IDENTIFIER.treefile"
},
"fasttree_lg_gamma":{
"program_type": "tree",
"cmd_line": "fasttree -lg -gamma < INPUT > OUTPUT"
},
"diamond":{
"program_type": "search",
"db_cmd": "diamond makedb --in INPUT -d OUTPUT",
"search_cmd": "diamond blastp -d DATABASE -q INPUT -o OUTPUT --more-sensitive -p 1 --quiet -e 0.001 --compress 1"
},
"blast_gz":{
"program_type": "search",
"db_cmd": "makeblastdb -dbtype prot -in INPUT -out OUTPUT",
"search_cmd": "blastp -outfmt 6 -evalue 0.001 -query INPUT -db DATABASE | gzip > OUTPUT.gz"
},
"mmseqs":{
"program_type": "search",
"db_cmd": "mmseqs createdb INPUT OUTPUT.fa ; mmseqs createindex OUTPUT.fa /tmp",
"search_cmd": "mmseqs search PATH/mmseqsDBBASENAME DATABASE.fa OUTPUT.db /tmp/tmpBASEOUTNAME --threads 1 ; mmseqs convertalis PATH/mmseqsDBBASENAME DATABASE.fa OUTPUT.db OUTPUT"
},
"mafft_and_trim":{
"program_type": "msa",
"cmd_line": "mafft --auto --anysymbol INPUT > /import/data/tmp/BASENAME_pre_trim 2> /dev/null; trimal -in /import/data/tmp/BASENAME_pre_trim -out OUTPUT -gappyout "
}
}

To load the Compara analyses into your Ensembl site, add your database name to the Ensembl setup.ini file then remove and restart your Ensembl container.

nano ~/genomehubs/v1/ensembl/conf/setup.ini
...
COMPARA_DBS = [ mealybug_compara_40_93 ensembl_compara_metazoa_40_93 ]
docker rm -f genomehubs-ensembl
docker run -d \
--name genomehubs-ensembl \
-v ~/genomehubs/v1/ensembl/conf:/conf:ro \
--network genomehubs-network \
-p 8881:8080 \
genomehubs/easy-mirror:19.05