import/conf
directory. This will have the form sitename_compara_40_93.ini
. The examples below are based on mealybug.org so substitute mealybug
for your site name. The initial sections in this file are similar to the assembly config files described elsewhere:[SETUP]
section contains some paths to define where files will be written relative to the container filesystem. FASTA_DIR
will contain sequence files exported from the core databases ready for analysis, TMP_DIR
is needed to retain untrimmed alignments that would otherwise be discarded by OrthoFinder, ORTHOFINDER_DIR
will contain the OrthoFinder results and ORTHOGROUP_DIR
is used for processed orthogroup files ready for import into the Compara database. REMOVE
is probably not needed but is listed here in case it is still used somewhere.[TAXA]
section contains a mapping between the assembly database names to be used in the analysis and a set of short names for use in files. There may be some constraints on these names so it is probably best to use six upper case letters to represent each assembly.[SPECIES_SET]
section defines a name for the species set (in this case HEMIPTERA
- currently only one species set is supported) and the assemblies that should be included in the set. The TREE_FILE
needs to be the same file used by OrthoFinder. OrthoFinder supports a user specified tree, but currently the only supported option is to use the species tree inferred by OrthoFinder, which will be located at a path similar to the example. The Tree_Label
should match one of the ranks in the NCBI taxonomy-based classification of the assemblies in your GenomeHub.[ORTHOGROUP]
section defines a set of file types and suffixes to use when preparing files for import in the ORTHOGROUP_DIR
. Most of these should not need to be altered but the PREFIX
should be changed to math your sit name. The example PREFIX
is for MealyBug Gene Tree, version 01.[METHOD_LINK]
section. This probably doesn't belong here but it is needed to populate a table in the Compara database and should be copied and pasted directly into your configuration file.genomehubs/compara
container, however it may be useful to run the individual steps separately to debug any problems with the configuration. The steps to run are controlled by the FLAGS
:-e
- Export sequence files from the core databases listed in the [TAXA]
section. Three files are written for each assembly, representing a single, canonical transcript per gene in files of protein sequence, protein sequence showing exon boundaries and CDS sequence.-o
- Run OrthoFinder. The command to OrthoFinder is of the form orthofinder -f $FASTA_DIR -M msa -S diamond -A mafft_and_trim -T raxml-ng -o $ORTHOFINDER_DIR -n $VERSION -t $THREADS
. $VERSION
is determined based on the version number at the end of the ORTHOFINDER_DIR
parameter. The handler scripts are not yet sophisticated to allow OrthoFinder to be resumed so the analysis needs to be run in a single step.-m
- Make orthogroup files. Processes the OrthoFinder output to generate a set of files per orthogroup fready for import into the Compara database.-s
- Setup database. Creates a new/replaces and existing compara database and populates some tables in preparation fo importing the orthogroup files.-i
- Import orthogroup files. Files are imported for each orthogroup in turn.SEARCH
), multiple sequence alignment (ALIGN
) and tree reconstruction (TREE
) options to use.config.json
file:setup.ini
file then remove and restart your Ensembl container.