import/confdirectory. This will have the form
sitename_compara_40_93.ini. The examples below are based on mealybug.org so substitute
mealybugfor your site name. The initial sections in this file are similar to the assembly config files described elsewhere:
[SETUP]section contains some paths to define where files will be written relative to the container filesystem.
FASTA_DIRwill contain sequence files exported from the core databases ready for analysis,
TMP_DIRis needed to retain untrimmed alignments that would otherwise be discarded by OrthoFinder,
ORTHOFINDER_DIRwill contain the OrthoFinder results and
ORTHOGROUP_DIRis used for processed orthogroup files ready for import into the Compara database.
REMOVEis probably not needed but is listed here in case it is still used somewhere.
[TAXA]section contains a mapping between the assembly database names to be used in the analysis and a set of short names for use in files. There may be some constraints on these names so it is probably best to use six upper case letters to represent each assembly.
[SPECIES_SET]section defines a name for the species set (in this case
HEMIPTERA- currently only one species set is supported) and the assemblies that should be included in the set. The
TREE_FILEneeds to be the same file used by OrthoFinder. OrthoFinder supports a user specified tree, but currently the only supported option is to use the species tree inferred by OrthoFinder, which will be located at a path similar to the example. The
Tree_Labelshould match one of the ranks in the NCBI taxonomy-based classification of the assemblies in your GenomeHub.
[ORTHOGROUP]section defines a set of file types and suffixes to use when preparing files for import in the
ORTHOGROUP_DIR. Most of these should not need to be altered but the
PREFIXshould be changed to math your sit name. The example
PREFIXis for MealyBug Gene Tree, version 01.
[METHOD_LINK]section. This probably doesn't belong here but it is needed to populate a table in the Compara database and should be copied and pasted directly into your configuration file.
genomehubs/comparacontainer, however it may be useful to run the individual steps separately to debug any problems with the configuration. The steps to run are controlled by the
-e- Export sequence files from the core databases listed in the
[TAXA]section. Three files are written for each assembly, representing a single, canonical transcript per gene in files of protein sequence, protein sequence showing exon boundaries and CDS sequence.
-o- Run OrthoFinder. The command to OrthoFinder is of the form
orthofinder -f $FASTA_DIR -M msa -S diamond -A mafft_and_trim -T raxml-ng -o $ORTHOFINDER_DIR -n $VERSION -t $THREADS.
$VERSIONis determined based on the version number at the end of the
ORTHOFINDER_DIRparameter. The handler scripts are not yet sophisticated to allow OrthoFinder to be resumed so the analysis needs to be run in a single step.
-m- Make orthogroup files. Processes the OrthoFinder output to generate a set of files per orthogroup fready for import into the Compara database.
-s- Setup database. Creates a new/replaces and existing compara database and populates some tables in preparation fo importing the orthogroup files.
-i- Import orthogroup files. Files are imported for each orthogroup in turn.
SEARCH), multiple sequence alignment (
ALIGN) and tree reconstruction (
TREE) options to use.
setup.inifile then remove and restart your Ensembl container.