MAKER is a gene annotation pipelilne for both eukaryotic and prokaryotic genome projects. The HPCC build of MAKER leverages the external applications: NCBI BLAST, Augustus, TRF, RepeatMasker, Exonerate, tRNAscan, SNAP and GeneMark-ES. The default HPCC MAKER build uses RMBlast as the default search application, and the full version of the RepeatMasker Libraries has been installed from Giri (Repbase).
The HPCC build of MAKER has both MPI (distributed memory) and non-MPI multi-threaded (shared memory, single node) capabilities.
Several of the external applications required for the MAKER 2.31 pipeline are self-contained within the MAKER build. A few others must be downloaded, built and configured separately. The table below presents the versions and source of each external application used by the current MAKER build:
|BLAST+||2.2.27||internal||blastn, blastx, tblastx, makeblastdb|
|RepeatMasker||4.0.5||internal||includes full RepBase libraries for RepeatMasker|
|RMBlast||2.2.28||external||default search for RepeatMasker|
Running MAKER ( >= 2.31)
MAKER versions >= 2.31 were built with MPICH2 to provide full MPI-capable processing on one or more nodes. MAKER versions 2.28 and 2.10 (also available) also feature MPI support since November 2013. To use the latest MPI-capable MAKER:
Loading the module above will load the necessary prerequisites and set all of the required environmental variables and paths. MAKER runs are accomplished inside the data directory and are guided by a series of control files which need to be generated first before the main maker command is executed.
Generating Control Files
Before you can begin running MAKER, you will need to generate run-specific control (CTL) files. In this tutorial, we are going to load the MPI-capable build version of MAKER, generate the control file, and then run it on the example data provided in the MAKER source package. To facilitate this, we are going to copy the example data into a subdirectory of our scratch space:
You should now have a directory called:
Containing the following files:
Now lets generate the CTL files:
We should now have the following three (3) control files:
maker_exe.ctl- contains the path information for the underlying executables.
maker_bopt.ctl- contains filtering statistics for BLAST and Exonerate
maker_opt.ctl- contains all other information for MAKER, including the location of the input genome file.
It should NOT be necessary to change the contents of the first file (maker_exe.ctl) - all entries should be pre-populated with the correct executable paths. Most users would likely only need to concentrate on making changes to the BLAST/Exonerate and MAKER run options.
For our example, we are going to change the "maker_opt.ctl" file to tell MAKER where to find our input files. Fire-up your favorite text editor (in this example, we'll use nano) and change the following lines as shown:
Save the file.
Running as a Cluster Job
Now we're going to launch the job on the HPCC cluster using the application's MPI capabilities. For this example, let's use the following in a job script called "makerTest.sh":
In this example, we are going to use 2 nodes, 2 cores each, with a total memory allocation of 20GB for all processors (total job, or average about 5GB each). Your needs for your data set may vary, and this is more for illustrative purposes than anything else. The maker command is run with "mpiexec" and the flag "-n 4" which is control to invoke MPICH2 with 4 CPUs.
Now launch the job:
This run should finish pretty quickly. If you scan your run directory, you should see a new subdirectory called "dpp_contig.maker.output". This will contain your results.
Older Versions of MAKER (2.10)
If using MAKER 2.10 on the HPCC, there is a separate executable file for MPI runs. The process for running MAKER is nearly identical to the instructions provided above, except for the following command line argument:
Inside this new subdirectory, you should see the following:
- The maker_opts.log, maker_exe.log, and maker_bopts.log files are logs of the control files used for this run of MAKER.
- The mpi_blastdb directory contains FASTA indexes and BLAST database files created from the input EST, protein, and repeat databases.
- The dpp_contig_master_datastore_index.log contains information on both the run status of individual contigs and information on where individual contig data is stored.
- The dpp_contig_datastore directory contains a set of subfolders, each containing the final MAKER output for individual contigs from the genomic fasta file.
You should also see two (2) other files containing standard program output. Let's assume our job number was: "234567". These two output files would be named:
- makerTest.sh.e234567 - browse this to see details of the MAKER run which would normally stream to the screen.
- makerTest.sh.o234567 - browse to see the output of the "qstat -f" command at the end of the job script above, or to view errors (if any).