Mothur is an open source bioinformatics toolkit aimed primarily at addressing the needs of the microbial ecology community. The Mothur application can be built to provide single node multi-threaded functionality for certain tools, or MPI-capable functionality via OpenMPI, BUT NOT BOTH. This informational tutorial addresses the differences between each build, how to use them, and the potential advantages and costs to each approach.
MPI Run Characteristics
As discussed in other Mothur tutorials, MPI runs of Mothur are initiated by using something like the following:
If you fail to use the "mpirun -np #" prefix to the command on MPI-builds of Mothur, and also use the "processors=#" option on an eligible command, YOU WILL NOT BE PROVIDED WITH MULTI-PROCESSOR PERFORMANCE. That is to say, Mothur will not default to single-node multiprocessor mode on MPI-builds in the absence of the "mpirun -np #" command directive.
Imagine for example, you load the default MPI-build of Mothur and attempt to run the "cluster.split" function using multiple processors (4 in the following example):
In this case, you will not utilize 4 cpus, only 1, since the MPI-build does not provide multi-threaded single-node functionality outside of MPI.
In order to run the above example with 4 cpus, it is strictly necessary to invoke mothur inside the context of MPI by modifying the initial launch line as follows:
Although the examples above illustrate interactive job runs, this also applies when running scheduled jobs with job scripts on the HPCC cluster.
The following Mothur installs on the HPCC have been designed to accommodate MPI-capable runs:
- 1.31.2 (default)
- 1.24.0, 1.23.1, 1.21.1, 1.20.3, 1.18.1
Multi-Processor, Non-MPI Runs
The most recent version of Mothur as of this writing has been installed as a build for non-MPI multi-threaded runs only:
If you desire to run Mothur on a single node with multiple processors, you must load the version above for this to work correctly.
MPI versus Non-MPI Multiple CPU Runs
The primary advantages of MPI-capable runs over multi-processor single-node runs may be summarized as follows:
- More processors can be dedicated to the task than are available on a single node
- May be easier to schedule if a few processors are distributed across several nodes than trying to occupy most or all of a single node (potentially shorter queue wait time)
The primary disadvantages of MPI-capable Mothur runs may be summarized as:
- Slower on single node runs
- Higher overhead and less memory efficient
- The additional processor advantages offered by MPI may be cancelled out by I/O waits to disk
Preliminary run testing is recommended to determine which approach is best for your data set and selected analyses. Remember, not all of the functions in the Mothur tool set are multiprocessor/MPI capable. Make sure to evaluate the functions you are using to determine if multiprocessor capabilities even apply to any portion of your run. See the next section for more information.
Mulit-Processor Capable Mothur Functions
The following table presents a list of multi-processor capable mothur tools, and which of those, are additionally MPI-capable (as of this writing).