Some of the most common questions I get from users are
- "How can I get my job to run faster?"
- "Why has my job been in the queue for so long?"
- "I set my ppn=20 for this job run, why didn't it finish any quicker?"
To answer those questions you need to better understand how the HPCC works, and how programs can (or cannot) use more than one processor. This tutorial attempts to address those knowledge gaps.
Different Types of Jobs
Serial versus Parallel
So what are the major types of jobs? For the purposes of this discussion, I'm going to classify them one of two (2) ways:
- Serial Jobs
- Parallel Jobs (which include):
- Multi-threaded jobs, and
- MPI jobs
Cores, CPUs, and Nodes
Simply stated, Serial Jobs are ones where only one (1) processing core is used (the most common job); and Parallel Jobs, are ones where more than one processing core is used. So what do we mean by "processing core?" Is that the same as a "CPU?" In the early days of microcomputer technology, a CPU was a fairly monolithic processing unit. We could say that most conventional CPUs during those days, were "single core". If engineers wanted to boost processing power, they could add multiple CPUs to a computer build (most commonly in enterprise grade servers). However, as technology has improved, this has become less, and less the case, even in personal workstations and laptops.
Most modern CPUs now possess multiple processing "cores", each being capable of handling its own process "thread" of instructions. Moreover, modern enterprise servers still commonly have more than one CPU, but each CPU will likely contain multiple processing cores. Below please find an illustration which shows this relationship:
In the diagram above, there is one CPU in each "socket" on the computer's main board (or motherboard). Each CPU has 4 cores (labeled Core 0 through Core 3). Also note that in this example, a "node" is a single computer (generally this term is used in high-performance clusters). So we have 1 node, 2 CPUs and 8 cores in total.
Multi-Processor, Multi-Core, & Hyperthreaded
We've already briefly discussed the differences between cores and CPUs, and between single and multiple CPUs. Another term some might have heard is "hyperthreading". Hyperthreading is interesting in its own right, but also highlights some functional differences - architecturally and practically - from multi-processor and multi-core units. Hyperthreaded processors contain some separate and parallel sub-processing components, but do not contain a fully independent set of processing capabilities. To put it another way, CPUs with hyperthreaded cores are processors where each core has access to its own multiple parallel sub-processing units and a shared cache and bus connection (among all cores on the CPU die). Note that a single CPU may contain multiple hyperthreaded cores. In contrast, simple multi-core units contain more or less independent processing hardware paths, with less "sharing" (save the cache and connection to the system bus), and multiple CPUs represent the ultimate in hardware separation (each CPU has an its own cache and bus connection, but may have multiple cores).
Of course one can have a computer with multiple-CPUs, each containing multiple hyperthreaded cores! A picture is worth a thousand words, and the following should do a good job of highlighting the major differences:
What shoud be important to the computational biologist is that hyperthreaded cores can have better performance than non-hyperthreaded cores, because a portion of the process execution thread work passing through that core can be handled in parallel. Multiple hyperthreaded cores can similarly process more work than non-hyperthreaded multi-core processors, and so on. It is also important to note that while multiple core CPUs can theoretically almost achieve fully-scaled performance improvements for each added core, adding hyperthreading to those cores does not (i.e., scaling improvements are quite fractional). So hyperthreading a single core will undoubtedly not double your workload throughput. How much can it improve things? Ballpark rule-of-thumb estimates range from about 30%-50%.
Ironically it is for this reason that many cluster system administrators will switch off hyperthreading on the machines under their management. More specifically, because it is difficult to precisely scale and control the added fractional performance boost of a hyperthreaded processor in a load-managed and job-scheduled cluster, administrators would often rather forego the potential benefits to avoid the possibility of overestimating the load carrying capacity of their nodes. On the MSU HPCC, this is exactly the case - hyperthreading capability exists on many of the cluster machines, but it is disabled.
The more processing cores and CPUs a computer has, the more work can be done in parallel on that machine. This can translate into running more than one job at the same time, or running one job faster by dividing the job task into "threads" of execution so that work can be done in parallel instead one after the other (i.e., in serial). It is important to note that parallelism doesn't magically happen when you assign more cores ("ppn" in the parlance of HPCC) to your job. An application must be parallelized algorithmically, which can be quite challenging.
As an example, consider a case where we are running BLAST to search for hits against some reference database. Let us assume that we have a very large set of reads we want to BLAST. We could parallelize this work fairly easily, since whether or not any particular read "hits" or "misses" on our database does not depend on whether any other read "hits" or "misses". More specifically, the task of BLAST searching a series of reads against a common database are independent of one another. To that end when NCBI developed the BLAST program, they provided multi-threading functionality wherein the user may specify any number of processors up to the maximum available on the runtime machine.
Our BLASTing example could also be implemented using "Embarrassingly Parallel" techniques if we really wanted. "Embarrassingly Parallel" is a term used by computer scientists which means little or no effort is required to subdivide a computational task into parallel work loads. So we could actually subdivide our read set into several subsets (let us say "8") and run each of these as part of a single threaded BLAST against the reference database. Then at the end, simply concatenate the results together and run any statistics we like. Fortunately, you do not have to do this since the developers at NCBI have taken care of this for you, but it does illustrate how some work loads can be parallelized fairly easily, while others cannot.
If your application is not parallelized algorithmically (i.e., in code), then you must run it as a serial job, or figure out a way to embarrassingly parallelize it yourself. Throwing multiple "ppn" or "nodes" at a serial application does not help you, and it wastes system resources.
More on Embarrassingly Parallel
While many algorithms that can be easily parallelized generally have been by someone, somewhere; this is not always the case. Many times, dividing up a programatic task using embarrassingly parallel techniques can be advantageous, not just in terms of total processing throughput, but in making more efficient use of available resources. Imagine you are running BLAST once again and are using the parallel multi-threading option for your job run. However, your data set is very, very large and the computers available to you have a maximum of 2 processing cores only. One could use a combination of embarrassingly parallel and multi-threading to subdivide the task amongst multiple machines. Let us say we have 3 machines available - all with 2 cores each. We could divide our data set into 3 equal parts and run a 2 core BLAST job on each of the available machines. All other factors remaining equal, we should be able to finish our job in 1/3 the time of running it on a single 2 core machine alone. We need only subdivide the data set, place a reference database copy on each machine, setup the jobs, and then concatenate our results at the end.
Why aren't there more Parallel Bioinformatics Applications?
While there are some problems that are strictly sequential in nature, and thus, cannot be parallelized successfully; there are numerous others that are members of the set "NC" that should be quite amenable to parallelization. So why hasn't that yet happened?
Asanovic et al (2006) point out that one of the chief obstacles to developing parallel algorithms for problem solving is the very human nature of scientists and computer programmers, stating: "Programmers have a difficult time determining when to synchronize in parallel code, and often get it wrong" (pp. 33). From a psychological standpoint, human programmers have trouble parallelizing code for a variety of reasons, including (but not limited to):
- The history and occupational practice of computer programming has tended to be very sequential in nature
- Sequential problem solving is "intuitive"
- For complex problems, programmers and scientists worry that they will break the algorithm by parallelizing it in some incorrect (but unintentional) way
- Because programmers do not want to break proven sequential algorithms, they may add thousands of lines of code to ensure proper correctness and consistency
- Cautionary programming practices can lead to code bloat, increased debugging times, and potentially wasted man-hours if the programmers miss something or just plain get it wrong
- Many programmers would rather just leave their serial code be, than invest enormous amounts of time in proofing parallelized variants (perhaps unsuccessfully), and writing thousands of lines of "guard" code to insure accuracy
- Many (most?) bioinformatics applications are written in academic research labs where the funding for parallel algorithm development simply isn't available
There are several proposals aimed at trying to accelerate the development of parallel problem solving. It certainly would be convenient to make good use of all of those new cores, especially since biological data sets can be so ridiculously large. A proper and thorough exploration of these ideas are beyond the scope of this tutorial. However, it is important to note the following key points from this discussion:
- There are important differences between serial and parallel code
- Cluster resource reservations should match the type of application you're running and the way you're running it
- The development of parallelized algorithms for solving biological problems is proceeding at a sluggish pace
- Parallelizing proven algorithms is a tricky and potentially expensive business
- If your data can be parallelized "embarrassingly", consider that as an option to speed processing and make more efficient use of available resources
Multi-Threading versus MPI
Multi-threaded jobs run multiple processors on a single node using shared memory space, whereas MPI jobs run multiple processors on more than one node using distributed memory space. Provided below are two memory schematics illustrating the differences.
How does this practically impact the way we work with these types of applications?
- Throwing multiple processors at serial jobs does you no good
- Throwing multiple nodes at a multi-threaded application does you no good
- Don't waste your time with the overhead of a MPI application if you're only going to assign one node to the job
- Don't use threads in your parallel application unless you have reserved sufficient resources in you job script
- Failure to adhere to these guidelines can result in wasted resources, poor performance, and/or over-utilized CPU resources.
More Information on Parallelism
- Wikipedia Article on NC Complexity
- The Landscape of Parallel Computing Research: A View from Berkeley - Asanovic et al, 2006
- Parallelized short read assembly of large genomes using de Bruijn graphs - Liu, Schmidt, and Maskell, 2011
- What makes parallel programming hard?
- Writing and Optimizing Parallel Programs — A complete example - A straight forward example with cautionary notes on "overhead"
- Programming for Multicore and Many-core Products - Reinders (Intel), 2012 - Good overview of OpenMP, Intel MPI, TBB, etc.