Parallel Programming

Introduction

In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem:

  • To be run using multiple CPUs

  • A problem is broken into discrete parts that can be solved concurrently

  • Each part is further broken down to a series of instructions

  • Instructions from each part execute simultaneously on different CPUs

The compute resources might be:

  • A single computer with multiple processors (using of threads);

  • An arbitrary number of computers connected by a network(using parallel library like MPI);

  • A combination of both.

The computational problem should be able to:

  • Be broken apart into discrete pieces of work that can be solved simultaneously;

  • Execute multiple program instructions at any moment in time;

  • Be solved in less time with multiple compute resources than with a single compute resource.

In Xmipp3.0 we have develop several classes to make easier the use of threads and MPI when developing parallel programs. If you want to read more about parallel computing there is a nice introduction here. ## Task distribution

One of the first steps in designing a parallel program is to break the problem into discrete “chunks” of work that can be distributed to multiple tasks. This is known as decomposition or partitioning. There are two basic ways to partition computational work among parallel tasks: the data is partitioned or the work to be done.

In image processing for single particles on electron microscopy we usually deal with several thousands of images, so the most common and easy way to distribute is by data. Then each CPU take care of some part of the data to perform processing tasks.

We have create the class [[ParallelTaskDistributor]] which have the function to distribute N tasks (without knowing what each “task” really means) between a group of workers (threads or MPI). Each worker will ask for a some tasks, process it and ask for more tasks until there is nothing to be done. The following are several subclasses of ParallelTaskDistributor:

//...
// Create a task distributor with N total task and serve 10 on each request
ParallelTaskDistributor * td = new ThreadTaskDistributor(N, 10);
//...
//function to perform some operation
//to N images executed in parellel
void processSeveralImages()
{
    size_t firstImage, lastImage;
    while (td->getTasks(firstImage, lastImage))
        for (size_t image = firstImage; image <= lastImage; ++image)
        {
            //...
            processOneImage(image);
            //...
        }
}

Using threads

Technically, a thread is defined as an independent stream of instructions that can be scheduled to run as such by the operating system. Before understanding a thread, one first needs to understand a UNIX process. A process is created by the operating system, and requires a fair amount of “overhead”. Processes contain information about program resources and program execution state, including: Process ID, process group ID, user ID, and group ID, environment, working directory, program instructions, registers, stack, heap, file descriptors, signal actions, shared libraries, inter-process communication tools (such as message queues, pipes, semaphores, or shared memory). Threads use and exist within these process resources, yet are able to be scheduled by the operating system and run as independent entities largely because they duplicate only the bare essential resources that enable them to exist as executable code.

So, in summary, in the UNIX environment a thread:

  • Exists within a process and uses the process resources

  • Has its own independent flow of control as long as its parent process exists and the OS supports it

  • Duplicates only the essential resources it needs to be independently schedulable

  • May share the process resources with other threads that act equally independently (and dependently)

  • Dies if the parent process dies - or something similar

  • Is “lightweight” because most of the overhead has already been accomplished through the creation of its process.

Because threads within the same process share resources:

  • Changes made by one thread to shared system resources (such as closing a file) will be seen by all other threads.

  • Two pointers having the same value point to the same data.

  • Reading and writing to the same memory locations is possible, and therefore requires explicit synchronization by the programmer.

A more detailed explanation about use of POSIX threads can be found  here. ### Creating threads and passing parameters

Imagine that you have a program that perform tasks A, B and C, and tasks A and C task can be threaded. So, task A can be splited in several concurrent tasks A1, A2, A3…An and the same for C. In the following figure you can see the serial and threaded version of the program execution:

This type of threading now can be easily done using the following classes:

  • [[ThreadManager]] will create the threads and run diffent functions in parallel

  • [[ThreadFunction]] prototype of function that can be runned by [[ThreadManager]].

  • Its definition is typedef void( [[ThreadFunction]] )(ThreadArgument &arg) typedef void( [[ThreadFunction]] )(ThreadArgument &arg)

  • [[ThreadArgument]]: Argument type that is passed to [[ThreadFunction]]. It contains:

  • thread_id: number identifying each thread

  • data: void * pointer to pass additional information

  • workClass: void * pointer to hold a reference to working class

The previous example can be coded:

  void * functionA(ThreadArgument & data)
 {
     //...
 }
  void * functionB()
 {
     //...
 }
  void * functionC(ThreadArgument & data)
 {
     //...
 }

 int main()
 {
 //Start 4 threads to work
 ThreadManager * tm = new ThreadManager(4);
 // Run in parallel functionA
 tm.run(functionA);
 // All threads are syncronized at this point
 functionB();
 //If you need to pass some additional information
// to work on functionB you can do:
tm.setData(myData);
 // Put the threads works on functionB
 tm.run(functionB);
 }

Synchronizing threads

Synchronization is vital for almost all parallel programs. We want things done faster but also we want things done well. Through synchronization we can guarantee that things are done in the correct order and provide the same results as if it was done sequentially.

Synchronization between threads is done primarily through mutexes. A mutex allows to protect a portion of the code so only one thread can access it at a time. We have created the Mutex class wich encapsulates the mutex creation, initialization and clean up through the pthreads library.

Mutex mutexUpdate;
//....
// Inside some threaded function:
mutexUpdate.lock();
//Perform the updated
mutexUpdate.unlock();

Other different synchronization structures exist that can adapt better to different circumstances. For example, a barrier is used when we want to synchronize a number of threads at a point of the code so no one can continue working until all of them have reached such point. Barriers are not always present on all computing platforms. For example, old Unix implementations do not have such structure defined on the pthreads library. To avoid problems of this type, a Barrier class have been implemented base on mutexes. ### Example

 Here you will find a complete example of a parallel program using all the elements together. This example estimate the value of PI. ### Some Tips

Programming threads is easy… but debugging threads can be a nightmare. So take note of these tips:

  • Do not use static variables on threaded code. Such variables are shared between all threads and can lead to unexpected results.

  • Do not use threads for everything. Use them when it is clear they will represent an advantage. Using too much threads will lead to a decreared performance.

  • Try to create threads once and reuse them. Creating and destroying threads will represent a slight overhead. On some applications this can translate into lower performance. (Create just one [[ThreadManager]] and run several functions )

  • Be careful with critical regions and the use of Mutex and Barrier. A misuse can lead to race conditions(bad results) or deadlock (program will runs forever)

Programming with MPI

The Message Passing Interface Standard ( MPI) is a message passing library standard based on the consensus of the MPI Forum, which has over 40 participating organizations, including vendors, researchers, software library developers, and users. The goal of the Message Passing Interface is to establish a portable, efficient, and flexible standard for message passing that will be widely used for writing message passing programs. As such, MPI is the first standardized, vendor independent, message passing library. The advantages of developing message passing software using MPI closely match the design goals of portability, efficiency, and flexibility. MPI is not an IEEE or ISO standard, but has in fact, become the “industry standard” for writing message passing programs on HPC platforms. You can find more about MPI  here.

We have created some useful classes like [[MpiNode]] that will take care of some MPI initialization and cleaning. This class also have a method to synchronize: barrierWait and other utilities. The same concepts for task distribution can be used with MPI through the [[MpiTaskDistributor]] class.

A complete example using the MPI tools is available  Here .