OpenMP slows down program instead of speeding it up: a bug in gcc?_问答_开发者

I will first give some background about the problem I'm having so you know what I'm trying to do. I have been helping out with the development of a certain software tool and found out that we could benefit greatly from using OpenMP to parallelize some of the biggest loops in this software. We actually parallelized the loops successfully and with just two cores the loops executed 30% faster, which was an OK improvement. On the other hand we noticed a weird phenomenom in a function that traverses through a tree structure using recursive calls. The program actually slowed down here wi开发者_运维百科th OpenMP on and the execution time of this function over doubled. We thought that maybe the tree-structure was not balanced enough for parallelization and commented out the OpenMP pragmas in this function. This appeared to have no effect on the execution time though. We are currently using GCC-compiler 4.4.6 with the -fopenmp flag on for OpenMP support. And here is the current problem:

If we don't use any omp pragmas in the code, all runs fine. But if we add just the following to the beginning of the program's main function, the execution time of the tree travelsal function over doubles from 35 seconds to 75 seconds:

//beginning of main function
...
#pragma omp parallel
{
#pragma omp single
{}
}
//main function continues
...

Does anyone have any clues about why this happens? I don't understand why the program slows down so greatly just from using the OpenMP pragmas. If we take off all the omp pragmas, the execution time of the tree traversal function drops back to 35 seconds again. I would guess that this is some sort of compiler bug as I have no other explanation on my mind right now.

Not everything that can be parallelized, should be parallelized. If you are using a single, then only one thread executes it and the rest have to wait until the region is done. They can either spin-wait or sleep. Most implementations start out with a spin-wait, hoping that the single region will not take too long and the waiting threads can see the completion faster than if sleeping. Spin-waits eat up a lot of processor cycles. You can try specifying that the wait should be passive - but this is only in OpenMP V3.0 and is only a hint to the implementation (so it might not have any effect). Basically, unless you have a lot of work in the parallel region that can compensate for the single, the single is going to increase the parallel overhead substantially and may well make it too expensive to parallelize.

First, OpenMP often reduces performance on first try. It can be tricky to to use omp parallel if you don't understand it inside-out. I may be able to help if you can you tell me a little more about the program structure, specifically the following questions annotated by ????.

//beginning of main function
...
#pragma omp parallel
{

???? What goes here, is this a loop? if so, for loop, while loop?

#pragma omp single
   { 

     ???? What goes here, how long does it run? 
  }
}

//main function continues
....
???? Does performance of this code reduce or somewhere else?

Thanks.

Thank you everyone. We were able to fix the issue today by linking with TCMalloc, one of the solutions ejd offered. The execution time dropped immediately and we were able to get around 40% improvement in execution times over a non-threaded version. We used 2 cores. It seems that when using OpenMP on Unix with GCC, you should also pick a replacement for the standard memory management solution. Otherwise the program may just slow down.

I did some more testing and made a small test program to test whether the issue could be memory operation related. I was unable to replicate the issue of an empty parallel-single region causing program to slow down in my small test program, but I was able to replicate the slow down by parallelizing some malloc calls.

When running the test program on Windows 7 64-bit with 2 CPU-cores, no noticeable slow down was caused by using -fopenmp flag with the gcc (g++) compiler and running the compiled program compared to running the program without OpenMP support.

Doing the same on Kubuntu 11.04 64-bit on the same computer, however, raised the execution to over 4 times of the non-OpenMP version. This issue seems to only appear on Unix-systems and not on Windows.

The source of my test program is below. I have also uploaded zipped-source for win and unix version as well as assembly source for win and unix version for both with and without OpenMP-support. This zip can be downloaded here http://www.2shared.com/file/0thqReHk/omp_speed_test_2011_05_11.html

#include <stdio.h>
#include <windows.h>
#include <list>
#include <sys/time.h>
//#include <cstdlib>

using namespace std;

int main(int argc, char* argv[])
{
//  #pragma omp parallel
//  #pragma omp single
//  {}

  int start = GetTickCount();
  /*
  struct timeval begin, end;
  int usecs;
  gettimeofday(&begin, NULL);
  */
  list<void *> pointers;

  #pragma omp parallel for default(shared)
  for(int i=0; i< 10000; i++)
    //pointers.push_back(calloc(20000, sizeof(void *)));
    pointers.push_back(malloc(20000));

  for(list<void *>::iterator i = pointers.begin(); i!= pointers.end(); i++)
    free(*i);

  /*
  gettimeofday(&end, NULL);
  if (end.tv_usec < begin.tv_usec) {
    end.tv_usec += 1000000;
    begin.tv_sec += 1;
  }
  usecs = (end.tv_sec - begin.tv_sec) * 1000000;
  usecs += (end.tv_usec - begin.tv_usec);
  */

  printf("It took %d milliseconds to finish the memory operations", GetTickCount() - start);
  //printf("It took %d milliseconds to finish the memory operations", usecs/1000);

  return 0;
  }

What remains unanswered now is, what can I do to avoid issues such as these on the Unix-platform..