C++ OpenMP directives_问答_开发者_运维开发者技术经验分享

I have a loop that I'm trying to parallelize and in it I am filling a container, say an STL map. Consider then the simple pseudo code below where T1 and T2 are some arbitrary types, while f and g are some functions of integer argument, returning T1, T2 types respectively:

#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
   c.insert(std::make_pair<T1,T2>(f(i),g(i))
}

This looks rather straighforward and seems like it should be trivially parallelized but it doesn't speed up as I expected. On the contrary it leads to run-time errors in my code, due to unexpected values being filled in the container, likely due to race conditions. I've even tried putting barriers and what-not, but all to no-avail. The only thing that allows it to work is to use a critical directive as below:

#pragma omp parallel for schedule(static) private(i)开发者_高级运维 shared(c)
for(i = 0; i < N; ++i) {
#pragma omp critical
   {
      c.insert(std::make_pair<T1,T2>(f(i),g(i))
   }
}

But this sort of renders useless the whole point of using omp in the above example, since only one thread at a time is executing the bulk of the loop (the container insert statement). What am I missing here? Short of changing the way the code is written, can somebody kindly explain?

This particular example you have is not a good candidate for parallelism unless f() and g() are extremely expensive function calls.

STL containers are not thread-safe. That's why you're getting the race conditions. So accessing them needs to be synchronized - which makes your insertion process inherently sequential.
As the other answer mentions, there's a LOT of overhead for parallelism. So unless f() and g() extremely expensive, your loop doesn't do enough work to offset the overhead of parallelism.

Now assuming f() and g() are extremely expensive calls, then your loop can be parallelized like this:

#pragma omp parallel for schedule(static) private(i) shared(c)
    for(i = 0; i < N; ++i) {
        std::pair<T1,T2> p = std::make_pair<T1,T2>(f(i),g(i));

#pragma omp critical
       {
          c.insert(p);
       }
    }

Running multithreaded code make you think about thread safety and shared access to your variables. As long as you start inserting into c from multiple threads, the collection should be prepared to take such "simultaneous" calls and keep its data consistent, are you sure it is made this way?

Another thing is that parallelization has its own overhead and you are not going to gain anything when you try to run a very small task on multiple threads - with the cost of splitting and synchronization you might end up with even higher total execution time for the task.

c will have obviously data races, as you guessed. STL map is not thread-safe. Calling insert method concurrently in multiple threads will have very unpredictable behavior, mostly just crash.
Yes, to avoid the data races, you must have either (1) a mutex like #pragma omp critical, or (2) concurrent data structure (aka look-free data structures). However, not all data structures can be lock-free in current hardware. For example, TBB provides tbb::concurrent_hash_map. If you don't need ordering of the keys, you may use it and could get some speedup as it does not have a conventional mutex.
In case where you can use just a hash table and the table is very huge, you could take a reduction-like approach (See this link for the concept of reduction). Hash tables do not care about the ordering of the insertion. In this case, you allocate multiple hash tables for each thread, and let each thread inserts N/#thread items in parallel, which will give a speedup. Looking up is also can be easily done by accessing these tables in parallel.