开发者

Parallelizing large loops and improving cache accesses

开发者 https://www.devze.com 2023-04-04 17:29 出处:网络
I have a code like the following which I am using to find prime numbers (using Eratosthenes sieve) within a range, and using OpenMP to parallelize. Before this,开发者_高级运维 I have a preprocessing s

I have a code like the following which I am using to find prime numbers (using Eratosthenes sieve) within a range, and using OpenMP to parallelize. Before this,开发者_高级运维 I have a preprocessing stage where I am flagging off all even numbers, and multiples of 3 and 5 so that I have to do less work in this stage. The shared L3 cache of the testbed is 12MB, and the physical memory is 32 GB. I am using 12 threads. The flag array is unsigned char.

#pragma omp parallel for
for (i = 0; i < range; i++)
{
     for (j = 5; j < range; j+=2)
     {
         if( flag[i] == 1 && i*j < range )
             if ( flag[i*j] == 1 )
                 flag[i*j] = 0;
      }
 }

This program works well for ranges less than 1,000,000...but after that the execution time shoots up for larger ranges; eg, for range = 10,000,000 this program takes around 70 mins (not fitting in cache?). I have modified the above program to incorporate loop tiling so that it could utilize the cache for any loop range, but even the blocking approach seems to be time consuming. Interchanging the loops also do not help for large ranges.

How do I modify the above code to tackle large ranges? And how could I rewrite the code to make it fully parallel (range and flag [since the flag array is quite large so I can't declare it private] is shared)?


Actually, I just noticed a few easy speedups in your code. So I'll mention these before I get into the fast algorithm:

  1. Use a bit-field instead of a char array. You can save a factor of 8 in memory.
  2. Your outer loop is running over all integers. Not just the primes. After each iteration, start from the first number that hasn't been crossed off yet. (that number will be prime)

I'm suggesting this because you mentioned that it take 70 min. on a (pretty powerful) machine to run N = 10,000,000. That didn't look right, since my own trivial implementation can do N = 2^32 in under 20 seconds on a laptop - single-threaded, no source-level optimizations. So then I noticed that you missed a few basic optimizations.

Here's the efficient solution. But it takes some work.

The key is to recognize that the Eratosthenes Sieve only needs to go up to sqrt(N) of your target size. In other words, you only need to run the sieve on all prime numbers up to sqrt(N) before you are done.

So the trick is to first run the algorithm on sqrt(N). Then dump all the primes into a dense data structure. By pre-computing all the needed primes, you break the dependency on the outer-loop.

Now, for the rest of the numbers from sqrt(N) - N, you can cross off all numbers that are divisible by any prime in your pre-computed table. Note that this is independent for all the remaining numbers. So the algorithm is now embarrassingly parallel.

To be efficient, this needs to be done using "mini"-sieves on blocks that fit in cache. To be even more efficient, you should compute and cache the reciprocals of all the primes in the table. This will help you efficiently find the "initial offsets" of each prime when you fill out each "mini-sieve".

The initial step of running the algorithm sequential for sqrt(N) will be very fast since it's only sqrt(N). The rest of the work is completely parallelizable.

In the fully general case, this algorithm can be applied recursively on the initial sieve, but that's generally overkill.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号