OpenMP C++ parallel performance better dualcore laptop than eight cores cluster_问答_开发者

OpenMP C++ parallel performance better dualcore laptop than eight cores cluster

开发者 https://www.devze.com 2023-03-08 21:44 出处：网络

相关专题：openmp

First of all, OpenMP obviously only runs in one of the motherboards in the cluster, in this case each motherboard has two quad-core Xeons E5405 at 2GHz and its running Scientific Linux 5.3 (released in 2009, red hat based). My laptop on the other hand a has core2duo T7300 at 2GHz running windows 7. No hyperthreading in either machine.

The main problem is that I have OOP code that generally runs for around 2min in serial in both systems, but when I implement OpenMP in a nested loop it experieces an expected reduction in time in my la开发者_运维问答ptop (when 2 threads are used) and a significant increase in time in the server (around 5min with two threads, for example).

There are two classes, "cube" and "space". Space contains a three dimensional array (20x20x20) of cubes and the code that I am trying to parallelise is a three way nested loop that calls a member function of cube for each of the cubes. This member function has three arguments (doubles) and does some calculations based on the private variables of each cube.

inline void space::cubes_refresh(const double vsx, const double vsy, const double vsz) {
int loopx, loopy, loopz;
#pragma omp parallel private(loopx, loopy, loopz)
{
    #pragma omp for schedule(guided,1) nowait 
    for(loopx=0 ; loopx<cubes_w ; loopx++) {
        for(loopy=0 ; loopy<cubes_h ; loopy++) {
            for(loopz=0 ; loopz<cubes_d ; loopz++) {
                // Refreshing the values in source
                if ( (loopx==source_x)&&(loopy==source_y)&&(loopz==source_z) )
                    cube_array[loopx][loopy][loopz].refresh(0.0,0.0,vsz);
                // refresh everything else
                else
                    cube_array[loopx][loopy][loopz].refresh(0.0,0.0,0.0);
            }
        }
    }   // End of loop
}

I don't know where the problem could be, as I have said before, in my laptop I see an expected improvement in performance, but exactly the same code in the server does significantly worse. These are the flags I use in my laptop (have tried using exactly the same flags, but nothing):

g++ -std=c++98 -fopenmp -O3 -Wl,--enable-auto-import -pedantic main.cpp -o parallel_openmp

And in the server:

g++ -std=c++98 -fopenmp -O3 -W -pedantic main.cpp -o parallel_openmp

I'm running gcc version 4.5.0 and the server is running 4.1.2, I don' know the OpenMP version in the server as I don't know how to check it, I think is a version before 3.0 as the collapse in loops does not work. Could this be the problem?

gcc did not support OpenMP until 4.2, OpenMP 3.0 was supported starting in gcc 4.4. Your operating system vendor may have back ported the changes to 4.1.2.

The only thing I can think maybe causing the problem is that for some reason in the server all the threads accessing the cube member array is causing a lot cache misses, but wouldn't this also happen in the program running in my laptop?