MPI Error: Out of Memory - What are some solution options_问答_开发者

I am trying to resolve Fatal Error in MPI_Irecv: Aborting Job and received mixed (useful, however incomplete) responses to that query.

The error message is the following:

aborting job:
> Fatal error in MPI_Irecv: Other MPI
> error, error stack: MPI_Irecv(143):
> MPI_Irecv(buf=0x8294a60, count=48,
> MPI_DOUBLE, src=2, tag=-1, 
> MPI_COMM_WORLD, request=0xffffd6ac)
> failed MPID_Irecv(64): Out of
> memory

I am seeking help from someone to answer to these questions (I require guidance to help debug and resolve this deadlock)

At the end of "MPI Non Blocking Send and Receive", is the memory freed by itself after the send/receive has completed OR does it have to be forced to be freed?
Will the issue of "Out of memory" be resolved if I use "Multiple Cores" instead of a Single one?. We presently have 4 processors to 1 core and I submit my job using the following command: mpirun -np 4 <file>. I tried using mpirun n -4 <file> but it still ran 4 threads on the same core.
How do I figure out how much "Shared memory" is required for my program?

The MPI_ISend/MPI_IRecv is inside a recursive loop in my code and hence not very clear if the source of error lies there (If I use the Send/Recv. commands just once or twice, system computes just fine without "Out of Memory Issues"). If so, how does one check and relieve such information?

#include <mpi.h>  

#define Rows 48 

double *A = new double[Rows];
double *AA = new double[Rows];
....
....

int main (int argc, char *argv[])
{
    MPI_Status status[8]; 
    MPI_Request request[8];
    MPI_Init (&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &p);   
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

    while (time < final_time){
    ...
    ...

    for (i=0; i<Columns; i++) 
    {
        for (y=0; y<Rows; y++) 
        {
            if ((my_rank) == 0)
            {
                MPI_Isend(A, Rows, MPI_DOUBLE, my_rank+1, 0, MPI_COMM_WORLD, &request[1]);
                MPI_Irecv(AA, Rows, MPI_DOUBLE, my_rank+1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[3]);
                MPI_Wait(&request[3], &status[3]);  

                MPI_Isend(B, Rows, MPI_DOUBLE, my_rank+2, 0, MPI_COMM_WORLD, &request[5]);
                MPI_Irecv(BB, Rows, MPI_DOUBLE, my_rank+2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[7]);
                MPI_Wait(&request[7], &status[7]);
            }

            if ((my_rank) == 1)
            {
                MPI_Irecv(CC, Rows, MPI_DOUBLE, my_rank-1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[1]);
                MPI_Wait(&request[1], &status[1]); 
                MPI_Isend(Cmpi, Rows, MPI_DOUBLE, my_rank-1, 0, MPI_COMM_WORLD, &request[3]);

                MPI_Isend(D, Rows, MPI_DOUBLE, my_rank+2, 0, MPI_COMM_WORLD, &request[6]); 
                MPI_Irecv(DD, Rows, MPI_DOUBLE, my_rank+2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[8]);
                MPI_Wait(&request[8], &status[8]);
            }

            if ((my_rank) == 2)
            {
                MPI_Isend(E, Rows, MPI_DOUBLE, my_rank+1, 0, MPI_COMM_WORLD, &request[2]);
                MPI_Irecv(EE, Rows, MPI_DOUBLE, my_rank+1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[4]);
                MPI_Wait(&request[4], &status[4]);

                MPI_Irecv(FF, Rows, MPI_DOUBLE, my_rank-2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[5]);
                MPI_Wait(&request[5], &status[5]);
                MPI_Isend(Fmpi, Rows, MPI_DOUBLE, my_rank-2, 0, MPI_COMM_WORLD, &request[7]);
            }

            if ((my_rank) == 3)
            {
                MPI_Irecv(GG, Rows, MPI_DOUBLE, my_rank-1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[2]);
                MPI_Wait(&request[2], &status[2]);
                    MPI_Isend(G, Rows, MPI_DOUBLE, my_rank-1, 0, MPI_开发者_运维问答COMM_WORLD, &request[4]);

                MPI_Irecv(HH, Rows, MPI_DOUBLE, my_rank-2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[6]);
                    MPI_Wait(&request[6], &status[6]); 
                    MPI_Isend(H, Rows, MPI_DOUBLE, my_rank-2, 0, MPI_COMM_WORLD, &request[8]);
            }
        }
    }
}

Thanks!

You have a memory leak in your program; this:

MPI_Isend(A, Rows, MPI_DOUBLE, my_rank+1, 0, MPI_COMM_WORLD, &request[1]);
MPI_Irecv(AA, Rows, MPI_DOUBLE, my_rank+1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[3]);
MPI_Wait(&request[3], &status[3])

leaks resources associated with the MPI_Isend request. You call this Rows*Columns times per iteration, over presumably many iterations; but you're only calling Wait for one of the requests. You presumably need to be doing an MPI_Waitall() for the two requests.

But beyond that, your program is very confusing. No sensible MPI program should have such a series of if (rank == ...) statements. And since you're not doing any real work between the nonblocking send/recieves and the Waits, I don't understand why you're not just using MPI_Sendrecv or something. What is your program trying to accomplish?

UPDATE

Ok, so it looks like you're doing standard halo-filling thing. A few things:

Each task does not need it's own arrays - A/AA for rank 0, B/BB for rank 1, etc. The memory is distributed, not shared; no rank can see the others arrays, so there's no need to worry about overwriting them. (If there was, you wouldn't need to send messages). Besides, think how much harder this makes running on different numbers of processes - you'd have to add new arrays to the code each time you changed the number of processors you use.
You can read/write directly into the V array rather than using copies, although the copies may be easiest to understand initially.

I've written here a little version of a halo-filling code using your variable names (Tmyo, Nmyo, V, indicies i and y, etc). Each task has only it's piece of the wider V array, and exchanges its edge data with only its neighbours. It uses characters so you can see what's going on. It fills in its part of the V array with its rank #, and then exchanges its edge data with its neighbours.

I'd STRONGLY encourage you to sit down with an MPI book and work through its examples. I'm fond of Using MPI, but there are many others. There are also a lot of good MPI tutorials out there. I think it's no exaggeration to say that 95% of MPI books and tutorials (eg, ours here - see parts 5 and 6) will go through exactly this procedure as one of their first big worked examples. They will call it halo-filling or guardcell filling or boundry exchange or something, but it all comes down to passing edge data.

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

char **alloc_2d_char(const int rows, const int cols) {
    char *data = (char *)malloc(rows*cols*sizeof(char));
    char **array= (char **)malloc(rows*sizeof(char*));
    for (int i=0; i<rows; i++)
        array[i] = &(data[cols*i]);

    return array;
}

void edgeDataFill(char **locV, const int locNmyo, const int locTmyo,
                  const int ncols, const int myrow, const int mycol,
                  const int size, const int rank) {

    MPI_Datatype leftright, updown;
    int left, right, up, down;
    int lefttag = 1, righttag = 2;
    int uptag = 3, downtag = 4;
    MPI_Status status;

    /* figure out our neighbours */
    left = rank-1;
    if (mycol == 0) left = MPI_PROC_NULL;

    right = rank+1;
    if (mycol == ncols-1) right = MPI_PROC_NULL;

    up = rank - ncols;
    if (myrow == 0) up = MPI_PROC_NULL;

    down = rank + ncols;
    if (down >= size) down = MPI_PROC_NULL;

    /* create data type for sending/receiving data left/right */
    MPI_Type_vector(locNmyo, 1, locTmyo+2, MPI_CHAR, &leftright);
    MPI_Type_commit(&leftright);

    /* create data type for sending/receiving data up/down */
    MPI_Type_contiguous(locTmyo, MPI_CHAR, &updown);
    MPI_Type_commit(&updown);

    /* Send edge data to our right neighbour, receive from left.
       We are sending the edge (locV[1][locTmyo]..locV[locNmyo][locTmyo]),
       and receiving into edge (locV[0][1]..locV[locNmyo][locTmyo]) */

    MPI_Sendrecv(&(locV[1][locTmyo]), 1, leftright, right, righttag,
                 &(locV[1][0]),       1, leftright, left, righttag,
                 MPI_COMM_WORLD, &status);


    /* Send edge data to our left neighbour, receive from right.
       We are sending the edge (locV[1][1]..locV[locNmyo][1]),
       and receiving into edge (locV[1][locTmyo+1]..locV[locNmyo][locTmyo+1]) */

    MPI_Sendrecv(&(locV[1][1]),         1, leftright, left,  lefttag,
                 &(locV[1][locTmyo+1]), 1, leftright, right, lefttag,
                 MPI_COMM_WORLD, &status);

    /* Send edge data to our up neighbour, receive from down.
       We are sending the edge (locV[1][1]..locV[1][locTmyo]),
       and receiving into edge (locV[locNmyo+1][1]..locV[locNmyo+1][locTmyo]) */

    MPI_Sendrecv(&(locV[1][1]),         1, updown, up,   uptag,
                 &(locV[locNmyo+1][1]), 1, updown, down, uptag,
                 MPI_COMM_WORLD, &status);

    /* Send edge data to our down neighbour, receive from up.
       We are sending the edge (locV[locNmyo][1]..locV[locNmyo][locTmyo]),
       and receiving into edge (locV[0][1]..locV[0][locTmyo]) */

    MPI_Sendrecv(&(locV[locNmyo][1]),1, updown, down, downtag,
                 &(locV[0][1]),      1, updown, up,   downtag,
                 MPI_COMM_WORLD, &status);

    /* Release the resources associated with the Type_create() calls. */

    MPI_Type_free(&updown);
    MPI_Type_free(&leftright);

}

void printArrays(char **locV, const int locNmyo, const int locTmyo,
                 const int size, const int rank) {

    /* all these barriers are a terrible idea, but it's just
       for controlling output to the screen as a demo.  You'd 
       really do something smarter here... */

    for (int task=0; task<size; task++) {
        if (rank == task) {
            printf("\nTask %d's local array:\n", rank);
            for (int i=0; i<locNmyo+2; i++) {
                putc('[', stdout);
                for (int y=0; y<locTmyo+2; y++) {
                    putc(locV[i][y], stdout);
                }
                printf("]\n");
            }
        }
        fflush(stdout);
        MPI_Barrier(MPI_COMM_WORLD);
    }
}

int main(int argc, char **argv) {
    int ierr, size, rank;
    char **locV;
    const int Nmyo=12;  /* horizontal */
    const int Tmyo=12;  /* vertical */
    const int ncols=2;  /* n procs in horizontal direction */ 
    int nrows;   
    int myrow, mycol;
    int locNmyo, locTmyo;

    ierr = MPI_Init(&argc, &argv);
    ierr|= MPI_Comm_size(MPI_COMM_WORLD, &size);
    ierr|= MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    nrows = size/ncols;
    if (nrows*ncols !=  size) {
        fprintf(stderr,"Size %d does not divide number of columns %d!\n",
                size, ncols);
        MPI_Abort(MPI_COMM_WORLD,-1);
    }

    /* where are we? */
    mycol = rank % ncols;
    myrow = rank / ncols;

    /* figure out how many Tmyo we have */
    locTmyo  = (Tmyo / ncols);
    /* in case it doesn't divide evenly... */
    if (mycol == ncols-1) locTmyo = Tmyo - (ncols-1)*locTmyo;

    /* figure out how many Tmyo we have */
    locNmyo  = (Nmyo / nrows);
    /* in case it doesn't divide evenly... */
    if (myrow == nrows-1) locNmyo = Nmyo - (ncols-1)*locNmyo;

    /* allocate our local array, with space for edge data */
    locV = alloc_2d_char(locNmyo+2, locTmyo+2);

    /* fill in our local data - first spaces everywhere */
    for (int i=0; i<locNmyo+2; i++) 
        for (int y=0; y<locTmyo+2; y++) 
                locV[i][y] = ' ';

    /* then the inner regions have our rank # */
    for (int i=1; i<locNmyo+1; i++)
        for (int y=1; y<locTmyo+1; y++)
                locV[i][y] = '0' + rank;

    /* The "before" picture: */
    if (rank==0) printf("###BEFORE###\n");
    printArrays(locV, locNmyo, locTmyo, size, rank);

    /* Now do edge filling.  Ignore corners for now; 
       the right way to do that depends on your algorithm */

    edgeDataFill(locV, locNmyo, locTmyo, ncols, myrow, mycol, size, rank);

    /* The "after" picture: */
    if (rank==0) printf("###AFTER###\n");
    printArrays(locV, locNmyo, locTmyo, size, rank);

    MPI_Finalize();
}

The above program can be simplified still further using MPI_Cart_create to create your multidimensional domain and calculate your neighbours for you automatically, but I wanted to show you the logic so you see what's going on.

Also, if you can take some advice from someone who's done this for a long time:

Any time you have line after line of repeated code: like 60 (!!) lines of this:

Vmax =V[i][y]-Vold; updateMaxStateChange(Vmax / dt);

mmax=m[i][y]-mold; updateMaxStateChange(mmax / dt);
hmax=h[i][y]-hold; updateMaxStateChange(hmax / dt);
jmax=j[i][y]-jold; updateMaxStateChange(jmax / dt);

mLmax=mL[i][y]-mLold; updateMaxStateChange(mLmax / dt);
hLmax=hL[i][y]-hLold; updateMaxStateChange(hLmax / dt);
hLBmax=hLB[i][y]-hLBold; updateMaxStateChange(hLBmax / dt);
hLSmax=hLS[i][y]-hLSold; updateMaxStateChange(hLSmax / dt);

amax=a[i][y]-aold; updateMaxStateChange(amax / dt);
i1fmax=i1f[i][y]-i1fold; updateMaxStateChange(i1fmax / dt);
i1smax=i1s[i][y]-i1sold; updateMaxStateChange(i1smax / dt);

Xrmax=Xr[i][y]-Xrold; updateMaxStateChange(Xrmax / dt);

i2max=i2[i][y]-i2old; updateMaxStateChange(i2max / dt);

that's a sign you aren't using the right data structures. Here, you almost certainly want to have a 3d array of state variables, with (probably) the 3rd index being the species or local state variable or whatever you want to call i2, i1f, i1s, etc. Then all these lines can be replaced with a loop, and adding a new local state variable becomes much simpler.

Similarly, having essentially all your state being defined as global variables is going to make your life much tougher when it comes to updating and maintaing the code. Again, this is probably partly related to having things in zillions of independant state variables instead of having structures or higher-dimensional arrays grouping all relevant data together.

I'm not familiar with the library, but... 1) You should not delete the buffer after the read. You have allocated the buffer (dynamically) at program startup. As long as you delete it (once) at termination, you should be fine. Actually, even if you don't delete it, it should get cleaned up when the program exits (but that's sloppy).

2) Multiple cores should have no effect on a memory problem.

3) Not sure. MPI should have some documentation to help you.