MPI_Bcast: Efficiency advantages?_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-25 23:45 出处：网络

In MPI, is MPI_Bcast purely a convenience function or is there an efficiency advantage to using it instead of just looping over all ranks and sending the same message to all of them?

In MPI, is MPI_Bcast purely a convenience function or is there an efficiency advantage to using it instead of just looping over all ranks and sending the same message to all of them?

Rationale: MPI_Bcast's behavior of sending the message to everyone, including the root, is inconvenient for me, so I'd rather not use it unless there's a good reason, or it can be made to not send the message t开发者_如何学编程o root.

Using MPI_Bcast will definitely be more efficient than rolling your own. A lot of work has been done in all MPI implementations to optimise collective operations based on factors such as the message size and the communication architecture.

For example, MPI_Bcast in MPICH2 would use a different algorithm depending on the size of the message. For short messages, a binary tree is used to minimise processing load and latency. For long messages, it is implemented as a binary tree scatter followed by an allgather.

In addition, HPC vendors often provide MPI implementations that make efficient use of the underlying interconnects, especially for collective operations. For example, it is possible to use a hardware supported multicast scheme or to use bespoke algorithms that can take advantage of the existing interconnects.

The collective communications can be much faster than rolling your own. All of the MPI implmementations spend a lot of time working on those routines to be fast.

If you routinely want to do collective-type things but only on a subset of tasks, then you probably want to create your own sub-communicators and use BCAST, etc on those communicators.

MPI_Bcast sends the message from one process (the 'root') to all others, by definition. It probably will be a little faster than just looping over all processes, too. The mpich2 implementation, for instance, uses a binomial tree to distribute the message.

In case you don't want to broadcast to MPI_COMM_WORLD, but you want to define subgroups, you can go about it like this:

#include <stdio.h>
#include "mpi.h"

#define NPROCS 8

int main(int argc, char **argv)
{
    int rank, new_rank, sendbuf, recvbuf,
    ranks1[4]={0,1,2,3}, ranks2[4]={4,5,6,7};

    MPI_Group orig_group, new_group;
    MPI_Comm new_comm;

    MPI_Init(&argc,&argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    sendbuf = rank;

    /* Extract the original group handle */
    MPI_Comm_group(MPI_COMM_WORLD, &orig_group);

    /* Divide tasks into two groups based on rank */
    if (rank < NPROCS/2) {
        MPI_Group_incl(orig_group, NPROCS/2, ranks1, &new_group);
    } else {
        MPI_Group_incl(orig_group, NPROCS/2, ranks2, &new_group);
    }

    /* Create new communicator and then perform some comm
     * Here, MPI_Allreduce, but you can MPI_Bcast at will
     */
    MPI_Comm_create(MPI_COMM_WORLD, new_group, &new_comm);
    MPI_Allreduce(&sendbuf, &recvbuf, 1, MPI_INT, MPI_SUM, new_comm);
    MPI_Group_rank (new_group, &new_rank);

    printf("rank= %d newrank= %d recvbuf= %d\n", rank, new_rank, recvbuf); 

    MPI_Finalize();
}

Which might produce something like the following output:

rank= 7 newrank= 3 recvbuf= 22
rank= 0 newrank= 0 recvbuf= 6 
rank= 1 newrank= 1 recvbuf= 6 
rank= 2 newrank= 2 recvbuf= 6 
rank= 6 newrank= 2 recvbuf= 22
rank= 3 newrank= 3 recvbuf= 6
rank= 4 newrank= 0 recvbuf= 22
rank= 5 newrank= 1 recvbuf= 22

The answer is that the MPI_Bcast is probably faster than a loop, in the general case. In general, the MPI collectives are optimized over a wide range of message sizes, comm sizes, and specific rank layouts.

That said, it may be possible to beat a collective at specific message sizes, comm sizes, and rank layouts. For instance, a loop over non-blocking point to point calls (e.g. ISend and Recv/IRecv) may be faster...but probably only at a few specific message sizes, comm sizes, and rank layouts.

If the specific algorithm that you are coding needs the pattern of a Bcast (e.g. all ranks get the same data payload from a root), then use the Bcast collective. In general, it is not worth adding complication by rolling your own "collective replacements".

If there is some other message pattern that the algorithm needs, and a Bcast on only a partial fit...then it may be worth rolling your own...but personally I set that bar rather high.