开发者

Benchmarking processor affinity impact

开发者 https://www.devze.com 2023-02-22 20:34 出处:网络
I\'m working on a NUMA architecture, where each compute node has 2 sockets and 4 cores by socket, for a total of 8 cores by compute node, and 24GB of RAM by node. I have to proof that setting processo

I'm working on a NUMA architecture, where each compute node has 2 sockets and 4 cores by socket, for a total of 8 cores by compute node, and 24GB of RAM by node. I have to proof that setting processor affinity can have a significant impact on performances.

Do you have any program to suggest that I could use as a benchmark to show the difference of i开发者_如何学运维mpact between using processor affinity or not? I could also write a simple C test program, using MPI, or OpenMP, or pthreads, but what operation would be the best to do that test? It must be something that would take advantage of cache locality, but that also would trigger context switching (blocking operations) so process could potentially migrate to another core, or worse, to an other socket. It must run on a multiple of 8 cores.


I tried to write a program that benchmarks asymmetry in memory latency on NUMA architecture, and with the help of the StackOverflow community, I succeeded. You can get the program from my StackOverflow post.

Measuring NUMA (Non-Uniform Memory Access). No observable asymmetry. Why?

When I run my benchmark program on hardware very similar to yours, I see about a 30% performance penalty when a core is reading/writing to memory that is not in the core's NUMA node (region of affinity). The program has to read and write in a pattern that deliberately defeats caching and pre-fetching, otherwise there's no observable asymmetry.


Try ASC Sequoia benchmark -- CLOMP -- designed for measuring threading overheads.


You can just use a simple single-threaded process which writes and then repeatedly reads a modest data set. The process needs to run for a lot longer than a single time slice, obviously, and long enough for processes to migrate from one core to another, e.g. 100 seconds.

You can then run two test cases:

  1. run 8 instances of the process without CPU affinity

    $ for p in 0 1 2 3 4 5 6 7 ; do time ./my_process & ; done

  2. run 8 instances of the process with CPU affinity

    $ for p in 0 1 2 3 4 5 6 7 ; do time taskset -c $p ./my_process & ; done

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号