开发者

Why does replacing if-else by bit-operation turn out to be slower in CUDA?

开发者 https://www.devze.com 2023-03-30 07:14 出处:网络
I replace if((nMark >> tempOffset) & 1){nDuplicate++;} else{nMark = (nMark | (1 << tempOffset));}

I replace

if((nMark >> tempOffset) & 1){nDuplicate++;}
else{nMark = (nMark | (1 << tempOffset));}

with

nDuplicate += ((nMark >> tempOffset) & 1);
nMark = (nMark | (1 << tempOffset));

this replacement turns out to be 5ms slower on GT 520 graphics card.

Could you tell me why? or do yo开发者_运维技巧u have any idea to help me improve it?


The native instruction set for the GPU deals with small conditions very efficiently via predication. Additionally, the ISET instruction converts a condition code register into an integer with the value 0 or 1, which naturally fits with your conditional increment.

My guess is that the key difference between the first and second formulations is that you've effectively hidden the fact that it's an if/else.

To tell for sure, you can use cuobjdump to look at the microcode generated for the two cases: specify --keep to nvcc and use cuobjdump on the .cubin file to see the disassembled microcode.


Shot in the dark, but you're always incrementing/re-assigning to the nDuplicate variable now in the latter implementation where as you weren't incrementing/assigning to it if the test in the if statement was false previously. Guessing the overhead comes from that, but you don't describe your test data set so I don't know if that was already the case.


Does your program exhibit significant branch divergence? If you're running e.g. 100 warps and only 5 have divergent behavior, and they run in 5 SMs, you would only see 21 time cycles (expecting 20)... a 5% increase that could easily be defeated by doing 2x the work in each thread to avoid rare divergence.

Barring that, the 520 is a fairly modern graphics card, and might incorporate modern SIMT scheduling techniques, e.g. Dynamic Warp Formation and Thread Block Compaction, to hide SIMT stalls. Maybe look into architectural features (specs) or write a simple benchmark to generate n-way branch divergence and measure slowdown?

Barring that, check where your variables live. Does making them shared affect performance/results? Since you always access all variables in the second and the first can avoid accessing nDimension, slow (uncoalesced global?) memory accesses could explain it.

Just some things to think about.


For low-level optimization, it is often helpful to look at the low-level assembly (SASS) of the kernel directly. You can do this with the cuobjdump tool distributed as part of the CUDA Toolkit. Basic usage is to compile with -keep in nvcc then do:

cuobjdump -sass mykernel.cubin

Then you can see the exact sequence of instructions and compare them. I'm not sure why version 1 would be faster than version 2 of the code, but the SASS listings might give you a clue.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号