开发者

OpenCL - Impact of barrier on performance

开发者 https://www.devze.com 2023-04-12 15:32 出处:网络
In OpenCL, all the threads need to compute few common values. Which of the following tw开发者_StackOverflow社区o cases is faster?

In OpenCL, all the threads need to compute few common values. Which of the following tw开发者_StackOverflow社区o cases is faster? 1. All the threads compute the values, store in the private memory and no synchronization required among threads. 2. One thread computes and stores in local memory. Synchronized by a barrier. All the threads of the work group access the values in the local memory.

Thanks.


Is there any correct answer to this, given the range of devices on which OpenCL executes (e.g. various GPUs, various CPUs, and the Cell BE)? The performance characteristics will vary greatly between CPU and GPU, and potentially also between GPU vendors and models.

You will have to measure, on the platforms and implementations of interest to you or your users.

Is it possible in your case to pre-compute the few common values on the host, and pass them in as either dynamic parameters to the OpenCL kernel, or as compile time parameters to the OpenCL kernel?


Depends on the complexity of calculating those common values, and how many work items are able to run in parallel.

Lets say the time to calculate the common values is A, the time to do the rest of the calculation is B, and the overhead for the barrier is AO & BO (A part and B part). We can calculate the time for each option.

  • Option 1 and a single thread and 1000 work items: 1000A + 1000B
  • Option 2 and a single thread and 1000 work items: A + AO + 1000B + 1000BO
  • Option 1 with 1000 threads and 1000 work items : A + B
  • Option 2 with 1000 threads and 1000 work items : A + AO + B + BO

When you've got as many threads as work items, option 2 is obviously slower. When you've got a single thread, if BO is small compared to A, option 2 is probably quicker.

The truth is probably somewhere in the middle.

Option 3 is have the host calculate these values and put the results in constant memory. If you do this, and use a little double buffering you can probably hide the time to calculate the next set of common values whilst you're waiting for OpenCL to do the current calculation.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号