How to get a list (or subset) from an OpenCL Kernel?_问答_开发者

How to get a list (or subset) from an OpenCL Kernel?

开发者 https://www.devze.com 2023-03-14 16:03 出处：网络

I have a large array with 2^20 ulongs on it.This little OpenCL kernel flows through it like a charm. Yet, I have absolutely no idea (and google hasn\'t helped here) how to return a small number of ite

I have a large array with 2^20 ulongs on it. This little OpenCL kernel flows through it like a charm. Yet, I have absolutely no idea (and google hasn't helped here) how to return a small number of items (2^10) from it.

What I'm looking for is a fixed-sized list with at most 1024 items that have hamming distance (popcoun开发者_运维百科t) smaller than a given number. The list order doesn't matter, so perhaps I should be asking for a subset of these 2**20 items.

Since the output is expected to be much smaller than the input, using a global index in the output through atomic access will not be too ineffective. You need to pass a buffer containing a single uint, initially set to 0:

__kernel void K(...,__global uint * outIndex,...)
{
  ...
  if (selected)
  {
    uint index = atomic_inc(outIndex);  // or atom_inc if using OpenCL 1.0 extension
    out[index] = value;
  }
}

A list as such is not supported with OpenCL. OpenCL is a kind of standard C with some extensions and some limitation. You can only operate on buffers (aka arrays).

What you might look for is a global memory buffer which you need to allocate before you run the kernel. In this you can put your results in and with an clEnqueueReadBuffer you can retrieve your results.

Well, there is a way, through some hacks. I forked pyopencl.algorithm and created a new method, sparse_copy_if(), that returns the exact-sized buffer I need, as if it were a list with items being appended to. I will document it, and submit a patch to Andreas.

If your buffers are too large, though, there is a way to improve performance even more: I followed Rick's suggestion above, created a hash table, and threw the desired results in there. (Note that there's always risk of collision, so the hash table buffer/array has to be orders of magnitude larger than your expected output).

Then, I run sparse_copy_if() on the hash table buffer and receive nothing but a perfectly-sized buffer.

In conclusion:

I have a kernel scanning a 1,000,000-sized buffer. It computes results for all of them but doesn't separate the results I want.

These desired results are then thrown in a ~25,000 buffer (hash table, significantly smaller then the original data).

Then, by running sparse_copy_if() on the hash table buffer, you get the desired output---almost as if it were a list in which items could have been appended to.

sparse_copy_if(), of course, has the overhead of creating the perfectly-sized buffers, and copying data to them. But I've found that this overhead generally compensates, as you are making now (low-latency) transfers of small buffers/arrays from device back to host.

Code for testing sparse_copy_if() performance versus copy_if().