Parallelize while loop with OpenMP_问答_开发者

I have a very large data file, and each record in this data file has 4 lines. I have written a very simple C program开发者_如何学Go to analyze files of this type and print out some useful information. The basic idea of the program is this.

int main()
{
  char buffer[BUFFER_SIZE];
  while(fgets(buffer, BUFFER_SIZE, stdin))
  {
    fgets(buffer, BUFFER_SIZE, stdin);
    do_some_simple_processing_on_the_second_line_of_the_record(buffer);
    fgets(buffer, BUFFER_SIZE, stdin);
    fgets(buffer, BUFFER_SIZE, stdin);
  }
  print_out_result();
}

This of course leaves out some details (sanity/error checking, etc), but that is not relevant to the question.

The program works fine, but the data files I'm working with are huge. I figured I would try to speed up the program by parallelizing the loop with OpenMP. After a bit of searching, though, it appears that OpenMP can only handle for loops where the number of iterations is know beforehand. Since I don't know the size of the files beforehand, and even simple commands like wc -l take a long time to run, how can I parallelize this program?

As thiton mentioned, this code could be I/O bounded. However, these days many computers may have SSDs and high-throughput RAID disks. In such case, you can get speedup from parallelization. Moreover, if the computation is not trivial, then parallelize wins. Even if the I/O is effectively serialized due to saturated bandwidth, you can still get speedup by distributing the computation to multicore.

Back to the question itself, you can parallelize this loop by OpenMP. With stdin, I have no idea to parallelize because it needs to read sequentially and no prior information of the end. However, if you're working a typical file, you can do it.

Here is my code with omp parallel. I used some Win32 API and MSVC CRT:

void test_io2()
{
  const static int BUFFER_SIZE = 1024;
  const static int CONCURRENCY = 4;

  uint64_t local_checksums[CONCURRENCY];
  uint64_t local_reads[CONCURRENCY];

  DWORD start = GetTickCount();

  omp_set_num_threads(CONCURRENCY);

  #pragma omp parallel
  {
    int tid = omp_get_thread_num();

    FILE* file = fopen("huge_file.dat", "rb");
    _fseeki64(file, 0, SEEK_END);
    uint64_t total_size = _ftelli64(file);

    uint64_t my_start_pos = total_size/CONCURRENCY * tid;
    uint64_t my_end_pos   = min((total_size/CONCURRENCY * (tid + 1)), total_size);
    uint64_t my_read_size = my_end_pos - my_start_pos;
    _fseeki64(file, my_start_pos, SEEK_SET);

    char* buffer = new char[BUFFER_SIZE];

    uint64_t local_checksum = 0;
    uint64_t local_read = 0;
    size_t read_bytes;
    while ((read_bytes = fread(buffer, 1, min(my_read_size, BUFFER_SIZE), file)) != 0 &&
      my_read_size != 0)
    {
      local_read += read_bytes;
      my_read_size -= read_bytes;
      for (int i = 0; i < read_bytes; ++i)
        local_checksum += (buffer[i]);
    }

    local_checksums[tid] = local_checksum;
    local_reads[tid]     = local_read;

    fclose(file);
  }

  uint64_t checksum = 0;
  uint64_t total_read = 0;
  for (int i = 0; i < CONCURRENCY; ++i)
    checksum += local_checksums[i], total_read += local_reads[i];

  std::cout << checksum << std::endl
    << total_read << std::endl
    << double(GetTickCount() - start)/1000. << std::endl;
}

This code looks a bit dirty because I needed to precisely distribute the amount of the file to be read. However, the code is fairly straightforward. One thing keep in mind is that you need to have a per-thread file pointer. You can't simply share a file pointer because the internal data structure may not be thread-safe. Also, this code can be parallelized by parallel for. But, I think this approach is more natural.

Simple experimental results

I have tested this code to read a 10GB file on either a HDD (WD Green 2TB) and a SSD (Intel 120GB).

With a HDD, yes, no speedups were obtained. Even slowdown was observed. This clearly shows that this code is I/O bounded. This code virtually has no computation. Just I/O.

However, with a SSD, I had a speedup of 1.2 with 4 cores. Yes, the speedup is small. But, you still can get it with SSD. And, if the computation becomes a bit more (I just put a very short busy-waiting loop), speedups would be significant. I was able to get speedup of 2.5.

In sum, I'd like to recommend that you try to parallelize this code.

Also, if the computation is not trivial, I would recommend pipelining. The above code simply divides into several big chunks, resulting in poor cache efficiency. However, pipeline parallelization may yield better cache utilization. Try to use TBB for pipeline parallelization. They provide a simple pipeline construct.

Have you checked that your process is actually CPU-bound and not I/O-bound? Your code looks very much like I/O-bound code, which would gain nothing from parallelization.

In response to "minding", I don't think your code actually optimize anything here. There are a lot of common misunderstanding about this statement "#pragma omp parallel", this one would actually just spawn the threads, without the "for" key word, all the threads will just execute whatever codes that are following. So your code would actually be duplicating the computation on each thread. In response to Daniel, you were right, OpenMP can't optimize while loop, the only way to optimize it is by restructuring the code so that iteration is known in advance (such as while loop it once with a counter). Sorry about posting another answer, as I can't comment yet, but hopefully, this clears out the common misunderstandings.