What is the best way to avoid overloading a parallel file-system when running embarrassingly parallel jobs?_问答_开发者

We have a problem which is embarrassingly parallel - we run a large number of instances of a single program with a different data set for each; we do this simply by submitting the application many times to the batch queue with different parameters each time.

However with a large number of jobs, not all of them complete. It does not appear to be a problem in the queue - all of the jobs are started.

The issue appears to be that with a large number of instances of the application running, lots of jobs finish at roughly the same time and thus all try to write out their data to the parallel file-system at pretty much the same time.

The issue then seems to be that either the program is unable to write to the file-system and crashes in some manner, or just sits there waiting to write and the batch queue system kills the job after it's been sat waiting too long. (From what I have gathered on the pr开发者_运维知识库oblem, most of the jobs that fail to complete, if not all, do not leave core files)

What is the best way to schedule disk-writes to avoid this problem? I mention our program is embarrassingly parallel to highlight the fact the each process is not aware of the others - they cannot talk to each other to schedule their writes in some manner.

Although I have the source-code for the program, we'd like to solve the problem without having to modify this if possible as we don't maintain or develop it (plus most of the comments are in Italian).

I have had some thoughts on the matter:

Each job write to the local (scratch) disk of the node at first. We can then run another job which checks every now and then what jobs have completed and moves the files from the local disks to the parallel file-system.
Use an MPI wrapper around the program in master/slave system, where the master manages a queue of jobs and farms these off to each slave; and the slave wrapper runs the applications and catches the exception (could I do this reliably for a file-system timeout in C++, or possibly Java?), and sends a message back to the master to re-run the job

In the meantime I need to pester my supervisors for more information on the error itself - I've never run into it personally, but I haven't had to use the program for a very large number of datasets (yet).

In case it's useful: we run Solaris on our HPC system with the SGE (Sun GridEngine) batch queue system. The file-system is NFS4, and the storage servers also run Solaris. The HPC nodes and storage servers communicate over fibre channel links.

Most parallel file systems, particularly those at supercomputing centres, are targetted for HPC applications, rather than serial-farm type stuff. As a result, they're painstakingly optimized for bandwidth, not for IOPs (I/O operations per sec) - that is, they are aimed at big (1000+ process) jobs writing a handful of mammoth files, rather than zillions of little jobs outputting octillions of tiny little files. It is all to easy for users to run something that runs fine(ish) on their desktop and naively scale up to hundreds of simultaneous jobs to starve the system of IOPs, hanging their jobs and typically others on the same systems.

The main thing you can do here is aggregate, aggregate, aggregate. It would be best if you could tell us where you're running so we can get more information on the system. But some tried-and-true strategies:

If you are outputting many files per job, change your output strategy so that each job writes out one file which contains all the others. If you have local ramdisk, you can do something as simple as writing them to ramdisk, then tar-gzing them out to the real filesystem.
Write in binary, not in ascii. Big data never goes in ascii. Binary formats are ~10x faster to write, somewhat smaller, and you can write big chunks at a time rather than a few numbers in a loop, which leads to:
Big writes are better than little writes. Every IO operation is something the file system has to do. Make few, big, writes rather than looping over tiny writes.
Similarly, don't write in formats which require you to seek around to write in different parts of the file at different times. Seeks are slow and useless.
If you're running many jobs on a node, you can use the same ramdisk trick as above (or local disk) to tar up all the jobs' outputs and send them all out to the parallel file system at once.

The above suggestions will benefit the I/O performance of your code everywhere, not juston parallel file systems. IO is slow everywhere, and the more you can do in memory and the fewer actual IO operations you execute, the faster it will go. Some systems may be more sensitive than others, so you may not notice it so much on your laptop, but it will help.

Similarly, having fewer big files rather than many small files will speed up everything from directory listings to backups on your filesystem; it is good all around.

It is hard to decide if you don't know what exactly causes the crash. If you think it is an error related to the filesystem performance, you can try an distributed filesystem: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html

If you want to implement Master/Slave system, maybe Hadoop can be the answer.

But first of all I would try to find out what causes the crash...

OSes don't alway behave nicely when they run out of resources; sometimes they simply abort the process that asks for the first unit of resource the OS can't provide. Many OSes have file handle resource limits (Windows I think has a several-thousand handle resource, which you can bump up against in circumstances like yours), and failure to find a free handle usually means the OS does bad things to the requesting process.

One simple solution requiring a program change, is to agree that no more than N of your many jobs can be writing at once. You'll need a shared semaphore that all jobs can see; most OSes will provide you with facilities for one, often as a named resource (!). Initialize the semaphore to N before you launch any job. Have each writing job acquire a resource unit from the semaphore when the job is about to write, and release that resource unit when it is done. The amount of code to accomplish this should be a handful of lines inserted once into your highly parallel application. Then you tune N until you no longer have the problem. N==1 will surely solve it, and you can presumably do lots better than that.