开发者

Writing to a remote file: When does write() really return?

开发者 https://www.devze.com 2023-04-11 17:13 出处:网络
I have a client node writing a file to a hard disk that is on another node (I am writing to a parallel fs actually).

I have a client node writing a file to a hard disk that is on another node (I am writing to a parallel fs actually).

What I want to understand is:

When I write() (or pwrite()), when exactly does the write call return?

I see three possibilities:

  1. write returns immediately after queueing the I/O operation on the client side:

    In this case, write can return before data has actually left the client node (If you are writing to a local hard drive, then the write call employs delayed writes, where data is simply queued up for writing. But does this also happen when you are writing to a remote hard disk?). I wrote a testcase in which I write a large matrix (1GByte) to file. Without fsync, it showed very high bandwidth values, whereas with fsync, results looked more realistic. So looks like it could be using delayed writes.

  2. write returns after the data has been transferred to the server buffer:

    Now data is on the server, but resides in a buffer in its main memory, but not yet permanently stored away on the hard drive. In this case, I/O time should be dominated by the time to transfer the data over the network.

  3. write returns after data has been actually stored on the hard drive:

    Which I am sure does not happen by default (unless you write really large files which causes your RAM to get filled and ultimately get flushed out and so on...).

Additionally, what I would like to be sure about is:

Can a situation occur where the program terminates without any data actually having left the client node, such that network parameters like latency, bandwidth开发者_如何学编程, and the hard drive bandwidth do not feature in the program's execution time at all? Consider we do not do an fsync or something similar.

EDIT: I am using the pvfs2 parallel file system


Option 3. is of course simple, and safe. However, a production quality POSIX compatible parallel file system with performance good enough that anyone actually cares to use it, will typically use option 1 combined with some more or less involved mechanism to avoid conflicts when e.g. several clients cache the same file.

As the saying goes, "There are only two hard things in Computer Science: cache invalidation and naming things and off-by-one errors".

If the filesystem is supposed to be POSIX compatible, you need to go and learn POSIX fs semantics, and look up how the fs supports these while getting good performance (alternatively, which parts of POSIX semantics it skips, a la NFS). What makes this, err, interesting is that the POSIX fs semantics harks back to the 1970's with little to no though of how to support network filesystems.

I don't know about pvfs2 specifically, but typically in order to conform to POSIX and provide decent performance, option 1 can be used together with some kind of cache coherency protocol (which e.g. Lustre does). For fsync(), the data must then actually be transferred to the server and committed to stable storage on the server (disks or battery-backed write cache) before fsync() returns. And of course, the client has some limit on the amount of dirty pages, after which it will block further write()'s to the file until some have been transferred to the server.


You can get any of your three options. It depends on the flags you provide to the open call. It depends on how the filesystem was mounted locally. It also depends on how the remote server is configured.

The following are all taken from Linux. Solaris and others may differ.

Some important open flags are O_SYNC, O_DIRECT, O_DSYNC, O_RSYNC.

Some important mount flags for NFS are ac, noac, cto, nocto, lookupcache, sync, async.

Some important flags for exporting NFS are sync, async, no_wdelay. And of course the mount flags of the filesystem that NFS is exporting are important as well. For example, if you were exporting XFS or EXT4 from Linux and for some reason you used the nobarrier flag, a power loss on the server side would almost certainly result in lost data.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号