[mdlug] stale NFS file handles - achille's heal of linux?

Sat Jun 14 23:52:41 EDT 2008

Dean,

> Hello, can anyone tell answer this question:
> stale NFS file handles, is there an equivalent problem
> in every OS or just those that rely on NFS?
> Is there a better way to share files over a network?  

If you disconnect (by rebooting the file server) files
that are in use by another host, you are likely to get a stale file handle.

Analagously if you turn off the company's power at the main,
while an important conference call is going on,
you should expect them to stop and report missing some of the
conference proceedings.

NFS is one of the most reliable and easy to manage network file systems.

> I had a situation recently
> where there was a 38 - hour job running on a cluster.
> We needed to add a SCSI disk to the file server (separate from the cluster).
> The users were screaming about space for their results.

> We had to reboot the file server to add the disk.
> All the NFS mounts were dead and somehow the big job got killed.

If you rebooted the server,
which means shutting down all services,
then all the NFS mounts that were on the server were dead,
as well as the cluster components running on the file server.

AFAIK, most cluster jobs will not tolerate rebooting one of their
nodes, and almost certainly will not tolerate doing so without
notice to the cluster controller.

If the big job used files on the server, and found they weren't there
(because the server was down during reboot), most big jobs would
report the problem and quit (with some good data) rather than
produce a big garbage heap.

> This turned out to be very bad and the CAE manager
> who hates linux blamed it all on linux.
> (The CAE workstations, the file server, and the cluster all run linux).
> I didn't know what to say.
> The NFS stale file handle issue seems to be a sticky one.

Assuming it was his users who were screaming about space for their results,
ask him why he didn't provide the SCSI disk in time to install it before
beginning the 38 hour job
or why his users demanded you interrupt the 38 hour job to
install the SCSI disk,
and couldn't wait until the 38 hour job was done
and the server was idle, before installing the SCSI disk.

Hopefully helpful,
-- 
Bob

  "Evolution is an obfuscated (GTA) C contest."