Linux NFS Overview, FAQ and HOWTO Documents
|
This document provides an introduction to NFS as implemented in the Linux
kernel. It links to developers' sites, mailing list archives, and relevant
RFCs, and provides guidance for quickly configuring and getting started with
NFS on Linux. A Frequently Asked Questions section is also included.
This document assumes the reader is already familiar with generic NFS
terminology.
|
|
|
|
General Information
|
Quick Overview
- NFS Versions 2, 3, and 4 are supported on 2.6 and later kernels.
- NFS over UDP and TCP on IPv4 are supported on the latest 2.4 and 2.6 kernels.
- Linux NFS clients and servers have been tested against many non-Linux implementations.
- Since version 1.0.1 of the NFS utilities tarball has changed
the server export default to "sync", then, if no behavior is specified in
the export list (thus assuming the default behavior), a warning will be
generated at export time.
- If you plan to deploy NFS extensively, consider subscribing
to one of these mailing lists:
NFS Mailing List, or
the AutoFS Mailing List.
Before reporting problems, you should search for similar issues in the
searchable mail archive.
Another searchable archive for NFS, supported by Google, is
here.
The searchable mail archive for AutoFS is
here.
- A useful set of generic NFS references includes the following:
- "NFS Illustrated," by Brent Callaghan; Addison-Wesley, 2000.
- "Managing NFS and NIS, 2nd edition," by Hal Stern, Mike Eisler, Ricardo Labiaga; O'Reilly, 2001.
- "Linux NFS and Automounter Administration," by Erez Zadok; Sybex, 2001.
- "Using the Linux NFS Client with Network Appliance Filers," by Charles Lever; Netapp TR-3183, 2004.
- "Mike Eisler's NFS blog."
- "Eric Kustarz's blog."
- "NFS version 4 home page."
- Finally, the "linux.org online library" has many references.
Quick Server Setup Guide
- Acquire and install a recent distribution of Linux.
- Set up your /etc/exports file
(man exports for details).
- Consult your distribution's documentation to determine which
/etc/init.d start-up script is used to start your server.
Start NFS services by invoking this script as root, using the
"start" parameter.
Consider adding this script to the list of scripts that are automatically
run at system start-up. (Red Hat uses the chkconfig command for this
purpose).
- Read the NFS How-To for advice on tuning and securing your server.
Quick Client Setup Guide
- Acquire and install a recent distribution of Linux.
To enable NLM lock recovery, ensure your client's host name,
as returned by uname -n, matches the host name returned
by DNS.
- The NLM protocol is handled by an in-kernel service in modern kernels,
but the user-level rpc.statd program must be running to enable NLM
lock recovery. Consult your distribution's documentation to determine which
/etc/init.d start-up script is used to start it. Start the
NSM daemon by invoking this script as root, using the "start" parameter.
Consider adding this script to the list of scripts that are automatically
run at system start-up. (Red Hat uses the chkconfig command for this
purpose).
- Create the directories on your client where you will mount the NFS
shares.
- Add entries in /etc/fstab corresponding to your mount points
(man nfs for details).
- Use mount -a -t nfs to mount the NFS shares.
- During system boot-up, most distributions automatically mount NFS shares
that are listed in /etc/fstab. If yours doesn't, check your
distribution's documentation for instructions on how to configure your
client to do this.
Frequently Asked Questions:
The Questions and Answers section is divided into categories
as follows:
- Section A:
About the NFS protocol
- Section B:
Performance
- Section C:
Common export configuration errors
- Section D:
Commonly occurring error messages
- Section E:
Using Linux NFS with alternate platforms
|
|
|
A. About the NFS protocol
|
-
A1. What are the primary differences between NFS Versions 2 and 3?
-
A. From the system point of view, the primary differences are
these:
Version 2 clients can access only the lowest 2GB of a file
(signed 32 bit offset). Version 3 clients support larger
files (up to 64 bit offsets). Maximum file size depends on
the NFS server's local file systems.
NFS Version 2 limits the maximum size of an on-the-wire
NFS read or write operation to 8KB (8192 bytes).
NFS Version 3 over UDP theoretically supports up to
56KB (the maximum size of a UDP datagram is 64KB, so with room
for the NFS, RPC, and UDP headers, the largest on-the-wire
NFS read or write size for NFS over UDP is around 60KB).
For NFS Version 3 over TCP, the limit depends on the
implementation. Most implementations don't support more than
32KB.
NFS Version 3 introduces the concept of Weak Cache
Consistency. Weak Cache Consistency helps NFS Version 3
clients more quickly detect changes to files that are
modified by other clients. This is done by returning extra
attribute information in a server's reply to a read or write
operation. A client can use this information to decide
whether its data and attribute caches are stale.
Version 2 clients interpret a file's mode bits themselves to
determine whether a user has access to a file. Version 3
clients can use a new operation (called ACCESS) to ask the
server to decide access rights. This allows a client that
doesn't support Access Control Lists to interact correctly
with a server that does.
NFS Version 2 requires that a server must save all
the data in a write operation to disk before it replies to a
client that the write operation has completed. This can be
expensive because it breaks write requests into small chunks
(8KB or less) that must each be written to disk before the
next chunk can be written. Disks work best when they can
write large amounts of data all at once.
NFS Version 3 introduces the concept of "safe
asynchronous writes." A Version 3 client can specify that
the server is allowed to reply before it has saved the
requested data to disk, permitting the server to gather
small NFS write operations into a single efficient disk
write operation.
A Version 3 client can also specify that the data
must be written to disk before the server replies,
just like a Version 2 write.
The client specifies the type of write by setting the
stable_how field in the arguments of each write
operation to UNSTABLE to request a safe asynchronous write,
and FILE_SYNC for an NFS Version 2 style write.
Servers indicate whether the requested data is permanently
stored by setting a corresponding field in the response
to each NFS write operation.
A server can respond to an UNSTABLE write request with an
UNSTABLE reply or a FILE_SYNC reply, depending on whether
or not the requested data resides on permanent storage yet.
An NFS protocol-compliant server must respond to a FILE_SYNC
request only with a FILE_SYNC reply.
Clients ensure that data that was written using a safe
asynchronous write has been written onto permanent storage
using a new operation available in Version 3 called a
COMMIT.
Servers do not send a response to a COMMIT operation
until all data specified in the request has been written
to permanent storage.
NFS Version 3 clients must protect buffered data that has
been written using a safe asynchronous write but not yet
committed.
If a server reboots before a client has sent an
appropriate COMMIT, the server can reply to the eventual
COMMIT request in a way that forces the client to resend
the original write operation.
Version 3 clients use COMMIT operations when flushing
safe asynchronous writes to the server during a close(2)
or fsync(2) system call, or when encountering
memory pressure.
For more information on the NFS Version 3 protocol, read
RFC 1813.
-
A2. Can I run NFS across the TCP/IP Transport Protocol?
-
A. Client support for NFS over TCP is integrated into all 2.4
and later kernels.
Server support for TCP appears in 2.4.19 and later 2.4 kernels, and
in 2.6 and later kernels. Not all 2.4-based distributions support NFS over TCP in
the Linux NFS server.
-
A3. Are there any other versions of NFS under development?
-
A. Yes. NFS Version 4 is being developed
under the supervision of the
Internet Engineering Task Force (IETF).
The IETF hosts
several documents
that describe the NFS Version 4 working group's efforts to date.
Several commercial vendors have already released NFS clients and servers
that support the new version of NFS.
A Linux implementation of NFS Version 4 is under development at the
University of Michigan's
Center for Information Technology Integration
under the direction of Andy Adamson.
This version is available now in the Linux 2.6 kernel.
Although this is a reference implementation of an NFS Version 4 client and
server,
one of two such implementations required as part of the IETF's standards
process, it is still missing some features.
These features are currently under development and should appear soon.
For more information, visit CITI U-M's
NFSv4 project web site.
-
A4. How can I prevent the use of NFS Version 2, or of
other NFS versions?
-
A. The protocol version is determined at mount time,
and can be modified
by specifying the version of the NFS protocol, or the version of the transport
protocol, supported by the server. For example, the client mount command
mount -o vers=3 foo:/ /bar will request that the server
use NFS Version 3 when granting a mount request (Note that "vers"
and "nfsvers" have the same meaning in the mount command; The string "vers"
is compatible with NFS implementations on Solaris and other vendors). If
you wish to prevent use of NFS Version 2 in all cases, then you must restart
rpc.mountd on the server, with the option "-N 1 -N 2". The best
way to do this is to modify the nfs rpc.mountd configuration on the server
by modifying the NFS startup script options, and then shutting down and restarting
NFS as a whole:
- cd /etc/rc.d/init.d
- Modify RPCMOUNTDOPTS in the nfs script to include
"-N 1 -N 2"
- Restart nfs (you must have root access) with "./nfs
restart"
You will now get the following error when attemping to nfs mount
a file system using NFS Version 2 (now unrecognized) after
restarting rpc.mountd:
mount: RPC: Unable to receive; errno = Connection refused
You will also subsequently get the following (non-fatal) warning
when you unmount any nfs mounted file system at all, regardless of when
it was mounted:
Bad UMNT RPC: RPC: Program/version mismatch; low version
= 3, high version = 3
-
A5. Can I use Kerberos authentication with NFS on Linux?
-
A. Sun defined a new interface called RPCSEC GSSAPI that
creates the ability to use authentication plug-ins for protocols like
NFS that ride on top of RPC.
This is the standard way of providing Kerberos authentication support
for NFS.
Support for NFS security mechanisms using RPCSEC GSSAPI is now under
development in Linux, based on work that is already in the 2.6 kernel.
When completed, RPCSEC GSSAPI will work with all versions of the NFS
protocol.
In addition to the three flavors of Kerberos security (authentication,
integrity checking, and full privacy), RPCSEC GSSAPI will eventually support
other security flavors such as SPKM3, and will be fully compatible with
other implementations such as the one in Solaris.
Besides kernel support for RPCSEC GSSAPI,
additional support is required in the form of various user-level changes
(the mount command, and a pair of rpcgss daemons, for example).
Currently, only Fedora Core 2 has RPCSEC GSSAPI enabled in its kernels
and user-level support integrated into its standard distribution.
We expect that, as this work matures, it will be adopted by all 2.6-based
distributions.
Currently Fedora Core 2 supports only the use of Kerberos 5 authentication
with NFS Version 4.
Because of bugs and missing features, for now support for Linux NFS with
Kerberos is appropriate only for early adopters, and not for production
use.
For more information on RPCSEC GSS, read
RFC 2203.
Information on the Linux implementation of RPCSEC GSSAPI is available
here.
-
A6. What are the main new features in version 4 of the NFS protocol?
-
A. Here is a short summary of new features.
For a complete discussion of these features, see the
documentation provided by the NFSv4 Working Group.
NFS Versions 2 and 3 are stateless protocols, but NFS Version 4 introduces
state.
An NFS Version 4 client uses state to notify an NFS Version 4 server of its
intentions on a file: locking, reading, writing, and so on. An NFS Version
4 server can return information to a client about what other clients have
intentions on a file to allow a client to cache file data more aggressively
via delegation. To help keep state consistent, more sophisticated client
and server reboot recovery mechanisms are built in to the NFS Version 4
protocol.
NFS Version 4 introduces support for byte-range locking and share reservation.
Locking in NFS Version 4 is lease-based, so an NFS Version 4 client must
maintain contact with an NFS Version 4 server to continue extending its open
and lock leases.
NFS Version 4 introduces file delegation. An NFS Version 4 server can allow
an NFS Version 4 client to access and modify a file in it's own cache without
sending any network requests to the server, until the server indicates
via a callback that another client wishes to access a file. This reduces
the amount of traffic between NFS Version 4 client and server considerably in
cases where no other clients wish to access a set of files
concurrently.
NFS Version 4 uses compound RPCs. An NFS Version 4 client can combine several
traditional NFS operations (LOOKUP, OPEN, and READ, for example)
into a single RPC request to carry out a complex operation in one
network round trip.
NFS Version 4 specifies a number of sophisticated security mechanisms, and
mandates their implementation by all conforming clients. These
mechanisms include Kerberos 5 and SPKM3, in addition to traditional
AUTH_SYS security. A new API is provided to allow easy addition of
new security mechanisms in the future.
NFS Version 4 standardizes the use and interpretation of ACLs across Posix
and Windows environments. It also supports named attributes. User
and group information is stored in the form of strings, not as
numeric values. ACLs, user names, group names, and named attributes
are stored with UTF-8 encoding.
NFS Version 4 combines the disparate NFS protocols (stat, NLM, mount, ACL,
and NFS) into a single protocol specification to allow better
compatibility with network firewalls.
NFS Version 4 introduces protocol support for file migration
and replication.
NFS Version 4 requires support of RPC over streaming network transport
protocols such as TCP. Although many NFS Version 4 clients continue to
support RPC via datagrams, this support may be phased out over time in
favor of more reliable stream transport protocols.
For more information on the NFS Version 4 protocol, read
RFC 3530.
-
A7. I've heard NFS Version 4 is not interoperable with earlier
versions of NFS. What's the real deal?
-
A.
In the same way that an NFS Version 3-only client cannot communicate
with an NFS Version 2-only server, an NFS Version 4-only client or server
cannot communicate with clients and servers that only support earlier
versions of NFS. NFS Version 4 uses a different version number in RPC
headers to distinguish the new protocol version. Thus, clients that
support only NFS Version 4 cannot communicate with servers that
support only versions 2 and 3. True interoperability is achieved
by implementing clients and servers that can communicate using
all three protocol versions: NFS Versions 2, 3, and 4.
Early versions of the Linux NFS Version 4 prototype used two separate
clients: the original client that supported NFS Versions 2 and 3,
and a new separate client that supported only NFS Version 4. For
various reasons this prevented the ability to mount NFS Version 4 servers
at the same time as NFS Version 2 and 3 servers were mounted. This
was an implementation choice, not a protocol limitation. This
is no longer the case: the Linux 2.5 NFS client, and all future
versions of the Linux NFS client, support all three versions
seamlessly, and can concurrently mount servers that export version
2, version 3, and version 4.
The goal is that NFS Version 4 will coexist with versions 2 and 3 in
much the same way as NFS Version 3 coexists with NFS Version 2 today.
Upgrading should be nearly transparent.
There are some minor interoperability issues when applications
running on clients make use of some of the new features of NFS Version 4
such as mandatory locking, share reservations, and delegations.
These features help make NFS Version 4 more compatible with traditional
Windows file systems like CIFS. Network Appliance, who makes
file servers that can export file systems via both CIFS and
NFS concurrently, has published papers describing some of these
issues. See:
-
A8. What is close-to-open cache consistency?
-
A.
Perfect cache coherency among disparate NFS clients is very expensive
to achieve, so NFS settles for something weaker that satisfies the
requirements of most everyday types of file sharing. Everyday file
sharing is most often completely sequential: first client A
opens a file, writes something to it, then closes it; then client
B opens the same file, and reads the changes.
So, when an application opens a file stored in NFS, the NFS client
checks that it still exists on the server, and is permitted to the
opener, by sending a GETATTR or ACCESS operation. When the application
closes the file, the NFS client writes back any pending changes to the
file so that the next opener can view the changes. This also gives
the NFS client an opportunity to report any server write errors to
the application via the return code from close().
This behavior is referred to as close-to-open cache consistency.
Linux implements close-to-open cache consistency by comparing
the results of a GETATTR operation done just after the file is closed
to the results of a GETATTR operation done when the file is next opened.
If the results are the same, the client will assume its data cache
is still valid; otherwise, the cache is purged.
Close-to-open cache consistency was introduced to the Linux NFS
client in 2.4.20. If for some reason you have applications that
depend on the old behavior, you can disable close-to-open support by
using the "nocto" mount option.
There are still opportunities for a client's data cache to
contain stale data. The NFS version 3 protocol introduced "weak
cache consistency" (also known as WCC) which provides a way of
checking a file's attributes before and after an operation to
allow a client to identify changes that could have been made by
other clients. Unfortunately when a client is using many
concurrent operations that update the same file at the same time,
it is impossible to tell whether it was that client's updates or
some other client's updates that changed the file.
For this reason, some versions of the Linux 2.6 NFS client
abandon WCC checking entirely, and simply trust their own data cache.
On these versions, the client can maintain a cache full of stale file
data if a file is opened for write. In this case, using file locking
is the best way to ensure that all clients see the latest version of
a file's data.
A system administrator can try using the "noac" mount option
to achieve attribute cache coherency among multiple clients.
Almost every client operation checks file attribute information. Usually
the client keeps this information cached for a period of time to reduce
network and server load. When "noac" is in effect, a client's file
attribute cache is disabled, so each operation that needs to check a
file's attributes is forced to go back to the server. This permits
a client to see changes to a file very quickly, at the cost of many
extra network operations.
Be careful not to confuse "noac" with "no data caching."
The "noac" mount option will keep file attributes up-to-date with
the server, but there are still races that may result in data
incoherency between client and server. If you need absolute cache
coherency among clients, applications can use file locking, where
a client purges file data when a file is locked, and flushes changes
back to the server before unlocking a file; or applications can open
their files with the O_DIRECT flag to disable data caching entirely.
For a better understanding of the compromises faced in the design
of NFS caching, see Callaghan's
"NFS Illustrated."
-
A9. Why does opening files with O_APPEND on multiple clients cause
the files to become corrupted?
-
A.
The NFS protocol does not support atomic append writes, so
append writes are never atomic on NFS for any platform.
Most NFS clients, including the Linux NFS client in kernels newer than 2.4.20,
support "close to open" cache consistency, which provides good performance
and meets the sharing needs of most applications.
This style of cache consistency does not provide strict coherence of the
file size attribute among multiple clients, which would be necessary to ensure
that append writes are always placed at the end of a file.
Read all about the NFS cache consistency model here.
Alternately, the NFS protocol could include a specific atomic append write
operation, but today's versions of the protocol do not. The designers
of the NFS protocol felt that atomic append writes would be rarely used,
so they never added the feature. Even with such a feature, keeping the
file size attribute up to date would be challenging.
-
A10. What does it mean when my application fails because of
an ESTALE error?
-
A.
The NFS protocol does not refer to files and directories by name or
by path; it uses an opaque binary value called a file handle.
In NFSv3 this file handle can be up to 64 bytes long; NFSv4
allows them to be even larger. A file's file handle is assigned by an
NFS server, and is supposed to be unique on that server for the life
of that file. Clients discover the value of a file's file handle by
doing a LOOKUP operation, or by using part of the results of a
READDIRPLUS operation. There is usually a special process done
while mounting an NFS file system to determine the file handle of
the file system's root directory.
ESTALE is an error reported by a server when a file handle is not
valid. Here are some common reasons why a file handle is not valid:
- The file resides in an export that is not accessible.
It could have been unexported, the export's access list may have changed,
or the server could be up but simply not exporting its shares.
- The file handle refers to a deleted file.
After a file is deleted on the server, clients don't find out
until they try to access the file with a file handle they had cached
from a previous LOOKUP.
Using rsync or mv to replace a file while it is in use on
another client is a common scenario that results in an ESTALE error.
- The file was renamed to another directory, and subtree checking
is enabled on a share exported by a Linux NFS server.
See question C7 for more details
on subtree checking on Linux NFS servers.
- The device ID of the partition that holds your exported files has
changed.
File handles often contain all or part of a physical device ID, and
that ID can change after a reboot, RAID-related changes, or a hardware
hot-swap event on your server.
Using the "fsid" export option on Linux will force the fsid of an exported
partition to remain the same. See the "exports" man page for more details.
- The exported file system doesn't support permanent inode numbers.
Exporting FAT file systems via NFS is problematic for this reason. This
problem can be avoided by exporting only local filesystems which have good
NFS support. See question C6 for more information.
A client can recover when it encounters an ESTALE error during a pathname
resolution, but not during a READ or WRITE operation. An NFS client
prevents data corruption by notifying applications immediately when a
file has been replaced during a read or write request. After all, it is
usually catastrophic if an application writes to or reads from the wrong
file.
Thus in general, to recover from an ESTALE error, an application must close
the file or directory where the error occurred, and reopen it so the NFS
client can resolve the pathname again and retrieve the new file handle.
Older Linux NFS clients do not recover from an ESTALE error, even during
pathname resolution. In the 2.6.12 kernel and later, the Linux VFS layer
can redrive pathname resolution when an ESTALE is encountered to recover
appropriately.
|
|
|
B. Performance
|
-
B1. What can I do to to improve NFS performance in general?
-
A. Review the
performance section
of the NFS Howto doc and then look at several things:
- How fast is the disk IO speed on your server(s)? That will have a
big impact on overall NFS performance for both Version 2 and Version 3.
- Does your application open its files with the O_SYNC option?
That will force NFS Version 3 to behave exactly like (synchronous) NFS
Version 2.
- UDP requires IP fragment reassembly. If you see fragmentation errors
indicated in the output of netstat -s you may want to increase the
size of your socket buffers.
- Have you started enough NFS daemons?
Review the contents of /proc/net/rpc/nfsd, especially the line
that begins with "th".
The first number on that line is the total number of NFS server threads
that are started and waiting for NFS requests.
The second number indicates whether at any time all of the threads were
running at once. The remaining numbers are a thread count time histogram.
See the NFS How-to for details on tuning your server based on the data
in this histogram.
- Do your NICs and Switches/Hubs/Routers autonegotiate down
to 10baseT or half duplex? Half duplex will give you many more
network collisions, which are the worst thing possible for NFS
performance in UDP.
- Are you running ext3 or ReiserFS? You might look at placing
the journal on a separate disk, or on NVRAM. As of January 2002, ext3
allows this, and Reiser has a patch available.
-
B2. Everything seems so slow and I think the default rsize
and wsize are set to 1024 - what's going on?
-
A. Normally, the Linux NFS client uses read-ahead and delayed
writes to hide the latency of NFS read and write operations. However, the
client can cache only a single read or write request per page. Thus, if
reading or writing a whole page requires more than one on-the-wire read or
write operation (which it certainly does if rsize or wsize is 1024), each of
these operations must complete before the next one can be issued. In the case
of small NFS Version 3 write operations, the write must be FILE_SYNC because
the client must fully complete each write before it issues the next one.
Note that this limitation becomes especially significant for hardware
that supports larger pages. For instance, many distributors provide a Linux
kernel built for Itanium processors that uses 16KB pages rather than 4KB pages
normally found on 32-bit x86 systems. On such a system, if wsize is smaller
than 16KB, the client always sends write operations serially, if they occur in
the same page.
Finally, note that the maximum transfer size permitted by the Linux
server (NFSSVC_MAXBLKSIZE) is set to 32KB when applying all patches involved
with the implementation of NFS over TCP in the 2.4 kernels.
The latest 2.4 kernels have TCP support integrated, and allow transfer sizes
up to 32KB.
-
B3. Why can't I mount more than 255 NFS file systems
on my client? Why is it sometimes even less than 255?
-
A. On Linux, each mounted file system is assigned a major
number, which indicates what file system type it is (eg. ext3, nfs, isofs);
and a minor number, which makes it unique among the file systems
of the same type. In kernels prior to 2.6, Linux major and minor numbers
have only 8 bits, so they may range numerically from zero to 255. Because
a minor number has only 8 bits, a system can mount only 255 file systems
of the same type. So a system can mount up to 255 NFS file systems, another
255 ext3 file system, 255 more iosfs file systems, and so on. Kernels after
2.6 have 20-bit wide minor numbers, which alleviate this restriction.
For the Linux NFS client, however, the problem is somewhat worse
because it is an anonymous file system. Local disk-based file
systems have a block device associated with them, but anonymous file
systems do not.
/proc, for example, is an anonymous file system, and so are
other network file systems like AFS.
All anonymous file systems share the same major number, so there can be
a maximum of only 255 anonymous file systems mounted on a single host.
Usually you won't need more than ten or twenty total NFS mounts on
any given client. In some large enterprises, though, your work and users
might be spread across hundreds of NFS file servers. To work around the
limitation on the number of NFS file systems you can mount on a single host,
we recommend that you set up and run one of the automounter daemons for
Linux. An automounter finds and mounts file systems as they are needed,
and unmounts any that it finds are inactive. You can find more
information on Linux automounters
here.
You may also run into a limit on the number of privileged network
ports on your system. The NFS client uses a unique socket with its own port
number for each NFS mount point. Using an automounter helps address
the limited number of available ports by automatically unmounting
file systems that are not in use, thus freeing their network ports.
NFS version 4 support in the Linux NFS client uses a single socket per
client-server pair, which also helps increase the allowable number of
NFS mount points on a client.
-
B4. Why does NFS Version 2 seem so much faster than Version 3?
-
A. There are actually two problems here, plus a feature.
First, some background; the NFS Version 2 protocol specification
requires a server to record each write to permanent storage
before it sends a reply to a client. This makes server and client
reboot recovery very simple, and provides a good guarantee that
data sent to the server is permanently stored.
Linux servers (although not the Solaris reference implementation) allow
this requirement to be relaxed by setting a per-export option in
/etc/exports.
The name of this export option is "[a]sync"
(note that there is also a client-side mount option by the same name, but
it has a different function, and does not defeat NFS protocol
compliance).
When set to "sync," Linux server behavior strictly conforms to
the NFS protocol. This is default behavior in most other server
implementations. When set to "async," the Linux server replies to
NFS clients before flushing data or metadata modifying operations to
permanent storage, thus improving performance, but breaking all
guarantees about server reboot recovery.
First problem:
The default value of this export option on Linux NFS servers before
nfs-utils-1.0.1 was "async". If a system administrator did not
specify either "sync" or "async" in /etc/exports,
exportfs used "async" by default.
This allowed the server to reply to Version 2 write operations
and metadata update operations (such as CREATE or MKDIR)
before the requested data was written to the server's disk, thereby
greatly improving the performance of write operations as well as
introducing the possibility of undetectable data corruption.
Releases of nfs-utils starting with version 1.0.1 use a default value of
"sync," which causes the Linux server to conform properly to the NFS
protocol specification.
Second problem:
Support for NFS Version 3 in Linux 2.2's NFS server does not honor
the "async" export option. Thus, by default on a system running Linux
2.2 with an old version of the nfs-utils package, NFS Version 2 writes
are fast and unsafe, but Version 3 write and commit operations are safe,
although slower, since they always follow the client's request
for either UNSTABLE or FILE_SYNC (see question A1).
Feature:
When you use the exportfs command with its verbose option
set, it displays the various export options in effect for each
exported file system. If the "async" export option is set, it appears
in the option list, but if "sync" is requested, it will not appear in
the exportfs parameter list. This reflects the common usage of "sync"
as the default in other platforms, but can be somewhat confusing.
-
B5. Why does default NFS Version 2 performance seem equivalent
to NFS Version 3 performance in 2.4 kernels?
-
A. See B4 for background information
on how export options affect the Linux NFS server's write behavior.
Since Linux 2.4, the NFS Version 3 server recognizes the
"async" export option. When this option is set, the server replies
to clients before data has been written to permanent storage. The
server also sends a FILE_SYNC response to the client, indicating that
the client need not retain buffered data or send a subsequent COMMIT
operation. This exposes the client to the same undetectable corruption
as exists for NFS Version 2 (with "async") if the server crashes before
it has actually written data to stable storage.
(See question B6 for further discussion of this
behavior and its consequences.)
Note that even if a client sends a Version 3 COMMIT operation, the
server replies immediately if the file system has been exported with
the "async" option.
Conversely, when the "sync" export option is used on a Linux 2.4
server, both Version 2 and Version 3 writes behave as required by the
NFS protocol specification. In this case, NFS Version 3 has a performance
advantage over NFS Version 2, while maintaining data resilience during a
server crash.
Note well that "[a]sync" also affects some metadata operations
on the server.
-
B6. Why is the "async" export option unsafe, and is that
really a serious problem?
-
A. The biggest problem is not just that it is unsafe,
but that corruption may not be detected.
In the Linux implementation of NFS Version 2, when the "async"
export option is in effect, a Linux NFS server may crash before posting
all NFS write requests to disk. A Version 2 client, however, always
assumes data is permanently written to stable storage, and that
it is safe to discard buffers containing the written data.
After a server crash, the Version 2 client cannot know that
unwritten data is lost; this is why Version 2 writes are supposed to
be permanent before the server replies. Even if a client still has
the modified data in its cache, the data on the server no longer
matches what is cached on the client (since some or all of the writes
did not complete before the server crashed). This may cause applications
to make future decisions based on data cached by the client rather than
what is on the server, thus further corrupting the file.
For the Linux implementation of NFS Version 3, using the "async"
export option to allow faster writes is no longer necessary. NFS
Version 3 explicitly allows a server to reply before writing
data to disk, under controlled circumstances. It allows clients and
servers to communicate about the disposition of written data so that
in the event of a server reboot, a Version 3 client can detect the
reboot and resend the data.
In summary, be sure all exports on your Linux NFS servers use
the "sync" option by setting it explicitly or by upgrading your nfs-utils
package to version 1.0.1 or later. If you need fast writes, be sure
your clients mount using NFS Version 3. You may also improve
write performance by adding the "wdelay" option to your exports.
-
B7. I have achieved pretty fast speeds in some client benchmarks,
but when my client is heavily loaded, it slows down considerably.
Why does that happen?
-
A. The Linux client limits the total number of pending
read or write operations per mount point. This prevents the client
from exhausting its memory with cached read or write requests when
the network or server is slow. The hard limit is 256 outstanding
read or write operations per mount point. When that limit is reached,
the client does not issue a new read or write operation until at least
one outstanding read or write operation completes, thus serializing
all reads and writes on that mount point until load is reduced.
Two ways of mitigating this effect are to:
Increase rsize and wsize on your client's mount points.
This increases the amount of data that can be involved in
outstanding reads or writes at any given time.
Mount the same server partition multiple times on your clients,
and spread your applications among the mount points.
This limit has been removed in 2.6 and later kernels.
-
B8. Why won't my client let me use rsize or wsize larger than 8KB
when I mount my Linux NFS server?
-
A.
NFS Version 2 supports up to 8KB reads and writes.
NFS Version 3 allows larger reads and writes (see question
A1).
Stock 2.4 kernels earlier than 2.4.20 do not support read or write
operations larger than 8192 bytes for either NFS Version 2 or 3.
Server-side TCP support, introduced as an experimental compile-time
option in 2.4.20, increases the server's maximum I/O size to 32KB
by increasing the value of NFSSVC_MAXBLKSIZE
(see question B2).
When a client mounts a file server, the file server advertises the
largest number of bytes it can read or write in a single operation.
Clients always use the smaller of the server's maximum and the value
specified by the rsize and wsize values specified by the client in
the mount command.
Large values of rsize and wsize may inhibit performance when using UDP.
UDP datagrams must be separated into fragments that fit within
your network's Maximum Transfer Unit.
The loss of any of these fragments requires retransmission
of the whole datagram. This may have a particularly adverse impact
on client performance if your network is congested.
TCP is considerably better at recovering one or two lost segments
and managing network congestion, so larger I/O operations are
usually more effective at reliably boosting performance when using
NFS over TCP.
-
B9. I use the "sync" or "noac" mount options. I've increased
my wsize, but write throughput is lower than I expect. Why is this?
-
A.
Normally, an NFS client delays sending application write requests,
allowing application processing to overlap with NFS write operations.
An NFS client only causes an application to wait for writes to
complete when the application closes or flushes a file.
When a client sends write operations synchronously, however,
the client causes applications to wait for each write operation
to complete at the server. This results in much lower performance.
The Linux NFS client uses synchronous writes under many circumstances,
some of which are obvious, and some of which you may not expect.
Applications enable synchronous writes for a single file by opening
a file with the O_SYNC or O_DSYNC flags. System administrators enable
synchronous writes for all files in a local file system by mounting that
file system with the "sync" option. The "noac" mount option also
enables synchronous writes. If it didn't, applications running on
other clients would have a difficult time retrieving file modifications
if a client delayed writes.
Currently the Linux NFS client has a limitation which prevents it
from safely generating large synchronous writes. The client breaks
large write requests into on-the-wire write operations that are
no larger than a single page to guarantee that write requests arrive
on the server's disk in byte order (some applications depend on
this behavior). Even if you set wsize larger than a page, the
client will break any application write request into page-sized
NFS write operations to meet this guarantee.
In addition, if the server's page size is larger than the client's
page size, the server is forced to do additional work when the client
writes in small chunks. NFS clients normally align reads and writes
to their own page size, which then may be unaligned on the server if
it uses larger pages. Depending on the server OS and filesystem,
this could result in a number of performance limiting problems.
-
B10. Sometimes my server gets slow or becomes unresponsive, then
comes back to life. I'm using NFS over UDP, and I've noticed a lot of
IP fragmentation on my network. Is there anything I can do?
-
A.
UDP datagrams larger than the IP Maximum Transfer Unit (MTU) must be divided
into pieces that are small enough to be transmitted. If, for example, your
network's MTU is 1524 bytes, the Linux IP layer must break UDP datagram
larger than 1524 bytes into separate packets, all of which must be smaller
than the MTU. These separated packets are called fragments.
The Linux IP layer transmits each fragment as it is breaking up a UDP
datagram, encoding enough information in each fragment so that the receiving
end can reassemble the individual fragments into the original UDP datagram.
If something happens that prevents a client from continuing to fragment a
packet (e.g., the output socket buffer space in the IP layer is exceeded),
the IP layer stops sending fragments. In this case, the receiving end has a
set of fragments that is incomplete, and after a certain time window, it will
drop the fragments if it does not receive enough to assemble a complete
datagram. When this occurs, the UDP datagram is lost. Clients detect this
loss when they have not received a reply from the server after a certain time
interval, and recover by retransmitting the datagram.
Under heavy write loads, the Linux NFS client can generate many large UDP
datagrams. This can quickly exhaust output socket buffer space on the
client. If this occurs many times in a short time, the client sends the
server a large number of fragments, but almost never gets a whole datagram's
worth of fragments to the server. This fills the server's IP reassembly
queue, causing it to become unreachable via UDP until it expels the useless
fragments from the queue.
Note that the same thing can occur on servers that are under a heavy read
load. If the server's output socket buffers are too small, large reads
will cause them to overflow during IP fragmentation. The client's IP
reassembly queue then fills with worthless fragments, and little UDP
traffic can get to the client.
Here are some symptoms of this problem:
- You use NFS over UDP with a large wsize (relative to the network's
MTU), and your application workload is write-intensive, or with a large
rsize with a read-intensive application.
- You may see many fragmentation errors on your server or clients
(netstat -s will tell the story).
- Your server may periodically become very slow or unreachable.
- Increasing the number of threads on your server has no effect on
performance.
- One or a small number of clients seem to make the server unusable.
- The network path between your client and server may have a router or
switch with small port buffers, or the path may contain links that run at
different speeds (100Mb/s and GbE).
The fix is to make the Linux's IP fragmentation logic continue
fragmenting a datagram even when output socket buffer space is
over its limit. This fix appears in kernels newer than 2.4.20.
You can work around this problem in one of several ways:
- Use NFS over TCP. TCP does not use fragmentation, so it
does not suffer from this problem. Using TCP may not be
possible with older Linux NFS clients and servers that only
support NFS over UDP.
- If you can't use NFS over TCP, upgrade your clients
to 2.4.20 or later.
- If you can't upgrade your clients, increase the default
size of your client's socket buffers (see below). 2.4.20
and later kernels do this automatically for the NFS client's
socket buffers. See
Section 5.3ff
of the NFS How-To for more information.
- If your rsize or wsize is very large, reduce it. This will
reduce the load on your client's and server's output socket buffers.
- Reduce network congestion by ensuring your GbE links
use full flow control, that your switch and router ports
use adequate buffer sizes, and that all links are negotiating
their fastest settings.
-
B11. Why does my server see so many ACCESS calls when using Linux
clients?
-
A.
Default NFS server behavior is to prevent root on client machines from
having privileged access to exported files. Servers do this by mapping
the "root" user to some unprivileged user (usually the user "nobody") on
the server side. This is known as root squashing. Most servers,
including the Linux NFS server, provide an export option to disable
this behaviour and allow root on selected clients to enjoy full root
privileges on exported file systems.
Unfortunately, an NFS client has no way to determine that a server is
squashing root. Thus the Linux client uses NFS Version 3 ACCESS operations
when an application is running on a client as root. If an application
runs as a normal user, a client uses it's own authentication checking,
and doesn't bother to contact the server.
The Linux NFS client should cache the results of these ACCESS operations.
In fact, in the new 2.6.x kernels, it does this and it extends ACCESS
checking to all users to allow for generic uid/gid mapping on the server.
This also enables proper support for Access Control Lists in the server's
local file system. In pre-2.6 kernels, the stock NFS client does not
cache the results of ACCESS operations.
|
|
|
C. Common export configuration errors
|
-
C1. How are exported file systems and client mount points
tracked on the server?
-
A. /etc/exports contains information about how file
systems should normally be exported. This is only read by exportfs.
- /var/lib/nfs/etab contains information about what filesystems
should be exported to whom at the moment.
- /var/lib/nfs/rmtab contains a list of which filesystems
actually are mounted by certain clients at the moment.
- /proc/fs/nfs/exports contains information about
what filesystems are exported to actual client (individual, not subnet or
whatever) at the moment.
- /var/lib/nfs/xtab is the same information as
/proc/fs/nfs/exports but is maintained by nfs-utils instead of
directly by the kernel. It is only used if /proc isn't mounted.
-
C2. Can I modify export permissions without needing to
remount clients in order to have them take effect?
-
A. Yes. The safest thing to do is edit /etc/exports
and run "exportfs -r".
Note that when a mount request arrives, mountd check .../etab
to see if that host is allowed access. If it is, an entry is placed in
.../rmtab and the filesystem is exported thus creating an entry in
/proc/fs/nfs/exports.
When you run "exportfs -io <options> host:/dir
then the entry in ../etab is changed, or a new one is added. If
it is a subnet/wildcard/netgroup entry, then every line in ../rmtab
is checked to see if it matches. When a match is found, a host-specific
entry is given to (or changed in) the kernel. When you run "exportfs
-a" it makes sure that all entries in /etc/exports are properly reflected
in ../etab. Any extra entries in etab are left alone. Once the
correct content of etab has been determined, rmtab is examine to create a
list of specific-host entries for any new entries in etab. This host-specific
entries are given to the kernel.
When you run "exportfs -r" it ignores the prior contents
of ../etab and initializes etab to the contents of /etc/exportfs.
Then it inspects rmtab and make an changes to /proc/fs/nfs/export
that are necessary.
-
C3. My exports seem to be readable by everyone - or /etc/exports
is not giving the intended permissions
-
A. /etc/exports is VERY sensitive to whitespace -
so the following statements are not the same, due to the space between
the option "hostname" and the opening parentheses:
/export/dir hostname(rw,no_root_squash)
/export/dir hostname (rw,no_root_squash)
The first will grant hostname read and write access to
/export/dir without squashing root privileges. The second will
grant hostname read and write privileges with root squash,
and it will grant everyone else read and write access, without
squashing root privileges.
-
C4. I believe the Linux NFS server will not export a fat32 partition.
Is that correct?
-
A. The FAT file systems can be exported, starting with the
early 2.4 kernels, but if used extensively, it may cause grief. First, only
those operations supported by the exported file system will be honoured.
Operations such as "chown", "link", and "symlink" are not supported by these
file systems, and will fail. Read/write/create etc., should be fine, as long as
the files remain relatively unchanged.
The most serious problem is that the FAT filesystem layout does not contain
enough information to create a lasting identity needed for NFS to create
persistent filehandles. For example, if you take a file, rename it to another
directory, trunctate it, and write new data to it, there is nothing stored in
the filesystem that can be used to show that the resulting file is, in any
sense, the "same" as the original file, and there is no way to find the new
file given any details about the original file. Therefore, the Linux NFS
server cannot guarantee that once you have opened a file, you can continue to
have access to that file, if the file is modified in the ways given above.
NFS may then be unable to locate or identify the file correctly, and so may
return ESTALE errors.
-
C5. Sometimes my client gets a "permission denied" error
when attempting to mount a file system, even though it managed it a few
hours earlier with no change to the configuration on the server.
-
A. Your server's /etc/exports is probably
misconfigured. If the exports file contains both domain names and
IP addresses, it can result in random client behavior when mounting,
especially if your clients have multiple IP addresses registered with
DNS.
If you export a directory and one of its ancestors, and both
reside on the same physical file system on the server, it can result
in random client behavior when mounting.
-
C6. Which local file systems can I export with the Linux NFS
server?
-
A. We expect the following local file systems to work,
as they are tested often: ext2, ext3, jfs, reiserfs, xfs.
These local file systems may work or may have a few minor-ish
issues: iso9660, ntfs, reiser4, udf. Ask on the
NFS mailing list
for details.
Any file system based on FAT or not having the ability to
provide permanent inode numbers will have trouble with NFS versions
2 and 3 (see question C4).
Local file systems that are known not to work with the Linux
NFS server are: procfs, sysfs, tmpfs (and friends).
-
C7. Why should I disable subtree checking on my NFS server exports?
-
A. When an NFS server exports a subdirectory of a local file
system, but leaves the rest unexported, the NFS server must check whether
each NFS request is against a file residing in the area that is exported.
This check is called the subtree check.
To perform this check, the server includes information about the parent
directory of each file in NFS file handles that are handed out to NFS
clients. If the file is renamed to a different directory, for example,
this changes the file handle, even though the file itself is still the
same file. This breaks NFS protocol-compliance, often causing
misbehavior on clients such as ESTALE errors, inappropriate access to
renamed or deleted files, broken hard links, and so on.
In the opinion of many, subtree checking causes much more trouble than it
saves, and should be avoided in most cases.
The subtree_check option is necessary only when you want
to prevent a file handle guessing attack from gaining access to files
that fall outside the exported part of your server's local file systems.
If you need to be certain that noone can access files
outside the exported part of a local file system, set up the partitions
on your server so that you only export whole file systems.
|
|
|
D. Commonly occurring error messages
|
-
D1. I keep getting permission failure messages at my
NFS server. What are they?
-
A. The messages you are mentioning take the following format:
Jan 7 09:15:29 server kernel: fh_verify: mail/guest permission failure,
acc=4, error=13
Jan 7 09:23:51 server kernel: fh_verify: ekonomi/test permission failure,
acc=4, error=13
They happen when a NFS setattr operation is attempted on a file you don't
have write access to. These messages are harmless.
-
D2. What is a "silly rename"? Why do these .nfsXXXXX files keep
showing up?
-
A. Unix applications often open a scratch file and then unlink
it. They do this so that the file is not visible in the file system name
space to any other applications, and so that the system will automatically
clean up (delete) the file when the application exits. This is known as
"delete on last close", and is a tradition among Unix applications.
Because of the design of the NFS protocol, there is no way for a file to
be deleted from the name space but still remain in use by an application.
Thus NFS clients have to emulate this using what already exists in the
protocol. If an open file is unlinked, an NFS client renames it to a
special name that looks like ".nfsXXXXX". This "hides" the file while
it remains in use. This is known as a "silly rename." Note that NFS
servers have nothing to do with this behavior.
After all applications on a client have closed the silly-renamed file,
the client automatically finishes the unlink by deleting the file on the
server. Generally this is effective, but if the client crashes before
the file is removed, it will leave the .nfsXXXXX file. If you are sure
that the applications using these files are no longer running, it is
safe to delete these files manually.
The NFS version 4 protocol is stateful, and could actually support
delete-on-last-close. Unfortunately there isn't an easy way to do this
and remain backwards-compatible with version 2 and 3 accessors.
-
D3. What does this mean: svc: unknown program
100227 (me 100003)
-
A. It refers to a mount request by an NFS client which
supports the Solaris NFS_ACL side-band protocol. The Linux NFS server
in the mainline kernels does not support this protocol, but many
distributions include patches that provide NFS_ACL support in their
NFS implementation. The message can be ignored safely.
-
D4. I frequently see this in my logs:
-
kernel: nfs: server server.domain.name not responding, still trying
kernel: nfs: task 10754 can't get a request slot
kernel: nfs: server server.domain.name OK
A. The "can't get a request slot" message means that the
client-side
RPC code has detected a lot of timeouts (perhaps due to network congestion,
perhaps due to an overloaded server), and is throttling back the number of
concurrent outstanding requests in an attempt to lighten the load. Some possible
causes:
- Network congestion
- Overloaded server
- Packets (input or output) dropped by a bad NIC or driver....
-
D5. I just upgraded to the latest nfs-utils and now NLM locking
no longer works on files residing on my NFS server. What's up?
-
A. There are permisions on the /var/lib/nfs/sm and
/var/lib/nfs/sm.bak
files that must be addressed. Whomever rpc.statd is running as must have
ownership and rw access to those dirs. The permissions should be set to 700
for both. In addition, etab, rmtab, and xtab all must exist and be writable
by root.
-
D6. I've mounted with the "intr" option but processes still become
unkillable when my server is unavailable. How do I kill the processes so
I can unmount them?
-
A.
It is true that even when using the "intr" mount option, you will not
always succeed in killing a task that is hanging on NFS. In these
instances, the task is usually waiting in the kernel on some
semaphore that is held by another process. Since signals cannot interrupt
semaphores, the signal will have no effect on the hanging task.
There have been some suggestioned solutions, but none have
been implemented. One is to set up a special class of semaphores
which are killable with 'SIGKILL', but replacing the relevant semaphores
in the VFS and VM layers will not be possible before the 2.7 kernels
the earliest. Another solution under consideration is to cause rpciod
to awaken all waiting requests when a user requests an unmount, allowing
them to exit with an error.
Until these are implemented, you can work around this problem by
killing all processes waiting for I/O to complete in a given file system:
- use 'lsof' or some other means to identify processes waiting
on files in the target file system,
- kill -9 all the processes, then
- kill -9 rpciod.
Another, less desirable, workaround is to use "soft" mounts.
This will cause processes to stop retrying I/O after a time.
Eventually processes become unstuck and your file system can be
unmounted. However, soft mounts are not completely safe.
See question E4 for a description of the
risks of using "soft" mounts.
-
D7. How come lock recovery doesn't work for me?
-
A.
When a client reboots, it should notify any servers it had previously
mounted to release all locks that were held.
It does this by invoking rpc.statd during system start up.
There are several common problems that can prevent rpc.statd
from working.
First, be sure that your client has the appropriate startup script
enabled (/etc/rc.d/init.d/nfslock on Red Hat distributions).
Next, make certain that when rpc.statd starts up, the network is
already available for it to work (some DHCP-configured hosts may have
a problem with this, for example).
Make sure that the client's nodename (uname -n) is
the same as what is returned by gethostbyname(3) on your client.
These can differ because of your nsswitch configuration, the contents of
/etc/hosts, because your client is configured via DHCP, or
because of DNS misconfiguration.
The in-kernel lockd process uses a client's nodename to identify its
locks when sending lock requests.
Rpc.statd must send an identical string when it sends a recovery
notification, otherwise the server has no way to match the notification
to any locks it may still hold for the client.
It is also recommended that the nodenames for your NFS clients be fully
qualified domain names, not just a hostname. If another client in a
different domain with the same hostname contacts your server, a fully
qualified nodename on both clients will allow the server to distinguish
between locks set on each client.
When traversing a firewall between your clients and server, bi-directional
RPC traffic must be allowed if you need lock recovery to work, as NLM is
callback-based. Two important issues that may prevent the server from
calling the client are:
-
Blocking locks (F_SETLKW) will be hampered since the client expects
the server's lockd daemon to call it back as soon as any conflicting locks
have been released and the lock granted.
-
Server reboot recovery will be broken, since the server's rpc.statd
daemon will be incapable of notifying the clients that their locks have been
lost and need to be recovered.
-
D8. When my application uses memory-mapped NFS files, it breaks. Why?
-
A.
Usually this is because application developers rely on certain local
file system behaviors to guarantee data consistency, rather than
reading the mmap man pages carefully to understand what behavior is
required by all file system implementations. Some examples:
Although some implementations of munmap(2) happen to write dirty
pages to local file systems, the NFS version of munmap(2) does not.
An msync(2) call is always required to guarantee that dirty
mapped data is written to permanent storage.
A subtle ramification of the Linux NFS client's treatment of
munmap(2) is that does not consider munmap(2) to be a
close operation for the purposes of
close-to-open cache coherency.
The distinction between the MS_SYNC and MS_ASYNC flags
is also important. MS_ASYNC will force dirty mapped pages to
permanent storage eventually. Only MS_SYNC guarantees
that the pages are written before msync(2) returns to
your application.
Therefore applications should use msync(MS_SYNC) to serialize
data writes to mapped files.
Finally, the Linux NFS client may not flush dirty mapped pages when
a file descriptor is closed via close(2). Oftentimes during
close processing, the client may flush mapped pages along with
pages dirtied by a write(2) call, but this behavior is not
guaranteed.
Many applications will open a file, map it, then close it and continue
using the map. The behavior described above is an attempt to optimize
the performance of this use case.
-
D9. When I update shared executable files on my NFS exports, programs
running on my clients all segfault. How come?
-
A.
If you simply copy the new executable or library over an old version,
you are violating the NFS cache consistency rules
(described here)
by changing a file that is being held open on your clients.
Copying over executables creates a window during which an NFS client's
cache may hold parts of the old version and parts of the new version,
all combined in the same file.
The correct way to update executables and shared libraries on your NFS
shares is to use the install program with the '-b' option.
That renames the version of the executable that is in use, then
creates a brand new file to contain the new version of the executable.
-
D10. I'm trying to use flock()/BSD locks to lock files used on
multiple clients, but the files become corrupted. How come?
-
A.
flock()/BSD locks act only locally on Linux NFS clients prior
to 2.6.12. Use fcntl()/POSIX locks to ensure that file locks
are visible to other clients.
Here are some ways to serialize access to an NFS file.
- Use the fcntl()/POSIX locking API.
This type of locking provides byte-range locking across multiple clients
via the NLM protocol, or via NFSv4.
- Use a separate lockfile, and create hard links to it.
See the description in the O_EXCL section of the creat(2)
man page.
It's worth noting that until early 2.6 kernels, O_EXCL creates were not atomic
on Linux NFS clients. Don't use O_EXCL creates and expect atomic behavior
among multiple NFS client unless you are running a kernel newer than 2.6.5.
It's a known issue that Perl uses flock()/BSD locking by default.
This can break programs ported from other operating systems, such as Solaris,
that expect flock/BSD locks to work like POSIX locks.
On Linux, using file locking instead of a hard link has the added benefit
of checkpointing the client's cache with the server. When a file lock is
acquired, the client will flush the page cache for that file so that
any subsequent reads get new data from the server. When a file lock
is released, any changes to the file on that client are flushed back
to the server before the lock is released so that other clients waiting
to lock that file can see the changes.
The NFS client in 2.6.12 provides support for flock()/BSD locks
on NFS files by emulating the BSD-style locks in terms of POSIX byte
range locks.
Other NFS clients that use the same emulation mechanism, or that use
fcntl()/POSIX locks, will then see the same locks that the
Linux NFS client sees.
On local Linux filesystems, POSIX locks and BSD locks are invisible
to one another.
Thus, due to this emulation, applications running on a Linux NFS server
will still see files locked by NFS clients as being locked with a
fcntl()/POSIX lock, whether the application on the client is
using a BSD-style or a POSIX-style lock. If the server application uses
flock()BSD locks, it will not see the locks the NFS clients use.
-
D11. Why doesn't "mount -oremount,tcp" convert
an NFS-mounted file system mounted with UDP to one mounted with TCP?
-
A.
The "remount" option on the mount command only affects the generic mount
options, such as ro/rw, sync, and so on (see man mount for a
complete list of generic mount command options).
The NFS-specific mount options listed on the nfs man page can't
be changed with a "mount -oremount" style mount command.
You must unmount your file system and mount it again with new options
in order to modify the NFS-specific settings.
Note that the mount command may update the contents of /etc/mtab
whether or not the actual mount settings have changed in the kernel.
So when you try mount -oremount with an NFS-specific mount option,
subsequent mount commands may report that the setting is in effect.
This is only because the mount command is reading /etc/mtab.
The /proc/mounts file reflects the true mount options that the
kernel is using.
-
D12. I didn't mount with "intr" (the default is "nointr") and
some processes are unkillable when my server becomes unavailable.
What can I do?
-
A.
Use the umount command's "-f" flag to force an unmount. There
will be a brief pause while the umount command attempts to
contact the server, and then all outstanding requests to the server will
be failed, thus making the processes killable and responsive in other
ways.
Some progrms, on receiving an I/O error, will just try more I/O, making
them unkillable again. For this reason, try killing all processes on the
stuck mount first first, and then run "umount -f". When the I/O requests
fail, the process will become killable, will see the signal, and will
die. Sometimes it can at a couple of interations of the "kill processes"
then "umount -f" cycle until the filesystem is unmounted, but it
usually works.
If all else fails, you can still unmount the partition on which the
processes are hanging using the "umount -l" command. This causes the
stuck mount to become detached from the file system name space hierarchy
on your client, and will thus no longer be visible to other processes.
You can replace that mount point with another mount to the same server
when it becomes available again, or to some other server if the remote
data has moved. Note, though, that the old mount point will continue to
consume client memory until the stuck processes have all died.
|
|
|
E. Using Linux NFS with alternate platforms
|
-
E1. I use a Tru64 Unix 4.x or SunOS 4.1.x client.
NFS File locking does not seem to work unless I give all users permissions
on the file.
-
A. The default specifications for NFS Versions 2 and 3 allow
any user to lock a file regardless whether that user has permission to access
the file. The writers of the Linux NFS server regarded this behavior as
insecure, and chose to only allow users who have access to a file to be able
to lock it. However, older SunOS and Tru64 clients, and some HP/UX clients,
take advantage of the NFS specification by making all NFS file lock requests
with the credentials of the daemon. This means that if the daemon does not
have access to the files, the server will refuse to lock them.
The export option no_auth_nlm is designed to alleviate this
problem. Set it on any shares you wish to export to these clients.
This will disable the authorization check on file lock requests.
-
E2. I'm not using Redhat or VALinux distros so the nfs-utils
startup script in the rpm is broken. What do I do?
-
A. You should comment out the following line in the
/etc/rc.d/init.d/nfs that says this:
. /etc/rc.d/init.d/functions
-
E3. I'm using an Irix Client and I'm seeing an array
of problems with file lists and cwd from a Linux server. The server is
running NFS Version 3. Is this a Linux bug?
-
A. IRIX improperly deals with file handles of less than 32
bytes which the NFS server in Linux 2.4.x uses. SGI has addressed this
problem in IRIX 6.5.13, which was released in 2001.
A workaround to this problem is to use NFS Version 2. On the IRIX
client, use vers=2 in your mount options.
-
E4. Why do I get NFS timeouts when I mount a Linux NFS server
from my Solaris NFS client?
-
A. You get NFS timeouts because you are using soft mounts.
Normally, mounts are hard, which requires the client to continue
attempts to reach the server forever.
A soft mount allows the client to stop trying an operation after
a period of time.
A soft timeout may cause silent data corruption if it occurs during data
or metadata transmissions, so you should only use soft mounts in the cases
where client responsiveness is more important than data integrity.
If you require the use of soft mounts over an unreliable link such as DSL, try
using TCP, which is what Solaris uses by default. This will help manage the
impact of brief network interruptions. If using TCP is not possible, then you
should reduce the risk of using soft mounts with UDP by specifying long
retransmission timeout values and a relatively large number of retries in
the mount command options (i.e., timeo=30, retrans=10).
Note that NFS over UDP now uses a retransmit timeout estimation algorithm in
the latest 2.4 and 2.6 kernels, which means the timeo= mount option is less
effective at preventing data corruption due to a soft timeout.
|
|
FAQ maintained by Chuck Lever
cel at citi dot umich dot edu
|
|
|
The Linux
NFS project is hosted by the good people at VA Linux Systems and their
sourceforge.net project.
|
|