gvisor/pkg/sentry/fs/g3doc/inotify.md

5.6 KiB

Inotify

Inotify implements the like-named filesystem event notification system for the sentry, see inotify(7).

Architecture

For the most part, the sentry implementation of inotify mirrors the Linux architecture. Inotify instances (i.e. the fd returned by inotify_init(2)) are backed by a pseudo-filesystem. Events are generated from various places in the sentry, including the syscall layer, the vfs layer and the process fd table. Watches are stored in inodes and generated events are queued to the inotify instance owning the watches for delivery to the user.

Objects

Here is a brief description of the existing and new objects involved in the sentry inotify mechanism, and how they interact:

fs.Inotify

  • An inotify instances, created by inotify_init(2)/inotify_init1(2).
  • The inotify fd has a fs.Dirent, supports filesystem syscalls to read events.
  • Has multiple fs.Watches, with at most one watch per target inode, per inotify instance.
  • Has an instance id which is globally unique. This is not the fd number for this instance, since the fd can be duped. This id is not externally visible.

fs.Watch

  • An inotify watch, created/deleted by inotify_add_watch(2)/inotify_rm_watch(2).
  • Owned by an fs.Inotify instance, each watch keeps a pointer to the owner.
  • Associated with a single fs.Inode, which is the watch target. While the watch is active, it indirectly pins target to memory. See the "Reference Model" section for a detailed explanation.
  • Filesystem operations on target generate fs.Events.

fs.Event

  • A simple struct encapsulating all the fields for an inotify event.
  • Generated by fs.Watches and forwarded to the watches' owners.
  • Serialized to the user during read(2) syscalls on the associated fs.Inotify's fd.

fs.Dirent

  • Many inotify events are generated inside dirent methods. Events are generated in the dirent methods rather than fs.Inode methods because some events carry the name of the subject node, and node names are generally unavailable in an fs.Inode.
  • Dirents do not directly contain state for any watches. Instead, they forward notifications to the underlying fs.Inode.

fs.Inode

  • Interacts with inotify through fs.Watches.
  • Inodes contain a map of all active fs.Watches on them.
  • An fs.Inotify instance can have at most one fs.Watch per inode. fs.Watches on an inode are indexed by their owner's id.
  • All inotify logic is encapsulated in the Watches struct in an inode. Logically, Watches is the set of inotify watches on the inode.

Reference Model

The sentry inotify implementation has a complex reference model. An inotify watch observes a single inode. For efficient lookup, the state for a watch is stored directly on the target inode. This state needs to be persistent for the lifetime of watch. Unlike usual filesystem metadata, the watch state has no "on-disk" representation, so they cannot be reconstructed by the filesystem if the inode is flushed from memory. This effectively means we need to keep any inodes with actives watches pinned to memory.

We can't just hold an extra ref on the inode to pin it to memory because some filesystems (such as gofer-based filesystems) don't have persistent inodes. In such a filesystem, if we just pin the inode, nothing prevents the enclosing dirent from being GCed. Once the dirent is GCed, the pinned inode is unreachable -- these filesystems generate a new inode by re-reading the node state on the next walk. Incidentally, hardlinks also don't work on these filesystems for this reason.

To prevent the above scenario, when a new watch is added on an inode, we pin the dirent we used to reach the inode. Note that due to hardlinks, this dirent may not be the only dirent pointing to the inode. Attempting to set an inotify watch via multiple hardlinks to the same file results in the same watch being returned for both links. However, for each new dirent we use to reach the same inode, we add a new pin. We need a new pin for each new dirent used to reach the inode because we have no guarantees about the deletion order of the different links to the inode.

Lock Ordering

There are 4 locks related to the inotify implementation:

  • Inotify.mu: the inotify instance lock.
  • Inotify.evMu: the inotify event queue lock.
  • Watch.mu: the watch lock, used to protect pins.
  • fs.Watches.mu: the inode watch set mu, used to protect the collection of watches on the inode.

The correct lock ordering for inotify code is:

Inotify.mu -> fs.Watches.mu -> Watch.mu -> Inotify.evMu.

We need a distinct lock for the event queue because by the time a goroutine attempts to queue a new event, it is already holding fs.Watches.mu. If we used Inotify.mu to also protect the event queue, this would violate the above lock ordering.