123 lines
5.7 KiB
Markdown
123 lines
5.7 KiB
Markdown
# Inotify
|
|
|
|
Inotify implements the like-named filesystem event notification system for the
|
|
sentry, see `inotify(7)`.
|
|
|
|
## Architecture
|
|
|
|
For the most part, the sentry implementation of inotify mirrors the Linux
|
|
architecture. Inotify instances (i.e. the fd returned by inotify_init(2)) are
|
|
backed by a pseudo-filesystem. Events are generated from various places in the
|
|
sentry, including the [syscall layer][syscall_dir], the [vfs layer][dirent] and
|
|
the [process fd table][fd_table]. Watches are stored in inodes and generated
|
|
events are queued to the inotify instance owning the watches for delivery to the
|
|
user.
|
|
|
|
## Objects
|
|
|
|
Here is a brief description of the existing and new objects involved in the
|
|
sentry inotify mechanism, and how they interact:
|
|
|
|
### [`fs.Inotify`][inotify]
|
|
|
|
- An inotify instances, created by inotify_init(2)/inotify_init1(2).
|
|
- The inotify fd has a `fs.Dirent`, supports filesystem syscalls to read
|
|
events.
|
|
- Has multiple `fs.Watch`es, with at most one watch per target inode, per
|
|
inotify instance.
|
|
- Has an instance `id` which is globally unique. This is *not* the fd number
|
|
for this instance, since the fd can be duped. This `id` is not externally
|
|
visible.
|
|
|
|
### [`fs.Watch`][watch]
|
|
|
|
- An inotify watch, created/deleted by
|
|
inotify_add_watch(2)/inotify_rm_watch(2).
|
|
- Owned by an `fs.Inotify` instance, each watch keeps a pointer to the
|
|
`owner`.
|
|
- Associated with a single `fs.Inode`, which is the watch `target`. While the
|
|
watch is active, it indirectly pins `target` to memory. See the "Reference
|
|
Model" section for a detailed explanation.
|
|
- Filesystem operations on `target` generate `fs.Event`s.
|
|
|
|
### [`fs.Event`][event]
|
|
|
|
- A simple struct encapsulating all the fields for an inotify event.
|
|
- Generated by `fs.Watch`es and forwarded to the watches' `owner`s.
|
|
- Serialized to the user during read(2) syscalls on the associated
|
|
`fs.Inotify`'s fd.
|
|
|
|
### [`fs.Dirent`][dirent]
|
|
|
|
- Many inotify events are generated inside dirent methods. Events are
|
|
generated in the dirent methods rather than `fs.Inode` methods because some
|
|
events carry the name of the subject node, and node names are generally
|
|
unavailable in an `fs.Inode`.
|
|
- Dirents do not directly contain state for any watches. Instead, they forward
|
|
notifications to the underlying `fs.Inode`.
|
|
|
|
### [`fs.Inode`][inode]
|
|
|
|
- Interacts with inotify through `fs.Watch`es.
|
|
- Inodes contain a map of all active `fs.Watch`es on them.
|
|
- An `fs.Inotify` instance can have at most one `fs.Watch` per inode.
|
|
`fs.Watch`es on an inode are indexed by their `owner`'s `id`.
|
|
- All inotify logic is encapsulated in the [`Watches`][inode_watches] struct
|
|
in an inode. Logically, `Watches` is the set of inotify watches on the
|
|
inode.
|
|
|
|
## Reference Model
|
|
|
|
The sentry inotify implementation has a complex reference model. An inotify
|
|
watch observes a single inode. For efficient lookup, the state for a watch is
|
|
stored directly on the target inode. This state needs to be persistent for the
|
|
lifetime of watch. Unlike usual filesystem metadata, the watch state has no
|
|
"on-disk" representation, so they cannot be reconstructed by the filesystem if
|
|
the inode is flushed from memory. This effectively means we need to keep any
|
|
inodes with actives watches pinned to memory.
|
|
|
|
We can't just hold an extra ref on the inode to pin it to memory because some
|
|
filesystems (such as gofer-based filesystems) don't have persistent inodes. In
|
|
such a filesystem, if we just pin the inode, nothing prevents the enclosing
|
|
dirent from being GCed. Once the dirent is GCed, the pinned inode is
|
|
unreachable -- these filesystems generate a new inode by re-reading the node
|
|
state on the next walk. Incidentally, hardlinks also don't work on these
|
|
filesystems for this reason.
|
|
|
|
To prevent the above scenario, when a new watch is added on an inode, we *pin*
|
|
the dirent we used to reach the inode. Note that due to hardlinks, this dirent
|
|
may not be the only dirent pointing to the inode. Attempting to set an inotify
|
|
watch via multiple hardlinks to the same file results in the same watch being
|
|
returned for both links. However, for each new dirent we use to reach the same
|
|
inode, we add a new pin. We need a new pin for each new dirent used to reach the
|
|
inode because we have no guarantees about the deletion order of the different
|
|
links to the inode.
|
|
|
|
## Lock Ordering
|
|
|
|
There are 4 locks related to the inotify implementation:
|
|
|
|
- `Inotify.mu`: the inotify instance lock.
|
|
- `Inotify.evMu`: the inotify event queue lock.
|
|
- `Watch.mu`: the watch lock, used to protect pins.
|
|
- `fs.Watches.mu`: the inode watch set mu, used to protect the collection of
|
|
watches on the inode.
|
|
|
|
The correct lock ordering for inotify code is:
|
|
|
|
`Inotify.mu` -> `fs.Watches.mu` -> `Watch.mu` -> `Inotify.evMu`.
|
|
|
|
We need a distinct lock for the event queue because by the time a goroutine
|
|
attempts to queue a new event, it is already holding `fs.Watches.mu`. If we used
|
|
`Inotify.mu` to also protect the event queue, this would violate the above lock
|
|
ordering.
|
|
|
|
[dirent]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/fs/dirent.go
|
|
[event]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/fs/inotify_event.go
|
|
[fd_table]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/kernel/fd_table.go
|
|
[inode]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/fs/inode.go
|
|
[inode_watches]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/fs/inode_inotify.go
|
|
[inotify]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/fs/inotify.go
|
|
[syscall_dir]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/syscalls/linux/
|
|
[watch]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/fs/inotify_watch.go
|