gvisor/pkg/sentry/vfs
Jamie Liu 163ab5e9ba Sentry virtual filesystem, v2
Major differences from the current ("v1") sentry VFS:

- Path resolution is Filesystem-driven (FilesystemImpl methods call
vfs.ResolvingPath methods) rather than VFS-driven (fs package owns a
Dirent tree and calls fs.InodeOperations methods to populate it). This
drastically improves performance, primarily by reducing overhead from
inefficient synchronization and indirection. It also makes it possible
to implement remote filesystem protocols that translate FS system calls
into single RPCs, rather than having to make (at least) one RPC per path
component, significantly reducing the latency of remote filesystems
(especially during cold starts and for uncacheable shared filesystems).

- Mounts are correctly represented as a separate check based on
contextual state (current mount) rather than direct replacement in a
fs.Dirent tree. This makes it possible to support (non-recursive) bind
mounts and mount namespaces.

Included in this CL is fsimpl/memfs, an incomplete in-memory filesystem
that exists primarily to demonstrate intended filesystem implementation
patterns and for benchmarking:

BenchmarkVFS1TmpfsStat/1-6               3000000               497 ns/op
BenchmarkVFS1TmpfsStat/2-6               2000000               676 ns/op
BenchmarkVFS1TmpfsStat/3-6               2000000               904 ns/op
BenchmarkVFS1TmpfsStat/8-6               1000000              1944 ns/op
BenchmarkVFS1TmpfsStat/64-6               100000             14067 ns/op
BenchmarkVFS1TmpfsStat/100-6               50000             21700 ns/op
BenchmarkVFS2MemfsStat/1-6              10000000               197 ns/op
BenchmarkVFS2MemfsStat/2-6               5000000               233 ns/op
BenchmarkVFS2MemfsStat/3-6               5000000               268 ns/op
BenchmarkVFS2MemfsStat/8-6               3000000               477 ns/op
BenchmarkVFS2MemfsStat/64-6               500000              2592 ns/op
BenchmarkVFS2MemfsStat/100-6              300000              4045 ns/op
BenchmarkVFS1TmpfsMountStat/1-6          2000000               679 ns/op
BenchmarkVFS1TmpfsMountStat/2-6          2000000               912 ns/op
BenchmarkVFS1TmpfsMountStat/3-6          1000000              1113 ns/op
BenchmarkVFS1TmpfsMountStat/8-6          1000000              2118 ns/op
BenchmarkVFS1TmpfsMountStat/64-6                  100000             14251 ns/op
BenchmarkVFS1TmpfsMountStat/100-6                 100000             22397 ns/op
BenchmarkVFS2MemfsMountStat/1-6                  5000000               317 ns/op
BenchmarkVFS2MemfsMountStat/2-6                  5000000               361 ns/op
BenchmarkVFS2MemfsMountStat/3-6                  5000000               387 ns/op
BenchmarkVFS2MemfsMountStat/8-6                  3000000               582 ns/op
BenchmarkVFS2MemfsMountStat/64-6                  500000              2699 ns/op
BenchmarkVFS2MemfsMountStat/100-6                 300000              4133 ns/op

From this we can infer that, on this machine:

- Constant cost for tmpfs stat() is ~160ns in VFS2 and ~280ns in VFS1.

- Per-path-component cost is ~35ns in VFS2 and ~215ns in VFS1, a
difference of about 6x.

- The cost of crossing a mount boundary is about 80ns in VFS2
(MemfsMountStat/1 does approximately the same amount of work as
MemfsStat/2, except that it also crosses a mount boundary). This is an
inescapable cost of the separate mount lookup needed to support bind
mounts and mount namespaces.

PiperOrigin-RevId: 258853946
2019-07-18 15:10:29 -07:00
..
BUILD Sentry virtual filesystem, v2 2019-07-18 15:10:29 -07:00
README.md Sentry virtual filesystem, v2 2019-07-18 15:10:29 -07:00
context.go Sentry virtual filesystem, v2 2019-07-18 15:10:29 -07:00
debug.go Sentry virtual filesystem, v2 2019-07-18 15:10:29 -07:00
dentry.go Sentry virtual filesystem, v2 2019-07-18 15:10:29 -07:00
file_description.go Sentry virtual filesystem, v2 2019-07-18 15:10:29 -07:00
file_description_impl_util.go Sentry virtual filesystem, v2 2019-07-18 15:10:29 -07:00
filesystem.go Sentry virtual filesystem, v2 2019-07-18 15:10:29 -07:00
filesystem_type.go Sentry virtual filesystem, v2 2019-07-18 15:10:29 -07:00
mount.go Sentry virtual filesystem, v2 2019-07-18 15:10:29 -07:00
mount_test.go Sentry virtual filesystem, v2 2019-07-18 15:10:29 -07:00
mount_unsafe.go Sentry virtual filesystem, v2 2019-07-18 15:10:29 -07:00
options.go Sentry virtual filesystem, v2 2019-07-18 15:10:29 -07:00
permissions.go Sentry virtual filesystem, v2 2019-07-18 15:10:29 -07:00
resolving_path.go Sentry virtual filesystem, v2 2019-07-18 15:10:29 -07:00
syscalls.go Sentry virtual filesystem, v2 2019-07-18 15:10:29 -07:00
vfs.go Sentry virtual filesystem, v2 2019-07-18 15:10:29 -07:00

README.md

The gVisor Virtual Filesystem

THIS PACKAGE IS CURRENTLY EXPERIMENTAL AND NOT READY OR ENABLED FOR PRODUCTION USE. For the filesystem implementation currently used by gVisor, see the fs package.

Implementation Notes

Reference Counting

Filesystem, Dentry, Mount, MountNamespace, and FileDescription are all reference-counted. Mount and MountNamespace are exclusively VFS-managed; when their reference count reaches zero, VFS releases their resources. Filesystem and FileDescription management is shared between VFS and filesystem implementations; when their reference count reaches zero, VFS notifies the implementation by calling FilesystemImpl.Release() or FileDescriptionImpl.Release() respectively and then releases VFS-owned resources. Dentries are exclusively managed by filesystem implementations; reference count changes are abstracted through DentryImpl, which should release resources when reference count reaches zero.

Filesystem references are held by:

  • Mount: Each referenced Mount holds a reference on the mounted Filesystem.

Dentry references are held by:

  • FileDescription: Each referenced FileDescription holds a reference on the Dentry through which it was opened, via FileDescription.vd.dentry.

  • Mount: Each referenced Mount holds a reference on its mount point and on the mounted filesystem root. The mount point is mutable (mount(MS_MOVE)).

Mount references are held by:

  • FileDescription: Each referenced FileDescription holds a reference on the Mount on which it was opened, via FileDescription.vd.mount.

  • Mount: Each referenced Mount holds a reference on its parent, which is the mount containing its mount point.

  • VirtualFilesystem: A reference is held on all Mounts that are attached (reachable by Mount traversal).

MountNamespace and FileDescription references are held by users of VFS. The expectation is that each kernel.Task holds a reference on its corresponding MountNamespace, and each file descriptor holds a reference on its represented FileDescription.

Notes:

  • Dentries do not hold a reference on their owning Filesystem. Instead, all uses of a Dentry occur in the context of a Mount, which holds a reference on the relevant Filesystem (see e.g. the VirtualDentry type). As a corollary, when releasing references on both a Dentry and its corresponding Mount, the Dentry's reference must be released first (because releasing the Mount's reference may release the last reference on the Filesystem, whose state may be required to release the Dentry reference).

The Inheritance Pattern

Filesystem, Dentry, and FileDescription are all concepts featuring both state that must be shared between VFS and filesystem implementations, and operations that are implementation-defined. To facilitate this, each of these three concepts follows the same pattern, shown below for Dentry:

// Dentry represents a node in a filesystem tree.
type Dentry struct {
  // VFS-required dentry state.
  parent *Dentry
  // ...

  // impl is the DentryImpl associated with this Dentry. impl is immutable.
  // This should be the last field in Dentry.
  impl DentryImpl
}

// Init must be called before first use of d.
func (d *Dentry) Init(impl DentryImpl) {
  d.impl = impl
}

// Impl returns the DentryImpl associated with d.
func (d *Dentry) Impl() DentryImpl {
  return d.impl
}

// DentryImpl contains implementation-specific details of a Dentry.
// Implementations of DentryImpl should contain their associated Dentry by
// value as their first field.
type DentryImpl interface {
  // VFS-required implementation-defined dentry operations.
  IncRef()
  // ...
}

This construction, which is essentially a type-safe analogue to Linux's container_of pattern, has the following properties:

  • VFS works almost exclusively with pointers to Dentry rather than DentryImpl interface objects, such as in the type of Dentry.parent. This avoids interface method calls (which are somewhat expensive to perform, and defeat inlining and escape analysis), reduces the size of VFS types (since an interface object is two pointers in size), and allows pointers to be loaded and stored atomically using sync/atomic. Implementation-defined behavior is accessed via Dentry.impl when required.

  • Filesystem implementations can access the implementation-defined state associated with objects of VFS types by type-asserting or type-switching (e.g. Dentry.Impl().(*myDentry)). Type assertions to a concrete type require only an equality comparison of the interface object's type pointer to a static constant, and are consequently very fast.

  • Filesystem implementations can access the VFS state associated with objects of implementation-defined types directly.

  • VFS and implementation-defined state for a given type occupy the same object, minimizing memory allocations and maximizing memory locality. impl is the last field in Dentry, and Dentry is the first field in DentryImpl implementations, for similar reasons: this tends to cause fetching of the Dentry.impl interface object to also fetch DentryImpl fields, either because they are in the same cache line or via next-line prefetching.

Future Work

  • Most mount(2) features, and unmounting, are incomplete.

  • VFS1 filesystems are not directly compatible with VFS2. It may be possible to implement shims that implement vfs.FilesystemImpl for fs.MountNamespace, vfs.DentryImpl for fs.Dirent, and vfs.FileDescriptionImpl for fs.File, which may be adequate for filesystems that are not performance-critical (e.g. sysfs); however, it is not clear that this will be less effort than simply porting the filesystems in question. Practically speaking, the following filesystems will probably need to be ported or made compatible through a shim to evaluate filesystem performance on realistic workloads:

    • devfs/procfs/sysfs, which will realistically be necessary to execute most applications. (Note that procfs and sysfs do not support hard links, so they do not require the complexity of separate inode objects. Also note that Linux's /dev is actually a variant of tmpfs called devtmpfs.)

    • tmpfs. This should be relatively straightforward: copy/paste memfs, store regular file contents in pgalloc-allocated memory instead of []byte, and add support for file timestamps. (In fact, it probably makes more sense to convert memfs to tmpfs and not keep the former.)

    • A remote filesystem, either lisafs (if it is ready by the time that other benchmarking prerequisites are) or v9fs (aka 9P, aka gofers).

    • epoll files.

    Filesystems that will need to be ported before switching to VFS2, but can probably be skipped for early testing:

    • overlayfs, which is needed for (at least) synthetic mount points.

    • Support for host ttys.

    • timerfd files.

    Filesystems that can be probably dropped:

    • ashmem, which is far too incomplete to use.

    • binder, which is similarly far too incomplete to use.

    • whitelistfs, which we are already actively attempting to remove.

  • Save/restore. For instance, it is unclear if the current implementation of the state package supports the inheritance pattern described above.

  • Many features that were previously implemented by VFS must now be implemented by individual filesystems (though, in most cases, this should consist of calls to hooks or libraries provided by vfs or other packages). This includes, but is not necessarily limited to:

    • Block and character device special files

    • Inotify

    • File locking

    • O_ASYNC

  • Reference counts in the vfs package do not use the refs package since refs.AtomicRefCount adds 64 bytes of overhead to each 8-byte reference count, resulting in considerable cache bloat. 24 bytes of this overhead is for weak reference support, which have poor performance and will not be used by VFS2. The remaining 40 bytes is to store a descriptive string and stack trace for reference leak checking; we can support reference leak checking without incurring this space overhead by including the applicable information directly in finalizers for applicable types.