gvisor/pkg/lisafs/README.md

# Replacing 9P

## Background

The Linux filesystem model consists of the following key aspects (modulo mounts,
which are outside the scope of this discussion):

-   A `struct inode` represents a "filesystem object", such as a directory or a
    regular file. "Filesystem object" is most precisely defined by the practical
    properties of an inode, such as an immutable type (regular file, directory,
    symbolic link, etc.) and its independence from the path originally used to
    obtain it.

-   A `struct dentry` represents a node in a filesystem tree. Semantically, each
    dentry is immutably associated with an inode representing the filesystem
    object at that position. (Linux implements optimizations involving reuse of
    unreferenced dentries, which allows their associated inodes to change, but
    this is outside the scope of this discussion.)

-   A `struct file` represents an open file description (hereafter FD) and is
    needed to perform I/O. Each FD is immutably associated with the dentry
    through which it was opened.

The current gVisor virtual filesystem implementation (hereafter VFS1) closely
imitates the Linux design:

-   `struct inode` => `fs.Inode`

-   `struct dentry` => `fs.Dirent`

-   `struct file` => `fs.File`

gVisor accesses most external filesystems through a variant of the 9P2000.L
protocol, including extensions for performance (`walkgetattr`) and for features
not supported by vanilla 9P2000.L (`flushf`, `lconnect`). The 9P protocol family
is inode-based; 9P fids represent a file (equivalently "file system object"),
and the protocol is structured around alternatively obtaining fids to represent
files (with `walk` and, in gVisor, `walkgetattr`) and performing operations on
those fids.

In the sections below, a **shared** filesystem is a filesystem that is *mutably*
accessible by multiple concurrent clients, such that a **non-shared** filesystem
is a filesystem that is either read-only or accessible by only a single client.

## Problems

### Serialization of Path Component RPCs

Broadly speaking, VFS1 traverses each path component in a pathname, alternating
between verifying that each traversed dentry represents an inode that represents
a searchable directory and moving to the next dentry in the path.

In the context of a remote filesystem, the structure of this traversal means
that - modulo caching - a path involving N components requires at least N-1
*sequential* RPCs to obtain metadata for intermediate directories, incurring
significant latency. (In vanilla 9P2000.L, 2(N-1) RPCs are required: N-1 `walk`
and N-1 `getattr`. We added the `walkgetattr` RPC to reduce this overhead.) On
non-shared filesystems, this overhead is primarily significant during
application startup; caching mitigates much of this overhead at steady state. On
shared filesystems, where correct caching requires revalidation (requiring RPCs
for each revalidated directory anyway), this overhead is consistently ruinous.

### Inefficient RPCs

9P is not exceptionally economical with RPCs in general. In addition to the
issue described above:

-   Opening an existing file in 9P involves at least 2 RPCs: `walk` to produce
    an unopened fid representing the file, and `lopen` to open the fid.

-   Creating a file also involves at least 2 RPCs: `walk` to produce an unopened
    fid representing the parent directory, and `lcreate` to create the file and
    convert the fid to an open fid representing the created file. In practice,
    both the Linux and gVisor 9P clients expect to have an unopened fid for the
    created file (necessitating an additional `walk`), as well as attributes for
    the created file (necessitating an additional `getattr`), for a total of 4
    RPCs. (In a shared filesystem, where whether a file already exists can
    change between RPCs, a correct implementation of `open(O_CREAT)` would have
    to alternate between these two paths (plus `clunk`ing the temporary fid
    between alternations, since the nature of the `fid` differs between the two
    paths). Neither Linux nor gVisor implement the required alternation, so
    `open(O_CREAT)` without `O_EXCL` can spuriously fail with `EEXIST` on both.)

-   Closing (`clunk`ing) a fid requires an RPC. VFS1 issues this RPC
    asynchronously in an attempt to reduce critical path latency, but scheduling
    overhead makes this not clearly advantageous in practice.

-   `read` and `readdir` can return partial reads without a way to indicate EOF,
    necessitating an additional final read to detect EOF.

-   Operations that affect filesystem state do not consistently return updated
    filesystem state. In gVisor, the client implementation attempts to handle
    this by tracking what it thinks updated state "should" be; this is complex,
    and especially brittle for timestamps (which are often not arbitrarily
    settable). In Linux, the client implemtation invalidates cached metadata
    whenever it performs such an operation, and reloads it when a dentry
    corresponding to an inode with no valid cached metadata is revalidated; this
    is simple, but necessitates an additional `getattr`.

### Dentry/Inode Ambiguity

As noted above, 9P's documentation tends to imply that unopened fids represent
an inode. In practice, most filesystem APIs present very limited interfaces for
working with inodes at best, such that the interpretation of unopened fids
varies:

-   Linux's 9P client associates unopened fids with (dentry, uid) pairs. When
    caching is enabled, it also associates each inode with the first fid opened
    writably that references that inode, in order to support page cache
    writeback.

-   gVisor's 9P client associates unopened fids with inodes, and also caches
    opened fids in inodes in a manner similar to Linux.

-   The runsc fsgofer associates unopened fids with both "dentries" (host
    filesystem paths) and "inodes" (host file descriptors); which is used
    depends on the operation invoked on the fid.

For non-shared filesystems, this confusion has resulted in correctness issues
that are (in gVisor) currently handled by a number of coarse-grained locks that
serialize renames with all other filesystem operations. For shared filesystems,
this means inconsistent behavior in the presence of concurrent mutation.

## Design

Almost all Linux filesystem syscalls describe filesystem resources in one of two
ways:

-   Path-based: A filesystem position is described by a combination of a
    starting position and a sequence of path components relative to that
    position, where the starting position is one of:

    -   The VFS root (defined by mount namespace and chroot), for absolute paths

    -   The VFS position of an existing FD, for relative paths passed to `*at`
        syscalls (e.g. `statat`)

    -   The current working directory, for relative paths passed to non-`*at`
        syscalls and `*at` syscalls with `AT_FDCWD`

-   File-description-based: A filesystem object is described by an existing FD,
    passed to a `f*` syscall (e.g. `fstat`).

Many of our issues with 9P arise from its (and VFS') interposition of a model
based on inodes between the filesystem syscall API and filesystem
implementations. We propose to replace 9P with a protocol that does not feature
inodes at all, and instead closely follows the filesystem syscall API by
featuring only path-based and FD-based operations, with minimal deviations as
necessary to ameliorate deficiencies in the syscall interface (see below). This
approach addresses the issues described above:

-   Even on shared filesystems, most application filesystem syscalls are
    translated to a single RPC (possibly excepting special cases described
    below), which is a logical lower bound.

-   The behavior of application syscalls on shared filesystems is
    straightforwardly predictable: path-based syscalls are translated to
    path-based RPCs, which will re-lookup the file at that path, and FD-based
    syscalls are translated to FD-based RPCs, which use an existing open file
    without performing another lookup. (This is at least true on gofers that
    proxy the host local filesystem; other filesystems that lack support for
    e.g. certain operations on FDs may have different behavior, but this
    divergence is at least still predictable and inherent to the underlying
    filesystem implementation.)

Note that this approach is only feasible in gVisor's next-generation virtual
filesystem (VFS2), which does not assume the existence of inodes and allows the
remote filesystem client to translate whole path-based syscalls into RPCs. Thus
one of the unavoidable tradeoffs associated with such a protocol vs. 9P is the
inability to construct a Linux client that is performance-competitive with
gVisor.

### File Permissions

Many filesystem operations are side-effectual, such that file permissions must
be checked before such operations take effect. The simplest approach to file
permission checking is for the sentry to obtain permissions from the remote
filesystem, then apply permission checks in the sentry before performing the
application-requested operation. However, this requires an additional RPC per
application syscall (which can't be mitigated by caching on shared filesystems).
Alternatively, we may delegate file permission checking to gofers. In general,
file permission checks depend on the following properties of the accessor:

-   Filesystem UID/GID

-   Supplementary GIDs

-   Effective capabilities in the accessor's user namespace (i.e. the accessor's
    effective capability set)

-   All UIDs and GIDs mapped in the accessor's user namespace (which determine
    if the accessor's capabilities apply to accessed files)

We may choose to delay implementation of file permission checking delegation,
although this is potentially costly since it doubles the number of required RPCs
for most operations on shared filesystems. We may also consider compromise
options, such as only delegating file permission checks for accessors in the
root user namespace.

### Symbolic Links

gVisor usually interprets symbolic link targets in its VFS rather than on the
filesystem containing the symbolic link; thus e.g. a symlink to
"/proc/self/maps" on a remote filesystem resolves to said file in the sentry's
procfs rather than the host's. This implies that:

-   Remote filesystem servers that proxy filesystems supporting symlinks must
    check if each path component is a symlink during path traversal.

-   Absolute symlinks require that the sentry restart the operation at its
    contextual VFS root (which is task-specific and may not be on a remote
    filesystem at all), so if a remote filesystem server encounters an absolute
    symlink during path traversal on behalf of a path-based operation, it must
    terminate path traversal and return the symlink target.

-   Relative symlinks begin target resolution in the parent directory of the
    symlink, so in theory most relative symlinks can be handled automatically
    during the path traversal that encounters the symlink, provided that said
    traversal is supplied with the number of remaining symlinks before `ELOOP`.
    However, the new path traversed by the symlink target may cross VFS mount
    boundaries, such that it's only safe for remote filesystem servers to
    speculatively follow relative symlinks for side-effect-free operations such
    as `stat` (where the sentry can simply ignore results that are inapplicable
    due to crossing mount boundaries). We may choose to delay implementation of
    this feature, at the cost of an additional RPC per relative symlink (note
    that even if the symlink target crosses a mount boundary, the sentry will
    need to `stat` the path to the mount boundary to confirm that each traversed
    component is an accessible directory); until it is implemented, relative
    symlinks may be handled like absolute symlinks, by terminating path
    traversal and returning the symlink target.

The possibility of symlinks (and the possibility of a compromised sentry) means
that the sentry may issue RPCs with paths that, in the absence of symlinks,
would traverse beyond the root of the remote filesystem. For example, the sentry
may issue an RPC with a path like "/foo/../..", on the premise that if "/foo" is
a symlink then the resulting path may be elsewhere on the remote filesystem. To
handle this, path traversal must also track its current depth below the remote
filesystem root, and terminate path traversal if it would ascend beyond this
point.

### Path Traversal

Since path-based VFS operations will translate to path-based RPCs, filesystem
servers will need to handle path traversal. From the perspective of a given
filesystem implementation in the server, there are two basic approaches to path
traversal:

-   Inode-walk: For each path component, obtain a handle to the underlying
    filesystem object (e.g. with `open(O_PATH)`), check if that object is a
    symlink (as described above) and that that object is accessible by the
    caller (e.g. with `fstat()`), then continue to the next path component (e.g.
    with `openat()`). This ensures that the checked filesystem object is the one
    used to obtain the next object in the traversal, which is intuitively
    appealing. However, while this approach works for host local filesystems, it
    requires features that are not widely supported by other filesystems.

-   Path-walk: For each path component, use a path-based operation to determine
    if the filesystem object currently referred to by that path component is a
    symlink / is accessible. This is highly portable, but suffers from quadratic
    behavior (at the level of the underlying filesystem implementation, the
    first path component will be traversed a number of times equal to the number
    of path components in the path).

The implementation should support either option by delegating path traversal to
filesystem implementations within the server (like VFS and the remote filesystem
protocol itself), as inode-walking is still safe, efficient, amenable to FD
caching, and implementable on non-shared host local filesystems (a sufficiently
common case as to be worth considering in the design).

Both approaches are susceptible to race conditions that may permit sandboxed
filesystem escapes:

-   Under inode-walk, a malicious application may cause a directory to be moved
    (with `rename`) during path traversal, such that the filesystem
    implementation incorrectly determines whether subsequent inodes are located
    in paths that should be visible to sandboxed applications.

-   Under path-walk, a malicious application may cause a non-symlink file to be
    replaced with a symlink during path traversal, such that following path
    operations will incorrectly follow the symlink.

Both race conditions can, to some extent, be mitigated in filesystem server
implementations by synchronizing path traversal with the hazardous operations in
question. However, shared filesystems are frequently used to share data between
sandboxed and unsandboxed applications in a controlled way, and in some cases a
malicious sandboxed application may be able to take advantage of a hazardous
filesystem operation performed by an unsandboxed application. In some cases,
filesystem features may be available to ensure safety even in such cases (e.g.
[the new openat2() syscall](https://man7.org/linux/man-pages/man2/openat2.2.html)),
but it is not clear how to solve this problem in general. (Note that this issue
is not specific to our design; rather, it is a fundamental limitation of
filesystem sandboxing.)

### Filesystem Multiplexing

A given sentry may need to access multiple distinct remote filesystems (e.g.
different volumes for a given container). In many cases, there is no advantage
to serving these filesystems from distinct filesystem servers, or accessing them
through distinct connections (factors such as maximum RPC concurrency should be
based on available host resources). Therefore, the protocol should support
multiplexing of distinct filesystem trees within a single session. 9P supports
this by allowing multiple calls to the `attach` RPC to produce fids representing
distinct filesystem trees, but this is somewhat clunky; we propose a much
simpler mechanism wherein each message that conveys a path also conveys a
numeric filesystem ID that identifies a filesystem tree.

## Alternatives Considered

### Additional Extensions to 9P

There are at least three conceptual aspects to 9P:

-   Wire format: messages with a 4-byte little-endian size prefix, strings with
    a 2-byte little-endian size prefix, etc. Whether the wire format is worth
    retaining is unclear; in particular, it's unclear that the 9P wire format
    has a significant advantage over protobufs, which are substantially easier
    to extend. Note that the official Go protobuf implementation is widely known
    to suffer from a significant number of performance deficiencies, so if we
    choose to switch to protobuf, we may need to use an alternative toolchain
    such as `gogo/protobuf` (which is also widely used in the Go ecosystem, e.g.
    by Kubernetes).

-   Filesystem model: fids, qids, etc. Discarding this is one of the motivations
    for this proposal.

-   RPCs: Twalk, Tlopen, etc. In addition to previously-described
    inefficiencies, most of these are dependent on the filesystem model and
    therefore must be discarded.

### FUSE

The FUSE (Filesystem in Userspace) protocol is frequently used to provide
arbitrary userspace filesystem implementations to a host Linux kernel.
Unfortunately, FUSE is also inode-based, and therefore doesn't address any of
the problems we have with 9P.

### virtio-fs

virtio-fs is an ongoing project aimed at improving Linux VM filesystem
performance when accessing Linux host filesystems (vs. virtio-9p). In brief, it
is based on:

-   Using a FUSE client in the guest that communicates over virtio with a FUSE
    server in the host.

-   Using DAX to map the host page cache into the guest.

-   Using a file metadata table in shared memory to avoid VM exits for metadata
    updates.

None of these improvements seem applicable to gVisor:

-   As explained above, FUSE is still inode-based, so it is still susceptible to
    most of the problems we have with 9P.

-   Our use of host file descriptors already allows us to leverage the host page
    cache for file contents.

-   Our need for shared filesystem coherence is usually based on a user
    requirement that an out-of-sandbox filesystem mutation is guaranteed to be
    visible by all subsequent observations from within the sandbox, or vice
    versa; it's not clear that this can be guaranteed without a synchronous
    signaling mechanism like an RPC.