# Replacing 9P ## Background The Linux filesystem model consists of the following key aspects (modulo mounts, which are outside the scope of this discussion): - A `struct inode` represents a "filesystem object", such as a directory or a regular file. "Filesystem object" is most precisely defined by the practical properties of an inode, such as an immutable type (regular file, directory, symbolic link, etc.) and its independence from the path originally used to obtain it. - A `struct dentry` represents a node in a filesystem tree. Semantically, each dentry is immutably associated with an inode representing the filesystem object at that position. (Linux implements optimizations involving reuse of unreferenced dentries, which allows their associated inodes to change, but this is outside the scope of this discussion.) - A `struct file` represents an open file description (hereafter FD) and is needed to perform I/O. Each FD is immutably associated with the dentry through which it was opened. The current gVisor virtual filesystem implementation (hereafter VFS1) closely imitates the Linux design: - `struct inode` => `fs.Inode` - `struct dentry` => `fs.Dirent` - `struct file` => `fs.File` gVisor accesses most external filesystems through a variant of the 9P2000.L protocol, including extensions for performance (`walkgetattr`) and for features not supported by vanilla 9P2000.L (`flushf`, `lconnect`). The 9P protocol family is inode-based; 9P fids represent a file (equivalently "file system object"), and the protocol is structured around alternatively obtaining fids to represent files (with `walk` and, in gVisor, `walkgetattr`) and performing operations on those fids. In the sections below, a **shared** filesystem is a filesystem that is *mutably* accessible by multiple concurrent clients, such that a **non-shared** filesystem is a filesystem that is either read-only or accessible by only a single client. ## Problems ### Serialization of Path Component RPCs Broadly speaking, VFS1 traverses each path component in a pathname, alternating between verifying that each traversed dentry represents an inode that represents a searchable directory and moving to the next dentry in the path. In the context of a remote filesystem, the structure of this traversal means that - modulo caching - a path involving N components requires at least N-1 *sequential* RPCs to obtain metadata for intermediate directories, incurring significant latency. (In vanilla 9P2000.L, 2(N-1) RPCs are required: N-1 `walk` and N-1 `getattr`. We added the `walkgetattr` RPC to reduce this overhead.) On non-shared filesystems, this overhead is primarily significant during application startup; caching mitigates much of this overhead at steady state. On shared filesystems, where correct caching requires revalidation (requiring RPCs for each revalidated directory anyway), this overhead is consistently ruinous. ### Inefficient RPCs 9P is not exceptionally economical with RPCs in general. In addition to the issue described above: - Opening an existing file in 9P involves at least 2 RPCs: `walk` to produce an unopened fid representing the file, and `lopen` to open the fid. - Creating a file also involves at least 2 RPCs: `walk` to produce an unopened fid representing the parent directory, and `lcreate` to create the file and convert the fid to an open fid representing the created file. In practice, both the Linux and gVisor 9P clients expect to have an unopened fid for the created file (necessitating an additional `walk`), as well as attributes for the created file (necessitating an additional `getattr`), for a total of 4 RPCs. (In a shared filesystem, where whether a file already exists can change between RPCs, a correct implementation of `open(O_CREAT)` would have to alternate between these two paths (plus `clunk`ing the temporary fid between alternations, since the nature of the `fid` differs between the two paths). Neither Linux nor gVisor implement the required alternation, so `open(O_CREAT)` without `O_EXCL` can spuriously fail with `EEXIST` on both.) - Closing (`clunk`ing) a fid requires an RPC. VFS1 issues this RPC asynchronously in an attempt to reduce critical path latency, but scheduling overhead makes this not clearly advantageous in practice. - `read` and `readdir` can return partial reads without a way to indicate EOF, necessitating an additional final read to detect EOF. - Operations that affect filesystem state do not consistently return updated filesystem state. In gVisor, the client implementation attempts to handle this by tracking what it thinks updated state "should" be; this is complex, and especially brittle for timestamps (which are often not arbitrarily settable). In Linux, the client implemtation invalidates cached metadata whenever it performs such an operation, and reloads it when a dentry corresponding to an inode with no valid cached metadata is revalidated; this is simple, but necessitates an additional `getattr`. ### Dentry/Inode Ambiguity As noted above, 9P's documentation tends to imply that unopened fids represent an inode. In practice, most filesystem APIs present very limited interfaces for working with inodes at best, such that the interpretation of unopened fids varies: - Linux's 9P client associates unopened fids with (dentry, uid) pairs. When caching is enabled, it also associates each inode with the first fid opened writably that references that inode, in order to support page cache writeback. - gVisor's 9P client associates unopened fids with inodes, and also caches opened fids in inodes in a manner similar to Linux. - The runsc fsgofer associates unopened fids with both "dentries" (host filesystem paths) and "inodes" (host file descriptors); which is used depends on the operation invoked on the fid. For non-shared filesystems, this confusion has resulted in correctness issues that are (in gVisor) currently handled by a number of coarse-grained locks that serialize renames with all other filesystem operations. For shared filesystems, this means inconsistent behavior in the presence of concurrent mutation. ## Design Almost all Linux filesystem syscalls describe filesystem resources in one of two ways: - Path-based: A filesystem position is described by a combination of a starting position and a sequence of path components relative to that position, where the starting position is one of: - The VFS root (defined by mount namespace and chroot), for absolute paths - The VFS position of an existing FD, for relative paths passed to `*at` syscalls (e.g. `statat`) - The current working directory, for relative paths passed to non-`*at` syscalls and `*at` syscalls with `AT_FDCWD` - File-description-based: A filesystem object is described by an existing FD, passed to a `f*` syscall (e.g. `fstat`). Many of our issues with 9P arise from its (and VFS') interposition of a model based on inodes between the filesystem syscall API and filesystem implementations. We propose to replace 9P with a protocol that does not feature inodes at all, and instead closely follows the filesystem syscall API by featuring only path-based and FD-based operations, with minimal deviations as necessary to ameliorate deficiencies in the syscall interface (see below). This approach addresses the issues described above: - Even on shared filesystems, most application filesystem syscalls are translated to a single RPC (possibly excepting special cases described below), which is a logical lower bound. - The behavior of application syscalls on shared filesystems is straightforwardly predictable: path-based syscalls are translated to path-based RPCs, which will re-lookup the file at that path, and FD-based syscalls are translated to FD-based RPCs, which use an existing open file without performing another lookup. (This is at least true on gofers that proxy the host local filesystem; other filesystems that lack support for e.g. certain operations on FDs may have different behavior, but this divergence is at least still predictable and inherent to the underlying filesystem implementation.) Note that this approach is only feasible in gVisor's next-generation virtual filesystem (VFS2), which does not assume the existence of inodes and allows the remote filesystem client to translate whole path-based syscalls into RPCs. Thus one of the unavoidable tradeoffs associated with such a protocol vs. 9P is the inability to construct a Linux client that is performance-competitive with gVisor. ### File Permissions Many filesystem operations are side-effectual, such that file permissions must be checked before such operations take effect. The simplest approach to file permission checking is for the sentry to obtain permissions from the remote filesystem, then apply permission checks in the sentry before performing the application-requested operation. However, this requires an additional RPC per application syscall (which can't be mitigated by caching on shared filesystems). Alternatively, we may delegate file permission checking to gofers. In general, file permission checks depend on the following properties of the accessor: - Filesystem UID/GID - Supplementary GIDs - Effective capabilities in the accessor's user namespace (i.e. the accessor's effective capability set) - All UIDs and GIDs mapped in the accessor's user namespace (which determine if the accessor's capabilities apply to accessed files) We may choose to delay implementation of file permission checking delegation, although this is potentially costly since it doubles the number of required RPCs for most operations on shared filesystems. We may also consider compromise options, such as only delegating file permission checks for accessors in the root user namespace. ### Symbolic Links gVisor usually interprets symbolic link targets in its VFS rather than on the filesystem containing the symbolic link; thus e.g. a symlink to "/proc/self/maps" on a remote filesystem resolves to said file in the sentry's procfs rather than the host's. This implies that: - Remote filesystem servers that proxy filesystems supporting symlinks must check if each path component is a symlink during path traversal. - Absolute symlinks require that the sentry restart the operation at its contextual VFS root (which is task-specific and may not be on a remote filesystem at all), so if a remote filesystem server encounters an absolute symlink during path traversal on behalf of a path-based operation, it must terminate path traversal and return the symlink target. - Relative symlinks begin target resolution in the parent directory of the symlink, so in theory most relative symlinks can be handled automatically during the path traversal that encounters the symlink, provided that said traversal is supplied with the number of remaining symlinks before `ELOOP`. However, the new path traversed by the symlink target may cross VFS mount boundaries, such that it's only safe for remote filesystem servers to speculatively follow relative symlinks for side-effect-free operations such as `stat` (where the sentry can simply ignore results that are inapplicable due to crossing mount boundaries). We may choose to delay implementation of this feature, at the cost of an additional RPC per relative symlink (note that even if the symlink target crosses a mount boundary, the sentry will need to `stat` the path to the mount boundary to confirm that each traversed component is an accessible directory); until it is implemented, relative symlinks may be handled like absolute symlinks, by terminating path traversal and returning the symlink target. The possibility of symlinks (and the possibility of a compromised sentry) means that the sentry may issue RPCs with paths that, in the absence of symlinks, would traverse beyond the root of the remote filesystem. For example, the sentry may issue an RPC with a path like "/foo/../..", on the premise that if "/foo" is a symlink then the resulting path may be elsewhere on the remote filesystem. To handle this, path traversal must also track its current depth below the remote filesystem root, and terminate path traversal if it would ascend beyond this point. ### Path Traversal Since path-based VFS operations will translate to path-based RPCs, filesystem servers will need to handle path traversal. From the perspective of a given filesystem implementation in the server, there are two basic approaches to path traversal: - Inode-walk: For each path component, obtain a handle to the underlying filesystem object (e.g. with `open(O_PATH)`), check if that object is a symlink (as described above) and that that object is accessible by the caller (e.g. with `fstat()`), then continue to the next path component (e.g. with `openat()`). This ensures that the checked filesystem object is the one used to obtain the next object in the traversal, which is intuitively appealing. However, while this approach works for host local filesystems, it requires features that are not widely supported by other filesystems. - Path-walk: For each path component, use a path-based operation to determine if the filesystem object currently referred to by that path component is a symlink / is accessible. This is highly portable, but suffers from quadratic behavior (at the level of the underlying filesystem implementation, the first path component will be traversed a number of times equal to the number of path components in the path). The implementation should support either option by delegating path traversal to filesystem implementations within the server (like VFS and the remote filesystem protocol itself), as inode-walking is still safe, efficient, amenable to FD caching, and implementable on non-shared host local filesystems (a sufficiently common case as to be worth considering in the design). Both approaches are susceptible to race conditions that may permit sandboxed filesystem escapes: - Under inode-walk, a malicious application may cause a directory to be moved (with `rename`) during path traversal, such that the filesystem implementation incorrectly determines whether subsequent inodes are located in paths that should be visible to sandboxed applications. - Under path-walk, a malicious application may cause a non-symlink file to be replaced with a symlink during path traversal, such that following path operations will incorrectly follow the symlink. Both race conditions can, to some extent, be mitigated in filesystem server implementations by synchronizing path traversal with the hazardous operations in question. However, shared filesystems are frequently used to share data between sandboxed and unsandboxed applications in a controlled way, and in some cases a malicious sandboxed application may be able to take advantage of a hazardous filesystem operation performed by an unsandboxed application. In some cases, filesystem features may be available to ensure safety even in such cases (e.g. [the new openat2() syscall](https://man7.org/linux/man-pages/man2/openat2.2.html)), but it is not clear how to solve this problem in general. (Note that this issue is not specific to our design; rather, it is a fundamental limitation of filesystem sandboxing.) ### Filesystem Multiplexing A given sentry may need to access multiple distinct remote filesystems (e.g. different volumes for a given container). In many cases, there is no advantage to serving these filesystems from distinct filesystem servers, or accessing them through distinct connections (factors such as maximum RPC concurrency should be based on available host resources). Therefore, the protocol should support multiplexing of distinct filesystem trees within a single session. 9P supports this by allowing multiple calls to the `attach` RPC to produce fids representing distinct filesystem trees, but this is somewhat clunky; we propose a much simpler mechanism wherein each message that conveys a path also conveys a numeric filesystem ID that identifies a filesystem tree. ## Alternatives Considered ### Additional Extensions to 9P There are at least three conceptual aspects to 9P: - Wire format: messages with a 4-byte little-endian size prefix, strings with a 2-byte little-endian size prefix, etc. Whether the wire format is worth retaining is unclear; in particular, it's unclear that the 9P wire format has a significant advantage over protobufs, which are substantially easier to extend. Note that the official Go protobuf implementation is widely known to suffer from a significant number of performance deficiencies, so if we choose to switch to protobuf, we may need to use an alternative toolchain such as `gogo/protobuf` (which is also widely used in the Go ecosystem, e.g. by Kubernetes). - Filesystem model: fids, qids, etc. Discarding this is one of the motivations for this proposal. - RPCs: Twalk, Tlopen, etc. In addition to previously-described inefficiencies, most of these are dependent on the filesystem model and therefore must be discarded. ### FUSE The FUSE (Filesystem in Userspace) protocol is frequently used to provide arbitrary userspace filesystem implementations to a host Linux kernel. Unfortunately, FUSE is also inode-based, and therefore doesn't address any of the problems we have with 9P. ### virtio-fs virtio-fs is an ongoing project aimed at improving Linux VM filesystem performance when accessing Linux host filesystems (vs. virtio-9p). In brief, it is based on: - Using a FUSE client in the guest that communicates over virtio with a FUSE server in the host. - Using DAX to map the host page cache into the guest. - Using a file metadata table in shared memory to avoid VM exits for metadata updates. None of these improvements seem applicable to gVisor: - As explained above, FUSE is still inode-based, so it is still susceptible to most of the problems we have with 9P. - Our use of host file descriptors already allows us to leverage the host page cache for file contents. - Our need for shared filesystem coherence is usually based on a user requirement that an out-of-sandbox filesystem mutation is guaranteed to be visible by all subsequent observations from within the sandbox, or vice versa; it's not clear that this can be guaranteed without a synchronous signaling mechanism like an RPC.