gvisor/pkg/sentry/fs/fsutil
Jamie Liu 492229d017 VFS2 gofer client
Updates #1198

Opening host pipes (by spinning in fdpipe) and host sockets is not yet
complete, and will be done in a future CL.

Major differences from VFS1 gofer client (sentry/fs/gofer), with varying levels
of backportability:

- "Cache policies" are replaced by InteropMode, which control the behavior of
  timestamps in addition to caching. Under InteropModeExclusive (analogous to
  cacheAll) and InteropModeWritethrough (analogous to cacheAllWritethrough),
  client timestamps are *not* written back to the server (it is not possible in
  9P or Linux for clients to set ctime, so writing back client-authoritative
  timestamps results in incoherence between atime/mtime and ctime). Under
  InteropModeShared (analogous to cacheRemoteRevalidating), client timestamps
  are not used at all (remote filesystem clocks are authoritative). cacheNone
  is translated to InteropModeShared + new option
  filesystemOptions.specialRegularFiles.

- Under InteropModeShared, "unstable attribute" reloading for permission
  checks, lookup, and revalidation are fused, which is feasible in VFS2 since
  gofer.filesystem controls path resolution. This results in a ~33% reduction
  in RPCs for filesystem operations compared to cacheRemoteRevalidating. For
  example, consider stat("/foo/bar/baz") where "/foo/bar/baz" fails
  revalidation, resulting in the instantiation of a new dentry:

  VFS1 RPCs:
  getattr("/")                          // fs.MountNamespace.FindLink() => fs.Inode.CheckPermission() => gofer.inodeOperations.check() => gofer.inodeOperations.UnstableAttr()
  walkgetattr("/", "foo") = fid1        // fs.Dirent.walk() => gofer.session.Revalidate() => gofer.cachePolicy.Revalidate()
  clunk(fid1)
  getattr("/foo")                       // CheckPermission
  walkgetattr("/foo", "bar") = fid2     // Revalidate
  clunk(fid2)
  getattr("/foo/bar")                   // CheckPermission
  walkgetattr("/foo/bar", "baz") = fid3 // Revalidate
  clunk(fid3)
  walkgetattr("/foo/bar", "baz") = fid4 // fs.Dirent.walk() => gofer.inodeOperations.Lookup
  getattr("/foo/bar/baz")               // linux.stat() => gofer.inodeOperations.UnstableAttr()

  VFS2 RPCs:
  getattr("/")                          // gofer.filesystem.walkExistingLocked()
  walkgetattr("/", "foo") = fid1        // gofer.filesystem.stepExistingLocked()
  clunk(fid1)
                                        // No getattr: walkgetattr already updated metadata for permission check
  walkgetattr("/foo", "bar") = fid2
  clunk(fid2)
  walkgetattr("/foo/bar", "baz") = fid3
                                        // No clunk: fid3 used for new gofer.dentry
                                        // No getattr: walkgetattr already updated metadata for stat()

- gofer.filesystem.unlinkAt() does not require instantiation of a dentry that
  represents the file to be deleted. Updates #898.

- gofer.regularFileFD.OnClose() skips Tflushf for regular files under
  InteropModeExclusive, as it's nonsensical to request a remote file flush
  without flushing locally-buffered writes to that remote file first.

- Symlink targets are cached when InteropModeShared is not in effect.

- p9.QID.Path (which is already required to be unique for each file within a
  server, and is accordingly already synthesized from device/inode numbers in
  all known gofers) is used as-is for inode numbers, rather than being mapped
  along with attr.RDev in the client to yet another synthetic inode number.

- Relevant parts of fsutil.CachingInodeOperations are inlined directly into
  gofer package code. This avoids having to duplicate part of its functionality
  in fsutil.HostMappable.

PiperOrigin-RevId: 293190213
2020-02-04 11:29:22 -08:00
..
BUILD VFS2 gofer client 2020-02-04 11:29:22 -08:00
README.md Decouple filemem from platform and move it to pgalloc.MemoryFile. 2019-03-14 08:12:48 -07:00
dirty_set.go Update package locations. 2020-01-27 15:31:32 -08:00
dirty_set_test.go Update package locations. 2020-01-27 15:31:32 -08:00
file.go Update package locations. 2020-01-27 15:31:32 -08:00
file_range_set.go Update package locations. 2020-01-27 15:31:32 -08:00
frame_ref_set.go VFS2 gofer client 2020-02-04 11:29:22 -08:00
fsutil.go Change copyright notice to "The gVisor Authors" 2019-04-29 14:26:23 -07:00
host_file_mapper.go Update package locations. 2020-01-27 15:31:32 -08:00
host_file_mapper_state.go Change copyright notice to "The gVisor Authors" 2019-04-29 14:26:23 -07:00
host_file_mapper_unsafe.go Update package locations. 2020-01-27 15:31:32 -08:00
host_mappable.go Update package locations. 2020-01-27 15:31:32 -08:00
inode.go Update package locations. 2020-01-27 15:31:32 -08:00
inode_cached.go VFS2 gofer client 2020-02-04 11:29:22 -08:00
inode_cached_test.go Update package locations. 2020-01-27 15:31:32 -08:00

README.md

This package provides utilities for implementing virtual filesystem objects.

[TOC]

Page cache

CachingInodeOperations implements a page cache for files that cannot use the host page cache. Normally these are files that store their data in a remote filesystem. This also applies to files that are accessed on a platform that does not support directly memory mapping host file descriptors (e.g. the ptrace platform).

An CachingInodeOperations buffers regions of a single file into memory. It is owned by an fs.Inode, the in-memory representation of a file (all open file descriptors are backed by an fs.Inode). The fs.Inode provides operations for reading memory into an CachingInodeOperations, to represent the contents of the file in-memory, and for writing memory out, to relieve memory pressure on the kernel and to synchronize in-memory changes to filesystems.

An CachingInodeOperations enables readable and/or writable memory access to file content. Files can be mapped shared or private, see mmap(2). When a file is mapped shared, changes to the file via write(2) and truncate(2) are reflected in the shared memory region. Conversely, when the shared memory region is modified, changes to the file are visible via read(2). Multiple shared mappings of the same file are coherent with each other. This is consistent with Linux.

When a file is mapped private, updates to the mapped memory are not visible to other memory mappings. Updates to the mapped memory are also not reflected in the file content as seen by read(2). If the file is changed after a private mapping is created, for instance by write(2), the change to the file may or may not be reflected in the private mapping. This is consistent with Linux.

An CachingInodeOperations keeps track of ranges of memory that were modified (or "dirtied"). When the file is explicitly synced via fsync(2), only the dirty ranges are written out to the filesystem. Any error returned indicates a failure to write all dirty memory of an CachingInodeOperations to the filesystem. In this case the filesystem may be in an inconsistent state. The same operation can be performed on the shared memory itself using msync(2). If neither fsync(2) nor msync(2) is performed, then the dirty memory is written out in accordance with the CachingInodeOperations eviction strategy (see below) and there is no guarantee that memory will be written out successfully in full.

Memory allocation and eviction

An CachingInodeOperations implements the following allocation and eviction strategy:

  • Memory is allocated and brought up to date with the contents of a file when a region of mapped memory is accessed (or "faulted on").

  • Dirty memory is written out to filesystems when an fsync(2) or msync(2) operation is performed on a memory mapped file, for all memory mapped files when saved, and/or when there are no longer any memory mappings of a range of a file, see munmap(2). As the latter implies, in the absence of a panic or SIGKILL, dirty memory is written out for all memory mapped files when an application exits.

  • Memory is freed when there are no longer any memory mappings of a range of a file (e.g. when an application exits). This behavior is consistent with Linux for shared memory that has been locked via mlock(2).

Notably, memory is not allocated for read(2) or write(2) operations. This means that reads and writes to the file are only accelerated by an CachingInodeOperations if the file being read or written has been memory mapped and if the shared memory has been accessed at the region being read or written. This diverges from Linux which buffers memory into a page cache on read(2) proactively (i.e. readahead) and delays writing it out to filesystems on write(2) (i.e. writeback). The absence of these optimizations is not visible to applications beyond less than optimal performance when repeatedly reading and/or writing to same region of a file. See Future Work for plans to implement these optimizations.

Additionally, memory held by CachingInodeOperationss is currently unbounded in size. An CachingInodeOperations does not write out dirty memory and free it under system memory pressure. This can cause pathological memory usage.

When memory is written back, an CachingInodeOperations may write regions of shared memory that were never modified. This is due to the strategy of minimizing page faults (see below) and handling only a subset of memory write faults. In the absence of an application or sentry crash, it is guaranteed that if a region of shared memory was written to, it is written back to a filesystem.

Life of a shared memory mapping

A file is memory mapped via mmap(2). For example, if A is an address, an application may execute:

mmap(A, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

This creates a shared mapping of fd that reflects 4k of the contents of fd starting at offset 0, accessible at address A. This in turn creates a virtual memory area region ("vma") which indicates that [A, A+0x1000) is now a valid address range for this application to access.

At this point, memory has not been allocated in the file's CachingInodeOperations. It is also the case that the address range [A, A+0x1000) has not been mapped on the host on behalf of the application. If the application then tries to modify 8 bytes of the shared memory:

char buffer[] = "aaaaaaaa";
memcpy(A, buffer, 8);

The host then sends a SIGSEGV to the sentry because the address range [A, A+8) is not mapped on the host. The SIGSEGV indicates that the memory was accessed writable. The sentry looks up the vma associated with [A, A+8), finds the file that was mapped and its CachingInodeOperations. It then calls CachingInodeOperations.Translate which allocates memory to back [A, A+8). It may choose to allocate more memory (i.e. do "readahead") to minimize subsequent faults.

Memory that is allocated comes from a host tmpfs file (see pgalloc.MemoryFile). The host tmpfs file memory is brought up to date with the contents of the mapped file on its filesystem. The region of the host tmpfs file that reflects the mapped file is then mapped into the host address space of the application so that subsequent memory accesses do not repeatedly generate a SIGSEGV.

The range that was allocated, including any extra memory allocation to minimize faults, is marked dirty due to the write fault. This overcounts dirty memory if the extra memory allocated is never modified.

To make the scenario more interesting, imagine that this application spawns another process and maps the same file in the exact same way:

mmap(A, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

Imagine that this process then tries to modify the file again but with only 4 bytes:

char buffer[] = "bbbb";
memcpy(A, buffer, 4);

Since the first process has already mapped and accessed the same region of the file writable, CachingInodeOperations.Translate is called but returns the memory that has already been allocated rather than allocating new memory. The address range [A, A+0x1000) reflects the same cached view of the file as the first process sees. For example, reading 8 bytes from the file from either process via read(2) starting at offset 0 returns a consistent "bbbbaaaa".

When this process no longer needs the shared memory, it may do:

munmap(A, 0x1000);

At this point, the modified memory cached by the CachingInodeOperations is not written back to the file because it is still in use by the first process that mapped it. When the first process also does:

munmap(A, 0x1000);

Then the last memory mapping of the file at the range [0, 0x1000) is gone. The file's CachingInodeOperations then starts writing back memory marked dirty to the file on its filesystem. Once writing completes, regardless of whether it was successful, the CachingInodeOperations frees the memory cached at the range [0, 0x1000).

Subsequent read(2) or write(2) operations on the file go directly to the filesystem since there no longer exists memory for it in its CachingInodeOperations.

Future Work

Page cache

The sentry does not yet implement the readahead and writeback optimizations for read(2) and write(2) respectively. To do so, on read(2) and/or write(2) the sentry must ensure that memory is allocated in a page cache to read or write into. However, the sentry cannot boundlessly allocate memory. If it did, the host would eventually OOM-kill the sentry+application process. This means that the sentry must implement a page cache memory allocation strategy that is bounded by a global user or container imposed limit. When this limit is approached, the sentry must decide from which page cache memory should be freed so that it can allocate more memory. If it makes a poor decision, the sentry may end up freeing and re-allocating memory to back regions of files that are frequently used, nullifying the optimization (and in some cases causing worse performance due to the overhead of memory allocation and general management). This is a form of "cache thrashing".

In Linux, much research has been done to select and implement a lightweight but optimal page cache eviction algorithm. Linux makes use of hardware page bits to keep track of whether memory has been accessed. The sentry does not have direct access to hardware. Implementing a similarly lightweight and optimal page cache eviction algorithm will need to either introduce a kernel interface to obtain these page bits or find a suitable alternative proxy for access events.

In Linux, readahead happens by default but is not always ideal. For instance, for files that are not read sequentially, it would be more ideal to simply read from only those regions of the file rather than to optimistically cache some number of bytes ahead of the read (up to 2MB in Linux) if the bytes cached won't be accessed. Linux implements the fadvise64(2) system call for applications to specify that a range of a file will not be accessed sequentially. The advice bit FADV_RANDOM turns off the readahead optimization for the given range in the given file. However fadvise64 is rarely used by applications so Linux implements a readahead backoff strategy if reads are not sequential. To ensure that application performance is not degraded, the sentry must implement a similar backoff strategy.