gvisor/pkg/sentry/fs/fsutil/README.md

This package provides utilities for implementing virtual filesystem objects.

[TOC]

## Page cache

`CachingInodeOperations` implements a page cache for files that cannot use the
host page cache. Normally these are files that store their data in a remote
filesystem. This also applies to files that are accessed on a platform that does
not support directly memory mapping host file descriptors (e.g. the ptrace
platform).

An `CachingInodeOperations` buffers regions of a single file into memory. It is
owned by an `fs.Inode`, the in-memory representation of a file (all open file
descriptors are backed by an `fs.Inode`). The `fs.Inode` provides operations for
reading memory into an `CachingInodeOperations`, to represent the contents of
the file in-memory, and for writing memory out, to relieve memory pressure on
the kernel and to synchronize in-memory changes to filesystems.

An `CachingInodeOperations` enables readable and/or writable memory access to
file content. Files can be mapped shared or private, see mmap(2). When a file is
mapped shared, changes to the file via write(2) and truncate(2) are reflected in
the shared memory region. Conversely, when the shared memory region is modified,
changes to the file are visible via read(2). Multiple shared mappings of the
same file are coherent with each other. This is consistent with Linux.

When a file is mapped private, updates to the mapped memory are not visible to
other memory mappings. Updates to the mapped memory are also not reflected in
the file content as seen by read(2). If the file is changed after a private
mapping is created, for instance by write(2), the change to the file may or may
not be reflected in the private mapping. This is consistent with Linux.

An `CachingInodeOperations` keeps track of ranges of memory that were modified
(or "dirtied"). When the file is explicitly synced via fsync(2), only the dirty
ranges are written out to the filesystem. Any error returned indicates a failure
to write all dirty memory of an `CachingInodeOperations` to the filesystem. In
this case the filesystem may be in an inconsistent state. The same operation can
be performed on the shared memory itself using msync(2). If neither fsync(2) nor
msync(2) is performed, then the dirty memory is written out in accordance with
the `CachingInodeOperations` eviction strategy (see below) and there is no
guarantee that memory will be written out successfully in full.

### Memory allocation and eviction

An `CachingInodeOperations` implements the following allocation and eviction
strategy:

-   Memory is allocated and brought up to date with the contents of a file when
    a region of mapped memory is accessed (or "faulted on").

-   Dirty memory is written out to filesystems when an fsync(2) or msync(2)
    operation is performed on a memory mapped file, for all memory mapped files
    when saved, and/or when there are no longer any memory mappings of a range
    of a file, see munmap(2). As the latter implies, in the absence of a panic
    or SIGKILL, dirty memory is written out for all memory mapped files when an
    application exits.

-   Memory is freed when there are no longer any memory mappings of a range of a
    file (e.g. when an application exits). This behavior is consistent with
    Linux for shared memory that has been locked via mlock(2).

Notably, memory is not allocated for read(2) or write(2) operations. This means
that reads and writes to the file are only accelerated by an
`CachingInodeOperations` if the file being read or written has been memory
mapped *and* if the shared memory has been accessed at the region being read or
written. This diverges from Linux which buffers memory into a page cache on
read(2) proactively (i.e. readahead) and delays writing it out to filesystems on
write(2) (i.e. writeback). The absence of these optimizations is not visible to
applications beyond less than optimal performance when repeatedly reading and/or
writing to same region of a file. See [Future Work](#future-work) for plans to
implement these optimizations.

Additionally, memory held by `CachingInodeOperationss` is currently unbounded in
size. An `CachingInodeOperations` does not write out dirty memory and free it
under system memory pressure. This can cause pathological memory usage.

When memory is written back, an `CachingInodeOperations` may write regions of
shared memory that were never modified. This is due to the strategy of
minimizing page faults (see below) and handling only a subset of memory write
faults. In the absence of an application or sentry crash, it is guaranteed that
if a region of shared memory was written to, it is written back to a filesystem.

### Life of a shared memory mapping

A file is memory mapped via mmap(2). For example, if `A` is an address, an
application may execute:

```
mmap(A, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
```

This creates a shared mapping of fd that reflects 4k of the contents of fd
starting at offset 0, accessible at address `A`. This in turn creates a virtual
memory area region ("vma") which indicates that [`A`, `A`+0x1000) is now a valid
address range for this application to access.

At this point, memory has not been allocated in the file's
`CachingInodeOperations`. It is also the case that the address range [`A`,
`A`+0x1000) has not been mapped on the host on behalf of the application. If the
application then tries to modify 8 bytes of the shared memory:

```
char buffer[] = "aaaaaaaa";
memcpy(A, buffer, 8);
```

The host then sends a `SIGSEGV` to the sentry because the address range [`A`,
`A`+8) is not mapped on the host. The `SIGSEGV` indicates that the memory was
accessed writable. The sentry looks up the vma associated with [`A`, `A`+8),
finds the file that was mapped and its `CachingInodeOperations`. It then calls
`CachingInodeOperations.Translate` which allocates memory to back [`A`, `A`+8).
It may choose to allocate more memory (i.e. do "readahead") to minimize
subsequent faults.

Memory that is allocated comes from a host tmpfs file (see
`pgalloc.MemoryFile`). The host tmpfs file memory is brought up to date with the
contents of the mapped file on its filesystem. The region of the host tmpfs file
that reflects the mapped file is then mapped into the host address space of the
application so that subsequent memory accesses do not repeatedly generate a
`SIGSEGV`.

The range that was allocated, including any extra memory allocation to minimize
faults, is marked dirty due to the write fault. This overcounts dirty memory if
the extra memory allocated is never modified.

To make the scenario more interesting, imagine that this application spawns
another process and maps the same file in the exact same way:

```
mmap(A, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
```

Imagine that this process then tries to modify the file again but with only 4
bytes:

```
char buffer[] = "bbbb";
memcpy(A, buffer, 4);
```

Since the first process has already mapped and accessed the same region of the
file writable, `CachingInodeOperations.Translate` is called but returns the
memory that has already been allocated rather than allocating new memory. The
address range [`A`, `A`+0x1000) reflects the same cached view of the file as the
first process sees. For example, reading 8 bytes from the file from either
process via read(2) starting at offset 0 returns a consistent "bbbbaaaa".

When this process no longer needs the shared memory, it may do:

```
munmap(A, 0x1000);
```

At this point, the modified memory cached by the `CachingInodeOperations` is not
written back to the file because it is still in use by the first process that
mapped it. When the first process also does:

```
munmap(A, 0x1000);
```

Then the last memory mapping of the file at the range [0, 0x1000) is gone. The
file's `CachingInodeOperations` then starts writing back memory marked dirty to
the file on its filesystem. Once writing completes, regardless of whether it was
successful, the `CachingInodeOperations` frees the memory cached at the range
[0, 0x1000).

Subsequent read(2) or write(2) operations on the file go directly to the
filesystem since there no longer exists memory for it in its
`CachingInodeOperations`.

## Future Work

### Page cache

The sentry does not yet implement the readahead and writeback optimizations for
read(2) and write(2) respectively. To do so, on read(2) and/or write(2) the
sentry must ensure that memory is allocated in a page cache to read or write
into. However, the sentry cannot boundlessly allocate memory. If it did, the
host would eventually OOM-kill the sentry+application process. This means that
the sentry must implement a page cache memory allocation strategy that is
bounded by a global user or container imposed limit. When this limit is
approached, the sentry must decide from which page cache memory should be freed
so that it can allocate more memory. If it makes a poor decision, the sentry may
end up freeing and re-allocating memory to back regions of files that are
frequently used, nullifying the optimization (and in some cases causing worse
performance due to the overhead of memory allocation and general management).
This is a form of "cache thrashing".

In Linux, much research has been done to select and implement a lightweight but
optimal page cache eviction algorithm. Linux makes use of hardware page bits to
keep track of whether memory has been accessed. The sentry does not have direct
access to hardware. Implementing a similarly lightweight and optimal page cache
eviction algorithm will need to either introduce a kernel interface to obtain
these page bits or find a suitable alternative proxy for access events.

In Linux, readahead happens by default but is not always ideal. For instance,
for files that are not read sequentially, it would be more ideal to simply read
from only those regions of the file rather than to optimistically cache some
number of bytes ahead of the read (up to 2MB in Linux) if the bytes cached won't
be accessed. Linux implements the fadvise64(2) system call for applications to
specify that a range of a file will not be accessed sequentially. The advice bit
FADV_RANDOM turns off the readahead optimization for the given range in the
given file. However fadvise64 is rarely used by applications so Linux implements
a readahead backoff strategy if reads are not sequential. To ensure that
application performance is not degraded, the sentry must implement a similar
backoff strategy.