208 lines
10 KiB
Markdown
208 lines
10 KiB
Markdown
This package provides utilities for implementing virtual filesystem objects.
|
|
|
|
[TOC]
|
|
|
|
## Page cache
|
|
|
|
`CachingInodeOperations` implements a page cache for files that cannot use the
|
|
host page cache. Normally these are files that store their data in a remote
|
|
filesystem. This also applies to files that are accessed on a platform that does
|
|
not support directly memory mapping host file descriptors (e.g. the ptrace
|
|
platform).
|
|
|
|
An `CachingInodeOperations` buffers regions of a single file into memory. It is
|
|
owned by an `fs.Inode`, the in-memory representation of a file (all open file
|
|
descriptors are backed by an `fs.Inode`). The `fs.Inode` provides operations for
|
|
reading memory into an `CachingInodeOperations`, to represent the contents of
|
|
the file in-memory, and for writing memory out, to relieve memory pressure on
|
|
the kernel and to synchronize in-memory changes to filesystems.
|
|
|
|
An `CachingInodeOperations` enables readable and/or writable memory access to
|
|
file content. Files can be mapped shared or private, see mmap(2). When a file is
|
|
mapped shared, changes to the file via write(2) and truncate(2) are reflected in
|
|
the shared memory region. Conversely, when the shared memory region is modified,
|
|
changes to the file are visible via read(2). Multiple shared mappings of the
|
|
same file are coherent with each other. This is consistent with Linux.
|
|
|
|
When a file is mapped private, updates to the mapped memory are not visible to
|
|
other memory mappings. Updates to the mapped memory are also not reflected in
|
|
the file content as seen by read(2). If the file is changed after a private
|
|
mapping is created, for instance by write(2), the change to the file may or may
|
|
not be reflected in the private mapping. This is consistent with Linux.
|
|
|
|
An `CachingInodeOperations` keeps track of ranges of memory that were modified
|
|
(or "dirtied"). When the file is explicitly synced via fsync(2), only the dirty
|
|
ranges are written out to the filesystem. Any error returned indicates a failure
|
|
to write all dirty memory of an `CachingInodeOperations` to the filesystem. In
|
|
this case the filesystem may be in an inconsistent state. The same operation can
|
|
be performed on the shared memory itself using msync(2). If neither fsync(2) nor
|
|
msync(2) is performed, then the dirty memory is written out in accordance with
|
|
the `CachingInodeOperations` eviction strategy (see below) and there is no
|
|
guarantee that memory will be written out successfully in full.
|
|
|
|
### Memory allocation and eviction
|
|
|
|
An `CachingInodeOperations` implements the following allocation and eviction
|
|
strategy:
|
|
|
|
- Memory is allocated and brought up to date with the contents of a file when
|
|
a region of mapped memory is accessed (or "faulted on").
|
|
|
|
- Dirty memory is written out to filesystems when an fsync(2) or msync(2)
|
|
operation is performed on a memory mapped file, for all memory mapped files
|
|
when saved, and/or when there are no longer any memory mappings of a range
|
|
of a file, see munmap(2). As the latter implies, in the absence of a panic
|
|
or SIGKILL, dirty memory is written out for all memory mapped files when an
|
|
application exits.
|
|
|
|
- Memory is freed when there are no longer any memory mappings of a range of a
|
|
file (e.g. when an application exits). This behavior is consistent with
|
|
Linux for shared memory that has been locked via mlock(2).
|
|
|
|
Notably, memory is not allocated for read(2) or write(2) operations. This means
|
|
that reads and writes to the file are only accelerated by an
|
|
`CachingInodeOperations` if the file being read or written has been memory
|
|
mapped *and* if the shared memory has been accessed at the region being read or
|
|
written. This diverges from Linux which buffers memory into a page cache on
|
|
read(2) proactively (i.e. readahead) and delays writing it out to filesystems on
|
|
write(2) (i.e. writeback). The absence of these optimizations is not visible to
|
|
applications beyond less than optimal performance when repeatedly reading and/or
|
|
writing to same region of a file. See [Future Work](#future-work) for plans to
|
|
implement these optimizations.
|
|
|
|
Additionally, memory held by `CachingInodeOperationss` is currently unbounded in
|
|
size. An `CachingInodeOperations` does not write out dirty memory and free it
|
|
under system memory pressure. This can cause pathological memory usage.
|
|
|
|
When memory is written back, an `CachingInodeOperations` may write regions of
|
|
shared memory that were never modified. This is due to the strategy of
|
|
minimizing page faults (see below) and handling only a subset of memory write
|
|
faults. In the absence of an application or sentry crash, it is guaranteed that
|
|
if a region of shared memory was written to, it is written back to a filesystem.
|
|
|
|
### Life of a shared memory mapping
|
|
|
|
A file is memory mapped via mmap(2). For example, if `A` is an address, an
|
|
application may execute:
|
|
|
|
```
|
|
mmap(A, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
|
|
```
|
|
|
|
This creates a shared mapping of fd that reflects 4k of the contents of fd
|
|
starting at offset 0, accessible at address `A`. This in turn creates a virtual
|
|
memory area region ("vma") which indicates that [`A`, `A`+0x1000) is now a valid
|
|
address range for this application to access.
|
|
|
|
At this point, memory has not been allocated in the file's
|
|
`CachingInodeOperations`. It is also the case that the address range [`A`,
|
|
`A`+0x1000) has not been mapped on the host on behalf of the application. If the
|
|
application then tries to modify 8 bytes of the shared memory:
|
|
|
|
```
|
|
char buffer[] = "aaaaaaaa";
|
|
memcpy(A, buffer, 8);
|
|
```
|
|
|
|
The host then sends a `SIGSEGV` to the sentry because the address range [`A`,
|
|
`A`+8) is not mapped on the host. The `SIGSEGV` indicates that the memory was
|
|
accessed writable. The sentry looks up the vma associated with [`A`, `A`+8),
|
|
finds the file that was mapped and its `CachingInodeOperations`. It then calls
|
|
`CachingInodeOperations.Translate` which allocates memory to back [`A`, `A`+8).
|
|
It may choose to allocate more memory (i.e. do "readahead") to minimize
|
|
subsequent faults.
|
|
|
|
Memory that is allocated comes from a host tmpfs file (see
|
|
`pgalloc.MemoryFile`). The host tmpfs file memory is brought up to date with the
|
|
contents of the mapped file on its filesystem. The region of the host tmpfs file
|
|
that reflects the mapped file is then mapped into the host address space of the
|
|
application so that subsequent memory accesses do not repeatedly generate a
|
|
`SIGSEGV`.
|
|
|
|
The range that was allocated, including any extra memory allocation to minimize
|
|
faults, is marked dirty due to the write fault. This overcounts dirty memory if
|
|
the extra memory allocated is never modified.
|
|
|
|
To make the scenario more interesting, imagine that this application spawns
|
|
another process and maps the same file in the exact same way:
|
|
|
|
```
|
|
mmap(A, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
|
|
```
|
|
|
|
Imagine that this process then tries to modify the file again but with only 4
|
|
bytes:
|
|
|
|
```
|
|
char buffer[] = "bbbb";
|
|
memcpy(A, buffer, 4);
|
|
```
|
|
|
|
Since the first process has already mapped and accessed the same region of the
|
|
file writable, `CachingInodeOperations.Translate` is called but returns the
|
|
memory that has already been allocated rather than allocating new memory. The
|
|
address range [`A`, `A`+0x1000) reflects the same cached view of the file as the
|
|
first process sees. For example, reading 8 bytes from the file from either
|
|
process via read(2) starting at offset 0 returns a consistent "bbbbaaaa".
|
|
|
|
When this process no longer needs the shared memory, it may do:
|
|
|
|
```
|
|
munmap(A, 0x1000);
|
|
```
|
|
|
|
At this point, the modified memory cached by the `CachingInodeOperations` is not
|
|
written back to the file because it is still in use by the first process that
|
|
mapped it. When the first process also does:
|
|
|
|
```
|
|
munmap(A, 0x1000);
|
|
```
|
|
|
|
Then the last memory mapping of the file at the range [0, 0x1000) is gone. The
|
|
file's `CachingInodeOperations` then starts writing back memory marked dirty to
|
|
the file on its filesystem. Once writing completes, regardless of whether it was
|
|
successful, the `CachingInodeOperations` frees the memory cached at the range
|
|
[0, 0x1000).
|
|
|
|
Subsequent read(2) or write(2) operations on the file go directly to the
|
|
filesystem since there no longer exists memory for it in its
|
|
`CachingInodeOperations`.
|
|
|
|
## Future Work
|
|
|
|
### Page cache
|
|
|
|
The sentry does not yet implement the readahead and writeback optimizations for
|
|
read(2) and write(2) respectively. To do so, on read(2) and/or write(2) the
|
|
sentry must ensure that memory is allocated in a page cache to read or write
|
|
into. However, the sentry cannot boundlessly allocate memory. If it did, the
|
|
host would eventually OOM-kill the sentry+application process. This means that
|
|
the sentry must implement a page cache memory allocation strategy that is
|
|
bounded by a global user or container imposed limit. When this limit is
|
|
approached, the sentry must decide from which page cache memory should be freed
|
|
so that it can allocate more memory. If it makes a poor decision, the sentry may
|
|
end up freeing and re-allocating memory to back regions of files that are
|
|
frequently used, nullifying the optimization (and in some cases causing worse
|
|
performance due to the overhead of memory allocation and general management).
|
|
This is a form of "cache thrashing".
|
|
|
|
In Linux, much research has been done to select and implement a lightweight but
|
|
optimal page cache eviction algorithm. Linux makes use of hardware page bits to
|
|
keep track of whether memory has been accessed. The sentry does not have direct
|
|
access to hardware. Implementing a similarly lightweight and optimal page cache
|
|
eviction algorithm will need to either introduce a kernel interface to obtain
|
|
these page bits or find a suitable alternative proxy for access events.
|
|
|
|
In Linux, readahead happens by default but is not always ideal. For instance,
|
|
for files that are not read sequentially, it would be more ideal to simply read
|
|
from only those regions of the file rather than to optimistically cache some
|
|
number of bytes ahead of the read (up to 2MB in Linux) if the bytes cached won't
|
|
be accessed. Linux implements the fadvise64(2) system call for applications to
|
|
specify that a range of a file will not be accessed sequentially. The advice bit
|
|
FADV_RANDOM turns off the readahead optimization for the given range in the
|
|
given file. However fadvise64 is rarely used by applications so Linux implements
|
|
a readahead backoff strategy if reads are not sequential. To ensure that
|
|
application performance is not degraded, the sentry must implement a similar
|
|
backoff strategy.
|