gvisor/pkg/sentry/fs/fsutil
Cyrille Hemidy 04b79137ba Fix misspellings.
PiperOrigin-RevId: 195307689
Change-Id: I499f19af49875a43214797d63376f20ae788d2f4
2018-05-03 14:06:13 -07:00
..
BUILD Check in gVisor. 2018-04-28 01:44:26 -04:00
README.md Check in gVisor. 2018-04-28 01:44:26 -04:00
dirty_set.go Check in gVisor. 2018-04-28 01:44:26 -04:00
dirty_set_test.go Check in gVisor. 2018-04-28 01:44:26 -04:00
file.go Fix misspellings. 2018-05-03 14:06:13 -07:00
file_range_set.go Check in gVisor. 2018-04-28 01:44:26 -04:00
frame_ref_set.go Check in gVisor. 2018-04-28 01:44:26 -04:00
fsutil.go Check in gVisor. 2018-04-28 01:44:26 -04:00
handle.go Check in gVisor. 2018-04-28 01:44:26 -04:00
handle_test.go Check in gVisor. 2018-04-28 01:44:26 -04:00
host_file_mapper.go Check in gVisor. 2018-04-28 01:44:26 -04:00
host_file_mapper_state.go Check in gVisor. 2018-04-28 01:44:26 -04:00
host_file_mapper_unsafe.go Check in gVisor. 2018-04-28 01:44:26 -04:00
inode.go Check in gVisor. 2018-04-28 01:44:26 -04:00
inode_cached.go Check in gVisor. 2018-04-28 01:44:26 -04:00
inode_cached_test.go Check in gVisor. 2018-04-28 01:44:26 -04:00

README.md

This package provides utilities for implementing virtual filesystem objects.

[TOC]

Page cache

CachingInodeOperations implements a page cache for files that cannot use the host page cache. Normally these are files that store their data in a remote filesystem. This also applies to files that are accessed on a platform that does not support directly memory mapping host file descriptors (e.g. the ptrace platform).

An CachingInodeOperations buffers regions of a single file into memory. It is owned by an fs.Inode, the in-memory representation of a file (all open file descriptors are backed by an fs.Inode). The fs.Inode provides operations for reading memory into an CachingInodeOperations, to represent the contents of the file in-memory, and for writing memory out, to relieve memory pressure on the kernel and to synchronize in-memory changes to filesystems.

An CachingInodeOperations enables readable and/or writable memory access to file content. Files can be mapped shared or private, see mmap(2). When a file is mapped shared, changes to the file via write(2) and truncate(2) are reflected in the shared memory region. Conversely, when the shared memory region is modified, changes to the file are visible via read(2). Multiple shared mappings of the same file are coherent with each other. This is consistent with Linux.

When a file is mapped private, updates to the mapped memory are not visible to other memory mappings. Updates to the mapped memory are also not reflected in the file content as seen by read(2). If the file is changed after a private mapping is created, for instance by write(2), the change to the file may or may not be reflected in the private mapping. This is consistent with Linux.

An CachingInodeOperations keeps track of ranges of memory that were modified (or "dirtied"). When the file is explicitly synced via fsync(2), only the dirty ranges are written out to the filesystem. Any error returned indicates a failure to write all dirty memory of an CachingInodeOperations to the filesystem. In this case the filesystem may be in an inconsistent state. The same operation can be performed on the shared memory itself using msync(2). If neither fsync(2) nor msync(2) is performed, then the dirty memory is written out in accordance with the CachingInodeOperations eviction strategy (see below) and there is no guarantee that memory will be written out successfully in full.

Memory allocation and eviction

An CachingInodeOperations implements the following allocation and eviction strategy:

  • Memory is allocated and brought up to date with the contents of a file when a region of mapped memory is accessed (or "faulted on").

  • Dirty memory is written out to filesystems when an fsync(2) or msync(2) operation is performed on a memory mapped file, for all memory mapped files when saved, and/or when there are no longer any memory mappings of a range of a file, see munmap(2). As the latter implies, in the absence of a panic or SIGKILL, dirty memory is written out for all memory mapped files when an application exits.

  • Memory is freed when there are no longer any memory mappings of a range of a file (e.g. when an application exits). This behavior is consistent with Linux for shared memory that has been locked via mlock(2).

Notably, memory is not allocated for read(2) or write(2) operations. This means that reads and writes to the file are only accelerated by an CachingInodeOperations if the file being read or written has been memory mapped and if the shared memory has been accessed at the region being read or written. This diverges from Linux which buffers memory into a page cache on read(2) proactively (i.e. readahead) and delays writing it out to filesystems on write(2) (i.e. writeback). The absence of these optimizations is not visible to applications beyond less than optimal performance when repeatedly reading and/or writing to same region of a file. See Future Work for plans to implement these optimizations.

Additionally, memory held by CachingInodeOperationss is currently unbounded in size. An CachingInodeOperations does not write out dirty memory and free it under system memory pressure. This can cause pathological memory usage.

When memory is written back, an CachingInodeOperations may write regions of shared memory that were never modified. This is due to the strategy of minimizing page faults (see below) and handling only a subset of memory write faults. In the absence of an application or sentry crash, it is guaranteed that if a region of shared memory was written to, it is written back to a filesystem.

Life of a shared memory mapping

A file is memory mapped via mmap(2). For example, if A is an address, an application may execute:

mmap(A, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

This creates a shared mapping of fd that reflects 4k of the contents of fd starting at offset 0, accessible at address A. This in turn creates a virtual memory area region ("vma") which indicates that [A, A+0x1000) is now a valid address range for this application to access.

At this point, memory has not been allocated in the file's CachingInodeOperations. It is also the case that the address range [A, A+0x1000) has not been mapped on the host on behalf of the application. If the application then tries to modify 8 bytes of the shared memory:

char buffer[] = "aaaaaaaa";
memcpy(A, buffer, 8);

The host then sends a SIGSEGV to the sentry because the address range [A, A+8) is not mapped on the host. The SIGSEGV indicates that the memory was accessed writable. The sentry looks up the vma associated with [A, A+8), finds the file that was mapped and its CachingInodeOperations. It then calls CachingInodeOperations.MapInto which allocates memory to back [A, A+8). It may choose to allocate more memory (i.e. do "readahead") to minimize subsequent faults.

Memory that is allocated comes from a host tmpfs file (see filemem.FileMem). The host tmpfs file memory is brought up to date with the contents of the mapped file on its filesystem. The region of the host tmpfs file that reflects the mapped file is then mapped into the host address space of the application so that subsequent memory accesses do not repeatedly generate a SIGSEGV.

The range that was allocated, including any extra memory allocation to minimize faults, is marked dirty due to the write fault. This overcounts dirty memory if the extra memory allocated is never modified.

To make the scenario more interesting, imagine that this application spawns another process and maps the same file in the exact same way:

mmap(A, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

Imagine that this process then tries to modify the file again but with only 4 bytes:

char buffer[] = "bbbb";
memcpy(A, buffer, 4);

Since the first process has already mapped and accessed the same region of the file writable, CachingInodeOperations.MapInto is called but re-maps the memory that has already been allocated (because the host mapping can be invalidated at any time) rather than allocating new memory. The address range [A, A+0x1000) reflects the same cached view of the file as the first process sees. For example, reading 8 bytes from the file from either process via read(2) starting at offset 0 returns a consistent "bbbbaaaa".

When this process no longer needs the shared memory, it may do:

munmap(A, 0x1000);

At this point, the modified memory cached by the CachingInodeOperations is not written back to the file because it is still in use by the first process that mapped it. When the first process also does:

munmap(A, 0x1000);

Then the last memory mapping of the file at the range [0, 0x1000) is gone. The file's CachingInodeOperations then starts writing back memory marked dirty to the file on its filesystem. Once writing completes, regardless of whether it was successful, the CachingInodeOperations frees the memory cached at the range [0, 0x1000).

Subsequent read(2) or write(2) operations on the file go directly to the filesystem since there no longer exists memory for it in its CachingInodeOperations.

Future Work

Page cache

The sentry does not yet implement the readahead and writeback optimizations for read(2) and write(2) respectively. To do so, on read(2) and/or write(2) the sentry must ensure that memory is allocated in a page cache to read or write into. However, the sentry cannot boundlessly allocate memory. If it did, the host would eventually OOM-kill the sentry+application process. This means that the sentry must implement a page cache memory allocation strategy that is bounded by a global user or container imposed limit. When this limit is approached, the sentry must decide from which page cache memory should be freed so that it can allocate more memory. If it makes a poor decision, the sentry may end up freeing and re-allocating memory to back regions of files that are frequently used, nullifying the optimization (and in some cases causing worse performance due to the overhead of memory allocation and general management). This is a form of "cache thrashing".

In Linux, much research has been done to select and implement a lightweight but optimal page cache eviction algorithm. Linux makes use of hardware page bits to keep track of whether memory has been accessed. The sentry does not have direct access to hardware. Implementing a similarly lightweight and optimal page cache eviction algorithm will need to either introduce a kernel interface to obtain these page bits or find a suitable alternative proxy for access events.

In Linux, readahead happens by default but is not always ideal. For instance, for files that are not read sequentially, it would be more ideal to simply read from only those regions of the file rather than to optimistically cache some number of bytes ahead of the read (up to 2MB in Linux) if the bytes cached won't be accessed. Linux implements the fadvise64(2) system call for applications to specify that a range of a file will not be accessed sequentially. The advice bit FADV_RANDOM turns off the readahead optimization for the given range in the given file. However fadvise64 is rarely used by applications so Linux implements a readahead backoff strategy if reads are not sequential. To ensure that application performance is not degraded, the sentry must implement a similar backoff strategy.