gvisor/pkg/sentry/mm/README.md

14 KiB

This package provides an emulation of Linux semantics for application virtual memory mappings.

For completeness, this document also describes aspects of the memory management subsystem defined outside this package.

Background

We begin by describing semantics for virtual memory in Linux.

A virtual address space is defined as a collection of mappings from virtual addresses to physical memory. However, userspace applications do not configure mappings to physical memory directly. Instead, applications configure memory mappings from virtual addresses to offsets into a file using the mmap system call.1 For example, a call to:

mmap(
    /* addr = */ 0x400000,
    /* length = */ 0x1000,
    PROT_READ | PROT_WRITE,
    MAP_SHARED,
    /* fd = */ 3,
    /* offset = */ 0);

creates a mapping of length 0x1000 bytes, starting at virtual address (VA) 0x400000, to offset 0 in the file represented by file descriptor (FD) 3. Within the Linux kernel, virtual memory mappings are represented by virtual memory areas (VMAs). Supposing that FD 3 represents file /tmp/foo, the state of the virtual memory subsystem after the mmap call may be depicted as:

VMA:     VA:0x400000 -> /tmp/foo:0x0

Establishing a virtual memory area does not necessarily establish a mapping to a physical address, because Linux has not necessarily provisioned physical memory to store the file's contents. Thus, if the application attempts to read the contents of VA 0x400000, it may incur a page fault, a CPU exception that forces the kernel to create such a mapping to service the read.

For a file, doing so consists of several logical phases:

  1. The kernel allocates physical memory to store the contents of the required part of the file, and copies file contents to the allocated memory. Supposing that the kernel chooses the physical memory at physical address (PA) 0x2fb000, the resulting state of the system is:

    VMA:     VA:0x400000 -> /tmp/foo:0x0
    Filemap:                /tmp/foo:0x0 -> PA:0x2fb000
    

    (In Linux the state of the mapping from file offset to physical memory is stored in struct address_space, but to avoid confusion with other notions of address space we will refer to this system as filemap, named after Linux kernel source file mm/filemap.c.)

  2. The kernel stores the effective mapping from virtual to physical address in a page table entry (PTE) in the application's page tables, which are used by the CPU's virtual memory hardware to perform address translation. The resulting state of the system is:

    VMA:     VA:0x400000 -> /tmp/foo:0x0
    Filemap:                /tmp/foo:0x0 -> PA:0x2fb000
    PTE:     VA:0x400000 -----------------> PA:0x2fb000
    

    The PTE is required for the application to actually use the contents of the mapped file as virtual memory. However, the PTE is derived from the VMA and filemap state, both of which are independently mutable, such that mutations to either will affect the PTE. For example:

    • The application may remove the VMA using the munmap system call. This breaks the mapping from VA:0x400000 to /tmp/foo:0x0, and consequently the mapping from VA:0x400000 to PA:0x2fb000. However, it does not necessarily break the mapping from /tmp/foo:0x0 to PA:0x2fb000, so a future mapping of the same file offset may reuse this physical memory.

    • The application may invalidate the file's contents by passing a length of 0 to the ftruncate system call. This breaks the mapping from /tmp/foo:0x0 to PA:0x2fb000, and consequently the mapping from VA:0x400000 to PA:0x2fb000. However, it does not break the mapping from VA:0x400000 to /tmp/foo:0x0, so future changes to the file's contents may again be made visible at VA:0x400000 after another page fault results in the allocation of a new physical address.

    Note that, in order to correctly break the mapping from VA:0x400000 to PA:0x2fb000 in the latter case, filemap must also store a reverse mapping from /tmp/foo:0x0 to VA:0x400000 so that it can locate and remove the PTE.

Private Mappings

The preceding example considered VMAs created using the MAP_SHARED flag, which means that PTEs derived from the mapping should always use physical memory that represents the current state of the mapped file.2 Applications can alternatively pass the MAP_PRIVATE flag to create a private mapping. Private mappings are copy-on-write.

Suppose that the application instead created a private mapping in the previous example. In Linux, the state of the system after a read page fault would be:

VMA:     VA:0x400000 -> /tmp/foo:0x0 (private)
Filemap:                /tmp/foo:0x0 -> PA:0x2fb000
PTE:     VA:0x400000 -----------------> PA:0x2fb000 (read-only)

Now suppose the application attempts to write to VA:0x400000. For a shared mapping, the write would be propagated to PA:0x2fb000, and the kernel would be responsible for ensuring that the write is later propagated to the mapped file. For a private mapping, the write incurs another page fault since the PTE is marked read-only. In response, the kernel allocates physical memory to store the mapping's private copy of the file's contents, copies file contents to the allocated memory, and changes the PTE to map to the private copy. Supposing that the kernel chooses the physical memory at physical address (PA) 0x5ea000, the resulting state of the system is:

VMA:     VA:0x400000 -> /tmp/foo:0x0 (private)
Filemap:                /tmp/foo:0x0 -> PA:0x2fb000
PTE:     VA:0x400000 -----------------> PA:0x5ea000

Note that the filemap mapping from /tmp/foo:0x0 to PA:0x2fb000 may still exist, but is now irrelevant to this mapping.

Anonymous Mappings

Instead of passing a file to the mmap system call, applications can instead request an anonymous mapping by passing the MAP_ANONYMOUS flag. Semantically, an anonymous mapping is essentially a mapping to an ephemeral file initially filled with zero bytes. Practically speaking, this is how shared anonymous mappings are implemented, but private anonymous mappings do not result in the creation of an ephemeral file; since there would be no way to modify the contents of the underlying file through a private mapping, all private anonymous mappings use a single shared page filled with zero bytes until copy-on-write occurs.

Virtual Memory in the Sentry

The sentry implements application virtual memory atop a host kernel, introducing an additional level of indirection to the above.

Consider the same scenario as in the previous section. Since the sentry handles application system calls, the effect of an application mmap system call is to create a VMA in the sentry (as opposed to the host kernel):

Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0

When the application first incurs a page fault on this address, the host kernel delivers information about the page fault to the sentry in a platform-dependent manner, and the sentry handles the fault:

  1. The sentry allocates memory to store the contents of the required part of the file, and copies file contents to the allocated memory. However, since the sentry is implemented atop a host kernel, it does not configure mappings to physical memory directly. Instead, mappable "memory" in the sentry is represented by a host file descriptor and offset, since (as noted in "Background") this is the memory mapping primitive provided by the host kernel. In general, memory is allocated from a temporary host file using the filemem package. Supposing that the sentry allocates offset 0x3000 from host file "memory-file", the resulting state is:

    Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0
    Sentry filemap:                /tmp/foo:0x0 -> host:memory-file:0x3000
    
  2. The sentry stores the effective mapping from virtual address to host file in a host VMA by invoking the mmap system call:

    Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0
    Sentry filemap:                /tmp/foo:0x0 -> host:memory-file:0x3000
      Host VMA:     VA:0x400000 -----------------> host:memory-file:0x3000
    
  3. The sentry returns control to the application, which immediately incurs the page fault again.3 However, since a host VMA now exists for the faulting virtual address, the host kernel now handles the page fault as described in "Background":

    Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0
    Sentry filemap:                /tmp/foo:0x0 -> host:memory-file:0x3000
      Host VMA:     VA:0x400000 -----------------> host:memory-file:0x3000
      Host filemap:                                host:memory-file:0x3000 -> PA:0x2fb000
      Host PTE:     VA:0x400000 --------------------------------------------> PA:0x2fb000
    

Thus, from an implementation standpoint, host VMAs serve the same purpose in the sentry that PTEs do in Linux. As in Linux, sentry VMA and filemap state is independently mutable, and the desired state of host VMAs is derived from that state.

Private Mappings

The sentry implements private mappings consistently with Linux. Before copy-on-write, the private mapping example given in the Background results in:

Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0 (private)
Sentry filemap:                /tmp/foo:0x0 -> host:memory-file:0x3000
  Host VMA:     VA:0x400000 -----------------> host:memory-file:0x3000 (read-only)
  Host filemap:                                host:memory-file:0x3000 -> PA:0x2fb000
  Host PTE:     VA:0x400000 --------------------------------------------> PA:0x2fb000 (read-only)

When the application attempts to write to this address, the host kernel delivers information about the resulting page fault to the sentry. Analogous to Linux, the sentry allocates memory to store the mapping's private copy of the file's contents, copies file contents to the allocated memory, and changes the host VMA to map to the private copy. Supposing that the sentry chooses the offset 0x4000 in host file memory-file to store the private copy, the state of the system after copy-on-write is:

Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0 (private)
Sentry filemap:                /tmp/foo:0x0 -> host:memory-file:0x3000
  Host VMA:     VA:0x400000 -----------------> host:memory-file:0x4000
  Host filemap:                                host:memory-file:0x4000 -> PA:0x5ea000
  Host PTE:     VA:0x400000 --------------------------------------------> PA:0x5ea000

However, this highlights an important difference between Linux and the sentry. In Linux, page tables are concrete (architecture-dependent) data structures owned by the kernel. Conversely, the sentry has the ability to create and destroy host VMAs using host system calls, but it does not have direct access to their state. Thus, as written, if the application invokes the munmap system call to remove the sentry VMA, it is non-trivial for the sentry to determine that it should deallocate host:memory-file:0x4000. This implies that the sentry must retain information about the host VMAs that it has created.

Anonymous Mappings

The sentry implements anonymous mappings consistently with Linux, except that there is no shared zero page.

Implementation Constructs

In Linux:

  • A virtual address space is represented by struct mm_struct.

  • VMAs are represented by struct vm_area_struct, stored in struct mm_struct::mmap.

  • Mappings from file offsets to physical memory are stored in struct address_space.

  • Reverse mappings from file offsets to virtual mappings are stored in struct address_space::i_mmap.

  • Physical memory pages are represented by a pointer to struct page or an index called a page frame number (PFN), represented by pfn_t.

  • PTEs are represented by architecture-dependent type pte_t, stored in a table hierarchy rooted at struct mm_struct::pgd.

In the sentry:


  1. Memory mappings to non-files are discussed in later sections. ↩︎

  2. Modulo files with special mmap semantics such as /dev/zero. ↩︎

  3. The sentry could force the host kernel to establish PTEs when it creates the host VMA by passing the MAP_POPULATE flag to the mmap system call, but usually does not. This is because, to reduce the number of page faults that require handling by the sentry and (correspondingly) the number of host mmap system calls, the sentry usually creates host VMAs that are much larger than the single faulting page. ↩︎