281 lines
14 KiB
Markdown
281 lines
14 KiB
Markdown
This package provides an emulation of Linux semantics for application virtual
|
|
memory mappings.
|
|
|
|
For completeness, this document also describes aspects of the memory management
|
|
subsystem defined outside this package.
|
|
|
|
# Background
|
|
|
|
We begin by describing semantics for virtual memory in Linux.
|
|
|
|
A virtual address space is defined as a collection of mappings from virtual
|
|
addresses to physical memory. However, userspace applications do not configure
|
|
mappings to physical memory directly. Instead, applications configure memory
|
|
mappings from virtual addresses to offsets into a file using the `mmap` system
|
|
call.[^mmap-anon] For example, a call to:
|
|
|
|
mmap(
|
|
/* addr = */ 0x400000,
|
|
/* length = */ 0x1000,
|
|
PROT_READ | PROT_WRITE,
|
|
MAP_SHARED,
|
|
/* fd = */ 3,
|
|
/* offset = */ 0);
|
|
|
|
creates a mapping of length 0x1000 bytes, starting at virtual address (VA)
|
|
0x400000, to offset 0 in the file represented by file descriptor (FD) 3. Within
|
|
the Linux kernel, virtual memory mappings are represented by *virtual memory
|
|
areas* (VMAs). Supposing that FD 3 represents file /tmp/foo, the state of the
|
|
virtual memory subsystem after the `mmap` call may be depicted as:
|
|
|
|
VMA: VA:0x400000 -> /tmp/foo:0x0
|
|
|
|
Establishing a virtual memory area does not necessarily establish a mapping to a
|
|
physical address, because Linux has not necessarily provisioned physical memory
|
|
to store the file's contents. Thus, if the application attempts to read the
|
|
contents of VA 0x400000, it may incur a *page fault*, a CPU exception that
|
|
forces the kernel to create such a mapping to service the read.
|
|
|
|
For a file, doing so consists of several logical phases:
|
|
|
|
1. The kernel allocates physical memory to store the contents of the required
|
|
part of the file, and copies file contents to the allocated memory.
|
|
Supposing that the kernel chooses the physical memory at physical address
|
|
(PA) 0x2fb000, the resulting state of the system is:
|
|
|
|
VMA: VA:0x400000 -> /tmp/foo:0x0
|
|
Filemap: /tmp/foo:0x0 -> PA:0x2fb000
|
|
|
|
(In Linux the state of the mapping from file offset to physical memory is
|
|
stored in `struct address_space`, but to avoid confusion with other notions
|
|
of address space we will refer to this system as filemap, named after Linux
|
|
kernel source file `mm/filemap.c`.)
|
|
|
|
2. The kernel stores the effective mapping from virtual to physical address in
|
|
a *page table entry* (PTE) in the application's *page tables*, which are
|
|
used by the CPU's virtual memory hardware to perform address translation.
|
|
The resulting state of the system is:
|
|
|
|
VMA: VA:0x400000 -> /tmp/foo:0x0
|
|
Filemap: /tmp/foo:0x0 -> PA:0x2fb000
|
|
PTE: VA:0x400000 -----------------> PA:0x2fb000
|
|
|
|
The PTE is required for the application to actually use the contents of the
|
|
mapped file as virtual memory. However, the PTE is derived from the VMA and
|
|
filemap state, both of which are independently mutable, such that mutations
|
|
to either will affect the PTE. For example:
|
|
|
|
- The application may remove the VMA using the `munmap` system call. This
|
|
breaks the mapping from VA:0x400000 to /tmp/foo:0x0, and consequently
|
|
the mapping from VA:0x400000 to PA:0x2fb000. However, it does not
|
|
necessarily break the mapping from /tmp/foo:0x0 to PA:0x2fb000, so a
|
|
future mapping of the same file offset may reuse this physical memory.
|
|
|
|
- The application may invalidate the file's contents by passing a length
|
|
of 0 to the `ftruncate` system call. This breaks the mapping from
|
|
/tmp/foo:0x0 to PA:0x2fb000, and consequently the mapping from
|
|
VA:0x400000 to PA:0x2fb000. However, it does not break the mapping from
|
|
VA:0x400000 to /tmp/foo:0x0, so future changes to the file's contents
|
|
may again be made visible at VA:0x400000 after another page fault
|
|
results in the allocation of a new physical address.
|
|
|
|
Note that, in order to correctly break the mapping from VA:0x400000 to
|
|
PA:0x2fb000 in the latter case, filemap must also store a *reverse mapping*
|
|
from /tmp/foo:0x0 to VA:0x400000 so that it can locate and remove the PTE.
|
|
|
|
[^mmap-anon]: Memory mappings to non-files are discussed in later sections.
|
|
|
|
## Private Mappings
|
|
|
|
The preceding example considered VMAs created using the `MAP_SHARED` flag, which
|
|
means that PTEs derived from the mapping should always use physical memory that
|
|
represents the current state of the mapped file.[^mmap-dev-zero] Applications
|
|
can alternatively pass the `MAP_PRIVATE` flag to create a *private mapping*.
|
|
Private mappings are *copy-on-write*.
|
|
|
|
Suppose that the application instead created a private mapping in the previous
|
|
example. In Linux, the state of the system after a read page fault would be:
|
|
|
|
VMA: VA:0x400000 -> /tmp/foo:0x0 (private)
|
|
Filemap: /tmp/foo:0x0 -> PA:0x2fb000
|
|
PTE: VA:0x400000 -----------------> PA:0x2fb000 (read-only)
|
|
|
|
Now suppose the application attempts to write to VA:0x400000. For a shared
|
|
mapping, the write would be propagated to PA:0x2fb000, and the kernel would be
|
|
responsible for ensuring that the write is later propagated to the mapped file.
|
|
For a private mapping, the write incurs another page fault since the PTE is
|
|
marked read-only. In response, the kernel allocates physical memory to store the
|
|
mapping's *private copy* of the file's contents, copies file contents to the
|
|
allocated memory, and changes the PTE to map to the private copy. Supposing that
|
|
the kernel chooses the physical memory at physical address (PA) 0x5ea000, the
|
|
resulting state of the system is:
|
|
|
|
VMA: VA:0x400000 -> /tmp/foo:0x0 (private)
|
|
Filemap: /tmp/foo:0x0 -> PA:0x2fb000
|
|
PTE: VA:0x400000 -----------------> PA:0x5ea000
|
|
|
|
Note that the filemap mapping from /tmp/foo:0x0 to PA:0x2fb000 may still exist,
|
|
but is now irrelevant to this mapping.
|
|
|
|
[^mmap-dev-zero]: Modulo files with special mmap semantics such as `/dev/zero`.
|
|
|
|
## Anonymous Mappings
|
|
|
|
Instead of passing a file to the `mmap` system call, applications can instead
|
|
request an *anonymous* mapping by passing the `MAP_ANONYMOUS` flag.
|
|
Semantically, an anonymous mapping is essentially a mapping to an ephemeral file
|
|
initially filled with zero bytes. Practically speaking, this is how shared
|
|
anonymous mappings are implemented, but private anonymous mappings do not result
|
|
in the creation of an ephemeral file; since there would be no way to modify the
|
|
contents of the underlying file through a private mapping, all private anonymous
|
|
mappings use a single shared page filled with zero bytes until copy-on-write
|
|
occurs.
|
|
|
|
# Virtual Memory in the Sentry
|
|
|
|
The sentry implements application virtual memory atop a host kernel, introducing
|
|
an additional level of indirection to the above.
|
|
|
|
Consider the same scenario as in the previous section. Since the sentry handles
|
|
application system calls, the effect of an application `mmap` system call is to
|
|
create a VMA in the sentry (as opposed to the host kernel):
|
|
|
|
Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
|
|
|
|
When the application first incurs a page fault on this address, the host kernel
|
|
delivers information about the page fault to the sentry in a platform-dependent
|
|
manner, and the sentry handles the fault:
|
|
|
|
1. The sentry allocates memory to store the contents of the required part of
|
|
the file, and copies file contents to the allocated memory. However, since
|
|
the sentry is implemented atop a host kernel, it does not configure mappings
|
|
to physical memory directly. Instead, mappable "memory" in the sentry is
|
|
represented by a host file descriptor and offset, since (as noted in
|
|
"Background") this is the memory mapping primitive provided by the host
|
|
kernel. In general, memory is allocated from a temporary host file using the
|
|
`pgalloc` package. Supposing that the sentry allocates offset 0x3000 from
|
|
host file "memory-file", the resulting state is:
|
|
|
|
Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
|
|
Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
|
|
|
|
2. The sentry stores the effective mapping from virtual address to host file in
|
|
a host VMA by invoking the `mmap` system call:
|
|
|
|
Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
|
|
Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
|
|
Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000
|
|
|
|
3. The sentry returns control to the application, which immediately incurs the
|
|
page fault again.[^mmap-populate] However, since a host VMA now exists for
|
|
the faulting virtual address, the host kernel now handles the page fault as
|
|
described in "Background":
|
|
|
|
Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
|
|
Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
|
|
Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000
|
|
Host filemap: host:memory-file:0x3000 -> PA:0x2fb000
|
|
Host PTE: VA:0x400000 --------------------------------------------> PA:0x2fb000
|
|
|
|
Thus, from an implementation standpoint, host VMAs serve the same purpose in the
|
|
sentry that PTEs do in Linux. As in Linux, sentry VMA and filemap state is
|
|
independently mutable, and the desired state of host VMAs is derived from that
|
|
state.
|
|
|
|
[^mmap-populate]: The sentry could force the host kernel to establish PTEs when
|
|
it creates the host VMA by passing the `MAP_POPULATE` flag to
|
|
the `mmap` system call, but usually does not. This is because,
|
|
to reduce the number of page faults that require handling by
|
|
the sentry and (correspondingly) the number of host `mmap`
|
|
system calls, the sentry usually creates host VMAs that are
|
|
much larger than the single faulting page.
|
|
|
|
## Private Mappings
|
|
|
|
The sentry implements private mappings consistently with Linux. Before
|
|
copy-on-write, the private mapping example given in the Background results in:
|
|
|
|
Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 (private)
|
|
Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
|
|
Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000 (read-only)
|
|
Host filemap: host:memory-file:0x3000 -> PA:0x2fb000
|
|
Host PTE: VA:0x400000 --------------------------------------------> PA:0x2fb000 (read-only)
|
|
|
|
When the application attempts to write to this address, the host kernel delivers
|
|
information about the resulting page fault to the sentry. Analogous to Linux,
|
|
the sentry allocates memory to store the mapping's private copy of the file's
|
|
contents, copies file contents to the allocated memory, and changes the host VMA
|
|
to map to the private copy. Supposing that the sentry chooses the offset 0x4000
|
|
in host file `memory-file` to store the private copy, the state of the system
|
|
after copy-on-write is:
|
|
|
|
Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 (private)
|
|
Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
|
|
Host VMA: VA:0x400000 -----------------> host:memory-file:0x4000
|
|
Host filemap: host:memory-file:0x4000 -> PA:0x5ea000
|
|
Host PTE: VA:0x400000 --------------------------------------------> PA:0x5ea000
|
|
|
|
However, this highlights an important difference between Linux and the sentry.
|
|
In Linux, page tables are concrete (architecture-dependent) data structures
|
|
owned by the kernel. Conversely, the sentry has the ability to create and
|
|
destroy host VMAs using host system calls, but it does not have direct access to
|
|
their state. Thus, as written, if the application invokes the `munmap` system
|
|
call to remove the sentry VMA, it is non-trivial for the sentry to determine
|
|
that it should deallocate `host:memory-file:0x4000`. This implies that the
|
|
sentry must retain information about the host VMAs that it has created.
|
|
|
|
## Anonymous Mappings
|
|
|
|
The sentry implements anonymous mappings consistently with Linux, except that
|
|
there is no shared zero page.
|
|
|
|
# Implementation Constructs
|
|
|
|
In Linux:
|
|
|
|
- A virtual address space is represented by `struct mm_struct`.
|
|
|
|
- VMAs are represented by `struct vm_area_struct`, stored in `struct
|
|
mm_struct::mmap`.
|
|
|
|
- Mappings from file offsets to physical memory are stored in `struct
|
|
address_space`.
|
|
|
|
- Reverse mappings from file offsets to virtual mappings are stored in `struct
|
|
address_space::i_mmap`.
|
|
|
|
- Physical memory pages are represented by a pointer to `struct page` or an
|
|
index called a *page frame number* (PFN), represented by `pfn_t`.
|
|
|
|
- PTEs are represented by architecture-dependent type `pte_t`, stored in a
|
|
table hierarchy rooted at `struct mm_struct::pgd`.
|
|
|
|
In the sentry:
|
|
|
|
- A virtual address space is represented by type [`mm.MemoryManager`][mm].
|
|
|
|
- Sentry VMAs are represented by type [`mm.vma`][mm], stored in
|
|
`mm.MemoryManager.vmas`.
|
|
|
|
- Mappings from sentry file offsets to host file offsets are abstracted
|
|
through interface method [`memmap.Mappable.Translate`][memmap].
|
|
|
|
- Reverse mappings from sentry file offsets to virtual mappings are abstracted
|
|
through interface methods
|
|
[`memmap.Mappable.AddMapping` and `memmap.Mappable.RemoveMapping`][memmap].
|
|
|
|
- Host files that may be mapped into host VMAs are represented by type
|
|
[`platform.File`][platform].
|
|
|
|
- Host VMAs are represented in the sentry by type [`mm.pma`][mm] ("platform
|
|
mapping area"), stored in `mm.MemoryManager.pmas`.
|
|
|
|
- Creation and destruction of host VMAs is abstracted through interface
|
|
methods
|
|
[`platform.AddressSpace.MapFile` and `platform.AddressSpace.Unmap`][platform].
|
|
|
|
[memmap]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/memmap/memmap.go
|
|
[mm]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/mm/mm.go
|
|
[pgalloc]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/pgalloc/pgalloc.go
|
|
[platform]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/platform/platform.go
|