gvisor/pkg/sentry/kernel
Jamie Liu ff8b308a30 Remove call to Notify from pipe.VFSPipeFD.CopyOutFrom.
This was missed in cl/351911375; pipe.VFSPipeFD.SpliceFromNonPipe already calls
Notify.

PiperOrigin-RevId: 355246655
2021-02-02 14:01:22 -08:00
..
auth Fix simple mistakes identified by goreportcard. 2021-01-12 12:38:22 -08:00
contexttest Update package locations. 2020-01-27 15:31:32 -08:00
epoll Implement `fcntl` options `F_GETSIG` and `F_SETSIG`. 2020-12-03 06:20:29 -08:00
eventfd Plumbing context.Context to DecRef() and Release(). 2020-08-03 13:36:05 -07:00
fasync Implement `fcntl` options `F_GETSIG` and `F_SETSIG`. 2020-12-03 06:20:29 -08:00
futex Plumbing context.Context to DecRef() and Release(). 2020-08-03 13:36:05 -07:00
g3doc
memevent Standardize on tools directory. 2020-01-27 12:21:00 -08:00
pipe Remove call to Notify from pipe.VFSPipeFD.CopyOutFrom. 2021-02-02 14:01:22 -08:00
sched Standardize on tools directory. 2020-01-27 12:21:00 -08:00
semaphore Implement command SEM_INFO and SEM_STAT for semctl. 2020-12-15 16:06:06 -08:00
shm Do not check for reference leaks after saving. 2020-12-14 10:47:01 -08:00
signalfd Remove existing nogo exceptions. 2020-12-11 12:06:49 -08:00
time Consistent precondition formatting 2020-08-20 13:32:24 -07:00
BUILD Implement F_GETLK fcntl. 2021-01-22 13:58:16 -08:00
README.md
abstract_socket_namespace.go Rewrite reference leak checker without finalizers. 2020-10-23 09:17:02 -07:00
aio.go Check for misuse of kernel.Task as context.Context. 2020-11-13 14:47:47 -08:00
context.go Check for misuse of kernel.Task as context.Context. 2020-11-13 14:47:47 -08:00
fd_table.go Do not generate extraneous IN_CLOSE inotify events. 2021-01-26 00:02:52 -08:00
fd_table_test.go Fix panic when calling dup2(). 2020-09-01 13:41:01 -07:00
fd_table_unsafe.go Do not unconditionally allocate in kernel.FDTable.setAll(). 2020-12-02 19:40:43 -08:00
fs_context.go Initialize references with a value of 1. 2020-11-09 08:33:17 -08:00
ipc_namespace.go Initialize references with a value of 1. 2020-11-09 08:33:17 -08:00
kcov.go Fix reference counting on kcov mappings. 2020-10-19 18:09:39 -07:00
kcov_unsafe.go Fix reference counting on kcov mappings. 2020-10-19 18:09:39 -07:00
kernel.go Implement error on pointers 2021-01-26 13:03:40 -08:00
kernel_opts.go Add notes to relevant tests. 2020-02-05 22:46:35 -08:00
kernel_state.go Update canonical repository. 2019-06-13 16:50:15 -07:00
pending_signals.go Update canonical repository. 2019-06-13 16:50:15 -07:00
pending_signals_state.go Update canonical repository. 2019-06-13 16:50:15 -07:00
posixtimer.go Disable cpuClockTicker when app is idle 2019-10-01 12:21:01 -07:00
ptrace.go Remove existing nogo exceptions. 2020-12-11 12:06:49 -08:00
ptrace_amd64.go Enable automated marshalling for the syscall package. 2020-09-15 23:38:57 -07:00
ptrace_arm64.go Update package locations. 2020-01-27 15:31:32 -08:00
rseq.go Consistent precondition formatting 2020-08-20 13:32:24 -07:00
seccomp.go Rename kernel.TaskContext to kernel.TaskImage. 2020-11-12 17:39:19 -08:00
sessions.go Initialize references with a value of 1. 2020-11-09 08:33:17 -08:00
signal.go Remove existing nogo exceptions. 2020-12-11 12:06:49 -08:00
signal_handlers.go New sync package. 2020-01-09 22:02:24 -08:00
syscalls.go Add support for OCI seccomp filters in the sandbox. 2020-09-15 23:19:17 -07:00
syscalls_state.go Rename kernel.TaskContext to kernel.TaskImage. 2020-11-12 17:39:19 -08:00
syslog.go Add a few syslog messages. 2020-11-18 11:46:23 -08:00
table_test.go Update canonical repository. 2019-06-13 16:50:15 -07:00
task.go Check for misuse of kernel.Task as context.Context. 2020-11-13 14:47:47 -08:00
task_acct.go Rename kernel.TaskContext to kernel.TaskImage. 2020-11-12 17:39:19 -08:00
task_block.go Define tcpip.Payloader in terms of io.Reader 2021-01-22 12:26:09 -08:00
task_clone.go Rename kernel.TaskContext to kernel.TaskImage. 2020-11-12 17:39:19 -08:00
task_context.go Check for misuse of kernel.Task as context.Context. 2020-11-13 14:47:47 -08:00
task_exec.go Rename kernel.TaskContext to kernel.TaskImage. 2020-11-12 17:39:19 -08:00
task_exit.go Remove existing nogo exceptions. 2020-12-11 12:06:49 -08:00
task_futex.go Rename kernel.TaskContext to kernel.TaskImage. 2020-11-12 17:39:19 -08:00
task_identity.go Add logging message for noNewPrivileges OCI option. 2020-04-10 20:32:23 -07:00
task_image.go Rename kernel.TaskContext to kernel.TaskImage. 2020-11-12 17:39:19 -08:00
task_log.go Fix misuses of kernel.Task as context.Context. 2020-11-12 18:22:40 -08:00
task_net.go Initial network namespace support. 2020-02-20 15:20:40 -08:00
task_run.go Log task goroutine IDs in the sentry watchdog. 2020-11-13 18:10:55 -08:00
task_sched.go Reset watchdog timer between sendfile() iterations. 2020-11-16 18:55:24 -08:00
task_signals.go Remove existing nogo exceptions. 2020-12-11 12:06:49 -08:00
task_start.go Rename kernel.TaskContext to kernel.TaskImage. 2020-11-12 17:39:19 -08:00
task_stop.go Consistent precondition formatting 2020-08-20 13:32:24 -07:00
task_syscall.go Enable automated marshalling for the syscall package. 2020-09-15 23:38:57 -07:00
task_test.go Update canonical repository. 2019-06-13 16:50:15 -07:00
task_usermem.go Separate kernel.Task.AsCopyContext() into CopyContext() and OwnCopyContext(). 2020-10-30 13:54:47 -07:00
task_work.go Add task work mechanism. 2020-07-23 16:25:34 -07:00
thread_group.go [vfs2] Fix fork reference leaks. 2020-10-19 13:20:13 -07:00
threads.go Implement F_GETLK fcntl. 2021-01-22 13:58:16 -08:00
timekeeper.go Move platform.File in memmap 2020-07-27 11:59:10 -07:00
timekeeper_state.go Update canonical repository. 2019-06-13 16:50:15 -07:00
timekeeper_test.go Update package locations. 2020-01-27 15:31:32 -08:00
tty.go Fix "unlock of unlocked mutex" crash when getting tty 2020-01-15 13:00:59 +08:00
uncaught_signal.proto Change copyright notice to "The gVisor Authors" 2019-04-29 14:26:23 -07:00
uts_namespace.go New sync package. 2020-01-09 22:02:24 -08:00
vdso.go Fix more nogo tests 2020-11-03 15:23:32 -08:00
version.go Change copyright notice to "The gVisor Authors" 2019-04-29 14:26:23 -07:00

README.md

This package contains:

  • A (partial) emulation of the "core Linux kernel", which governs task execution and scheduling, system call dispatch, and signal handling. See below for details.

  • The top-level interface for the sentry's Linux kernel emulation in general, used by the main function of all versions of the sentry. This interface revolves around the Env type (defined in kernel.go).

Background

In Linux, each schedulable context is referred to interchangeably as a "task" or "thread". Tasks can be divided into userspace and kernel tasks. In the sentry, scheduling is managed by the Go runtime, so each schedulable context is a goroutine; only "userspace" (application) contexts are referred to as tasks, and represented by Task objects. (From this point forward, "task" refers to the sentry's notion of a task unless otherwise specified.)

At a high level, Linux application threads can be thought of as repeating a "run loop":

  • Some amount of application code is executed in userspace.

  • A trap (explicit syscall invocation, hardware interrupt or exception, etc.) causes control flow to switch to the kernel.

  • Some amount of kernel code is executed in kernelspace, e.g. to handle the cause of the trap.

  • The kernel "returns from the trap" into application code.

Analogously, each task in the sentry is associated with a task goroutine that executes that task's run loop (Task.run in task_run.go). However, the sentry's task run loop differs in structure in order to support saving execution state to, and resuming execution from, checkpoints.

While in kernelspace, a Linux thread can be descheduled (cease execution) in a variety of ways:

  • It can yield or be preempted, becoming temporarily descheduled but still runnable. At present, the sentry delegates scheduling of runnable threads to the Go runtime.

  • It can exit, becoming permanently descheduled. The sentry's equivalent is returning from Task.run, terminating the task goroutine.

  • It can enter interruptible sleep, a state in which it can be woken by a caller-defined wakeup or the receipt of a signal. In the sentry, interruptible sleep (which is ambiguously referred to as blocking) is implemented by making all events that can end blocking (including signal notifications) communicated via Go channels and using select to multiplex wakeup sources; see task_block.go.

  • It can enter uninterruptible sleep, a state in which it can only be woken by a caller-defined wakeup. Killable sleep is a closely related variant in which the task can also be woken by SIGKILL. (These definitions also include Linux's "group-stopped" (TASK_STOPPED) and "ptrace-stopped" (TASK_TRACED) states.)

To maximize compatibility with Linux, sentry checkpointing appears as a spurious signal-delivery interrupt on all tasks; interrupted system calls return EINTR or are automatically restarted as usual. However, these semantics require that uninterruptible and killable sleeps do not appear to be interrupted. In other words, the state of the task, including its progress through the interrupted operation, must be preserved by checkpointing. For many such sleeps, the wakeup condition is application-controlled, making it infeasible to wait for the sleep to end before checkpointing. Instead, we must support checkpointing progress through sleeping operations.

Implementation

We break the task's control flow graph into states, delimited by:

  1. Points where uninterruptible and killable sleeps may occur. For example, there exists a state boundary between signal dequeueing and signal delivery because there may be an intervening ptrace signal-delivery-stop.

  2. Points where sleep-induced branches may "rejoin" normal execution. For example, the syscall exit state exists because it can be reached immediately following a synchronous syscall, or after a task that is sleeping in execve() or vfork() resumes execution.

  3. Points containing large branches. This is strictly for organizational purposes. For example, the state that processes interrupt-signaled conditions is kept separate from the main "app" state to reduce the size of the latter.

  4. SyscallReinvoke, which does not correspond to anything in Linux, and exists solely to serve the autosave feature.

dot -Tpng -Goverlap=false -orun_states.png run_states.dot

States before which a stop may occur are represented as implementations of the taskRunState interface named run(state), allowing them to be saved and restored. States that cannot be immediately preceded by a stop are simply Task methods named do(state).

Conditions that can require task goroutines to cease execution for unknown lengths of time are called stops. Stops are divided into internal stops, which are stops whose start and end conditions are implemented within the sentry, and external stops, which are stops whose start and end conditions are not known to the sentry. Hence all uninterruptible and killable sleeps are internal stops, and the existence of a pending checkpoint operation is an external stop. Internal stops are reified into instances of the TaskStop type, while external stops are merely counted. The task run loop alternates between checking for stops and advancing the task's state. This allows checkpointing to hold tasks in a stopped state while waiting for all tasks in the system to stop.