gvisor/g3doc/proposals/runtime_dedicate_os_thread.md

9.7 KiB

runtime.DedicateOSThread

Status as of 2020-09-18: Deprioritized; initial studies in #2180 suggest that this may be difficult to support in the Go runtime due to issues with GC.

Summary

Allow goroutines to bind to kernel threads in a way that allows their scheduling to be kernel-managed rather than runtime-managed.

Objectives

  • Reduce Go runtime overhead in the gVisor sentry (#2184).

  • Minimize intrusiveness of changes to the Go runtime.

Background

In Go, execution contexts are referred to as goroutines, which the runtime calls Gs. The Go runtime maintains a variably-sized pool of threads (called Ms by the runtime) on which Gs are executed, as well as a pool of "virtual processors" (called Ps by the runtime) of size equal to runtime.GOMAXPROCS(). Usually, each M requires a P in order to execute Gs, limiting the number of concurrently executing goroutines to runtime.GOMAXPROCS().

The runtime.LockOSThread function temporarily locks the invoking goroutine to its current thread. It is primarily useful for interacting with OS or non-Go library facilities that are per-thread. It does not reduce interactions with the Go runtime scheduler: locked Ms relinquish their P when they become blocked, and only continue execution after another M "chooses" their locked G to run and donates their P to the locked M instead.

Problems

Context Switch Overhead

Most goroutines in the gVisor sentry are task goroutines, which back application threads. Task goroutines spend large amounts of time blocked on syscalls that execute untrusted application code. When invoking said syscall (which varies by gVisor platform), the task goroutine may interact with the Go runtime in one of three ways:

  • It can invoke the syscall without informing the runtime. In this case, the task goroutine will continue to hold its P during the syscall, limiting the number of application threads that can run concurrently to runtime.GOMAXPROCS(). This is problematic because the Go runtime scheduler is known to scale poorly with GOMAXPROCS; see #1942 and https://github.com/golang/go/issues/28808. It also means that preemption of application threads must be driven by sentry or runtime code, which is strictly slower than kernel-driven preemption (since the sentry must invoke another syscall to preempt the application thread).

  • It can call runtime.entersyscallblock before invoking the syscall, and runtime.exitsyscall after the syscall returns. In this case, the task goroutine will release its P while the syscall is executing. This allows the number of threads concurrently executing application code to exceed GOMAXPROCS. However, this incurs additional latency on syscall entry (to hand off the released P to another M, often requiring a futex(FUTEX_WAKE) syscall) and on syscall exit (to acquire a new P). It also drastically increases the number of threads that concurrently interact with the runtime scheduler, which is also problematic for performance (both in terms of CPU utilization and in terms of context switch latency); see #205.

  • It can call runtime.entersyscall before invoking the syscall, and runtime.exitsyscall after the syscall returns. In this case, the task goroutine "lazily releases" its P, allowing the runtime's "sysmon" thread to steal it on behalf of another M after a 20us delay. This mitigates the context switch latency problem when there are few task goroutines and the interval between switches to application code (i.e. the interval between application syscalls, page faults, or signal delivery) is short. (Cynically, this means that it's most effective in microbenchmarks). However, the delay before a P is stolen can also be problematic for performance when there are both many task goroutines switching to application code (lazily releasing their Ps) and many task goroutines switching to sentry code (contending for Ps), which is likely in larger heterogeneous workloads.

Blocking Overhead

Task goroutines block on behalf of application syscalls like futex and epoll_wait by receiving from a Go channel. (Future work may convert task goroutine blocking to use the syncevent package to avoid overhead associated with channels and select, but this does not change how blocking interacts with the Go runtime scheduler.)

If runtime.LockOSThread() is not in effect when a task goroutine blocks, then when the task goroutine is unblocked (by e.g. an application FUTEX_WAKE, signal delivery, or a timeout) by sending to the blocked channel, runtime.ready migrates the unblocked G to the unblocking P. In most cases, this implies that every application thread block/unblock cycle results in a migration of the thread between Ps, and therefore Ms, and therefore cores, resulting in reduced application performance due to loss of CPU caches. Furthermore, in most cases, the unblocking P cannot immediately switch to the unblocked G (instead resuming execution of its current application thread after completing the application's futex(FUTEX_WAKE), tgkill, etc. syscall), often requiring that another P steal the unblocked G before it can resume execution.

If runtime.LockOSThread() is in effect when a task goroutine blocks, then the G will remain locked to its M, avoiding the core migration described above; however, wakeup latency is significantly increased since, as described in "Background", the G still needs to be selected by the scheduler before it can run, and the M that selects the G then needs to transfer its P to the locked M, incurring an additional FUTEX_WAKE syscall and round of kernel scheduling.

Proposal

We propose to add a function, tentatively called DedicateOSThread, to the Go runtime package, documented as follows:

// DedicateOSThread wires the calling goroutine to its current operating system
// thread, and exempts it from counting against GOMAXPROCS. The calling
// goroutine will always execute in that thread, and no other goroutine will
// execute in it, until the calling goroutine has made as many calls to
// UndedicateOSThread as to DedicateOSThread. If the calling goroutine exits
// without unlocking the thread, the thread will be terminated.
//
// DedicateOSThread should only be used by long-lived goroutines that usually
// block due to blocking system calls, rather than interaction with other
// goroutines.
func DedicateOSThread()

Mechanically, DedicateOSThread implies LockOSThread (i.e. it locks the invoking G to a M), but additionally locks the invoking M to a P. Ps locked by DedicateOSThread are not counted against GOMAXPROCS; that is, the actual number of Ps in the system (len(runtime.allp)) is GOMAXPROCS plus the number of bound Ps (plus some slack to avoid frequent changes to runtime.allp). Corollaries:

  • If runtime.ready observes that a readied G is locked to a M locked to a P, it immediately wakes the locked M without migrating the G to the readying P or waiting for a future call to runtime.schedule to select the readied G in runtime.findrunnable.

  • runtime.stoplockedm and runtime.reentersyscall skip the release of locked Ps; the latter also skips sysmon wakeup. runtime.stoplockedm and runtime.exitsyscall skip re-acquisition of Ps if one is locked.

  • sysmon does not attempt to preempt Gs that are locked to Ps, avoiding fruitless overhead from tgkill syscalls and signal delivery.

  • runtime.findrunnable's work stealing skips locked Ps (suggesting that unlocked Ps be tracked in a separate array). runtime.findrunnable on locked Ps skip the global run queue, work stealing, and possibly netpoll.

  • New goroutines created by goroutines with locked Ps are enqueued on the global run queue rather than the invoking P's local run queue.

While gVisor's use case does not strictly require that the association is reversible (with runtime.UndedicateOSThread), such a feature is required to allow reuse of locked Ms, which is likely to be critical for performance.

Alternatives Considered

  • Make the runtime scale well with GOMAXPROCS. While we are also concurrently investigating this problem, this would not address the issues of increased preemption cost or blocking overhead.

  • Make the runtime scale well with number of Ms. It is unclear if this is actually feasible, and would not address blocking overhead.

  • Make P-locking part of LockOSThread's behavior. This would likely introduce performance regressions in existing uses of LockOSThread that do not fit this usage pattern. In particular, since DedicateOSThread transitions the invoker's P from "counted against GOMAXPROCS" to "not counted against GOMAXPROCS", it may need to wake another M to run a new P (that is counted against GOMAXPROCS), and the converse applies to UndedicateOSThread.

  • Rewrite the gVisor sentry in a language that does not force userspace scheduling. This is a last resort due to the amount of code involved.

The proposed functionality is directly analogous to spawn_blocking in Rust async runtimes async_std and tokio.

Outside of gVisor: