diff --git a/website/_config.yml b/website/_config.yml index 5f1cbbdeb..e394f39b4 100644 --- a/website/_config.yml +++ b/website/_config.yml @@ -47,3 +47,9 @@ authors: nybidari: name: Nayana Bidari email: nybidari@google.com + jianfengt: + name: Jianfeng Tan + email: henry.tjf@antgroup.com + yonghe: + name: Yong He + email: chenglang.hy@antgroup.com diff --git a/website/assets/images/2021-11-11-flamegraph-figure2.png b/website/assets/images/2021-11-11-flamegraph-figure2.png new file mode 100644 index 000000000..eb8314975 Binary files /dev/null and b/website/assets/images/2021-11-11-flamegraph-figure2.png differ diff --git a/website/assets/images/2021-11-11-syscall-figure1.png b/website/assets/images/2021-11-11-syscall-figure1.png new file mode 100644 index 000000000..07d33e636 Binary files /dev/null and b/website/assets/images/2021-11-11-syscall-figure1.png differ diff --git a/website/blog/2021-11-11-running-gvisor-in-production-at-scale-in-ant.md b/website/blog/2021-11-11-running-gvisor-in-production-at-scale-in-ant.md new file mode 100644 index 000000000..5c2e451b3 --- /dev/null +++ b/website/blog/2021-11-11-running-gvisor-in-production-at-scale-in-ant.md @@ -0,0 +1,413 @@ +# Running gVisor in Production at Scale in Ant + +>This post was contributed by [Ant Group](https://www.antgroup.com/), a +>large-scale digital payment platform. Jianfeng and Yong are engineers at Ant +>Group working on infrastructure systems, and contributors to gVisor. + +At Ant Group, we are committed to keep online transactions safe and efficient. +Continuously improving security for potential system-level attacks is one of +many measures. As a container runtime, gVisor provides container-native security +without sacrificing resource efficiency. Therefore, it has been on our radar +since it was released. + +However, there have been performance concerns raised by members of +[academia](https://www.usenix.org/system/files/hotcloud19-paper-young.pdf) +and [industry](https://news.ycombinator.com/item?id=19924036). Users of gVisor +tend to bear the extra overhead as the tax of security. But we tend to agree +that [security is no excuse for poor performance (See Chapter 6!)](https://sel4.systems/About/seL4-whitepaper.pdf). + +In this article, we will present how we identified bottlenecks in gVisor and +unblocked large-scale production adoption. Our main focus are the CPU +utilization and latency overhead it brings. Small memory footprint is also a +valued goal, but not discussed in this blog. As a result of these efforts and +community improvements, 70% of our applications running on runsc have <1% +overhead; another 25% have <3% overhead. Some of our most valued application +are the focus of our optimization, and get even better performance compared with +runc. + +The rest of this blog is organized as follows: +- First, we analyze the cost of different syscall paths in gVisor. +- Then, a way to profile a whole picture of a instance is proposed to find out + if some slow syscall paths are encountered. +- Some invisible overhead in Go runtime is discussed. +- At last, a short summary on performance optimization with some other factors + on production adoption. + +For convenience of discussion, we are targeting KVM-based, or hypervisor-based +platforms, unless explicitly stated. + +## Cost of different syscall paths + +[Defense-in-depth](../../../../2019/11/18/gvisor-security-basics-part-1/#defense-in-depth) +is the key design principle of gVisor. In gVisor, different syscalls have +different paths, further leading to different cost (orders of magnitude) on +latency and CPU consumption. Here are the syscall paths in gVisor. + +![Figure 1](/assets/images/2021-11-11-syscall-figure1.png "Sentry syscall paths.") + +### Path 1: User-space vDSO + +Sentry provides a [vDSO library](https://github.com/google/gvisor/tree/master/vdso) +for its sandboxed processes. Several syscalls are short circuited and +implemented in user space. These syscalls cost almost as much as native Linux. +But note that the vDSO library is partially implemented. We once noticed some +[syscalls](https://github.com/google/gvisor/issues/3101) in our environment are +not properly terminated in user space. We create some additional implementations +to the vDSO, and aim to push these improvements upstream when possible. + +### Path 2: Sentry contained + +Most syscalls, e.g., clone(2), are implemented in Sentry. They are +some basic abstractions of a operating system, such as process/thread lifecycle, +scheduling, IPC, memory management, etc. These syscalls and all below suffer +from a structural cost of syscall interception. The overhead is about 800ns +while that of the native syscalls is about 70ns. We'll dig it further below. +Syscalls of this kind spend takes about several microseconds, which is +competitive to the corresponding native Linux syscalls. + +### Path 3: Host-kernel involved + +Some syscalls, resource related, e.g., read/write, are redirected into the host +kernel. Note that gVisor never passes through application syscalls directly into +host kernel for functional and security reasons. So comparing to native Linux, +time spent in Sentry seems an extra overhead. Another overhead is the way +to call a host kernel syscall. Let's use kvm platform of x86_64 as an example. +After Sentry issues the syscall instruction, if it is in GR0, it first goes +to the syscall entrypoint defined in LSTAR, and then halts to HR3 (a vmexit +happens here), and exits from a signal handler, and executes syscall instruction +again. We can save the "Halt to HR3" by introducing vmcall here, but there's +still a syscall trampoline there and the vmexit/vmentry overhead is not trivial. +Nevertheless, these overhead is not that significant. + +For some sentry-contained syscalls in Path 2, although the syscall semantic is +terminated in Sentry, it may further introduces one or many unexpected exits to +host kernel. It could be a page fault when Sentry runs, and more likely, a +schedule event in Go runtime, e.g., M idle/wakeup. An example in hand is that +futex(FUETX_WAIT) and epoll_wait(2) could lead to M +idle and a further futex call into host kernel if it does not find any runnable +Gs. (See the comments in https://go.dev/src/runtime/proc.go for further +explanation about the GMP scheduler). + +### Path 4: Gofer involved + +Other IO-related syscalls, especially security sensitive, go through another +layer of protection - Gofer. For such a syscall, it usually involves one or +more Sentry/Gofer inter-process communications. Even with the recent +optimization that using lisafs to supersede P9, it's still the slowest path +which we shall try best to avoid. + +As shown above, some syscall paths are by-design slow, and should be identified +and reduced as much as possible. Let's hold it to the next section, and dig +into the details of the structural and implementation-specific cost of syscalls +firstly, because the performance of some Sentry-contained syscalls are not good +enough. + +### The structural cost + +The first kind of cost is the comparatively stable, introduced by syscall +interception. It is platform-specific depending on the way to intercept syscalls. +And whether this cost matters also depends on the syscall rate of sandboxed +applications. + +Here's the benchmark result on the structural cost of syscall. We got the data +on a Intel(R) Xeon(R) CPU E5-2650 v2 platform, using +[getpid benchmark](https://github.com/google/gvisor/blob/master/test/perf/linux/getpid_benchmark.cc). +As we can see, for KVM platform, the syscall interception costs more than 10x +than a native Linux syscall. + +| |getpid benchmark (ns)| +|------------|---------------------| +|Native |62 | +|Native-KPTI |236 | +|runsc-KVM |830 | +|runsc-ptrace|6249 | + +* "Native" stands for using vanilla linux kernel. + +To understand the structural cost of syscall interception, +we did a [quantitative analysis](https://github.com/google/gvisor/issues/2354) +on kvm platform. According to the analysis, the overhead mainly comes from: + +1. KPTI-like CR3 switches: to maintain the address equation of Sentry running +in HR3 and GR0, it has to switch CR3 register twice, on each user/kernel switch; + +2. Platform's Switch(): Linux is very efficient by just switching to a +per-thread kernel stack and calling the corresponding syscall entry function. +But in Sentry, each task is represented by a goroutine; before calling into +syscall entry functions, it needs to pop the stack to recover the big while +loop, i.e., kernel.(*Task).run. + +Can we save the structural cost of syscall interception? This cost is actually +by-design. We can optimize it, for example, avoid allocation and map operations +in switch process, but it can not be eliminated. + +Does the structural cost of syscall interception really matter? It depends on +the syscall rate. Most applications in our case have a syscall rate < 200K/sec, +and according to flame graphs (which will be described later in this blog), we +see 2~3% of samples are in the switch Secondly, most syscalls, except those +as simple as getpid(2), take several microseconds. In proportion, +it's not a significant overhead. However, if you have an elephant RPC (which +involves many times of DB access), or a service served by a long-snake RPC +chain, this brings nontrivial overhead on latency. + +### The implementation-specific cost + +The other kind of cost is implementation-specific. For example, it involves +some heavy malloc operations; or defer is used in some frequent syscall paths +(defer is optimized in Go 1.14); what's worse, the application process may +trigger a long-path syscall with host kernel or Gofer involved. + +When we try to do optimization on the gVisor runtime, we need information +on the sandboxed applications, POD configurations, and runsc internals. But +most people only play either as platform engineer or application engineer. +So we need an easier way to understand the whole picture. + +## Performance profile of a running instance + +To quickly understand the whole picture of performance, we need some ways to +profile a running gVisor instance. As gVisor sandbox process is essentially a +Go process, Go pprof is an existing way: + +* [Go pprof](https://golang.org/pkg/runtime/pprof/) - provides CPU and heap + profile through [runsc debug subcommands](https://gvisor.dev/docs/user_guide/debugging/#profiling). +* [Go trace](https://golang.org/pkg/runtime/trace/) - provides more + internal profile types like synchronization blocking and scheduler latency. + +Unfortunately, above tools only provide hot-spots in Sentry, instead of the +whole picture (how much time spent in GR3 and HR0). And CPU profile relies on +the [SIGPROF signal](https://golang.org/pkg/runtime/pprof/), which may not +accurate enough. + +[perf-kvm](https://www.linux-kvm.org/page/Perf_events) cannot provide what we +need either. It may help to top/record/stat some information in guest with the +help of option [--guestkallsyms], but it cannot analyze the call chain (which +is not supported in the host kernel, see Linux's perf_callchain_kernel). + +### Perf sandbox process like a normal process + +Then we turn to a nice virtual address equation in Sentry: [(GR0 VA) = (HR3 VA)]. +This is to make sure any pointers in HR3 can be directly used in GR0. + +The equation is helpful to solve this problem in the way that we can profile +Sentry just as a normal HR3 process with a little hack on kvm. + +- First, as said above, Linux does not support to analyze the call chain of +guest. So Change [is_in_guest] to pretend that it runs in host mode even it's +in guest mode. This can be done in +[kvm_is_in_guest](https://github.com/torvalds/linux/blob/v4.19/arch/x86/kvm/x86.c#L6560) + +``` +int kvm_is_in_guest(void) + { +- return __this_cpu_read(current_vcpu) != NULL; ++ return 0; + } +``` + +- Secondly, change the process of guest profile. Previously, after PMU counter +overflows and triggers a NMI interrupt, vCPU is forced to exit to host, and +calls [int $2] immediately for later recording. Now instead of calling [int $2], +we shall call **do_nmi** directly with correct registers (i.e., pt_regs): + +``` ++void (*fn_do_nmi)(struct pt_regs *, long); ++ ++#define HIGHER_HALF_CANONICAL_ADDR 0xFFFF800000000000 ++ ++void make_pt_regs(struct kvm_vcpu *vcpu, struct pt_regs *regs) ++{ ++ /* In Sentry GR0, we will use address among ++ * [HIGHER_HALF_CANONICAL_ADDR, 2^64-1) ++ * when syscall just happens. To avoid conflicting with HR0, ++ * we correct these addresses into HR3 addresses. ++ */ ++ regs->bp = vcpu->arch.regs[VCPU_REGS_RBP] & ~HIGHER_HALF_CANONICAL_ADDR; ++ regs->ip = vmcs_readl(GUEST_RIP) & ~HIGHER_HALF_CANONICAL_ADDR; ++ regs->sp = vmcs_readl(GUEST_RSP) & ~HIGHER_HALF_CANONICAL_ADDR; ++ ++ regs->flags = (vmcs_readl(GUEST_RFLAGS) & 0xFF) | ++ X86_EFLAGS_IF | 0x2; ++ regs->cs = __USER_CS; ++ regs->ss = __USER_DS; ++} ++ + static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx) + { + u32 exit_intr_info; +@@ -8943,7 +8965,14 @@ static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx) + /* We need to handle NMIs before interrupts are enabled */ + if (is_nmi(exit_intr_info)) { + kvm_before_handle_nmi(&vmx->vcpu); +- asm("int $2"); ++ if (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) ++ asm("int $2"); ++ else { ++ struct pt_regs regs; ++ memset((void *)®s, 0, sizeof(regs)); ++ make_pt_regs(&vmx->vcpu, ®s); ++ fn_do_nmi(®s, 0); ++ } + kvm_after_handle_nmi(&vmx->vcpu); + } + } +@@ -11881,6 +11927,10 @@ static int __init vmx_init(void) + } + } + ++ fn_do_nmi = (void *) kallsyms_lookup_name("do_nmi"); ++ if (!fn_do_nmi) ++ printk(KERN_ERR "kvm: lookup do_nmi fail\n"); ++ +``` + +As shown above, we properly handle samples in GR3 and GR0 trampoline. + +### An example of profile + +Firstly, make sure we compile the runsc with symbols not stripped: +``` +bazel build runsc --strip=never +``` + +As an example, run below script inside the gVisor container to make it busy: +``` +stress -i 1 -c 1 -m 1 +``` + +Perf the instance with command: +``` +perf kvm --host --guest record -a -g -e cycles -G -- sleep 10 >/dev/null +``` + +Note we still need to perf the instance with 'perf kvm' and '--guest', because +kvm-intel requires this to keep the PMU hardware event enabled in guest mode. + +Then generate a flame graph using +[Brendan's tool](https://github.com/brendangregg/FlameGraph), and we got this +[flame graph](https://raw.githubusercontent.com/zhuangel/gvisor/zhuangel_blog/website/blog/blog-kvm-stress.svg). + +Let's roughly divide it to differentiate GR3 and GR0 like this: + +![Figure 2](/assets/images/2021-11-11-flamegraph-figure2.png "Flamegraph of stress.") + +### Optimize based on flame graphs + +Now we can get clear information like: + +1. The bottleneck syscall(s): the above flame graph shows +sync(2) is a relatively large block of samples. If we cannot avoid +them in user space, they are worth time for optimization. Some real cases we +found and optimized are: supersede CopyIn/CopyOut with CopyInBytes/CopyOutBytes +to avoid reflection; avoid use defer in some frequent syscalls in which case you +can say deferreturn() in the flame graph (not needed if you already +upgrade to newer Go version). Another optimization is: after we find that append +write of shared volume spends a lot of time querying gofer for current file +length in the flame graph, we propose to add +[an handle only for append write](https://github.com/google/gvisor/issues/1792). + +2. If GC is a real problem: we can barely see sample related to GC in this +case. But if we do, we can further search mallocgc() to see where +the heap allocation is frequent. We can perform a heap profile to see allocated +objects. And we can consider adjust [GC percent](https://golang.org/pkg/runtime/debug/#SetGCPercent), +100% by default, to sacrifice memory for less CPU utilization. We once found +that allocating a object > 32 KB also triggers GC, referring to +[this](https://github.com/google/gvisor/commit/f697d1a33e4e7cefb4164ec977c38ccc2a228099). + +3. Percentage of time spent in GR3 app and Sentry: We can determine if it worths +to continue the optimization. If most of the samples are in GR3, then we better +turn to optimizing the application code instead. + +4. Rather large chunk of samples lie in ept violation and +fallocate(2) (into HR0). This is caused by frequent memory +allocation and free. We can either optimize the application to avoid this, or +add a memory buffer layer in memfile management to relieve it. + +As a short summary, now we have a tool to get a visible graph of what's going +on in a running gVisor instance. Unfortunately, we cannot get the details of +the application processes in the above flame graph because of the semantic gap. +To get a flame graph of the application processes, we have prototyped a way in +Sentry. Hopefully, we'll discuss it in later blogs. + +A visible way is very helpful when we try to optimize a new application on +gVisor. However, there's another kind of overhead, invisible like "Dark matter". + +## Invisible overhead in Go runtime + +Sentry inherits timer, scheduler, channel, and heap allocator in Go runtime. +While it saves a lot of code to build a kernel, it also introduces some +unpleasant overhead. The Go runtime, after all, is designed and massively used +for general purpose Go applications. While it's used as a part or the basis of a +kernel, we shall be very careful with the implementation and overhead of these +syntactic sugar. + +Unfortunately, we did not find an universal method to identify this kind of +overhead. The only way seems to get your hands dirty with Go runtime. We'll show +some examples in our use case. + +### Timer + +It's known that Go (before 1.14) timer suffers from +[lock contention and context switches](https://github.com/golang/go/issues/27707). +What's worse, statistics of Sentry syscalls shows that a lot of +futex() is introduced by timers (64 timer buckets), and that Sentry +syscalls walks a much longer path (redpill), makes it worse. + +We have two optimizations here: 1. decrease the number of timer buckets, from 64 +to 4; 2. decrease the timer precision from ns to ms. You may worry about the +decrease of timer precision, but as we see, most of the applications are +event-based, and not affected by a coarse grained timer. + +However, Go changes the implementation of timer in v1.14; how to port this +optimization remains an open question. + +### Scheduler + +gVisor introduces an extra level of schedule along with the host linux +scheduler (usually CFS). A L2 scheduler sometimes brings positive impact as it +saves the heavy context switch in the L1 scheduler. We can find many two-level +scheduler cases, for example, coroutines, virtual machines, etc. + +gVisor reuses Go's work-stealing scheduler, which is originally designed for +coroutines, as the L2 scheduler. They share the same goal: + +"We need to balance between keeping enough running worker threads to utilize +available hardware parallelism and parking excessive running worker threads +to conserve CPU resources and power." -- From +[Go scheduler code](https://golang.org/src/runtime/proc.go). + +If not properly tuned, the L2 scheduler may leak the schedule pressure to the +L1 scheduler. According to G-P-M model of Go, the parallelism is close related to +the GOMAXPROCS limit. The upstream gVisor by default uses # of host cores, +which leads to a lot of wasted M wake/stop(s). By properly configuring the +GOMAXPROCS of a POD of 4/8/16 cores, we find it can save some CPU cycles +without worsening the workload latency. + +To further restrict extra M wake/stop(s), before wakep(), we calculate the # of +running Gs and # of running Ps to decide if necessary to wake a M. And we find +it's better to firstly steal from the longest local run queue, comparing to +previously random-sequential way. Another related optimization is that we find +most applications will get back to Sentry very soon, and it's not necessary to +handle off its P when it leaves into user space and find an idle P when it gets +back. + +Some optimizations in Go are put [here](https://github.com/zhuangel/go/tree/go1.13.4.blog). +What we learned from the optimization process of gVisor is that digging into +Go runtime to understand what's going on there. And it's normal that some ideas +work, but some fail. + +## Summary + +We introduced how we profiled gVisor for production-ready performance. Using +this methodology, along with some other aggressive measures, we finally got to +run gVisor with an acceptable overhead, and even better than runc in some +workloads. We also absorbed a lot of optimization progress in the community, +e.g., VFS2. + +So far, we have deployed more than 100K gVisor instances in the production +environment. And it very well supported transactions of +[Singles Day Global Shopping Festivals](https://en.wikipedia.org/wiki/Singles%27_Day). + +Along with performance, there are also some other important aspects for +production adoption. For example, generating a core after a sentry panic is +helpful for debugging; a coverage tool is necessary to make sure new changes are +properly covered by test cases. We'll leave these topics to later discussions. diff --git a/website/blog/BUILD b/website/blog/BUILD index 0384b9ba9..f53c7e8b9 100644 --- a/website/blog/BUILD +++ b/website/blog/BUILD @@ -59,6 +59,17 @@ doc( permalink = "/blog/2021/08/31/gvisor-rack/", ) +doc( + name = "tune_gvisor_for_production_adoption", + src = "2021-11-11-running-gvisor-in-production-at-scale-in-ant.md", + authors = [ + "jianfengt", + "yonghe", + ], + layout = "post", + permalink = "/blog/2021/11/11/running-gvisor-in-production-at-scale-in-ant/", +) + docs( name = "posts", deps = [ diff --git a/website/index.md b/website/index.md index c6cd477c2..68a6098d0 100644 --- a/website/index.md +++ b/website/index.md @@ -45,6 +45,13 @@ having to rearchitect your infrastructure.

Read More » + +
+

Tune gVisor for Production Adoption

+

TODO: introduction

+ Read More » +
+