411 lines
20 KiB
Markdown
411 lines
20 KiB
Markdown
# Running gVisor in Production at Scale in Ant
|
|
|
|
> This post was contributed by [Ant Group](https://www.antgroup.com/), a
|
|
> large-scale digital payment platform. Jianfeng and Yong are engineers at Ant
|
|
> Group working on infrastructure systems, and contributors to gVisor.
|
|
|
|
At Ant Group, we are committed to keep online transactions safe and efficient.
|
|
Continuously improving security for potential system-level attacks is one of
|
|
many measures. As a container runtime, gVisor provides container-native security
|
|
without sacrificing resource efficiency. Therefore, it has been on our radar
|
|
since it was released.
|
|
|
|
However, there have been performance concerns raised by members of
|
|
[academia](https://www.usenix.org/system/files/hotcloud19-paper-young.pdf) and
|
|
[industry](https://news.ycombinator.com/item?id=19924036). Users of gVisor tend
|
|
to bear the extra overhead as the tax of security. But we tend to agree that
|
|
[security is no excuse for poor performance (See Chapter 6!)](https://sel4.systems/About/seL4-whitepaper.pdf).
|
|
|
|
In this article, we will present how we identified bottlenecks in gVisor and
|
|
unblocked large-scale production adoption. Our main focus are the CPU
|
|
utilization and latency overhead it brings. Small memory footprint is also a
|
|
valued goal, but not discussed in this blog. As a result of these efforts and
|
|
community improvements, 70% of our applications running on runsc have <1%
|
|
overhead; another 25% have <3% overhead. Some of our most valued application are
|
|
the focus of our optimization, and get even better performance compared with
|
|
runc.
|
|
|
|
The rest of this blog is organized as follows: * First, we analyze the cost of
|
|
different syscall paths in gVisor. * Then, a way to profile a whole picture of a
|
|
instance is proposed to find out if some slow syscall paths are encountered.
|
|
Some invisible overhead in Go runtime is discussed. * At last, a short summary
|
|
on performance optimization with some other factors on production adoption.
|
|
|
|
For convenience of discussion, we are targeting KVM-based, or hypervisor-based
|
|
platforms, unless explicitly stated.
|
|
|
|
## Cost of different syscall paths
|
|
|
|
[Defense-in-depth](../../../../2019/11/18/gvisor-security-basics-part-1/#defense-in-depth)
|
|
is the key design principle of gVisor. In gVisor, different syscalls have
|
|
different paths, further leading to different cost (orders of magnitude) on
|
|
latency and CPU consumption. Here are the syscall paths in gVisor.
|
|
|
|
![Figure 1](/assets/images/2021-12-02-syscall-figure1.png "Sentry syscall paths.")
|
|
|
|
### Path 1: User-space vDSO
|
|
|
|
Sentry provides a
|
|
[vDSO library](https://github.com/google/gvisor/tree/master/vdso) for its
|
|
sandboxed processes. Several syscalls are short circuited and implemented in
|
|
user space. These syscalls cost almost as much as native Linux. But note that
|
|
the vDSO library is partially implemented. We once noticed some
|
|
[syscalls](https://github.com/google/gvisor/issues/3101) in our environment are
|
|
not properly terminated in user space. We create some additional implementations
|
|
to the vDSO, and aim to push these improvements upstream when possible.
|
|
|
|
### Path 2: Sentry contained
|
|
|
|
Most syscalls, e.g., <code>clone(2)</code>, are implemented in Sentry. They are
|
|
some basic abstractions of a operating system, such as process/thread lifecycle,
|
|
scheduling, IPC, memory management, etc. These syscalls and all below suffer
|
|
from a structural cost of syscall interception. The overhead is about 800ns
|
|
while that of the native syscalls is about 70ns. We'll dig it further below.
|
|
Syscalls of this kind spend takes about several microseconds, which is
|
|
competitive to the corresponding native Linux syscalls.
|
|
|
|
### Path 3: Host-kernel involved
|
|
|
|
Some syscalls, resource related, e.g., read/write, are redirected into the host
|
|
kernel. Note that gVisor never passes through application syscalls directly into
|
|
host kernel for functional and security reasons. So comparing to native Linux,
|
|
time spent in Sentry seems an extra overhead. Another overhead is the way to
|
|
call a host kernel syscall. Let's use kvm platform of x86_64 as an example.
|
|
After Sentry issues the syscall instruction, if it is in GR0, it first goes to
|
|
the syscall entrypoint defined in LSTAR, and then halts to HR3 (a vmexit happens
|
|
here), and exits from a signal handler, and executes syscall instruction again.
|
|
We can save the "Halt to HR3" by introducing vmcall here, but there's still a
|
|
syscall trampoline there and the vmexit/vmentry overhead is not trivial.
|
|
Nevertheless, these overhead is not that significant.
|
|
|
|
For some sentry-contained syscalls in Path 2, although the syscall semantic is
|
|
terminated in Sentry, it may further introduces one or many unexpected exits to
|
|
host kernel. It could be a page fault when Sentry runs, and more likely, a
|
|
schedule event in Go runtime, e.g., M idle/wakeup. An example in hand is that
|
|
<code>futex(FUETX_WAIT)</code> and <code>epoll_wait(2)</code> could lead to M
|
|
idle and a further futex call into host kernel if it does not find any runnable
|
|
Gs. (See the comments in https://go.dev/src/runtime/proc.go for further
|
|
explanation about the Go scheduler).
|
|
|
|
### Path 4: Gofer involved
|
|
|
|
Other IO-related syscalls, especially security sensitive, go through another
|
|
layer of protection - Gofer. For such a syscall, it usually involves one or more
|
|
Sentry/Gofer inter-process communications. Even with the recent optimization
|
|
that using lisafs to supersede P9, it's still the slowest path which we shall
|
|
try best to avoid.
|
|
|
|
As shown above, some syscall paths are by-design slow, and should be identified
|
|
and reduced as much as possible. Let's hold it to the next section, and dig into
|
|
the details of the structural and implementation-specific cost of syscalls
|
|
firstly, because the performance of some Sentry-contained syscalls are not good
|
|
enough.
|
|
|
|
### The structural cost
|
|
|
|
The first kind of cost is the comparatively stable, introduced by syscall
|
|
interception. It is platform-specific depending on the way to intercept
|
|
syscalls. And whether this cost matters also depends on the syscall rate of
|
|
sandboxed applications.
|
|
|
|
Here's the benchmark result on the structural cost of syscall. We got the data
|
|
on a Intel(R) Xeon(R) CPU E5-2650 v2 platform, using
|
|
[getpid benchmark](https://github.com/google/gvisor/blob/master/test/perf/linux/getpid_benchmark.cc).
|
|
As we can see, for KVM platform, the syscall interception costs more than 10x
|
|
than a native Linux syscall.
|
|
|
|
getpid | benchmark (ns)
|
|
------------ | --------------
|
|
Native | 62
|
|
Native-KPTI | 236
|
|
runsc-KVM | 830
|
|
runsc-ptrace | 6249
|
|
|
|
\* "Native" stands for using vanilla linux kernel.
|
|
|
|
To understand the structural cost of syscall interception, we did a
|
|
[quantitative analysis](https://github.com/google/gvisor/issues/2354) on kvm
|
|
platform. According to the analysis, the overhead mainly comes from:
|
|
|
|
1. KPTI-like CR3 switches: to maintain the address equation of Sentry running
|
|
in HR3 and GR0, it has to switch CR3 register twice, on each user/kernel
|
|
switch;
|
|
|
|
2. Platform's Switch(): Linux is very efficient by just switching to a
|
|
per-thread kernel stack and calling the corresponding syscall entry
|
|
function. But in Sentry, each task is represented by a goroutine; before
|
|
calling into syscall entry functions, it needs to pop the stack to recover
|
|
the big while loop, i.e., kernel.(*Task).run.
|
|
|
|
Can we save the structural cost of syscall interception? This cost is actually
|
|
by-design. We can optimize it, for example, avoid allocation and map operations
|
|
in switch process, but it can not be eliminated.
|
|
|
|
Does the structural cost of syscall interception really matter? It depends on
|
|
the syscall rate. Most applications in our case have a syscall rate < 200K/sec,
|
|
and according to flame graphs (which will be described later in this blog), we
|
|
see 2~3% of samples are in the switch Secondly, most syscalls, except those as
|
|
simple as <code>getpid(2)</code>, take several microseconds. In proportion, it's
|
|
not a significant overhead. However, if you have an elephant RPC (which involves
|
|
many times of DB access), or a service served by a long-snake RPC chain, this
|
|
brings nontrivial overhead on latency.
|
|
|
|
### The implementation-specific cost
|
|
|
|
The other kind of cost is implementation-specific. For example, it involves some
|
|
heavy malloc operations; or defer is used in some frequent syscall paths (defer
|
|
is optimized in Go 1.14); what's worse, the application process may trigger a
|
|
long-path syscall with host kernel or Gofer involved.
|
|
|
|
When we try to do optimization on the gVisor runtime, we need information on the
|
|
sandboxed applications, POD configurations, and runsc internals. But most people
|
|
only play either as platform engineer or application engineer. So we need an
|
|
easier way to understand the whole picture.
|
|
|
|
## Performance profile of a running instance
|
|
|
|
To quickly understand the whole picture of performance, we need some ways to
|
|
profile a running gVisor instance. As gVisor sandbox process is essentially a Go
|
|
process, Go pprof is an existing way:
|
|
|
|
* [Go pprof](https://golang.org/pkg/runtime/pprof/) - provides CPU and heap
|
|
profile through
|
|
[runsc debug subcommands](https://gvisor.dev/docs/user_guide/debugging/#profiling).
|
|
* [Go trace](https://golang.org/pkg/runtime/trace/) - provides more internal
|
|
profile types like synchronization blocking and scheduler latency.
|
|
|
|
Unfortunately, above tools only provide hot-spots in Sentry, instead of the
|
|
whole picture (how much time spent in GR3 and HR0). And CPU profile relies on
|
|
the [SIGPROF signal](https://golang.org/pkg/runtime/pprof/), which may not
|
|
accurate enough.
|
|
|
|
[perf-kvm](https://www.linux-kvm.org/page/Perf_events) cannot provide what we
|
|
need either. It may help to top/record/stat some information in guest with the
|
|
help of option [--guestkallsyms], but it cannot analyze the call chain (which is
|
|
not supported in the host kernel, see Linux's perf_callchain_kernel).
|
|
|
|
### Perf sandbox process like a normal process
|
|
|
|
Then we turn to a nice virtual address equation in Sentry: [(GR0 VA) = (HR3
|
|
VA)]. This is to make sure any pointers in HR3 can be directly used in GR0.
|
|
|
|
The equation is helpful to solve this problem in the way that we can profile
|
|
Sentry just as a normal HR3 process with a little hack on kvm.
|
|
|
|
- First, as said above, Linux does not support to analyze the call chain of
|
|
guest. So Change [is_in_guest] to pretend that it runs in host mode even
|
|
it's in guest mode. This can be done in
|
|
[kvm_is_in_guest](https://github.com/torvalds/linux/blob/v4.19/arch/x86/kvm/x86.c#L6560)
|
|
|
|
```
|
|
int kvm_is_in_guest(void)
|
|
{
|
|
- return __this_cpu_read(current_vcpu) != NULL;
|
|
+ return 0;
|
|
}
|
|
```
|
|
|
|
- Secondly, change the process of guest profile. Previously, after PMU counter
|
|
overflows and triggers a NMI interrupt, vCPU is forced to exit to host, and
|
|
calls [int $2] immediately for later recording. Now instead of calling [int
|
|
$2], we shall call **do_nmi** directly with correct registers (i.e.,
|
|
pt_regs):
|
|
|
|
```
|
|
+void (*fn_do_nmi)(struct pt_regs *, long);
|
|
+
|
|
+#define HIGHER_HALF_CANONICAL_ADDR 0xFFFF800000000000
|
|
+
|
|
+void make_pt_regs(struct kvm_vcpu *vcpu, struct pt_regs *regs)
|
|
+{
|
|
+ /* In Sentry GR0, we will use address among
|
|
+ * [HIGHER_HALF_CANONICAL_ADDR, 2^64-1)
|
|
+ * when syscall just happens. To avoid conflicting with HR0,
|
|
+ * we correct these addresses into HR3 addresses.
|
|
+ */
|
|
+ regs->bp = vcpu->arch.regs[VCPU_REGS_RBP] & ~HIGHER_HALF_CANONICAL_ADDR;
|
|
+ regs->ip = vmcs_readl(GUEST_RIP) & ~HIGHER_HALF_CANONICAL_ADDR;
|
|
+ regs->sp = vmcs_readl(GUEST_RSP) & ~HIGHER_HALF_CANONICAL_ADDR;
|
|
+
|
|
+ regs->flags = (vmcs_readl(GUEST_RFLAGS) & 0xFF) |
|
|
+ X86_EFLAGS_IF | 0x2;
|
|
+ regs->cs = __USER_CS;
|
|
+ regs->ss = __USER_DS;
|
|
+}
|
|
+
|
|
static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
|
|
{
|
|
u32 exit_intr_info;
|
|
@@ -8943,7 +8965,14 @@ static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
|
|
/* We need to handle NMIs before interrupts are enabled */
|
|
if (is_nmi(exit_intr_info)) {
|
|
kvm_before_handle_nmi(&vmx->vcpu);
|
|
- asm("int $2");
|
|
+ if (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF)
|
|
+ asm("int $2");
|
|
+ else {
|
|
+ struct pt_regs regs;
|
|
+ memset((void *)®s, 0, sizeof(regs));
|
|
+ make_pt_regs(&vmx->vcpu, ®s);
|
|
+ fn_do_nmi(®s, 0);
|
|
+ }
|
|
kvm_after_handle_nmi(&vmx->vcpu);
|
|
}
|
|
}
|
|
@@ -11881,6 +11927,10 @@ static int __init vmx_init(void)
|
|
}
|
|
}
|
|
|
|
+ fn_do_nmi = (void *) kallsyms_lookup_name("do_nmi");
|
|
+ if (!fn_do_nmi)
|
|
+ printk(KERN_ERR "kvm: lookup do_nmi fail\n");
|
|
+
|
|
```
|
|
|
|
As shown above, we properly handle samples in GR3 and GR0 trampoline.
|
|
|
|
### An example of profile
|
|
|
|
Firstly, make sure we compile the runsc with symbols not stripped: `bazel build
|
|
runsc --strip=never`
|
|
|
|
As an example, run below script inside the gVisor container to make it busy:
|
|
`stress -i 1 -c 1 -m 1`
|
|
|
|
Perf the instance with command: `perf kvm --host --guest record -a -g -e cycles
|
|
-G <path/to/cgroup> -- sleep 10 >/dev/null`
|
|
|
|
Note we still need to perf the instance with 'perf kvm' and '--guest', because
|
|
kvm-intel requires this to keep the PMU hardware event enabled in guest mode.
|
|
|
|
Then generate a flame graph using
|
|
[Brendan's tool](https://github.com/brendangregg/FlameGraph), and we got this
|
|
[flame graph](https://raw.githubusercontent.com/zhuangel/gvisor/zhuangel_blog/website/blog/blog-kvm-stress.svg).
|
|
|
|
Let's roughly divide it to differentiate GR3 and GR0 like this:
|
|
|
|
![Figure 2](/assets/images/2021-12-02-flamegraph-figure2.png "Flamegraph of stress.")
|
|
|
|
### Optimize based on flame graphs
|
|
|
|
Now we can get clear information like:
|
|
|
|
1. The bottleneck syscall(s): the above flame graph shows <code>sync(2)</code>
|
|
is a relatively large block of samples. If we cannot avoid them in user
|
|
space, they are worth time for optimization. Some real cases we found and
|
|
optimized are: supersede CopyIn/CopyOut with CopyInBytes/CopyOutBytes to
|
|
avoid reflection; avoid use defer in some frequent syscalls in which case
|
|
you can say <code>deferreturn()</code> in the flame graph (not needed if you
|
|
already upgrade to newer Go version). Another optimization is: after we find
|
|
that append write of shared volume spends a lot of time querying gofer for
|
|
current file length in the flame graph, we propose to add
|
|
[an handle only for append write](https://github.com/google/gvisor/issues/1792).
|
|
|
|
2. If GC is a real problem: we can barely see sample related to GC in this
|
|
case. But if we do, we can further search <code>mallocgc()</code> to see
|
|
where the heap allocation is frequent. We can perform a heap profile to see
|
|
allocated objects. And we can consider adjust
|
|
[GC percent](https://golang.org/pkg/runtime/debug/#SetGCPercent), 100% by
|
|
default, to sacrifice memory for less CPU utilization. We once found that
|
|
allocating a object > 32 KB also triggers GC, referring to
|
|
[this](https://github.com/google/gvisor/commit/f697d1a33e4e7cefb4164ec977c38ccc2a228099).
|
|
|
|
3. Percentage of time spent in GR3 app and Sentry: We can determine if it
|
|
worths to continue the optimization. If most of the samples are in GR3, then
|
|
we better turn to optimizing the application code instead.
|
|
|
|
4. Rather large chunk of samples lie in ept violation and
|
|
<code>fallocate(2)</code> (into HR0). This is caused by frequent memory
|
|
allocation and free. We can either optimize the application to avoid this,
|
|
or add a memory buffer layer in memfile management to relieve it.
|
|
|
|
As a short summary, now we have a tool to get a visible graph of what's going on
|
|
in a running gVisor instance. Unfortunately, we cannot get the details of the
|
|
application processes in the above flame graph because of the semantic gap. To
|
|
get a flame graph of the application processes, we have prototyped a way in
|
|
Sentry. Hopefully, we'll discuss it in later blogs.
|
|
|
|
A visible way is very helpful when we try to optimize a new application on
|
|
gVisor. However, there's another kind of overhead, invisible like "Dark matter".
|
|
|
|
## Invisible overhead in Go runtime
|
|
|
|
Sentry inherits timer, scheduler, channel, and heap allocator in Go runtime.
|
|
While it saves a lot of code to build a kernel, it also introduces some
|
|
unpleasant overhead. The Go runtime, after all, is designed and massively used
|
|
for general purpose Go applications. While it's used as a part or the basis of a
|
|
kernel, we shall be very careful with the implementation and overhead of these
|
|
syntactic sugar.
|
|
|
|
Unfortunately, we did not find an universal method to identify this kind of
|
|
overhead. The only way seems to get your hands dirty with Go runtime. We'll show
|
|
some examples in our use case.
|
|
|
|
### Timer
|
|
|
|
It's known that Go (before 1.14) timer suffers from
|
|
[lock contention and context switches](https://github.com/golang/go/issues/27707).
|
|
What's worse, statistics of Sentry syscalls shows that a lot of
|
|
<code>futex()</code> is introduced by timers (64 timer buckets), and that Sentry
|
|
syscalls walks a much longer path (redpill), makes it worse.
|
|
|
|
We have two optimizations here: 1. decrease the number of timer buckets, from 64
|
|
to 4; 2. decrease the timer precision from ns to ms. You may worry about the
|
|
decrease of timer precision, but as we see, most of the applications are
|
|
event-based, and not affected by a coarse grained timer.
|
|
|
|
However, Go changes the implementation of timer in v1.14; how to port this
|
|
optimization remains an open question.
|
|
|
|
### Scheduler
|
|
|
|
gVisor introduces an extra level of schedule along with the host linux scheduler
|
|
(usually CFS). A L2 scheduler sometimes brings positive impact as it saves the
|
|
heavy context switch in the L1 scheduler. We can find many two-level scheduler
|
|
cases, for example, coroutines, virtual machines, etc.
|
|
|
|
gVisor reuses Go's work-stealing scheduler, which is originally designed for
|
|
coroutines, as the L2 scheduler. They share the same goal:
|
|
|
|
"We need to balance between keeping enough running worker threads to utilize
|
|
available hardware parallelism and parking excessive running worker threads to
|
|
conserve CPU resources and power." -- From
|
|
[Go scheduler code](https://golang.org/src/runtime/proc.go).
|
|
|
|
If not properly tuned, the L2 scheduler may leak the schedule pressure to the L1
|
|
scheduler. According to G-P-M model of Go, the parallelism is close related to
|
|
the GOMAXPROCS limit. The upstream gVisor by default uses # of host cores, which
|
|
leads to a lot of wasted M wake/stop(s). By properly configuring the GOMAXPROCS
|
|
of a POD of 4/8/16 cores, we find it can save some CPU cycles without worsening
|
|
the workload latency.
|
|
|
|
To further restrict extra M wake/stop(s), before wakep(), we calculate the # of
|
|
running Gs and # of running Ps to decide if necessary to wake a M. And we find
|
|
it's better to firstly steal from the longest local run queue, comparing to
|
|
previously random-sequential way. Another related optimization is that we find
|
|
most applications will get back to Sentry very soon, and it's not necessary to
|
|
handle off its P when it leaves into user space and find an idle P when it gets
|
|
back.
|
|
|
|
Some optimizations in Go are put
|
|
[here](https://github.com/zhuangel/go/tree/go1.13.4.blog). What we learned from
|
|
the optimization process of gVisor is that digging into Go runtime to understand
|
|
what's going on there. And it's normal that some ideas work, but some fail.
|
|
|
|
## Summary
|
|
|
|
We introduced how we profiled gVisor for production-ready performance. Using
|
|
this methodology, along with some other aggressive measures, we finally got to
|
|
run gVisor with an acceptable overhead, and even better than runc in some
|
|
workloads. We also absorbed a lot of optimization progress in the community,
|
|
e.g., VFS2.
|
|
|
|
So far, we have deployed more than 100K gVisor instances in the production
|
|
environment. And it very well supported transactions of
|
|
[Singles Day Global Shopping Festivals](https://en.wikipedia.org/wiki/Singles%27_Day).
|
|
|
|
Along with performance, there are also some other important aspects for
|
|
production adoption. For example, generating a core after a sentry panic is
|
|
helpful for debugging; a coverage tool is necessary to make sure new changes are
|
|
properly covered by test cases. We'll leave these topics to later discussions.
|