diff --git a/website/_config.yml b/website/_config.yml
index 5f1cbbdeb..e394f39b4 100644
--- a/website/_config.yml
+++ b/website/_config.yml
@@ -47,3 +47,9 @@ authors:
nybidari:
name: Nayana Bidari
email: nybidari@google.com
+ jianfengt:
+ name: Jianfeng Tan
+ email: henry.tjf@antgroup.com
+ yonghe:
+ name: Yong He
+ email: chenglang.hy@antgroup.com
diff --git a/website/assets/images/2021-11-11-flamegraph-figure2.png b/website/assets/images/2021-11-11-flamegraph-figure2.png
new file mode 100644
index 000000000..eb8314975
Binary files /dev/null and b/website/assets/images/2021-11-11-flamegraph-figure2.png differ
diff --git a/website/assets/images/2021-11-11-syscall-figure1.png b/website/assets/images/2021-11-11-syscall-figure1.png
new file mode 100644
index 000000000..07d33e636
Binary files /dev/null and b/website/assets/images/2021-11-11-syscall-figure1.png differ
diff --git a/website/blog/2021-11-11-running-gvisor-in-production-at-scale-in-ant.md b/website/blog/2021-11-11-running-gvisor-in-production-at-scale-in-ant.md
new file mode 100644
index 000000000..5c2e451b3
--- /dev/null
+++ b/website/blog/2021-11-11-running-gvisor-in-production-at-scale-in-ant.md
@@ -0,0 +1,413 @@
+# Running gVisor in Production at Scale in Ant
+
+>This post was contributed by [Ant Group](https://www.antgroup.com/), a
+>large-scale digital payment platform. Jianfeng and Yong are engineers at Ant
+>Group working on infrastructure systems, and contributors to gVisor.
+
+At Ant Group, we are committed to keep online transactions safe and efficient.
+Continuously improving security for potential system-level attacks is one of
+many measures. As a container runtime, gVisor provides container-native security
+without sacrificing resource efficiency. Therefore, it has been on our radar
+since it was released.
+
+However, there have been performance concerns raised by members of
+[academia](https://www.usenix.org/system/files/hotcloud19-paper-young.pdf)
+and [industry](https://news.ycombinator.com/item?id=19924036). Users of gVisor
+tend to bear the extra overhead as the tax of security. But we tend to agree
+that [security is no excuse for poor performance (See Chapter 6!)](https://sel4.systems/About/seL4-whitepaper.pdf).
+
+In this article, we will present how we identified bottlenecks in gVisor and
+unblocked large-scale production adoption. Our main focus are the CPU
+utilization and latency overhead it brings. Small memory footprint is also a
+valued goal, but not discussed in this blog. As a result of these efforts and
+community improvements, 70% of our applications running on runsc have <1%
+overhead; another 25% have <3% overhead. Some of our most valued application
+are the focus of our optimization, and get even better performance compared with
+runc.
+
+The rest of this blog is organized as follows:
+- First, we analyze the cost of different syscall paths in gVisor.
+- Then, a way to profile a whole picture of a instance is proposed to find out
+ if some slow syscall paths are encountered.
+- Some invisible overhead in Go runtime is discussed.
+- At last, a short summary on performance optimization with some other factors
+ on production adoption.
+
+For convenience of discussion, we are targeting KVM-based, or hypervisor-based
+platforms, unless explicitly stated.
+
+## Cost of different syscall paths
+
+[Defense-in-depth](../../../../2019/11/18/gvisor-security-basics-part-1/#defense-in-depth)
+is the key design principle of gVisor. In gVisor, different syscalls have
+different paths, further leading to different cost (orders of magnitude) on
+latency and CPU consumption. Here are the syscall paths in gVisor.
+
+![Figure 1](/assets/images/2021-11-11-syscall-figure1.png "Sentry syscall paths.")
+
+### Path 1: User-space vDSO
+
+Sentry provides a [vDSO library](https://github.com/google/gvisor/tree/master/vdso)
+for its sandboxed processes. Several syscalls are short circuited and
+implemented in user space. These syscalls cost almost as much as native Linux.
+But note that the vDSO library is partially implemented. We once noticed some
+[syscalls](https://github.com/google/gvisor/issues/3101) in our environment are
+not properly terminated in user space. We create some additional implementations
+to the vDSO, and aim to push these improvements upstream when possible.
+
+### Path 2: Sentry contained
+
+Most syscalls, e.g., clone(2)
, are implemented in Sentry. They are
+some basic abstractions of a operating system, such as process/thread lifecycle,
+scheduling, IPC, memory management, etc. These syscalls and all below suffer
+from a structural cost of syscall interception. The overhead is about 800ns
+while that of the native syscalls is about 70ns. We'll dig it further below.
+Syscalls of this kind spend takes about several microseconds, which is
+competitive to the corresponding native Linux syscalls.
+
+### Path 3: Host-kernel involved
+
+Some syscalls, resource related, e.g., read/write, are redirected into the host
+kernel. Note that gVisor never passes through application syscalls directly into
+host kernel for functional and security reasons. So comparing to native Linux,
+time spent in Sentry seems an extra overhead. Another overhead is the way
+to call a host kernel syscall. Let's use kvm platform of x86_64 as an example.
+After Sentry issues the syscall instruction, if it is in GR0, it first goes
+to the syscall entrypoint defined in LSTAR, and then halts to HR3 (a vmexit
+happens here), and exits from a signal handler, and executes syscall instruction
+again. We can save the "Halt to HR3" by introducing vmcall here, but there's
+still a syscall trampoline there and the vmexit/vmentry overhead is not trivial.
+Nevertheless, these overhead is not that significant.
+
+For some sentry-contained syscalls in Path 2, although the syscall semantic is
+terminated in Sentry, it may further introduces one or many unexpected exits to
+host kernel. It could be a page fault when Sentry runs, and more likely, a
+schedule event in Go runtime, e.g., M idle/wakeup. An example in hand is that
+futex(FUETX_WAIT)
and epoll_wait(2)
could lead to M
+idle and a further futex call into host kernel if it does not find any runnable
+Gs. (See the comments in https://go.dev/src/runtime/proc.go for further
+explanation about the GMP scheduler).
+
+### Path 4: Gofer involved
+
+Other IO-related syscalls, especially security sensitive, go through another
+layer of protection - Gofer. For such a syscall, it usually involves one or
+more Sentry/Gofer inter-process communications. Even with the recent
+optimization that using lisafs to supersede P9, it's still the slowest path
+which we shall try best to avoid.
+
+As shown above, some syscall paths are by-design slow, and should be identified
+and reduced as much as possible. Let's hold it to the next section, and dig
+into the details of the structural and implementation-specific cost of syscalls
+firstly, because the performance of some Sentry-contained syscalls are not good
+enough.
+
+### The structural cost
+
+The first kind of cost is the comparatively stable, introduced by syscall
+interception. It is platform-specific depending on the way to intercept syscalls.
+And whether this cost matters also depends on the syscall rate of sandboxed
+applications.
+
+Here's the benchmark result on the structural cost of syscall. We got the data
+on a Intel(R) Xeon(R) CPU E5-2650 v2 platform, using
+[getpid benchmark](https://github.com/google/gvisor/blob/master/test/perf/linux/getpid_benchmark.cc).
+As we can see, for KVM platform, the syscall interception costs more than 10x
+than a native Linux syscall.
+
+| |getpid benchmark (ns)|
+|------------|---------------------|
+|Native |62 |
+|Native-KPTI |236 |
+|runsc-KVM |830 |
+|runsc-ptrace|6249 |
+
+* "Native" stands for using vanilla linux kernel.
+
+To understand the structural cost of syscall interception,
+we did a [quantitative analysis](https://github.com/google/gvisor/issues/2354)
+on kvm platform. According to the analysis, the overhead mainly comes from:
+
+1. KPTI-like CR3 switches: to maintain the address equation of Sentry running
+in HR3 and GR0, it has to switch CR3 register twice, on each user/kernel switch;
+
+2. Platform's Switch(): Linux is very efficient by just switching to a
+per-thread kernel stack and calling the corresponding syscall entry function.
+But in Sentry, each task is represented by a goroutine; before calling into
+syscall entry functions, it needs to pop the stack to recover the big while
+loop, i.e., kernel.(*Task).run.
+
+Can we save the structural cost of syscall interception? This cost is actually
+by-design. We can optimize it, for example, avoid allocation and map operations
+in switch process, but it can not be eliminated.
+
+Does the structural cost of syscall interception really matter? It depends on
+the syscall rate. Most applications in our case have a syscall rate < 200K/sec,
+and according to flame graphs (which will be described later in this blog), we
+see 2~3% of samples are in the switch Secondly, most syscalls, except those
+as simple as getpid(2)
, take several microseconds. In proportion,
+it's not a significant overhead. However, if you have an elephant RPC (which
+involves many times of DB access), or a service served by a long-snake RPC
+chain, this brings nontrivial overhead on latency.
+
+### The implementation-specific cost
+
+The other kind of cost is implementation-specific. For example, it involves
+some heavy malloc operations; or defer is used in some frequent syscall paths
+(defer is optimized in Go 1.14); what's worse, the application process may
+trigger a long-path syscall with host kernel or Gofer involved.
+
+When we try to do optimization on the gVisor runtime, we need information
+on the sandboxed applications, POD configurations, and runsc internals. But
+most people only play either as platform engineer or application engineer.
+So we need an easier way to understand the whole picture.
+
+## Performance profile of a running instance
+
+To quickly understand the whole picture of performance, we need some ways to
+profile a running gVisor instance. As gVisor sandbox process is essentially a
+Go process, Go pprof is an existing way:
+
+* [Go pprof](https://golang.org/pkg/runtime/pprof/) - provides CPU and heap
+ profile through [runsc debug subcommands](https://gvisor.dev/docs/user_guide/debugging/#profiling).
+* [Go trace](https://golang.org/pkg/runtime/trace/) - provides more
+ internal profile types like synchronization blocking and scheduler latency.
+
+Unfortunately, above tools only provide hot-spots in Sentry, instead of the
+whole picture (how much time spent in GR3 and HR0). And CPU profile relies on
+the [SIGPROF signal](https://golang.org/pkg/runtime/pprof/), which may not
+accurate enough.
+
+[perf-kvm](https://www.linux-kvm.org/page/Perf_events) cannot provide what we
+need either. It may help to top/record/stat some information in guest with the
+help of option [--guestkallsyms], but it cannot analyze the call chain (which
+is not supported in the host kernel, see Linux's perf_callchain_kernel).
+
+### Perf sandbox process like a normal process
+
+Then we turn to a nice virtual address equation in Sentry: [(GR0 VA) = (HR3 VA)].
+This is to make sure any pointers in HR3 can be directly used in GR0.
+
+The equation is helpful to solve this problem in the way that we can profile
+Sentry just as a normal HR3 process with a little hack on kvm.
+
+- First, as said above, Linux does not support to analyze the call chain of
+guest. So Change [is_in_guest] to pretend that it runs in host mode even it's
+in guest mode. This can be done in
+[kvm_is_in_guest](https://github.com/torvalds/linux/blob/v4.19/arch/x86/kvm/x86.c#L6560)
+
+```
+int kvm_is_in_guest(void)
+ {
+- return __this_cpu_read(current_vcpu) != NULL;
++ return 0;
+ }
+```
+
+- Secondly, change the process of guest profile. Previously, after PMU counter
+overflows and triggers a NMI interrupt, vCPU is forced to exit to host, and
+calls [int $2] immediately for later recording. Now instead of calling [int $2],
+we shall call **do_nmi** directly with correct registers (i.e., pt_regs):
+
+```
++void (*fn_do_nmi)(struct pt_regs *, long);
++
++#define HIGHER_HALF_CANONICAL_ADDR 0xFFFF800000000000
++
++void make_pt_regs(struct kvm_vcpu *vcpu, struct pt_regs *regs)
++{
++ /* In Sentry GR0, we will use address among
++ * [HIGHER_HALF_CANONICAL_ADDR, 2^64-1)
++ * when syscall just happens. To avoid conflicting with HR0,
++ * we correct these addresses into HR3 addresses.
++ */
++ regs->bp = vcpu->arch.regs[VCPU_REGS_RBP] & ~HIGHER_HALF_CANONICAL_ADDR;
++ regs->ip = vmcs_readl(GUEST_RIP) & ~HIGHER_HALF_CANONICAL_ADDR;
++ regs->sp = vmcs_readl(GUEST_RSP) & ~HIGHER_HALF_CANONICAL_ADDR;
++
++ regs->flags = (vmcs_readl(GUEST_RFLAGS) & 0xFF) |
++ X86_EFLAGS_IF | 0x2;
++ regs->cs = __USER_CS;
++ regs->ss = __USER_DS;
++}
++
+ static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
+ {
+ u32 exit_intr_info;
+@@ -8943,7 +8965,14 @@ static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
+ /* We need to handle NMIs before interrupts are enabled */
+ if (is_nmi(exit_intr_info)) {
+ kvm_before_handle_nmi(&vmx->vcpu);
+- asm("int $2");
++ if (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF)
++ asm("int $2");
++ else {
++ struct pt_regs regs;
++ memset((void *)®s, 0, sizeof(regs));
++ make_pt_regs(&vmx->vcpu, ®s);
++ fn_do_nmi(®s, 0);
++ }
+ kvm_after_handle_nmi(&vmx->vcpu);
+ }
+ }
+@@ -11881,6 +11927,10 @@ static int __init vmx_init(void)
+ }
+ }
+
++ fn_do_nmi = (void *) kallsyms_lookup_name("do_nmi");
++ if (!fn_do_nmi)
++ printk(KERN_ERR "kvm: lookup do_nmi fail\n");
++
+```
+
+As shown above, we properly handle samples in GR3 and GR0 trampoline.
+
+### An example of profile
+
+Firstly, make sure we compile the runsc with symbols not stripped:
+```
+bazel build runsc --strip=never
+```
+
+As an example, run below script inside the gVisor container to make it busy:
+```
+stress -i 1 -c 1 -m 1
+```
+
+Perf the instance with command:
+```
+perf kvm --host --guest record -a -g -e cycles -G sync(2)
is a relatively large block of samples. If we cannot avoid
+them in user space, they are worth time for optimization. Some real cases we
+found and optimized are: supersede CopyIn/CopyOut with CopyInBytes/CopyOutBytes
+to avoid reflection; avoid use defer in some frequent syscalls in which case you
+can say deferreturn()
in the flame graph (not needed if you already
+upgrade to newer Go version). Another optimization is: after we find that append
+write of shared volume spends a lot of time querying gofer for current file
+length in the flame graph, we propose to add
+[an handle only for append write](https://github.com/google/gvisor/issues/1792).
+
+2. If GC is a real problem: we can barely see sample related to GC in this
+case. But if we do, we can further search mallocgc()
to see where
+the heap allocation is frequent. We can perform a heap profile to see allocated
+objects. And we can consider adjust [GC percent](https://golang.org/pkg/runtime/debug/#SetGCPercent),
+100% by default, to sacrifice memory for less CPU utilization. We once found
+that allocating a object > 32 KB also triggers GC, referring to
+[this](https://github.com/google/gvisor/commit/f697d1a33e4e7cefb4164ec977c38ccc2a228099).
+
+3. Percentage of time spent in GR3 app and Sentry: We can determine if it worths
+to continue the optimization. If most of the samples are in GR3, then we better
+turn to optimizing the application code instead.
+
+4. Rather large chunk of samples lie in ept violation and
+fallocate(2)
(into HR0). This is caused by frequent memory
+allocation and free. We can either optimize the application to avoid this, or
+add a memory buffer layer in memfile management to relieve it.
+
+As a short summary, now we have a tool to get a visible graph of what's going
+on in a running gVisor instance. Unfortunately, we cannot get the details of
+the application processes in the above flame graph because of the semantic gap.
+To get a flame graph of the application processes, we have prototyped a way in
+Sentry. Hopefully, we'll discuss it in later blogs.
+
+A visible way is very helpful when we try to optimize a new application on
+gVisor. However, there's another kind of overhead, invisible like "Dark matter".
+
+## Invisible overhead in Go runtime
+
+Sentry inherits timer, scheduler, channel, and heap allocator in Go runtime.
+While it saves a lot of code to build a kernel, it also introduces some
+unpleasant overhead. The Go runtime, after all, is designed and massively used
+for general purpose Go applications. While it's used as a part or the basis of a
+kernel, we shall be very careful with the implementation and overhead of these
+syntactic sugar.
+
+Unfortunately, we did not find an universal method to identify this kind of
+overhead. The only way seems to get your hands dirty with Go runtime. We'll show
+some examples in our use case.
+
+### Timer
+
+It's known that Go (before 1.14) timer suffers from
+[lock contention and context switches](https://github.com/golang/go/issues/27707).
+What's worse, statistics of Sentry syscalls shows that a lot of
+futex()
is introduced by timers (64 timer buckets), and that Sentry
+syscalls walks a much longer path (redpill), makes it worse.
+
+We have two optimizations here: 1. decrease the number of timer buckets, from 64
+to 4; 2. decrease the timer precision from ns to ms. You may worry about the
+decrease of timer precision, but as we see, most of the applications are
+event-based, and not affected by a coarse grained timer.
+
+However, Go changes the implementation of timer in v1.14; how to port this
+optimization remains an open question.
+
+### Scheduler
+
+gVisor introduces an extra level of schedule along with the host linux
+scheduler (usually CFS). A L2 scheduler sometimes brings positive impact as it
+saves the heavy context switch in the L1 scheduler. We can find many two-level
+scheduler cases, for example, coroutines, virtual machines, etc.
+
+gVisor reuses Go's work-stealing scheduler, which is originally designed for
+coroutines, as the L2 scheduler. They share the same goal:
+
+"We need to balance between keeping enough running worker threads to utilize
+available hardware parallelism and parking excessive running worker threads
+to conserve CPU resources and power." -- From
+[Go scheduler code](https://golang.org/src/runtime/proc.go).
+
+If not properly tuned, the L2 scheduler may leak the schedule pressure to the
+L1 scheduler. According to G-P-M model of Go, the parallelism is close related to
+the GOMAXPROCS limit. The upstream gVisor by default uses # of host cores,
+which leads to a lot of wasted M wake/stop(s). By properly configuring the
+GOMAXPROCS of a POD of 4/8/16 cores, we find it can save some CPU cycles
+without worsening the workload latency.
+
+To further restrict extra M wake/stop(s), before wakep(), we calculate the # of
+running Gs and # of running Ps to decide if necessary to wake a M. And we find
+it's better to firstly steal from the longest local run queue, comparing to
+previously random-sequential way. Another related optimization is that we find
+most applications will get back to Sentry very soon, and it's not necessary to
+handle off its P when it leaves into user space and find an idle P when it gets
+back.
+
+Some optimizations in Go are put [here](https://github.com/zhuangel/go/tree/go1.13.4.blog).
+What we learned from the optimization process of gVisor is that digging into
+Go runtime to understand what's going on there. And it's normal that some ideas
+work, but some fail.
+
+## Summary
+
+We introduced how we profiled gVisor for production-ready performance. Using
+this methodology, along with some other aggressive measures, we finally got to
+run gVisor with an acceptable overhead, and even better than runc in some
+workloads. We also absorbed a lot of optimization progress in the community,
+e.g., VFS2.
+
+So far, we have deployed more than 100K gVisor instances in the production
+environment. And it very well supported transactions of
+[Singles Day Global Shopping Festivals](https://en.wikipedia.org/wiki/Singles%27_Day).
+
+Along with performance, there are also some other important aspects for
+production adoption. For example, generating a core after a sentry panic is
+helpful for debugging; a coverage tool is necessary to make sure new changes are
+properly covered by test cases. We'll leave these topics to later discussions.
diff --git a/website/blog/BUILD b/website/blog/BUILD
index 0384b9ba9..f53c7e8b9 100644
--- a/website/blog/BUILD
+++ b/website/blog/BUILD
@@ -59,6 +59,17 @@ doc(
permalink = "/blog/2021/08/31/gvisor-rack/",
)
+doc(
+ name = "tune_gvisor_for_production_adoption",
+ src = "2021-11-11-running-gvisor-in-production-at-scale-in-ant.md",
+ authors = [
+ "jianfengt",
+ "yonghe",
+ ],
+ layout = "post",
+ permalink = "/blog/2021/11/11/running-gvisor-in-production-at-scale-in-ant/",
+)
+
docs(
name = "posts",
deps = [
diff --git a/website/index.md b/website/index.md
index c6cd477c2..68a6098d0 100644
--- a/website/index.md
+++ b/website/index.md
@@ -45,6 +45,13 @@
having to rearchitect your infrastructure.