gvisor/website/blog/2020-09-18-containing-a-rea...

# Containing a Real Vulnerability

In the previous two posts we talked about gVisor's
[security design principles](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/)
as well as how those are applied in the
[context of networking](https://gvisor.dev/blog/2020/04/02/gvisor-networking-security/).
Recently, a new container escape vulnerability
([CVE-2020-14386](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14386))
was announced that ties these topics well together. gVisor is
[not vulnerable](https://seclists.org/oss-sec/2020/q3/168) to this specific
issue, but it provides an interesting case study to continue our exploration of
gVisor's security. While gVisor is not immune to vulnerabilities,
[we take several steps](https://gvisor.dev/security/) to minimize the impact and
remediate if a vulnerability is found.

## Escaping the Container

First, let’s describe how the discovered vulnerability works. There are numerous
ways one can send and receive bytes over the network with Linux. One of the most
performant ways is to use a ring buffer, which is a memory region shared by the
application and the kernel. These rings are created by calling
[setsockopt(2)](https://man7.org/linux/man-pages/man2/setsockopt.2.html) with
[`PACKET_RX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for
receiving and
[`PACKET_TX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for
sending packets.

The vulnerability is in the code that reads packets when `PACKET_RX_RING` is
enabled. There is another option
([`PACKET_RESERVE`](https://man7.org/linux/man-pages/man7/packet.7.html)) that
asks the kernel to leave some space in the ring buffer before each packet for
anything the application needs, e.g. control structures. When a packet is
received, the kernel calculates where to copy the packet to, taking the amount
reserved before each packet into consideration. If the amount reserved is large,
the kernel performed an incorrect calculation which could cause an overflow
leading to an out-of-bounds write of up to 10 bytes, controlled by the attacker.
The data in the write is easily controlled using the loopback to send a crafted
packet and receiving it using a `PACKET_RX_RING` with a carefully selected
`PACKET_RESERVE` size.

```c
static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
               struct packet_type *pt, struct net_device *orig_dev)
{
// ...
    if (sk->sk_type == SOCK_DGRAM) {
        macoff = netoff = TPACKET_ALIGN(po->tp_hdrlen) + 16 +
                  po->tp_reserve;
    } else {
        unsigned int maclen = skb_network_offset(skb);
        // tp_reserve is unsigned int, netoff is unsigned short.
        // Addition can overflow netoff
        netoff = TPACKET_ALIGN(po->tp_hdrlen +
                       (maclen < 16 ? 16 : maclen)) +
                       po->tp_reserve;
        if (po->has_vnet_hdr) {
            netoff += sizeof(struct virtio_net_hdr);
            do_vnet = true;
        }
        // Attacker controls netoff and can make macoff be smaller
        // than sizeof(struct virtio_net_hdr)
        macoff = netoff - maclen;
    }
// ...
    // "macoff - sizeof(struct virtio_net_hdr)" can be negative,
    // resulting in a pointer before h.raw
    if (do_vnet &&
        virtio_net_hdr_from_skb(skb, h.raw + macoff -
                    sizeof(struct virtio_net_hdr),
                    vio_le(), true, 0)) {
// ...
```

The [`CAP_NET_RAW`](https://man7.org/linux/man-pages/man7/capabilities.7.html)
capability is required to create the socket above. However, in order to support
common debugging tools like `ping` and `tcpdump`, Docker containers, including
those created for Kubernetes, are given `CAP_NET_RAW` by default and thus may be
able to trigger this vulnerability to elevate privileges and escape the
container.

Next, we are going to explore why this vulnerability doesn’t work in gVisor, and
how gVisor could prevent the escape even if a similar vulnerability existed
inside gVisor’s kernel.

## Default Protections

gVisor does not implement `PACKET_RX_RING`, but **does** support raw sockets
which are required for `PACKET_RX_RING`. Raw sockets are a controversial feature
to support in a sandbox environment. While it allows great customizations for
essential tools like `ping`, it may allow packets to be written to the network
without any validation. In general, allowing an untrusted application to write
crafted packets to the network is a questionable idea and a historical source of
vulnerabilities. With that in mind, if `CAP_NET_RAW` is enabled by default, it
would not be _secure by default_ to run untrusted applications.

After multiple discussions when raw sockets were first implemented, we decided
to disable raw sockets by default, **even if `CAP_NET_RAW` is given to the
application**. Instead, enabling raw sockets in gVisor requires the admin to set
`--net-raw` flag to runsc when configuring the runtime, in addition to requiring
the `CAP_NET_RAW` capability in the application. It comes at the expense that
some tools may not work out of the box, but as part of our
[secure-by-default](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#secure-by-default)
principle, we felt that it was important for the “less secure” configuration to
be explicit.

Since this bug was due to an overflow in the specific Linux implementation of
the packet ring, gVisor's raw socket implementation is not affected. However, if
there were a vulnerability in gVisor, containers would not be allowed to exploit
it by default.

As an alternative way to implement this same constraint, Kubernetes allows
[admission controllers](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/)
to be configured to customize requests. Cloud providers can use this to
implement more stringent policies. For example, GKE implements an admission
controller for gVisor that
[removes `CAP_NET_RAW` from gVisor pods](https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods#capabilities)
unless it has been explicitly set in the pod spec.

## Isolated Kernel

gVisor has its own application kernel, called the Sentry, that is distinct from
the host kernel. Just like what you would expect from a kernel, gVisor has a
memory management subsystem, virtual file system, and a full network stack. The
host network is only used as a transport to carry packets in and out the
sandbox[^1]. The loopback interface which is used in the exploit stays
completely inside the sandbox, never reaching the host.

Therefore, even if the Sentry was vulnerable to the attack, there would be two
factors that would prevent a container escape from happening. First, the
vulnerability would be limited to the Sentry, and the attacker would compromise
only the application kernel, bound by a restricted set of
[seccomp](https://en.wikipedia.org/wiki/Seccomp) filters, discussed more in
depth below. Second, the Sentry is a distinct implementation of the API, written
in Go, which provides bounds checking that would have likely prevented access
past the bounds of the shared region (e.g. see
[aio](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/syscalls/linux/vfs2/aio.go;l=210;drc=a11061d78a58ed75b10606d1a770b035ed944b66?q=file:aio&ss=gvisor%2Fgvisor)
or
[kcov](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/kernel/kcov.go;l=272?q=file:kcov&ss=gvisor%2Fgvisor),
which have similar shared regions).

Here, Kubernetes warrants slightly more explanation. gVisor makes pods the unit
of isolation and a pod can run multiple containers. In other words, each pod is
a gVisor instance, and each container is a set of processes running inside
gVisor, isolated via Sentry-internal namespaces like regular containers inside a
pod. If there were a vulnerability in gVisor, the privilege escalation would
allow a container inside the pod to break out to other **containers inside the
same pod**, but the container still **cannot break out of the pod**.

## Defense in Depth

gVisor follows a
[common security principle used at Google](https://cloud.google.com/security/infrastructure/design/resources/google_infrastructure_whitepaper_fa.pdf)
that the system should have two layers of protection, and those layers should
require different compromises to be broken. We apply this principle by assuming
that the Sentry (first layer of defense)
[will be compromised and should not be trusted](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#defense-in-depth).
In order to protect the host kernel from a compromised Sentry, we wrap it around
many security and isolations features to ensure only the minimal set of
functionality from the host kernel is exposed.

![Figure 1](/assets/images/2020-09-18-containing-a-real-vulnerability-figure1.png "Protection layers.")

First, the sandbox runs inside a cgroup that can limit and throttle host
resources being used. Second, the sandbox joins empty namespaces, including user
and mount, to further isolate from the host. Next, it changes the process root
to a read-only directory that contains only `/proc` and nothing else. Then, it
executes with the unprivileged user/group
[`nobody`](https://en.wikipedia.org/wiki/Nobody_\(username\)) with all
capabilities stripped. Last and most importantly, a seccomp filter is added to
tightly restrict what parts of the Linux syscall surface that gVisor is allowed
to access. The allowed host surface is a far smaller set of syscalls than the
Sentry implements for applications to use. Not only restricting the syscall
being called, but also checking that arguments to these syscalls are within the
expected set. Dangerous syscalls like <code>execve(2)</code>,
<code>open(2)</code>, and <code>socket(2)</code> are prohibited, thus an
attacker isn’t able to execute binaries or acquire new resources on the host.

if there were a vulnerability in gVisor that allowed an attacker to execute code
inside the Sentry, the attacker still has extremely limited privileges on the
host. In fact, a compromised Sentry is much more restricted than a
non-compromised regular container. For CVE-2020-14386 in particular, the attack
would be blocked by more than one security layer: non-privileged user, no
capability, and seccomp filters.

Although the surface is drastically reduced, there is still a chance that there
is a vulnerability in one of the allowed syscalls. That’s why it’s important to
keep the surface small and carefully consider what syscalls are allowed. You can
find the full set of allowed syscalls
[here](https://cs.opensource.google/gvisor/gvisor/+/master:runsc/boot/filter/).

Another possible attack vector is resources that are present in the Sentry, like
open file descriptors. The Sentry has file descriptors that an attacker could
potentially use, such as log files, platform files (e.g. `/dev/kvm`), an RPC
endpoint that allows external communication with the Sentry, and a Netstack
endpoint that connects the sandbox to the network. The Netstack endpoint in
particular is a concern because it gives direct access to the network. It’s an
`AF_PACKET` socket that allows arbitrary L2 packets to be written to the
network. In the normal case, Netstack assembles packets that go out the network,
giving the container control over only the payload. But if the Sentry is
compromised, an attacker can craft packets to the network. In many ways this is
similar to anyone sending random packets over the internet, but still this is a
place where the host kernel surface exposed is larger than we would like it to
be.

## Conclusion

Security comes with many tradeoffs that are often hard to make, such as the
decision to disable raw sockets by default. However, these tradeoffs have served
us well, and we've found them to have paid off over time. CVE-2020-14386 offers
great insight into how multiple layers of protection can be effective against
such an attack.

We cannot guarantee that a container escape will never happen in gVisor, but we
do our best to make it as hard as we possibly can.

If you have not tried gVisor yet, it’s easier than you think. Just follow the
steps [here](https://gvisor.dev/docs/user_guide/install/).
<br>
<br>

--------------------------------------------------------------------------------

[^1]: Those packets are eventually handled by the host, as it needs to route
    them to local containers or send them out the NIC. The packet will be
    handled by many switches, routers, proxies, servers, etc. along the way,
    which may be subject to their own vulnerabilities.