2020-09-18 17:25:52 +00:00
|
|
|
|
# Containing a Real Vulnerability
|
|
|
|
|
|
|
|
|
|
In the previous two posts we talked about gVisor's
|
|
|
|
|
[security design principles](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/)
|
|
|
|
|
as well as how those are applied in the
|
|
|
|
|
[context of networking](https://gvisor.dev/blog/2020/04/02/gvisor-networking-security/).
|
|
|
|
|
Recently, a new container escape vulnerability
|
|
|
|
|
([CVE-2020-14386](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14386))
|
|
|
|
|
was announced that ties these topics well together. gVisor is
|
|
|
|
|
[not vulnerable](https://seclists.org/oss-sec/2020/q3/168) to this specific
|
|
|
|
|
issue, but it provides an interesting case study to continue our exploration of
|
|
|
|
|
gVisor's security. While gVisor is not immune to vulnerabilities,
|
|
|
|
|
[we take several steps](https://gvisor.dev/security/) to minimize the impact and
|
|
|
|
|
remediate if a vulnerability is found.
|
|
|
|
|
|
|
|
|
|
## Escaping the Container
|
|
|
|
|
|
|
|
|
|
First, let’s describe how the discovered vulnerability works. There are numerous
|
|
|
|
|
ways one can send and receive bytes over the network with Linux. One of the most
|
|
|
|
|
performant ways is to use a ring buffer, which is a memory region shared by the
|
|
|
|
|
application and the kernel. These rings are created by calling
|
|
|
|
|
[setsockopt(2)](https://man7.org/linux/man-pages/man2/setsockopt.2.html) with
|
|
|
|
|
[`PACKET_RX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for
|
|
|
|
|
receiving and
|
|
|
|
|
[`PACKET_TX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for
|
|
|
|
|
sending packets.
|
|
|
|
|
|
|
|
|
|
The vulnerability is in the code that reads packets when `PACKET_RX_RING` is
|
|
|
|
|
enabled. There is another option
|
|
|
|
|
([`PACKET_RESERVE`](https://man7.org/linux/man-pages/man7/packet.7.html)) that
|
|
|
|
|
asks the kernel to leave some space in the ring buffer before each packet for
|
|
|
|
|
anything the application needs, e.g. control structures. When a packet is
|
|
|
|
|
received, the kernel calculates where to copy the packet to, taking the amount
|
|
|
|
|
reserved before each packet into consideration. If the amount reserved is large,
|
|
|
|
|
the kernel performed an incorrect calculation which could cause an overflow
|
|
|
|
|
leading to an out-of-bounds write of up to 10 bytes, controlled by the attacker.
|
|
|
|
|
The data in the write is easily controlled using the loopback to send a crafted
|
|
|
|
|
packet and receiving it using a `PACKET_RX_RING` with a carefully selected
|
|
|
|
|
`PACKET_RESERVE` size.
|
|
|
|
|
|
|
|
|
|
```c
|
|
|
|
|
static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
|
|
|
|
|
struct packet_type *pt, struct net_device *orig_dev)
|
|
|
|
|
{
|
|
|
|
|
// ...
|
|
|
|
|
if (sk->sk_type == SOCK_DGRAM) {
|
|
|
|
|
macoff = netoff = TPACKET_ALIGN(po->tp_hdrlen) + 16 +
|
|
|
|
|
po->tp_reserve;
|
|
|
|
|
} else {
|
|
|
|
|
unsigned int maclen = skb_network_offset(skb);
|
2020-10-14 01:58:01 +00:00
|
|
|
|
// tp_reserve is unsigned int, netoff is unsigned short.
|
|
|
|
|
// Addition can overflow netoff
|
2020-09-18 17:25:52 +00:00
|
|
|
|
netoff = TPACKET_ALIGN(po->tp_hdrlen +
|
|
|
|
|
(maclen < 16 ? 16 : maclen)) +
|
|
|
|
|
po->tp_reserve;
|
|
|
|
|
if (po->has_vnet_hdr) {
|
|
|
|
|
netoff += sizeof(struct virtio_net_hdr);
|
|
|
|
|
do_vnet = true;
|
|
|
|
|
}
|
2020-10-14 01:58:01 +00:00
|
|
|
|
// Attacker controls netoff and can make macoff be smaller
|
|
|
|
|
// than sizeof(struct virtio_net_hdr)
|
2020-09-18 17:25:52 +00:00
|
|
|
|
macoff = netoff - maclen;
|
|
|
|
|
}
|
|
|
|
|
// ...
|
2020-10-14 01:58:01 +00:00
|
|
|
|
// "macoff - sizeof(struct virtio_net_hdr)" can be negative,
|
|
|
|
|
// resulting in a pointer before h.raw
|
2020-09-18 17:25:52 +00:00
|
|
|
|
if (do_vnet &&
|
|
|
|
|
virtio_net_hdr_from_skb(skb, h.raw + macoff -
|
|
|
|
|
sizeof(struct virtio_net_hdr),
|
|
|
|
|
vio_le(), true, 0)) {
|
|
|
|
|
// ...
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
The [`CAP_NET_RAW`](https://man7.org/linux/man-pages/man7/capabilities.7.html)
|
|
|
|
|
capability is required to create the socket above. However, in order to support
|
|
|
|
|
common debugging tools like `ping` and `tcpdump`, Docker containers, including
|
|
|
|
|
those created for Kubernetes, are given `CAP_NET_RAW` by default and thus may be
|
|
|
|
|
able to trigger this vulnerability to elevate privileges and escape the
|
|
|
|
|
container.
|
|
|
|
|
|
|
|
|
|
Next, we are going to explore why this vulnerability doesn’t work in gVisor, and
|
|
|
|
|
how gVisor could prevent the escape even if a similar vulnerability existed
|
|
|
|
|
inside gVisor’s kernel.
|
|
|
|
|
|
|
|
|
|
## Default Protections
|
|
|
|
|
|
|
|
|
|
gVisor does not implement `PACKET_RX_RING`, but **does** support raw sockets
|
|
|
|
|
which are required for `PACKET_RX_RING`. Raw sockets are a controversial feature
|
|
|
|
|
to support in a sandbox environment. While it allows great customizations for
|
|
|
|
|
essential tools like `ping`, it may allow packets to be written to the network
|
|
|
|
|
without any validation. In general, allowing an untrusted application to write
|
|
|
|
|
crafted packets to the network is a questionable idea and a historical source of
|
|
|
|
|
vulnerabilities. With that in mind, if `CAP_NET_RAW` is enabled by default, it
|
|
|
|
|
would not be _secure by default_ to run untrusted applications.
|
|
|
|
|
|
|
|
|
|
After multiple discussions when raw sockets were first implemented, we decided
|
|
|
|
|
to disable raw sockets by default, **even if `CAP_NET_RAW` is given to the
|
|
|
|
|
application**. Instead, enabling raw sockets in gVisor requires the admin to set
|
|
|
|
|
`--net-raw` flag to runsc when configuring the runtime, in addition to requiring
|
|
|
|
|
the `CAP_NET_RAW` capability in the application. It comes at the expense that
|
|
|
|
|
some tools may not work out of the box, but as part of our
|
|
|
|
|
[secure-by-default](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#secure-by-default)
|
|
|
|
|
principle, we felt that it was important for the “less secure” configuration to
|
|
|
|
|
be explicit.
|
|
|
|
|
|
|
|
|
|
Since this bug was due to an overflow in the specific Linux implementation of
|
|
|
|
|
the packet ring, gVisor's raw socket implementation is not affected. However, if
|
|
|
|
|
there were a vulnerability in gVisor, containers would not be allowed to exploit
|
|
|
|
|
it by default.
|
|
|
|
|
|
|
|
|
|
As an alternative way to implement this same constraint, Kubernetes allows
|
|
|
|
|
[admission controllers](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/)
|
|
|
|
|
to be configured to customize requests. Cloud providers can use this to
|
|
|
|
|
implement more stringent policies. For example, GKE implements an admission
|
|
|
|
|
controller for gVisor that
|
|
|
|
|
[removes `CAP_NET_RAW` from gVisor pods](https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods#capabilities)
|
|
|
|
|
unless it has been explicitly set in the pod spec.
|
|
|
|
|
|
|
|
|
|
## Isolated Kernel
|
|
|
|
|
|
|
|
|
|
gVisor has its own application kernel, called the Sentry, that is distinct from
|
|
|
|
|
the host kernel. Just like what you would expect from a kernel, gVisor has a
|
|
|
|
|
memory management subsystem, virtual file system, and a full network stack. The
|
|
|
|
|
host network is only used as a transport to carry packets in and out the
|
|
|
|
|
sandbox[^1]. The loopback interface which is used in the exploit stays
|
|
|
|
|
completely inside the sandbox, never reaching the host.
|
|
|
|
|
|
|
|
|
|
Therefore, even if the Sentry was vulnerable to the attack, there would be two
|
|
|
|
|
factors that would prevent a container escape from happening. First, the
|
|
|
|
|
vulnerability would be limited to the Sentry, and the attacker would compromise
|
|
|
|
|
only the application kernel, bound by a restricted set of
|
|
|
|
|
[seccomp](https://en.wikipedia.org/wiki/Seccomp) filters, discussed more in
|
|
|
|
|
depth below. Second, the Sentry is a distinct implementation of the API, written
|
|
|
|
|
in Go, which provides bounds checking that would have likely prevented access
|
|
|
|
|
past the bounds of the shared region (e.g. see
|
|
|
|
|
[aio](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/syscalls/linux/vfs2/aio.go;l=210;drc=a11061d78a58ed75b10606d1a770b035ed944b66?q=file:aio&ss=gvisor%2Fgvisor)
|
|
|
|
|
or
|
|
|
|
|
[kcov](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/kernel/kcov.go;l=272?q=file:kcov&ss=gvisor%2Fgvisor),
|
|
|
|
|
which have similar shared regions).
|
|
|
|
|
|
|
|
|
|
Here, Kubernetes warrants slightly more explanation. gVisor makes pods the unit
|
|
|
|
|
of isolation and a pod can run multiple containers. In other words, each pod is
|
|
|
|
|
a gVisor instance, and each container is a set of processes running inside
|
|
|
|
|
gVisor, isolated via Sentry-internal namespaces like regular containers inside a
|
|
|
|
|
pod. If there were a vulnerability in gVisor, the privilege escalation would
|
|
|
|
|
allow a container inside the pod to break out to other **containers inside the
|
|
|
|
|
same pod**, but the container still **cannot break out of the pod**.
|
|
|
|
|
|
|
|
|
|
## Defense in Depth
|
|
|
|
|
|
|
|
|
|
gVisor follows a
|
|
|
|
|
[common security principle used at Google](https://cloud.google.com/security/infrastructure/design/resources/google_infrastructure_whitepaper_fa.pdf)
|
|
|
|
|
that the system should have two layers of protection, and those layers should
|
|
|
|
|
require different compromises to be broken. We apply this principle by assuming
|
|
|
|
|
that the Sentry (first layer of defense)
|
|
|
|
|
[will be compromised and should not be trusted](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#defense-in-depth).
|
|
|
|
|
In order to protect the host kernel from a compromised Sentry, we wrap it around
|
|
|
|
|
many security and isolations features to ensure only the minimal set of
|
|
|
|
|
functionality from the host kernel is exposed.
|
|
|
|
|
|
|
|
|
|
![Figure 1](/assets/images/2020-09-18-containing-a-real-vulnerability-figure1.png "Protection layers.")
|
|
|
|
|
|
|
|
|
|
First, the sandbox runs inside a cgroup that can limit and throttle host
|
|
|
|
|
resources being used. Second, the sandbox joins empty namespaces, including user
|
|
|
|
|
and mount, to further isolate from the host. Next, it changes the process root
|
|
|
|
|
to a read-only directory that contains only `/proc` and nothing else. Then, it
|
|
|
|
|
executes with the unprivileged user/group
|
|
|
|
|
[`nobody`](https://en.wikipedia.org/wiki/Nobody_\(username\)) with all
|
|
|
|
|
capabilities stripped. Last and most importantly, a seccomp filter is added to
|
|
|
|
|
tightly restrict what parts of the Linux syscall surface that gVisor is allowed
|
|
|
|
|
to access. The allowed host surface is a far smaller set of syscalls than the
|
|
|
|
|
Sentry implements for applications to use. Not only restricting the syscall
|
|
|
|
|
being called, but also checking that arguments to these syscalls are within the
|
|
|
|
|
expected set. Dangerous syscalls like <code>execve(2)</code>,
|
|
|
|
|
<code>open(2)</code>, and <code>socket(2)</code> are prohibited, thus an
|
|
|
|
|
attacker isn’t able to execute binaries or acquire new resources on the host.
|
|
|
|
|
|
|
|
|
|
if there were a vulnerability in gVisor that allowed an attacker to execute code
|
|
|
|
|
inside the Sentry, the attacker still has extremely limited privileges on the
|
|
|
|
|
host. In fact, a compromised Sentry is much more restricted than a
|
|
|
|
|
non-compromised regular container. For CVE-2020-14386 in particular, the attack
|
|
|
|
|
would be blocked by more than one security layer: non-privileged user, no
|
|
|
|
|
capability, and seccomp filters.
|
|
|
|
|
|
|
|
|
|
Although the surface is drastically reduced, there is still a chance that there
|
|
|
|
|
is a vulnerability in one of the allowed syscalls. That’s why it’s important to
|
|
|
|
|
keep the surface small and carefully consider what syscalls are allowed. You can
|
|
|
|
|
find the full set of allowed syscalls
|
|
|
|
|
[here](https://cs.opensource.google/gvisor/gvisor/+/master:runsc/boot/filter/).
|
|
|
|
|
|
|
|
|
|
Another possible attack vector is resources that are present in the Sentry, like
|
|
|
|
|
open file descriptors. The Sentry has file descriptors that an attacker could
|
|
|
|
|
potentially use, such as log files, platform files (e.g. `/dev/kvm`), an RPC
|
|
|
|
|
endpoint that allows external communication with the Sentry, and a Netstack
|
|
|
|
|
endpoint that connects the sandbox to the network. The Netstack endpoint in
|
|
|
|
|
particular is a concern because it gives direct access to the network. It’s an
|
|
|
|
|
`AF_PACKET` socket that allows arbitrary L2 packets to be written to the
|
|
|
|
|
network. In the normal case, Netstack assembles packets that go out the network,
|
|
|
|
|
giving the container control over only the payload. But if the Sentry is
|
|
|
|
|
compromised, an attacker can craft packets to the network. In many ways this is
|
|
|
|
|
similar to anyone sending random packets over the internet, but still this is a
|
|
|
|
|
place where the host kernel surface exposed is larger than we would like it to
|
|
|
|
|
be.
|
|
|
|
|
|
|
|
|
|
## Conclusion
|
|
|
|
|
|
|
|
|
|
Security comes with many tradeoffs that are often hard to make, such as the
|
|
|
|
|
decision to disable raw sockets by default. However, these tradeoffs have served
|
|
|
|
|
us well, and we've found them to have paid off over time. CVE-2020-14386 offers
|
|
|
|
|
great insight into how multiple layers of protection can be effective against
|
|
|
|
|
such an attack.
|
|
|
|
|
|
|
|
|
|
We cannot guarantee that a container escape will never happen in gVisor, but we
|
|
|
|
|
do our best to make it as hard as we possibly can.
|
|
|
|
|
|
|
|
|
|
If you have not tried gVisor yet, it’s easier than you think. Just follow the
|
2020-09-21 23:26:40 +00:00
|
|
|
|
steps [here](https://gvisor.dev/docs/user_guide/install/).
|
2020-09-18 17:25:52 +00:00
|
|
|
|
<br>
|
|
|
|
|
<br>
|
|
|
|
|
|
|
|
|
|
--------------------------------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
[^1]: Those packets are eventually handled by the host, as it needs to route
|
|
|
|
|
them to local containers or send them out the NIC. The packet will be
|
|
|
|
|
handled by many switches, routers, proxies, servers, etc. along the way,
|
|
|
|
|
which may be subject to their own vulnerabilities.
|