Commit Graph

245 Commits

Author SHA1 Message Date
Jamie Liu 48961d27a8 Move //pkg/sentry/memutil to //pkg/memutil.
PiperOrigin-RevId: 252124156
2019-06-07 14:52:27 -07:00
Jamie Liu a26043ee53 Implement reclaim-driven MemoryFile eviction.
PiperOrigin-RevId: 251950660
2019-06-06 16:27:55 -07:00
Bhasker Hariharan 85be01b42d Add multi-fd support to fdbased endpoint.
This allows an fdbased endpoint to have multiple underlying fd's from which
packets can be read and dispatched/written to.

This should allow for higher throughput as well as better scalability of the
network stack as number of connections increases.

Updates #231

PiperOrigin-RevId: 251852825
2019-06-06 08:07:02 -07:00
Michael Pratt 57772db2e7 Shutdown host sockets on internal shutdown
This is required to make the shutdown visible to peers outside the
sandbox.

The readClosed / writeClosed fields were dropped, as they were
preventing a shutdown socket from reading the remainder of queued bytes.
The host syscalls will return the appropriate errors for shutdown.

The control message tests have been split out of socket_unix.cc to make
the (few) remaining tests accessible to testing inherited host UDS,
which don't support sending control messages.

Updates #273

PiperOrigin-RevId: 251763060
2019-06-05 18:40:37 -07:00
Fabricio Voznika f1aee6a7ad Refactor container FS setup
No change in functionaly. Added containerMounter object
to keep state while the mounts are processed. This will
help upcoming changes to share mounts per-pod.

PiperOrigin-RevId: 251350096
2019-06-03 18:20:57 -07:00
Fabricio Voznika d28f71adcf Remove 'clearStatus' option from container.Wait*PID()
clearStatus was added to allow detached execution to wait
on the exec'd process and retrieve its exit status. However,
it's not currently used. Both docker and gvisor-containerd-shim
wait on the "shim" process and retrieve the exit status from
there. We could change gvisor-containerd-shim to use waits, but
it will end up also consuming a process for the wait, which is
similar to having the shim process.

Closes #234

PiperOrigin-RevId: 251349490
2019-06-03 18:16:09 -07:00
Michael Pratt 955685845e Remove spurious period
PiperOrigin-RevId: 251288885
2019-06-03 12:48:24 -07:00
Bhasker Hariharan 035a8fa38e Add support for collecting execution trace to runsc.
Updates #220

PiperOrigin-RevId: 250532302
2019-05-30 12:07:11 -07:00
Fabricio Voznika c091e62369 Set sticky bit to /tmp
This is generally done for '/tmp' to prevent accidental
deletion of files. More details here:
http://man7.org/linux/man-pages/man1/chmod.1.html#RESTRICTED_DELETION_FLAG_OR_STICKY_BIT

PiperOrigin-RevId: 249633207
Change-Id: I444a5b406fdef664f5677b2f20f374972613a02b
2019-05-23 06:48:00 -07:00
Fabricio Voznika 9006304dfe Initial support for bind mounts
Separate MountSource from Mount. This is needed to allow
mounts to be shared by multiple containers within the same
pod.

PiperOrigin-RevId: 249617810
Change-Id: Id2944feb7e4194951f355cbe6d4944ae3c02e468
2019-05-23 04:16:10 -07:00
Fabricio Voznika ecb0f00e10 Cleanup around urpc file payload handling
urpc always closes all files once the RPC function returns.

PiperOrigin-RevId: 248406857
Change-Id: I400a8562452ec75c8e4bddc2154948567d572950
2019-05-15 14:36:28 -07:00
Andrei Vagin 85380ff03d gvisor/runsc: use a veth link address instead of generating a new one
PiperOrigin-RevId: 248367340
Change-Id: Id792afcfff9c9d2cfd62cae21048316267b4a924
2019-05-15 11:11:58 -07:00
Fabricio Voznika 1bee43be13 Implement fallocate(2)
Closes #225

PiperOrigin-RevId: 247508791
Change-Id: I04f47cf2770b30043e5a272aba4ba6e11d0476cc
2019-05-09 15:35:49 -07:00
Andrei Vagin bf0ac565d2 Fix runsc restore to be compatible with docker start --checkpoint ...
Change-Id: I02b30de13f1393df66edf8829fedbf32405d18f8
PiperOrigin-RevId: 246621192
2019-05-03 21:41:45 -07:00
Jamie Liu 8bfb83d0ac Implement async MemoryFile eviction, and use it in CachingInodeOperations.
This feature allows MemoryFile to delay eviction of "optional"
allocations, such as unused cached file pages.

Note that this incidentally makes CachingInodeOperations writeback
asynchronous, in the sense that it doesn't occur until eviction; this is
necessary because between when a cached page becomes evictable and when
it's evicted, file writes (via CachingInodeOperations.Write) may dirty
the page.

As currently implemented, this feature won't meaningfully impact
steady-state memory usage or caching; the reclaimer goroutine will
schedule eviction as soon as it runs out of other work to do. Future CLs
increase caching by adding constraints on when eviction is scheduled.

PiperOrigin-RevId: 246014822
Change-Id: Ia85feb25a2de92a48359eb84434b6ec6f9bea2cb
2019-04-30 13:56:41 -07:00
Michael Pratt 4d52a55201 Change copyright notice to "The gVisor Authors"
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.

1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.

Fixes #209

PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
2019-04-29 14:26:23 -07:00
Nicolas Lacasse f4ce43e1f4 Allow and document bug ids in gVisor codebase.
PiperOrigin-RevId: 245818639
Change-Id: I03703ef0fb9b6675955637b9fe2776204c545789
2019-04-29 14:04:14 -07:00
Kevin Krakauer 43dff57b87 Make raw sockets a toggleable feature disabled by default.
PiperOrigin-RevId: 245511019
Change-Id: Ia9562a301b46458988a6a1f0bbd5f07cbfcb0615
2019-04-26 16:51:46 -07:00
Bhasker Hariharan 99b877fa1d Revert runsc to use RecvMMsg packet dispatcher.
PacketMMap mode has issues due to a kernel bug. This change
reverts us to using recvmmsg instead of a shared ring buffer to
dispatch inbound packets. This will reduce performance but should
be more stable under heavy load till PacketMMap is updated to
use TPacketv3.

See #210 for details.

Perf difference between recvmmsg vs packetmmap.

RecvMMsg :
iperf3 -c 172.17.0.2
Connecting to host 172.17.0.2, port 5201
[  4] local 172.17.0.1 port 43478 connected to 172.17.0.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   778 MBytes  6.53 Gbits/sec  4349    188 KBytes
[  4]   1.00-2.00   sec   786 MBytes  6.59 Gbits/sec  4395    212 KBytes
[  4]   2.00-3.00   sec   756 MBytes  6.34 Gbits/sec  3655    161 KBytes
[  4]   3.00-4.00   sec   782 MBytes  6.56 Gbits/sec  4419    175 KBytes
[  4]   4.00-5.00   sec   755 MBytes  6.34 Gbits/sec  4317    187 KBytes
[  4]   5.00-6.00   sec   774 MBytes  6.49 Gbits/sec  4002    173 KBytes
[  4]   6.00-7.00   sec   737 MBytes  6.18 Gbits/sec  3904    191 KBytes
[  4]   7.00-8.00   sec   530 MBytes  4.44 Gbits/sec  3318    189 KBytes
[  4]   8.00-9.00   sec   487 MBytes  4.09 Gbits/sec  2627    188 KBytes
[  4]   9.00-10.00  sec   770 MBytes  6.46 Gbits/sec  4221    170 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  6.99 GBytes  6.00 Gbits/sec  39207             sender
[  4]   0.00-10.00  sec  6.99 GBytes  6.00 Gbits/sec                  receiver

iperf Done.

PacketMMap:

bhaskerh@gvisor-bench:~/tensorflow$ iperf3 -c 172.17.0.2
Connecting to host 172.17.0.2, port 5201
[  4] local 172.17.0.1 port 43496 connected to 172.17.0.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   657 MBytes  5.51 Gbits/sec    0   1.01 MBytes
[  4]   1.00-2.00   sec  1021 MBytes  8.56 Gbits/sec    0   1.01 MBytes
[  4]   2.00-3.00   sec  1.21 GBytes  10.4 Gbits/sec   45   1.01 MBytes
[  4]   3.00-4.00   sec  1018 MBytes  8.54 Gbits/sec   15   1.01 MBytes
[  4]   4.00-5.00   sec  1.28 GBytes  11.0 Gbits/sec   45   1.01 MBytes
[  4]   5.00-6.00   sec  1.38 GBytes  11.9 Gbits/sec    0   1.01 MBytes
[  4]   6.00-7.00   sec  1.34 GBytes  11.5 Gbits/sec   45    856 KBytes
[  4]   7.00-8.00   sec  1.23 GBytes  10.5 Gbits/sec    0    901 KBytes
[  4]   8.00-9.00   sec  1010 MBytes  8.48 Gbits/sec    0    923 KBytes
[  4]   9.00-10.00  sec  1.39 GBytes  11.9 Gbits/sec    0    960 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  11.4 GBytes  9.83 Gbits/sec  150             sender
[  4]   0.00-10.00  sec  11.4 GBytes  9.83 Gbits/sec                  receiver

Updates #210

PiperOrigin-RevId: 244968438
Change-Id: Id461b5cbff2dea6fa55cfc108ea246d8f83da20b
2019-04-23 19:07:06 -07:00
Fabricio Voznika c8cee7108f Use FD limit and file size limit from host
FD limit and file size limit is read from the host, instead
of using hard-coded defaults, given that they effect the sandbox
process. Also limit the direct cache to use no more than half
if the available FDs.

PiperOrigin-RevId: 244050323
Change-Id: I787ad0fdf07c49d589e51aebfeae477324fe26e6
2019-04-17 12:57:40 -07:00
Fabricio Voznika 9f8c89fc7f Return error from fdbased.New
RELNOTES: n/a
PiperOrigin-RevId: 244031742
Change-Id: Id0cdb73194018fb5979e67b58510ead19b5a2b81
2019-04-17 11:16:35 -07:00
Bhasker Hariharan eaac2806ff Add TCP checksum verification.
PiperOrigin-RevId: 242704699
Change-Id: I87db368ca343b3b4bf4f969b17d3aa4ce2f8bd4f
2019-04-09 11:23:47 -07:00
Andrei Vagin 88409e983c gvisor: Add support for the MS_NOEXEC mount option
https://github.com/google/gvisor/issues/145

PiperOrigin-RevId: 242044115
Change-Id: I8f140fe05e32ecd438b6be218e224e4b7fe05878
2019-04-04 17:43:53 -07:00
Kevin Krakauer f9431fb20f Remove obsolete TODO.
PiperOrigin-RevId: 241637164
Change-Id: I65476a739cf38f1818dc47f6ce60638dec8b77a8
2019-04-02 17:27:05 -07:00
Kevin Krakauer a40ee4f4b8 Change bug number for duplicate bug.
PiperOrigin-RevId: 241567897
Change-Id: I580eac04f52bb15f4aab7df9822c4aa92e743021
2019-04-02 11:28:06 -07:00
Andrei Vagin a046054ba3 gvisor/runsc: enable generic segmentation offload (GSO)
The linux packet socket can handle GSO packets, so we can segment packets to
64K instead of the MTU which is usually 1500.

Here are numbers for the nginx-1m test:
runsc:		579330.01 [Kbytes/sec] received
runsc-gso:	1794121.66 [Kbytes/sec] received
runc:		2122139.06 [Kbytes/sec] received

and for tcp_benchmark:

$ tcp_benchmark  --duration 15   --ideal
[  4]  0.0-15.0 sec  86647 MBytes  48456 Mbits/sec

$ tcp_benchmark --client --duration 15   --ideal
[  4]  0.0-15.0 sec  2173 MBytes  1214 Mbits/sec

$ tcp_benchmark --client --duration 15   --ideal --gso 65536
[  4]  0.0-15.0 sec  19357 MBytes  10825 Mbits/sec

PiperOrigin-RevId: 241072403
Change-Id: I20b03063a1a6649362b43609cbbc9b59be06e6d5
2019-03-29 16:27:38 -07:00
Jamie Liu 8f4634997b Decouple filemem from platform and move it to pgalloc.MemoryFile.
This is in preparation for improved page cache reclaim, which requires
greater integration between the page cache and page allocator.

PiperOrigin-RevId: 238444706
Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4
2019-03-14 08:12:48 -07:00
Nicolas Lacasse 2512cc5617 Allow filesystem.Mount to take an optional interface argument.
PiperOrigin-RevId: 238360231
Change-Id: I5eaf8d26f8892f77d71c7fbd6c5225ef471cedf1
2019-03-13 19:24:03 -07:00
Ian Gudger a16f6e50c5 Make HandleLocal apply to all non-loopback interfaces.
HandleLocal is very similar conceptually to MULTICAST_LOOP, so we can unify
the implementations. This has the benefit of making HandleLocal apply even when
the fdbased link endpoint isn't in use.

In addition, move looping logic to route creation so that it doesn't need to be
run for each packet. This should improve performance.

PiperOrigin-RevId: 238099480
Change-Id: I72839f16f25310471453bc9d3fb8544815b25c23
2019-03-12 14:37:56 -07:00
Fabricio Voznika bc9b979b94 Add profiling commands to runsc
Example:
  runsc debug --root=<dir> \
      --profile-heap=/tmp/heap.prof \
      --profile-cpu=/tmp/cpu.prod --profile-delay=30 \
      <container ID>
PiperOrigin-RevId: 237848456
Change-Id: Icff3f20c1b157a84d0922599eaea327320dad773
2019-03-11 11:47:30 -07:00
Ian Gudger 56a6128295 Implement IP_MULTICAST_LOOP.
IP_MULTICAST_LOOP controls whether or not multicast packets sent on the default
route are looped back. In order to implement this switch, support for sending
and looping back multicast packets on the default route had to be implemented.

For now we only support IPv4 multicast.

PiperOrigin-RevId: 237534603
Change-Id: I490ac7ff8e8ebef417c7eb049a919c29d156ac1c
2019-03-08 15:49:17 -08:00
Fabricio Voznika 0b76887147 Priority-inheritance futex implementation
It is Implemented without the priority inheritance part given
that gVisor defers scheduling decisions to Go runtime and doesn't
have control over it.

PiperOrigin-RevId: 236989545
Change-Id: I714c8ca0798743ecf3167b14ffeb5cd834302560
2019-03-05 23:40:18 -08:00
Fabricio Voznika fcba4e8f04 Add uncaught signal message to the user log
This help troubleshoot cases where the container is killed and the
app logs don't show the reason.

PiperOrigin-RevId: 236982883
Change-Id: I361892856a146cea5b04abaa3aedbf805e123724
2019-03-05 22:20:17 -08:00
Fabricio Voznika 3dbd4a16f8 Add semctl(GETPID) syscall
Also added unimplemented notification for semctl(2)
commands.

PiperOrigin-RevId: 236340672
Change-Id: I0795e3bd2e6d41d7936fabb731884df426a42478
2019-03-01 10:57:02 -08:00
Kevin Krakauer b75aa51504 Rename ping endpoints to icmp endpoints.
PiperOrigin-RevId: 235248572
Change-Id: I5b0538b6feb365a98712c2a2d56d856fe80a8a09
2019-02-22 13:34:47 -08:00
Nicolas Lacasse 0a41ea72c1 Don't allow writing or reading to TTY unless process group is in foreground.
If a background process tries to read from a TTY, linux sends it a SIGTTIN
unless the signal is blocked or ignored, or the process group is an orphan, in
which case the syscall returns EIO.

See drivers/tty/n_tty.c:n_tty_read()=>job_control().

If a background process tries to write a TTY, set the termios, or set the
foreground process group, linux then sends a SIGTTOU. If the signal is ignored
or blocked, linux allows the write. If the process group is an orphan, the
syscall returns EIO.

See drivers/tty/tty_io.c:tty_check_change().

PiperOrigin-RevId: 234044367
Change-Id: I009461352ac4f3f11c5d42c43ac36bb0caa580f9
2019-02-14 15:47:31 -08:00
Bhasker Hariharan e0b3d3323f Add support for using PACKET_RX_RING to receive packets.
PACKET_RX_RING allows the use of an mmapped buffer to receive packets from the
kernel. This should cut down the number of host syscalls that need to be made
to receive packets when the underlying fd is a socket of the AF_PACKET type.

PiperOrigin-RevId: 233834998
Change-Id: I8060025c6ced206986e94cc46b8f382b81bfa47f
2019-02-13 14:53:03 -08:00
Nicolas Lacasse 92e85623a0 Factor the subtargets method into a helper method with tests.
PiperOrigin-RevId: 232047515
Change-Id: I00f036816e320356219be7b2f2e6d5fe57583a60
2019-02-01 15:23:43 -08:00
Michael Pratt 2a0c69b19f Remove license comments
Nothing reads them and they can simply get stale.

Generated with:
$ sed -i "s/licenses(\(.*\)).*/licenses(\1)/" **/BUILD

PiperOrigin-RevId: 231818945
Change-Id: Ibc3f9838546b7e94f13f217060d31f4ada9d4bf0
2019-01-31 11:12:53 -08:00
Andrei Vagin 7e8a56087b runsc: check whether a container is deleted or not before setupContainerFS
PiperOrigin-RevId: 231811387
Change-Id: Ib143fb9a4d0fa1f105d1a3a3bd533dfc44e792af
2019-01-31 10:34:15 -08:00
Andrei Vagin dd577f5410 runsc: reap a sandbox process only in sandbox.Wait()
PiperOrigin-RevId: 231504064
Change-Id: I585b769aef04a3ad7e7936027958910a6eed9c8d
2019-01-29 17:15:56 -08:00
Bhasker Hariharan 24cb2c0a72 Use recvmmsg() instead of readv() to read packets from NIC.
This should reduce the number of syscalls required to process packets
significantly and improve throughputs.

PiperOrigin-RevId: 231366886
Change-Id: I8b38077262bf9c53176bc4a94b530188d3d7c0ca
2019-01-29 01:39:01 -08:00
Fabricio Voznika c1be25b78d Scrub runsc error messages
Removed "error" and "failed to" prefix that don't add value
from messages. Adjusted a few other messages.  In particular,
when the container fail to start, the message returned is easier
for humans to read:

$ docker run --rm --runtime=runsc alpine foobar
docker: Error response from daemon: OCI runtime start failed: <path> did not terminate sucessfully: starting container: starting root container [foobar]: starting sandbox: searching for executable "foobar", cwd: "/", $PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin": no such file or directory

Closes #77

PiperOrigin-RevId: 230022798
Change-Id: I83339017c70dae09e4f9f8e0ea2e554c4d5d5cd1
2019-01-18 17:36:02 -08:00
Fabricio Voznika e4d3ca7263 Prevent internal tmpfs mount to override files in /tmp
Runsc wants to mount /tmp using internal tmpfs implementation for
performance. However, it risks hiding files that may exist under
/tmp in case it's present in the container. Now, it only mounts
over /tmp iff:
  - /tmp was not explicitly asked to be mounted
  - /tmp is empty

If any of this is not true, then /tmp maps to the container's
image /tmp.

Note: checkpoint doesn't have sentry FS mounted to check if /tmp
is empty. It simply looks for explicit mounts right now.
PiperOrigin-RevId: 229607856
Change-Id: I10b6dae7ac157ef578efc4dfceb089f3b94cde06
2019-01-16 12:48:32 -08:00
Nicolas Lacasse dc8450b567 Remove fs.Handle, ramfs.Entry, and all the DeprecatedFileOperations.
More helper structs have been added to the fsutil package to make it easier to
implement fs.InodeOperations and fs.FileOperations.

PiperOrigin-RevId: 229305982
Change-Id: Ib6f8d3862f4216745116857913dbfa351530223b
2019-01-14 20:34:28 -08:00
Michael Pratt 33191e1cc4 Automated rollback of changelist 225089593
PiperOrigin-RevId: 227595007
Change-Id: If14cc5aab869c5fd7a4ebd95929c887ab690e94c
2019-01-02 15:48:00 -08:00
Fabricio Voznika a891afad6d Simplify synchronization between runsc and sandbox process
Make 'runsc create' join cgroup before creating sandbox process.
This removes the need to synchronize platform creation and ensure
that sandbox process is charged to the right cgroup from the start.

PiperOrigin-RevId: 227166451
Change-Id: Ieb4b18e6ca0daf7b331dc897699ca419bc5ee3a2
2018-12-28 13:48:24 -08:00
Jamie Liu 194ef586fc Rename limits.MemoryPagesLocked to limits.MemoryLocked.
"RLIMIT_MEMLOCK: This is the maximum number of bytes of memory that may
be locked into RAM." - getrlimit(2)

PiperOrigin-RevId: 226384346
Change-Id: Iefac4a1bb69f7714dc813b5b871226a8344dc800
2018-12-20 13:28:46 -08:00
Googler 86c9bd2547 Automated rollback of changelist 225861605
PiperOrigin-RevId: 226224230
Change-Id: Id24c7d3733722fd41d5fe74ef64e0ce8c68f0b12
2018-12-19 13:30:08 -08:00
Michael Pratt b62591e6a8 Expose internal testing flag
Never to used outside of runsc tests!

PiperOrigin-RevId: 225919013
Change-Id: Ib3b14aa2a2564b5246fb3f8933d95e01027ed186
2018-12-17 17:35:06 -08:00
Jamie Liu 2421006426 Implement mlock(), kind of.
Currently mlock() and friends do nothing whatsoever. However, mlocking
is directly application-visible in a number of ways; for example,
madvise(MADV_DONTNEED) and msync(MS_INVALIDATE) both fail on mlocked
regions. We handle this inconsistently: MADV_DONTNEED is too important
to not work, but MS_INVALIDATE is rejected.

Change MM to track mlocked regions in a manner consistent with Linux.
It still will not actually pin pages into host physical memory, but:

- mlock() will now cause sentry memory management to precommit mlocked
pages.

- MADV_DONTNEED and MS_INVALIDATE will interact with mlocked pages as
described above.

PiperOrigin-RevId: 225861605
Change-Id: Iee187204979ac9a4d15d0e037c152c0902c8d0ee
2018-12-17 11:38:59 -08:00
Michael Pratt 24c1158b9c Add "trace signal" option
This option is effectively equivalent to -panic-signal, except that the
sandbox does not die after logging the traceback.

PiperOrigin-RevId: 225089593
Change-Id: Ifb1c411210110b6104613f404334bd02175e484e
2018-12-11 16:12:41 -08:00
Brian Geffon d3bc79bc84 Open source system call tests.
PiperOrigin-RevId: 224886231
Change-Id: I0fccb4d994601739d8b16b1d4e6b31f40297fb22
2018-12-10 14:42:34 -08:00
Zhaozhong Ni 9984138abe sentry: turn "dynamically-created" procfs files into static creation.
PiperOrigin-RevId: 224600982
Change-Id: I547253528e24fb0bb318fc9d2632cb80504acb34
2018-12-07 17:03:54 -08:00
Brian Geffon 82719be42e Max link traversals should be for an entire path.
The number of symbolic links that are allowed to be followed
are for a full path and not just a chain of symbolic links.

PiperOrigin-RevId: 224047321
Change-Id: I5e3c4caf66a93c17eeddcc7f046d1e8bb9434a40
2018-12-04 14:32:03 -08:00
Fabricio Voznika eaac94d91c Use RET_KILL_PROCESS if available in kernel
RET_KILL_THREAD doesn't work well for Go because it will
kill only the offending thread and leave the process hanging.
RET_TRAP can be masked out and it's not guaranteed to kill
the process. RET_KILL_PROCESS is available since 4.14.

For older kernel, continue to use RET_TRAP as this is the
best option (likely to kill process, easy to debug).

PiperOrigin-RevId: 222357867
Change-Id: Icc1d7d731274b16c2125b7a1ba4f7883fbdb2cbd
2018-11-20 22:56:51 -08:00
Fabricio Voznika fadffa2ff8 Add unsupported syscall events for get/setsockopt
PiperOrigin-RevId: 222148953
Change-Id: I21500a9f08939c45314a6414e0824490a973e5aa
2018-11-20 14:04:12 -08:00
Fabricio Voznika 237f9c7a5e Don't fail when destroyContainerFS is called more than once
This can happen when destroy is called multiple times or when destroy
failed previously and is being called again.

PiperOrigin-RevId: 221882034
Change-Id: I8d069af19cf66c4e2419bdf0d4b789c5def8d19e
2018-11-20 14:03:42 -08:00
Nicolas Lacasse 845836c578 Internal change.
PiperOrigin-RevId: 221848471
Change-Id: I882fbe5ce7737048b2e1f668848e9c14ed355665
2018-11-20 14:03:11 -08:00
Nicolas Lacasse 40f843fc78 Internal change.
PiperOrigin-RevId: 221343626
Change-Id: I03d57293a555cf4da9952a81803b9f8463173c89
2018-11-13 15:18:17 -08:00
Nicolas Lacasse 6c2d320138 Internal change.
PiperOrigin-RevId: 221299066
Change-Id: I8ae352458f9976c329c6946b1efa843a3de0eaa4
2018-11-13 11:14:28 -08:00
Fabricio Voznika d97ccfa346 Close donated files if containerManager.Start() fails
PiperOrigin-RevId: 220869535
Change-Id: I9917e5daf02499f7aab6e2aa4051c54ff4461b9a
2018-11-09 14:54:34 -08:00
Nicolas Lacasse 13b48f2e6a AsyncBarrier should be run after all defers in destroyContainerFS.
destroyContainerFS must wait for all async operations to finish before
returning. In an attempt to do this, we call fs.AsyncBarrier() at the end of
the function. However, there are many defer'd DecRefs which end up running
AFTER the AsyncBarrier() call.

This CL fixes this by calling fs.AsyncBarrier() in the first defer statement,
thus ensuring that it runs at the end of the function, after all other defers.

PiperOrigin-RevId: 220523545
Change-Id: I5e96ee9ea6d86eeab788ff964484c50ef7f64a2f
2018-11-07 13:55:36 -08:00
Fabricio Voznika c92b9b7086 Add more logging to controller.go
PiperOrigin-RevId: 220519632
Change-Id: Iaeec007fc1aa3f0b72569b288826d45f2534c4bf
2018-11-07 13:33:19 -08:00
Fabricio Voznika 86b3f0cd24 Fix race between start and destroy
Before this change, a container starting up could race with
destroy (aka delete) and leave processes behind.

Now, whenever a container is created, Loader.processes gets
a new entry. Start now expects the entry to be there, and if
it's not it means that the container was deleted.

I've also fixed Loader.waitPID to search for the process using
the init process's PID namespace.

We could use a few more tests for signal and wait. I'll send
them in another cl.

PiperOrigin-RevId: 220224290
Change-Id: I15146079f69904dc07d43c3b66cc343a2dab4cc4
2018-11-05 21:29:37 -08:00
Fabricio Voznika a467f09261 Log when external signal is received
PiperOrigin-RevId: 220204591
Change-Id: I21a9c6f5c12a376d18da5d10c1871837c4f49ad2
2018-11-05 17:42:24 -08:00
Fabricio Voznika 5cd55cd90f Use spec with clean paths for gofer
Otherwise the gofer's attach point may be different from sandbox when there
symlinks in the path.

PiperOrigin-RevId: 219730492
Change-Id: Ia9c4c2d16228c6a1a9e790e0cb673fd881003fe1
2018-11-01 17:52:11 -07:00
Fabricio Voznika b6b81fd04b Add new log format that is compatible with Kubernetes
Fluentd configuration uses 'log' for the log message
while containerd uses 'msg'. Since we can't have a single
JSON format for both, add another log format and make
debug log configurable.

PiperOrigin-RevId: 219729658
Change-Id: I2a6afc4034d893ab90bafc63b394c4fb62b2a7a0
2018-11-01 17:44:58 -07:00
Ian Lewis 9d69d85bc1 Make error messages a bit more user friendly.
Updated error messages so that it doesn't print full Go struct representations
when running a new container in a sandbox. For example, this occurs frequently
when commands are not found when doing a 'kubectl exec'.

PiperOrigin-RevId: 219729141
Change-Id: Ic3a7bc84cd7b2167f495d48a1da241d621d3ca09
2018-11-01 17:40:09 -07:00
Adin Scannell 0091db9cbd kvm: use private futexes.
Use private futexes for performance and to align with other runtime uses.

PiperOrigin-RevId: 219422634
Change-Id: Ief2af5e8302847ea6dc246e8d1ee4d64684ca9dd
2018-10-30 22:46:42 -07:00
Kevin Krakauer b42a2a3203 Removes outdated TODO.
PiperOrigin-RevId: 219151173
Change-Id: I73014ea648ae485692ea0d44860c87f4365055cb
2018-10-29 10:31:56 -07:00
Adin Scannell 75cd70ecc9 Track paths and provide a rename hook.
This change also adds extensive testing to the p9 package via mocks. The sanity
checks and type checks are moved from the gofer into the core package, where
they can be more easily validated.

PiperOrigin-RevId: 218296768
Change-Id: I4fc3c326e7bf1e0e140a454cbacbcc6fd617ab55
2018-10-23 00:20:15 -07:00
Fabricio Voznika b2068cf5a5 Add more unimplemented syscall events
Added events for *ctl syscalls that may have multiple different commands.
For runsc, each syscall event is only logged once. For *ctl syscalls, use
the cmd as identifier, not only the syscall number.

PiperOrigin-RevId: 218015941
Change-Id: Ie3c19131ae36124861e9b492a7dbe1765d9e5e59
2018-10-20 11:14:23 -07:00
Ian Gudger 8fce67af24 Use correct company name in copyright header
PiperOrigin-RevId: 217951017
Change-Id: Ie08bf6987f98467d07457bcf35b5f1ff6e43c035
2018-10-19 16:35:11 -07:00
Fabricio Voznika f3ffa4db52 Resolve mount paths while setting up root fs mount
It's hard to resolve symlinks inside the sandbox because rootfs and mounts
may be read-only, forcing us to create mount points inside lower layer of an
overlay, **before** the volumes are mounted.

Since the destination must already be resolved outside the sandbox when creating
mounts, take this opportunity to rewrite the spec with paths resolved.
"runsc boot" will use the "resolved" spec to load mounts. In addition, symlink
traversals were disabled while mounting containers inside the sandbox.

It haven't been able to write a good test for it. So I'm relying on manual tests
for now.

PiperOrigin-RevId: 217749904
Change-Id: I7ac434d5befd230db1488446cda03300cc0751a9
2018-10-18 12:42:24 -07:00
Nicolas Lacasse e0bb94201f Close the gofer socket gracefully in boot:boot_test.
We were closing the FD directly. If the test then created a new socket pair
with the same FD, in-flight RPCs would get directed to the new socket and break
the test.

Instead, we should use unet.Socket.Close(), which allows any in-flight RPCs to
finish.

PiperOrigin-RevId: 217608491
Change-Id: I8c5a76638899ba30f33ca976e6fac967fa0aadbf
2018-10-17 16:18:39 -07:00
Nicolas Lacasse 4e6f0892c9 runsc: Support job control signals for the root container.
Now containers run with "docker run -it" support control characters like ^C and
^Z.

This required refactoring our signal handling a bit. Signals delivered to the
"runsc boot" process are turned into loader.Signal calls with the appropriate
delivery mode. Previously they were always sent directly to PID 1.

PiperOrigin-RevId: 217566770
Change-Id: I5b7220d9a0f2b591a56335479454a200c6de8732
2018-10-17 12:29:05 -07:00
Kevin Krakauer 8cbca46b6d Remove incorrect TODO.
PiperOrigin-RevId: 217548429
Change-Id: Ie640c881fdc4fc70af58c8ca834df1ac531e519a
2018-10-17 10:55:34 -07:00
Kevin Krakauer 9b3550f70b runsc: Add --pid flag to runsc kill.
--pid allows specific processes to be signalled rather than the container root
process or all processes in the container. containerd needs to SIGKILL exec'd
processes that timeout and check whether processes are still alive.

PiperOrigin-RevId: 217547636
Change-Id: I2058ebb548b51c8eb748f5884fb88bad0b532e45
2018-10-17 10:51:39 -07:00
Nicolas Lacasse 3bc5e6482b Fix reference leak in tests.
PiperOrigin-RevId: 216780438
Change-Id: Ide637fe36f8d2a61fea9e5b16d1b3401f2540416
2018-10-11 16:23:54 -07:00
Fabricio Voznika e68d86e1bd Make debug log file name configurable
This is a breaking change if you're using --debug-log-dir.
The fix is to replace it with --debug-log and add a '/' at
the end:
  --debug-log-dir=/tmp/runsc ==> --debug-log=/tmp/runsc/

PiperOrigin-RevId: 216761212
Change-Id: I244270a0a522298c48115719fa08dad55e34ade1
2018-10-11 14:29:37 -07:00
Fabricio Voznika f413e4b117 Add bare bones unsupported syscall logging
This change introduces a new flags to create/run called
--user-log. Logs to this files are visible to users and
are meant to help debugging problems with their images
and containers.

For now only unsupported syscalls are sent to this log,
and only minimum support was added. We can build more
infrastructure around it as needed.

PiperOrigin-RevId: 216735977
Change-Id: I54427ca194604991c407d49943ab3680470de2d0
2018-10-11 11:56:54 -07:00
Kevin Krakauer e21ba16d9c Removes irrelevant TODO.
PiperOrigin-RevId: 216616873
Change-Id: I4d974ab968058eadd01542081e18a987ef08f50a
2018-10-10 16:50:59 -07:00
Jonathan Giannuzzi 8388a505e7 Support for older Linux kernels without getrandom
Change-Id: I1fb9f5b47a264a7617912f6f56f995f3c4c5e578
PiperOrigin-RevId: 216591484
2018-10-10 14:18:47 -07:00
Michael Pratt ddb34b3690 Enforce message size limits and avoid host calls with too many iovecs
Currently, in the face of FileMem fragmentation and a large sendmsg or
recvmsg call, host sockets may pass > 1024 iovecs to the host, which
will immediately cause the host to return EMSGSIZE.

When we detect this case, use a single intermediate buffer to pass to
the kernel, copying to/from the src/dst buffer.

To avoid creating unbounded intermediate buffers, enforce message size
checks and truncation w.r.t. the send buffer size. The same
functionality is added to netstack unix sockets for feature parity.

PiperOrigin-RevId: 216590198
Change-Id: I719a32e71c7b1098d5097f35e6daf7dd5190eff7
2018-10-10 14:10:17 -07:00
Fabricio Voznika 29cd05a7c6 Add sandbox to cgroup
Sandbox creation uses the limits and reservations configured in the
OCI spec and set cgroup options accordinly. Then it puts both the
sandbox and gofer processes inside the cgroup.

It also allows the cgroup to be pre-configured by the caller. If the
cgroup already exists, sandbox and gofer processes will join the
cgroup but it will not modify the cgroup with spec limits.

PiperOrigin-RevId: 216538209
Change-Id: If2c65ffedf55820baab743a0edcfb091b89c1019
2018-10-10 09:00:42 -07:00
Ian Gudger c36d2ef373 Add new netstack metrics to the sentry
PiperOrigin-RevId: 216431260
Change-Id: Ia6e5c8d506940148d10ff2884cf4440f470e5820
2018-10-09 15:12:44 -07:00
Nicolas Lacasse ae5122eb87 Job control signals must be sent to all processes in the FG process group.
We were previously only sending to the originator of the process group.

Integration test was changed to test this behavior. It fails without the
corresponding code change.

PiperOrigin-RevId: 216297263
Change-Id: I7e41cfd6bdd067f4b9dc215e28f555fb5088916f
2018-10-08 20:48:54 -07:00
Michael Pratt b8048f75da Uncapitalize error
PiperOrigin-RevId: 216281263
Change-Id: Ie0c189e7f5934b77c6302336723bc1181fd2866c
2018-10-08 17:44:39 -07:00
Nicolas Lacasse e215b9970a runsc: Pass root container's stdio via FD.
We were previously using the sandbox process's stdio as the root container's
stdio. This makes it difficult/impossible to distinguish output application
output from sandbox output, such as panics, which are always written to stderr.

Also close the console socket when we are done with it.

PiperOrigin-RevId: 215585180
Change-Id: I980b8c69bd61a8b8e0a496fd7bc90a06446764e0
2018-10-03 10:32:03 -07:00
Fabricio Voznika 77e43adeab Add TIOCINQ to allowed seccomp when hostinet is used
PiperOrigin-RevId: 215574070
Change-Id: Ib36e804adebaf756adb9cbc2752be9789691530b
2018-10-03 09:32:54 -07:00
Nicolas Lacasse f1c01ed886 runsc: Support job control signals in "exec -it".
Terminal support in runsc relies on host tty file descriptors that are imported
into the sandbox. Application tty ioctls are sent directly to the host fd.

However, those host tty ioctls are associated in the host kernel with a host
process (in this case runsc), and the host kernel intercepts job control
characters like ^C and send signals to the host process. Thus, typing ^C into a
"runsc exec" shell will send a SIGINT to the runsc process.

This change makes "runsc exec" handle all signals, and forward them into the
sandbox via the "ContainerSignal" urpc method. Since the "runsc exec" is
associated with a particular container process in the sandbox, the signal must
be associated with the same container process.

One big difficulty is that the signal should not necessarily be sent to the
sandbox process started by "exec", but instead must be sent to the foreground
process group for the tty. For example, we may exec "bash", and from bash call
"sleep 100". A ^C at this point should SIGINT sleep, not bash.

To handle this, tty files inside the sandbox must keep track of their
foreground process group, which is set/get via ioctls. When an incoming
ContainerSignal urpc comes in, we look up the foreground process group via the
tty file. Unfortunately, this means we have to expose and cache the tty file in
the Loader.

Note that "runsc exec" now handles signals properly, but "runs run" does not.
That will come in a later CL, as this one is complex enough already.

Example:
	root@:/usr/local/apache2# sleep 100
	^C

	root@:/usr/local/apache2# sleep 100
	^Z
	[1]+  Stopped                 sleep 100

	root@:/usr/local/apache2# fg
	sleep 100
	^C

	root@:/usr/local/apache2#

PiperOrigin-RevId: 215334554
Change-Id: I53cdce39653027908510a5ba8d08c49f9cf24f39
2018-10-01 22:06:56 -07:00
Fabricio Voznika a2ad8fef13 Make multi-container the default mode for runsc
And remove multicontainer option.

PiperOrigin-RevId: 215236981
Change-Id: I9fd1d963d987e421e63d5817f91a25c819ced6cb
2018-10-01 10:31:17 -07:00
Fabricio Voznika 2496d9b4b6 Make runsc kill and delete more conformant to the "spec"
PiperOrigin-RevId: 214976251
Change-Id: I631348c3886f41f63d0e77e7c4f21b3ede2ab521
2018-09-28 12:22:21 -07:00
Googler fb65b0b471 Change tcpip.Route.Mask to tcpip.AddressMask.
PiperOrigin-RevId: 214975659
Change-Id: I7bd31a2c54f03ff52203109da312e4206701c44c
2018-09-28 12:18:15 -07:00
Fabricio Voznika 6779bd1187 Merge Loader.containerRootTGs and execProcess into a single map
It's easier to manage a single map with processes that we're interested
to track. This will make the next change to clean up the map on destroy
easier.

PiperOrigin-RevId: 214894210
Change-Id: I099247323a0487cd0767120df47ba786fac0926d
2018-09-27 23:55:05 -07:00
Nicolas Lacasse b709d23987 Forward ioctl(TCSETSF) calls on host ttys to the host kernel.
We already forward TCSETS and TCSETSW.  TCSETSF is roughly equivalent but
discards pending input.

The filters were relaxed to allow host ioctls with TCSETSF argument.

This fixes programs like "passwd" that prevent user input from being displayed
on the terminal.

Before:
	root@b8a0240fc836:/# passwd
	Enter new UNIX password: 123
	Retype new UNIX password: 123
	passwd: password updated successfully

After:
	root@ae6f5dabe402:/# passwd
	Enter new UNIX password:
	Retype new UNIX password:
	passwd: password updated successfully
PiperOrigin-RevId: 214869788
Change-Id: I31b4d1373c1388f7b51d0f2f45ce40aa8e8b0b58
2018-09-27 18:17:38 -07:00
Fabricio Voznika 491faac03b Implement 'runsc kill --all'
In order to implement kill --all correctly, the Sentry needs
to track all tasks that belong to a given container. This change
introduces ContainerID to the task, that gets inherited by all
children. 'kill --all' then iterates over all tasks comparing the
ContainerID field to find all processes that need to be signalled.

PiperOrigin-RevId: 214841768
Change-Id: I693b2374be8692d88cc441ef13a0ae34abf73ac6
2018-09-27 15:00:58 -07:00
Fabricio Voznika b514ab0589 Refactor 'runsc boot' to take container ID as argument
This makes the flow slightly simpler (no need to call
Loader.SetRootContainer). And this is required change to tag
tasks with container ID inside the Sentry.

PiperOrigin-RevId: 214795210
Change-Id: I6ff4af12e73bb07157f7058bb15fd5bb88760884
2018-09-27 10:26:34 -07:00
Nicolas Lacasse d489336784 runsc: All non-root bind mounts should be shared.
This CL changes the semantics of the "--file-access" flag so that it only
affects the root filesystem.  The default remains "exclusive" which is the
common use case, as neither Docker nor K8s supports sharing the root.

Keeping the root fs as "exclusive" means that the fs-intensive work done during
application startup will mostly be cacheable, and thus faster.

Non-root bind mounts will always be shared.

This CL also removes some redundant FSAccessType validations.  We validate this
flag in main(), so we can assume it is valid afterwards.

PiperOrigin-RevId: 214359936
Change-Id: I7e75d7bf52dbd7fa834d0aacd4034868314f3b51
2018-09-24 17:22:15 -07:00