Commit Graph

61 Commits

Author SHA1 Message Date
Nicolas Lacasse 106de2182d runsc: Terminal support for "docker exec -ti".
This CL adds terminal support for "docker exec".  We previously only supported
consoles for the container process, but not exec processes.

The SYS_IOCTL syscall was added to the default seccomp filter list, but only
for ioctls that get/set winsize and termios structs. We need to allow these
ioctl for all containers because it's possible to run "exec -ti" on a
container that was started without an attached console, after the filters
have been installed.

Note that control-character signals are still not properly supported.

Tested with:
	$ docker run --runtime=runsc -it alpine
In another terminial:
	$ docker exec -it <containerid> /bin/sh

PiperOrigin-RevId: 210185456
Change-Id: I6d2401e53a7697bb988c120a8961505c335f96d9
2018-08-24 17:43:21 -07:00
Kevin Krakauer 02dfceab6d runsc: Allow runsc to properly search the PATH for executable name.
Previously, runsc improperly attempted to find an executable in the container's
PATH.

We now search the PATH via the container's fsgofer rather than the host FS,
eliminating the confusing differences between paths on the host and within a
container.

PiperOrigin-RevId: 210159488
Change-Id: I228174dbebc4c5356599036d6efaa59f28ff28d2
2018-08-24 14:42:40 -07:00
Fabricio Voznika 001a4c2493 Clean up syscall filters
Removed syscalls that are only used by whitelistfs
which has its own set of filters.

PiperOrigin-RevId: 209967259
Change-Id: Idb2e1b9d0201043d7cd25d96894f354729dbd089
2018-08-23 11:15:07 -07:00
Kevin Krakauer 635b0c4593 runsc fsgofer: Support dynamic serving of filesystems.
When multiple containers run inside a sentry, each container has its own root
filesystem and set of mounts. Containers are also added after sentry boot rather
than all configured and known at boot time.

The fsgofer needs to be able to serve the root filesystem of each container.
Thus, it must be possible to add filesystems after the fsgofer has already
started.

This change:
* Creates a URPC endpoint within the gofer process that listens for requests to
  serve new content.
* Enables the sentry, when starting a new container, to add the new container's
  filesystem.
* Mounts those new filesystems at separate roots within the sentry.

PiperOrigin-RevId: 208903248
Change-Id: Ifa91ec9c8caf5f2f0a9eead83c4a57090ce92068
2018-08-15 16:25:22 -07:00
Nicolas Lacasse 2033f61aae runsc: Fix instances of file access "proxy".
This file access type is actually called "proxy-shared", but I forgot to update
all locations.

PiperOrigin-RevId: 208832491
Change-Id: I7848bc4ec2478f86cf2de1dcd1bfb5264c6276de
2018-08-15 09:34:18 -07:00
Nicolas Lacasse e8a4f2e133 runsc: Change cache policy for root fs and volume mounts.
Previously, gofer filesystems were configured with the default "fscache"
policy, which caches filesystem metadata and contents aggressively.  While this
setting is best for performance, it means that changes from inside the sandbox
may not be immediately propagated outside the sandbox, and vice-versa.

This CL changes volumes and the root fs configuration to use a new
"remote-revalidate" cache policy which tries to retain as much caching as
possible while still making fs changes visible across the sandbox boundary.

This cache policy is enabled by default for the root filesystem. The default
value for the "--file-access" flag is still "proxy", but the behavior is
changed to use the new cache policy.

A new value for the "--file-access" flag is added, called "proxy-exclusive",
which turns on the previous aggressive caching behavior. As the name implies,
this flag should be used when the sandbox has "exclusive" access to the
filesystem.

All volume mounts are configured to use the new cache policy, since it is
safest and most likely to be correct. There is not currently a way to change
this behavior, but it's possible to add such a mechanism in the future. The
configurability is a smaller issue for volumes, since most of the expensive
application fs operations (walking + stating files) will likely served by the
root fs.

PiperOrigin-RevId: 208735037
Change-Id: Ife048fab1948205f6665df8563434dbc6ca8cfc9
2018-08-14 16:25:58 -07:00
Fabricio Voznika 4e171f7590 Basic support for ip link/addr and ifconfig
Closes #94

PiperOrigin-RevId: 207997580
Change-Id: I19b426f1586b5ec12f8b0cd5884d5b401d334924
2018-08-08 22:39:58 -07:00
Fabricio Voznika ea1e39a314 Resend packets back to netstack if destined to itself
Add option to redirect packet back to netstack if it's destined to itself.
This fixes the problem where connecting to the local NIC address would
not work, e.g.:
echo bar | nc -l -p 8080 &
echo foo | nc 192.168.0.2 8080

PiperOrigin-RevId: 207995083
Change-Id: I17adc2a04df48bfea711011a5df206326a1fb8ef
2018-08-08 22:03:35 -07:00
Fabricio Voznika 0d350aac7f Enable SACK in runsc
SACK is disabled by default and needs to be manually enabled. It not only
improves performance, but also fixes hangs downloading files from certain
websites.

PiperOrigin-RevId: 207906742
Change-Id: I4fb7277b67bfdf83ac8195f1b9c38265a0d51e8b
2018-08-08 10:26:18 -07:00
Ian Gudger 3cd7824410 Move stack clock to options struct
PiperOrigin-RevId: 207039273
Change-Id: Ib8f55a6dc302052ab4a10ccd70b07f0d73b373df
2018-08-01 20:22:02 -07:00
Michael Pratt 6cad96f38a Drop dup2 filter
It is unused.

PiperOrigin-RevId: 206798328
Change-Id: I2d7d27c0e4a0ef51264b900f14f1b3fdad17f2c4
2018-07-31 11:38:57 -07:00
Justine Olshan c05660373e Moved restore code out of create and made to be called after create.
Docker expects containers to be created before they are restored.
However, gVisor restoring requires specificactions regarding the kernel
and the file system. These actions were originally in booting the sandbox.

Now setting up the file system is deferred until a call to a call to
runsc start. In the restore case, the kernel is destroyed and a new kernel
is created in the same process, as we need the same process for Docker.

These changes required careful execution of concurrent processes which
required the use of a channel.

Full docker integration still needs the ability to restore into the same
container.

PiperOrigin-RevId: 205161441
Change-Id: Ie1d2304ead7e06855319d5dc310678f701bd099f
2018-07-18 16:58:30 -07:00
Nicolas Lacasse 9059983fdb runsc: Fix map access race in boot.Loader.waitContainer.
PiperOrigin-RevId: 204522004
Change-Id: I4819dc025f0a1df03ceaaba7951b1902d44562b3
2018-07-13 13:46:14 -07:00
Nicolas Lacasse 67507bd579 runsc: Don't close the control server in a defer.
Closing the control server will block until all open requests have completed.
If a control server method panics, we end up stuck because the defer'd Destroy
function will never return.

PiperOrigin-RevId: 204354676
Change-Id: I6bb1d84b31242d7c3f20d5334b1c966bd6a61dbf
2018-07-12 13:36:57 -07:00
Bhasker Hariharan c15cb8d432 Automated rollback of changelist 203157739
PiperOrigin-RevId: 204196916
Change-Id: If632750fc6368acb835e22cfcee0ae55c8a04d16
2018-07-11 15:07:19 -07:00
Michael Pratt 660f1203ff Fix runsc VDSO mapping
80bdf8a406 accidentally moved vdso into an
inner scope, never assigning the vdso variable passed to the Kernel and
thus skipping VDSO mappings.

Fix this and remove the ability for loadVDSO to skip VDSO mappings,
since tests that do so are gone.

PiperOrigin-RevId: 203169135
Change-Id: Ifd8cadcbaf82f959223c501edcc4d83d05327eba
2018-07-03 12:53:39 -07:00
Fabricio Voznika 52ddb8571c Skip overlay on root when its readonly
PiperOrigin-RevId: 203161098
Change-Id: Ia1904420cb3ee830899d24a4fe418bba6533be64
2018-07-03 12:01:09 -07:00
Fabricio Voznika 0ef6066167 Resend packets back to netstack if destined to itself
Add option to redirect packet back to netstack if it's destined to itself.
This fixes the problem where connecting to the local NIC address would
not work, e.g.:
echo bar | nc -l -p 8080 &
echo foo | nc 192.168.0.2 8080

PiperOrigin-RevId: 203157739
Change-Id: I31c9f7c501e3f55007f25e1852c27893a16ac6c4
2018-07-03 11:39:17 -07:00
Nicolas Lacasse 4500155ffc runsc: Mount "mandatory" mounts right after mounting the root.
The /proc and /sys mounts are "mandatory" in the sense that they should be
mounted in the sandbox even when they are not included in the spec. Runsc
treats /tmp similarly, because it is faster to use the internal tmpfs
implementation instead of proxying to the host.

However, the spec may contain submounts of these mandatory mounts (particularly
for /tmp). In those cases, we must mount our mandatory mounts before the
submount, otherwise the submount will be masked.

Since the mandatory mounts are all top-level directories, we can mount them
right after the root.

PiperOrigin-RevId: 203145635
Change-Id: Id69bae771d32c1a5b67e08c8131b73d9b42b2fbf
2018-07-03 10:36:22 -07:00
Dmitry Vyukov 6144751962 runsc/boot/filter: permit SYS_TIME for race
glibc's malloc also uses SYS_TIME. Permit it.

#0  0x0000000000de6267 in time ()
#1  0x0000000000db19d8 in get_nprocs ()
#2  0x0000000000d8a31a in arena_get2.part ()
#3  0x0000000000d8ab4a in malloc ()
#4  0x0000000000d3c6b5 in __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 140737488355328ull, 0ul, __sanitizer::SizeClassMap<3ul, 4ul, 8ul, 17ul, 64ul, 14ul>, 20ul, __sanitizer::TwoLevelByteMap<32768ull, 4096ull, __sanitizer::NoOpMapUnmapCallback>, __sanitizer::NoOpMapUnmapCallback> >*, unsigned long) ()
#5  0x0000000000d4cd70 in __tsan_go_start ()
#6  0x00000000004617a3 in racecall ()
#7  0x00000000010f4ea0 in runtime.findfunctab ()
#8  0x000000000043f193 in runtime.racegostart ()

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
[mpratt@google.com: updated comments and commit message]
Signed-off-by: Michael Pratt <mpratt@google.com>

Change-Id: Ibe2d0dc3035bf5052d5fb802cfaa37c5e0e7a09a
PiperOrigin-RevId: 203042627
2018-07-02 17:47:32 -07:00
Fabricio Voznika fa64c2a151 Make default limits the same as with runc
Closes #2

PiperOrigin-RevId: 202997196
Change-Id: I0c9f6f5a8a1abe1ae427bca5f590bdf9f82a6675
2018-07-02 12:51:38 -07:00
Justine Olshan 80bdf8a406 Sets the restore environment for restoring a container.
Updated how restoring occurs through boot.go with a separate Restore function.
This prevents a new process and new mounts from being created.
Added tests to ensure the container is restored.
Registered checkpoint and restore commands so they can be used.
Docker support for these commands is still limited.
Working on #80.

PiperOrigin-RevId: 202710950
Change-Id: I2b893ceaef6b9442b1ce3743bd112383cb92af0c
2018-06-29 14:47:40 -07:00
Kevin Krakauer 16d37973eb runsc: Add the "wait" subcommand.
Users can now call "runsc wait <container id>" to wait on a particular process
inside the container. -pid can also be used to wait on a specific PID.

Manually tested the wait subcommand for a single waiter and multiple waiters
(simultaneously 2 processes waiting on the container and 2 processes waiting on
a PID within the container).

PiperOrigin-RevId: 202548978
Change-Id: Idd507c2cdea613c3a14879b51cfb0f7ea3fb3d4c
2018-06-28 14:56:36 -07:00
Fabricio Voznika 8459390cdd Error out if spec is invalid
Closes #66

PiperOrigin-RevId: 202496258
Change-Id: Ib9287c5bf1279ffba1db21ebd9e6b59305cddf34
2018-06-28 09:57:27 -07:00
Fabricio Voznika 1f207de315 Add option to configure watchdog action
PiperOrigin-RevId: 202494747
Change-Id: I4d4a18e71468690b785060e580a5f83c616bd90f
2018-06-28 09:46:50 -07:00
Lantao Liu 000fd8d1e4 runsc: set gofer umask to 0.
PiperOrigin-RevId: 202185642
Change-Id: I2eefcc0b2ffadc6ef21d177a8a4ab0cda91f3399
2018-06-26 13:40:04 -07:00
Lantao Liu e8ae2b85e9 runsc: add a `multi-container` flag to enable multi-container support.
PiperOrigin-RevId: 201995800
Change-Id: I770190d135e14ec7da4b3155009fe10121b2a502
2018-06-25 12:08:44 -07:00
Fabricio Voznika cecc1e472c Fix lint errors
PiperOrigin-RevId: 201978212
Change-Id: Ie3df1fd41d5293fff66b546a0c68c3bf98126067
2018-06-25 10:41:27 -07:00
Kevin Krakauer 04bdcc7b65 runsc: Enable waiting on individual containers within a sandbox.
PiperOrigin-RevId: 201742160
Change-Id: Ia9fa1442287c5f9e1196fb117c41536a80f6bb31
2018-06-22 14:31:25 -07:00
Fabricio Voznika f6be5fe619 Forward SIGUSR2 to the sandbox too
SIGUSR2 was being masked out to be used as a way to dump sentry
stacks. This could cause compatibility problems in cases anyone
uses SIGUSR2 to communicate with the container init process.

PiperOrigin-RevId: 201575374
Change-Id: I312246e828f38ad059139bb45b8addc2ed055d74
2018-06-21 13:22:18 -07:00
Justine Olshan f2a687001d Added functionality to create a RestoreEnvironment.
Before a container can be restored, the mounts must be configured.
The root and submounts and their key information is compiled into a
RestoreEnvironment.
Future code will be added to set this created environment before
restoring a container.
Tests to ensure the correct environment were added.

PiperOrigin-RevId: 201544637
Change-Id: Ia894a8b0f80f31104d1c732e113b1d65a4697087
2018-06-21 10:18:11 -07:00
Nicolas Lacasse 81d13fbd4d runsc: Default umask should be 0.
PiperOrigin-RevId: 201539050
Change-Id: I36cbf270fa5ad25de507ecb919e4005eda6aa16d
2018-06-21 09:43:15 -07:00
Fabricio Voznika 4ad7315b67 Add 'runsc debug' command
It prints sandbox stacks to the log to help debug stuckness. I expect
that many more options will be added in the future.

PiperOrigin-RevId: 201405931
Change-Id: I87e560800cd5a5a7b210dc25a5661363c8c3a16e
2018-06-20 13:31:31 -07:00
Kevin Krakauer 5397963b5d runsc: Enable container creation within existing sandboxes.
Containers are created as processes in the sandbox. Of the many things that
don't work yet, the biggest issue is that the fsgofer is launched with its root
as the sandbox's root directory. Thus, when a container is started and wants to
read anything (including the init binary of the container), the gofer tries to
serve from sandbox's root (which basically just has pause), not the container's.

PiperOrigin-RevId: 201294560
Change-Id: I6423aa8830538959c56ae908ce067e4199d627b1
2018-06-19 21:44:33 -07:00
Kevin Krakauer 3ebd0e35f4 runsc: Whitelist lstat, as it is now used in specutils.
When running multi-container, child containers are added after the filters have
been installed. Thus, lstat must be in the set of allowed syscalls.

PiperOrigin-RevId: 201269550
Change-Id: I03f2e6675a53d462ed12a0f651c10049b76d4c52
2018-06-19 17:17:41 -07:00
Justine Olshan a6dbef045f Added a resume command to unpause a paused container.
Resume checks the status of the container and unpauses the kernel
if its status is paused. Otherwise nothing happens.
Tests were added to ensure that the process is in the correct state
after various commands.

PiperOrigin-RevId: 201251234
Change-Id: Ifd11b336c33b654fea6238738f864fcf2bf81e19
2018-06-19 15:23:36 -07:00
Justine Olshan 873ec0c414 Modified boot.go to allow for restores.
A file descriptor was added as a flag to boot so a state file can restore a
container that was checkpointed.

PiperOrigin-RevId: 201068699
Change-Id: I18e96069488ffa3add468861397f3877725544aa
2018-06-18 15:20:36 -07:00
Lantao Liu 821aaf531d runsc: support "rw" mount option.
PiperOrigin-RevId: 201018483
Change-Id: I52fe3d01c83c8a2f0e9275d9d88c37e46fa224a2
2018-06-18 10:34:11 -07:00
Justine Olshan 0786707cd9 Added code for a pause command for a container process.
Like runc, the pause command will pause the processes of the given container.
It will set that container's status to "paused."
A resume command will be be added to unpause and continue running the process.

PiperOrigin-RevId: 200789624
Change-Id: I72a5d7813d90ecfc4d01cc252d6018855016b1ea
2018-06-15 16:09:09 -07:00
Lantao Liu 2081c5e7f7 runsc: support /dev bind mount which does not conflict with default /dev mount.
PiperOrigin-RevId: 200768923
Change-Id: I4b8da10bcac296e8171fe6754abec5aabfec5e65
2018-06-15 13:58:39 -07:00
Fabricio Voznika ef5dd4df9b Set kernel.applicationCores to the number of processor on the host
The right number to use is the number of processors assigned to the cgroup. But until
we make the sandbox join the respective cgroup, just use the number of processors on
the host.

Closes #65, closes #66

PiperOrigin-RevId: 200725483
Change-Id: I34a566b1a872e26c66f56fa6e3100f42aaf802b1
2018-06-15 09:19:04 -07:00
Michael Pratt d71f5ef688 Add nanosleep filter for Go 1.11 support
golang.org/cl/108538 replaces pselect6 with nanosleep in runtime.usleep. Update
the filters accordingly.

PiperOrigin-RevId: 200574612
Change-Id: Ifb2296fcb3781518fc047aabbbffedb9ae488cd7
2018-06-14 10:11:05 -07:00
Fabricio Voznika 717f2501c9 Fix failure to mount volume that sandbox process has no access
Boot loader tries to stat mount to determine whether it's a file or not. This
may file if the sandbox process doesn't have access to the file. Instead, add
overlay on top of file, which is better anyway since we don't want to propagate
changes to the host.

PiperOrigin-RevId: 200411261
Change-Id: I14222410e8bc00ed037b779a1883d503843ffebb
2018-06-13 10:20:06 -07:00
Lantao Liu 2506b9b11f runsc: do not include sub target if it is not started with '/'.
PiperOrigin-RevId: 200274828
Change-Id: I956703217df08d8650a881479b7ade8f9f119912
2018-06-12 13:54:54 -07:00
Brielle Broder 711a9869e5 Runsc checkpoint works.
This is the first iteration of checkpoint that actually saves to a file.
Tests for checkpoint are included.

Ran into an issue when private unix sockets are enabled. An error message
was added for this case and the mutex state was set.

PiperOrigin-RevId: 200269470
Change-Id: I28d29a9f92c44bf73dc4a4b12ae0509ee4070e93
2018-06-12 13:25:23 -07:00
Kevin Krakauer 2dc9cd7bf7 runsc: enable terminals in the sandbox.
runsc now mounts the devpts filesystem, so you get a real terminal using
ssh+sshd.

PiperOrigin-RevId: 200244830
Change-Id: If577c805ad0138fda13103210fa47178d8ac6605
2018-06-12 11:03:25 -07:00
Fabricio Voznika 48335318a2 Enable debug logging in tests
Unit tests call runsc directly now, so all command line arguments
are valid. On the other hand, enabling debug in the test binary
doesn't affect runsc. It needs to be set in the config.

PiperOrigin-RevId: 200237706
Change-Id: I0b5922db17f887f58192dbc2f8dd2fd058b76ec7
2018-06-12 10:25:55 -07:00
Fabricio Voznika 5c51bc51e4 Drop capabilities not needed by Gofer
PiperOrigin-RevId: 199808391
Change-Id: Ib37a4fb6193dc85c1f93bc16769d6aa41854b9d4
2018-06-08 09:59:26 -07:00
Googler 722275c3d1 Added a function to the controller to checkpoint a container.
Functionality for checkpoint is not complete, more to come.

PiperOrigin-RevId: 199500803
Change-Id: Iafb0fcde68c584270000fea898e6657a592466f7
2018-06-06 11:43:55 -07:00
Fabricio Voznika 6c585b8eb6 Create destination mount dir if it doesn't exist
PiperOrigin-RevId: 199175296
Change-Id: I694ad1cfa65572c92f77f22421fdcac818f44630
2018-06-04 12:31:35 -07:00