Commit Graph

67 Commits

Author SHA1 Message Date
Fabricio Voznika 3f46f2e501 Fix sandbox chroot
Sandbox was setting chroot, but was not chaging the working
dir. Added test to ensure this doesn't happen in the future.

PiperOrigin-RevId: 215676270
Change-Id: I14352d3de64a4dcb90e50948119dc8328c9c15e1
2018-10-03 20:44:20 -07:00
Nicolas Lacasse e215b9970a runsc: Pass root container's stdio via FD.
We were previously using the sandbox process's stdio as the root container's
stdio. This makes it difficult/impossible to distinguish output application
output from sandbox output, such as panics, which are always written to stderr.

Also close the console socket when we are done with it.

PiperOrigin-RevId: 215585180
Change-Id: I980b8c69bd61a8b8e0a496fd7bc90a06446764e0
2018-10-03 10:32:03 -07:00
Nicolas Lacasse f1c01ed886 runsc: Support job control signals in "exec -it".
Terminal support in runsc relies on host tty file descriptors that are imported
into the sandbox. Application tty ioctls are sent directly to the host fd.

However, those host tty ioctls are associated in the host kernel with a host
process (in this case runsc), and the host kernel intercepts job control
characters like ^C and send signals to the host process. Thus, typing ^C into a
"runsc exec" shell will send a SIGINT to the runsc process.

This change makes "runsc exec" handle all signals, and forward them into the
sandbox via the "ContainerSignal" urpc method. Since the "runsc exec" is
associated with a particular container process in the sandbox, the signal must
be associated with the same container process.

One big difficulty is that the signal should not necessarily be sent to the
sandbox process started by "exec", but instead must be sent to the foreground
process group for the tty. For example, we may exec "bash", and from bash call
"sleep 100". A ^C at this point should SIGINT sleep, not bash.

To handle this, tty files inside the sandbox must keep track of their
foreground process group, which is set/get via ioctls. When an incoming
ContainerSignal urpc comes in, we look up the foreground process group via the
tty file. Unfortunately, this means we have to expose and cache the tty file in
the Loader.

Note that "runsc exec" now handles signals properly, but "runs run" does not.
That will come in a later CL, as this one is complex enough already.

Example:
	root@:/usr/local/apache2# sleep 100
	^C

	root@:/usr/local/apache2# sleep 100
	^Z
	[1]+  Stopped                 sleep 100

	root@:/usr/local/apache2# fg
	sleep 100
	^C

	root@:/usr/local/apache2#

PiperOrigin-RevId: 215334554
Change-Id: I53cdce39653027908510a5ba8d08c49f9cf24f39
2018-10-01 22:06:56 -07:00
Fabricio Voznika a2ad8fef13 Make multi-container the default mode for runsc
And remove multicontainer option.

PiperOrigin-RevId: 215236981
Change-Id: I9fd1d963d987e421e63d5817f91a25c819ced6cb
2018-10-01 10:31:17 -07:00
Fabricio Voznika 2496d9b4b6 Make runsc kill and delete more conformant to the "spec"
PiperOrigin-RevId: 214976251
Change-Id: I631348c3886f41f63d0e77e7c4f21b3ede2ab521
2018-09-28 12:22:21 -07:00
Fabricio Voznika cf226d48ce Switch to root in userns when CAP_SYS_CHROOT is also missing
Some tests check current capabilities and re-run the tests as root inside
userns if required capabibilities are missing. It was checking for
CAP_SYS_ADMIN only, CAP_SYS_CHROOT is also required now.

PiperOrigin-RevId: 214949226
Change-Id: Ic81363969fa76c04da408fae8ea7520653266312
2018-09-28 09:44:13 -07:00
Fabricio Voznika 491faac03b Implement 'runsc kill --all'
In order to implement kill --all correctly, the Sentry needs
to track all tasks that belong to a given container. This change
introduces ContainerID to the task, that gets inherited by all
children. 'kill --all' then iterates over all tasks comparing the
ContainerID field to find all processes that need to be signalled.

PiperOrigin-RevId: 214841768
Change-Id: I693b2374be8692d88cc441ef13a0ae34abf73ac6
2018-09-27 15:00:58 -07:00
Fabricio Voznika b514ab0589 Refactor 'runsc boot' to take container ID as argument
This makes the flow slightly simpler (no need to call
Loader.SetRootContainer). And this is required change to tag
tasks with container ID inside the Sentry.

PiperOrigin-RevId: 214795210
Change-Id: I6ff4af12e73bb07157f7058bb15fd5bb88760884
2018-09-27 10:26:34 -07:00
Fabricio Voznika b63c4bfe02 Set Sandbox.Chroot so it gets cleaned up upon destruction
I've made several attempts to create a test, but the lack of
permission from the test user makes it nearly impossible to
test anything useful.

PiperOrigin-RevId: 213922174
Change-Id: I5b502ca70cb7a6645f8836f028fb203354b4c625
2018-09-20 18:54:09 -07:00
Kevin Krakauer ffb5fdd690 runsc: Fix stdin/stdout/stderr in multi-container mode.
The issue with the previous change was that the stdin/stdout/stderr passed to
the sentry were dup'd by host.ImportFile. This left a dangling FD that by never
closing caused containerd to timeout waiting on container stop.

PiperOrigin-RevId: 213753032
Change-Id: Ia5e4c0565c42c8610d3b59f65599a5643b0901e4
2018-09-19 22:20:41 -07:00
Nicolas Lacasse 915d76aa92 Add container.Destroy urpc method.
This method will:
1. Stop the container process if it is still running.
2. Unmount all sanadbox-internal mounts for the container.
3. Delete the contaner root directory inside the sandbox.

Destroy is idempotent, and safe to call concurrantly.

This fixes a bug where after stopping a container, we cannot unmount the
container root directory on the host. This bug occured because the sandbox
dirent cache was holding a dirent with a host fd corresponding to a file inside
the container root on the host. The dirent cache did not know that the
container had exited, and kept the FD open, preventing us from unmounting on
the host.

Now that we unmount (and flush) all container mounts inside the sandbox, any
host FDs donated by the gofer will be closed, and we can unmount the container
root on the host.

PiperOrigin-RevId: 213737693
Change-Id: I28c0ff4cd19a08014cdd72fec5154497e92aacc9
2018-09-19 18:54:14 -07:00
Fabricio Voznika 7967d8ecd5 Handle children processes better in tests
Reap children more systematically in container tests. Previously,
container_test was taking ~5 mins to run because constainer.Destroy()
would timeout waiting for the sandbox process to exit. Now the test
running in less than a minute.

Also made the contract around Container and Sandbox destroy clearer.

PiperOrigin-RevId: 213527471
Change-Id: Icca84ee1212bbdcb62bdfc9cc7b71b12c6d1688d
2018-09-18 15:21:28 -07:00
Kevin Krakauer 7e00f37054 Automated rollback of changelist 213307171
PiperOrigin-RevId: 213504354
Change-Id: Iadd42f0ca4b7e7a9eae780bee9900c7233fb4f3f
2018-09-18 13:22:26 -07:00
Kevin Krakauer bb88c187c5 runsc: Enable waiting on exited processes.
This makes `runsc wait` behave more like waitpid()/wait4() in that:
- Once a process has run to completion, you can wait on it and get its exit
  code.
- Processes not waited on will consume memory (like a zombie process)

PiperOrigin-RevId: 213358916
Change-Id: I5b5eca41ce71eea68e447380df8c38361a4d1558
2018-09-17 16:25:24 -07:00
Kevin Krakauer 25add7b22b runsc: Fix stdin/out/err in multi-container mode.
Stdin/out/err weren't being sent to the sentry.

PiperOrigin-RevId: 213307171
Change-Id: Ie4b634a58b1b69aa934ce8597e5cc7a47a2bcda2
2018-09-17 11:31:28 -07:00
Lantao Liu bde2a91433 runsc: Support container signal/wait.
This CL:
1) Fix `runsc wait`, it now also works after the container exits;
2) Generate correct container state in Load;
2) Make sure `Destory` cleanup everything before successfully return.

PiperOrigin-RevId: 212900107
Change-Id: Ie129cbb9d74f8151a18364f1fc0b2603eac4109a
2018-09-13 16:38:03 -07:00
Kevin Krakauer 2eff1fdd06 runsc: Add exec flag that specifies where to save the sandbox-internal pid.
This is different from the existing -pid-file flag, which saves a host pid.

PiperOrigin-RevId: 212713968
Change-Id: I2c486de8dd5cfd9b923fb0970165ef7c5fc597f0
2018-09-12 15:23:35 -07:00
Nicolas Lacasse 6cc9b311af platform: Pass device fd into platform constructor.
We were previously openining the platform device (i.e. /dev/kvm) inside the
platfrom constructor (i.e. kvm.New).  This requires that we have RW access to
the platform device when constructing the platform.

However, now that the runsc sandbox process runs as user "nobody", it is not
able to open the platform device.

This CL changes the kvm constructor to take the platform device FD, rather than
opening the device file itself. The device file is opened outside of the
sandbox and passed to the sandbox process.

PiperOrigin-RevId: 212505804
Change-Id: I427e1d9de5eb84c84f19d513356e1bb148a52910
2018-09-11 13:09:46 -07:00
Nicolas Lacasse e198f9ab02 runsc: Chmod all mounted files to 777 inside chroot.
Inside the chroot, we run as user nobody, so all mounted files and directories
must be accessible to all users.

PiperOrigin-RevId: 212284805
Change-Id: I705e0dbbf15e01e04e0c7f378a99daffe6866807
2018-09-10 10:00:16 -07:00
Nicolas Lacasse 9751b800a6 runsc: Support multi-container exec.
We must use a context.Context with a Root Dirent that corresponds to the
container's chroot. Previously we were using the root context, which does not
have a chroot.

Getting the correct context required refactoring some of the path-lookup code.
We can't lookup the path without a context.Context, which requires
kernel.CreateProcArgs, which we only get inside control.Execute.  So we have to
do the path lookup much later than we previously were.

PiperOrigin-RevId: 212064734
Change-Id: I84a5cfadacb21fd9c3ab9c393f7e308a40b9b537
2018-09-07 17:39:54 -07:00
Fabricio Voznika bc81f3fe4a Remove '--file-access=direct' option
It was used before gofer was implemented and it's not
supported anymore.
BREAKING CHANGE: proxy-shared and proxy-exclusive options
are now: shared and exclusive.

PiperOrigin-RevId: 212017643
Change-Id: If029d4073fe60583e5ca25f98abb2953de0d78fd
2018-09-07 12:28:48 -07:00
Nicolas Lacasse 210c252089 runsc: Run sandbox process inside minimal chroot.
We construct a dir with the executable bind-mounted at /exe, and proc mounted
at /proc.  Runsc now executes the sandbox process inside this chroot, thus
limiting access to the host filesystem.  The mounts and chroot dir are removed
when the sandbox is destroyed.

Because this requires bind-mounts, we can only do the chroot if we have
CAP_SYS_ADMIN.

PiperOrigin-RevId: 211994001
Change-Id: Ia71c515e26085e0b69b833e71691830148bc70d1
2018-09-07 10:16:39 -07:00
Fabricio Voznika efac28976c Enable network for multi-container
PiperOrigin-RevId: 211834411
Change-Id: I52311a6c5407f984e5069359d9444027084e4d2a
2018-09-06 11:00:08 -07:00
Nicolas Lacasse 0a9a40abcd runsc: Run sandbox as user nobody.
When starting a sandbox without direct file or network access, we create an
empty user namespace and run the sandbox in there.  However, the root user in
that namespace is still mapped to the root user in the parent namespace.

This CL maps the "nobody" user from the parent namespace into the child
namespace, and runs the sandbox process as user "nobody" inside the new
namespace.

PiperOrigin-RevId: 211572223
Change-Id: I1b1f9b1a86c0b4e7e5ca7bc93be7d4887678bab6
2018-09-04 20:33:05 -07:00
Nicolas Lacasse ad8648c634 runsc: Pass log and config files to sandbox process by FD.
This is a prereq for running the sandbox process as user "nobody", when it may
not have permissions to open these files.

Instead, we must open then before starting the sandbox process, and pass them
by FD.

The specutils.ReadSpecFromFile method was fixed to always seek to the beginning
of the file before reading. This allows Files from the same FD to be read
multiple times, as we do in the boot command when the apply-caps flag is set.

Tested with --network=host.

PiperOrigin-RevId: 211570647
Change-Id: I685be0a290aa7f70731ebdce82ebc0ebcc9d475c
2018-09-04 20:10:01 -07:00
Fabricio Voznika 7e18f158b2 Automated rollback of changelist 210995199
PiperOrigin-RevId: 211116429
Change-Id: I446d149c822177dc9fc3c64ce5e455f7f029aa82
2018-08-31 11:30:47 -07:00
Nicolas Lacasse 5ade9350ad runsc: Pass log and config files to sandbox process by FD.
This is a prereq for running the sandbox process as user "nobody", when it may
not have permissions to open these files.

Instead, we must open then before starting the sandbox process, and pass them
by FD.

PiperOrigin-RevId: 210995199
Change-Id: I715875a9553290b4a49394a8fcd93be78b1933dd
2018-08-30 15:47:18 -07:00
Fabricio Voznika db81c0b02f Put fsgofer inside chroot
Now each container gets its own dedicated gofer that is chroot'd to the
rootfs path. This is done to add an extra layer of security in case the
gofer gets compromised.

PiperOrigin-RevId: 210396476
Change-Id: Iba21360a59dfe90875d61000db103f8609157ca0
2018-08-27 11:10:14 -07:00
Nicolas Lacasse 106de2182d runsc: Terminal support for "docker exec -ti".
This CL adds terminal support for "docker exec".  We previously only supported
consoles for the container process, but not exec processes.

The SYS_IOCTL syscall was added to the default seccomp filter list, but only
for ioctls that get/set winsize and termios structs. We need to allow these
ioctl for all containers because it's possible to run "exec -ti" on a
container that was started without an attached console, after the filters
have been installed.

Note that control-character signals are still not properly supported.

Tested with:
	$ docker run --runtime=runsc -it alpine
In another terminial:
	$ docker exec -it <containerid> /bin/sh

PiperOrigin-RevId: 210185456
Change-Id: I6d2401e53a7697bb988c120a8961505c335f96d9
2018-08-24 17:43:21 -07:00
Fabricio Voznika a81a4402a2 Add option to panic gofer if writes are attempted over RO mounts
This is used when '--overlay=true' to guarantee writes are not sent to gofer.

PiperOrigin-RevId: 210116288
Change-Id: I7616008c4c0e8d3668e07a205207f46e2144bf30
2018-08-24 10:17:42 -07:00
Fabricio Voznika d6d165cb0b Initial change for multi-gofer support
PiperOrigin-RevId: 209647293
Change-Id: I980fca1257ea3fcce796388a049c353b0303a8a5
2018-08-21 13:14:43 -07:00
Fabricio Voznika 0fc7b30695 Standardize mounts in tests
Tests get a readonly rootfs mapped to / (which was the case before)
and writable TEST_TMPDIR. This makes it easier to setup containers to
write to files and to share state between test and containers.

PiperOrigin-RevId: 209453224
Change-Id: I4d988e45dc0909a0450a3bb882fe280cf9c24334
2018-08-20 11:26:39 -07:00
Kevin Krakauer 635b0c4593 runsc fsgofer: Support dynamic serving of filesystems.
When multiple containers run inside a sentry, each container has its own root
filesystem and set of mounts. Containers are also added after sentry boot rather
than all configured and known at boot time.

The fsgofer needs to be able to serve the root filesystem of each container.
Thus, it must be possible to add filesystems after the fsgofer has already
started.

This change:
* Creates a URPC endpoint within the gofer process that listens for requests to
  serve new content.
* Enables the sentry, when starting a new container, to add the new container's
  filesystem.
* Mounts those new filesystems at separate roots within the sentry.

PiperOrigin-RevId: 208903248
Change-Id: Ifa91ec9c8caf5f2f0a9eead83c4a57090ce92068
2018-08-15 16:25:22 -07:00
Nicolas Lacasse e8a4f2e133 runsc: Change cache policy for root fs and volume mounts.
Previously, gofer filesystems were configured with the default "fscache"
policy, which caches filesystem metadata and contents aggressively.  While this
setting is best for performance, it means that changes from inside the sandbox
may not be immediately propagated outside the sandbox, and vice-versa.

This CL changes volumes and the root fs configuration to use a new
"remote-revalidate" cache policy which tries to retain as much caching as
possible while still making fs changes visible across the sandbox boundary.

This cache policy is enabled by default for the root filesystem. The default
value for the "--file-access" flag is still "proxy", but the behavior is
changed to use the new cache policy.

A new value for the "--file-access" flag is added, called "proxy-exclusive",
which turns on the previous aggressive caching behavior. As the name implies,
this flag should be used when the sandbox has "exclusive" access to the
filesystem.

All volume mounts are configured to use the new cache policy, since it is
safest and most likely to be correct. There is not currently a way to change
this behavior, but it's possible to add such a mechanism in the future. The
configurability is a smaller issue for volumes, since most of the expensive
application fs operations (walking + stating files) will likely served by the
root fs.

PiperOrigin-RevId: 208735037
Change-Id: Ife048fab1948205f6665df8563434dbc6ca8cfc9
2018-08-14 16:25:58 -07:00
Fabricio Voznika bc9a1fca23 Tiny reordering to network code
PiperOrigin-RevId: 207581723
Change-Id: I6e4eb1227b5ed302de5e6c891040b670955f1eea
2018-08-06 11:48:29 -07:00
Fabricio Voznika d7a34790a0 Add KVM and overlay dimensions to container_test
PiperOrigin-RevId: 205714667
Change-Id: I317a2ca98ac3bdad97c4790fcc61b004757d99ef
2018-07-23 13:31:42 -07:00
Justine Olshan f543ada150 Removed a now incorrect reference to restoreFile.
PiperOrigin-RevId: 205470108
Change-Id: I226878a887fe1133561005357a9e3b09428b06b6
2018-07-20 16:18:07 -07:00
Lantao Liu f62d6dd453 runsc: copy gateway from the pod network interface.
PiperOrigin-RevId: 205334841
Change-Id: Ia60d486f9aae70182fdc4af50cf7c915986126d7
2018-07-19 18:09:56 -07:00
Justine Olshan c05660373e Moved restore code out of create and made to be called after create.
Docker expects containers to be created before they are restored.
However, gVisor restoring requires specificactions regarding the kernel
and the file system. These actions were originally in booting the sandbox.

Now setting up the file system is deferred until a call to a call to
runsc start. In the restore case, the kernel is destroyed and a new kernel
is created in the same process, as we need the same process for Docker.

These changes required careful execution of concurrent processes which
required the use of a channel.

Full docker integration still needs the ability to restore into the same
container.

PiperOrigin-RevId: 205161441
Change-Id: Ie1d2304ead7e06855319d5dc310678f701bd099f
2018-07-18 16:58:30 -07:00
Kevin Krakauer 16d37973eb runsc: Add the "wait" subcommand.
Users can now call "runsc wait <container id>" to wait on a particular process
inside the container. -pid can also be used to wait on a specific PID.

Manually tested the wait subcommand for a single waiter and multiple waiters
(simultaneously 2 processes waiting on the container and 2 processes waiting on
a PID within the container).

PiperOrigin-RevId: 202548978
Change-Id: Idd507c2cdea613c3a14879b51cfb0f7ea3fb3d4c
2018-06-28 14:56:36 -07:00
Fabricio Voznika bb31a11903 Wait for sandbox process when waiting for root container
Closes #71

PiperOrigin-RevId: 202532762
Change-Id: I80a446ff638672ff08e6fd853cd77e28dd05d540
2018-06-28 13:23:04 -07:00
Kevin Krakauer 04bdcc7b65 runsc: Enable waiting on individual containers within a sandbox.
PiperOrigin-RevId: 201742160
Change-Id: Ia9fa1442287c5f9e1196fb117c41536a80f6bb31
2018-06-22 14:31:25 -07:00
Brielle Broder 7d6149063a Restore implementation added to runsc.
Restore creates a new container and uses the given image-path to load a saved
image of a previous container. Restore command is plumbed through container
and sandbox. This command does not work yet - more to come.

PiperOrigin-RevId: 201541229
Change-Id: I864a14c799ce3717d99bcdaaebc764281863d06f
2018-06-21 09:58:24 -07:00
Fabricio Voznika 4ad7315b67 Add 'runsc debug' command
It prints sandbox stacks to the log to help debug stuckness. I expect
that many more options will be added in the future.

PiperOrigin-RevId: 201405931
Change-Id: I87e560800cd5a5a7b210dc25a5661363c8c3a16e
2018-06-20 13:31:31 -07:00
Kevin Krakauer 5397963b5d runsc: Enable container creation within existing sandboxes.
Containers are created as processes in the sandbox. Of the many things that
don't work yet, the biggest issue is that the fsgofer is launched with its root
as the sandbox's root directory. Thus, when a container is started and wants to
read anything (including the init binary of the container), the gofer tries to
serve from sandbox's root (which basically just has pause), not the container's.

PiperOrigin-RevId: 201294560
Change-Id: I6423aa8830538959c56ae908ce067e4199d627b1
2018-06-19 21:44:33 -07:00
Justine Olshan a6dbef045f Added a resume command to unpause a paused container.
Resume checks the status of the container and unpauses the kernel
if its status is paused. Otherwise nothing happens.
Tests were added to ensure that the process is in the correct state
after various commands.

PiperOrigin-RevId: 201251234
Change-Id: Ifd11b336c33b654fea6238738f864fcf2bf81e19
2018-06-19 15:23:36 -07:00
Fabricio Voznika 775982ed4b Automated rollback of changelist 200770591
PiperOrigin-RevId: 201012131
Change-Id: I5cd69e795555129319eb41135ecf26db9a0b1fcb
2018-06-18 10:00:04 -07:00
Justine Olshan 0786707cd9 Added code for a pause command for a container process.
Like runc, the pause command will pause the processes of the given container.
It will set that container's status to "paused."
A resume command will be be added to unpause and continue running the process.

PiperOrigin-RevId: 200789624
Change-Id: I72a5d7813d90ecfc4d01cc252d6018855016b1ea
2018-06-15 16:09:09 -07:00
Kevin Krakauer 437890dc4b runsc: Make gofer logs show up in test output.
PiperOrigin-RevId: 200770591
Change-Id: Ifc096d88615b63135210d93c2b4cee2eaecf1eee
2018-06-15 14:07:54 -07:00
Brielle Broder 711a9869e5 Runsc checkpoint works.
This is the first iteration of checkpoint that actually saves to a file.
Tests for checkpoint are included.

Ran into an issue when private unix sockets are enabled. An error message
was added for this case and the mutex state was set.

PiperOrigin-RevId: 200269470
Change-Id: I28d29a9f92c44bf73dc4a4b12ae0509ee4070e93
2018-06-12 13:25:23 -07:00