Commit Graph

58 Commits

Author SHA1 Message Date
Nicolas Lacasse aaaefdf9ca Remove kernel.mounts.
We can get the mount namespace from the CreateProcessArgs in all cases where we
need it. This also gets rid of kernel.Destroy method, since the only thing it
was doing was DecRefing the mounts.

Removing the need to call kernel.SetRootMountNamespace also allowed for some
more simplifications in the container fs setup code.

PiperOrigin-RevId: 261357060
2019-08-02 11:23:11 -07:00
Fabricio Voznika b21b1db700 Allow to change logging options using 'runsc debug'
New options are:
  runsc debug --strace=off|all|function1,function2
  runsc debug --log-level=warning|info|debug
  runsc debug --log-packets=true|false

Updates #407

PiperOrigin-RevId: 254843128
2019-06-24 15:03:02 -07:00
Adin Scannell add40fd6ad Update canonical repository.
This can be merged after:
https://github.com/google/gvisor-website/pull/77
  or
https://github.com/google/gvisor-website/pull/78

PiperOrigin-RevId: 253132620
2019-06-13 16:50:15 -07:00
Andrei Vagin bb849bad29 gvisor/runsc: apply seccomp filters before parsing a state file
PiperOrigin-RevId: 252869983
2019-06-12 11:55:24 -07:00
Fabricio Voznika fc746efa9a Add support to mount pod shared tmpfs mounts
Parse annotations containing 'gvisor.dev/spec/mount' that gives
hints about how mounts are shared between containers inside a
pod. This information can be used to better inform how to mount
these volumes inside gVisor. For example, a volume that is shared
between containers inside a pod can be bind mounted inside the
sandbox, instead of being two independent mounts.

For now, this information is used to allow the same tmpfs mounts
to be shared between containers which wasn't possible before.

PiperOrigin-RevId: 252704037
2019-06-11 14:54:31 -07:00
Fabricio Voznika f1aee6a7ad Refactor container FS setup
No change in functionaly. Added containerMounter object
to keep state while the mounts are processed. This will
help upcoming changes to share mounts per-pod.

PiperOrigin-RevId: 251350096
2019-06-03 18:20:57 -07:00
Fabricio Voznika d28f71adcf Remove 'clearStatus' option from container.Wait*PID()
clearStatus was added to allow detached execution to wait
on the exec'd process and retrieve its exit status. However,
it's not currently used. Both docker and gvisor-containerd-shim
wait on the "shim" process and retrieve the exit status from
there. We could change gvisor-containerd-shim to use waits, but
it will end up also consuming a process for the wait, which is
similar to having the shim process.

Closes #234

PiperOrigin-RevId: 251349490
2019-06-03 18:16:09 -07:00
Bhasker Hariharan 035a8fa38e Add support for collecting execution trace to runsc.
Updates #220

PiperOrigin-RevId: 250532302
2019-05-30 12:07:11 -07:00
Fabricio Voznika ecb0f00e10 Cleanup around urpc file payload handling
urpc always closes all files once the RPC function returns.

PiperOrigin-RevId: 248406857
Change-Id: I400a8562452ec75c8e4bddc2154948567d572950
2019-05-15 14:36:28 -07:00
Andrei Vagin bf0ac565d2 Fix runsc restore to be compatible with docker start --checkpoint ...
Change-Id: I02b30de13f1393df66edf8829fedbf32405d18f8
PiperOrigin-RevId: 246621192
2019-05-03 21:41:45 -07:00
Michael Pratt 4d52a55201 Change copyright notice to "The gVisor Authors"
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.

1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.

Fixes #209

PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
2019-04-29 14:26:23 -07:00
Nicolas Lacasse f4ce43e1f4 Allow and document bug ids in gVisor codebase.
PiperOrigin-RevId: 245818639
Change-Id: I03703ef0fb9b6675955637b9fe2776204c545789
2019-04-29 14:04:14 -07:00
Kevin Krakauer f9431fb20f Remove obsolete TODO.
PiperOrigin-RevId: 241637164
Change-Id: I65476a739cf38f1818dc47f6ce60638dec8b77a8
2019-04-02 17:27:05 -07:00
Kevin Krakauer a40ee4f4b8 Change bug number for duplicate bug.
PiperOrigin-RevId: 241567897
Change-Id: I580eac04f52bb15f4aab7df9822c4aa92e743021
2019-04-02 11:28:06 -07:00
Jamie Liu 8f4634997b Decouple filemem from platform and move it to pgalloc.MemoryFile.
This is in preparation for improved page cache reclaim, which requires
greater integration between the page cache and page allocator.

PiperOrigin-RevId: 238444706
Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4
2019-03-14 08:12:48 -07:00
Fabricio Voznika bc9b979b94 Add profiling commands to runsc
Example:
  runsc debug --root=<dir> \
      --profile-heap=/tmp/heap.prof \
      --profile-cpu=/tmp/cpu.prod --profile-delay=30 \
      <container ID>
PiperOrigin-RevId: 237848456
Change-Id: Icff3f20c1b157a84d0922599eaea327320dad773
2019-03-11 11:47:30 -07:00
Andrei Vagin dd577f5410 runsc: reap a sandbox process only in sandbox.Wait()
PiperOrigin-RevId: 231504064
Change-Id: I585b769aef04a3ad7e7936027958910a6eed9c8d
2019-01-29 17:15:56 -08:00
Fabricio Voznika c1be25b78d Scrub runsc error messages
Removed "error" and "failed to" prefix that don't add value
from messages. Adjusted a few other messages.  In particular,
when the container fail to start, the message returned is easier
for humans to read:

$ docker run --rm --runtime=runsc alpine foobar
docker: Error response from daemon: OCI runtime start failed: <path> did not terminate sucessfully: starting container: starting root container [foobar]: starting sandbox: searching for executable "foobar", cwd: "/", $PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin": no such file or directory

Closes #77

PiperOrigin-RevId: 230022798
Change-Id: I83339017c70dae09e4f9f8e0ea2e554c4d5d5cd1
2019-01-18 17:36:02 -08:00
Fabricio Voznika a891afad6d Simplify synchronization between runsc and sandbox process
Make 'runsc create' join cgroup before creating sandbox process.
This removes the need to synchronize platform creation and ensure
that sandbox process is charged to the right cgroup from the start.

PiperOrigin-RevId: 227166451
Change-Id: Ieb4b18e6ca0daf7b331dc897699ca419bc5ee3a2
2018-12-28 13:48:24 -08:00
Zhaozhong Ni 9984138abe sentry: turn "dynamically-created" procfs files into static creation.
PiperOrigin-RevId: 224600982
Change-Id: I547253528e24fb0bb318fc9d2632cb80504acb34
2018-12-07 17:03:54 -08:00
Fabricio Voznika d97ccfa346 Close donated files if containerManager.Start() fails
PiperOrigin-RevId: 220869535
Change-Id: I9917e5daf02499f7aab6e2aa4051c54ff4461b9a
2018-11-09 14:54:34 -08:00
Fabricio Voznika c92b9b7086 Add more logging to controller.go
PiperOrigin-RevId: 220519632
Change-Id: Iaeec007fc1aa3f0b72569b288826d45f2534c4bf
2018-11-07 13:33:19 -08:00
Fabricio Voznika 86b3f0cd24 Fix race between start and destroy
Before this change, a container starting up could race with
destroy (aka delete) and leave processes behind.

Now, whenever a container is created, Loader.processes gets
a new entry. Start now expects the entry to be there, and if
it's not it means that the container was deleted.

I've also fixed Loader.waitPID to search for the process using
the init process's PID namespace.

We could use a few more tests for signal and wait. I'll send
them in another cl.

PiperOrigin-RevId: 220224290
Change-Id: I15146079f69904dc07d43c3b66cc343a2dab4cc4
2018-11-05 21:29:37 -08:00
Fabricio Voznika a467f09261 Log when external signal is received
PiperOrigin-RevId: 220204591
Change-Id: I21a9c6f5c12a376d18da5d10c1871837c4f49ad2
2018-11-05 17:42:24 -08:00
Ian Gudger 8fce67af24 Use correct company name in copyright header
PiperOrigin-RevId: 217951017
Change-Id: Ie08bf6987f98467d07457bcf35b5f1ff6e43c035
2018-10-19 16:35:11 -07:00
Nicolas Lacasse 4e6f0892c9 runsc: Support job control signals for the root container.
Now containers run with "docker run -it" support control characters like ^C and
^Z.

This required refactoring our signal handling a bit. Signals delivered to the
"runsc boot" process are turned into loader.Signal calls with the appropriate
delivery mode. Previously they were always sent directly to PID 1.

PiperOrigin-RevId: 217566770
Change-Id: I5b7220d9a0f2b591a56335479454a200c6de8732
2018-10-17 12:29:05 -07:00
Nicolas Lacasse f1c01ed886 runsc: Support job control signals in "exec -it".
Terminal support in runsc relies on host tty file descriptors that are imported
into the sandbox. Application tty ioctls are sent directly to the host fd.

However, those host tty ioctls are associated in the host kernel with a host
process (in this case runsc), and the host kernel intercepts job control
characters like ^C and send signals to the host process. Thus, typing ^C into a
"runsc exec" shell will send a SIGINT to the runsc process.

This change makes "runsc exec" handle all signals, and forward them into the
sandbox via the "ContainerSignal" urpc method. Since the "runsc exec" is
associated with a particular container process in the sandbox, the signal must
be associated with the same container process.

One big difficulty is that the signal should not necessarily be sent to the
sandbox process started by "exec", but instead must be sent to the foreground
process group for the tty. For example, we may exec "bash", and from bash call
"sleep 100". A ^C at this point should SIGINT sleep, not bash.

To handle this, tty files inside the sandbox must keep track of their
foreground process group, which is set/get via ioctls. When an incoming
ContainerSignal urpc comes in, we look up the foreground process group via the
tty file. Unfortunately, this means we have to expose and cache the tty file in
the Loader.

Note that "runsc exec" now handles signals properly, but "runs run" does not.
That will come in a later CL, as this one is complex enough already.

Example:
	root@:/usr/local/apache2# sleep 100
	^C

	root@:/usr/local/apache2# sleep 100
	^Z
	[1]+  Stopped                 sleep 100

	root@:/usr/local/apache2# fg
	sleep 100
	^C

	root@:/usr/local/apache2#

PiperOrigin-RevId: 215334554
Change-Id: I53cdce39653027908510a5ba8d08c49f9cf24f39
2018-10-01 22:06:56 -07:00
Fabricio Voznika 2496d9b4b6 Make runsc kill and delete more conformant to the "spec"
PiperOrigin-RevId: 214976251
Change-Id: I631348c3886f41f63d0e77e7c4f21b3ede2ab521
2018-09-28 12:22:21 -07:00
Fabricio Voznika 6779bd1187 Merge Loader.containerRootTGs and execProcess into a single map
It's easier to manage a single map with processes that we're interested
to track. This will make the next change to clean up the map on destroy
easier.

PiperOrigin-RevId: 214894210
Change-Id: I099247323a0487cd0767120df47ba786fac0926d
2018-09-27 23:55:05 -07:00
Fabricio Voznika 491faac03b Implement 'runsc kill --all'
In order to implement kill --all correctly, the Sentry needs
to track all tasks that belong to a given container. This change
introduces ContainerID to the task, that gets inherited by all
children. 'kill --all' then iterates over all tasks comparing the
ContainerID field to find all processes that need to be signalled.

PiperOrigin-RevId: 214841768
Change-Id: I693b2374be8692d88cc441ef13a0ae34abf73ac6
2018-09-27 15:00:58 -07:00
Fabricio Voznika b514ab0589 Refactor 'runsc boot' to take container ID as argument
This makes the flow slightly simpler (no need to call
Loader.SetRootContainer). And this is required change to tag
tasks with container ID inside the Sentry.

PiperOrigin-RevId: 214795210
Change-Id: I6ff4af12e73bb07157f7058bb15fd5bb88760884
2018-09-27 10:26:34 -07:00
Nicolas Lacasse cbaec4d614 Wait for all async fs operations to complete before returning from Destroy.
Destroy flushes dirent references, which triggers many async close operations.
We must wait for those to finish before returning from Destroy, otherwise we
may kill the gofer, causing a cascade of failing RPCs and leading to an
inconsistent FS state.

PiperOrigin-RevId: 213884637
Change-Id: Id054b47fc0f97adc5e596d747c08d3b97a1d1f71
2018-09-20 14:37:53 -07:00
Kevin Krakauer ffb5fdd690 runsc: Fix stdin/stdout/stderr in multi-container mode.
The issue with the previous change was that the stdin/stdout/stderr passed to
the sentry were dup'd by host.ImportFile. This left a dangling FD that by never
closing caused containerd to timeout waiting on container stop.

PiperOrigin-RevId: 213753032
Change-Id: Ia5e4c0565c42c8610d3b59f65599a5643b0901e4
2018-09-19 22:20:41 -07:00
Nicolas Lacasse 915d76aa92 Add container.Destroy urpc method.
This method will:
1. Stop the container process if it is still running.
2. Unmount all sanadbox-internal mounts for the container.
3. Delete the contaner root directory inside the sandbox.

Destroy is idempotent, and safe to call concurrantly.

This fixes a bug where after stopping a container, we cannot unmount the
container root directory on the host. This bug occured because the sandbox
dirent cache was holding a dirent with a host fd corresponding to a file inside
the container root on the host. The dirent cache did not know that the
container had exited, and kept the FD open, preventing us from unmounting on
the host.

Now that we unmount (and flush) all container mounts inside the sandbox, any
host FDs donated by the gofer will be closed, and we can unmount the container
root on the host.

PiperOrigin-RevId: 213737693
Change-Id: I28c0ff4cd19a08014cdd72fec5154497e92aacc9
2018-09-19 18:54:14 -07:00
Kevin Krakauer 7e00f37054 Automated rollback of changelist 213307171
PiperOrigin-RevId: 213504354
Change-Id: Iadd42f0ca4b7e7a9eae780bee9900c7233fb4f3f
2018-09-18 13:22:26 -07:00
Kevin Krakauer bb88c187c5 runsc: Enable waiting on exited processes.
This makes `runsc wait` behave more like waitpid()/wait4() in that:
- Once a process has run to completion, you can wait on it and get its exit
  code.
- Processes not waited on will consume memory (like a zombie process)

PiperOrigin-RevId: 213358916
Change-Id: I5b5eca41ce71eea68e447380df8c38361a4d1558
2018-09-17 16:25:24 -07:00
Kevin Krakauer 25add7b22b runsc: Fix stdin/out/err in multi-container mode.
Stdin/out/err weren't being sent to the sentry.

PiperOrigin-RevId: 213307171
Change-Id: Ie4b634a58b1b69aa934ce8597e5cc7a47a2bcda2
2018-09-17 11:31:28 -07:00
Lantao Liu bde2a91433 runsc: Support container signal/wait.
This CL:
1) Fix `runsc wait`, it now also works after the container exits;
2) Generate correct container state in Load;
2) Make sure `Destory` cleanup everything before successfully return.

PiperOrigin-RevId: 212900107
Change-Id: Ie129cbb9d74f8151a18364f1fc0b2603eac4109a
2018-09-13 16:38:03 -07:00
Kevin Krakauer 2eff1fdd06 runsc: Add exec flag that specifies where to save the sandbox-internal pid.
This is different from the existing -pid-file flag, which saves a host pid.

PiperOrigin-RevId: 212713968
Change-Id: I2c486de8dd5cfd9b923fb0970165ef7c5fc597f0
2018-09-12 15:23:35 -07:00
Nicolas Lacasse 6cc9b311af platform: Pass device fd into platform constructor.
We were previously openining the platform device (i.e. /dev/kvm) inside the
platfrom constructor (i.e. kvm.New).  This requires that we have RW access to
the platform device when constructing the platform.

However, now that the runsc sandbox process runs as user "nobody", it is not
able to open the platform device.

This CL changes the kvm constructor to take the platform device FD, rather than
opening the device file itself. The device file is opened outside of the
sandbox and passed to the sandbox process.

PiperOrigin-RevId: 212505804
Change-Id: I427e1d9de5eb84c84f19d513356e1bb148a52910
2018-09-11 13:09:46 -07:00
Nicolas Lacasse 9751b800a6 runsc: Support multi-container exec.
We must use a context.Context with a Root Dirent that corresponds to the
container's chroot. Previously we were using the root context, which does not
have a chroot.

Getting the correct context required refactoring some of the path-lookup code.
We can't lookup the path without a context.Context, which requires
kernel.CreateProcArgs, which we only get inside control.Execute.  So we have to
do the path lookup much later than we previously were.

PiperOrigin-RevId: 212064734
Change-Id: I84a5cfadacb21fd9c3ab9c393f7e308a40b9b537
2018-09-07 17:39:54 -07:00
Kevin Krakauer 8f0b6e7fc0 runsc: Support runsc kill multi-container.
Now, we can kill individual containers rather than the entire sandbox.

PiperOrigin-RevId: 211748106
Change-Id: Ic97e91db33d53782f838338c4a6d0aab7a313ead
2018-09-05 21:14:56 -07:00
Nicolas Lacasse f96b33c73c runsc: Promote getExecutablePathInternal to getExecutablePath.
Remove GetExecutablePath (the non-internal version).  This makes path handling
more consistent between exec, root, and child containers.

The new getExecutablePath now uses MountNamespace.FindInode, which is more
robust than Walking the Dirent tree ourselves.

This also removes the last use of lstat(2) in the sentry, so that can be
removed from the filters.

PiperOrigin-RevId: 211683110
Change-Id: Ic8ec960fc1c267aa7d310b8efe6e900c88a9207a
2018-09-05 13:01:21 -07:00
Fabricio Voznika db81c0b02f Put fsgofer inside chroot
Now each container gets its own dedicated gofer that is chroot'd to the
rootfs path. This is done to add an extra layer of security in case the
gofer gets compromised.

PiperOrigin-RevId: 210396476
Change-Id: Iba21360a59dfe90875d61000db103f8609157ca0
2018-08-27 11:10:14 -07:00
Nicolas Lacasse 106de2182d runsc: Terminal support for "docker exec -ti".
This CL adds terminal support for "docker exec".  We previously only supported
consoles for the container process, but not exec processes.

The SYS_IOCTL syscall was added to the default seccomp filter list, but only
for ioctls that get/set winsize and termios structs. We need to allow these
ioctl for all containers because it's possible to run "exec -ti" on a
container that was started without an attached console, after the filters
have been installed.

Note that control-character signals are still not properly supported.

Tested with:
	$ docker run --runtime=runsc -it alpine
In another terminial:
	$ docker exec -it <containerid> /bin/sh

PiperOrigin-RevId: 210185456
Change-Id: I6d2401e53a7697bb988c120a8961505c335f96d9
2018-08-24 17:43:21 -07:00
Kevin Krakauer 635b0c4593 runsc fsgofer: Support dynamic serving of filesystems.
When multiple containers run inside a sentry, each container has its own root
filesystem and set of mounts. Containers are also added after sentry boot rather
than all configured and known at boot time.

The fsgofer needs to be able to serve the root filesystem of each container.
Thus, it must be possible to add filesystems after the fsgofer has already
started.

This change:
* Creates a URPC endpoint within the gofer process that listens for requests to
  serve new content.
* Enables the sentry, when starting a new container, to add the new container's
  filesystem.
* Mounts those new filesystems at separate roots within the sentry.

PiperOrigin-RevId: 208903248
Change-Id: Ifa91ec9c8caf5f2f0a9eead83c4a57090ce92068
2018-08-15 16:25:22 -07:00
Fabricio Voznika 0d350aac7f Enable SACK in runsc
SACK is disabled by default and needs to be manually enabled. It not only
improves performance, but also fixes hangs downloading files from certain
websites.

PiperOrigin-RevId: 207906742
Change-Id: I4fb7277b67bfdf83ac8195f1b9c38265a0d51e8b
2018-08-08 10:26:18 -07:00
Justine Olshan c05660373e Moved restore code out of create and made to be called after create.
Docker expects containers to be created before they are restored.
However, gVisor restoring requires specificactions regarding the kernel
and the file system. These actions were originally in booting the sandbox.

Now setting up the file system is deferred until a call to a call to
runsc start. In the restore case, the kernel is destroyed and a new kernel
is created in the same process, as we need the same process for Docker.

These changes required careful execution of concurrent processes which
required the use of a channel.

Full docker integration still needs the ability to restore into the same
container.

PiperOrigin-RevId: 205161441
Change-Id: Ie1d2304ead7e06855319d5dc310678f701bd099f
2018-07-18 16:58:30 -07:00
Kevin Krakauer 16d37973eb runsc: Add the "wait" subcommand.
Users can now call "runsc wait <container id>" to wait on a particular process
inside the container. -pid can also be used to wait on a specific PID.

Manually tested the wait subcommand for a single waiter and multiple waiters
(simultaneously 2 processes waiting on the container and 2 processes waiting on
a PID within the container).

PiperOrigin-RevId: 202548978
Change-Id: Idd507c2cdea613c3a14879b51cfb0f7ea3fb3d4c
2018-06-28 14:56:36 -07:00
Kevin Krakauer 04bdcc7b65 runsc: Enable waiting on individual containers within a sandbox.
PiperOrigin-RevId: 201742160
Change-Id: Ia9fa1442287c5f9e1196fb117c41536a80f6bb31
2018-06-22 14:31:25 -07:00