Commit Graph

64 Commits

Author SHA1 Message Date
Kevin Krakauer b42a2a3203 Removes outdated TODO.
PiperOrigin-RevId: 219151173
Change-Id: I73014ea648ae485692ea0d44860c87f4365055cb
2018-10-29 10:31:56 -07:00
Ian Gudger 8fce67af24 Use correct company name in copyright header
PiperOrigin-RevId: 217951017
Change-Id: Ie08bf6987f98467d07457bcf35b5f1ff6e43c035
2018-10-19 16:35:11 -07:00
Nicolas Lacasse 4e6f0892c9 runsc: Support job control signals for the root container.
Now containers run with "docker run -it" support control characters like ^C and
^Z.

This required refactoring our signal handling a bit. Signals delivered to the
"runsc boot" process are turned into loader.Signal calls with the appropriate
delivery mode. Previously they were always sent directly to PID 1.

PiperOrigin-RevId: 217566770
Change-Id: I5b7220d9a0f2b591a56335479454a200c6de8732
2018-10-17 12:29:05 -07:00
Kevin Krakauer 9b3550f70b runsc: Add --pid flag to runsc kill.
--pid allows specific processes to be signalled rather than the container root
process or all processes in the container. containerd needs to SIGKILL exec'd
processes that timeout and check whether processes are still alive.

PiperOrigin-RevId: 217547636
Change-Id: I2058ebb548b51c8eb748f5884fb88bad0b532e45
2018-10-17 10:51:39 -07:00
Fabricio Voznika f413e4b117 Add bare bones unsupported syscall logging
This change introduces a new flags to create/run called
--user-log. Logs to this files are visible to users and
are meant to help debugging problems with their images
and containers.

For now only unsupported syscalls are sent to this log,
and only minimum support was added. We can build more
infrastructure around it as needed.

PiperOrigin-RevId: 216735977
Change-Id: I54427ca194604991c407d49943ab3680470de2d0
2018-10-11 11:56:54 -07:00
Kevin Krakauer e21ba16d9c Removes irrelevant TODO.
PiperOrigin-RevId: 216616873
Change-Id: I4d974ab968058eadd01542081e18a987ef08f50a
2018-10-10 16:50:59 -07:00
Jonathan Giannuzzi 8388a505e7 Support for older Linux kernels without getrandom
Change-Id: I1fb9f5b47a264a7617912f6f56f995f3c4c5e578
PiperOrigin-RevId: 216591484
2018-10-10 14:18:47 -07:00
Fabricio Voznika 29cd05a7c6 Add sandbox to cgroup
Sandbox creation uses the limits and reservations configured in the
OCI spec and set cgroup options accordinly. Then it puts both the
sandbox and gofer processes inside the cgroup.

It also allows the cgroup to be pre-configured by the caller. If the
cgroup already exists, sandbox and gofer processes will join the
cgroup but it will not modify the cgroup with spec limits.

PiperOrigin-RevId: 216538209
Change-Id: If2c65ffedf55820baab743a0edcfb091b89c1019
2018-10-10 09:00:42 -07:00
Ian Gudger c36d2ef373 Add new netstack metrics to the sentry
PiperOrigin-RevId: 216431260
Change-Id: Ia6e5c8d506940148d10ff2884cf4440f470e5820
2018-10-09 15:12:44 -07:00
Nicolas Lacasse ae5122eb87 Job control signals must be sent to all processes in the FG process group.
We were previously only sending to the originator of the process group.

Integration test was changed to test this behavior. It fails without the
corresponding code change.

PiperOrigin-RevId: 216297263
Change-Id: I7e41cfd6bdd067f4b9dc215e28f555fb5088916f
2018-10-08 20:48:54 -07:00
Michael Pratt b8048f75da Uncapitalize error
PiperOrigin-RevId: 216281263
Change-Id: Ie0c189e7f5934b77c6302336723bc1181fd2866c
2018-10-08 17:44:39 -07:00
Nicolas Lacasse e215b9970a runsc: Pass root container's stdio via FD.
We were previously using the sandbox process's stdio as the root container's
stdio. This makes it difficult/impossible to distinguish output application
output from sandbox output, such as panics, which are always written to stderr.

Also close the console socket when we are done with it.

PiperOrigin-RevId: 215585180
Change-Id: I980b8c69bd61a8b8e0a496fd7bc90a06446764e0
2018-10-03 10:32:03 -07:00
Nicolas Lacasse f1c01ed886 runsc: Support job control signals in "exec -it".
Terminal support in runsc relies on host tty file descriptors that are imported
into the sandbox. Application tty ioctls are sent directly to the host fd.

However, those host tty ioctls are associated in the host kernel with a host
process (in this case runsc), and the host kernel intercepts job control
characters like ^C and send signals to the host process. Thus, typing ^C into a
"runsc exec" shell will send a SIGINT to the runsc process.

This change makes "runsc exec" handle all signals, and forward them into the
sandbox via the "ContainerSignal" urpc method. Since the "runsc exec" is
associated with a particular container process in the sandbox, the signal must
be associated with the same container process.

One big difficulty is that the signal should not necessarily be sent to the
sandbox process started by "exec", but instead must be sent to the foreground
process group for the tty. For example, we may exec "bash", and from bash call
"sleep 100". A ^C at this point should SIGINT sleep, not bash.

To handle this, tty files inside the sandbox must keep track of their
foreground process group, which is set/get via ioctls. When an incoming
ContainerSignal urpc comes in, we look up the foreground process group via the
tty file. Unfortunately, this means we have to expose and cache the tty file in
the Loader.

Note that "runsc exec" now handles signals properly, but "runs run" does not.
That will come in a later CL, as this one is complex enough already.

Example:
	root@:/usr/local/apache2# sleep 100
	^C

	root@:/usr/local/apache2# sleep 100
	^Z
	[1]+  Stopped                 sleep 100

	root@:/usr/local/apache2# fg
	sleep 100
	^C

	root@:/usr/local/apache2#

PiperOrigin-RevId: 215334554
Change-Id: I53cdce39653027908510a5ba8d08c49f9cf24f39
2018-10-01 22:06:56 -07:00
Fabricio Voznika a2ad8fef13 Make multi-container the default mode for runsc
And remove multicontainer option.

PiperOrigin-RevId: 215236981
Change-Id: I9fd1d963d987e421e63d5817f91a25c819ced6cb
2018-10-01 10:31:17 -07:00
Fabricio Voznika 2496d9b4b6 Make runsc kill and delete more conformant to the "spec"
PiperOrigin-RevId: 214976251
Change-Id: I631348c3886f41f63d0e77e7c4f21b3ede2ab521
2018-09-28 12:22:21 -07:00
Fabricio Voznika 6779bd1187 Merge Loader.containerRootTGs and execProcess into a single map
It's easier to manage a single map with processes that we're interested
to track. This will make the next change to clean up the map on destroy
easier.

PiperOrigin-RevId: 214894210
Change-Id: I099247323a0487cd0767120df47ba786fac0926d
2018-09-27 23:55:05 -07:00
Fabricio Voznika 491faac03b Implement 'runsc kill --all'
In order to implement kill --all correctly, the Sentry needs
to track all tasks that belong to a given container. This change
introduces ContainerID to the task, that gets inherited by all
children. 'kill --all' then iterates over all tasks comparing the
ContainerID field to find all processes that need to be signalled.

PiperOrigin-RevId: 214841768
Change-Id: I693b2374be8692d88cc441ef13a0ae34abf73ac6
2018-09-27 15:00:58 -07:00
Fabricio Voznika b514ab0589 Refactor 'runsc boot' to take container ID as argument
This makes the flow slightly simpler (no need to call
Loader.SetRootContainer). And this is required change to tag
tasks with container ID inside the Sentry.

PiperOrigin-RevId: 214795210
Change-Id: I6ff4af12e73bb07157f7058bb15fd5bb88760884
2018-09-27 10:26:34 -07:00
Lantao Liu 8a938a3f9d runsc: allow `runsc wait` on a container for multiple times.
PiperOrigin-RevId: 213908919
Change-Id: I74eff99a5360bb03511b946f4cb5658bb5fc40c7
2018-09-20 16:59:42 -07:00
Kevin Krakauer ffb5fdd690 runsc: Fix stdin/stdout/stderr in multi-container mode.
The issue with the previous change was that the stdin/stdout/stderr passed to
the sentry were dup'd by host.ImportFile. This left a dangling FD that by never
closing caused containerd to timeout waiting on container stop.

PiperOrigin-RevId: 213753032
Change-Id: Ia5e4c0565c42c8610d3b59f65599a5643b0901e4
2018-09-19 22:20:41 -07:00
Lingfu f0a92b6b67 Add docker command line args support for --cpuset-cpus and --cpus
`docker run --cpuset-cpus=/--cpus=` will generate cpu resource info in config.json
(runtime spec file). When nginx worker_connections is configured as auto, the worker is
generated according to the number of CPUs. If the cgroup is already set on the host, but
it is not displayed correctly in the sandbox, performance may be degraded.

This patch can get cpus info from spec file and apply to sentry on bootup, so the
/proc/cpuinfo can show the correct cpu numbers. `lscpu` and other commands rely on
`/sys/devices/system/cpu/online` are also affected by this patch.

e.g.

--cpuset-cpus=2,3   ->  cpu number:2
--cpuset-cpus=4-7   ->  cpu number:4
--cpus=2.8          ->  cpu number:3
--cpus=0.5          ->  cpu number:1
Change-Id: Ideb22e125758d4322a12be7c51795f8018e3d316
PiperOrigin-RevId: 213685199
2018-09-19 13:35:42 -07:00
Kevin Krakauer 7e00f37054 Automated rollback of changelist 213307171
PiperOrigin-RevId: 213504354
Change-Id: Iadd42f0ca4b7e7a9eae780bee9900c7233fb4f3f
2018-09-18 13:22:26 -07:00
Fabricio Voznika 5d9816be41 Remove memory usage static init
panic() during init() can be hard to debug.

Updates #100

PiperOrigin-RevId: 213391932
Change-Id: Ic103f1981c5b48f1e12da3b42e696e84ffac02a9
2018-09-17 21:34:37 -07:00
Kevin Krakauer bb88c187c5 runsc: Enable waiting on exited processes.
This makes `runsc wait` behave more like waitpid()/wait4() in that:
- Once a process has run to completion, you can wait on it and get its exit
  code.
- Processes not waited on will consume memory (like a zombie process)

PiperOrigin-RevId: 213358916
Change-Id: I5b5eca41ce71eea68e447380df8c38361a4d1558
2018-09-17 16:25:24 -07:00
Kevin Krakauer 25add7b22b runsc: Fix stdin/out/err in multi-container mode.
Stdin/out/err weren't being sent to the sentry.

PiperOrigin-RevId: 213307171
Change-Id: Ie4b634a58b1b69aa934ce8597e5cc7a47a2bcda2
2018-09-17 11:31:28 -07:00
Lantao Liu bde2a91433 runsc: Support container signal/wait.
This CL:
1) Fix `runsc wait`, it now also works after the container exits;
2) Generate correct container state in Load;
2) Make sure `Destory` cleanup everything before successfully return.

PiperOrigin-RevId: 212900107
Change-Id: Ie129cbb9d74f8151a18364f1fc0b2603eac4109a
2018-09-13 16:38:03 -07:00
Nicolas Lacasse 6cc9b311af platform: Pass device fd into platform constructor.
We were previously openining the platform device (i.e. /dev/kvm) inside the
platfrom constructor (i.e. kvm.New).  This requires that we have RW access to
the platform device when constructing the platform.

However, now that the runsc sandbox process runs as user "nobody", it is not
able to open the platform device.

This CL changes the kvm constructor to take the platform device FD, rather than
opening the device file itself. The device file is opened outside of the
sandbox and passed to the sandbox process.

PiperOrigin-RevId: 212505804
Change-Id: I427e1d9de5eb84c84f19d513356e1bb148a52910
2018-09-11 13:09:46 -07:00
Fabricio Voznika 8ce3fbf9f8 Only start signal forwarding after init process is created
PiperOrigin-RevId: 212028121
Change-Id: If9c2c62f3be103e2bb556b8d154c169888e34369
2018-09-07 13:39:12 -07:00
Fabricio Voznika bc81f3fe4a Remove '--file-access=direct' option
It was used before gofer was implemented and it's not
supported anymore.
BREAKING CHANGE: proxy-shared and proxy-exclusive options
are now: shared and exclusive.

PiperOrigin-RevId: 212017643
Change-Id: If029d4073fe60583e5ca25f98abb2953de0d78fd
2018-09-07 12:28:48 -07:00
Fabricio Voznika f895cb4d8b Use root abstract socket namespace for exec
PiperOrigin-RevId: 211999211
Change-Id: I5968dd1a8313d3e49bb6e6614e130107495de41d
2018-09-07 10:45:55 -07:00
Kevin Krakauer 8f0b6e7fc0 runsc: Support runsc kill multi-container.
Now, we can kill individual containers rather than the entire sandbox.

PiperOrigin-RevId: 211748106
Change-Id: Ic97e91db33d53782f838338c4a6d0aab7a313ead
2018-09-05 21:14:56 -07:00
Nicolas Lacasse f96b33c73c runsc: Promote getExecutablePathInternal to getExecutablePath.
Remove GetExecutablePath (the non-internal version).  This makes path handling
more consistent between exec, root, and child containers.

The new getExecutablePath now uses MountNamespace.FindInode, which is more
robust than Walking the Dirent tree ourselves.

This also removes the last use of lstat(2) in the sentry, so that can be
removed from the filters.

PiperOrigin-RevId: 211683110
Change-Id: Ic8ec960fc1c267aa7d310b8efe6e900c88a9207a
2018-09-05 13:01:21 -07:00
Fabricio Voznika 30c025f3ef Add argument checks to seccomp
This is required to increase protection when running in GKE.

PiperOrigin-RevId: 210635123
Change-Id: Iaaa8be49e73f7a3a90805313885e75894416f0b5
2018-08-28 17:10:03 -07:00
Fabricio Voznika ae648bafda Add command-line parameter to trigger panic on signal
This is to troubleshoot problems with a hung process that is
not responding to 'runsc debug --stack' command.

PiperOrigin-RevId: 210483513
Change-Id: I4377b210b4e51bc8a281ad34fd94f3df13d9187d
2018-08-27 20:36:10 -07:00
Fabricio Voznika db81c0b02f Put fsgofer inside chroot
Now each container gets its own dedicated gofer that is chroot'd to the
rootfs path. This is done to add an extra layer of security in case the
gofer gets compromised.

PiperOrigin-RevId: 210396476
Change-Id: Iba21360a59dfe90875d61000db103f8609157ca0
2018-08-27 11:10:14 -07:00
Nicolas Lacasse 106de2182d runsc: Terminal support for "docker exec -ti".
This CL adds terminal support for "docker exec".  We previously only supported
consoles for the container process, but not exec processes.

The SYS_IOCTL syscall was added to the default seccomp filter list, but only
for ioctls that get/set winsize and termios structs. We need to allow these
ioctl for all containers because it's possible to run "exec -ti" on a
container that was started without an attached console, after the filters
have been installed.

Note that control-character signals are still not properly supported.

Tested with:
	$ docker run --runtime=runsc -it alpine
In another terminial:
	$ docker exec -it <containerid> /bin/sh

PiperOrigin-RevId: 210185456
Change-Id: I6d2401e53a7697bb988c120a8961505c335f96d9
2018-08-24 17:43:21 -07:00
Kevin Krakauer 02dfceab6d runsc: Allow runsc to properly search the PATH for executable name.
Previously, runsc improperly attempted to find an executable in the container's
PATH.

We now search the PATH via the container's fsgofer rather than the host FS,
eliminating the confusing differences between paths on the host and within a
container.

PiperOrigin-RevId: 210159488
Change-Id: I228174dbebc4c5356599036d6efaa59f28ff28d2
2018-08-24 14:42:40 -07:00
Kevin Krakauer 635b0c4593 runsc fsgofer: Support dynamic serving of filesystems.
When multiple containers run inside a sentry, each container has its own root
filesystem and set of mounts. Containers are also added after sentry boot rather
than all configured and known at boot time.

The fsgofer needs to be able to serve the root filesystem of each container.
Thus, it must be possible to add filesystems after the fsgofer has already
started.

This change:
* Creates a URPC endpoint within the gofer process that listens for requests to
  serve new content.
* Enables the sentry, when starting a new container, to add the new container's
  filesystem.
* Mounts those new filesystems at separate roots within the sentry.

PiperOrigin-RevId: 208903248
Change-Id: Ifa91ec9c8caf5f2f0a9eead83c4a57090ce92068
2018-08-15 16:25:22 -07:00
Fabricio Voznika 0d350aac7f Enable SACK in runsc
SACK is disabled by default and needs to be manually enabled. It not only
improves performance, but also fixes hangs downloading files from certain
websites.

PiperOrigin-RevId: 207906742
Change-Id: I4fb7277b67bfdf83ac8195f1b9c38265a0d51e8b
2018-08-08 10:26:18 -07:00
Ian Gudger 3cd7824410 Move stack clock to options struct
PiperOrigin-RevId: 207039273
Change-Id: Ib8f55a6dc302052ab4a10ccd70b07f0d73b373df
2018-08-01 20:22:02 -07:00
Justine Olshan c05660373e Moved restore code out of create and made to be called after create.
Docker expects containers to be created before they are restored.
However, gVisor restoring requires specificactions regarding the kernel
and the file system. These actions were originally in booting the sandbox.

Now setting up the file system is deferred until a call to a call to
runsc start. In the restore case, the kernel is destroyed and a new kernel
is created in the same process, as we need the same process for Docker.

These changes required careful execution of concurrent processes which
required the use of a channel.

Full docker integration still needs the ability to restore into the same
container.

PiperOrigin-RevId: 205161441
Change-Id: Ie1d2304ead7e06855319d5dc310678f701bd099f
2018-07-18 16:58:30 -07:00
Nicolas Lacasse 9059983fdb runsc: Fix map access race in boot.Loader.waitContainer.
PiperOrigin-RevId: 204522004
Change-Id: I4819dc025f0a1df03ceaaba7951b1902d44562b3
2018-07-13 13:46:14 -07:00
Nicolas Lacasse 67507bd579 runsc: Don't close the control server in a defer.
Closing the control server will block until all open requests have completed.
If a control server method panics, we end up stuck because the defer'd Destroy
function will never return.

PiperOrigin-RevId: 204354676
Change-Id: I6bb1d84b31242d7c3f20d5334b1c966bd6a61dbf
2018-07-12 13:36:57 -07:00
Michael Pratt 660f1203ff Fix runsc VDSO mapping
80bdf8a406 accidentally moved vdso into an
inner scope, never assigning the vdso variable passed to the Kernel and
thus skipping VDSO mappings.

Fix this and remove the ability for loadVDSO to skip VDSO mappings,
since tests that do so are gone.

PiperOrigin-RevId: 203169135
Change-Id: Ifd8cadcbaf82f959223c501edcc4d83d05327eba
2018-07-03 12:53:39 -07:00
Justine Olshan 80bdf8a406 Sets the restore environment for restoring a container.
Updated how restoring occurs through boot.go with a separate Restore function.
This prevents a new process and new mounts from being created.
Added tests to ensure the container is restored.
Registered checkpoint and restore commands so they can be used.
Docker support for these commands is still limited.
Working on #80.

PiperOrigin-RevId: 202710950
Change-Id: I2b893ceaef6b9442b1ce3743bd112383cb92af0c
2018-06-29 14:47:40 -07:00
Kevin Krakauer 16d37973eb runsc: Add the "wait" subcommand.
Users can now call "runsc wait <container id>" to wait on a particular process
inside the container. -pid can also be used to wait on a specific PID.

Manually tested the wait subcommand for a single waiter and multiple waiters
(simultaneously 2 processes waiting on the container and 2 processes waiting on
a PID within the container).

PiperOrigin-RevId: 202548978
Change-Id: Idd507c2cdea613c3a14879b51cfb0f7ea3fb3d4c
2018-06-28 14:56:36 -07:00
Fabricio Voznika 8459390cdd Error out if spec is invalid
Closes #66

PiperOrigin-RevId: 202496258
Change-Id: Ib9287c5bf1279ffba1db21ebd9e6b59305cddf34
2018-06-28 09:57:27 -07:00
Fabricio Voznika 1f207de315 Add option to configure watchdog action
PiperOrigin-RevId: 202494747
Change-Id: I4d4a18e71468690b785060e580a5f83c616bd90f
2018-06-28 09:46:50 -07:00
Lantao Liu 000fd8d1e4 runsc: set gofer umask to 0.
PiperOrigin-RevId: 202185642
Change-Id: I2eefcc0b2ffadc6ef21d177a8a4ab0cda91f3399
2018-06-26 13:40:04 -07:00
Kevin Krakauer 04bdcc7b65 runsc: Enable waiting on individual containers within a sandbox.
PiperOrigin-RevId: 201742160
Change-Id: Ia9fa1442287c5f9e1196fb117c41536a80f6bb31
2018-06-22 14:31:25 -07:00