This makes `runsc wait` behave more like waitpid()/wait4() in that:
- Once a process has run to completion, you can wait on it and get its exit
code.
- Processes not waited on will consume memory (like a zombie process)
PiperOrigin-RevId: 213358916
Change-Id: I5b5eca41ce71eea68e447380df8c38361a4d1558
This CL:
1) Fix `runsc wait`, it now also works after the container exits;
2) Generate correct container state in Load;
2) Make sure `Destory` cleanup everything before successfully return.
PiperOrigin-RevId: 212900107
Change-Id: Ie129cbb9d74f8151a18364f1fc0b2603eac4109a
This is different from the existing -pid-file flag, which saves a host pid.
PiperOrigin-RevId: 212713968
Change-Id: I2c486de8dd5cfd9b923fb0970165ef7c5fc597f0
We were previously openining the platform device (i.e. /dev/kvm) inside the
platfrom constructor (i.e. kvm.New). This requires that we have RW access to
the platform device when constructing the platform.
However, now that the runsc sandbox process runs as user "nobody", it is not
able to open the platform device.
This CL changes the kvm constructor to take the platform device FD, rather than
opening the device file itself. The device file is opened outside of the
sandbox and passed to the sandbox process.
PiperOrigin-RevId: 212505804
Change-Id: I427e1d9de5eb84c84f19d513356e1bb148a52910
Inside the chroot, we run as user nobody, so all mounted files and directories
must be accessible to all users.
PiperOrigin-RevId: 212284805
Change-Id: I705e0dbbf15e01e04e0c7f378a99daffe6866807
We must use a context.Context with a Root Dirent that corresponds to the
container's chroot. Previously we were using the root context, which does not
have a chroot.
Getting the correct context required refactoring some of the path-lookup code.
We can't lookup the path without a context.Context, which requires
kernel.CreateProcArgs, which we only get inside control.Execute. So we have to
do the path lookup much later than we previously were.
PiperOrigin-RevId: 212064734
Change-Id: I84a5cfadacb21fd9c3ab9c393f7e308a40b9b537
It was used before gofer was implemented and it's not
supported anymore.
BREAKING CHANGE: proxy-shared and proxy-exclusive options
are now: shared and exclusive.
PiperOrigin-RevId: 212017643
Change-Id: If029d4073fe60583e5ca25f98abb2953de0d78fd
We construct a dir with the executable bind-mounted at /exe, and proc mounted
at /proc. Runsc now executes the sandbox process inside this chroot, thus
limiting access to the host filesystem. The mounts and chroot dir are removed
when the sandbox is destroyed.
Because this requires bind-mounts, we can only do the chroot if we have
CAP_SYS_ADMIN.
PiperOrigin-RevId: 211994001
Change-Id: Ia71c515e26085e0b69b833e71691830148bc70d1
Docker and containerd do not expose runsc's stderr, so tracking down sentry
panics can be painful.
If we have a debug log file, we should send panics (and all stderr data) to the
log file.
PiperOrigin-RevId: 211992321
Change-Id: I5f0d2f45f35c110a38dab86bafc695aaba42f7a3
Now, we can kill individual containers rather than the entire sandbox.
PiperOrigin-RevId: 211748106
Change-Id: Ic97e91db33d53782f838338c4a6d0aab7a313ead
When no capabilities are specified in exec, use the
container's capabilities to match runc's behavior.
PiperOrigin-RevId: 211735186
Change-Id: Icd372ed64410c81144eae94f432dffc9fe3a86ce
With multi-gofers, bind mounts in sub-containers should
just work. Removed restrictions and added test. There are
also a few cleanups along the way, e.g. retry unmounting
in case cleanup races with gofer teardown.
PiperOrigin-RevId: 211699569
Change-Id: Ic0a69c29d7c31cd7e038909cc686c6ac98703374
Remove GetExecutablePath (the non-internal version). This makes path handling
more consistent between exec, root, and child containers.
The new getExecutablePath now uses MountNamespace.FindInode, which is more
robust than Walking the Dirent tree ourselves.
This also removes the last use of lstat(2) in the sentry, so that can be
removed from the filters.
PiperOrigin-RevId: 211683110
Change-Id: Ic8ec960fc1c267aa7d310b8efe6e900c88a9207a
When starting a sandbox without direct file or network access, we create an
empty user namespace and run the sandbox in there. However, the root user in
that namespace is still mapped to the root user in the parent namespace.
This CL maps the "nobody" user from the parent namespace into the child
namespace, and runs the sandbox process as user "nobody" inside the new
namespace.
PiperOrigin-RevId: 211572223
Change-Id: I1b1f9b1a86c0b4e7e5ca7bc93be7d4887678bab6
This is a prereq for running the sandbox process as user "nobody", when it may
not have permissions to open these files.
Instead, we must open then before starting the sandbox process, and pass them
by FD.
The specutils.ReadSpecFromFile method was fixed to always seek to the beginning
of the file before reading. This allows Files from the same FD to be read
multiple times, as we do in the boot command when the apply-caps flag is set.
Tested with --network=host.
PiperOrigin-RevId: 211570647
Change-Id: I685be0a290aa7f70731ebdce82ebc0ebcc9d475c
GoCompile: missing strict dependencies:
/tmpfs/tmp/bazel/sandbox/linux-sandbox/1744/execroot/__main__/runsc/main.go:
import of "gvisor.googlesource.com/gvisor/runsc/specutils"
This was broken in 210995199.
PiperOrigin-RevId: 211086595
Change-Id: I166b9a2ed8e4d6e624def944b720190940d7537c
This is a prereq for running the sandbox process as user "nobody", when it may
not have permissions to open these files.
Instead, we must open then before starting the sandbox process, and pass them
by FD.
PiperOrigin-RevId: 210995199
Change-Id: I715875a9553290b4a49394a8fcd93be78b1933dd
This is to keep it consistent with other test, and
it's easier to maintain them in single file.
Also increase python test timeout to deflake it.
PiperOrigin-RevId: 210575042
Change-Id: I2ef5bcd5d97c08549f0c5f645c4b694253ef0b4d
This is to troubleshoot problems with a hung process that is
not responding to 'runsc debug --stack' command.
PiperOrigin-RevId: 210483513
Change-Id: I4377b210b4e51bc8a281ad34fd94f3df13d9187d
For readonly filesystems specified via relative path, we were forgetting to
mount relative to the container's bundle directory.
PiperOrigin-RevId: 210483388
Change-Id: I84809fce4b1f2056d0e225547cb611add5f74177
When revalidating a Dirent, if the inode id is the same, then we don't need to
throw away the entire Dirent. We can just update the unstable attributes in
place.
If the inode id has changed, then the remote file has been deleted or moved,
and we have no choice but to throw away the dirent we have a look up another.
In this case, we may still end up losing a mounted dirent that is a child of
the revalidated dirent. However, that seems appropriate here because the entire
mount point has been pulled out from underneath us.
Because gVisor's overlay is at the Inode level rather than the Dirent level, we
must pass the parent Inode and name along with the Inode that is being
revalidated.
PiperOrigin-RevId: 210431270
Change-Id: I705caef9c68900234972d5aac4ae3a78c61c7d42
Previously, we were only using the host inode id as the QID path. But the host
filesystem can have multiple devices with conflicting inode ids. This resulted
in duplicate inode ids in the sentry.
This CL generates a unique QID for each <host inode, host device> pair.
PiperOrigin-RevId: 210424813
Change-Id: I16d106f61c7c8f910c0da4ceec562a010ffca2fb
Now each container gets its own dedicated gofer that is chroot'd to the
rootfs path. This is done to add an extra layer of security in case the
gofer gets compromised.
PiperOrigin-RevId: 210396476
Change-Id: Iba21360a59dfe90875d61000db103f8609157ca0
This CL adds terminal support for "docker exec". We previously only supported
consoles for the container process, but not exec processes.
The SYS_IOCTL syscall was added to the default seccomp filter list, but only
for ioctls that get/set winsize and termios structs. We need to allow these
ioctl for all containers because it's possible to run "exec -ti" on a
container that was started without an attached console, after the filters
have been installed.
Note that control-character signals are still not properly supported.
Tested with:
$ docker run --runtime=runsc -it alpine
In another terminial:
$ docker exec -it <containerid> /bin/sh
PiperOrigin-RevId: 210185456
Change-Id: I6d2401e53a7697bb988c120a8961505c335f96d9
Previously, runsc improperly attempted to find an executable in the container's
PATH.
We now search the PATH via the container's fsgofer rather than the host FS,
eliminating the confusing differences between paths on the host and within a
container.
PiperOrigin-RevId: 210159488
Change-Id: I228174dbebc4c5356599036d6efaa59f28ff28d2
This is used when '--overlay=true' to guarantee writes are not sent to gofer.
PiperOrigin-RevId: 210116288
Change-Id: I7616008c4c0e8d3668e07a205207f46e2144bf30
Removed syscalls that are only used by whitelistfs
which has its own set of filters.
PiperOrigin-RevId: 209967259
Change-Id: Idb2e1b9d0201043d7cd25d96894f354729dbd089
The bug was caused by os.File's finalizer, which closes the file. Because
fsgofer.serve() was passed a file descriptor as an int rather than a os.File,
callers would pass os.File.Fd(), and the os.File would go out of scope. Thus,
the file would get GC'd and finalized nondeterministically, causing failures
when the file was used.
PiperOrigin-RevId: 209861834
Change-Id: Idf24d5c1f04c9b28659e62c97202ab3b4d72e994
UDS has a lower size limit than regular files. When running under bazel
this limit is exceeded. Test was changed to always mount /tmp and use
it for the test.
PiperOrigin-RevId: 209717830
Change-Id: I1dbe19fe2051ffdddbaa32b188a9167f446ed193
Tests get a readonly rootfs mapped to / (which was the case before)
and writable TEST_TMPDIR. This makes it easier to setup containers to
write to files and to share state between test and containers.
PiperOrigin-RevId: 209453224
Change-Id: I4d988e45dc0909a0450a3bb882fe280cf9c24334
Bazel adds the build type in front of directories making it hard to
refer to binaries in code.
PiperOrigin-RevId: 209010854
Change-Id: I6c9da1ac3bbe79766868a3b14222dd42d03b4ec5
When multiple containers run inside a sentry, each container has its own root
filesystem and set of mounts. Containers are also added after sentry boot rather
than all configured and known at boot time.
The fsgofer needs to be able to serve the root filesystem of each container.
Thus, it must be possible to add filesystems after the fsgofer has already
started.
This change:
* Creates a URPC endpoint within the gofer process that listens for requests to
serve new content.
* Enables the sentry, when starting a new container, to add the new container's
filesystem.
* Mounts those new filesystems at separate roots within the sentry.
PiperOrigin-RevId: 208903248
Change-Id: Ifa91ec9c8caf5f2f0a9eead83c4a57090ce92068
This file access type is actually called "proxy-shared", but I forgot to update
all locations.
PiperOrigin-RevId: 208832491
Change-Id: I7848bc4ec2478f86cf2de1dcd1bfb5264c6276de
Previously, gofer filesystems were configured with the default "fscache"
policy, which caches filesystem metadata and contents aggressively. While this
setting is best for performance, it means that changes from inside the sandbox
may not be immediately propagated outside the sandbox, and vice-versa.
This CL changes volumes and the root fs configuration to use a new
"remote-revalidate" cache policy which tries to retain as much caching as
possible while still making fs changes visible across the sandbox boundary.
This cache policy is enabled by default for the root filesystem. The default
value for the "--file-access" flag is still "proxy", but the behavior is
changed to use the new cache policy.
A new value for the "--file-access" flag is added, called "proxy-exclusive",
which turns on the previous aggressive caching behavior. As the name implies,
this flag should be used when the sandbox has "exclusive" access to the
filesystem.
All volume mounts are configured to use the new cache policy, since it is
safest and most likely to be correct. There is not currently a way to change
this behavior, but it's possible to add such a mechanism in the future. The
configurability is a smaller issue for volumes, since most of the expensive
application fs operations (walking + stating files) will likely served by the
root fs.
PiperOrigin-RevId: 208735037
Change-Id: Ife048fab1948205f6665df8563434dbc6ca8cfc9
Previously, processes which used file-system Unix Domain Sockets could not be
checkpoint-ed in runsc because the sockets were saved with their inode
numbers which do not necessarily remain the same upon restore. Now,
the sockets are also saved with their paths so that the new inodes
can be determined for the sockets based on these paths after restoring.
Tests for cases with UDS use are included. Test cleanup to come.
PiperOrigin-RevId: 208268781
Change-Id: Ieaa5d5d9a64914ca105cae199fd8492710b1d7ec
inethost doesn't support netlink and 'ifconfig' call to retrieve IP address
fails. Look up IP address in /etc/hosts instead.
PiperOrigin-RevId: 208135641
Change-Id: I3c2ce15db6fc7c3306a45e4bfb9cc5d4423ffad3
Add option to redirect packet back to netstack if it's destined to itself.
This fixes the problem where connecting to the local NIC address would
not work, e.g.:
echo bar | nc -l -p 8080 &
echo foo | nc 192.168.0.2 8080
PiperOrigin-RevId: 207995083
Change-Id: I17adc2a04df48bfea711011a5df206326a1fb8ef
SACK is disabled by default and needs to be manually enabled. It not only
improves performance, but also fixes hangs downloading files from certain
websites.
PiperOrigin-RevId: 207906742
Change-Id: I4fb7277b67bfdf83ac8195f1b9c38265a0d51e8b
integration_test runs manually and breakage wasn't detected. Added test to
kokoro to ensure breakages are detected in the future.
PiperOrigin-RevId: 207772835
Change-Id: Iada81b579b558477d4db3516b38366ef6a2e933d
mysql image test is timing out sporadically and it's hard to tell
where the slow down in coming from.
PiperOrigin-RevId: 207147237
Change-Id: I05a4d2c116292695d63cf861f3b89cd1c54b6106
When "exec" command is called without the "--detach" flag, we spawn a second
"exec" command and wait for that one to start. We use the pid file passed in
--pid-file to detect when this second command has started running.
However if "exec" is called with no --pid-file flag, this system breaks down,
as we don't have a pid file to wait for.
This CL ensures that the second instance of the "exec" command always writes a
pid-file, so the wait is successful.
PiperOrigin-RevId: 206002403
Change-Id: If9f2be31eb6e831734b1b833f25054ec71ab94a6
Docker expects containers to be created before they are restored.
However, gVisor restoring requires specificactions regarding the kernel
and the file system. These actions were originally in booting the sandbox.
Now setting up the file system is deferred until a call to a call to
runsc start. In the restore case, the kernel is destroyed and a new kernel
is created in the same process, as we need the same process for Docker.
These changes required careful execution of concurrent processes which
required the use of a channel.
Full docker integration still needs the ability to restore into the same
container.
PiperOrigin-RevId: 205161441
Change-Id: Ie1d2304ead7e06855319d5dc310678f701bd099f
We must delete the output file at the beginning of the test, otherwise the test
fails immediately.
Also some minor cleanups in readOutputFile.
PiperOrigin-RevId: 205150525
Change-Id: I6bae1acd5b315320a2c6e25a59afcfc06267fb17
Closing the control server will block until all open requests have completed.
If a control server method panics, we end up stuck because the defer'd Destroy
function will never return.
PiperOrigin-RevId: 204354676
Change-Id: I6bb1d84b31242d7c3f20d5334b1c966bd6a61dbf
Moved some of the docker image functions to testutil.go.
Test runsc commands create, start, stop, pause, and resume.
PiperOrigin-RevId: 204138452
Change-Id: Id00bc58d2ad230db5e9e905eed942187e68e7c7b
Previously, error message only showed "<nil>" when child and pid were the
same (since no error is returned by the Wait4 syscall in this case) which
occurs when the process has incorrectly terminated. A new error message
was added to improve clarity for such a case. Tests for this function were
modified to reflect the improved distinction between process termination
and error.
PiperOrigin-RevId: 204018107
Change-Id: Ib38481c9590405e5bafcb6efe27fd49b3948910c