Properly handle propagation options for root and mounts. Now usage of
mount options shared, rshared, and noexec cause error to start. shared/
rshared breaks sandbox=>host isolation. slave however can be supported
because changes propagate from host to sandbox.
Root FS setup moved inside the gofer. Apart from simplifying the code,
it keeps all mounts inside the namespace. And they are torn down when
the namespace is destroyed (DestroyFS is no longer needed).
PiperOrigin-RevId: 239037661
Change-Id: I8b5ee4d50da33c042ea34fa68e56514ebe20e6e0
This is in preparation for improved page cache reclaim, which requires
greater integration between the page cache and page allocator.
PiperOrigin-RevId: 238444706
Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4
HandleLocal is very similar conceptually to MULTICAST_LOOP, so we can unify
the implementations. This has the benefit of making HandleLocal apply even when
the fdbased link endpoint isn't in use.
In addition, move looping logic to route creation so that it doesn't need to be
run for each packet. This should improve performance.
PiperOrigin-RevId: 238099480
Change-Id: I72839f16f25310471453bc9d3fb8544815b25c23
IP_MULTICAST_LOOP controls whether or not multicast packets sent on the default
route are looped back. In order to implement this switch, support for sending
and looping back multicast packets on the default route had to be implemented.
For now we only support IPv4 multicast.
PiperOrigin-RevId: 237534603
Change-Id: I490ac7ff8e8ebef417c7eb049a919c29d156ac1c
It is Implemented without the priority inheritance part given
that gVisor defers scheduling decisions to Go runtime and doesn't
have control over it.
PiperOrigin-RevId: 236989545
Change-Id: I714c8ca0798743ecf3167b14ffeb5cd834302560
This help troubleshoot cases where the container is killed and the
app logs don't show the reason.
PiperOrigin-RevId: 236982883
Change-Id: I361892856a146cea5b04abaa3aedbf805e123724
If a background process tries to read from a TTY, linux sends it a SIGTTIN
unless the signal is blocked or ignored, or the process group is an orphan, in
which case the syscall returns EIO.
See drivers/tty/n_tty.c:n_tty_read()=>job_control().
If a background process tries to write a TTY, set the termios, or set the
foreground process group, linux then sends a SIGTTOU. If the signal is ignored
or blocked, linux allows the write. If the process group is an orphan, the
syscall returns EIO.
See drivers/tty/tty_io.c:tty_check_change().
PiperOrigin-RevId: 234044367
Change-Id: I009461352ac4f3f11c5d42c43ac36bb0caa580f9
PACKET_RX_RING allows the use of an mmapped buffer to receive packets from the
kernel. This should cut down the number of host syscalls that need to be made
to receive packets when the underlying fd is a socket of the AF_PACKET type.
PiperOrigin-RevId: 233834998
Change-Id: I8060025c6ced206986e94cc46b8f382b81bfa47f
Nothing reads them and they can simply get stale.
Generated with:
$ sed -i "s/licenses(\(.*\)).*/licenses(\1)/" **/BUILD
PiperOrigin-RevId: 231818945
Change-Id: Ibc3f9838546b7e94f13f217060d31f4ada9d4bf0
This should reduce the number of syscalls required to process packets
significantly and improve throughputs.
PiperOrigin-RevId: 231366886
Change-Id: I8b38077262bf9c53176bc4a94b530188d3d7c0ca
When file size changes outside the sandbox, page cache was not
refreshing file size which is required for cacheRemoteRevalidating.
In fact, cacheRemoteRevalidating should be skipping the cache
completely since it's not really benefiting from it. The cache is
cache is already bypassed for unstable attributes (see
cachePolicy.cacheUAttrs). And althought the cache is called to
map pages, they will always miss the cache and map directly from
the host.
Created a HostMappable struct that maps directly to the host and
use it for files with cacheRemoteRevalidating.
Closes#124
PiperOrigin-RevId: 230998440
Change-Id: Ic5f632eabe33b47241e05e98c95e9b2090ae08fc
In Container.Destroy(), we call c.stop() before calling
executeHooksBestEffort(), therefore, when we call
executeHooksBestEffort(c.Spec.Hooks.Poststop, c.State()) to execute
the poststop hook, it results in a nil pointer dereference since it
reads c.Sandbox.Pid in c.State() after the sandbox has been destroyed.
To fix this bug, we can change container's status to "stopped" before
executing the poststop hook.
Signed-off-by: ShiruRen <renshiru2000@gmail.com>
Change-Id: I4d835e430066fab7e599e188f945291adfc521ef
PiperOrigin-RevId: 230975505
Mounting lib and lib64 are not necessary anymore and simplifies the test.
PiperOrigin-RevId: 230971195
Change-Id: Ib91a3ffcec4b322cd3687c337eedbde9641685ed
Removed "error" and "failed to" prefix that don't add value
from messages. Adjusted a few other messages. In particular,
when the container fail to start, the message returned is easier
for humans to read:
$ docker run --rm --runtime=runsc alpine foobar
docker: Error response from daemon: OCI runtime start failed: <path> did not terminate sucessfully: starting container: starting root container [foobar]: starting sandbox: searching for executable "foobar", cwd: "/", $PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin": no such file or directory
Closes#77
PiperOrigin-RevId: 230022798
Change-Id: I83339017c70dae09e4f9f8e0ea2e554c4d5d5cd1
In addition, it fixes a race condition in TestMultiContainerGoferStop.
There are two scripts copy the same set of files into the same directory
and sometime one of this command fails with EXIST.
PiperOrigin-RevId: 230011247
Change-Id: I9289f72e65dc407cdcd0e6cd632a509e01f43e9c
Runsc wants to mount /tmp using internal tmpfs implementation for
performance. However, it risks hiding files that may exist under
/tmp in case it's present in the container. Now, it only mounts
over /tmp iff:
- /tmp was not explicitly asked to be mounted
- /tmp is empty
If any of this is not true, then /tmp maps to the container's
image /tmp.
Note: checkpoint doesn't have sentry FS mounted to check if /tmp
is empty. It simply looks for explicit mounts right now.
PiperOrigin-RevId: 229607856
Change-Id: I10b6dae7ac157ef578efc4dfceb089f3b94cde06
More helper structs have been added to the fsutil package to make it easier to
implement fs.InodeOperations and fs.FileOperations.
PiperOrigin-RevId: 229305982
Change-Id: Ib6f8d3862f4216745116857913dbfa351530223b
In this case, new mounts are not created in the host mount namspaces, so
tearDownChroot isn't needed, because chroot will be destroyed with a
sandbox mount namespace.
In additional, pivot_root can't be called instead of chroot.
PiperOrigin-RevId: 229250871
Change-Id: I765bdb587d0b8287a6a8efda8747639d37c7e7b6
And we need to wait a gofer process before cgroup.Uninstall,
because it is running in the sandbox cgroups.
PiperOrigin-RevId: 228904020
Change-Id: Iaf8826d5b9626db32d4057a1c505a8d7daaeb8f9
The original code assumed that it was safe to join and not restore cgroup,
but Container.Run will not exit after calling start, making cgroup cleanup
fail because there were still processes inside the cgroup.
PiperOrigin-RevId: 228529199
Change-Id: I12a48d9adab4bbb02f20d71ec99598c336cbfe51
File/dir/symlink creation is multi-step and may leave state behind in
case of failure in one of the steps. Added best effort attempt to
clean up.
PiperOrigin-RevId: 228286612
Change-Id: Ib03c27cd3d3e4f44d0352edc6ee212a53412d7f1
Make 'runsc create' join cgroup before creating sandbox process.
This removes the need to synchronize platform creation and ensure
that sandbox process is charged to the right cgroup from the start.
PiperOrigin-RevId: 227166451
Change-Id: Ieb4b18e6ca0daf7b331dc897699ca419bc5ee3a2
"RLIMIT_MEMLOCK: This is the maximum number of bytes of memory that may
be locked into RAM." - getrlimit(2)
PiperOrigin-RevId: 226384346
Change-Id: Iefac4a1bb69f7714dc813b5b871226a8344dc800
Currently mlock() and friends do nothing whatsoever. However, mlocking
is directly application-visible in a number of ways; for example,
madvise(MADV_DONTNEED) and msync(MS_INVALIDATE) both fail on mlocked
regions. We handle this inconsistently: MADV_DONTNEED is too important
to not work, but MS_INVALIDATE is rejected.
Change MM to track mlocked regions in a manner consistent with Linux.
It still will not actually pin pages into host physical memory, but:
- mlock() will now cause sentry memory management to precommit mlocked
pages.
- MADV_DONTNEED and MS_INVALIDATE will interact with mlocked pages as
described above.
PiperOrigin-RevId: 225861605
Change-Id: Iee187204979ac9a4d15d0e037c152c0902c8d0ee
If the sandbox process is dead (because of a panic or some other problem),
container.Destroy will never remove the container metadata file, since it will
always fail when calling container.stop().
This CL changes container.Destroy() to always perform the three necessary
cleanup operations:
* Stop the sandbox and gofer processes.
* Remove the container fs on the host.
* Delete the container metadata directory.
Errors from these three operations will be concatenated and returned from
Destroy().
PiperOrigin-RevId: 225448164
Change-Id: I99c6311b2e4fe5f6e2ca991424edf1ebeae9df32
This option is effectively equivalent to -panic-signal, except that the
sandbox does not die after logging the traceback.
PiperOrigin-RevId: 225089593
Change-Id: Ifb1c411210110b6104613f404334bd02175e484e