Because the abi will depend on the core types for marshalling (usermem,
context, safemem, safecopy), these need to be flattened from the sentry
directory. These packages contain no sentry-specific details.
PiperOrigin-RevId: 291811289
Linux kernel before 4.19 doesn't implement a feature that updates
open FD after a file is open for write (and is copied to the upper
layer). Already open FD will continue to read the old file content
until they are reopened. This is especially problematic for gVisor
because it caches open files.
Flag was added to force readonly files to be reopenned when the
same file is open for write. This is only needed if using kernels
prior to 4.19.
Closes#1006
It's difficult to really test this because we never run on tests
on older kernels. I'm adding a test in GKE which uses kernels
with the overlayfs problem for 1.14 and lower.
PiperOrigin-RevId: 275115289
Options that do not change mount behavior inside the Sentry are
irrelevant and should not be used when looking for possible
incompatibilities between master and slave mounts.
PiperOrigin-RevId: 273593486
This used to be the case, but regressed after a recent change.
Also made a few fixes around it and clean up the code a bit.
Closes#720
PiperOrigin-RevId: 265717496
Each gofer now has a goroutine that polls on the FDs used
to communicate with the sandbox. The respective gofer is
destroyed if any of the FDs is closed.
Closes#601
PiperOrigin-RevId: 261383725
We can get the mount namespace from the CreateProcessArgs in all cases where we
need it. This also gets rid of kernel.Destroy method, since the only thing it
was doing was DecRefing the mounts.
Removing the need to call kernel.SetRootMountNamespace also allowed for some
more simplifications in the container fs setup code.
PiperOrigin-RevId: 261357060
The different containers in a sandbox used only one pid
namespace before. This results in that a container can see
the processes in another container in the same sandbox.
This patch use different pid namespace for different containers.
Signed-off-by: chris.zn <chris.zn@antfin.com>
This keeps all container filesystem completely separate from eachother
(including from the root container filesystem), and allows us to get rid of the
"__runsc_containers__" directory.
It also simplifies container startup/teardown as we don't have to muck around
in the root container's filesystem.
PiperOrigin-RevId: 259613346
This renames FDMap to FDTable and drops the kernel.FD type, which had an entire
package to itself and didn't serve much use (it was freely cast between types,
and served as more of an annoyance than providing any protection.)
Based on BenchmarkFDLookupAndDecRef-12, we can expect 5-10 ns per lookup
operation, and 10-15 ns per concurrent lookup operation of savings.
This also fixes two tangential usage issues with the FDMap. Namely, non-atomic
use of NewFDFrom and associated calls to Remove (that are both racy and fail to
drop the reference on the underlying file.)
PiperOrigin-RevId: 256285890
Currently, the overlay dirCache is only used for a single logical use of
getdents. i.e., it is discard when the FD is closed or seeked back to
the beginning.
But the initial work of getting the directory contents can be quite
expensive (particularly sorting large directories), so we should keep it
as long as possible.
This is very similar to the readdirCache in fs/gofer.
Since the upper filesystem does not have to allow caching readdir
entries, the new CacheReaddir MountSourceOperations method controls this
behavior.
This caching should be trivially movable to all Inodes if desired,
though that adds an additional copy step for non-overlay Inodes.
(Overlay Inodes already do the extra copy).
PiperOrigin-RevId: 255477592
The code was wrongly assuming that only read access was
required from the lower overlay when checking for permissions.
This allowed non-writable files to be writable in the overlay.
Fixes#316
PiperOrigin-RevId: 255263686
Parse annotations containing 'gvisor.dev/spec/mount' that gives
hints about how mounts are shared between containers inside a
pod. This information can be used to better inform how to mount
these volumes inside gVisor. For example, a volume that is shared
between containers inside a pod can be bind mounted inside the
sandbox, instead of being two independent mounts.
For now, this information is used to allow the same tmpfs mounts
to be shared between containers which wasn't possible before.
PiperOrigin-RevId: 252704037
No change in functionaly. Added containerMounter object
to keep state while the mounts are processed. This will
help upcoming changes to share mounts per-pod.
PiperOrigin-RevId: 251350096
Separate MountSource from Mount. This is needed to allow
mounts to be shared by multiple containers within the same
pod.
PiperOrigin-RevId: 249617810
Change-Id: Id2944feb7e4194951f355cbe6d4944ae3c02e468
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.
1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.
Fixes#209
PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
FD limit and file size limit is read from the host, instead
of using hard-coded defaults, given that they effect the sandbox
process. Also limit the direct cache to use no more than half
if the available FDs.
PiperOrigin-RevId: 244050323
Change-Id: I787ad0fdf07c49d589e51aebfeae477324fe26e6
Removed "error" and "failed to" prefix that don't add value
from messages. Adjusted a few other messages. In particular,
when the container fail to start, the message returned is easier
for humans to read:
$ docker run --rm --runtime=runsc alpine foobar
docker: Error response from daemon: OCI runtime start failed: <path> did not terminate sucessfully: starting container: starting root container [foobar]: starting sandbox: searching for executable "foobar", cwd: "/", $PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin": no such file or directory
Closes#77
PiperOrigin-RevId: 230022798
Change-Id: I83339017c70dae09e4f9f8e0ea2e554c4d5d5cd1
Runsc wants to mount /tmp using internal tmpfs implementation for
performance. However, it risks hiding files that may exist under
/tmp in case it's present in the container. Now, it only mounts
over /tmp iff:
- /tmp was not explicitly asked to be mounted
- /tmp is empty
If any of this is not true, then /tmp maps to the container's
image /tmp.
Note: checkpoint doesn't have sentry FS mounted to check if /tmp
is empty. It simply looks for explicit mounts right now.
PiperOrigin-RevId: 229607856
Change-Id: I10b6dae7ac157ef578efc4dfceb089f3b94cde06
More helper structs have been added to the fsutil package to make it easier to
implement fs.InodeOperations and fs.FileOperations.
PiperOrigin-RevId: 229305982
Change-Id: Ib6f8d3862f4216745116857913dbfa351530223b
The number of symbolic links that are allowed to be followed
are for a full path and not just a chain of symbolic links.
PiperOrigin-RevId: 224047321
Change-Id: I5e3c4caf66a93c17eeddcc7f046d1e8bb9434a40
This can happen when destroy is called multiple times or when destroy
failed previously and is being called again.
PiperOrigin-RevId: 221882034
Change-Id: I8d069af19cf66c4e2419bdf0d4b789c5def8d19e
destroyContainerFS must wait for all async operations to finish before
returning. In an attempt to do this, we call fs.AsyncBarrier() at the end of
the function. However, there are many defer'd DecRefs which end up running
AFTER the AsyncBarrier() call.
This CL fixes this by calling fs.AsyncBarrier() in the first defer statement,
thus ensuring that it runs at the end of the function, after all other defers.
PiperOrigin-RevId: 220523545
Change-Id: I5e96ee9ea6d86eeab788ff964484c50ef7f64a2f
It's hard to resolve symlinks inside the sandbox because rootfs and mounts
may be read-only, forcing us to create mount points inside lower layer of an
overlay, **before** the volumes are mounted.
Since the destination must already be resolved outside the sandbox when creating
mounts, take this opportunity to rewrite the spec with paths resolved.
"runsc boot" will use the "resolved" spec to load mounts. In addition, symlink
traversals were disabled while mounting containers inside the sandbox.
It haven't been able to write a good test for it. So I'm relying on manual tests
for now.
PiperOrigin-RevId: 217749904
Change-Id: I7ac434d5befd230db1488446cda03300cc0751a9
This CL changes the semantics of the "--file-access" flag so that it only
affects the root filesystem. The default remains "exclusive" which is the
common use case, as neither Docker nor K8s supports sharing the root.
Keeping the root fs as "exclusive" means that the fs-intensive work done during
application startup will mostly be cacheable, and thus faster.
Non-root bind mounts will always be shared.
This CL also removes some redundant FSAccessType validations. We validate this
flag in main(), so we can assume it is valid afterwards.
PiperOrigin-RevId: 214359936
Change-Id: I7e75d7bf52dbd7fa834d0aacd4034868314f3b51
The issue with the previous change was that the stdin/stdout/stderr passed to
the sentry were dup'd by host.ImportFile. This left a dangling FD that by never
closing caused containerd to timeout waiting on container stop.
PiperOrigin-RevId: 213753032
Change-Id: Ia5e4c0565c42c8610d3b59f65599a5643b0901e4
This method will:
1. Stop the container process if it is still running.
2. Unmount all sanadbox-internal mounts for the container.
3. Delete the contaner root directory inside the sandbox.
Destroy is idempotent, and safe to call concurrantly.
This fixes a bug where after stopping a container, we cannot unmount the
container root directory on the host. This bug occured because the sandbox
dirent cache was holding a dirent with a host fd corresponding to a file inside
the container root on the host. The dirent cache did not know that the
container had exited, and kept the FD open, preventing us from unmounting on
the host.
Now that we unmount (and flush) all container mounts inside the sandbox, any
host FDs donated by the gofer will be closed, and we can unmount the container
root on the host.
PiperOrigin-RevId: 213737693
Change-Id: I28c0ff4cd19a08014cdd72fec5154497e92aacc9
Capabilities.Set() adds capabilities,
but doesn't remove existing ones that might have been loaded. Fixed
the code and added tests.
PiperOrigin-RevId: 213726369
Change-Id: Id7fa6fce53abf26c29b13b9157bb4c6616986fba
We must use a context.Context with a Root Dirent that corresponds to the
container's chroot. Previously we were using the root context, which does not
have a chroot.
Getting the correct context required refactoring some of the path-lookup code.
We can't lookup the path without a context.Context, which requires
kernel.CreateProcArgs, which we only get inside control.Execute. So we have to
do the path lookup much later than we previously were.
PiperOrigin-RevId: 212064734
Change-Id: I84a5cfadacb21fd9c3ab9c393f7e308a40b9b537
It was used before gofer was implemented and it's not
supported anymore.
BREAKING CHANGE: proxy-shared and proxy-exclusive options
are now: shared and exclusive.
PiperOrigin-RevId: 212017643
Change-Id: If029d4073fe60583e5ca25f98abb2953de0d78fd
With multi-gofers, bind mounts in sub-containers should
just work. Removed restrictions and added test. There are
also a few cleanups along the way, e.g. retry unmounting
in case cleanup races with gofer teardown.
PiperOrigin-RevId: 211699569
Change-Id: Ic0a69c29d7c31cd7e038909cc686c6ac98703374