syscall.POLL is not supported on arm64, using syscall.PPOLL
to support both the x86 and arm64. refs #63
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I2c81a063d3ec4e7e6b38fe62f17a0924977f505e
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/543 from xiaobo55x:master ba598263fd3748d1addd48e4194080aa12085164
PiperOrigin-RevId: 260752049
By following the directions in the README file, these Dockerfiles can be
built and used to run native language tests for their respective runtimes.
PiperOrigin-RevId: 260174430
The different containers in a sandbox used only one pid
namespace before. This results in that a container can see
the processes in another container in the same sandbox.
This patch use different pid namespace for different containers.
Signed-off-by: chris.zn <chris.zn@antfin.com>
This keeps all container filesystem completely separate from eachother
(including from the root container filesystem), and allows us to get rid of the
"__runsc_containers__" directory.
It also simplifies container startup/teardown as we don't have to muck around
in the root container's filesystem.
PiperOrigin-RevId: 259613346
This change has the listTests() function return
a string slice of all the tests. Originally, I
planned not to modify the listTests() function
and instead capture the output of it and then
iterate through the captured output. I decided
against this approach as most of the test binaries
already produce a slice as they collect tests
through filepath.Walk(). Now I use this slice
and return it so that I can iterate through in
runAllTests() and also when printing out the tests.
PiperOrigin-RevId: 259599782
This is the first version of a testing program to be used by gVisor for
including language testing into their presubmits. It works when ran in
the same manor the image and integration tests are ran in as described in
their README file.
PiperOrigin-RevId: 259392416
This renames FDMap to FDTable and drops the kernel.FD type, which had an entire
package to itself and didn't serve much use (it was freely cast between types,
and served as more of an annoyance than providing any protection.)
Based on BenchmarkFDLookupAndDecRef-12, we can expect 5-10 ns per lookup
operation, and 10-15 ns per concurrent lookup operation of savings.
This also fixes two tangential usage issues with the FDMap. Namely, non-atomic
use of NewFDFrom and associated calls to Remove (that are both racy and fail to
drop the reference on the underlying file.)
PiperOrigin-RevId: 256285890
Addresses obvious typos, in the documentation only.
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/443 from Pixep:fix/documentation-spelling 4d0688164eafaf0b3010e5f4824b35d1e7176d65
PiperOrigin-RevId: 255477779
Currently, the overlay dirCache is only used for a single logical use of
getdents. i.e., it is discard when the FD is closed or seeked back to
the beginning.
But the initial work of getting the directory contents can be quite
expensive (particularly sorting large directories), so we should keep it
as long as possible.
This is very similar to the readdirCache in fs/gofer.
Since the upper filesystem does not have to allow caching readdir
entries, the new CacheReaddir MountSourceOperations method controls this
behavior.
This caching should be trivially movable to all Inodes if desired,
though that adds an additional copy step for non-overlay Inodes.
(Overlay Inodes already do the extra copy).
PiperOrigin-RevId: 255477592
The code was wrongly assuming that only read access was
required from the lower overlay when checking for permissions.
This allowed non-writable files to be writable in the overlay.
Fixes#316
PiperOrigin-RevId: 255263686
Go was going to change the behavior of SysProcAttr.Ctty such that it must be an
FD in the *parent* FD table:
https://go-review.googlesource.com/c/go/+/178919/
However, after some debate, it was decided that this change was too
backwards-incompatible, and so it was reverted.
https://github.com/golang/go/issues/29458
The behavior going forward is unchanged: the Ctty FD must be an FD in the
*child* FD table.
PiperOrigin-RevId: 255228476
An upcoming change in Go 1.13 [1] changes the semantics of the SysProcAttr.Ctty
field. Prior to the change, the FD must be an FD in the child process's FD
table (aka "post-shuffle"). After the change, the FD must be an FD in the
current process's FD table (aka "pre-shuffle").
To be compatible with both versions this CL introduces a new boolean
"CttyFdIsPostShuffle" which indicates whether a pre- or post-shuffle FD should
be provided. We use build tags to chose the correct one.
1: https://go-review.googlesource.com/c/go/+/178919/
PiperOrigin-RevId: 255015303
When we reopen file by path, we can't be sure that
we will open exactly the same file. The file can be
deleted and another one with the same name can be
created.
PiperOrigin-RevId: 254898594
There were 3 string arguments that could be easily misplaced
and it makes it easier to add new arguments, especially for
Container that has dozens of callers.
PiperOrigin-RevId: 253872074
$ bazel build runsc:runsc-debian
File ".../bazel_tools/tools/build_defs/pkg/make_deb.py", line 311,
in GetFlagValue:
flagvalue = flagvalue.decode('utf-8')
AttributeError: 'str' object has no attribute 'decode'
make_deb.py is incompatible with Python3.
https://github.com/bazelbuild/bazel/issues/8443
PiperOrigin-RevId: 253691923
runsc will now set the HOME environment variable as required by POSIX. The
user's home directory is retrieved from the /etc/passwd file located on the
container's file system during boot.
PiperOrigin-RevId: 253120627
'--rootless' flag lets a non-root user execute 'runsc do'.
The drawback is that the sandbox and gofer processes will
run as root inside a user namespace that is mapped to the
caller's user, intead of nobody. And network is defaulted
to '--network=host' inside the root network namespace. On
the bright side, it's very convenient for testing:
runsc --rootless do ls
runsc --rootless do curl www.google.com
PiperOrigin-RevId: 252840970
Parse annotations containing 'gvisor.dev/spec/mount' that gives
hints about how mounts are shared between containers inside a
pod. This information can be used to better inform how to mount
these volumes inside gVisor. For example, a volume that is shared
between containers inside a pod can be bind mounted inside the
sandbox, instead of being two independent mounts.
For now, this information is used to allow the same tmpfs mounts
to be shared between containers which wasn't possible before.
PiperOrigin-RevId: 252704037
Adds simple introspection for syscall compatibility information to Linux/AMD64.
Syscalls registered in the syscall table now have associated metadata like
name, support level, notes, and URLs to relevant issues.
Syscall information can be exported as a table, JSON, or CSV using the new
'runsc help syscalls' command. Users can use this info to debug and get info
on the compatibility of the version of runsc they are running or to generate
documentation.
PiperOrigin-RevId: 252558304
Overlayfs was expecting the parent to exist when bind(2)
was called, which may not be the case. The fix is to copy
the parent directory to the upper layer before binding
the UDS.
There is not good place to add tests for it. Syscall tests
would be ideal, but it's hard to guarantee that the
directory where the socket is created hasn't been touched
before (and thus copied the parent to the upper layer).
Added it to runsc integration tests for now. If it turns
out we have lots of these kind of tests, we can consider
moving them somewhere more appropriate.
PiperOrigin-RevId: 251954156
Containerd uses the last error message sent to the log to
print as failure cause for create/exec. This required a
few changes in the logging logic for runsc:
- cmd.Errorf/Fatalf: now writes a message with 'error'
level to containerd log, in addition to stderr and
debug logs, like before.
- log.Infof/Warningf/Fatalf: are not sent to containerd
log anymore. They are mostly used for debugging and not
useful to containerd. In most cases, --debug-log is
enabled and this avoids the logs messages from being
duplicated.
- stderr is not used as default log destination anymore.
Some commands assume stdio is for the container/process
running inside the sandbox and it's better to never use
it for logging. By default, logs are supressed now.
PiperOrigin-RevId: 251881815
This allows an fdbased endpoint to have multiple underlying fd's from which
packets can be read and dispatched/written to.
This should allow for higher throughput as well as better scalability of the
network stack as number of connections increases.
Updates #231
PiperOrigin-RevId: 251852825
This is required to make the shutdown visible to peers outside the
sandbox.
The readClosed / writeClosed fields were dropped, as they were
preventing a shutdown socket from reading the remainder of queued bytes.
The host syscalls will return the appropriate errors for shutdown.
The control message tests have been split out of socket_unix.cc to make
the (few) remaining tests accessible to testing inherited host UDS,
which don't support sending control messages.
Updates #273
PiperOrigin-RevId: 251763060
No change in functionaly. Added containerMounter object
to keep state while the mounts are processed. This will
help upcoming changes to share mounts per-pod.
PiperOrigin-RevId: 251350096
clearStatus was added to allow detached execution to wait
on the exec'd process and retrieve its exit status. However,
it's not currently used. Both docker and gvisor-containerd-shim
wait on the "shim" process and retrieve the exit status from
there. We could change gvisor-containerd-shim to use waits, but
it will end up also consuming a process for the wait, which is
similar to having the shim process.
Closes#234
PiperOrigin-RevId: 251349490
Fatalf calls os.Exit and a process exits without calling defer callbacks.
Should we do this for other runsc commands?
PiperOrigin-RevId: 249776310
Change-Id: If9d8b54d0ae37db443895906eb33bd9e9b600cc9
Separate MountSource from Mount. This is needed to allow
mounts to be shared by multiple containers within the same
pod.
PiperOrigin-RevId: 249617810
Change-Id: Id2944feb7e4194951f355cbe6d4944ae3c02e468
We want to know that our environment set up properly
and docker tests pass with a native runtime.
PiperOrigin-RevId: 248229294
Change-Id: I06c221e5eeed6e01bdd1aa935333c57e8eadc498
WaitForHTTP tries GET requests on a port until the call succeeds or timeout.
But we want to be sure that one of our attempts will not stuck for
the whole timeout.
All timeouts are increased to 30 seconds, because test cases with smaller
timeouts fail sometimes even for the native container runtime (runc).
PiperOrigin-RevId: 247888467
Change-Id: I03cfd3275286bc686a78fd26da43231d20667851
And stop storing the Filesystem in the MountSource.
This allows us to decouple the MountSource filesystem type from the name of the
filesystem.
PiperOrigin-RevId: 247292982
Change-Id: I49cbcce3c17883b7aa918ba76203dfd6d1b03cc8
$ dpkg -s runsc
Package: runsc
Status: install ok installed
Priority: optional
Section: contrib/devel
Maintainer: The gVisor Authors <gvisor-dev@googlegroups.com>
Architecture: amd64
Version: 20190304.1-123-g861434f612ce-dirty
Description: gVisor is a user-space kernel, written in Go, that
implements a substantial portion of the Linux system surface. It
includes an Open Container Initiative (OCI) runtime called runsc that
provides an isolation boundary between the application and the host
kernel. The runsc runtime integrates with Docker and Kubernetes,
making it simple to run sandboxed containers.
Homepage: https://gvisor.dev/
Built-Using: Bazel
Change-Id: I6f161de8fba649f12272a87b99529ccfd22e499a
PiperOrigin-RevId: 246546294
With this change, we will be able to run runsc do in a host network namespace.
PiperOrigin-RevId: 246436660
Change-Id: I8ea18b1053c88fe2feed74239b915fe7a151ce34
Opensource tools (e. g. https://github.com/fatih/vim-go) can't hanlde more than
one golang package in one directory.
PiperOrigin-RevId: 246435962
Change-Id: I67487915e3838762424b2d168efc54ae34fb801f
Sandbox always runsc with IP 192.168.10.2 and the peer
network adds 1 to the address (192.168.10.3). Sandbox
IP can be changed using --ip flag.
Here a few examples:
sudo runsc do curl www.google.com
sudo runsc do --ip=10.10.10.2 bash -c "echo 123 | netcat -l -p 8080"
PiperOrigin-RevId: 246421277
Change-Id: I7b3dce4af46a57300350dab41cb27e04e4b6e9da
This feature allows MemoryFile to delay eviction of "optional"
allocations, such as unused cached file pages.
Note that this incidentally makes CachingInodeOperations writeback
asynchronous, in the sense that it doesn't occur until eviction; this is
necessary because between when a cached page becomes evictable and when
it's evicted, file writes (via CachingInodeOperations.Write) may dirty
the page.
As currently implemented, this feature won't meaningfully impact
steady-state memory usage or caching; the reclaimer goroutine will
schedule eviction as soon as it runs out of other work to do. Future CLs
increase caching by adding constraints on when eviction is scheduled.
PiperOrigin-RevId: 246014822
Change-Id: Ia85feb25a2de92a48359eb84434b6ec6f9bea2cb
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.
1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.
Fixes#209
PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
The caller must call Readdir() at least twice to detect
EOF. The old code was always restarting the directory
search and then skipping elements already seen, effectively
doubling the cost to read a directory. The code now
remembers the last offset and doesn't reposition the cursor
if next request comes at the same offset.
PiperOrigin-RevId: 244957816
Change-Id: If21a8dc68b76614adbcf4301439adfda40f2643f
os.NewFile() accounts for 38% of CPU time in localFile.Walk().
This change switchs to use fd.FD which is much cheaper to create.
Now, fd.New() in localFile.Walk() accounts for only 4%.
PiperOrigin-RevId: 244944983
Change-Id: Ic892df96cf2633e78ad379227a213cb93ee0ca46
Create, Start, and Destroy were racing to create and destroy the
metadata directory of containers.
This is a re-upload of
https://gvisor-review.googlesource.com/c/gvisor/+/16260, but with the
correct account.
Change-Id: I16b7a9d0971f0df873e7f4145e6ac8f72730a4f1
PiperOrigin-RevId: 244892991
FD limit and file size limit is read from the host, instead
of using hard-coded defaults, given that they effect the sandbox
process. Also limit the direct cache to use no more than half
if the available FDs.
PiperOrigin-RevId: 244050323
Change-Id: I787ad0fdf07c49d589e51aebfeae477324fe26e6
It provides an easy way to run commands to quickly test gVisor.
By default it maps the host root as the container root with a
writable overlay on top (so the host root is not modified).
Example:
sudo runsc do ls -lh --color
sudo runsc do ~/src/test/my-test.sh
PiperOrigin-RevId: 243178711
Change-Id: I05f3d6ce253fe4b5f1362f4a07b5387f6ddb5dd9
Otherwise, we will not have capabilities in the user namespace.
And this patch adds the noexec option for mounts.
https://github.com/google/gvisor/issues/145
PiperOrigin-RevId: 242706519
Change-Id: I1b78b77d6969bd18038c71616e8eb7111b71207c
'ls' will hang if there is any FIFO in this path. So
return EPERM if unsupported file occurs and add NONBLOCK flag
when opening file to avoid blocking on FIFO read.
Signed-off-by: Liu Hua <sdu.liu@huawei.com>
Change-Id: I8b9a2a48322118d8ad531dd226395438123eb047
PiperOrigin-RevId: 241406726