gvisor

Commit Graph

Author	SHA1	Message	Date
Ridwan Sharif	a63db7d903	Moved FUSE device under the fuse directory	2020-06-25 14:22:21 -04:00
Nicolas Lacasse	58880bf551	Port /dev/net/tun device to VFS2. Updates #2912 #1035 PiperOrigin-RevId: 318162565	2020-06-24 16:23:44 -07:00
Bhasker Hariharan	b070e218c6	Add support for Stack level options. Linux controls socket send/receive buffers using a few sysctl variables - net.core.rmem_default - net.core.rmem_max - net.core.wmem_max - net.core.wmem_default - net.ipv4.tcp_rmem - net.ipv4.tcp_wmem The first 4 control the default socket buffer sizes for all sockets raw/packet/tcp/udp and also the maximum permitted socket buffer that can be specified in setsockopt(SOL_SOCKET, SO_(RCV\|SND)BUF,...). The last two control the TCP auto-tuning limits and override the default specified in rmem_default/wmem_default as well as the max limits. Netstack today only implements tcp_rmem/tcp_wmem and incorrectly uses it to limit the maximum size in setsockopt() as well as uses it for raw/udp sockets. This changelist introduces the other 4 and updates the udp/raw sockets to use the newly introduced variables. The values for min/max match the current tcp_rmem/wmem values and the default value buffers for UDP/RAW sockets is updated to match the linux value of 212KiB up from the really low current value of 32 KiB. Updates #3043 Fixes #3043 PiperOrigin-RevId: 318089805	2020-06-24 10:24:20 -07:00
Nicolas Lacasse	0f328beb0d	Port /dev/tty device to VFS2. Support is limited to the functionality that exists in VFS1. Updates #2923 #1035 PiperOrigin-RevId: 317981417	2020-06-23 18:48:37 -07:00
Kevin Krakauer	28b8a5cc3a	iptables: remove metadata struct Metadata was useful for debugging and safety, but enough tests exist that we should see failures when (de)serialization is broken. It made stack initialization more cumbersome and it's also getting in the way of ip6tables. PiperOrigin-RevId: 317210653	2020-06-18 17:02:16 -07:00
Bhasker Hariharan	07ff909e76	Support setsockopt SO_SNDBUF/SO_RCVBUF for raw/udp sockets. Updates #173,#6 Fixes #2888 PiperOrigin-RevId: 317087652	2020-06-18 06:07:20 -07:00
gVisor bot	dbf786c6b3	Add runsc options to set checksum offloading status --tx-checksum-offload=<true\|false> enable TX checksum offload (default: false) --rx-checksum-offload=<true\|false> enable RX checksum offload (default: true) Fixes #2989 PiperOrigin-RevId: 316781309	2020-06-16 16:34:26 -07:00
Ian Lewis	8ea99d58ff	Set the HOME environment variable for sub-containers. Fixes #701 PiperOrigin-RevId: 316025635	2020-06-11 19:31:24 -07:00
Jamie Liu	77c206e371	Add //pkg/sentry/fsimpl/overlay. Major differences from existing overlay filesystems: - Linux allows lower layers in an overlay to require revalidation, but not the upper layer. VFS1 allows the upper layer in an overlay to require revalidation, but not the lower layer. VFS2 does not allow any layers to require revalidation. (Now that vfs.MkdirOptions.ForSyntheticMountpoint exists, no uses of overlay in VFS1 are believed to require upper layer revalidation; in particular, the requirement that the upper layer support the creation of "trusted." extended attributes for whiteouts effectively required the upper filesystem to be tmpfs in most cases.) - Like VFS1, but unlike Linux, VFS2 overlay does not attempt to make mutations of the upper layer atomic using a working directory and features like RENAME_WHITEOUT. (This may change in the future, since not having a working directory makes error recovery for some operations, e.g. rmdir, particularly painful.) - Like Linux, but unlike VFS1, VFS2 represents whiteouts using character devices with rdev == 0; the equivalent of the whiteout attribute on directories is xattr trusted.overlay.opaque = "y"; and there is no equivalent to the whiteout attribute on non-directories since non-directories are never merged with lower layers. - Device and inode numbers work as follows: - In Linux, modulo the xino feature and a special case for when all layers are the same filesystem: - Directories use the overlay filesystem's device number and an ephemeral inode number assigned by the overlay. - Non-directories that have been copied up use the device and inode number assigned by the upper filesystem. - Non-directories that have not been copied up use a per-(overlay, layer)-pair device number and the inode number assigned by the lower filesystem. - In VFS1, device and inode numbers always come from the lower layer unless "whited out"; this has the adverse effect of requiring interaction with the lower filesystem even for non-directory files that exist on the upper layer. - In VFS2, device and inode numbers are assigned as in Linux, except that xino and the samefs special case are not supported. - Like Linux, but unlike VFS1, VFS2 does not attempt to maintain memory mapping coherence across copy-up. (This may have to change in the future, as users may be dependent on this property.) - Like Linux, but unlike VFS1, VFS2 uses the overlayfs mounter's credentials when interacting with the overlay's layers, rather than the caller's. - Like Linux, but unlike VFS1, VFS2 permits multiple lower layers in an overlay. - Like Linux, but unlike VFS1, VFS2's overlay filesystem is application-mountable. Updates #1199 PiperOrigin-RevId: 316019067	2020-06-11 18:34:53 -07:00
Fabricio Voznika	4e96b94915	Combine executable lookup code Run vs. exec, VFS1 vs. VFS2 were executable lookup were slightly different from each other. Combine them all into the same logic. PiperOrigin-RevId: 315426443	2020-06-08 23:08:23 -07:00
Rahat Mahmood	21b6bc7280	Implement mount(2) and umount2(2) for VFS2. This is mostly syscall plumbing, VFS2 already implements the internals of mounts. In addition to the syscall defintions, the following mount-related mechanisms are updated: - Implement MS_NOATIME for VFS2, but only for tmpfs and goferfs. The other VFS2 filesystems don't implement node-level timestamps yet. - Implement the 'mode', 'uid' and 'gid' mount options for VFS2's tmpfs. - Plumb mount namespace ownership, which is necessary for checking appropriate capabilities during mount(2). Updates #1035 PiperOrigin-RevId: 315035352	2020-06-05 19:12:03 -07:00
Nicolas Lacasse	e4e11f2798	Expand syscall filters to support MSAN. PiperOrigin-RevId: 314997564	2020-06-05 14:33:50 -07:00
Ting-Yu Wang	41da7a568b	Fix copylocks error about copying IPTables. IPTables.connections contains a sync.RWMutex. Copying it will trigger copylocks analysis. Tested by manually enabling nogo tests. sync.RWMutex is added to IPTables for the additional race condition discovered. PiperOrigin-RevId: 314817019	2020-06-05 11:29:09 -07:00
Fabricio Voznika	ca5912d13c	More runsc changes for VFS2 - Add /tmp handling - Apply mount options - Enable more container_test tests - Forward signals to child process when test respaws process to run as root inside namespace. Updates #1487 PiperOrigin-RevId: 314263281	2020-06-01 21:32:09 -07:00
Jamie Liu	3a987160aa	Handle gofer blocking opens of host named pipes in VFS2. Using tee instead of read to detect when a O_RDONLY\|O_NONBLOCK pipe FD has a writer circumvents the problem of what to do with the byte read from the pipe, avoiding much of the complexity of the fdpipe package. PiperOrigin-RevId: 314216146	2020-06-01 15:33:30 -07:00
Nicolas Lacasse	93edb36cbb	Refactor the ResolveExecutablePath logic. PiperOrigin-RevId: 313871804	2020-05-29 16:35:21 -07:00
Fabricio Voznika	a8c1b32660	Automated rollback of changelist 309082540 PiperOrigin-RevId: 313636920	2020-05-28 12:25:57 -07:00
Fabricio Voznika	32ab382c80	Improve unsupported syscall message PiperOrigin-RevId: 312104899	2020-05-18 10:23:22 -07:00
Jamie Liu	64afaf0e9b	Fix runsc association of gofers and FDs on VFS2. Updates #1487 PiperOrigin-RevId: 311443628	2020-05-13 18:18:09 -07:00
Jamie Liu	d846077628	Enable overlayfs_stale_read by default for runsc. Linux 4.18 and later make reads and writes coherent between pre-copy-up and post-copy-up FDs representing the same file on an overlay filesystem. However, memory mappings remain incoherent: - Documentation/filesystems/overlayfs.rst, "Non-standard behavior": "If a file residing on a lower layer is opened for read-only and then memory mapped with MAP_SHARED, then subsequent changes to the file are not reflected in the memory mapping." - fs/overlay/file.c:ovl_mmap() passes through to the underlying FD without any management of coherence in the overlay. - Experimentally on Linux 5.2: ``` $ cat mmap_cat_page.c #include <err.h> #include <fcntl.h> #include <stdio.h> #include <string.h> #include <sys/mman.h> #include <unistd.h> int main(int argc, char *argv) { if (argc < 2) { errx(1, "syntax: %s [FILE]", argv[0]); } const int fd = open(argv[1], O_RDONLY); if (fd < 0) { err(1, "open(%s)", argv[1]); } const size_t page_size = sysconf(_SC_PAGE_SIZE); void page = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0); if (page == MAP_FAILED) { err(1, "mmap"); } for (;;) { write(1, page, strnlen(page, page_size)); if (getc(stdin) == EOF) { break; } } return 0; } $ gcc -O2 -o mmap_cat_page mmap_cat_page.c $ mkdir lowerdir upperdir workdir overlaydir $ echo old > lowerdir/file $ sudo mount -t overlay -o "lowerdir=lowerdir,upperdir=upperdir,workdir=workdir" none overlaydir $ ./mmap_cat_page overlaydir/file old ^Z [1]+ Stopped ./mmap_cat_page overlaydir/file $ echo new > overlaydir/file $ cat overlaydir/file new $ fg ./mmap_cat_page overlaydir/file old ``` Therefore, while the VFS1 gofer client's behavior of reopening read FDs is only necessary pre-4.18, replacing existing memory mappings (in both sentry and application address spaces) with mappings of the new FD is required regardless of kernel version, and this latter behavior is common to both VFS1 and VFS2. Re-document accordingly, and change the runsc flag to enabled by default. New test: - Before this CL: https://source.cloud.google.com/results/invocations/5b222d2c-e918-4bae-afc4-407f5bac509b - After this CL: https://source.cloud.google.com/results/invocations/f28c747e-d89c-4d8c-a461-602b33e71aab PiperOrigin-RevId: 311361267	2020-05-13 10:53:37 -07:00
Fabricio Voznika	18cb3d24cb	Use VFS2 mount names Updates #1487 PiperOrigin-RevId: 311356385	2020-05-13 10:31:29 -07:00
Fabricio Voznika	305f786e51	Adjust a few log messages PiperOrigin-RevId: 311234146	2020-05-12 17:26:07 -07:00
Nicolas Lacasse	c52195d258	Stop avoiding preadv2 and pwritev2, and add them to the filters. Some code paths needed these syscalls anyways, so they should be included in the filters. Given that we depend on these syscalls in some cases, there's no real reason to avoid them any more. PiperOrigin-RevId: 310829126	2020-05-10 17:52:20 -07:00
Jamie Liu	9115f26851	Allocate device numbers for VFS2 filesystems. Updates #1197, #1198, #1672 PiperOrigin-RevId: 310432006	2020-05-07 14:01:53 -07:00
Dean Deng	16da7e790f	Update privateunixsocket TODOs. Synthetic sockets do not have the race condition issue in VFS2, and we will get rid of privateunixsocket as well. Fixes #1200. PiperOrigin-RevId: 310386474	2020-05-07 10:20:48 -07:00
Adin Scannell	279f1eb7ab	Fix runsc syscall documentation generation. We can register any number of tables with any number of architectures, and need not limit the definitions to the architecture in question. This allows runsc to generate documentation for all architectures simultaneously. Similarly, this simplifies the VFSv2 patching process. PiperOrigin-RevId: 310224827	2020-05-06 14:13:48 -07:00
Fabricio Voznika	0a307d0072	Mount VSFS2 filesystem using root credentials PiperOrigin-RevId: 309787938	2020-05-04 11:48:00 -07:00
Fabricio Voznika	cbc5bef2a6	Add TTY support on VFS2 to runsc Updates #1623, #1487 PiperOrigin-RevId: 309777922	2020-05-04 10:59:20 -07:00
Bhasker Hariharan	ae15d90436	FIFO QDisc implementation Updates #231 PiperOrigin-RevId: 309323808	2020-04-30 16:41:00 -07:00
moricho	fc53d64367	refactor and add test for bindmount Signed-off-by: moricho <ikeda.morito@gmail.com>	2020-04-26 17:24:34 +09:00
moricho	0b3166f624	add bind/rbind options for mount Signed-off-by: moricho <ikeda.morito@gmail.com>	2020-04-25 22:04:39 +09:00
moricho	93e510e26f	fix behavior of `getMountNameAndOptions` when options include either bind or rbind Signed-off-by: moricho <ikeda.morito@gmail.com>	2020-04-25 22:04:39 +09:00
Zach Koopmans	15a822a193	VFS2: Get HelloWorld image tests to pass with VFS2 This change includes: - Modifications to loader_test.go to get TestCreateMountNamespace to pass with VFS2. - Changes necessary to get TestHelloWorld in image tests to pass with VFS2. This means runsc can run the hello-world container with docker on VSF2. Note: Containers that use sockets will not run with these changes. See "//test/image/...". Any tests here with sockets currently fail (which is all of them but HelloWorld). PiperOrigin-RevId: 308363072	2020-04-24 18:23:37 -07:00
Dean Deng	632b104aff	Plumb context.Context into kernfs.Inode.Open(). PiperOrigin-RevId: 308304793	2020-04-24 12:37:49 -07:00
Dean Deng	1b88c63b3e	Move hostfs mount to Kernel struct. This is needed to set up host fds passed through a Unix socket. Note that the host package depends on kernel, so we cannot set up the hostfs mount directly in Kernel.Init as we do for sockfs and pipefs. Also, adjust sockfs to make its setup look more like hostfs's and pipefs's. PiperOrigin-RevId: 308274053	2020-04-24 10:03:43 -07:00
Jamie Liu	5042ea7e2c	Add vfs.MkdirOptions.ForSyntheticMountpoint. PiperOrigin-RevId: 308143529	2020-04-23 15:37:10 -07:00
Adin Scannell	1481499fe2	Simplify Docker test infrastructure. This change adds a layer of abstraction around the internal Docker APIs, and eliminates all direct dependencies on Dockerfiles in the infrastructure. A subsequent change will automated the generation of local images (with efficient caching). Note that this change drops the use of bazel container rules, as that experiment does not seem to be viable. PiperOrigin-RevId: 308095430	2020-04-23 11:33:30 -07:00
Nicolas Lacasse	e69a871c7b	Move user home detection to its own library. PiperOrigin-RevId: 307977689	2020-04-22 22:18:21 -07:00
Zach Koopmans	12bde95635	Get /bin/true to run on VFS2 Included: - loader_test.go RunTest and TestStartSignal VFS2 - container_test.go TestAppExitStatus on VFS2 - experimental flag added to runsc to turn on VFS2 Note: shared mounts are not yet supported. PiperOrigin-RevId: 307070753	2020-04-17 10:39:19 -07:00
gVisor bot	ac9b32c36b	Merge pull request #2212 from aaronlu:dup_stdioFDs PiperOrigin-RevId: 306477639	2020-04-14 11:20:11 -07:00
Fabricio Voznika	96f9142959	Use O_CLOEXEC when dup'ing FDs The sentry doesn't allow execve, but it's a good defense in-depth measure. PiperOrigin-RevId: 305958737	2020-04-10 15:47:23 -07:00
Adin Scannell	94b793262d	Fix all copy locks violations. This required minor restructuring of how system call tables were saved and restored, but it makes way more sense this way. Updates #2243	2020-04-08 10:00:14 -07:00
Ian Lewis	56054fc1fb	Add friendlier messages for frequently encountered errors. Issue #2270 Issue #1765 PiperOrigin-RevId: 305385436	2020-04-07 18:51:01 -07:00
Aaron Lu	0cfdd47391	checkpoint/restore: make sure the donated stdioFDs have the same value Suppose I start a runsc container using kvm platform like this: $ sudo runsc --debug=true --debug-log=1.txt --platform=kvm run rootbash The donating FD and the corresponding cmdline for runsc-sandbox is: D0313 17:50:12.608203 44389 x:0] Donating FD 3: "1.txt" D0313 17:50:12.608214 44389 x:0] Donating FD 4: "control_server_socket" D0313 17:50:12.608224 44389 x:0] Donating FD 5: "\|0" D0313 17:50:12.608229 44389 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json" D0313 17:50:12.608234 44389 x:0] Donating FD 7: "\|1" D0313 17:50:12.608238 44389 x:0] Donating FD 8: "sandbox IO FD" D0313 17:50:12.608242 44389 x:0] Donating FD 9: "/dev/kvm" D0313 17:50:12.608246 44389 x:0] Donating FD 10: "/dev/stdin" D0313 17:50:12.608249 44389 x:0] Donating FD 11: "/dev/stdout" D0313 17:50:12.608253 44389 x:0] Donating FD 12: "/dev/stderr" D0313 17:50:12.608257 44389 x:0] Starting sandbox: /proc/self/exe [runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log= --max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt --debug-log-format=text --file-access=exclusive --overlay=false --fsgofer-host-uds=false --network=sandbox --log-packets=false --platform=kvm --strace=false --strace-syscalls=--strace-log-size=1024 --watchdog-action=Panic --panic-signal=-1 --profile=false --net-raw=true --num-network-channels=1 --rootless=false --alsologtostderr=false --ref-leak-mode=disabled --gso=true --software-gso=true --overlayfs-stale-read=false --shared-volume= --debug-log-fd=3 --panic-signal=15 boot --bundle=/home/ziqian.lzq/bundle/bash/runsc --controller-fd=4 --mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8 --device-fd=9 --stdio-fds=10 --stdio-fds=11 --stdio-fds=12 --pidns=true --setup-root --cpu-num 32 --total-memory 4294967296 rootbash] Note stdioFDs starts from 10 with kvm platform and stderr's FD is 12. If I restore a container from the checkpoint image which is derived by checkpointing the above rootbash container, but either omit the platform switch or specify to use ptrace platform explicitely: $ sudo runsc --debug=true --debug-log=1.txt restore --image-path=some_path restored_rootbash the donating FD and corresponding cmdline for runsc-sandbox is: D0313 17:50:15.258632 44452 x:0] Donating FD 3: "1.txt" D0313 17:50:15.258640 44452 x:0] Donating FD 4: "control_server_socket" D0313 17:50:15.258645 44452 x:0] Donating FD 5: "\|0" D0313 17:50:15.258648 44452 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json" D0313 17:50:15.258653 44452 x:0] Donating FD 7: "\|1" D0313 17:50:15.258657 44452 x:0] Donating FD 8: "sandbox IO FD" D0313 17:50:15.258661 44452 x:0] Donating FD 9: "/dev/stdin" D0313 17:50:15.258675 44452 x:0] Donating FD 10: "/dev/stdout" D0313 17:50:15.258680 44452 x:0] Donating FD 11: "/dev/stderr" D0313 17:50:15.258684 44452 x:0] Starting sandbox: /proc/self/exe [runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log= --max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt --debug-log-format=text --file-access=exclusive --overlay=false --fsgofer-host-uds=false --network=sandbox --log-packets=false --platform=ptrace --strace=false --strace-syscalls= --strace-log-size=1024 --watchdog-action=Panic --panic-signal=-1 --profile=false --net-raw=true --num-network-channels=1 --rootless=false --alsologtostderr=false --ref-leak-mode=disabled --gso=true --software-gso=true --overlayfs-stale-read=false --shared-volume= --debug-log-fd=3 --panic-signal=15 boot --bundle=/home/ziqian.lzq/bundle/bash/runsc --controller-fd=4 --mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8 --stdio-fds=9 --stdio-fds=10 --stdio-fds=11 --setup-root --cpu-num 32 --total-memory 4294967296 restored_rootbash] Note this time, stdioFDs starts from 9 and stderr's FD is 11(so the saved host.descritor.origFD which is 12 for stderr is no longer valid). For the three host FD based files, The s.Dev and s.Ino derived from fstat(fd) shall all be the same and since the two fields are used as device.MultiDeviceKey, the host.inodeFileState.sattr.InodeId which is the value of MultiDevice.Map(MultiDeviceKey), shall also all be the same. Note that for MultiDevice m, m.cache records the mapping of key to value and m.rcache records the mapping of value to key. If same value doesn't map to the same key, it will panic on restore. Now that stderr's origFD 12 is no longer valid(it happens to be /memfd:runsc-memory in my test on restore), the s.Dev and s.Ino derived from fstat(fd=12) in host.inodeFileState.afterLoad() will neither be correct. But its InodeID is still the same as saved, MultiDevice.Load() will complain about the same value(InodeID) being mapped to different keys (different from stdin and stdout's) and panic with: "MultiDevice's caches are inconsistent". Solve this problem by making sure stdioFDs for root container's init task are always the same on initial start and on restore time, no matter what cmdline user has used: debug log specified or not, platform changed or not etc. shall not affect the ability to restore. Fixes #1844.	2020-03-31 11:37:11 +08:00
Dean Deng	137f361400	Use host-defined file owner and mode, when possible, for imported fds. Using the host-defined file owner matches VFS1. It is more correct to use the host-defined mode, since the cached value may become out of date. However, kernfs.Inode.Mode() does not return an error--other filesystems on kernfs are in-memory so retrieving mode should not fail. Therefore, if the host syscall fails, we rely on a cached value instead. Updates #1672. PiperOrigin-RevId: 303220864	2020-03-26 16:47:20 -07:00
Dean Deng	248e46f320	Whitelist utimensat(2). utimensat is used by hostfs for setting timestamps on imported fds. Previously, this would crash the sandbox since utimensat was not allowed. Correct the VFS2 version of hostfs to match the call in VFS1. PiperOrigin-RevId: 301970121	2020-03-19 23:30:21 -07:00
Dean Deng	5e413cad10	Plumb VFS2 imported fds into virtual filesystem. - When setting up the virtual filesystem, mount a host.filesystem to contain all files that need to be imported. - Make read/preadv syscalls to the host in cases where preadv2 may not be supported yet (likewise for writing). - Make save/restore functions in kernel/kernel.go return early if vfs2 is enabled. PiperOrigin-RevId: 300922353	2020-03-14 07:14:33 -07:00
gVisor bot	6367963c14	Merge pull request #1951 from moricho:moricho/add-profiler-option PiperOrigin-RevId: 299233818	2020-03-05 17:16:54 -08:00
Andrei Vagin	322dbfe06b	Allow to specify a separate log for GO's runtime messages GO's runtime calls the write system call twice to print "panic:" and "the reason of this panic", so here is a race window when other threads can print something to the log and we will see something like this: panic: log messages from another thread The reason of the panic. This confuses the syzkaller blacklist and dedup detection. It also makes the logs generally difficult to read. e.g., data races often have one side of the race, followed by a large "diagnosis" dump, finally followed by the other side of the race. PiperOrigin-RevId: 297887895	2020-02-28 11:24:11 -08:00
moricho	d8ed784311	add profile option	2020-02-26 16:49:51 +09:00
Jamie Liu	471b15b212	Port most syscalls to VFS2. pipe and pipe2 aren't ported, pending a slight rework of pipe FDs for VFS2. mount and umount2 aren't ported out of temporary laziness. access and faccessat need additional FSImpl methods to implement properly, but are stubbed to prevent googletest from CHECK-failing. Other syscalls require additional plumbing. Updates #1623 PiperOrigin-RevId: 297188448	2020-02-25 13:37:34 -08:00
gVisor bot	4a73bae269	Initial network namespace support. TCP/IP will work with netstack networking. hostinet doesn't work, and sockets will have the same behavior as it is now. Before the userspace is able to create device, the default loopback device can be used to test. /proc/net and /sys/net will still be connected to the root network stack; this is the same behavior now. Issue #1833 PiperOrigin-RevId: 296309389	2020-02-20 15:20:40 -08:00
gVisor bot	5baf9dc2fb	Synchronize signalling with S/R This is to fix a data race between sending an external signal to a ThreadGroup and kernel saving state for S/R. PiperOrigin-RevId: 295244281	2020-02-14 15:49:09 -08:00
gVisor bot	4075de11be	Plumb VFS2 inside the Sentry - Added fsbridge package with interface that can be used to open and read from VFS1 and VFS2 files. - Converted ELF loader to use fsbridge - Added VFS2 types to FSContext - Added vfs.MountNamespace to ThreadGroup Updates #1623 PiperOrigin-RevId: 295183950	2020-02-14 11:12:47 -08:00
Adin Scannell	1b6a12a768	Add notes to relevant tests. These were out-of-band notes that can help provide additional context and simplify automated imports. PiperOrigin-RevId: 293525915	2020-02-05 22:46:35 -08:00
Michael Pratt	4d1a648c7c	Allow mlock in system call filters Go 1.14 has a workaround for a Linux 5.2-5.4 bug which requires mlock'ing the g stack to prevent register corruption. We need to allow this syscall until it is removed from Go. PiperOrigin-RevId: 292967478	2020-02-03 11:39:51 -08:00
Fabricio Voznika	437c986c6a	Add vfs.FileDescription to FD table FD table now holds both VFS1 and VFS2 types and uses the correct one based on what's set. Parts of this CL are just initial changes (e.g. sys_read.go, runsc/main.go) to serve as a template for the remaining changes. Updates #1487 Updates #1623 PiperOrigin-RevId: 292023223	2020-01-28 15:31:03 -08:00
Adin Scannell	253c9e666c	Cleanup glog and add real caller information. In general, we've learned that logging must be avoided at all costs in the hot path. It's unlikely that the optimizations here were significant in any case, since buffer would certainly escape. This also adds a test to ensure that the caller identification works as expected, and so that logging can be benchmarked. Original: BenchmarkGoogleLogging-6 1222255 949 ns/op With this change: BenchmarkGoogleLogging-6 517323 2346 ns/op Fixes #184 PiperOrigin-RevId: 291815420	2020-01-27 16:08:35 -08:00
Adin Scannell	0e2f1b7abd	Update package locations. Because the abi will depend on the core types for marshalling (usermem, context, safemem, safecopy), these need to be flattened from the sentry directory. These packages contain no sentry-specific details. PiperOrigin-RevId: 291811289	2020-01-27 15:31:32 -08:00
Adin Scannell	d29e59af9f	Standardize on tools directory. PiperOrigin-RevId: 291745021	2020-01-27 12:21:00 -08:00
Ian Gudger	27500d529f	New sync package. * Rename syncutil to sync. * Add aliases to sync types. * Replace existing usage of standard library sync package. This will make it easier to swap out synchronization primitives. For example, this will allow us to use primitives from github.com/sasha-s/go-deadlock to check for lock ordering violations. Updates #1472 PiperOrigin-RevId: 289033387	2020-01-09 22:02:24 -08:00
Bert Muthalaly	e21c584056	Combine various Create*NIC methods into CreateNICWithOptions. PiperOrigin-RevId: 288779416	2020-01-08 14:50:49 -08:00
Bert Muthalaly	0cc1e74b57	Add NIC.isLoopback() ...enabling us to remove the "CreateNamedLoopbackNIC" variant of CreateNIC and all the plumbing to connect it through to where the value is read in FindRoute. PiperOrigin-RevId: 288713093	2020-01-08 09:30:20 -08:00
Aleksandr Razumov	67f678be27	Leave minimum CPU number as a constant Remove introduced CPUNumMin config and hard-code it as 2.	2019-12-17 20:41:02 +03:00
Aleksandr Razumov	b661434202	Add minimum CPU number and only lower CPUs on --cpu-num-from-quota * Add `--cpu-num-min` flag to control minimum CPUs * Only lower CPU count * Fix comments	2019-12-17 13:27:13 +03:00
Aleksandr Razumov	8782f0e287	Set CPU number to CPU quota When application is not cgroups-aware, it can spawn excessive threads which often defaults to CPU number. Introduce a opt-in flag that will set CPU number accordingly to CPU quota (if available). Fixes #1391	2019-12-15 21:12:43 +03:00
Bhasker Hariharan	b9aa62b9f9	Enable IPv6 in runsc Fixes #1341 PiperOrigin-RevId: 285108973	2019-12-11 19:14:26 -08:00
Fabricio Voznika	01eadf51ea	Bump up Go 1.13 as minimum requirement PiperOrigin-RevId: 284320186	2019-12-06 23:10:15 -08:00
gVisor bot	e70636d7f1	Merge pull request #1233 from xiaobo55x:compatLog PiperOrigin-RevId: 284305935	2019-12-06 19:41:39 -08:00
Adin Scannell	371e210b83	Add runtime tracing. This adds meaningful annotations to the trace generated by the runtime/trace package. PiperOrigin-RevId: 284290115	2019-12-06 17:00:07 -08:00
Fabricio Voznika	ea7a100202	Make annotations OCI compliant Changed annotation to follow the standard defined here: https://github.com/opencontainers/image-spec/blob/master/annotations.md PiperOrigin-RevId: 284254847	2019-12-06 13:51:38 -08:00
Dean Deng	19b2d997ec	Support IP_TOS and IPV6_TCLASS socket options for hostinet sockets. There are two potential ways of sending a TOS byte with outgoing packets: including a control message in sendmsg, or setting the IP_TOS/IPV6_TCLASS socket options (for IPV4 and IPV6 respectively). This change lets hostinet support the latter. Fixes #1188 PiperOrigin-RevId: 283550925	2019-12-03 08:33:22 -08:00
Haibo Xu	61f2274cb6	Enable runsc compatLog support on arm64. Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: I3fd5e552f5f03b5144ed52647f75af3b8253b1d6	2019-12-03 03:25:54 +00:00
Dean Deng	684f757a22	Add support for receiving TOS and TCLASS control messages in hostinet. This involves allowing getsockopt/setsockopt for the corresponding socket options, as well as allowing hostinet to process control messages received from the actual recvmsg syscall. PiperOrigin-RevId: 282851425	2019-11-27 16:21:05 -08:00
Fabricio Voznika	97d2c9a94e	Use mount hints to determine FileAccessType PiperOrigin-RevId: 282401165	2019-11-25 11:43:05 -08:00
gVisor bot	0416c247ec	Merge pull request #1176 from xiaobo55x:runsc_boot PiperOrigin-RevId: 282382564	2019-11-25 11:01:22 -08:00
Haibo Xu	05871a1cdc	Enable runsc/boot support on arm64. This patch also include a minor change to replace syscall.Dup2 with syscall.Dup3 which was missed in a previous commit(ref `a25a976`). Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: I00beb9cc492e44c762ebaa3750201c63c1f7c2f3	2019-11-13 06:39:11 +00:00
Michael Pratt	b23b36e701	Add NETLINK_KOBJECT_UEVENT socket support NETLINK_KOBJECT_UEVENT sockets send udev-style messages for device events. gVisor doesn't have any device events, so our sockets don't need to do anything once created. systemd's device manager needs to be able to create one of these sockets. It also wants to install a BPF filter on the socket. Since we'll never send any messages, the filter would never be invoked, thus we just fake it out. Fixes #1117 Updates #1119 PiperOrigin-RevId: 278405893	2019-11-04 10:07:52 -08:00
Nicolas Lacasse	e70f28664a	Allow the watchdog to detect when the sandbox is stuck during setup. The watchdog currently can find stuck tasks, but has no way to tell if the sandbox is stuck before the application starts executing. This CL adds a startup timeout and action to the watchdog. If Start() is not called before the given timeout (if non-zero), then the watchdog will take the action. PiperOrigin-RevId: 277970577	2019-11-01 11:49:31 -07:00
gVisor bot	0202be1ba5	Merge pull request #1058 from cmingxu:master PiperOrigin-RevId: 277623766	2019-10-31 11:26:45 -07:00
Andrei Vagin	db37483cb6	Store endpoints inside multiPortEndpoint in a sorted order It is required to guarantee the same order of endpoints after save/restore. PiperOrigin-RevId: 277598665	2019-10-30 15:33:41 -07:00
kevin.xu	1f19624fa1	fix typo fix a typo	2019-10-23 15:21:50 +08:00
kevin.xu	3edbdcc191	remove duplicated period remove a duplicated period	2019-10-23 14:56:44 +08:00
Andrei Vagin	8720bd643e	netstack/tcp: software segmentation offload Right now, we send each tcp packet separately, we call one system call per-packet. This patch allows to generate multiple tcp packets and send them by sendmmsg. The arguable part of this CL is a way how to handle multiple headers. This CL adds the next field to the Prepandable buffer. Nginx test results: Server Software: nginx/1.15.9 Server Hostname: 10.138.0.2 Server Port: 8080 Document Path: /10m.txt Document Length: 10485760 bytes w/o gso: Concurrency Level: 5 Time taken for tests: 5.491 seconds Complete requests: 100 Failed requests: 0 Total transferred: 1048600200 bytes HTML transferred: 1048576000 bytes Requests per second: 18.21 [#/sec] (mean) Time per request: 274.525 [ms] (mean) Time per request: 54.905 [ms] (mean, across all concurrent requests) Transfer rate: 186508.03 [Kbytes/sec] received sw-gso: Concurrency Level: 5 Time taken for tests: 3.852 seconds Complete requests: 100 Failed requests: 0 Total transferred: 1048600200 bytes HTML transferred: 1048576000 bytes Requests per second: 25.96 [#/sec] (mean) Time per request: 192.576 [ms] (mean) Time per request: 38.515 [ms] (mean, across all concurrent requests) Transfer rate: 265874.92 [Kbytes/sec] received w/o gso: $ ./tcp_benchmark --client --duration 15 --ideal [SUM] 0.0-15.1 sec 2.20 GBytes 1.25 Gbits/sec software gso: $ tcp_benchmark --client --duration 15 --ideal --gso $((1<<16)) --swgso [SUM] 0.0-15.1 sec 3.99 GBytes 2.26 Gbits/sec PiperOrigin-RevId: 276112677	2019-10-22 11:55:56 -07:00
Kevin Krakauer	12235d533a	AF_PACKET support for netstack (aka epsocket). Like (AF_INET, SOCK_RAW) sockets, AF_PACKET sockets require CAP_NET_RAW. With runsc, you'll need to pass `--net-raw=true` to enable them. Binding isn't supported yet. PiperOrigin-RevId: 275909366	2019-10-21 13:23:18 -07:00
Fabricio Voznika	9fb562234e	Fix problem with open FD when copy up is triggered in overlayfs Linux kernel before 4.19 doesn't implement a feature that updates open FD after a file is open for write (and is copied to the upper layer). Already open FD will continue to read the old file content until they are reopened. This is especially problematic for gVisor because it caches open files. Flag was added to force readonly files to be reopenned when the same file is open for write. This is only needed if using kernels prior to 4.19. Closes #1006 It's difficult to really test this because we never run on tests on older kernels. I'm adding a test in GKE which uses kernels with the overlayfs problem for 1.14 and lower. PiperOrigin-RevId: 275115289	2019-10-16 15:06:24 -07:00
Fabricio Voznika	a357fe427b	Remove stale TODO PiperOrigin-RevId: 273630282	2019-10-08 16:23:41 -07:00
Fabricio Voznika	b9cdbc26bc	Ignore mount options that are not supported in shared mounts Options that do not change mount behavior inside the Sentry are irrelevant and should not be used when looking for possible incompatibilities between master and slave mounts. PiperOrigin-RevId: 273593486	2019-10-08 13:36:16 -07:00
Ian Gudger	7c1587e340	Implement IP_TTL. Also change the default TTL to 64 to match Linux. PiperOrigin-RevId: 273430341	2019-10-07 19:29:51 -07:00
Kevin Krakauer	6a98237949	Rename epsocket to netstack. PiperOrigin-RevId: 273365058	2019-10-07 13:57:59 -07:00
gVisor bot	dd0e5eedae	Merge pull request #765 from trailofbits:uds_support PiperOrigin-RevId: 271235134	2019-09-25 16:44:22 -07:00
Kevin Krakauer	59ccbb1044	Remove centralized registration of protocols. Also removes the need for protocol names. PiperOrigin-RevId: 271186030	2019-09-25 12:57:05 -07:00
Robert Tonic	7810b30983	Refactor command line options and remove the allowed terminology for uds	2019-09-24 18:24:10 -04:00
Nicolas Lacasse	f2ea8e6b24	Always set HOME env var with `runsc exec`. We already do this for `runsc run`, but need to do the same for `runsc exec`. PiperOrigin-RevId: 270793459	2019-09-23 17:06:02 -07:00
Robert Tonic	46beb91912	Fix documentation, clean up seccomp filter installation, rename helpers. Filter installation has been streamlined and functions renamed. Documentation has been fixed to be standards compliant, and missing documentation added. gofmt has also been applied to modified files.	2019-09-19 17:10:50 -04:00
Robert Tonic	ac38a7ead0	Place the host UDS mounting behind --fsgofer-host-uds-allowed. This commit allows the use of the `--fsgofer-host-uds-allowed` flag to enable mounting sockets and add the appropriate seccomp filters.	2019-09-19 12:37:15 -04:00
Fabricio Voznika	010b093258	Bring back to life features lost in recent refactor - Sandbox logs are generated when running tests - Kokoro uploads the sandbox logs - Supports multiple parallel runs - Revive script to install locally built runsc with docker PiperOrigin-RevId: 269337274	2019-09-16 08:17:00 -07:00
Adin Scannell	a8834fc555	Update p9 to support flipcall. PiperOrigin-RevId: 268845090	2019-09-12 23:37:31 -07:00
Ian Gudger	fe1f521077	Remove reundant global tcpip.LinkEndpointID. PiperOrigin-RevId: 267709597	2019-09-06 18:01:14 -07:00
Fabricio Voznika	0f5cdc1e00	Resolve flakes with TestMultiContainerDestroy Some processes are reparented to the root container depending on the kill order and the root container would not reap in time. So some zombie processes were still present when the test checked. Fix it by running the second container inside a PID namespace. PiperOrigin-RevId: 267278591	2019-09-04 18:56:49 -07:00

1 2 3 4 5 ...

384 Commits