gvisor

Commit Graph

Author	SHA1	Message	Date
Jamie Liu	8bfb83d0ac	Implement async MemoryFile eviction, and use it in CachingInodeOperations. This feature allows MemoryFile to delay eviction of "optional" allocations, such as unused cached file pages. Note that this incidentally makes CachingInodeOperations writeback asynchronous, in the sense that it doesn't occur until eviction; this is necessary because between when a cached page becomes evictable and when it's evicted, file writes (via CachingInodeOperations.Write) may dirty the page. As currently implemented, this feature won't meaningfully impact steady-state memory usage or caching; the reclaimer goroutine will schedule eviction as soon as it runs out of other work to do. Future CLs increase caching by adding constraints on when eviction is scheduled. PiperOrigin-RevId: 246014822 Change-Id: Ia85feb25a2de92a48359eb84434b6ec6f9bea2cb	2019-04-30 13:56:41 -07:00
Michael Pratt	4d52a55201	Change copyright notice to "The gVisor Authors" Based on the guidelines at https://opensource.google.com/docs/releasing/authors/. 1. $ rg -l "Google LLC" \| xargs sed -i 's/Google LLC.*/The gVisor Authors./' 2. Manual fixup of "Google Inc" references. 3. Add AUTHORS file. Authors may request to be added to this file. 4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS. Fixes #209 PiperOrigin-RevId: 245823212 Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9	2019-04-29 14:26:23 -07:00
Nicolas Lacasse	f4ce43e1f4	Allow and document bug ids in gVisor codebase. PiperOrigin-RevId: 245818639 Change-Id: I03703ef0fb9b6675955637b9fe2776204c545789	2019-04-29 14:04:14 -07:00
Kevin Krakauer	43dff57b87	Make raw sockets a toggleable feature disabled by default. PiperOrigin-RevId: 245511019 Change-Id: Ia9562a301b46458988a6a1f0bbd5f07cbfcb0615	2019-04-26 16:51:46 -07:00
Bhasker Hariharan	99b877fa1d	Revert runsc to use RecvMMsg packet dispatcher. PacketMMap mode has issues due to a kernel bug. This change reverts us to using recvmmsg instead of a shared ring buffer to dispatch inbound packets. This will reduce performance but should be more stable under heavy load till PacketMMap is updated to use TPacketv3. See #210 for details. Perf difference between recvmmsg vs packetmmap. RecvMMsg : iperf3 -c 172.17.0.2 Connecting to host 172.17.0.2, port 5201 [ 4] local 172.17.0.1 port 43478 connected to 172.17.0.2 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 778 MBytes 6.53 Gbits/sec 4349 188 KBytes [ 4] 1.00-2.00 sec 786 MBytes 6.59 Gbits/sec 4395 212 KBytes [ 4] 2.00-3.00 sec 756 MBytes 6.34 Gbits/sec 3655 161 KBytes [ 4] 3.00-4.00 sec 782 MBytes 6.56 Gbits/sec 4419 175 KBytes [ 4] 4.00-5.00 sec 755 MBytes 6.34 Gbits/sec 4317 187 KBytes [ 4] 5.00-6.00 sec 774 MBytes 6.49 Gbits/sec 4002 173 KBytes [ 4] 6.00-7.00 sec 737 MBytes 6.18 Gbits/sec 3904 191 KBytes [ 4] 7.00-8.00 sec 530 MBytes 4.44 Gbits/sec 3318 189 KBytes [ 4] 8.00-9.00 sec 487 MBytes 4.09 Gbits/sec 2627 188 KBytes [ 4] 9.00-10.00 sec 770 MBytes 6.46 Gbits/sec 4221 170 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 6.99 GBytes 6.00 Gbits/sec 39207 sender [ 4] 0.00-10.00 sec 6.99 GBytes 6.00 Gbits/sec receiver iperf Done. PacketMMap: bhaskerh@gvisor-bench:~/tensorflow$ iperf3 -c 172.17.0.2 Connecting to host 172.17.0.2, port 5201 [ 4] local 172.17.0.1 port 43496 connected to 172.17.0.2 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 657 MBytes 5.51 Gbits/sec 0 1.01 MBytes [ 4] 1.00-2.00 sec 1021 MBytes 8.56 Gbits/sec 0 1.01 MBytes [ 4] 2.00-3.00 sec 1.21 GBytes 10.4 Gbits/sec 45 1.01 MBytes [ 4] 3.00-4.00 sec 1018 MBytes 8.54 Gbits/sec 15 1.01 MBytes [ 4] 4.00-5.00 sec 1.28 GBytes 11.0 Gbits/sec 45 1.01 MBytes [ 4] 5.00-6.00 sec 1.38 GBytes 11.9 Gbits/sec 0 1.01 MBytes [ 4] 6.00-7.00 sec 1.34 GBytes 11.5 Gbits/sec 45 856 KBytes [ 4] 7.00-8.00 sec 1.23 GBytes 10.5 Gbits/sec 0 901 KBytes [ 4] 8.00-9.00 sec 1010 MBytes 8.48 Gbits/sec 0 923 KBytes [ 4] 9.00-10.00 sec 1.39 GBytes 11.9 Gbits/sec 0 960 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 11.4 GBytes 9.83 Gbits/sec 150 sender [ 4] 0.00-10.00 sec 11.4 GBytes 9.83 Gbits/sec receiver Updates #210 PiperOrigin-RevId: 244968438 Change-Id: Id461b5cbff2dea6fa55cfc108ea246d8f83da20b	2019-04-23 19:07:06 -07:00
Fabricio Voznika	c8cee7108f	Use FD limit and file size limit from host FD limit and file size limit is read from the host, instead of using hard-coded defaults, given that they effect the sandbox process. Also limit the direct cache to use no more than half if the available FDs. PiperOrigin-RevId: 244050323 Change-Id: I787ad0fdf07c49d589e51aebfeae477324fe26e6	2019-04-17 12:57:40 -07:00
Fabricio Voznika	9f8c89fc7f	Return error from fdbased.New RELNOTES: n/a PiperOrigin-RevId: 244031742 Change-Id: Id0cdb73194018fb5979e67b58510ead19b5a2b81	2019-04-17 11:16:35 -07:00
Bhasker Hariharan	eaac2806ff	Add TCP checksum verification. PiperOrigin-RevId: 242704699 Change-Id: I87db368ca343b3b4bf4f969b17d3aa4ce2f8bd4f	2019-04-09 11:23:47 -07:00
Andrei Vagin	88409e983c	gvisor: Add support for the MS_NOEXEC mount option https://github.com/google/gvisor/issues/145 PiperOrigin-RevId: 242044115 Change-Id: I8f140fe05e32ecd438b6be218e224e4b7fe05878	2019-04-04 17:43:53 -07:00
Kevin Krakauer	f9431fb20f	Remove obsolete TODO. PiperOrigin-RevId: 241637164 Change-Id: I65476a739cf38f1818dc47f6ce60638dec8b77a8	2019-04-02 17:27:05 -07:00
Kevin Krakauer	a40ee4f4b8	Change bug number for duplicate bug. PiperOrigin-RevId: 241567897 Change-Id: I580eac04f52bb15f4aab7df9822c4aa92e743021	2019-04-02 11:28:06 -07:00
Andrei Vagin	a046054ba3	gvisor/runsc: enable generic segmentation offload (GSO) The linux packet socket can handle GSO packets, so we can segment packets to 64K instead of the MTU which is usually 1500. Here are numbers for the nginx-1m test: runsc: 579330.01 [Kbytes/sec] received runsc-gso: 1794121.66 [Kbytes/sec] received runc: 2122139.06 [Kbytes/sec] received and for tcp_benchmark: $ tcp_benchmark --duration 15 --ideal [ 4] 0.0-15.0 sec 86647 MBytes 48456 Mbits/sec $ tcp_benchmark --client --duration 15 --ideal [ 4] 0.0-15.0 sec 2173 MBytes 1214 Mbits/sec $ tcp_benchmark --client --duration 15 --ideal --gso 65536 [ 4] 0.0-15.0 sec 19357 MBytes 10825 Mbits/sec PiperOrigin-RevId: 241072403 Change-Id: I20b03063a1a6649362b43609cbbc9b59be06e6d5	2019-03-29 16:27:38 -07:00
Jamie Liu	8f4634997b	Decouple filemem from platform and move it to pgalloc.MemoryFile. This is in preparation for improved page cache reclaim, which requires greater integration between the page cache and page allocator. PiperOrigin-RevId: 238444706 Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4	2019-03-14 08:12:48 -07:00
Nicolas Lacasse	2512cc5617	Allow filesystem.Mount to take an optional interface argument. PiperOrigin-RevId: 238360231 Change-Id: I5eaf8d26f8892f77d71c7fbd6c5225ef471cedf1	2019-03-13 19:24:03 -07:00
Ian Gudger	a16f6e50c5	Make HandleLocal apply to all non-loopback interfaces. HandleLocal is very similar conceptually to MULTICAST_LOOP, so we can unify the implementations. This has the benefit of making HandleLocal apply even when the fdbased link endpoint isn't in use. In addition, move looping logic to route creation so that it doesn't need to be run for each packet. This should improve performance. PiperOrigin-RevId: 238099480 Change-Id: I72839f16f25310471453bc9d3fb8544815b25c23	2019-03-12 14:37:56 -07:00
Fabricio Voznika	bc9b979b94	Add profiling commands to runsc Example: runsc debug --root=<dir> \ --profile-heap=/tmp/heap.prof \ --profile-cpu=/tmp/cpu.prod --profile-delay=30 \ <container ID> PiperOrigin-RevId: 237848456 Change-Id: Icff3f20c1b157a84d0922599eaea327320dad773	2019-03-11 11:47:30 -07:00
Ian Gudger	56a6128295	Implement IP_MULTICAST_LOOP. IP_MULTICAST_LOOP controls whether or not multicast packets sent on the default route are looped back. In order to implement this switch, support for sending and looping back multicast packets on the default route had to be implemented. For now we only support IPv4 multicast. PiperOrigin-RevId: 237534603 Change-Id: I490ac7ff8e8ebef417c7eb049a919c29d156ac1c	2019-03-08 15:49:17 -08:00
Fabricio Voznika	0b76887147	Priority-inheritance futex implementation It is Implemented without the priority inheritance part given that gVisor defers scheduling decisions to Go runtime and doesn't have control over it. PiperOrigin-RevId: 236989545 Change-Id: I714c8ca0798743ecf3167b14ffeb5cd834302560	2019-03-05 23:40:18 -08:00
Fabricio Voznika	fcba4e8f04	Add uncaught signal message to the user log This help troubleshoot cases where the container is killed and the app logs don't show the reason. PiperOrigin-RevId: 236982883 Change-Id: I361892856a146cea5b04abaa3aedbf805e123724	2019-03-05 22:20:17 -08:00
Fabricio Voznika	3dbd4a16f8	Add semctl(GETPID) syscall Also added unimplemented notification for semctl(2) commands. PiperOrigin-RevId: 236340672 Change-Id: I0795e3bd2e6d41d7936fabb731884df426a42478	2019-03-01 10:57:02 -08:00
Kevin Krakauer	b75aa51504	Rename ping endpoints to icmp endpoints. PiperOrigin-RevId: 235248572 Change-Id: I5b0538b6feb365a98712c2a2d56d856fe80a8a09	2019-02-22 13:34:47 -08:00
Nicolas Lacasse	0a41ea72c1	Don't allow writing or reading to TTY unless process group is in foreground. If a background process tries to read from a TTY, linux sends it a SIGTTIN unless the signal is blocked or ignored, or the process group is an orphan, in which case the syscall returns EIO. See drivers/tty/n_tty.c:n_tty_read()=>job_control(). If a background process tries to write a TTY, set the termios, or set the foreground process group, linux then sends a SIGTTOU. If the signal is ignored or blocked, linux allows the write. If the process group is an orphan, the syscall returns EIO. See drivers/tty/tty_io.c:tty_check_change(). PiperOrigin-RevId: 234044367 Change-Id: I009461352ac4f3f11c5d42c43ac36bb0caa580f9	2019-02-14 15:47:31 -08:00
Bhasker Hariharan	e0b3d3323f	Add support for using PACKET_RX_RING to receive packets. PACKET_RX_RING allows the use of an mmapped buffer to receive packets from the kernel. This should cut down the number of host syscalls that need to be made to receive packets when the underlying fd is a socket of the AF_PACKET type. PiperOrigin-RevId: 233834998 Change-Id: I8060025c6ced206986e94cc46b8f382b81bfa47f	2019-02-13 14:53:03 -08:00
Nicolas Lacasse	92e85623a0	Factor the subtargets method into a helper method with tests. PiperOrigin-RevId: 232047515 Change-Id: I00f036816e320356219be7b2f2e6d5fe57583a60	2019-02-01 15:23:43 -08:00
Michael Pratt	2a0c69b19f	Remove license comments Nothing reads them and they can simply get stale. Generated with: $ sed -i "s/licenses($.$)./licenses(\1)/" **/BUILD PiperOrigin-RevId: 231818945 Change-Id: Ibc3f9838546b7e94f13f217060d31f4ada9d4bf0	2019-01-31 11:12:53 -08:00
Andrei Vagin	7e8a56087b	runsc: check whether a container is deleted or not before setupContainerFS PiperOrigin-RevId: 231811387 Change-Id: Ib143fb9a4d0fa1f105d1a3a3bd533dfc44e792af	2019-01-31 10:34:15 -08:00
Andrei Vagin	dd577f5410	runsc: reap a sandbox process only in sandbox.Wait() PiperOrigin-RevId: 231504064 Change-Id: I585b769aef04a3ad7e7936027958910a6eed9c8d	2019-01-29 17:15:56 -08:00
Bhasker Hariharan	24cb2c0a72	Use recvmmsg() instead of readv() to read packets from NIC. This should reduce the number of syscalls required to process packets significantly and improve throughputs. PiperOrigin-RevId: 231366886 Change-Id: I8b38077262bf9c53176bc4a94b530188d3d7c0ca	2019-01-29 01:39:01 -08:00
Fabricio Voznika	c1be25b78d	Scrub runsc error messages Removed "error" and "failed to" prefix that don't add value from messages. Adjusted a few other messages. In particular, when the container fail to start, the message returned is easier for humans to read: $ docker run --rm --runtime=runsc alpine foobar docker: Error response from daemon: OCI runtime start failed: <path> did not terminate sucessfully: starting container: starting root container [foobar]: starting sandbox: searching for executable "foobar", cwd: "/", $PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin": no such file or directory Closes #77 PiperOrigin-RevId: 230022798 Change-Id: I83339017c70dae09e4f9f8e0ea2e554c4d5d5cd1	2019-01-18 17:36:02 -08:00
Fabricio Voznika	e4d3ca7263	Prevent internal tmpfs mount to override files in /tmp Runsc wants to mount /tmp using internal tmpfs implementation for performance. However, it risks hiding files that may exist under /tmp in case it's present in the container. Now, it only mounts over /tmp iff: - /tmp was not explicitly asked to be mounted - /tmp is empty If any of this is not true, then /tmp maps to the container's image /tmp. Note: checkpoint doesn't have sentry FS mounted to check if /tmp is empty. It simply looks for explicit mounts right now. PiperOrigin-RevId: 229607856 Change-Id: I10b6dae7ac157ef578efc4dfceb089f3b94cde06	2019-01-16 12:48:32 -08:00
Nicolas Lacasse	dc8450b567	Remove fs.Handle, ramfs.Entry, and all the DeprecatedFileOperations. More helper structs have been added to the fsutil package to make it easier to implement fs.InodeOperations and fs.FileOperations. PiperOrigin-RevId: 229305982 Change-Id: Ib6f8d3862f4216745116857913dbfa351530223b	2019-01-14 20:34:28 -08:00
Michael Pratt	33191e1cc4	Automated rollback of changelist 225089593 PiperOrigin-RevId: 227595007 Change-Id: If14cc5aab869c5fd7a4ebd95929c887ab690e94c	2019-01-02 15:48:00 -08:00
Fabricio Voznika	a891afad6d	Simplify synchronization between runsc and sandbox process Make 'runsc create' join cgroup before creating sandbox process. This removes the need to synchronize platform creation and ensure that sandbox process is charged to the right cgroup from the start. PiperOrigin-RevId: 227166451 Change-Id: Ieb4b18e6ca0daf7b331dc897699ca419bc5ee3a2	2018-12-28 13:48:24 -08:00
Jamie Liu	194ef586fc	Rename limits.MemoryPagesLocked to limits.MemoryLocked. "RLIMIT_MEMLOCK: This is the maximum number of bytes of memory that may be locked into RAM." - getrlimit(2) PiperOrigin-RevId: 226384346 Change-Id: Iefac4a1bb69f7714dc813b5b871226a8344dc800	2018-12-20 13:28:46 -08:00
Googler	86c9bd2547	Automated rollback of changelist 225861605 PiperOrigin-RevId: 226224230 Change-Id: Id24c7d3733722fd41d5fe74ef64e0ce8c68f0b12	2018-12-19 13:30:08 -08:00
Michael Pratt	b62591e6a8	Expose internal testing flag Never to used outside of runsc tests! PiperOrigin-RevId: 225919013 Change-Id: Ib3b14aa2a2564b5246fb3f8933d95e01027ed186	2018-12-17 17:35:06 -08:00
Jamie Liu	2421006426	Implement mlock(), kind of. Currently mlock() and friends do nothing whatsoever. However, mlocking is directly application-visible in a number of ways; for example, madvise(MADV_DONTNEED) and msync(MS_INVALIDATE) both fail on mlocked regions. We handle this inconsistently: MADV_DONTNEED is too important to not work, but MS_INVALIDATE is rejected. Change MM to track mlocked regions in a manner consistent with Linux. It still will not actually pin pages into host physical memory, but: - mlock() will now cause sentry memory management to precommit mlocked pages. - MADV_DONTNEED and MS_INVALIDATE will interact with mlocked pages as described above. PiperOrigin-RevId: 225861605 Change-Id: Iee187204979ac9a4d15d0e037c152c0902c8d0ee	2018-12-17 11:38:59 -08:00
Michael Pratt	24c1158b9c	Add "trace signal" option This option is effectively equivalent to -panic-signal, except that the sandbox does not die after logging the traceback. PiperOrigin-RevId: 225089593 Change-Id: Ifb1c411210110b6104613f404334bd02175e484e	2018-12-11 16:12:41 -08:00
Brian Geffon	d3bc79bc84	Open source system call tests. PiperOrigin-RevId: 224886231 Change-Id: I0fccb4d994601739d8b16b1d4e6b31f40297fb22	2018-12-10 14:42:34 -08:00
Zhaozhong Ni	9984138abe	sentry: turn "dynamically-created" procfs files into static creation. PiperOrigin-RevId: 224600982 Change-Id: I547253528e24fb0bb318fc9d2632cb80504acb34	2018-12-07 17:03:54 -08:00
Brian Geffon	82719be42e	Max link traversals should be for an entire path. The number of symbolic links that are allowed to be followed are for a full path and not just a chain of symbolic links. PiperOrigin-RevId: 224047321 Change-Id: I5e3c4caf66a93c17eeddcc7f046d1e8bb9434a40	2018-12-04 14:32:03 -08:00
Fabricio Voznika	eaac94d91c	Use RET_KILL_PROCESS if available in kernel RET_KILL_THREAD doesn't work well for Go because it will kill only the offending thread and leave the process hanging. RET_TRAP can be masked out and it's not guaranteed to kill the process. RET_KILL_PROCESS is available since 4.14. For older kernel, continue to use RET_TRAP as this is the best option (likely to kill process, easy to debug). PiperOrigin-RevId: 222357867 Change-Id: Icc1d7d731274b16c2125b7a1ba4f7883fbdb2cbd	2018-11-20 22:56:51 -08:00
Fabricio Voznika	fadffa2ff8	Add unsupported syscall events for get/setsockopt PiperOrigin-RevId: 222148953 Change-Id: I21500a9f08939c45314a6414e0824490a973e5aa	2018-11-20 14:04:12 -08:00
Fabricio Voznika	237f9c7a5e	Don't fail when destroyContainerFS is called more than once This can happen when destroy is called multiple times or when destroy failed previously and is being called again. PiperOrigin-RevId: 221882034 Change-Id: I8d069af19cf66c4e2419bdf0d4b789c5def8d19e	2018-11-20 14:03:42 -08:00
Nicolas Lacasse	845836c578	Internal change. PiperOrigin-RevId: 221848471 Change-Id: I882fbe5ce7737048b2e1f668848e9c14ed355665	2018-11-20 14:03:11 -08:00
Nicolas Lacasse	40f843fc78	Internal change. PiperOrigin-RevId: 221343626 Change-Id: I03d57293a555cf4da9952a81803b9f8463173c89	2018-11-13 15:18:17 -08:00
Nicolas Lacasse	6c2d320138	Internal change. PiperOrigin-RevId: 221299066 Change-Id: I8ae352458f9976c329c6946b1efa843a3de0eaa4	2018-11-13 11:14:28 -08:00
Fabricio Voznika	d97ccfa346	Close donated files if containerManager.Start() fails PiperOrigin-RevId: 220869535 Change-Id: I9917e5daf02499f7aab6e2aa4051c54ff4461b9a	2018-11-09 14:54:34 -08:00
Nicolas Lacasse	13b48f2e6a	AsyncBarrier should be run after all defers in destroyContainerFS. destroyContainerFS must wait for all async operations to finish before returning. In an attempt to do this, we call fs.AsyncBarrier() at the end of the function. However, there are many defer'd DecRefs which end up running AFTER the AsyncBarrier() call. This CL fixes this by calling fs.AsyncBarrier() in the first defer statement, thus ensuring that it runs at the end of the function, after all other defers. PiperOrigin-RevId: 220523545 Change-Id: I5e96ee9ea6d86eeab788ff964484c50ef7f64a2f	2018-11-07 13:55:36 -08:00
Fabricio Voznika	c92b9b7086	Add more logging to controller.go PiperOrigin-RevId: 220519632 Change-Id: Iaeec007fc1aa3f0b72569b288826d45f2534c4bf	2018-11-07 13:33:19 -08:00

1 2 3 4

181 Commits