gvisor

Commit Graph

Author	SHA1	Message	Date
Michael Pratt	5b41ba5d0e	Fix various spelling issues in the documentation Addresses obvious typos, in the documentation only. COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/443 from Pixep:fix/documentation-spelling 4d0688164eafaf0b3010e5f4824b35d1e7176d65 PiperOrigin-RevId: 255477779	2019-06-27 14:25:50 -07:00
Michael Pratt	085a907565	Cache directory entries in the overlay Currently, the overlay dirCache is only used for a single logical use of getdents. i.e., it is discard when the FD is closed or seeked back to the beginning. But the initial work of getting the directory contents can be quite expensive (particularly sorting large directories), so we should keep it as long as possible. This is very similar to the readdirCache in fs/gofer. Since the upper filesystem does not have to allow caching readdir entries, the new CacheReaddir MountSourceOperations method controls this behavior. This caching should be trivially movable to all Inodes if desired, though that adds an additional copy step for non-overlay Inodes. (Overlay Inodes already do the extra copy). PiperOrigin-RevId: 255477592	2019-06-27 14:24:03 -07:00
Andrei Vagin	e276083903	gvisor/ptrace: grub initial thread registers only once PiperOrigin-RevId: 255465635	2019-06-27 13:59:57 -07:00
Fabricio Voznika	42e212f6b7	Preserve permissions when checking lower The code was wrongly assuming that only read access was required from the lower overlay when checking for permissions. This allowed non-writable files to be writable in the overlay. Fixes #316 PiperOrigin-RevId: 255263686	2019-06-26 14:24:44 -07:00
Nicolas Lacasse	857e5c47e9	Follow symlinks when creating a file, and create the target. If we have a symlink whose target does not exist, creating the symlink (either via 'creat' or 'open' with O_CREAT flag) should create the target of the symlink. Previously, gVisor would error with EEXIST in this case PiperOrigin-RevId: 255232944	2019-06-26 11:49:20 -07:00
Michael Pratt	e98ce4a2c6	Add TODO reminder to remove tmpfs caching options Updates #179 PiperOrigin-RevId: 255081565	2019-06-25 17:12:34 -07:00
Jamie Liu	ffee0f36b1	Add //pkg/fdchannel. To accompany flipcall connections in cases where passing FDs is required (as for gofers). PiperOrigin-RevId: 255062277	2019-06-25 15:38:11 -07:00
Andrei Vagin	03ae91c662	gvisor: lockless read access for task credentials Credentials are immutable and even before these changes we could read them without locks, but we needed to take a task lock to get a credential object from a task object. It is possible to avoid this lock, if we will guarantee that a credential object will not be changed after setting it on a task. PiperOrigin-RevId: 254989492	2019-06-25 09:52:49 -07:00
Andrei Vagin	e9ea7230f7	fs: synchronize concurrent writes into files with O_APPEND For files with O_APPEND, a file write operation gets a file size and uses it as offset to call an inode write operation. This means that all other operations which can change a file size should be blocked while the write operation doesn't complete. PiperOrigin-RevId: 254873771	2019-06-24 17:45:02 -07:00
Adin Scannell	7f5d0afe52	Add O_EXITKILL to ptrace options. This prevents a race before PDEATH_SIG can take effect during a sentry crash. Discovered and solution by avagin@. PiperOrigin-RevId: 254871534	2019-06-24 17:30:01 -07:00
Rahat Mahmood	94a6bfab5d	Implement /proc/net/tcp. PiperOrigin-RevId: 254854346	2019-06-24 15:56:36 -07:00
Andrei Vagin	c5486f5122	platform/ptrace: specify PTRACE_O_TRACEEXIT for stub-processes The tracee is stopped early during process exit, when registers are still available, allowing the tracer to see where the exit occurred, whereas the normal exit notifi? cation is done after the process is finished exiting. Without this option, dumpAndPanic fails to get registers. PiperOrigin-RevId: 254852917	2019-06-24 15:48:58 -07:00
Nicolas Lacasse	87df9aab24	Use correct statx syscall number for amd64. The previous number was for the arm architecture. Also change the statx tests to force them to run on gVisor, which would have caught this issue. PiperOrigin-RevId: 254846831	2019-06-24 15:19:36 -07:00
Fabricio Voznika	b21b1db700	Allow to change logging options using 'runsc debug' New options are: runsc debug --strace=off\|all\|function1,function2 runsc debug --log-level=warning\|info\|debug runsc debug --log-packets=true\|false Updates #407 PiperOrigin-RevId: 254843128	2019-06-24 15:03:02 -07:00
chris.zn	f957fb23cf	Return ENOENT when reading /proc/{pid}/task of an exited process There will be a deadloop when we use getdents to read /proc/{pid}/task of an exited process Like this: Process A is running Process B: open /proc/{pid of A}/task Process A exits Process B: getdents /proc/{pid of A}/task Then, process B will fall into deadloop, and return "." and ".." in loops and never ends. This patch returns ENOENT when use getdents to read /proc/{pid}/task if the process is just exited. Signed-off-by: chris.zn <chris.zn@antfin.com>	2019-06-24 15:49:53 +08:00
Nicolas Lacasse	35719d52c7	Implement statx. We don't have the plumbing for btime yet, so that field is left off. The returned mask indicates that btime is absent. Fixes #343 PiperOrigin-RevId: 254575752	2019-06-22 13:29:26 -07:00
Bhasker Hariharan	c1761378a9	Fix the logic for sending zero window updates. Today we have the logic split in two places between endpoint Read() and the worker goroutine which actually sends a zero window. This change makes it so that when a zero window ACK is sent we set a flag in the endpoint which can be read by the endpoint to decide if it should notify the worker to send a nonZeroWindow update. The worker now does not do the check again but instead sends an ACK and flips the flag right away. Similarly today when SO_RECVBUF is set the SetSockOpt call has logic to decide if a zero window update is required. Rather than do that we move the logic to the worker goroutine and it can check the zeroWindow flag and send an update if required. PiperOrigin-RevId: 254505447	2019-06-21 18:31:31 -07:00
Andrei Vagin	ab6774cebf	gvisor/fs: getdents returns 0 if offset is equal to FileMaxOffset FileMaxOffset is a special case when lseek(d, 0, SEEK_END) has been called. PiperOrigin-RevId: 254498777	2019-06-21 17:25:17 -07:00
Michael Pratt	6f933a934f	Remove O(n) lookup on unlink/rename Currently, the path tracking in the gofer involves an O(n) lookup of child fidRefs. This causes a significant overhead on unlinks in directories with lots of child fidRefs (<4k). In this transition, pathNode moves from sync.Map to normal synchronized maps. There is a small chance of contention in walk, but the lock is held for a very short time (and sync.Map also had a chance of requiring locking). OTOH, sync.Map makes it very difficult to add a fidRef reverse map. PiperOrigin-RevId: 254489952	2019-06-21 16:27:26 -07:00
Brad Burlage	ae4ef32b8c	Deflake TestSimpleReceive failures due to timeouts This test will occasionally fail waiting to read a packet. From repeated runs, I've seen it up to 1.5s for waitForPackets to complete. PiperOrigin-RevId: 254484627	2019-06-21 15:56:12 -07:00
Ayush Ranjan	727375321f	ext4 block group descriptor implementation in disk layout package. PiperOrigin-RevId: 254482180	2019-06-21 15:42:46 -07:00
Jamie Liu	e806466fc5	Add //pkg/flipcall. Flipcall is a (conceptually) simple local-only RPC mechanism. Compared to unet, Flipcall does not support passing FDs (support for which will be provided out of band by another package), requires users to establish connections manually, and requires user management of concurrency since each connected Endpoint pair supports only a single RPC at a time; however, it improves performance by using shared memory for data (reducing memory copies) and using futexes for control signaling (which is much cheaper than sendto/recvfrom/sendmsg/recvmsg). PiperOrigin-RevId: 254471986	2019-06-21 14:47:04 -07:00
Fabricio Voznika	5ba16d51a9	Add list of stuck tasks to panic message PiperOrigin-RevId: 254450309	2019-06-21 12:46:53 -07:00
Michael Pratt	c0317b28cb	Update pathNode documentation to reflect reality Neither fidRefs or children are (directly) synchronized by mu. Remove the preconditions that say so. That said, the surrounding does enforce some synchronization guarantees (e.g., fidRef.renameChildTo does not atomically replace the child in the maps). I've tried to note the need for callers to do this synchronization. I've also renamed the maps to what are (IMO) clearer names. As is, it is not obvious that pathNode.fidRefs is a map of child fidRefs rather than self fidRefs. PiperOrigin-RevId: 254446965	2019-06-21 12:26:42 -07:00
Andrei Vagin	f94653b3de	kernel: call t.mu.Unlock() explicitly in WithMuLocked defer here doesn't improve readability, but we know it slower that the explicit call. PiperOrigin-RevId: 254441473	2019-06-21 11:55:42 -07:00
Fabricio Voznika	054b5632ef	Update comment PiperOrigin-RevId: 254428866	2019-06-21 10:56:42 -07:00
Jamie Liu	7db8685100	Preallocate auth.NewAnonymousCredentials() in contexttest.TestContext. Otherwise every call to, say, fs.ContextCanAccessFile() in a benchmark using contexttest allocates new auth.Credentials, a new auth.UserNamespace, ... PiperOrigin-RevId: 254261051	2019-06-20 13:36:14 -07:00
Michael Pratt	292f70cbf7	Add package docs to seqfile and ramfs These are the only packages missing docs: https://godoc.org/gvisor.dev/gvisor PiperOrigin-RevId: 254261022	2019-06-20 13:34:33 -07:00
Rahat Mahmood	ddc1d94a37	Unmark amutex_test as flaky. PiperOrigin-RevId: 254254058	2019-06-20 12:58:04 -07:00
Neel Natu	0b2135072d	Implement madvise(MADV_DONTFORK) PiperOrigin-RevId: 254253777	2019-06-20 12:56:00 -07:00
Ian Gudger	7e49515696	Deflake SendFileTest_Shutdown. The sendfile syscall's backing doSplice contained a race with regard to blocking. If the first attempt failed with syserror.ErrWouldBlock and then the blocking file became ready before registering a waiter, we would just return the ErrWouldBlock (even if we were supposed to block). PiperOrigin-RevId: 254114432	2019-06-19 18:40:54 -07:00
Michael Pratt	9d2efaac5a	Add renamed children pathNodes to target parent Otherwise future renames may miss Renamed calls. PiperOrigin-RevId: 254060946	2019-06-19 13:41:07 -07:00
Nicolas Lacasse	29f9e4fa87	fileOp{On,At} should pass the remaning symlink traversal count. And methods that do more traversals should use the remaining count rather than resetting. PiperOrigin-RevId: 254041720	2019-06-19 11:56:34 -07:00
Nicolas Lacasse	f7428af9c1	Add MountNamespace to task. This allows tasks to have distinct mount namespace, instead of all sharing the kernel's root mount namespace. Currently, the only way for a task to get a different mount namespace than the kernel's root is by explicitly setting a different MountNamespace in CreateProcessArgs, and nothing does this (yet). In a follow-up CL, we will set CreateProcessArgs.MountNamespace when creating a new container inside runsc. Note that "MountNamespace" is a poor term for this thing. It's more like a distinct VFS tree. When we get around to adding real mount namespaces, this will need a better naem. PiperOrigin-RevId: 254009310	2019-06-19 09:21:21 -07:00
Fabricio Voznika	ca245a428b	Attempt to fix TestPipeWritesAccumulate Test fails because it's reading 4KB instead of the expected 64KB. Changed the test to read pipe buffer size instead of hardcode and added some logging in case the reason for failure was not pipe buffer size. PiperOrigin-RevId: 253916040	2019-06-18 19:16:11 -07:00
Andrei Vagin	8ab0848c70	gvisor/fs: don't update file.offset for sockets, pipes, etc sockets, pipes and other non-seekable file descriptors don't use file.offset, so we don't need to update it. With this change, we will be able to call file operations without locking the file.mu mutex. This is already used for pipes in the splice system call. PiperOrigin-RevId: 253746644	2019-06-18 01:43:29 -07:00
Yong He	0dbdca349c	Skip tid allocation which is using When leader of process group (session) exit, the process group ID (session ID) is holding by other processes in the process group, so the process group ID (session ID) can not be reused. If reusing the process group ID (seession ID) as new process group ID for new process, this will cause session create failed, and later runsc crash when access process group. The fix skip the tid if it is using by a process group (session) when allocating a new tid. We could easily reproduce the runsc crash follow these steps: 1. build test program, and run inside container int main(int argc, char argv[]) { pid_t cpid, spid; cpid = fork(); if (cpid == -1) { perror("fork"); exit(EXIT_FAILURE); } if (cpid == 0) { pid_t sid = setsid(); printf("Start New Session %ld\n",sid); printf("Child PID %ld / PPID %ld / PGID %ld / SID %ld\n", getpid(),getppid(),getpgid(getpid()),getsid(getpid())); spid = fork(); if (spid == 0) { setpgid(getpid(), getpid()); printf("Set GrandSon as New Process Group\n"); printf("GrandSon PID %ld / PPID %ld / PGID %ld / SID %ld\n", getpid(),getppid(),getpgid(getpid()),getsid(getpid())); while(1) { usleep(1); } } sleep(3); exit(0); } else { exit(0); } return 0; } 2. build hello program int main(int argc, char argv[]) { printf("Current PID is %ld\n", (long) getpid()); return 0; } 3. run script on host which run hello inside container, you can speed up the test with set TasksLimit as lower value. for (( i=0; i<65535; i++ )) do docker exec <container id> /test/hello done 4. when hello process reusing the process group of loop process, runsc will crash. panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x79f0c8] goroutine 612475 [running]: gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(ProcessGroup).decRefWithParent(0x0, 0x0) pkg/sentry/kernel/sessions.go:160 +0x78 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(Task).exitNotifyLocked(0xc000663500, 0x0) pkg/sentry/kernel/task_exit.go:672 +0x2b7 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(runExitNotify).execute(0x0, 0xc000663500, 0x0, 0x0) pkg/sentry/kernel/task_exit.go:542 +0xc4 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(Task).run(0xc000663500, 0xc) pkg/sentry/kernel/task_run.go:91 +0x194 created by gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).Start pkg/sentry/kernel/task_start.go:286 +0xfe	2019-06-14 14:05:41 +08:00
Bhasker Hariharan	3d71c627fa	Add support for TCP receive buffer auto tuning. The implementation is similar to linux where we track the number of bytes consumed by the application to grow the receive buffer of a given TCP endpoint. This ensures that the advertised window grows at a reasonable rate to accomodate for the sender's rate and prevents large amounts of data being held in stack buffers if the application is not actively reading or not reading fast enough. The original paper that was used to implement the linux receive buffer auto- tuning is available @ https://public.lanl.gov/radiant/pubs/drs/lacsi2001.pdf NOTE: Linux does not implement DRS as defined in that paper, it's just a good reference to understand the solution space. Updates #230 PiperOrigin-RevId: 253168283	2019-06-13 22:28:01 -07:00
Ian Gudger	3e9b8ecbfe	Plumb context through more layers of filesytem. All functions which allocate objects containing AtomicRefCounts will soon need a context. PiperOrigin-RevId: 253147709	2019-06-13 18:40:38 -07:00
Ian Gudger	0a5ee6f7b2	Fix deadlock in fasync. The deadlock can occur when both ends of a connected Unix socket which has FIOASYNC enabled on at least one end are closed at the same time. One end notifies that it is closing, calling (waiter.Queue).Notify which takes waiter.Queue.mu (as a read lock) and then calls (FileAsync).Callback, which takes FileAsync.mu. The other end tries to unregister for notifications by calling (FileAsync).Unregister, which takes FileAsync.mu and calls (waiter.Queue).EventUnregister which takes waiter.Queue.mu. This is fixed by moving the calls to waiter.Waitable.EventRegister and waiter.Waitable.EventUnregister outside of the protection of any mutex used in (FileAsync).Callback. The new test is related, but does not cover this particular situation. Also fix a data race on FileAsync.e.Callback. (FileAsync).Callback checked FileAsync.e.Callback under the protection of FileAsync.mu, but the waiter calling (*FileAsync).Callback could not and did not. This is fixed by making FileAsync.e.Callback immutable before passing it to the waiter for the first time. Fixes #346 PiperOrigin-RevId: 253138340	2019-06-13 17:26:22 -07:00
Rahat Mahmood	05ff1ffaad	Implement getsockopt() SO_DOMAIN, SO_PROTOCOL and SO_TYPE. SO_TYPE was already implemented for everything but netlink sockets. PiperOrigin-RevId: 253138157	2019-06-13 17:24:51 -07:00
Adin Scannell	add40fd6ad	Update canonical repository. This can be merged after: https://github.com/google/gvisor-website/pull/77 or https://github.com/google/gvisor-website/pull/78 PiperOrigin-RevId: 253132620	2019-06-13 16:50:15 -07:00
Jamie Liu	0c8603084d	Add p9 and unet benchmarks. PiperOrigin-RevId: 253122166	2019-06-13 15:53:43 -07:00
Adin Scannell	e352f46478	Minor BUILD file cleanup. PiperOrigin-RevId: 252918338	2019-06-12 15:59:46 -07:00
Kevin Krakauer	0bbbcafd68	Merge branch 'master' into iptables-1-pkg Change-Id: I7457a11de4725e1bf3811420c505d225b1cb6943	2019-06-12 15:21:22 -07:00
Bhasker Hariharan	70578806e8	Add support for TCP_CONGESTION socket option. This CL also cleans up the error returned for setting congestion control which was incorrectly returning EINVAL instead of ENOENT. PiperOrigin-RevId: 252889093	2019-06-12 13:35:50 -07:00
Andrei Vagin	0d05a12fd3	gvisor/ptrace: print guest registers if a stub stopped with unexpected code PiperOrigin-RevId: 252855280	2019-06-12 10:48:46 -07:00
Adin Scannell	df110ad4fe	Eat sendfile partial error For sendfile(2), we propagate a TCP error through the system call layer. This should be eaten if there is a partial result. This change also adds a test to ensure that there is no panic in this case, for both TCP sockets and unix domain sockets. PiperOrigin-RevId: 252746192	2019-06-11 19:24:35 -07:00
Fabricio Voznika	fc746efa9a	Add support to mount pod shared tmpfs mounts Parse annotations containing 'gvisor.dev/spec/mount' that gives hints about how mounts are shared between containers inside a pod. This information can be used to better inform how to mount these volumes inside gVisor. For example, a volume that is shared between containers inside a pod can be bind mounted inside the sandbox, instead of being two independent mounts. For now, this information is used to allow the same tmpfs mounts to be shared between containers which wasn't possible before. PiperOrigin-RevId: 252704037	2019-06-11 14:54:31 -07:00
Ian Lewis	74e397e39a	Add introspection for Linux/AMD64 syscalls Adds simple introspection for syscall compatibility information to Linux/AMD64. Syscalls registered in the syscall table now have associated metadata like name, support level, notes, and URLs to relevant issues. Syscall information can be exported as a table, JSON, or CSV using the new 'runsc help syscalls' command. Users can use this info to debug and get info on the compatibility of the version of runsc they are running or to generate documentation. PiperOrigin-RevId: 252558304	2019-06-10 23:38:36 -07:00
Jamie Liu	589f36ac4a	Move //pkg/sentry/platform/procid to //pkg/procid. PiperOrigin-RevId: 252501653	2019-06-10 15:47:25 -07:00
Bhasker Hariharan	3933dd5c04	Fixes to listen backlog handling. Changes netstack to confirm to current linux behaviour where if the backlog is full then we drop the SYN and do not send a SYN-ACK. Similarly we allow upto backlog connections to be in SYN-RCVD state as long as the backlog is not full. We also now drop a SYN if syn cookies are in use and the backlog for the listening endpoint is full. Added new tests to confirm the behaviour. Also reverted the change to increase the backlog in TcpPortReuseMultiThread syscall test. Fixes #236 PiperOrigin-RevId: 252500462	2019-06-10 15:40:44 -07:00
Rahat Mahmood	a00157cc0e	Store more information in the kernel socket table. Store enough information in the kernel socket table to distinguish between different types of sockets. Previously we were only storing the socket family, but this isn't enough to classify sockets. For example, TCPv4 and UDPv4 sockets are both AF_INET, and ICMP sockets are SOCK_DGRAM sockets with a particular protocol. Instead of creating more sub-tables, flatten the socket table and provide a filtering mechanism based on the socket entry. Also generate and store a socket entry index ("sl" in linux) which allows us to output entries in a stable order from procfs. PiperOrigin-RevId: 252495895	2019-06-10 15:17:43 -07:00
Kevin Krakauer	06a83df533	Address more comments. Change-Id: I83ae1079f3dcba6b018f59ab7898decab5c211d2	2019-06-10 12:43:54 -07:00
Jamie Liu	48961d27a8	Move //pkg/sentry/memutil to //pkg/memutil. PiperOrigin-RevId: 252124156	2019-06-07 14:52:27 -07:00
Kevin Krakauer	8afbd974da	Address Ian's comments. Change-Id: I7445033b1970cbba3f2ed0682fe520dce02d8fad	2019-06-07 12:54:53 -07:00
Jamie Liu	c933f3eede	Change visibility of //pkg/sentry/time. PiperOrigin-RevId: 251965598	2019-06-06 17:58:55 -07:00
Jamie Liu	9ea248489b	Cap initial usermem.CopyStringIn buffer size. Almost (?) all uses of CopyStringIn are via linux.copyInPath(), which passes maxlen = linux.PATH_MAX = 4096. Pre-allocating a buffer of this size is measurably inefficient in most cases: most paths will not be this long, 4 KB is a lot of bytes to zero, and as of this writing the Go runtime allocator maps only two 4 KB objects to each 8 KB span, necessitating a call to runtime.mcache.refill() on ~every other call. Limit the initial buffer size to 256 B instead, and geometrically reallocate if necessary. PiperOrigin-RevId: 251960441	2019-06-06 17:22:00 -07:00
Rahat Mahmood	315cf9a523	Use common definition of SockType. SockType isn't specific to unix domain sockets, and the current definition basically mirrors the linux ABI's definition. PiperOrigin-RevId: 251956740	2019-06-06 17:00:27 -07:00
Fabricio Voznika	02ab1f187c	Copy up parent when binding UDS on overlayfs Overlayfs was expecting the parent to exist when bind(2) was called, which may not be the case. The fix is to copy the parent directory to the upper layer before binding the UDS. There is not good place to add tests for it. Syscall tests would be ideal, but it's hard to guarantee that the directory where the socket is created hasn't been touched before (and thus copied the parent to the upper layer). Added it to runsc integration tests for now. If it turns out we have lots of these kind of tests, we can consider moving them somewhere more appropriate. PiperOrigin-RevId: 251954156	2019-06-06 16:45:51 -07:00
Jamie Liu	b3f104507d	"Implement" mbind(2). We still only advertise a single NUMA node, and ignore mempolicy accordingly, but mbind() at least now succeeds and has effects reflected by get_mempolicy(). Also fix handling of nodemasks: round sizes to unsigned long (as documented and done by Linux), and zero trailing bits when copying them out. PiperOrigin-RevId: 251950859	2019-06-06 16:29:46 -07:00
Jamie Liu	a26043ee53	Implement reclaim-driven MemoryFile eviction. PiperOrigin-RevId: 251950660	2019-06-06 16:27:55 -07:00
Rahat Mahmood	2d2831e354	Track and export socket state. This is necessary for implementing network diagnostic interfaces like /proc/net/{tcp,udp,unix} and sock_diag(7). For pass-through endpoints such as hostinet, we obtain the socket state from the backend. For netstack, we add explicit tracking of TCP states. PiperOrigin-RevId: 251934850	2019-06-06 15:04:47 -07:00
Bhasker Hariharan	85be01b42d	Add multi-fd support to fdbased endpoint. This allows an fdbased endpoint to have multiple underlying fd's from which packets can be read and dispatched/written to. This should allow for higher throughput as well as better scalability of the network stack as number of connections increases. Updates #231 PiperOrigin-RevId: 251852825	2019-06-06 08:07:02 -07:00
Andrei Vagin	79f7cb6c1c	netstack/sniffer: log GSO attributes PiperOrigin-RevId: 251788534	2019-06-05 22:51:53 -07:00
Michael Pratt	57772db2e7	Shutdown host sockets on internal shutdown This is required to make the shutdown visible to peers outside the sandbox. The readClosed / writeClosed fields were dropped, as they were preventing a shutdown socket from reading the remainder of queued bytes. The host syscalls will return the appropriate errors for shutdown. The control message tests have been split out of socket_unix.cc to make the (few) remaining tests accessible to testing inherited host UDS, which don't support sending control messages. Updates #273 PiperOrigin-RevId: 251763060	2019-06-05 18:40:37 -07:00
Andrei Vagin	a12848ffeb	netstack/tcp: fix calculating a number of outstanding packets In case of GSO, a segment can container more than one packet and we need to use the pCount() helper to get a number of packets. PiperOrigin-RevId: 251743020	2019-06-05 16:30:45 -07:00
Chris Kuiper	d18bb4f38a	Adjust route when looping multicast packets Multicast packets are special in that their destination address does not identify a specific interface. When sending out such a packet the multicast address is the remote address, but for incoming packets it is the local address. Hence, when looping a multicast packet, the route needs to be tweaked to reflect this. PiperOrigin-RevId: 251739298	2019-06-05 16:08:29 -07:00
Michael Pratt	d3ed9baac0	Implement dumpability tracking and checks We don't actually support core dumps, but some applications want to get/set dumpability, which still has an effect in procfs. Lack of support for set-uid binaries or fs creds simplifies things a bit. As-is, processes started via CreateProcess (i.e., init and sentryctl exec) have normal dumpability. I'm a bit torn on whether sentryctl exec tasks should be dumpable, but at least since they have no parent normal UID/GID checks should protect them. PiperOrigin-RevId: 251712714	2019-06-05 14:00:13 -07:00
Bhasker Hariharan	e0fb921205	Fix data race in synRcvdState. When checking the length of the acceptedChan we should hold the endpoint mutex otherwise a syn received while the listening socket is being closed can result in a data race where the cleanupLocked routine sets acceptedChan to nil while a handshake goroutine in progress could try and check it at the same time. PiperOrigin-RevId: 251537697	2019-06-04 16:17:24 -07:00
Yong He	7398f013f0	Drop one dirent reference after referenced by file When pipe is created, a dirent of pipe will be created and its initial reference is set as 0. Cause all dirent will only be destroyed when the reference decreased to -1, so there is already a 'initial reference' of dirent after it created. For destroying dirent after all reference released, the correct way is to drop the 'initial reference' once someone hold a reference to the dirent, such as fs.NewFile, otherwise the reference of dirent will stay 0 all the time, and will cause memory leak of dirent. Except pipe, timerfd/eventfd/epoll has the same problem Here is a simple case to create memory leak of dirent for pipe/timerfd/eventfd/epoll in C langange, after run the case, pprof the runsc process, you will find lots dirents of pipe/timerfd/eventfd/epoll not freed: int main(int argc, char *argv[]) { int i; int n; int pipefd[2]; if (argc != 3) { printf("Usage: %s epoll\|timerfd\|eventfd\|pipe <iterations>\n", argv[0]); } n = strtol(argv[2], NULL, 10); if (strcmp(argv[1], "epoll") == 0) { for (i = 0; i < n; ++i) close(epoll_create(1)); } else if (strcmp(argv[1], "timerfd") == 0) { for (i = 0; i < n; ++i) close(timerfd_create(CLOCK_REALTIME, 0)); } else if (strcmp(argv[1], "eventfd") == 0) { for (i = 0; i < n; ++i) close(eventfd(0, 0)); } else if (strcmp(argv[1], "pipe") == 0) { for (i = 0; i < n; ++i) if (pipe(pipefd) == 0) { close(pipefd[0]); close(pipefd[1]); } } printf("%s %s test finished\r\n",argv[1],argv[2]); return 0; } Change-Id: Ia1b8a1fb9142edb00c040e44ec644d007f81f5d2 PiperOrigin-RevId: 251531096	2019-06-04 15:40:23 -07:00
Nicolas Lacasse	0c292cdaab	Remove the Dirent field from Pipe. Dirents are ref-counted, but Pipes are not. Holding a Dirent inside of a Pipe raises difficult questions about the lifecycle of the Pipe and Dirent. Fortunately, we can side-step those questions by removing the Dirent field from Pipe entirely. We only need the Dirent when constructing fs.Files (which are ref-counted), and in GetFile (when a Dirent is passed to us anyways). PiperOrigin-RevId: 251497628	2019-06-04 12:58:56 -07:00
Andrei Vagin	90a116890f	gvisor/sock/unix: pass creds when a message is sent between unconnected sockets and don't report a sender address if it doesn't have one PiperOrigin-RevId: 251371284	2019-06-03 21:48:19 -07:00
Andrei Vagin	00f8663887	gvisor/fs: return a proper error from FileWriter.Write in case of a short-write The io.Writer contract requires that Write writes all available bytes and does not return short writes. This causes errors with io.Copy, since our own Write interface does not have this same contract. PiperOrigin-RevId: 251368730	2019-06-03 21:26:01 -07:00
Bhasker Hariharan	bfe3220992	Delete debug log lines left by mistake. Updates #236 PiperOrigin-RevId: 251337915	2019-06-03 17:00:18 -07:00
Andrei Vagin	8e926e3f74	gvisor: validate a new map region in the mremap syscall Right now, mremap allows to remap a memory region over MaxUserAddress, this means that we can change the stub region. PiperOrigin-RevId: 251266886	2019-06-03 10:59:46 -07:00
Bhasker Hariharan	3577a4f691	Disable certain tests that are flaky under race detector. PiperOrigin-RevId: 250976665	2019-05-31 16:19:49 -07:00
Bhasker Hariharan	033f96cc93	Change segment queue limit to be of fixed size. Netstack sets the unprocessed segment queue size to match the receive buffer size. This is not required as this queue only needs to hold enough for a short duration before the endpoint goroutine can process it. Updates #230 PiperOrigin-RevId: 250976323	2019-05-31 16:17:33 -07:00
Kevin Krakauer	d58eb9ce82	Add basic iptables structures to netstack. Change-Id: Ib589906175a59dae315405a28f2d7f525ff8877f	2019-05-31 16:14:04 -07:00
Nicolas Lacasse	6f73d79c32	Simplify overlayBoundEndpoint. There is no reason to do the recursion manually, since Inode.BoundEndpoint will do it for us. PiperOrigin-RevId: 250794903	2019-05-30 17:20:20 -07:00
Fabricio Voznika	38de91b028	Add build guard to files using go:linkname Funcion signatures are not validated during compilation. Since they are not exported, they can change at any time. The guard ensures that they are verified at least on every version upgrade. PiperOrigin-RevId: 250733742	2019-05-30 12:09:39 -07:00
Bhasker Hariharan	ae26b2c425	Fixes to TCP listen behavior. Netstack listen loop can get stuck if cookies are in-use and the app is slow to accept incoming connections. Further we continue to complete handshake for a connection even if the backlog is full. This creates a problem when a lots of connections come in rapidly and we end up with lots of completed connections just hanging around to be delivered. These fixes change netstack behaviour to mirror what linux does as described here in the following article http://veithen.io/2014/01/01/how-tcp-backlog-works-in-linux.html Now when cookies are not in-use Netstack will silently drop the ACK to a SYN-ACK and not complete the handshake if the backlog is full. This will result in the connection staying in a half-complete state. Eventually the sender will retransmit the ACK and if backlog has space we will transition to a connected state and deliver the endpoint. Similarly when cookies are in use we do not try and create an endpoint unless there is space in the accept queue to accept the newly created endpoint. If there is no space then we again silently drop the ACK as we can just recreate it when the ACK is retransmitted by the peer. We also now use the backlog to cap the size of the SYN-RCVD queue for a given endpoint. So at any time there can be N connections in the backlog and N in a SYN-RCVD state if the application is not accepting connections. Any new SYNs will be dropped. This CL also fixes another small bug where we mark a new endpoint which has not completed handshake as connected. We should wait till handshake successfully completes before marking it connected. Updates #236 PiperOrigin-RevId: 250717817	2019-05-30 12:08:41 -07:00
Michael Pratt	8d25cd0b40	Update procid for Go 1.13 Upstream Go has no changes here. PiperOrigin-RevId: 250602731	2019-05-30 12:08:10 -07:00
chris.zn	b18df9bed6	Add VmData field to /proc/{pid}/status VmData is the size of private data segments. It has the same meaning as in Linux. Change-Id: Iebf1ae85940a810524a6cde9c2e767d4233ddb2a PiperOrigin-RevId: 250593739	2019-05-30 12:07:40 -07:00
Bhasker Hariharan	035a8fa38e	Add support for collecting execution trace to runsc. Updates #220 PiperOrigin-RevId: 250532302	2019-05-30 12:07:11 -07:00
Andrei Vagin	4b9cb38157	gvisor: socket() returns EPROTONOSUPPORT if protocol is not supported PiperOrigin-RevId: 250426407	2019-05-30 12:06:15 -07:00
Michael Pratt	507a15dce9	Always wait on tracee children After bf959931ddb88c4e4366e96dd22e68fa0db9527c ("wait/ptrace: assume __WALL if the child is traced") (Linux 4.7), tracees are always eligible for waiting, regardless of type. PiperOrigin-RevId: 250399527	2019-05-30 12:05:46 -07:00
Adin Scannell	2165b77774	Remove obsolete bug. The original bug is no longer relevant, and the FIXME here contains lots of obsolete information. PiperOrigin-RevId: 249924036	2019-05-30 12:03:39 -07:00
Adin Scannell	ed5793808e	Remove obsolete TODO. We don't need to model internal interfaces after the system call interfaces (which are objectively worse and simply use a flag to distinguish between two logically different operations). PiperOrigin-RevId: 249916814 Change-Id: I45d02e0ec0be66b782a685b1f305ea027694cab9	2019-05-24 16:18:09 -07:00
Michael Pratt	6cdec6fadf	Wrap comments and reword in common present tense PiperOrigin-RevId: 249888234 Change-Id: Icfef32c3ed34809c34100c07e93e9581c786776e	2019-05-24 13:23:53 -07:00
Tamir Duberstein	e4b395db49	Remove unused wakers These wakers are uselessly allocated and passed around; nothing ever listens for notifications on them. The code here appears to be vestigial, so removing it and allowing a nil waker to be passed seems appropriate. PiperOrigin-RevId: 249879320 Change-Id: Icd209fb77cc0dd4e5c49d7a9f2adc32bf88b4b71	2019-05-24 12:29:14 -07:00
Andrei Vagin	a949133c4b	gvisor: interrupt the sendfile system call if a task has been interrupted sendfile can be called for a big range and it can require significant amount of time to process it, so we need to handle task interrupts in this system call. PiperOrigin-RevId: 249781023 Change-Id: Ifc2ec505d74c06f5ee76f93b8d30d518ec2d4015	2019-05-23 23:21:13 -07:00
Ayush Ranjan	6240abb205	Added boilerplate code for ext4 fs. Initialized BUILD with license Mount is still unimplemented and is not meant to be part of this CL. Rest of the fs interface is implemented. Referenced the Linux kernel appropriately when needed PiperOrigin-RevId: 249741997 Change-Id: Id1e4c7c9e68b3f6946da39896fc6a0c3dcd7f98c	2019-05-23 16:55:42 -07:00
Fabricio Voznika	9006304dfe	Initial support for bind mounts Separate MountSource from Mount. This is needed to allow mounts to be shared by multiple containers within the same pod. PiperOrigin-RevId: 249617810 Change-Id: Id2944feb7e4194951f355cbe6d4944ae3c02e468	2019-05-23 04:16:10 -07:00
Bhasker Hariharan	022bd0fd10	Fix the signature for gopark. gopark's signature was changed from having a string reason to a uint8. See: `4d7cf3fedb` This broke execution tracing of the sentry. Switching to the right signature makes tracing work again. Updates #220 PiperOrigin-RevId: 249565311 Change-Id: If77fd276cecb37d4003c8222f6de510b8031a074	2019-05-22 18:57:15 -07:00
Adin Scannell	79738d3958	Log unhandled faults only at DEBUG level. PiperOrigin-RevId: 249561399 Change-Id: Ic73c68c8538bdca53068f38f82b7260939addac2	2019-05-22 18:18:53 -07:00
Michael Pratt	f65dfec096	Add WCLONE / WALL support to waitid The previous commit adds WNOTHREAD support to waitid, so we may as well complete the upstream change. Linux added WCLONE, WALL, WNOTHREAD support to waitid(2) in 91c4e8ea8f05916df0c8a6f383508ac7c9e10dba ("wait: allow sys_waitid() to accept __WNOTHREAD/__WCLONE/__WALL"). i.e., Linux 4.7. PiperOrigin-RevId: 249560587 Change-Id: Iff177b0848a3f7bae6cb5592e44500c5a942fbeb	2019-05-22 18:11:50 -07:00
Adin Scannell	21915eb58b	Remove obsolete TODO. There no obvious reason to require that BlockSize and StatFS are MountSource operations. Today they are in INodeOperations, and they can be moved elsewhere in the future as part of a normal refactor process. PiperOrigin-RevId: 249549982 Change-Id: Ib832e02faeaf8253674475df4e385bcc53d780f3	2019-05-22 17:00:36 -07:00
Michael Pratt	711290a7f6	Add support for wait(WNOTHREAD) PiperOrigin-RevId: 249537694 Change-Id: Iaa4bca73a2d8341e03064d59a2eb490afc3f80da	2019-05-22 15:54:23 -07:00
Kevin Krakauer	c1cdf18e7b	UDP and TCP raw socket support. PiperOrigin-RevId: 249511348 Change-Id: I34539092cc85032d9473ff4dd308fc29dc9bfd6b	2019-05-22 13:45:15 -07:00

1 2 3 4 5 ...

911 Commits