gvisor

Commit Graph

Author	SHA1	Message	Date
Jamie Liu	9ca15dbf14	Avoid unnecessary slice allocation in usermem.BytesIO.blocksFromAddrRanges(). PiperOrigin-RevId: 280507239	2019-11-14 14:04:58 -08:00
Kevin Krakauer	3f7d937090	Use PacketBuffers for outgoing packets. PiperOrigin-RevId: 280455453	2019-11-14 10:15:38 -08:00
Bhasker Hariharan	6dd4c9ee74	Fix flaky behaviour during S/R. PiperOrigin-RevId: 280280156	2019-11-13 14:40:08 -08:00
Nicolas Lacasse	c2d3dc0c13	Use overlay MountSource when binding socket in overlay. PiperOrigin-RevId: 280131840	2019-11-12 23:01:47 -08:00
Haibo Xu	1d8b7292d7	Fix some build errors on arm64. Initialize the VDSO "os" and "arch" fields explicitly, or the VDSO load process would failed on arm64 platform. Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: Ic6768df88e43cd7c7956eb630511672ae11ac52f	2019-11-13 06:46:02 +00:00
Haibo Xu	c5d9b5b881	Enable sentry/fs/host support on arm64. newfstatat() syscall is not supported on arm64, so we resort to use the fstatat() syscall. Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: Iea95550ea53bcf85c01f7b3b95da70ad0952177d	2019-11-13 06:46:02 +00:00
Haibo Xu	05871a1cdc	Enable runsc/boot support on arm64. This patch also include a minor change to replace syscall.Dup2 with syscall.Dup3 which was missed in a previous commit(ref `a25a976`). Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: I00beb9cc492e44c762ebaa3750201c63c1f7c2f3	2019-11-13 06:39:11 +00:00
Andrei Vagin	ca9cba66d2	seccomp: introduce the GreaterThan rule type PiperOrigin-RevId: 280075805	2019-11-12 15:59:59 -08:00
Ghanan Gowripalan	3f51bef8cd	Do not handle TCP packets that include a non-unicast IP address This change drops TCP packets with a non-unicast IP address as the source or destination address as TCP is meant for communication between two endpoints. Test: Make sure that if the source or destination address contains a non-unicast address, no TCP packet is sent in response and the packet is dropped. PiperOrigin-RevId: 280073731	2019-11-12 15:50:02 -08:00
Ghanan Gowripalan	5398530e45	Discover on-link prefixes from Router Advertisements' Prefix Information options This change allows the netstack to do NDP's Prefix Discovery as outlined by RFC 4861 section 6.3.4. If configured to do so, when a new on-link prefix is discovered, the routing table will be updated with a device route through the nic the RA arrived at. Likewise, when such a prefix gets invalidated, the device route will be removed. Note, this change will not break existing uses of netstack as the default configuration for the stack options is set in such a way that Prefix Discovery will not be performed. See `stack.Options` and `stack.NDPConfigurations` for more details. This change reuses 1 option and introduces a new one that is required to take advantage of Prefix Discovery, all available under NDPConfigurations: - HandleRAs: Whether or not NDP RAs are processes - DiscoverOnLinkPrefixes: Whether or not Prefix Discovery is performed (new) Another note: for a NIC to process Prefix Information options (in Router Advertisements), it must not be a router itself. Currently the netstack does not have per-interface routing configuration; the routing/forwarding configuration is controlled stack-wide. Therefore, if the stack is configured to enable forwarding/routing, no router Advertisements (and by extension the Prefix Information options) will be processed. Tests: Unittest to make sure that Prefix Discovery and updates to the routing table only occur if explicitly configured to do so. Unittest to make sure at max stack.MaxDiscoveredOnLinkPrefixes discovered on-link prefixes are remembered. PiperOrigin-RevId: 280049278	2019-11-12 14:09:43 -08:00
Ian Gudger	57a2a5ea33	Add tests for SO_REUSEADDR and SO_REUSEPORT. * Basic tests for the SO_REUSEADDR and SO_REUSEPORT options. * SO_REUSEADDR functional tests for TCP and UDP. * SO_REUSEADDR and SO_REUSEPORT interaction tests for UDP. * Stubbed support for UDP getsockopt(SO_REUSEADDR). PiperOrigin-RevId: 280049265	2019-11-12 14:04:14 -08:00
gVisor bot	07f9041187	Merge pull request #918 from lubinszARM:pr_ring0 PiperOrigin-RevId: 279840214	2019-11-11 16:15:12 -08:00
Brad Burlage	e09e7bf72f	Add more extended features. PiperOrigin-RevId: 279820435	2019-11-11 14:42:57 -08:00
gVisor bot	7730716800	Make `connect` on socket returned by `accept` correctly error out with EISCONN PiperOrigin-RevId: 279814493	2019-11-11 14:15:06 -08:00
Kevin Krakauer	af58a4e3bb	Automated rollback of changelist 278417533 PiperOrigin-RevId: 279365629	2019-11-08 12:20:11 -08:00
Bhasker Hariharan	66ebb6575f	Add support for TIME_WAIT timeout. This change adds explicit support for honoring the 2MSL timeout for sockets in TIME_WAIT state. It also adds support for the TCP_LINGER2 option that allows modification of the FIN_WAIT2 state timeout duration for a given socket. It also adds an option to modify the Stack wide TIME_WAIT timeout but this is only for testing. On Linux this is fixed at 60s. Further, we also now correctly process RST's in CLOSE_WAIT and close the socket similar to linux without moving it to error state. We also now handle SYN in ESTABLISHED state as per RFC5961#section-4.1. Earlier we would just drop these SYNs. Which can result in some tests that pass on linux to fail on gVisor. Netstack now honors TIME_WAIT correctly as well as handles the following cases correctly. - TCP RSTs in TIME_WAIT are ignored. - A duplicate TCP FIN during TIME_WAIT extends the TIME_WAIT and a dup ACK is sent in response to the FIN as the dup FIN indicates potential loss of the original final ACK. - An out of order segment during TIME_WAIT generates a dup ACK. - A new SYN w/ a sequence number > the highest sequence number in the previous connection closes the TIME_WAIT early and opens a new connection. Further to make the SYN case work correctly the ISN (Initial Sequence Number) generation for Netstack has been updated to be as per RFC. Its not a pure random number anymore and follows the recommendation in https://tools.ietf.org/html/rfc6528#page-3. The current hash used is not a cryptographically secure hash function. A separate change will update the hash function used to Siphash similar to what is used in Linux. PiperOrigin-RevId: 279106406	2019-11-07 09:46:55 -08:00
Ghanan Gowripalan	0c424ea731	Rename nicid to nicID to follow go-readability initialisms https://github.com/golang/go/wiki/CodeReviewComments#initialisms This change does not introduce any new functionality. It just renames variables from `nicid` to `nicID`. PiperOrigin-RevId: 278992966	2019-11-06 19:41:25 -08:00
gVisor bot	adb10f4d53	Internal change. PiperOrigin-RevId: 278979065	2019-11-06 17:56:25 -08:00
Jamie Liu	f8ffadddb3	Add p9.OpenTruncate. This is required to implement O_TRUNC correctly on filesystems backed by gofers. 9P2000.L: "lopen prepares fid for file I/O. flags contains Linux open(2) flags bits, e.g. O_RDONLY, O_RDWR, O_WRONLY." open(2): "The argument flags must include one of the following access modes: O_RDONLY, O_WRONLY, or O_RDWR. ... In addition, zero or more file creation flags and file status flags can be bitwise-or'd in flags." The reference 9P2000.L implementation also appears to expect arbitrary flags, not just access modes, in Tlopen.flags: https://github.com/chaos/diod/blob/master/diod/ops.c#L703 PiperOrigin-RevId: 278972683	2019-11-06 17:11:58 -08:00
Ghanan Gowripalan	e63db5e7bb	Discover default routers from Router Advertisements This change allows the netstack to do NDP's Router Discovery as outlined by RFC 4861 section 6.3.4. Note, this change will not break existing uses of netstack as the default configuration for the stack options is set in such a way that Router Discovery will not be performed. See `stack.Options` and `stack.NDPConfigurations` for more details. This change introduces 2 options required to take advantage of Router Discovery, all available under NDPConfigurations: - HandleRAs: Whether or not NDP RAs are processes - DiscoverDefaultRouters: Whether or not Router Discovery is performed Another note: for a NIC to process Router Advertisements, it must not be a router itself. Currently the netstack does not have per-interface routing configuration; the routing/forwarding configuration is controlled stack-wide. Therefore, if the stack is configured to enable forwarding/routing, no Router Advertisements will be processed. Tests: Unittest to make sure that Router Discovery and updates to the routing table only occur if explicitly configured to do so. Unittest to make sure at max stack.MaxDiscoveredDefaultRouters discovered default routers are remembered. PiperOrigin-RevId: 278965143	2019-11-06 16:29:58 -08:00
Kevin Krakauer	e1b21f3c8c	Use PacketBuffers, rather than VectorisedViews, in netstack. PacketBuffers are analogous to Linux's sk_buff. They hold all information about a packet, headers, and payload. This is important for: * iptables to access various headers of packets * Preventing the clutter of passing different net and link headers along with VectorisedViews to packet handling functions. This change only affects the incoming packet path, and a future change will change the outgoing path. Benchmark Regular PacketBufferPtr PacketBufferConcrete -------------------------------------------------------------------------------- BM_Recvmsg 400.715MB/s 373.676MB/s 396.276MB/s BM_Sendmsg 361.832MB/s 333.003MB/s 335.571MB/s BM_Recvfrom 453.336MB/s 393.321MB/s 381.650MB/s BM_Sendto 378.052MB/s 372.134MB/s 341.342MB/s BM_SendmsgTCP/0/1k 353.711MB/s 316.216MB/s 322.747MB/s BM_SendmsgTCP/0/2k 600.681MB/s 588.776MB/s 565.050MB/s BM_SendmsgTCP/0/4k 995.301MB/s 888.808MB/s 941.888MB/s BM_SendmsgTCP/0/8k 1.517GB/s 1.274GB/s 1.345GB/s BM_SendmsgTCP/0/16k 1.872GB/s 1.586GB/s 1.698GB/s BM_SendmsgTCP/0/32k 1.017GB/s 1.020GB/s 1.133GB/s BM_SendmsgTCP/0/64k 475.626MB/s 584.587MB/s 627.027MB/s BM_SendmsgTCP/0/128k 416.371MB/s 503.434MB/s 409.850MB/s BM_SendmsgTCP/0/256k 323.449MB/s 449.599MB/s 388.852MB/s BM_SendmsgTCP/0/512k 243.992MB/s 267.676MB/s 314.474MB/s BM_SendmsgTCP/0/1M 95.138MB/s 95.874MB/s 95.417MB/s BM_SendmsgTCP/0/2M 96.261MB/s 94.977MB/s 96.005MB/s BM_SendmsgTCP/0/4M 96.512MB/s 95.978MB/s 95.370MB/s BM_SendmsgTCP/0/8M 95.603MB/s 95.541MB/s 94.935MB/s BM_SendmsgTCP/0/16M 94.598MB/s 94.696MB/s 94.521MB/s BM_SendmsgTCP/0/32M 94.006MB/s 94.671MB/s 94.768MB/s BM_SendmsgTCP/0/64M 94.133MB/s 94.333MB/s 94.746MB/s BM_SendmsgTCP/0/128M 93.615MB/s 93.497MB/s 93.573MB/s BM_SendmsgTCP/0/256M 93.241MB/s 95.100MB/s 93.272MB/s BM_SendmsgTCP/1/1k 303.644MB/s 316.074MB/s 308.430MB/s BM_SendmsgTCP/1/2k 537.093MB/s 584.962MB/s 529.020MB/s BM_SendmsgTCP/1/4k 882.362MB/s 939.087MB/s 892.285MB/s BM_SendmsgTCP/1/8k 1.272GB/s 1.394GB/s 1.296GB/s BM_SendmsgTCP/1/16k 1.802GB/s 2.019GB/s 1.830GB/s BM_SendmsgTCP/1/32k 2.084GB/s 2.173GB/s 2.156GB/s BM_SendmsgTCP/1/64k 2.515GB/s 2.463GB/s 2.473GB/s BM_SendmsgTCP/1/128k 2.811GB/s 3.004GB/s 2.946GB/s BM_SendmsgTCP/1/256k 3.008GB/s 3.159GB/s 3.171GB/s BM_SendmsgTCP/1/512k 2.980GB/s 3.150GB/s 3.126GB/s BM_SendmsgTCP/1/1M 2.165GB/s 2.233GB/s 2.163GB/s BM_SendmsgTCP/1/2M 2.370GB/s 2.219GB/s 2.453GB/s BM_SendmsgTCP/1/4M 2.005GB/s 2.091GB/s 2.214GB/s BM_SendmsgTCP/1/8M 2.111GB/s 2.013GB/s 2.109GB/s BM_SendmsgTCP/1/16M 1.902GB/s 1.868GB/s 1.897GB/s BM_SendmsgTCP/1/32M 1.655GB/s 1.665GB/s 1.635GB/s BM_SendmsgTCP/1/64M 1.575GB/s 1.547GB/s 1.575GB/s BM_SendmsgTCP/1/128M 1.524GB/s 1.584GB/s 1.580GB/s BM_SendmsgTCP/1/256M 1.579GB/s 1.607GB/s 1.593GB/s PiperOrigin-RevId: 278940079	2019-11-06 14:25:59 -08:00
Ghanan Gowripalan	d0d89ceedd	Send a TCP RST in response to a TCP SYN-ACK on a listening endpoint This change better follows what is outlined in RFC 793 section 3.4 figure 12 where a listening socket should not accept a SYN-ACK segment in response to a (potentially) old SYN segment. Tests: Test that checks the TCP RST segment sent in response to a TCP SYN-ACK segment received on a listening TCP endpoint. PiperOrigin-RevId: 278893114	2019-11-06 10:44:20 -08:00
Ghanan Gowripalan	a824b48cea	Validate incoming NDP Router Advertisements, as per RFC 4861 section 6.1.2 This change validates incoming NDP Router Advertisements as per RFC 4861 section 6.1.2. It also includes the skeleton to handle Router Advertiements that arrive on some NIC. Tests: Unittest to make sure only valid NDP Router Advertisements are received/ not dropped. PiperOrigin-RevId: 278891972	2019-11-06 10:39:29 -08:00
Kevin Krakauer	4fdd69d681	Check that a file is a regular file with open(O_TRUNC). It was possible to panic the sentry by opening a cache revalidating folder with O_TRUNC\|O_CREAT. PiperOrigin-RevId: 278417533	2019-11-04 10:58:29 -08:00
Michael Pratt	b23b36e701	Add NETLINK_KOBJECT_UEVENT socket support NETLINK_KOBJECT_UEVENT sockets send udev-style messages for device events. gVisor doesn't have any device events, so our sockets don't need to do anything once created. systemd's device manager needs to be able to create one of these sockets. It also wants to install a BPF filter on the socket. Since we'll never send any messages, the filter would never be invoked, thus we just fake it out. Fixes #1117 Updates #1119 PiperOrigin-RevId: 278405893	2019-11-04 10:07:52 -08:00
Michael Pratt	3b4f5445d0	Update membarrier bug Updates #267 PiperOrigin-RevId: 278402684	2019-11-04 09:55:30 -08:00
Michael Pratt	515fee5b6d	Add SO_PASSCRED support to netlink sockets Since we only supporting sending messages from the kernel, the peer is always the kernel, simplifying handling. There are currently no known users of SO_PASSCRED that would actually receive messages from gVisor, but adding full support is barely more work than stubbing out fake support. Updates #1117 Fixes #1119 PiperOrigin-RevId: 277981465	2019-11-01 12:45:11 -07:00
Nicolas Lacasse	e70f28664a	Allow the watchdog to detect when the sandbox is stuck during setup. The watchdog currently can find stuck tasks, but has no way to tell if the sandbox is stuck before the application starts executing. This CL adds a startup timeout and action to the watchdog. If Start() is not called before the given timeout (if non-zero), then the watchdog will take the action. PiperOrigin-RevId: 277970577	2019-11-01 11:49:31 -07:00
Jamie Liu	5694bd080e	Don't log "p9.channel.service: flipcall connection shutdown". This gets quite spammy, especially in tests. PiperOrigin-RevId: 277970468	2019-11-01 11:45:02 -07:00
Adin Scannell	a99d3479a8	Add context to state. PiperOrigin-RevId: 277840416	2019-10-31 18:03:24 -07:00
Andrei Vagin	f7dbddaf77	platform/kvm: calll sigtimedwait with zero timeout sigtimedwait is used to check pending signals and it should not block. PiperOrigin-RevId: 277777269	2019-10-31 12:29:04 -07:00
Kevin Krakauer	3246040447	Deep copy dispatcher views. When VectorisedViews were passed up the stack from packet_dispatchers, we were passing a sub-slice of the dispatcher's views fields. The dispatchers then immediately set those views to nil. This wasn't caught before because every implementer copied the data in these views before returning. PiperOrigin-RevId: 277615351	2019-10-30 17:12:57 -07:00
lubinszARM	ca933329fa	support using KVM_MEM_READONLY for arm64 regions On Arm platform, "setMemoryRegion" has extra permission checks. In virt/kvm/arm/mmu.c: kvm_arch_prepare_memory_region() .... if (writable && !(vma->vm_flags & VM_WRITE)) { ret = -EPERM; break; } .... So, for Arm platform, the "flags" for kvm_memory_region is required. And on x86 platform, the "flags" can be always set as '0'. Signed-off-by: Bin Lu <bin.lu@arm.com> COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/810 from lubinszARM:pr_setregion 8c99b19cfb0c859c6630a1cfff951db65fcf87ac PiperOrigin-RevId: 277602603	2019-10-30 15:53:31 -07:00
Andrei Vagin	db37483cb6	Store endpoints inside multiPortEndpoint in a sorted order It is required to guarantee the same order of endpoints after save/restore. PiperOrigin-RevId: 277598665	2019-10-30 15:33:41 -07:00
Ian Gudger	dc21c5ca16	Add Close and Wait methods to stack. Link endpoints still don't have a unified way to be requested to stop. Updates #837 PiperOrigin-RevId: 277398952	2019-10-29 17:22:32 -07:00
Ian Gudger	a2c51efe36	Add endpoint tracking to the stack. In the future this will replace DanglingEndpoints. DanglingEndpoints must be kept for now due to issues with save/restore. This is arguably a cleaner design and allows the stack to know which transport endpoints might still be using its link endpoints. Updates #837 PiperOrigin-RevId: 277386633	2019-10-29 16:14:51 -07:00
Dean Deng	d7f5e823e2	Fix grammar in comment. Missing "for". PiperOrigin-RevId: 277358513	2019-10-29 14:05:04 -07:00
Dean Deng	38330e9377	Update symlink traversal limit when resolving interpreter path. When execveat is called on an interpreter script, the symlink count for resolving the script path should be separate from the count for resolving the the corresponding interpreter. An ELOOP error should not occur if we do not hit the symlink limit along any individual path, even if the total number of symlinks encountered exceeds the limit. Closes #574 PiperOrigin-RevId: 277358474	2019-10-29 13:59:28 -07:00
Michael Pratt	c0b8fd4b6a	Update build tags to allow Go 1.14 Currently there are no ABI changes. We should check again closer to release. PiperOrigin-RevId: 277349744	2019-10-29 13:18:16 -07:00
Dean Deng	2e00771d5a	Refactor logic for loadExecutable. Separate the handling of filenames and *fs.File objects in a more explicit way for the sake of clarity. PiperOrigin-RevId: 277344203	2019-10-29 12:51:29 -07:00
Ian Gudger	7d80e85835	Allow waiting for Endpoint worker goroutines to finish. Updates #837 PiperOrigin-RevId: 277325162	2019-10-29 11:32:48 -07:00
gVisor bot	8b04e2dd8b	Merge pull request #1087 from xiaobo55x:fstat_Nlink PiperOrigin-RevId: 277324979	2019-10-29 11:27:57 -07:00
Ghanan Gowripalan	41e2df1bde	Support iterating an NDP options buffer. This change helps support iterating over an NDP options buffer so that implementations can handle all the NDP options present in an NDP packet. Note, this change does not yet actually handle these options, it just provides the tools to do so (in preparation for NDP's Prefix, Parameter, and a complete implementation of Neighbor Discovery). Tests: Unittests to make sure we can iterate over a valid NDP options buffer that may contain multiple options. Also tests to check an iterator before using it to see if the NDP options buffer is malformed. PiperOrigin-RevId: 277312487	2019-10-29 10:30:21 -07:00
Dean Deng	29273b0384	Disallow execveat on interpreter scripts with fd opened with O_CLOEXEC. When an interpreter script is opened with O_CLOEXEC and the resulting fd is passed into execveat, an ENOENT error should occur (the script would otherwise be inaccessible to the interpreter). This matches the actual behavior of Linux's execveat. PiperOrigin-RevId: 277306680	2019-10-29 10:04:39 -07:00
Ghanan Gowripalan	0864549ecc	Use the user supplied TCP MSS when creating a new active socket This change supports using a user supplied TCP MSS for new active TCP connections. Note, the user supplied MSS must be less than or equal to the maximum possible MSS for a TCP connection's route. If it is greater than the maximum possible MSS, the maximum possible MSS will be used as the connection's MSS instead. This change does not use this user supplied MSS for connections accepted from listening sockets - that will come in a later change. Test: Test that outgoing TCP SYN segments contain a TCP MSS option with the user supplied MSS if it is not greater than the maximum possible MSS for the route. PiperOrigin-RevId: 277185125	2019-10-28 18:20:36 -07:00
Michael Pratt	198f1cddb8	Update comment FDTable.GetFile doesn't exist. PiperOrigin-RevId: 277089842	2019-10-28 10:20:23 -07:00
Haibo Xu	dec831b493	Cast the Stat_t.Nlink to uint64 on arm64. Since the syscall.Stat_t.Nlink is defined as different types on amd64 and arm64(uint64 and uint32 respectively), we need to cast them to a unified uint64 type in gVisor code. Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: I7542b99b195c708f3fc49b1cbe6adebdd2f6e96b	2019-10-28 05:56:03 +00:00
Dean Deng	1c480abc39	Aggregate arguments for loading executables into a single struct. This change simplifies the function signatures of functions related to loading executables, such as LoadTaskImage, Load, loadBinary. PiperOrigin-RevId: 276821187	2019-10-25 22:44:19 -07:00
Ghanan Gowripalan	5a421058a0	Validate the checksum for incoming ICMPv6 packets This change validates the ICMPv6 checksum field before further processing an ICMPv6 packet. Tests: Unittests to make sure that only ICMPv6 packets with a valid checksum are accepted/processed. Existing tests using checker.ICMPv6 now also check the ICMPv6 checksum field. PiperOrigin-RevId: 276779148	2019-10-25 16:06:55 -07:00
Ian Gudger	8f029b3f82	Convert DelayOption to the newer/faster SockOpt int type. DelayOption is set on all new endpoints in gVisor. PiperOrigin-RevId: 276746791	2019-10-25 13:15:34 -07:00
Andrei Vagin	fd598912be	platform/ptrace: use tgkill instead of kill The syscall filters don't allow kill, just tgkill. PiperOrigin-RevId: 276718421	2019-10-25 11:19:20 -07:00
gVisor bot	9a726745ee	Merge pull request #1070 from lubinszARM:pr_abi PiperOrigin-RevId: 276609608	2019-10-25 10:59:42 -07:00
Ghanan Gowripalan	27e896f290	Add a type to represent the NDP Prefix Information option. This change is in preparation for NDP Prefix Discovery and SLAAC where the stack will need to handle NDP Prefix Information options. Tests: Test that given an NDP Prefix Information option buffer, correct values are returned by the field getters. PiperOrigin-RevId: 276594592	2019-10-24 16:53:08 -07:00
Ghanan Gowripalan	e50a1f5739	Remove the amss field from tcpip.tcp.handshake as it was unused The amss field in the tcpip.tcp.handshake was not used anywhere. Removed it to not cause confusion with the amss field in the tcpip.tcp.endpoint struct, which was documented to be used (and is actually being used) for the same purpose. PiperOrigin-RevId: 276577088	2019-10-24 15:23:43 -07:00
Ghanan Gowripalan	f034790ad8	Use interface-specific NDP configurations instead of the stack-wide default. This change makes it so that NDP work is done using the per-interface NDP configurations instead of the stack-wide default NDP configurations to correctly implement RFC 4861 section 6.3.2 (note here, a host is a single NIC operating as a host device), and RFC 4862 section 5.1. Test: Test that we can set NDP configurations on a per-interface basis without affecting the configurations of other interfaces or the stack-wide default. Also make sure that after the configurations are updated, the updated configurations are used for NDP processes (e.g. Duplicate Address Detection). PiperOrigin-RevId: 276525661	2019-10-24 11:09:18 -07:00
Bin Lu	7f9c391cf1	slight changes to pkg/abi In glibc, some structures are defined differently on different platforms. Such as: C.struct_stat Signed-off-by: Bin Lu <bin.lu@arm.com>	2019-10-24 09:15:29 +00:00
Dean Deng	d9fd536340	Handle AT_SYMLINK_NOFOLLOW flag for execveat. PiperOrigin-RevId: 276441249	2019-10-24 01:45:25 -07:00
Dean Deng	7ca50236c4	Handle AT_EMPTY_PATH flag in execveat. PiperOrigin-RevId: 276419967	2019-10-23 22:23:05 -07:00
gVisor bot	6d4d9564e3	Merge pull request #641 from tanjianfeng:master PiperOrigin-RevId: 276380008	2019-10-23 16:55:15 -07:00
DarcySail	fbe6b50d56	Keep minimal available fd to accelerate fd allocation Use fd.next to store the iteration start position, which can be used to accelerate allocating new FDs. And adding the corresponding gtest benchmark to measure performance. @tanjianfeng COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/758 from DarcySail:master 96685ec7886dfe1a64988406831d3bc002b438cc PiperOrigin-RevId: 276351250	2019-10-23 14:27:53 -07:00
Ghanan Gowripalan	de3dbf8a09	Inform netstack integrator when Duplicate Address Detection completes This change introduces a new interface, stack.NDPDispatcher. It can be implemented by the netstack integrator to receive NDP related events. As of this change, only DAD related events are supported. Tests: Existing tests were modified to use the NDPDispatcher's DAD events for DAD tests where it needed to wait for DAD completing (failing and resolving). PiperOrigin-RevId: 276338733	2019-10-23 13:26:35 -07:00
Bin Lu	345f140169	Optimize kvm/physical_map.go on Arm platform Signed-off-by: Bin Lu <bin.lu@arm.com>	2019-10-23 03:32:50 +00:00
Ian Lewis	ebe8001724	Update const names to be Go style. PiperOrigin-RevId: 276165962	2019-10-22 16:16:41 -07:00
Andrei Vagin	e63ff6d923	platform/ptrace: exit without panic if a stub process has been killed by SIGKILL SIGKILL can be sent only by an user or OOM-killer. In both cases, we don't need to panic. PiperOrigin-RevId: 276150120	2019-10-22 14:57:23 -07:00
Ghanan Gowripalan	515e0558d4	Add a type to represent the NDP Router Advertisement message. This change is in preparation for NDP Router Discovery where the stack will need to handle NDP Router Advertisments. Tests: Test that given an NDP Router Advertisement buffer (body of an ICMPv6 packet, correct values are returned by the field getters). PiperOrigin-RevId: 276146817	2019-10-22 14:41:51 -07:00
Ghanan Gowripalan	c356fe2ebb	Respect new PrimaryEndpointBehavior when addresses gets promoted to permanent This change makes sure that when an address which is already known by a NIC and has kind = permanentExpired gets promoted to permanent, the new PrimaryEndpointBehavior is respected. PiperOrigin-RevId: 276136317	2019-10-22 13:54:33 -07:00
Andrei Vagin	8720bd643e	netstack/tcp: software segmentation offload Right now, we send each tcp packet separately, we call one system call per-packet. This patch allows to generate multiple tcp packets and send them by sendmmsg. The arguable part of this CL is a way how to handle multiple headers. This CL adds the next field to the Prepandable buffer. Nginx test results: Server Software: nginx/1.15.9 Server Hostname: 10.138.0.2 Server Port: 8080 Document Path: /10m.txt Document Length: 10485760 bytes w/o gso: Concurrency Level: 5 Time taken for tests: 5.491 seconds Complete requests: 100 Failed requests: 0 Total transferred: 1048600200 bytes HTML transferred: 1048576000 bytes Requests per second: 18.21 [#/sec] (mean) Time per request: 274.525 [ms] (mean) Time per request: 54.905 [ms] (mean, across all concurrent requests) Transfer rate: 186508.03 [Kbytes/sec] received sw-gso: Concurrency Level: 5 Time taken for tests: 3.852 seconds Complete requests: 100 Failed requests: 0 Total transferred: 1048600200 bytes HTML transferred: 1048576000 bytes Requests per second: 25.96 [#/sec] (mean) Time per request: 192.576 [ms] (mean) Time per request: 38.515 [ms] (mean, across all concurrent requests) Transfer rate: 265874.92 [Kbytes/sec] received w/o gso: $ ./tcp_benchmark --client --duration 15 --ideal [SUM] 0.0-15.1 sec 2.20 GBytes 1.25 Gbits/sec software gso: $ tcp_benchmark --client --duration 15 --ideal --gso $((1<<16)) --swgso [SUM] 0.0-15.1 sec 3.99 GBytes 2.26 Gbits/sec PiperOrigin-RevId: 276112677	2019-10-22 11:55:56 -07:00
Ghanan Gowripalan	fb69de696b	Auto-generate an IPv6 link-local address based on the NIC's MAC Address. This change adds support for optionally auto-generating an IPv6 link-local address based on the NIC's MAC Address on NIC enable. Note, this change will not break existing uses of netstack as the default configuration for the stack options is set in such a way that a link-local address will not be auto-generated unless the stack is explicitly configured. See `stack.Options` for more details. Specifically, see `stack.Options.AutoGenIPv6LinkLocal`. Tests: Tests to make sure that the IPb6 link-local address is only auto-generated if the stack is specifically configured to do so. Also tests to make sure that an auto-generated address goes through the DAD process. PiperOrigin-RevId: 276059813	2019-10-22 07:26:54 -07:00
Bin Lu	2cee066929	enable ring0 to support arm64 This patch enabled the basic framework for arm64 guest. Serveral jobs were finished in this patch: 1, ring0.Vectors() 2, switchToUser() 3, basic framwork for Arm64 guest. Signed-off-by: Bin Lu <bin.lu@arm.com>	2019-10-22 08:33:39 +00:00
Nicolas Lacasse	070a8c2d4c	Remove old TODO. PiperOrigin-RevId: 275956240	2019-10-21 17:04:32 -07:00
Dean Deng	0b569b7cae	Add basic implementation of execveat syscall and associated tests. Allow file descriptors of directories as well as AT_FDCWD. PiperOrigin-RevId: 275929668	2019-10-21 14:55:18 -07:00
Kevin Krakauer	12235d533a	AF_PACKET support for netstack (aka epsocket). Like (AF_INET, SOCK_RAW) sockets, AF_PACKET sockets require CAP_NET_RAW. With runsc, you'll need to pass `--net-raw=true` to enable them. Binding isn't supported yet. PiperOrigin-RevId: 275909366	2019-10-21 13:23:18 -07:00
Kevin Krakauer	652f7b1d0f	Add support for pipes in VFS2. PiperOrigin-RevId: 275650307	2019-10-19 11:49:38 -07:00
Tamir Duberstein	51538c973e	Store primary endpoints in a slice There's no need for a linked list here. PiperOrigin-RevId: 275565920	2019-10-18 16:14:09 -07:00
Mithun Iyer	487d3b2358	Fix typo while initializing protocol for UDP endpoints. Fixes #763 PiperOrigin-RevId: 275563222	2019-10-18 16:00:11 -07:00
Michael Pratt	49b596b98d	Cleanup host UDS support This change fixes several issues with the fsgofer host UDS support. Notably, it adds support for SOCK_SEQPACKET and SOCK_DGRAM sockets [1]. It also fixes unsafe use of unet.Socket, which could cause a panic if Socket.FD is called when err != nil, and calls to Socket.FD with nothing to prevent the garbage collector from destroying and closing the socket. A set of tests is added to exercise host UDS access. This required extracting most of the syscall test runner into a library that can be used by custom tests. Updates #235 Updates #1003 [1] N.B. SOCK_DGRAM sockets are likely not particularly useful, as a server can only reply to a client that binds first. We don't allow bind, so these are unlikely to be used. PiperOrigin-RevId: 275558502	2019-10-18 15:33:03 -07:00
Tamir Duberstein	4e6f3a0c71	Remove restrictions on the sending address It is quite legal to send from the ANY address (it is required for DHCP). I can't figure out why the broadcast address was included here, so removing that as well. PiperOrigin-RevId: 275541954	2019-10-18 14:10:30 -07:00
Kevin Krakauer	dfdbdf14fa	Refactor pipe to support VFS2. * Pulls common functionality (IO and locking on open) into pipe_util.go. * Adds pipe/vfs.go, which implements a subset of vfs.FileDescriptionImpl. A subsequent change will add support for pipes in memfs. PiperOrigin-RevId: 275322385	2019-10-17 13:11:07 -07:00
Ghanan Gowripalan	962aa235de	NDP Neighbor Solicitations sent during DAD must have an IP hop limit of 255 NDP Neighbor Solicitations sent during Duplicate Address Detection must have an IP hop limit of 255, as all NDP Neighbor Solicitations should have. Test: Test that DAD messages have the IPv6 hop limit field set to 255. PiperOrigin-RevId: 275321680	2019-10-17 13:06:15 -07:00
Ghanan Gowripalan	06ed9e329d	Do Duplicate Address Detection on permanent IPv6 addresses. This change adds support for Duplicate Address Detection on IPv6 addresses as defined by RFC 4862 section 5.4. Note, this change will not break existing uses of netstack as the default configuration for the stack options is set in such a way that DAD will not be performed. See `stack.Options` and `stack.NDPConfigurations` for more details. Tests: Tests to make sure that the DAD process properly resolves or fails. That is, tests make sure that DAD resolves only if: - No other node is performing DAD for the same address - No other node owns the same address PiperOrigin-RevId: 275189471	2019-10-16 22:54:45 -07:00
Kevin Krakauer	2a82d5ad68	Reorder BUILD license and load functions in gvisor. PiperOrigin-RevId: 275139066	2019-10-16 16:40:30 -07:00
Michael Pratt	8fe48dcb1e	Add sublevel to kernel version Standard Linux kernel versions are VERSION.PATCHLEVEL.SUBLEVEL. e.g., 4.4.0, even when the sublevel is 0. Match this standard. PiperOrigin-RevId: 275125715	2019-10-16 15:22:42 -07:00
Fabricio Voznika	9fb562234e	Fix problem with open FD when copy up is triggered in overlayfs Linux kernel before 4.19 doesn't implement a feature that updates open FD after a file is open for write (and is copied to the upper layer). Already open FD will continue to read the old file content until they are reopened. This is especially problematic for gVisor because it caches open files. Flag was added to force readonly files to be reopenned when the same file is open for write. This is only needed if using kernels prior to 4.19. Closes #1006 It's difficult to really test this because we never run on tests on older kernels. I'm adding a test in GKE which uses kernels with the overlayfs problem for 1.14 and lower. PiperOrigin-RevId: 275115289	2019-10-16 15:06:24 -07:00
Nicolas Lacasse	fd4e436002	Support O_SYNC and O_DSYNC flags. When any of these flags are set, all writes will trigger a subsequent fsync call. This behavior already existed for "write-through" mounts. O_DIRECT is treated as an alias for O_SYNC. Better support coming soon. PiperOrigin-RevId: 275114392	2019-10-16 15:01:23 -07:00
Michael Pratt	bbdcf44ebb	Fix syscall changes lost in rebase These syscalls were changed in the amd64 file around the time the arm64 PR was sent out, so their changes got lost. Updates #63 PiperOrigin-RevId: 275114194	2019-10-16 14:56:29 -07:00
gVisor bot	d22f0534c0	Merge pull request #736 from tanjianfeng:fix-unix PiperOrigin-RevId: 275114157	2019-10-16 14:41:43 -07:00
Jamie Liu	0457a4c4cb	Minor vfs.FileDescriptionImpl fixes. - Pass context.Context to OnClose(). - Pass memmap.MMapOpts to ConfigureMMap() by pointer so that implementations can actually mutate it as required. PiperOrigin-RevId: 274934967	2019-10-15 18:40:45 -07:00
Bhasker Hariharan	f98c3ee32c	Remove panic when reassembly fails. Reassembly can fail due to an invalid sequence of fragments being received. eg. Multiple fragments with same id which claim to be the last one by setting the more flag to 0 etc. It's safer to just drop the reassembler and increment a metric than to panic when reassembly fails. PiperOrigin-RevId: 274920901	2019-10-15 17:04:44 -07:00
Tamir Duberstein	db1ca5c786	Set NDP hop limit in accordance with RFC 4861 ...and do not populate link address cache at dispatch. This partially reverts `313c767b00`, which caused malformed packets (e.g. NDP Neighbor Adverts with incorrect hop limit values) to populate the address cache. In particular, this masked a bug that was introduced to the Neighbor Advert generation code in `7c1587e340`. PiperOrigin-RevId: 274865182	2019-10-15 12:43:25 -07:00
Jianfeng Tan	d277bfba27	epsocket: support /proc/net/snmp Netstack has its own stats, we use this to fill /proc/net/snmp. Note that some metrics are not recorded in Netstack, which will be shown as 0 in the proc file. Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com> Change-Id: Ie0089184507d16f49bc0057b4b0482094417ebe1	2019-10-15 16:38:41 +00:00
Jianfeng Tan	aee2c93366	netstack: add counters for tcp CurrEstab and EstabResets Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>	2019-10-15 16:38:40 +00:00
Jianfeng Tan	dd7d1f825d	hostinet: support /proc/net/snmp and /proc/net/dev For hostinet, we inherit the data from host procfs. To to that, we cache the fds for these files for later reads. Fixes #506 Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com> Change-Id: I2f81215477455b9c59acf67e33f5b9af28ee0165	2019-10-15 16:38:40 +00:00
Jianfeng Tan	b94505ecc0	support /proc/net/route This proc file reports routing information to applications inside the container. Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com> Change-Id: I498e47f8c4c185419befbb42d849d0b099ec71f3	2019-10-15 16:38:40 +00:00
Jianfeng Tan	e3d4a67739	support /proc/net/snmp This proc file contains statistics according to [1]. [1] https://tools.ietf.org/html/rfc2013 Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com> Change-Id: I9662132085edd8a7783d356ce4237d7ac0800d94	2019-10-15 16:38:40 +00:00
gVisor bot	bfa0bb24dd	Internal change. PiperOrigin-RevId: 274700093	2019-10-14 17:46:52 -07:00
Kevin Krakauer	2302afb53d	Reorder BUILD license and load functions in netstack. PiperOrigin-RevId: 274672346	2019-10-14 15:21:59 -07:00
Bhasker Hariharan	a296425970	Use a different fanoutID for each new fdbased endpoint. PiperOrigin-RevId: 274638272	2019-10-14 13:10:16 -07:00
Ian Lewis	470997ca99	Allow for zero byte iovec with MSG_PEEK \| MSG_TRUNC in recvmsg. This allows for peeking at the length of the next message on a netlink socket without pulling it off the socket's buffer/queue, allowing tools like 'ip' to work. This CL also fixes an issue where dump_done_errno was not included in the NLMSG_DONE messages payload. Issue #769 PiperOrigin-RevId: 274068637	2019-10-10 16:55:48 -07:00
Bhasker Hariharan	c7e901f47a	Fix bugs in fragment handling. Strengthen the header.IPv4.IsValid check to correctly check for IHL/TotalLength fields. Also add a check to make sure fragmentOffsets + size of the fragment do not cause a wrap around for the end of the fragment. PiperOrigin-RevId: 274049313	2019-10-10 15:14:55 -07:00
Adin Scannell	f8b1859319	Fix signalfd polling. The signalfd descriptors otherwise always show as available. This can lead programs to spin, assuming they are looking to see what signals are pending. Updates #139 PiperOrigin-RevId: 274017890	2019-10-10 12:51:22 -07:00
gVisor bot	14952d01fb	Merge pull request #909 from xiaobo55x:atomic_bitsops PiperOrigin-RevId: 274011064	2019-10-10 12:46:46 -07:00
gVisor bot	bf870c1a42	Internal change. PiperOrigin-RevId: 273861936	2019-10-09 17:56:05 -07:00
gVisor bot	7a2d5b2fa7	Merge pull request #811 from lubinszARM:pr_testutil PiperOrigin-RevId: 273781641	2019-10-09 12:00:53 -07:00
gVisor bot	559aba7670	Merge pull request #813 from xiaobo55x:pkg_sleep PiperOrigin-RevId: 273668431	2019-10-09 11:11:28 -07:00
Haibo Xu	ebbf2b7fbd	Enable pkg/atomicbitops support on arm64. Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: I1646aaa6f07b5ec31c39c318b70f48693fe59a7c	2019-10-09 03:09:52 +00:00
Ian Gudger	7c1587e340	Implement IP_TTL. Also change the default TTL to 64 to match Linux. PiperOrigin-RevId: 273430341	2019-10-07 19:29:51 -07:00
Kevin Krakauer	1de0cf3563	Remove unnecessary context parameter for new pipes. PiperOrigin-RevId: 273421634	2019-10-07 18:16:14 -07:00
Kevin Krakauer	6a98237949	Rename epsocket to netstack. PiperOrigin-RevId: 273365058	2019-10-07 13:57:59 -07:00
gVisor bot	8fce24d33a	Merge pull request #753 from lubinszARM:pr_syscall_linux PiperOrigin-RevId: 273364848	2019-10-07 13:52:19 -07:00
Nicolas Lacasse	f24c3188b5	Add sanity check that overlayCreate is called with an overlay parent inode. PiperOrigin-RevId: 272987037	2019-10-04 17:03:50 -07:00
Jamie Liu	b941e35761	Return EIO from p9 if flipcall.Endpoint.Connect() fails. Also ensure that all flipcall transport errors not returned by p9 (converted to EIO by the client, or dropped on the floor by channel server goroutines) are logged. PiperOrigin-RevId: 272963663	2019-10-04 14:56:53 -07:00
Kevin Krakauer	7ef1c44a7f	Change linux.FileMode from uint to uint16, and update VFS to use FileMode. In Linux (include/linux/types.h), mode_t is an unsigned short. PiperOrigin-RevId: 272956350	2019-10-04 14:20:32 -07:00
Chris Kuiper	4874525161	Implement proper local broadcast behavior The behavior for sending and receiving local broadcast (255.255.255.255) traffic is as follows: Outgoing -------- * A broadcast packet sent on a socket that is bound to an interface goes out that interface * A broadcast packet sent on an unbound socket follows the route table to select the outgoing interface + if an explicit route entry exists for 255.255.255.255/32, use that one + else use the default route * Broadcast packets are looped back and delivered following the rules for incoming packets (see next). This is the same behavior as for multicast packets, except that it cannot be disabled via sockopt. Incoming -------- * Sockets wishing to receive broadcast packets must bind to either INADDR_ANY (0.0.0.0) or INADDR_BROADCAST (255.255.255.255). No other socket receives broadcast packets. * Broadcast packets are multiplexed to all sockets matching it. This is the same behavior as for multicast packets. * A socket can bind to 255.255.255.255:<port> and then receive its own broadcast packets sent to 255.255.255.255:<port> In addition, this change implicitly fixes an issue with multicast reception. If two sockets want to receive a given multicast stream and one is bound to ANY while the other is bound to the multicast address, only one of them will receive the traffic. PiperOrigin-RevId: 272792377	2019-10-03 19:31:35 -07:00
gVisor bot	135aadb517	Merge pull request #757 from xiaobo55x:pkg_bits PiperOrigin-RevId: 272760964	2019-10-03 16:13:34 -07:00
Andrei Vagin	db218fdfcf	Don't report partialResult errors from sendfile The input file descriptor is always a regular file, so sendfile can't lose any data if it will not be able to write them to the output file descriptor. Reported-by: syzbot+22d22330a35fa1c02155@syzkaller.appspotmail.com PiperOrigin-RevId: 272730357	2019-10-03 13:38:30 -07:00
gVisor bot	cde7711837	Merge pull request #865 from tanjianfeng:fix-829 PiperOrigin-RevId: 272522508	2019-10-02 14:51:04 -07:00
Andrei Vagin	2016cc283c	fs/proc: report PID-s from a pid namespace of the proc mount Right now, we can find more than one process with the 1 PID in /proc. $ for i in `seq 10`; do > unshare -fp sleep 1000 & > done $ ls /proc 1 1 1 1 12 18 24 29 6 loadavg net sys version 1 1 1 1 16 20 26 32 cpuinfo meminfo self thread-self 1 1 1 1 17 21 28 36 filesystems mounts stat uptime PiperOrigin-RevId: 272506593	2019-10-02 13:29:42 -07:00
Andrei Vagin	9a875306db	Merge branch 'master' into pr_syscall_linux	2019-10-02 13:00:07 -07:00
Michael Pratt	0d483985c5	Include AT_SECURE in the aux vector gVisor does not currently implement the functionality that would result in AT_SECURE = 1, but Linux includes AT_SECURE = 0 in the normal case, so we should do the same. PiperOrigin-RevId: 272311488	2019-10-01 15:43:14 -07:00
Michael Pratt	dd69b49ed1	Disable cpuClockTicker when app is idle Kernel.cpuClockTicker increments kernel.cpuClock, which tasks use as a clock to track their CPU usage. This improves latency in the syscall path by avoid expensive monotonic clock calls on every syscall entry/exit. However, this timer fires every 10ms. Thus, when all tasks are idle (i.e., blocked or stopped), this forces a sentry wakeup every 10ms, when we may otherwise be able to sleep until the next app-relevant event. These wakeups cause the sentry to utilize approximately 2% CPU when the application is otherwise idle. Updates to clock are not strictly necessary when the app is idle, as there are no readers of cpuClock. This commit reduces idle CPU by disabling the timer when tasks are completely idle, and computing its effects at the next wakeup. Rather than disabling the timer as soon as the app goes idle, we wait until the next tick, which provides a window for short sleeps to sleep and wakeup without doing the (relatively) expensive work of disabling and enabling the timer. PiperOrigin-RevId: 272265822	2019-10-01 12:21:01 -07:00
Michael Pratt	53cc72da90	Honor X bit on extra anon pages in PT_LOAD segments Linux changed this behavior in 16e72e9b30986ee15f17fbb68189ca842c32af58 (v4.11). Previously, extra pages were always mapped RW. Now, those pages will be executable if the segment specified PF_X. They still must be writeable. PiperOrigin-RevId: 272256280	2019-10-01 11:30:36 -07:00
Andrei Vagin	7a234f736f	splice: try another fallback option only if the previous one isn't supported Reported-by: syzbot+bb5ed342be51d39b0cbb@syzkaller.appspotmail.com PiperOrigin-RevId: 272110815	2019-09-30 18:23:42 -07:00
Andrei Vagin	29a1ba54ea	splice: compare inode numbers only if both ends are pipes It isn't allowed to splice data from and into the same pipe. But right now this check is broken, because we don't check that both ends are pipes. PiperOrigin-RevId: 272107022	2019-09-30 17:57:14 -07:00
Adin Scannell	20841b98e1	Update FIXME bug with GitHub issue. PiperOrigin-RevId: 272101930	2019-09-30 17:24:29 -07:00
Bhasker Hariharan	bcbb3ef317	Add a Stringer implementation to PacketDispatchMode PiperOrigin-RevId: 272083936	2019-09-30 15:52:55 -07:00
Bhasker Hariharan	61f6fbd0ce	Fix bugs in PickEphemeralPort for TCP. Netstack always picks a random start point everytime PickEphemeralPort is called. While this is required for UDP so that DNS requests go out through a randomized set of ports it is not required for TCP. Infact Linux explicitly hashes the (srcip, dstip, dstport) and a one time secret initialized at start of the application to get a random offset. But to ensure it doesn't start from the same point on every scan it uses a static hint that is incremented by 2 in every call to pick ephemeral ports. The reason for 2 is Linux seems to split the port ranges where active connects seem to use even ones while odd ones are used by listening sockets. This CL implements a similar strategy where we use a hash + hint to generate the offset to start the search for a free Ephemeral port. This ensures that we cycle through the available port space in order for repeated connects to the same destination and significantly reduces the chance of picking a recently released port. PiperOrigin-RevId: 272058370	2019-09-30 13:55:22 -07:00
Nicolas Lacasse	3ad17ff597	Force timestamps to update when set via InodeOperations.SetTimestamps. The gofer's CachingInodeOperations implementation contains an optimization for the common open-read-close pattern when we have a host FD. In this case, the host kernel will update the timestamp for us to a reasonably close time, so we don't need an extra RPC to the gofer. However, when the app explicitly sets the timestamps (via futimes or similar) then we actually DO need to update the timestamps, because the host kernel won't do it for us. To fix this, a new boolean `forceSetTimestamps` was added to CachineInodeOperations.SetMaskedAttributes. It is only set by gofer.InodeOperations.SetTimestamps. PiperOrigin-RevId: 272048146	2019-09-30 13:08:45 -07:00
Michael Pratt	981fc188f0	Only copy out remaining time on nanosleep success It looks like the old code attempted to do this, but didn't realize that err != nil even in the happy case. PiperOrigin-RevId: 272005887	2019-09-30 13:07:32 -07:00
gVisor bot	eebc38be7a	Merge pull request #882 from DarcySail:darcy_faster_CopyStringIn PiperOrigin-RevId: 271675009	2019-09-27 17:27:13 -07:00
gVisor bot	8539abc0df	Merge pull request #864 from tanjianfeng:fix-861 PiperOrigin-RevId: 271649711	2019-09-27 15:18:09 -07:00
gVisor bot	abbee5615f	Implement SO_BINDTODEVICE sockopt PiperOrigin-RevId: 271644926	2019-09-27 14:14:04 -07:00
Kevin Krakauer	543492650d	Make raw socket tests pass in environments with or without CAP_NET_RAW. PiperOrigin-RevId: 271442321	2019-09-26 15:09:20 -07:00
gVisor bot	dd0e5eedae	Merge pull request #765 from trailofbits:uds_support PiperOrigin-RevId: 271235134	2019-09-25 16:44:22 -07:00
Kevin Krakauer	59ccbb1044	Remove centralized registration of protocols. Also removes the need for protocol names. PiperOrigin-RevId: 271186030	2019-09-25 12:57:05 -07:00
gVisor bot	99c86b8dbd	Merge pull request #863 from tanjianfeng:fix-862 PiperOrigin-RevId: 271168948	2019-09-25 11:36:06 -07:00
gVisor bot	76ff1947b6	gvisor: change syscall.RawSyscall to syscall.RawSyscall6 where required Before https://golang.org/cl/173160 syscall.RawSyscall would zero out the last three register arguments to the system call. That no longer happens. For system calls that take more than three arguments, use RawSyscall6 to ensure that we pass zero, not random data, for the additional arguments. PiperOrigin-RevId: 271062527	2019-09-24 23:47:42 -07:00
Adin Scannell	502f8f238e	Stub out readahead implementation. Closes #261 PiperOrigin-RevId: 270973347	2019-09-24 13:29:46 -07:00
Chris Kuiper	6704d625ef	Return only primary addresses in Stack.NICInfo() Non-primary addresses are used for endpoints created to accept multicast and broadcast packets, as well as "helper" endpoints (0.0.0.0) that allow sending packets when no proper address has been assigned yet (e.g., for DHCP). These addresses are not real addresses from a user point of view and should not be part of the NICInfo() value. Also see b/127321246 for more info. This switches NICInfo() to call a new NIC.PrimaryAddresses() function. To still allow an option to get all addresses (mostly for testing) I added Stack.GetAllAddresses() and NIC.AllAddresses(). In addition, the return value for GetMainNICAddress() was changed for the case where the NIC has no primary address. Instead of returning an error here, it now returns an empty AddressWithPrefix() value. The rational for this change is that it is a valid case for a NIC to have no primary addresses. Lastly, I refactored the code based on the new additions. PiperOrigin-RevId: 270971764	2019-09-24 13:21:20 -07:00
Tamir Duberstein	bbaaa1fcc2	Simplify ICMPRateLimiter https://github.com/golang/time/commit/c4c64ca added SetBurst upstream. PiperOrigin-RevId: 270925077	2019-09-24 09:50:51 -07:00
henry.tjf	bc9de939fd	tty: fix sending SIGTTOU on tty write How to reproduce: $ echo "timeout 10 ls" > foo.sh $ chmod +x foo.sh $ ./foo.sh (will hang here for 10 secs, and the output of ls does not show) When "ls" process writes to stdout, it receives SIGTTOU signal, and hangs there. Until "timeout" process timeouts, and kills "ls" process. The expected result is: "ls" writes its output into tty, and terminates immdedately, then "timeout" process receives SIGCHLD and terminates. The reason for this failure is that we missed the check for TOSTOP (if set, background processes will receive the SIGTTOU signal when they do write). We use drivers/tty/n_tty.c:n_tty_write() as a reference. Fixes: #862 Reported-by: chris.zn <chris.zn@antfin.com> Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com> Signed-off-by: chenglang.hy <chenglang.hy@antfin.com>	2019-09-24 14:18:22 +00:00
Haibo Xu	a26276b949	Enable pkg/bits support on arm64. Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: I490716f0e6204f0b3a43f71931b10d1ca541e128	2019-09-24 07:03:19 +00:00
Haibo Xu	2db866c45f	Enable pkg/sleep support on arm64. Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: I9071e698c1f222e0fdf3b567ec4cbd97f0a8dde9	2019-09-24 06:42:26 +00:00
Adin Scannell	6c88f674af	Add test for concurrent reads and writes. PiperOrigin-RevId: 270789146	2019-09-23 16:44:30 -07:00
Andrei Vagin	03ee55cc62	netstack: convert more socket options to {Set,Get}SockOptInt PiperOrigin-RevId: 270763208	2019-09-23 14:39:14 -07:00
gVisor bot	4aeedd47bf	internal BUILD file cleanup. PiperOrigin-RevId: 270680704	2019-09-23 08:25:13 -07:00
Jamie Liu	fb55c2bd0d	Change vfs.Dirent.Off to NextOff. "d_off is the distance from the start of the directory to the start of the next linux_dirent." - getdents(2). PiperOrigin-RevId: 270349685	2019-09-20 14:24:29 -07:00
Ian Gudger	002f1d4aae	Allow waiting for LinkEndpoint worker goroutines to finish. Previously, the only safe way to use an fdbased endpoint was to leak the FD. This change makes it possible to safely close the FD. This is the first step towards having stoppable stacks. Updates #837 PiperOrigin-RevId: 270346582	2019-09-20 14:10:02 -07:00
Jianfeng Tan	223481e927	fix set hostname Previously, when we set hostname: $ strace hostname abc ... sethostname("abc", 3) = -1 ENAMETOOLONG (File name too long) ... According to man 2 sethostname: "The len argument specifies the number of bytes in name. (Thus, name does not require a terminating null byte.)" We wrongly use the CopyStringIn() to check terminating zero byte in the implementation of sethostname syscall. To fix this, we use CopyInBytes() instead. Fixes: #861 Reported-by: chenglang.hy <chenglang.hy@antfin.com> Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>	2019-09-20 17:57:25 +00:00
Jianfeng Tan	329b6653ff	Implement /proc/net/tcp6 Fixes: #829 Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com> Signed-off-by: Jielong Zhou <jielong.zjl@antfin.com>	2019-09-20 17:20:08 +00:00
Jamie Liu	e9af227a61	Fix p9 integration of flipcall. - Do not call Rread.SetPayload(flipcall packet window) in p9.channel.recv(). - Ignore EINTR from ppoll() in p9.Client.watch(). - Clean up handling of client socket FD lifetimes so that p9.Client.watch() never ppoll()s a closed FD. - Make p9test.Harness.Finish() call clientSocket.Shutdown() instead of clientSocket.Close() for the same reason. - Rework channel reuse to avoid leaking channels in the following case (suppose we have two channels): sendRecvChannel len(channels) == 2 => idx = 1 inuse[1] = ch0 sendRecvChannel len(channels) == 1 => idx = 0 inuse[0] = ch1 inuse[1] = nil sendRecvChannel len(channels) == 1 => idx = 0 inuse[0] = ch0 inuse[0] = nil inuse[0] == nil => ch0 leaked - Avoid deadlocking p9.Client.watch() by calling channelsWg.Wait() without holding channelsMu. - Bump p9test:client_test size to medium. PiperOrigin-RevId: 270200314	2019-09-19 22:52:56 -07:00
Robert Tonic	46beb91912	Fix documentation, clean up seccomp filter installation, rename helpers. Filter installation has been streamlined and functions renamed. Documentation has been fixed to be standards compliant, and missing documentation added. gofmt has also been applied to modified files.	2019-09-19 17:10:50 -04:00
Adin Scannell	75781ab3ef	Remove defer from hot path and ensure Atomic is applied consistently. PiperOrigin-RevId: 270114317	2019-09-19 13:39:32 -07:00
gVisor bot	1c0324d5a1	Merge pull request #876 from xiaobo55x:hostcpu PiperOrigin-RevId: 270094324	2019-09-19 12:03:38 -07:00
Kevin Krakauer	0a8a75f3da	Job control: controlling TTYs and foreground process groups. Adresses a deadlock with the rolled back change: `b6a5b950d2` Creating a session from an orphaned process group was causing a lock to be acquired twice by a single goroutine. This behavior is addressed, and a test (OrphanRegression) has been added to pty.cc. Implemented the following ioctls: - TIOCSCTTY - set controlling TTY - TIOCNOTTY - remove controlling tty, maybe signal some other processes - TIOCGPGRP - get foreground process group. Also enables tcgetpgrp(). - TIOCSPGRP - set foreground process group. Also enabled tcsetpgrp(). Next steps are to actually turn terminal-generated control characters (e.g. C^c) into signals to the proper process groups, and to send SIGTTOU and SIGTTIN when appropriate. PiperOrigin-RevId: 270088599	2019-09-19 11:36:47 -07:00
Hang Su	d72c63664b	Accelerate byte lookup in string with `bytealg/indexbyte` `bytealg/indexbyte` will use AVX or SSE instruction set, if possible, which could accelerate `CopyStringIn` function by 28%. In worst case(CPU doesn't support SSE), `bytealg/indexbyte` will degenerate to traversal lookup. When dealing with short strings, `bytealg/indexbyte` has the same performance level as before. Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com> Signed-off-by: Hang Su <darcy.sh@antfin.com>	2019-09-19 22:16:52 +08:00
Haibo Xu	cabe10e603	Enable pkg/sentry/hostcpu support on arm64. Signed-off-by: Haibo Xu haibo.xu@arm.com Change-Id: I333872da9bdf56ddfa8ab2f034dfc1f36a7d3132	2019-09-18 23:51:42 +00:00
Adin Scannell	c98e7f0d19	Signalfd support Note that the exact semantics for these signalfds are slightly different from Linux. These signalfds are bound to the process at creation time. Reads, polls, etc. are all associated with signals directed at that task. In Linux, all signalfd operations are associated with current, regardless of where the signalfd originated. In practice, this should not be an issue given how signalfds are used. In order to fix this however, we will need to plumb the context through all the event APIs. This gets complicated really quickly, because the waiter APIs are all netstack-specific, and not generally exposed to the context. Probably not worthwhile fixing immediately. PiperOrigin-RevId: 269901749	2019-09-18 15:16:42 -07:00
Bin Lu	38bc0b6b6a	enable syscalls/linux to support arm64 Signed-off-by: Bin Lu <bin.lu@arm.com> Change-Id: I45af8a54304f8bb0e248ab15f4e20b173ea9e430	2019-09-18 10:13:06 +00:00
Bin Lu	8e73e2cec5	enable kvm/testutil to support arm64 enable kvm/testutil to support arm64 The Arm64 user-mode execution stat consists of: 1, X0- X30 2, PC, SP, PSTATE 3, TPIDR_EL0, used for TLS 4, V0-V31: 32 128-bit registers for floating point and simd 5, FPSR Currently, we first try to achieve goals 1 and 2. This patch provids basic test utils for goals 1 & 2 Signed-off-by: Bin Lu <bin.lu@arm.com>	2019-09-18 09:57:59 +00:00
Ghanan Gowripalan	60fe8719e1	Automated rollback of changelist 268047073 PiperOrigin-RevId: 269658971	2019-09-17 14:47:09 -07:00
Andrei Vagin	3b7119a7c9	platform/ptrace: log exit code for stub processes PiperOrigin-RevId: 269631877	2019-09-17 12:45:22 -07:00
Ian Gudger	747320a7aa	Update remaining users of LinkEndpoints to not refer to them as an ID. PiperOrigin-RevId: 269614517	2019-09-17 11:31:00 -07:00
Andrei Vagin	239a07aabf	gvisor: return ENOTDIR from the unlink syscall ENOTDIR has to be returned when a component used as a directory in pathname is not, in fact, a directory. PiperOrigin-RevId: 269037893	2019-09-13 21:44:57 -07:00
Adin Scannell	a8834fc555	Update p9 to support flipcall. PiperOrigin-RevId: 268845090	2019-09-12 23:37:31 -07:00
Adin Scannell	7c6ab6a219	Implement splice methods for pipes and sockets. This also allows the tee(2) implementation to be enabled, since dup can now be properly supported via WriteTo. Note that this change necessitated some minor restructoring with the fs.FileOperations splice methods. If the *fs.File is passed through directly, then only public API methods are accessible, which will deadlock immediately since the locking is already done by fs.Splice. Instead, we pass through an abstract io.Reader or io.Writer, which elide locks and use the underlying fs.FileOperations directly. PiperOrigin-RevId: 268805207	2019-09-12 17:43:27 -07:00
Michael Pratt	df5d377521	Remove go_test from go_stateify and go_marshal They are no-ops, so the standard rule works fine. PiperOrigin-RevId: 268776264	2019-09-12 15:10:17 -07:00
Ghanan Gowripalan	857940d30d	Automated rollback of changelist 268047073 PiperOrigin-RevId: 268757842	2019-09-12 13:52:25 -07:00
Ian Gudger	9dfcd8b09f	Fix ephemeral port leak. Fix a bug where udp.(endpoint).Disconnect [accessible in gVisor via epsocket.(SocketOperations).Connect with AF_UNSPEC] would leak a port reservation if the socket/endpoint had an ephemeral port assigned to it. glibc's getaddrinfo uses connect with AF_UNSPEC, causing each call of getaddrinfo to leak a port. Call getaddrinfo too many times and you run out of ports (shows up as connect returning EAGAIN and getaddrinfo returning EAI_NONAME "Name or service not known"). PiperOrigin-RevId: 268071160	2019-09-09 14:02:00 -07:00
Rahat Mahmood	3733b9b893	go_marshal: Implement automatic generation of ABI marshalling code. This CL implements go_marshal, a code generation utility for automatically serializing and deserializing ABI structs. The go_marshal tool automatically generates implementations of the new marshal interface. Unlike binary.Marshal/Unmarshal, the generated interface implementations use no runtime reflection, and translates to a single memcpy for most structs. See go_marshal/README.md for details. PiperOrigin-RevId: 268065475	2019-09-09 13:36:39 -07:00
Ghanan Gowripalan	a8943325db	Join IPv6 all-nodes and solicited-node multicast addresses where appropriate. The IPv6 all-nodes multicast address will be joined on NIC enable, and the appropriate IPv6 solicited-node multicast address will be joined when IPv6 addresses are added. Tests: Test receiving packets destined to the IPv6 link-local all-nodes multicast address and the IPv6 solicted node address of an added IPv6 address. PiperOrigin-RevId: 268047073	2019-09-09 12:06:06 -07:00
Ian Gudger	fe1f521077	Remove reundant global tcpip.LinkEndpointID. PiperOrigin-RevId: 267709597	2019-09-06 18:01:14 -07:00
Jamie Liu	9e1cbdf565	Indicate flipcall synchronization to the Go race detector. Since each Endpoint has a distinct mapping of the packet window, the Go race detector does not recognize accesses by connected Endpoints to be related. This means that this change isn't necessary for the Go race detector to accept accesses of flipcall.Endpoint.Data(), but it is necessary for it to accept accesses to shared variables outside the scope of flipcall that are synchronized by flipcall.Endpoint state; see updated test for an example. RaceReleaseMerge is needed (instead of RaceRelease) because calls to raceBecomeInactive() from unrelated Endpoints can occur in any order. (DowngradableRWMutex.RUnlock() has a similar property: calls to RUnlock() on the same DowngradableRWMutex from different goroutines can occur in any order. Remove the TODO asking to explain this now that this is understood.) PiperOrigin-RevId: 267705325	2019-09-06 17:25:07 -07:00
Nicolas Lacasse	7e94f171f4	Better strace logs for statx. PiperOrigin-RevId: 267498537	2019-09-05 18:03:53 -07:00
Robert Tonic	4573efe84b	Switch from net to unet to open Unix Domain Sockets.	2019-09-05 07:16:36 -04:00
Bhasker Hariharan	3dc3cffb2d	Fix RST generation bugs. There are a few cases addressed by this change - We no longer generate a RST in response to a RST packet. - When we receive a RST we cleanup and release all reservations immediately as the connection is now aborted. - An ACK received by a listening socket generates a RST when SYN cookies are not in-use. The only reason an ACK should land at the listening socket is if we are using SYN cookies otherwise the goroutine for the handshake in progress should have gotten the packet and it should never have arrived at the listening endpoint. - Also fixes the error returned when a connection times out due to a Keepalive timer expiration from ECONNRESET to a ETIMEDOUT. PiperOrigin-RevId: 267238427	2019-09-04 14:59:53 -07:00
Chris Kuiper	7bf1d426d5	Handle subnet and broadcast addresses correctly with NIC.subnets This also renames "subnet" to "addressRange" to avoid any more confusion with an interface IP's subnet. Lastly, this also removes the Stack.ContainsSubnet(..) API since it isn't used by anyone. Plus the same information can be obtained from Stack.NICAddressRanges(). PiperOrigin-RevId: 267229843	2019-09-04 14:19:32 -07:00
Adin Scannell	67a2ab1438	Impose order on test scripts. The simple test script has gotten out of control. Shard this script into different pieces and attempt to impose order on overall test structure. This change helps lay some of the foundations for future improvements. * The runsc/test directories are moved into just test/. * The runsc/test/testutil package is split into logical pieces. * The scripts/ directory contains new top-level targets. * Each test is now responsible for building targets it requires. * The install functionality is moved into `runsc` itself for simplicity. * The existing kokoro run_tests.sh file now just calls all (can be split). After this change is merged, I will create multiple distinct workflows for Kokoro, one for each of the scripts currently targeted by `run_tests.sh` today, which should dramatically reduce the time-to-run for the Kokoro tests, and provides a better foundation for further improvements to the infrastructure. PiperOrigin-RevId: 267081397	2019-09-03 22:02:43 -07:00
Ghanan Gowripalan	144127e5e1	Validate IPv6 Hop Limit field for received NDP packets Make sure that NDP packets are only received if their IP header's hop limit field is set to 255, as per RFC 4861. PiperOrigin-RevId: 267061457	2019-09-03 18:43:12 -07:00
Bhasker Hariharan	3789c34b22	Make UDP traceroute work. Adds support to generate Port Unreachable messages for UDP datagrams received on a port for which there is no valid endpoint. Fixes #703 PiperOrigin-RevId: 267034418	2019-09-03 16:01:17 -07:00
Jamie Liu	eb94066ef2	Ensure that flipcall.Endpoint.Shutdown() shuts down inactive peers. PiperOrigin-RevId: 267022978	2019-09-03 15:10:51 -07:00
Haibo Xu	fa151e3971	Remove duplicated file in pkg/tcpip/link/rawfile. The blockingpoll_unsafe.go was copied to blockingpoll_noyield_unsafe.go during merging commit `7206202bb9`. If it still stay here, it would cause build errors on non-amd64 platform. ERROR: pkg/tcpip/link/rawfile/BUILD:5:1: GoCompilePkg pkg/tcpip/link/rawfile.a failed (Exit 1) builder failed: error executing command bazel-out/host/bin/external/go_sdk/builder compilepkg -sdk external/go_sdk -installsuffix linux_arm64 -src pkg/tcpip/link/rawfile/blockingpoll_noyield_unsafe.go -src ... (remaining 33 argument(s) skipped) Use --sandbox_debug to see verbose messages from the sandbox compilepkg: error running subcommand: exit status 2 pkg/tcpip/link/rawfile/blockingpoll_yield_unsafe.go:35:6: BlockingPoll redeclared in this block previous declaration at pkg/tcpip/link/rawfile/blockingpoll_unsafe.go:26:78 Target //pkg/tcpip/link/rawfile:rawfile failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 25.531s, Critical Path: 21.08s INFO: 262 processes: 262 linux-sandbox. FAILED: Build did NOT complete successfully Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: I4e21f82984225d0aa173de456f7a7c66053a053e	2019-09-02 02:49:41 +00:00
Jamie Liu	0352cf5866	Remove support for non-incremental mapped accounting. PiperOrigin-RevId: 266496644	2019-08-30 19:06:55 -07:00
Bhasker Hariharan	54bf2e8eff	Automated rollback of changelist 261387276 PiperOrigin-RevId: 266491264	2019-08-30 18:15:32 -07:00
Chris Kuiper	afbdf2f212	Fix data race accessing referencedNetworkEndpoint.kind Wrapping "kind" into atomic access functions. Fixes #789 PiperOrigin-RevId: 266485501	2019-08-30 17:23:53 -07:00
Fabricio Voznika	502c47f7a7	Return correct buffer size for ioctl(socket, FIONREAD) Ioctl was returning just the buffer size from epsocket.endpoint and it was not considering data from epsocket.SocketOperations that was read from the endpoint, but not yet sent to the caller. PiperOrigin-RevId: 266485461	2019-08-30 17:19:09 -07:00
Rahat Mahmood	863e11ac4d	Implement /proc/net/udp. PiperOrigin-RevId: 266229756	2019-08-29 14:30:41 -07:00
gVisor bot	0789b9cc08	Merge pull request #655 from praveensastry:feature/runsc-ref-chk-leak PiperOrigin-RevId: 266226714	2019-08-29 14:17:32 -07:00
Jamie Liu	36a8949b2a	Add limit_host_fd_translation Gofer mount option. PiperOrigin-RevId: 266177409	2019-08-29 14:01:03 -07:00
Tamir Duberstein	24ecce5dbf	Export generated linkAddrEntryEntry PiperOrigin-RevId: 266000128	2019-08-28 14:56:33 -07:00
Tamir Duberstein	313c767b00	Populate link address cache at dispatch This allows the stack to learn remote link addresses on incoming packets, reducing the need to ARP to send responses. This also reduces the number of round trips to the system clock, since that may also prove to be performance-sensitive. Fixes #739. PiperOrigin-RevId: 265815816	2019-08-27 18:54:56 -07:00
Michael Pratt	9679f9891f	Fix comment typo PiperOrigin-RevId: 265731735	2019-08-27 11:44:06 -07:00
Fabricio Voznika	8fd89fd7a2	Fix sendfile(2) error code When output file is in append mode, sendfile(2) should fail with EINVAL and not EBADF. Closes #721 PiperOrigin-RevId: 265718958	2019-08-27 10:52:46 -07:00
Fabricio Voznika	c39564332b	Mount volumes as super user This used to be the case, but regressed after a recent change. Also made a few fixes around it and clean up the code a bit. Closes #720 PiperOrigin-RevId: 265717496	2019-08-27 10:47:16 -07:00
Robert Tonic	c319b360d1	First pass at implementing Unix Domain Socket support. No tests. This commit adds support for detecting the socket file type, connecting to a Unix Domain Socket, and providing bidirectional communication (without file descriptor transfer support).	2019-08-27 13:08:56 -04:00
Rahat Mahmood	1fdefd41c5	netstack/tcp: Add LastAck transition. Add missing state transition to LastAck, which should happen when the endpoint has already recieved a FIN from the remote side, and is sending its own FIN. PiperOrigin-RevId: 265568314	2019-08-26 16:39:13 -07:00
Michael Pratt	904b156962	Add support for Intel cache CPUID leafs This exposes L1, L2, etc. cache sizes, cache line size, etc. Across S/R, everything except cache line size can differ from the host. This is because cache line size is critical for correct use of CLFLUSH / CLFLUSHOPT, but as far as I know, the other cache parameters can only affect performance, not correctness. AMD uses different leafs for cache information, which are not yet supported. fail. There are no known cases of cache line size other than 64 in the fleet. PiperOrigin-RevId: 265544786	2019-08-26 14:47:05 -07:00
gVisor bot	7206202bb9	Merge pull request #696 from xiaobo55x:tcpip_link PiperOrigin-RevId: 265534854	2019-08-26 14:03:30 -07:00
Chris Kuiper	ac2200b8a9	Prevent a network endpoint to send/rcv if its address was removed This addresses the problem where an endpoint has its address removed but still has outstanding references held by routes used in connected TCP/UDP sockets which prevent the removal of the endpoint. The fix adds a new "expired" flag to the referenced network endpoint, which is set when an endpoint has its address removed. Incoming packets are not delivered to an expired endpoint (unless in promiscuous mode), while sending outgoing packets triggers an error to the caller (unless in spoofing mode). In addition, a few helper functions were added to stack_test.go to reduce code duplications. PiperOrigin-RevId: 265514326	2019-08-26 12:29:47 -07:00
Tamir Duberstein	e75a12e89d	Implement fmt.Stringer on Route by value This is more convenient, since it implements the interface for both value and pointer. PiperOrigin-RevId: 265086510	2019-08-23 10:44:11 -07:00
Adin Scannell	761e4bf2fe	Ensure yield-equivalent with an already-expired timeout. PiperOrigin-RevId: 264920977	2019-08-22 14:34:33 -07:00
Jianfeng Tan	2c3e2ed2bf	unix: return ECONNRESET if peer closed with data not read For SOCK_STREAM type unix socket, we shall return ECONNRESET if peer is closed with data not read. We explictly set a flag when closing one end, to differentiate from just shutdown (where zero shall be returned). Fixes: #735 Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>	2019-08-22 15:25:38 +00:00
Jianfeng Tan	96f78e2466	unix: return zero if peer is closed Previously, recvmsg() on a unix stream socket with its peer closed will never return, with goroutine call trace like this: ... 2 in gvisor.dev/gvisor/pkg/sentry/kernel.(Task).block at pkg/sentry/kernel/task_block.go:124 3 in gvisor.dev/gvisor/pkg/sentry/kernel.(Task).BlockWithDeadline at pkg/sentry/kernel/task_block.go:69 4 in gvisor.dev/gvisor/pkg/sentry/socket/unix.(SocketOperations).RecvMsg at pkg/sentry/socket/unix/unix.go:612 5 in gvisor.dev/gvisor/pkg/sentry/syscalls/linux.recvFrom at pkg/sentry/syscalls/linux/sys_socket.go:885 6 in gvisor.dev/gvisor/pkg/sentry/syscalls/linux.RecvFrom at pkg/sentry/syscalls/linux/sys_socket.go:910 ... The issue is caused by that ErrClosedForReceive returned by unix/transport.queue is turned into nil in unix.(EndpointReader).ReadToBlocks(): err.ToError() As a result, in unix.(*SocketOperations).RecvMsg(): n == 0 and err == nil We shall differentiate it from another case - no data to read where ErrWouldBlock shall be returned; and return 0 immediately. Fixes: #734 Reported-by: chenglang.hy <chenglang.hy@antfin.com> Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>	2019-08-22 15:25:38 +00:00
praveensastry	7672eaae25	Add log prefix for better clarity	2019-08-22 22:52:43 +10:00
Chris Kuiper	8d9276ed56	Support binding to multicast and broadcast addresses This fixes the issue of not being able to bind to either a multicast or broadcast address as well as to send and receive data from it. The way to solve this is to treat these addresses similar to the ANY address and register their transport endpoint ID with the global stack's demuxer rather than the NIC's. That way there is no need to require an endpoint with that multicast or broadcast address. The stack's demuxer is in fact the only correct one to use, because neither broadcast- nor multicast-bound sockets care which NIC a packet was received on (for multicast a join is still needed to receive packets on a NIC). I also took the liberty of refactoring udp_test.go to consolidate a lot of duplicate code and make it easier to create repetitive tests that test the same feature for a variety of packet and socket types. For this purpose I created a "flowType" that represents two things: 1) the type of packet being sent or received and 2) the type of socket used for the test. E.g., a "multicastV4in6" flow represents a V4-mapped multicast packet run through a V6-dual socket. This allows writing significantly simpler tests. A nice example is testTTL(). PiperOrigin-RevId: 264766909	2019-08-21 22:54:25 -07:00
Tamir Duberstein	573e6e4bba	Use tcpip.Subnet in tcpip.Route This is the first step in replacing some of the redundant types with the standard library equivalents. PiperOrigin-RevId: 264706552	2019-08-21 15:31:18 -07:00
Chris Kuiper	7e79ca0225	Add tcpip.Route.String and tcpip.AddressMask.Prefix PiperOrigin-RevId: 264544163	2019-08-20 23:28:52 -07:00
Zach Koopmans	67d7864f83	Document RWF_HIPRI not implemented for preadv2/pwritev2. Document limitation of no reasonable implementation for RWF_HIPRI flag (High Priority Read/Write for block-based file systems). PiperOrigin-RevId: 264237589	2019-08-19 14:07:44 -07:00
gVisor bot	3ffbdffd7e	Internal change. PiperOrigin-RevId: 264218306	2019-08-19 12:43:22 -07:00
Jianfeng Tan	a63f88855f	hostinet: fix parsing route netlink message We wrongly parses output interface as gateway address. The fix is straightforward. Fixes #638 Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com> Change-Id: Ia4bab31f3c238b0278ea57ab22590fad00eaf061 COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/684 from tanjianfeng:fix-638 b940e810367ad1273519bfa594f4371bdd293e83 PiperOrigin-RevId: 264211336	2019-08-19 12:10:21 -07:00
Kevin Krakauer	bd826092fe	Read iptables via sockopts. PiperOrigin-RevId: 264180125	2019-08-19 10:05:59 -07:00
Andrei Vagin	3e4102b2ea	netstack: disconnect an unix socket only if the address family is AF_UNSPEC Linux allows to call connect for ANY and the zero port. PiperOrigin-RevId: 263892534	2019-08-16 19:32:14 -07:00
Ayush Ranjan	661b2b9f69	procfs: Migrate seqfile implementations. Migrates all (except 3) seqfile implementations to the vfs.DynamicBytesSource interface. There should not be any change in functionality due to this migration itself. Please note that the following seqfile implementations have not been migrated: - /proc/filesystems in proc/filesystems.go - /proc/[pid]/mountinfo in proc/mounts.go - /proc/[pid]/mounts in proc/mounts.go This is because these depend on pending changes in /pkg/senty/vfs. PiperOrigin-RevId: 263880719	2019-08-16 17:36:42 -07:00
Andrei Vagin	2a1303357c	ptrace: detect if a stub process exited unexpectedly PiperOrigin-RevId: 263880577	2019-08-16 17:33:28 -07:00
Chris Kuiper	f7114e0a27	Add subnet checking to NIC.findEndpoint and consolidate with NIC.getRef This adds the same logic to NIC.findEndpoint that is already done in NIC.getRef. Since this makes the two functions very similar they were combined into one with the originals being wrappers. PiperOrigin-RevId: 263864708	2019-08-16 15:58:58 -07:00
Ayush Ranjan	4bab7d7f08	vfs: Remove vfs.DefaultDirectoryFD from embedding vfs.DefaultFD. This fixes the implementation ambiguity issues when a filesystem implementation embeds vfs.DefaultDirectoryFD to its directory FD along with an internal common fileDescription utility. For similar reasons also removes FileDescriptionDefaultImpl from DynamicBytesFileDescriptionImpl. PiperOrigin-RevId: 263795513	2019-08-16 10:20:11 -07:00
Rahat Mahmood	6cfc76798b	Document source and versioning of the TCPInfo struct. PiperOrigin-RevId: 263637194	2019-08-15 14:05:59 -07:00
Tamir Duberstein	fe74bba2bd	Don't dereference errors passed to panic() These errors are always pointers; there's no sense in dereferencing them in the panic call. Changed one false positive for clarity. PiperOrigin-RevId: 263611579	2019-08-15 11:58:16 -07:00
Tamir Duberstein	816a9211e9	netstack: move resumption logic into _state.go `13a98df` rearranged some of this code in a way that broke compilation of the netstack-only export at github.com/google/netstack because _state.go files are not included in that export. This commit moves resumption logic back into *_state.go, fixing the compilation breakage. PiperOrigin-RevId: 263601629	2019-08-15 11:13:46 -07:00
Haibo Xu	1b1e39d7a1	Enabling pkg/tcpip/link support on arm64. Signed-off-by: Haibo Xu haibo.xu@arm.com Change-Id: Ib6b4aa2db19032e58bf0395f714e6883caee460a	2019-08-15 03:19:30 +00:00
Haibo Xu	52843719ca	Rename fdbased/mmap.go to fdbased/mmap_stub.go. Signed-off-by: Haibo Xu haibo.xu@arm.com Change-Id: Id4489554b9caa332695df8793d361f8332f6a13b	2019-08-15 03:19:22 +00:00
Haibo Xu	0624858593	Rename rawfile/blockingpoll_unsafe.go to rawfile/blockingpoll_stub_unsafe.go. Signed-off-by: Haibo Xu haibo.xu@arm.com Change-Id: I2376e502c1a860d5e624c8a8e3afab5da4c53022	2019-08-15 03:19:14 +00:00
Tamir Duberstein	d81d94ac4c	Replace uinptr with int64 when returning lengths This is in accordance with newer parts of the standard library. PiperOrigin-RevId: 263449916	2019-08-14 16:05:56 -07:00
Tamir Duberstein	69d1414a32	Add tcpip.AddressWithPrefix.String PiperOrigin-RevId: 263436592	2019-08-14 15:02:14 -07:00
Bhasker Hariharan	570fb1db6b	Improve SendMsg performance. SendMsg before this change would copy all the data over into a new slice even if the underlying socket could only accept a small amount of data. This is really inefficient with non-blocking sockets and under high throughput where large writes could get ErrWouldBlock or if there was say a timeout associated with the sendmsg() syscall. With this change we delay copying bytes in till they are needed and only copy what can be potentially sent/held in the socket buffer. Reducing the need to repeatedly copy data over. Also a minor fix to change state FIN-WAIT-1 when shutdown(..., SHUT_WR) is called instead of when we transmit the actual FIN. Otherwise the socket could remain in CONNECTED state even though the user has called shutdown() on the socket. Updates #627 PiperOrigin-RevId: 263430505	2019-08-14 14:34:27 -07:00
Jamie Liu	cee044c2ab	Add vfs.DynamicBytesFileDescriptionImpl. This replaces fs/proc/seqfile for vfs2-based filesystems. PiperOrigin-RevId: 263254647	2019-08-13 17:54:24 -07:00
Fabricio Voznika	0e907c4298	Fix file mode check in pipeOperations PiperOrigin-RevId: 263203441	2019-08-13 13:33:33 -07:00
Ian Gudger	072d941e32	Add note to name logging mentioning trace logging should be enabled to debug. PiperOrigin-RevId: 263194584	2019-08-13 12:49:18 -07:00
Ian Gudger	99bf75a6dc	gonet: Replace NewPacketConn with DialUDP. This better matches the standard library and allows creating connected PacketConns. PiperOrigin-RevId: 263187462	2019-08-13 12:11:09 -07:00
Nicolas Lacasse	9769a8eaa4	Handle ENOSPC with a partial write. Similar to the EPIPE case, we can return the number of bytes written before ENOSPC was encountered. If the app tries to write more, we can return ENOSPC on the next write. PiperOrigin-RevId: 263041648	2019-08-12 17:41:33 -07:00
Rahat Mahmood	691c2f8173	Compute size of struct tcp_info instead of hardcoding it. PiperOrigin-RevId: 263040624	2019-08-12 17:34:38 -07:00
Ian Gudger	eac690e358	Fix netstack build error on non-AMD64. This stub had the wrong function signature. PiperOrigin-RevId: 262992682	2019-08-12 13:31:16 -07:00
Andrei Vagin	af90e68623	netlink: return an error in nlmsgerr Now if a process sends an unsupported netlink requests, an error is returned from the send system call. The linux kernel works differently in this case. It returns errors in the nlmsgerr netlink message. Reported-by: syzbot+571d99510c6f935202da@syzkaller.appspotmail.com PiperOrigin-RevId: 262690453	2019-08-09 22:34:54 -07:00
Bhasker Hariharan	5a38eb120a	Add congestion control states to sender. This change just introduces different congestion control states and ensures the sender.state is updated to reflect the current state of the connection. It is not used for any decisions yet but this is required before algorithms like Eiffel/PRR can be implemented. Fixes #394 PiperOrigin-RevId: 262638292	2019-08-09 14:50:30 -07:00
Haibo Xu	1c9da886e7	Add initial ptrace stub and syscall support for arm64. Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: I1dbd23bb240cca71d0cc30fc75ca5be28cb4c37c PiperOrigin-RevId: 262619519	2019-08-09 13:18:11 -07:00
Ayush Ranjan	c8961a6cbd	ext: Move to pkg/sentry/fsimpl. fsimpl is the keeper of all filesystem implementations in VFS2. PiperOrigin-RevId: 262617869	2019-08-09 13:08:28 -07:00
praveensastry	73985c6545	Fix the Stringer for leak mode	2019-08-09 17:13:06 +10:00
Ayush Ranjan	690308111c	ext: Benchmark tests. Added benchmark tests which emulate memfs benchmarks. Stat benchmarks BenchmarkVFS2Ext4fsStat/1-12 10000000 145 ns/op BenchmarkVFS2Ext4fsStat/2-12 10000000 170 ns/op BenchmarkVFS2Ext4fsStat/3-12 10000000 202 ns/op BenchmarkVFS2Ext4fsStat/8-12 3000000 374 ns/op BenchmarkVFS2Ext4fsStat/64-12 500000 2159 ns/op BenchmarkVFS2Ext4fsStat/100-12 300000 3459 ns/op BenchmarkVFS1TmpfsStat/1-12 5000000 348 ns/op BenchmarkVFS1TmpfsStat/2-12 3000000 487 ns/op BenchmarkVFS1TmpfsStat/3-12 2000000 655 ns/op BenchmarkVFS1TmpfsStat/8-12 1000000 1365 ns/op BenchmarkVFS1TmpfsStat/64-12 200000 9565 ns/op BenchmarkVFS1TmpfsStat/100-12 100000 15158 ns/op BenchmarkVFS2MemfsStat/1-12 10000000 133 ns/op BenchmarkVFS2MemfsStat/2-12 10000000 155 ns/op BenchmarkVFS2MemfsStat/3-12 10000000 182 ns/op BenchmarkVFS2MemfsStat/8-12 5000000 310 ns/op BenchmarkVFS2MemfsStat/64-12 1000000 1659 ns/op BenchmarkVFS2MemfsStat/100-12 500000 2787 ns/op Mount Stat benchmarks BenchmarkVFS2ExtfsMountStat/1-12 5000000 245 ns/op BenchmarkVFS2ExtfsMountStat/2-12 5000000 266 ns/op BenchmarkVFS2ExtfsMountStat/3-12 5000000 304 ns/op BenchmarkVFS2ExtfsMountStat/8-12 3000000 456 ns/op BenchmarkVFS2ExtfsMountStat/64-12 500000 2308 ns/op BenchmarkVFS2ExtfsMountStat/100-12 300000 3482 ns/op BenchmarkVFS1TmpfsMountStat/1-12 3000000 488 ns/op BenchmarkVFS1TmpfsMountStat/2-12 2000000 658 ns/op BenchmarkVFS1TmpfsMountStat/3-12 2000000 806 ns/op BenchmarkVFS1TmpfsMountStat/8-12 1000000 1514 ns/op BenchmarkVFS1TmpfsMountStat/64-12 100000 10037 ns/op BenchmarkVFS1TmpfsMountStat/100-12 100000 15280 ns/op BenchmarkVFS2MemfsMountStat/1-12 10000000 212 ns/op BenchmarkVFS2MemfsMountStat/2-12 5000000 232 ns/op BenchmarkVFS2MemfsMountStat/3-12 5000000 264 ns/op BenchmarkVFS2MemfsMountStat/8-12 3000000 390 ns/op BenchmarkVFS2MemfsMountStat/64-12 1000000 1813 ns/op BenchmarkVFS2MemfsMountStat/100-12 500000 2812 ns/op PiperOrigin-RevId: 262477158	2019-08-08 18:45:37 -07:00
Rahat Mahmood	7bfad8ebb6	Return a well-defined socket address type from socket funtions. Previously we were representing socket addresses as an interface{}, which allowed any type which could be binary.Marshal()ed to be used as a socket address. This is fine when the address is passed to userspace via the linux ABI, but is problematic when used from within the sentry such as by networking procfs files. PiperOrigin-RevId: 262460640	2019-08-08 16:50:33 -07:00
Rahat Mahmood	13a98df49e	netstack: Don't start endpoint goroutines too soon on restore. Endpoint protocol goroutines were previously started as part of loading the endpoint. This is potentially too soon, as resources used by these goroutine may not have been loaded. Protocol goroutines may perform meaningful work as soon as they're started (ex: incoming connect) which can cause them to indirectly access resources that haven't been loaded yet. This CL defers resuming all protocol goroutines until the end of restore. PiperOrigin-RevId: 262409429	2019-08-08 12:33:11 -07:00
gVisor bot	2e45d1696e	Merge pull request #653 from xiaobo55x:dev PiperOrigin-RevId: 262402929	2019-08-08 11:58:14 -07:00
Jamie Liu	06102af65a	memfs fixes. - Unexport Filesystem/Dentry/Inode. - Support SEEK_CUR in directoryFD.Seek(). - Hold Filesystem.mu before touching directoryFD.off in directoryFD.Seek(). - Remove deleted Dentries from their parent directory.childLists. - Remove invalid FIXMEs. PiperOrigin-RevId: 262400633	2019-08-08 11:46:38 -07:00
Ayush Ranjan	08cd5e1d36	ext: Seek unit tests. PiperOrigin-RevId: 262264674	2019-08-07 19:13:41 -07:00
Ayush Ranjan	40d6d8c15b	ext: StatAt unit tests. PiperOrigin-RevId: 262249166	2019-08-07 17:21:00 -07:00
Ayush Ranjan	3b368cabf9	ext: Read unit tests. PiperOrigin-RevId: 262242410	2019-08-07 16:44:10 -07:00
Ayush Ranjan	ad67e5a7a0	ext: IterDirent unit tests. PiperOrigin-RevId: 262226761	2019-08-07 15:24:33 -07:00
Ayush Ranjan	1c9781a4ed	ext: vfs.FileDescriptionImpl and vfs.FilesystemImpl implementations. - This also gets rid of pipes for now because pipe does not have vfs2 specific support yet. - Added file path resolution logic. - Fixes testing infrastructure. - Does not include unit tests yet. PiperOrigin-RevId: 262213950	2019-08-07 14:23:42 -07:00
Tamir Duberstein	67a3f4039d	Set target address in ARP Reply PiperOrigin-RevId: 262163794	2019-08-07 10:27:43 -07:00
Bhasker Hariharan	dfbc0b0a4c	Fix for a panic due to writing to a closed accept channel. This can happen because endpoint.Close() closes the accept channel first and then drains/resets any accepted but not delivered connections. But there can be connections that are connected but not delivered to the channel as the channel was full. But closing the channel can cause these writes to fail with a write to a closed channel. The correct solution is to abort any connections in SYN-RCVD state and drain/abort all completed connections before closing the accept channel. PiperOrigin-RevId: 261951132	2019-08-06 11:01:27 -07:00
Michael Pratt	704f9610f3	Require pread/pwrite for splice file offsets If there is an offset, the file must support pread/pwrite. See fs/splice.c:do_splice. PiperOrigin-RevId: 261944932	2019-08-06 10:35:28 -07:00
Haibo Xu	83fdb7739e	Change syscall.EPOLLET to unix.EPOLLET syscall.EPOLLET has been defined with different values on amd64 and arm64(-0x80000000 on amd64, and 0x80000000 on arm64), while unix.EPOLLET has been unified this value to 0x80000000(golang/go#5328). ref #63 Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: Id97d075c4e79d86a2ea3227ffbef02d8b00ffbb8	2019-08-05 23:10:08 +00:00
praveensastry	607be0585f	Add option to configure reference leak checking	2019-08-06 01:15:48 +10:00
Kevin Krakauer	810cc07aab	Plumbing for iptables sockopts. PiperOrigin-RevId: 261413396	2019-08-02 16:26:48 -07:00
Kevin Krakauer	b6a5b950d2	Job control: controlling TTYs and foreground process groups. (Don't worry, this is mostly tests.) Implemented the following ioctls: - TIOCSCTTY - set controlling TTY - TIOCNOTTY - remove controlling tty, maybe signal some other processes - TIOCGPGRP - get foreground process group. Also enables tcgetpgrp(). - TIOCSPGRP - set foreground process group. Also enabled tcsetpgrp(). Next steps are to actually turn terminal-generated control characters (e.g. C^c) into signals to the proper process groups, and to send SIGTTOU and SIGTTIN when appropriate. PiperOrigin-RevId: 261387276	2019-08-02 14:05:48 -07:00
Rahat Mahmood	2906dffcdb	Automated rollback of changelist 261191548 PiperOrigin-RevId: 261373749	2019-08-02 12:52:40 -07:00
Nicolas Lacasse	aaaefdf9ca	Remove kernel.mounts. We can get the mount namespace from the CreateProcessArgs in all cases where we need it. This also gets rid of kernel.Destroy method, since the only thing it was doing was DecRefing the mounts. Removing the need to call kernel.SetRootMountNamespace also allowed for some more simplifications in the container fs setup code. PiperOrigin-RevId: 261357060	2019-08-02 11:23:11 -07:00
Nicolas Lacasse	bad43772a1	Drop reference on fs.Inode if Mount goes wrong. PiperOrigin-RevId: 261203674	2019-08-01 14:57:49 -07:00
Nicolas Lacasse	f2b25aeac7	tmpfs and ramfs Dirs should drop references on children in Release(). This is the source of many warnings like: AtomicRefCount 0x7f5ff84e3500 owned by "fs.Inode" garbage collected with ref count of 1 (want 0) PiperOrigin-RevId: 261197093	2019-08-01 14:25:14 -07:00
Rahat Mahmood	79511e8a50	Implement getsockopt(TCP_INFO). Export some readily-available fields for TCP_INFO and stub out the rest. PiperOrigin-RevId: 261191548	2019-08-01 13:58:48 -07:00
Ian Lewis	0a246fab80	Basic support for 'ip route' Implements support for RTM_GETROUTE requests for netlink sockets. Fixes #507 PiperOrigin-RevId: 261051045	2019-07-31 20:30:09 -07:00
Jamie Liu	cbe145247a	Flipcall refinements. Note that some of these changes affect the protocol in backward-incompatible ways. - Replace use of "initially-active" and "initially-inactive" with "client" and "server" respectively for clarity. - Fix a race condition involving Endpoint.Shutdown() by repeatedly invoking FUTEX_WAKE until it is confirmed that no local thread is blocked in FUTEX_WAIT. - Drop flipcall.ControlMode. PiperOrigin-RevId: 260981382	2019-07-31 12:56:04 -07:00
Nicolas Lacasse	cf2b2d97d5	Initialize kernel.unimplementedSyscallEmitter with a sync.Once. This is initialized lazily on the first unimplemented syscall. Without the sync.Once, this is racy. PiperOrigin-RevId: 260971758	2019-07-31 12:00:35 -07:00
Austin Kiekintveld	12c4eb294a	Fix ICMPv4 EchoReply packet checksum The checksum was not being reset before being re-calculated and sent out. This caused the sent checksum to always be `0x0800`. Fixes #605. PiperOrigin-RevId: 260965059	2019-07-31 11:26:41 -07:00
Tamir Duberstein	c6e6d92cb1	Test connecting UDP sockets to the ANY address This doesn't currently pass on gVisor. While I'm here, fix a bug where connecting to the v6-mapped v4 address doesn't work in gVisor. PiperOrigin-RevId: 260923961	2019-07-31 07:41:20 -07:00
Jamie Liu	a7d5e0d254	Cache pages in CachingInodeOperations.Read when memory evictions are delayed. PiperOrigin-RevId: 260851452	2019-07-30 20:32:29 -07:00
Ayush Ranjan	5afa642deb	ext: Migrate from using fileReader custom interface to using io.Reader. It gets rid of holding state of the io.Reader offset (which is anyways held by the vfs.FileDescriptor struct. It is also odd using a io.Reader becuase we using io.ReaderAt to interact with the device. So making a io.ReaderAt wrapper makes more sense. Most importantly, it gets rid of the complexity of extracting the file reader from a regular file implementation and then using it. Now we can just use the regular file implementation as a reader which is more intuitive. PiperOrigin-RevId: 260846927	2019-07-30 19:43:59 -07:00
Ayush Ranjan	9fbe984dc1	ext: block map file reader implementation. Also adds stress tests for block map reader and intensifies extent reader tests. PiperOrigin-RevId: 260838177	2019-07-30 18:20:31 -07:00
Tamir Duberstein	7369c63e42	Pass ProtocolAddress instead of its fields PiperOrigin-RevId: 260803517	2019-07-30 15:06:39 -07:00
gVisor bot	93b0917d23	Merge pull request #607 from DarcySail:master PiperOrigin-RevId: 260783254	2019-07-30 13:31:29 -07:00
Zach Koopmans	e511c0e05f	Add feature to launch Sentry from an open host FD. Adds feature to launch from an open host FD instead of a binary_path. The FD should point to a valid executable and most likely be statically compiled. If the executable is not statically compiled, the loader will search along the interpreter paths, which must be able to be resolved in the Sandbox's file system or start will fail. PiperOrigin-RevId: 260756825	2019-07-30 11:20:40 -07:00
Haibo Xu	1decf76471	Change syscall.POLL to syscall.PPOLL. syscall.POLL is not supported on arm64, using syscall.PPOLL to support both the x86 and arm64. refs #63 Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: I2c81a063d3ec4e7e6b38fe62f17a0924977f505e COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/543 from xiaobo55x:master ba598263fd3748d1addd48e4194080aa12085164 PiperOrigin-RevId: 260752049	2019-07-30 11:01:29 -07:00
Ayush Ranjan	8da9f8a12c	Migrate from using io.ReadSeeker to io.ReaderAt. This provides the following benefits: - We can now use pkg/fd package which does not take ownership of the file descriptor. So it does not close the fd when garbage collected. This reduces scope of errors from unexpected garbage collection of io.File. - It enforces the offset parameter in every read call. It does not affect the fd offset nor is it affected by it. Hence reducing scope of error of using stale offsets when reading. - We do not need to serialize the usage of any global file descriptor anymore. So this drops the mutual exclusion req hence reducing complexity and congestion. PiperOrigin-RevId: 260635174	2019-07-29 20:12:37 -07:00
Hang Su	50f3447786	Combine multiple epoll events copies Allocate a larger memory buffer and combine multiple copies into one copy, to reduce the number of copies from kernel memory to user memory. Signed-off-by: Hang Su <darcy.sh@antfin.com>	2019-07-30 10:53:55 +08:00
Ayush Ranjan	ddf25e3331	ext: extent reader implementation. PiperOrigin-RevId: 260629559	2019-07-29 19:17:27 -07:00
Ayush Ranjan	b765eb4589	ext: inode implementations. PiperOrigin-RevId: 260624470	2019-07-29 18:33:55 -07:00
Christopher Koch	a3e9031e66	Use x/sys/unix for sentry/host interaction; abi is for guest/sentry. PiperOrigin-RevId: 260613864	2019-07-29 17:19:09 -07:00
Nicolas Lacasse	5fdb945a0d	Rate limit the unimplemented syscall event handler. This introduces two new types of Emitters: 1. MultiEmitter, which will forward events to other registered Emitters, and 2. RateLimitedEmitter, which will forward events to a wrapped Emitter, subject to given rate limits. The methods in the eventchannel package itself act like a multiEmitter, but is not actually an Emitter. Now we have a DefaultEmitter, and the methods in eventchannel simply forward calls to the DefaultEmitter. The unimplemented syscall handler now uses a RateLimetedEmitter that wraps the DefaultEmitter. PiperOrigin-RevId: 260612770	2019-07-29 17:12:50 -07:00
gVisor bot	b50122379c	Merge pull request #452 from zhangningdlut:chris_test_pidns PiperOrigin-RevId: 260220279	2019-07-26 15:00:51 -07:00
Fabricio Voznika	7052d21dc4	Automated rollback of changelist 255679453 PiperOrigin-RevId: 260047477	2019-07-25 16:48:49 -07:00
Ayush Ranjan	8376757495	ext: filesystem boilerplate code. PiperOrigin-RevId: 259865366	2019-07-24 19:08:21 -07:00
Ayush Ranjan	417096f781	ext: Add tests for root directory inode. PiperOrigin-RevId: 259856442	2019-07-24 17:59:57 -07:00
Ayush Ranjan	2ed832ff86	ext: testing environment setup with VFS2 support. PiperOrigin-RevId: 259835948	2019-07-24 16:03:30 -07:00
Chris Kuiper	40e682759f	Add support for a subnet prefix length on interface network addresses This allows the user code to add a network address with a subnet prefix length. The prefix length value is stored in the network endpoint and provided back to the user in the ProtocolAddress type. PiperOrigin-RevId: 259807693	2019-07-24 13:42:14 -07:00
chris.zn	1c5b6d9bd2	Use different pidns among different containers The different containers in a sandbox used only one pid namespace before. This results in that a container can see the processes in another container in the same sandbox. This patch use different pid namespace for different containers. Signed-off-by: chris.zn <chris.zn@antfin.com>	2019-07-24 13:38:23 +08:00
Ayush Ranjan	7e38d64333	ext: Inode creation logic. PiperOrigin-RevId: 259666476	2019-07-23 20:36:04 -07:00
Ayush Ranjan	d7bb79b6f1	ext: Add ext2 and ext3 tiny images. PiperOrigin-RevId: 259657917	2019-07-23 19:01:05 -07:00
Ayush Ranjan	bd7708956f	ext: Added extent tree building logic. PiperOrigin-RevId: 259628657	2019-07-23 15:51:50 -07:00
Nicolas Lacasse	04cbb13ce9	Give each container a distinct MountNamespace. This keeps all container filesystem completely separate from eachother (including from the root container filesystem), and allows us to get rid of the "__runsc_containers__" directory. It also simplifies container startup/teardown as we don't have to muck around in the root container's filesystem. PiperOrigin-RevId: 259613346	2019-07-23 14:37:07 -07:00
Tamir Duberstein	12c256568b	Deduplicate EndpointState.connected some This fixes a bug introduced in cl/251934850 that caused connect-accept-close-connect races to result in the second connect call failiing when it should have succeeded. PiperOrigin-RevId: 259584525	2019-07-23 12:10:18 -07:00
Kevin Krakauer	5ddf9adb2b	Fix up and add some iptables ABI. PiperOrigin-RevId: 259437060	2019-07-22 17:06:18 -07:00
gVisor bot	d706922d78	Merge pull request #571 from lubinszARM:pr_loader PiperOrigin-RevId: 259427074	2019-07-22 16:12:46 -07:00
Andrei Vagin	ec906e46c0	kvm: fix race between machine.Put and machine.Get m.available.Signal() has to be called under m.mu.RLock, otherwise it can race with machine.Get: m.Get \| m.Put ------------------------------------- m.mu.Lock() \| Seatching available vcpu\| \| m.available.Signal() m.available.Wait \| PiperOrigin-RevId: 259394051	2019-07-22 13:28:16 -07:00
Jamie Liu	fdac770f31	Fix struct statx field alignment. PiperOrigin-RevId: 259376740	2019-07-22 12:04:21 -07:00
Bin Lu	ffe45f38e6	Add ARM64 support to pkg/sentry/loader Signed-off-by: Bin Lu <bin.lu@arm.com>	2019-07-21 19:30:18 -07:00
gVisor bot	f544509c01	Merge pull request #450 from Pixep:feature/add-clock-boottime-as-monotonic PiperOrigin-RevId: 258996346	2019-07-19 10:44:45 -07:00
Chris Kuiper	0e040ba6e8	Handle interfaceAddr and NIC options separately for IP_MULTICAST_IF This tweaks the handling code for IP_MULTICAST_IF to ignore the InterfaceAddr if a NICID is given. PiperOrigin-RevId: 258982541	2019-07-19 09:29:04 -07:00
Andrei Vagin	eefa817cfd	net/tcp/setockopt: impelment setsockopt(fd, SOL_TCP, TCP_INQ) PiperOrigin-RevId: 258859507	2019-07-18 15:41:04 -07:00
Jamie Liu	163ab5e9ba	Sentry virtual filesystem, v2 Major differences from the current ("v1") sentry VFS: - Path resolution is Filesystem-driven (FilesystemImpl methods call vfs.ResolvingPath methods) rather than VFS-driven (fs package owns a Dirent tree and calls fs.InodeOperations methods to populate it). This drastically improves performance, primarily by reducing overhead from inefficient synchronization and indirection. It also makes it possible to implement remote filesystem protocols that translate FS system calls into single RPCs, rather than having to make (at least) one RPC per path component, significantly reducing the latency of remote filesystems (especially during cold starts and for uncacheable shared filesystems). - Mounts are correctly represented as a separate check based on contextual state (current mount) rather than direct replacement in a fs.Dirent tree. This makes it possible to support (non-recursive) bind mounts and mount namespaces. Included in this CL is fsimpl/memfs, an incomplete in-memory filesystem that exists primarily to demonstrate intended filesystem implementation patterns and for benchmarking: BenchmarkVFS1TmpfsStat/1-6 3000000 497 ns/op BenchmarkVFS1TmpfsStat/2-6 2000000 676 ns/op BenchmarkVFS1TmpfsStat/3-6 2000000 904 ns/op BenchmarkVFS1TmpfsStat/8-6 1000000 1944 ns/op BenchmarkVFS1TmpfsStat/64-6 100000 14067 ns/op BenchmarkVFS1TmpfsStat/100-6 50000 21700 ns/op BenchmarkVFS2MemfsStat/1-6 10000000 197 ns/op BenchmarkVFS2MemfsStat/2-6 5000000 233 ns/op BenchmarkVFS2MemfsStat/3-6 5000000 268 ns/op BenchmarkVFS2MemfsStat/8-6 3000000 477 ns/op BenchmarkVFS2MemfsStat/64-6 500000 2592 ns/op BenchmarkVFS2MemfsStat/100-6 300000 4045 ns/op BenchmarkVFS1TmpfsMountStat/1-6 2000000 679 ns/op BenchmarkVFS1TmpfsMountStat/2-6 2000000 912 ns/op BenchmarkVFS1TmpfsMountStat/3-6 1000000 1113 ns/op BenchmarkVFS1TmpfsMountStat/8-6 1000000 2118 ns/op BenchmarkVFS1TmpfsMountStat/64-6 100000 14251 ns/op BenchmarkVFS1TmpfsMountStat/100-6 100000 22397 ns/op BenchmarkVFS2MemfsMountStat/1-6 5000000 317 ns/op BenchmarkVFS2MemfsMountStat/2-6 5000000 361 ns/op BenchmarkVFS2MemfsMountStat/3-6 5000000 387 ns/op BenchmarkVFS2MemfsMountStat/8-6 3000000 582 ns/op BenchmarkVFS2MemfsMountStat/64-6 500000 2699 ns/op BenchmarkVFS2MemfsMountStat/100-6 300000 4133 ns/op From this we can infer that, on this machine: - Constant cost for tmpfs stat() is ~160ns in VFS2 and ~280ns in VFS1. - Per-path-component cost is ~35ns in VFS2 and ~215ns in VFS1, a difference of about 6x. - The cost of crossing a mount boundary is about 80ns in VFS2 (MemfsMountStat/1 does approximately the same amount of work as MemfsStat/2, except that it also crosses a mount boundary). This is an inescapable cost of the separate mount lookup needed to support bind mounts and mount namespaces. PiperOrigin-RevId: 258853946	2019-07-18 15:10:29 -07:00
Adrien Leravat	2d11fa05f7	sys_time: Wrap comments to 80 columns	2019-07-17 20:25:18 -07:00
Michael Pratt	6f7e2bb388	Take copyMu in Revalidate copyMu is required to read child.overlay.upper. PiperOrigin-RevId: 258662209	2019-07-17 16:12:01 -07:00
Jamie Liu	2bc398bfd8	Separate O_DSYNC and O_SYNC. PiperOrigin-RevId: 258657913	2019-07-17 15:52:38 -07:00
Ayush Ranjan	84a59de5dc	ext: disklayout: extents support. PiperOrigin-RevId: 258657776	2019-07-17 15:48:58 -07:00
Ayush Ranjan	8e3e021aca	ext: Filesystem init implementation. PiperOrigin-RevId: 258645957	2019-07-17 14:48:04 -07:00
gVisor bot	609cd91e3f	Merge pull request #355 from zhuangel:master PiperOrigin-RevId: 258643966	2019-07-17 14:38:22 -07:00
Bhasker Hariharan	542fbd01a7	Fix race in FDTable.GetFDs(). PiperOrigin-RevId: 258635459	2019-07-17 13:56:49 -07:00
Kevin Krakauer	9f1189130e	Add AF_UNIX, SOCK_RAW sockets, which exist for some reason. tcpdump creates these. PiperOrigin-RevId: 258611829	2019-07-17 11:49:16 -07:00
gVisor bot	682fd2d68f	Merge pull request #533 from kevinGC:stub-dev-tty PiperOrigin-RevId: 258607547	2019-07-17 11:28:30 -07:00
Michael Pratt	ca829158e3	Properly invalidate cache in rename and remove We were invalidating the wrong overlayEntry in rename and missing invalidation in rename and remove if lower exists. PiperOrigin-RevId: 258604685	2019-07-17 11:14:57 -07:00
gVisor bot	78a2704bde	Merge pull request #474 from zhuangel:proctasks PiperOrigin-RevId: 258479216	2019-07-16 18:12:07 -07:00
gVisor bot	74dc663bbb	Internal change. PiperOrigin-RevId: 258424489	2019-07-16 13:03:37 -07:00
Jianfeng Tan	cf4fc510fd	Support /proc/net/dev This proc file reports the stats of interfaces. We could use ifconfig command to check the result. Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com> Change-Id: Ia7c1e637f5c76c30791ffda68ee61e861b6ef827 COPYBARA_INTEGRATE_REVIEW=https://gvisor-review.googlesource.com/c/gvisor/+/18282/ PiperOrigin-RevId: 258303936	2019-07-15 22:51:05 -07:00
Andrei Vagin	6a8ff6daef	kvm: wake up all waiter of vCPU.state Now we call FUTEX_WAKE with ^uintptr(0) of waiters, but in this case only one waiter will be waked up. If we want to wake up all of them, the number of waiters has to be set to math.MaxInt32. PiperOrigin-RevId: 258285286	2019-07-15 19:27:18 -07:00
Kevin Krakauer	9b4d3280e1	Add IPPROTO_RAW, which allows raw sockets to write IP headers. iptables also relies on IPPROTO_RAW in a way. It opens such a socket to manipulate the kernel's tables, but it doesn't actually use any of the functionality. Blegh. PiperOrigin-RevId: 257903078	2019-07-12 18:09:12 -07:00
Tamir Duberstein	17bab652af	Check that IP headers contain correct version PiperOrigin-RevId: 257888338	2019-07-12 16:19:18 -07:00
Bhasker Hariharan	6116473b2f	Stub out support for TCP_MAXSEG. Adds support to set/get the TCP_MAXSEG value but does not really change the segment sizes emitted by netstack or alter the MSS advertised by the endpoint. This is currently being added only to unblock iperf3 on gVisor. Plumbing this correctly requires a bit more work which will come in separate CLs. PiperOrigin-RevId: 257859112	2019-07-12 13:35:17 -07:00
gVisor bot	eff2c264a4	Merge pull request #282 from zhangningdlut:chris_test_proc PiperOrigin-RevId: 257855479	2019-07-12 13:11:01 -07:00
Nicolas Lacasse	69e0affaec	Don't emit an event for extended attribute syscalls. These are filesystem-specific, and filesystems are allowed to return ENOTSUP if they are not supported. PiperOrigin-RevId: 257813477	2019-07-12 09:11:04 -07:00
Kevin	ddef7f8078	Fix license year and remove Read.	2019-07-11 21:31:26 -07:00
Kevin	44427d8e26	Add a stub for /dev/tty. Actual implementation to follow, but this will satisfy applications that want it to just exist.	2019-07-11 21:24:27 -07:00
Ayush Ranjan	2eeca68900	Added tiny ext4 image. The image is of size 64Kb which supports 64 1k blocks and 16 inodes. This is the smallest size mkfs.ext4 works with. Added README.md documenting how this was created and included all files on the device under assets. PiperOrigin-RevId: 257712672	2019-07-11 17:17:47 -07:00
Ayush Ranjan	5242face2e	ext: boilerplate code. Renamed ext4 to ext since we are targeting ext(2/3/4). Removed fs.go since we are targeting VFS2. Added ext.go with filesystem struct. PiperOrigin-RevId: 257689775	2019-07-11 15:05:36 -07:00
Liu Hua	7581e84cb6	tss: block userspace access to all I/O ports. A userspace process (CPL=3) can access an i/o port if the bit corresponding to the port is set to 0 in the I/O permission bitmap. Configure the I/O permission bitmap address beyond the last valid byte in the TSS so access to all i/o ports is blocked. Signed-off-by: Liu Hua <sdu.liu@huawei.com> Change-Id: I3df76980c3735491db768f7210e71703f86bb989 PiperOrigin-RevId: 257336518	2019-07-09 22:21:56 -07:00
Ayush Ranjan	7965b1272b	ext4: disklayout: Directory Entry implementation. PiperOrigin-RevId: 257314911	2019-07-09 18:36:02 -07:00
Adin Scannell	dea3cb92f2	build: add nogo for static validation PiperOrigin-RevId: 257297820	2019-07-09 16:44:06 -07:00
Adin Scannell	cceef9d2cf	Cleanup straggling syscall dependencies. PiperOrigin-RevId: 257293198	2019-07-09 16:18:02 -07:00
Nicolas Lacasse	6db3f8d54c	Don't mask errors in createAt loop. The error set in the loop in createAt was being masked by other errors declared with ":=". This allowed an ErrResolveViaReadlink error to escape, which can cause a sentry panic. Added test case which repros without the fix. PiperOrigin-RevId: 257061767	2019-07-08 14:57:15 -07:00
Nicolas Lacasse	659bebab8e	Don't try to execute a file that is not regular. PiperOrigin-RevId: 257037608	2019-07-08 12:56:48 -07:00
Ayush Ranjan	8f9b1ca8e7	ext4: disklayout: inode impl. PiperOrigin-RevId: 257010414	2019-07-08 10:44:11 -07:00
Andrei Vagin	67f2cefce0	Avoid importing platforms from many source files PiperOrigin-RevId: 256494243	2019-07-03 22:51:26 -07:00
Ian Lewis	da57fb9d25	Fix syscall doc for getresgid PiperOrigin-RevId: 256481284	2019-07-03 20:13:19 -07:00
Neel Natu	9f2f9f0cab	futex: compare keys for equality when doing a FUTEX_UNLOCK_PI. PiperOrigin-RevId: 256453827	2019-07-03 16:01:38 -07:00
Andrei Vagin	116cac053e	netstack/udp: connect with the AF_UNSPEC address family means disconnect PiperOrigin-RevId: 256433283	2019-07-03 14:19:02 -07:00
gVisor bot	f10862696c	Merge pull request #493 from ahmetb:reticulating-splines PiperOrigin-RevId: 256319059	2019-07-03 01:10:34 -07:00
Yong He	85b27a9f8f	Solve BounceToKernel may hang issue BounceToKernel will make vCPU quit from guest ring3 to guest ring0, but vCPUWaiter is not cleared when we unlock the vCPU, when next time this vCPU enter guest mode ring3, vCPU may enter guest mode with vCPUWaiter bit setted, this will cause the following BounceToKernel to this vCPU hangs at waitUntilNot. Halt may workaroud this issue, because halt process will reset vCPU status into vCPUUser, and notify all waiter for vCPU state change, but if there is no exception or syscall in this period, BounceToKernel will hang at waitUntilNot. PiperOrigin-RevId: 256299660	2019-07-02 22:03:28 -07:00
Adin Scannell	753da9604e	Remove map from fd_map, change to fd_table. This renames FDMap to FDTable and drops the kernel.FD type, which had an entire package to itself and didn't serve much use (it was freely cast between types, and served as more of an annoyance than providing any protection.) Based on BenchmarkFDLookupAndDecRef-12, we can expect 5-10 ns per lookup operation, and 10-15 ns per concurrent lookup operation of savings. This also fixes two tangential usage issues with the FDMap. Namely, non-atomic use of NewFDFrom and associated calls to Remove (that are both racy and fail to drop the reference on the underlying file.) PiperOrigin-RevId: 256285890	2019-07-02 19:28:59 -07:00
Ian Lewis	3f14caeb99	Add documentation for remaining syscalls (fixes #197 , #186 ) Adds support level documentation for all syscalls. Removes the Undocumented utility function to discourage usage while leaving SupportUndocumented as the default support level for Syscall structs. PiperOrigin-RevId: 256281927	2019-07-02 18:45:16 -07:00
Ayush Ranjan	d8ec2fb671	Ext4: DiskLayout: Inode interface. PiperOrigin-RevId: 256234390	2019-07-02 14:04:31 -07:00
gVisor bot	d60ae0ddee	Merge pull request #279 from kevinGC:iptables-1-pkg PiperOrigin-RevId: 256231055	2019-07-02 13:48:06 -07:00
Nicolas Lacasse	4f2f44320f	Simplify (and fix) refcounts in createAt. fileOpAt holds references on the Dirents passed as arguments to the callback, and drops refs when finished, so we don't need to DecRef those Dirents ourselves However, all Dirents that we get from FindInode/FindLink must be DecRef'd. This CL cleans up the ref-counting logic, and fixes some refcount issues in the process. PiperOrigin-RevId: 256220882	2019-07-02 12:58:58 -07:00
Ahmet Alp Balkan	4cd28c6e27	sentry/kernel: add syslog message It feels like "reticulating splines" is missing from the list of meaningless syslog messages. Signed-off-by: Ahmet Alp Balkan <ahmetb@google.com>	2019-07-02 12:05:41 -07:00
Ian Gudger	0aa9418a77	Fix unix/transport.queue reference leaks. Fix two leaks for connectionless Unix sockets: * Double connect: Subsequent connects would leak a reference on the previously connected endpoint. * Close unconnected: Sockets which were not connected at the time of closure would leak a reference on their receiver. PiperOrigin-RevId: 256070451	2019-07-01 17:46:24 -07:00
Nicolas Lacasse	06537129a6	Check remaining traversal limit when creating a file through a symlink. This fixes the case when an app tries to create a file that already exists, and is a symlink to itself. A test was added. PiperOrigin-RevId: 256044811	2019-07-01 15:25:22 -07:00
Ian Gudger	3446f4e29b	Add stack trace printing to reference leak checking. PiperOrigin-RevId: 255759891	2019-06-29 09:23:22 -07:00
Adin Scannell	6d204f6a34	Drop local_server support. PiperOrigin-RevId: 255713414	2019-06-28 20:35:10 -07:00
Ian Gudger	45566fa4e4	Add finalizer on AtomicRefCount to check for leaks. PiperOrigin-RevId: 255711454	2019-06-28 20:07:52 -07:00
Adin Scannell	7dae043fec	Drop ashmem and binder. These are unfortunately unused and unmaintained. They can be brought back in the future if need requires it. PiperOrigin-RevId: 255697132	2019-06-28 17:20:25 -07:00
Nicolas Lacasse	d3f97aec49	Remove events from name_to_handle_at and open_by_handle_at. These syscalls require filesystem support that gVisor does not provide, and is not planning to implement. Their absense should not trigger an event. PiperOrigin-RevId: 255692871	2019-06-28 16:50:24 -07:00
Ayush Ranjan	c4da599e22	ext4: disklayout: SuperBlock interface implementations. PiperOrigin-RevId: 255687771	2019-06-28 16:18:29 -07:00
Nicolas Lacasse	295078fa7a	Automated rollback of changelist 255263686 PiperOrigin-RevId: 255679453	2019-06-28 15:28:41 -07:00
Andrei Vagin	e21d49c2d8	platform/ptrace: return more detailed errors Right now, if we can't create a stub process, we will see this error: panic: unable to activate mm: resource temporarily unavailable It would be better to know the root cause of this "resource temporarily unavailable". PiperOrigin-RevId: 255656831	2019-06-28 13:23:36 -07:00
Ayush Ranjan	7c13789818	Superblock interface in the disk layout package for ext4. PiperOrigin-RevId: 255644277	2019-06-28 12:07:28 -07:00
Yong He	c61d7761b4	Fix deadloop in proc subtask list Readdir of /proc/x/task/ will get direntry entries from tasks of specified taskgroup. Now the tasks slice is unsorted, use sort.SearchInts search entry from the slice may cause infinity loops. The fix is sort the slice before search. This issue could be easily reproduced via following steps, revise Readdir in pkg/sentry/fs/proc/task.go, force set taskInts into test slice []int{1, 11, 7, 5, 10, 6, 8, 3, 9, 2, 4}, then run docker image and run ls /proc/1/task, the command will cause infinity loops.	2019-06-28 22:20:57 +08:00
Fabricio Voznika	b2907595e5	Complete pipe support on overlayfs Get/Set pipe size and ioctl support were missing from overlayfs. It required moving the pipe.Sizer interface to fs so that overlay could get access. Fixes #318 PiperOrigin-RevId: 255511125	2019-06-27 17:22:53 -07:00
Michael Pratt	5b41ba5d0e	Fix various spelling issues in the documentation Addresses obvious typos, in the documentation only. COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/443 from Pixep:fix/documentation-spelling 4d0688164eafaf0b3010e5f4824b35d1e7176d65 PiperOrigin-RevId: 255477779	2019-06-27 14:25:50 -07:00
Michael Pratt	085a907565	Cache directory entries in the overlay Currently, the overlay dirCache is only used for a single logical use of getdents. i.e., it is discard when the FD is closed or seeked back to the beginning. But the initial work of getting the directory contents can be quite expensive (particularly sorting large directories), so we should keep it as long as possible. This is very similar to the readdirCache in fs/gofer. Since the upper filesystem does not have to allow caching readdir entries, the new CacheReaddir MountSourceOperations method controls this behavior. This caching should be trivially movable to all Inodes if desired, though that adds an additional copy step for non-overlay Inodes. (Overlay Inodes already do the extra copy). PiperOrigin-RevId: 255477592	2019-06-27 14:24:03 -07:00
Andrei Vagin	e276083903	gvisor/ptrace: grub initial thread registers only once PiperOrigin-RevId: 255465635	2019-06-27 13:59:57 -07:00
Fabricio Voznika	42e212f6b7	Preserve permissions when checking lower The code was wrongly assuming that only read access was required from the lower overlay when checking for permissions. This allowed non-writable files to be writable in the overlay. Fixes #316 PiperOrigin-RevId: 255263686	2019-06-26 14:24:44 -07:00
Nicolas Lacasse	857e5c47e9	Follow symlinks when creating a file, and create the target. If we have a symlink whose target does not exist, creating the symlink (either via 'creat' or 'open' with O_CREAT flag) should create the target of the symlink. Previously, gVisor would error with EEXIST in this case PiperOrigin-RevId: 255232944	2019-06-26 11:49:20 -07:00
Michael Pratt	e98ce4a2c6	Add TODO reminder to remove tmpfs caching options Updates #179 PiperOrigin-RevId: 255081565	2019-06-25 17:12:34 -07:00
Jamie Liu	ffee0f36b1	Add //pkg/fdchannel. To accompany flipcall connections in cases where passing FDs is required (as for gofers). PiperOrigin-RevId: 255062277	2019-06-25 15:38:11 -07:00
Andrei Vagin	03ae91c662	gvisor: lockless read access for task credentials Credentials are immutable and even before these changes we could read them without locks, but we needed to take a task lock to get a credential object from a task object. It is possible to avoid this lock, if we will guarantee that a credential object will not be changed after setting it on a task. PiperOrigin-RevId: 254989492	2019-06-25 09:52:49 -07:00
Adrien Leravat	3688e6e99d	Add CLOCK_BOOTTIME as a CLOCK_MONOTONIC alias Makes CLOCK_BOOTTIME available with * clock_gettime * timerfd_create * clock_gettime vDSO CLOCK_BOOTTIME is implemented as an alias to CLOCK_MONOTONIC. CLOCK_MONOTONIC already keeps track of time across save and restore. This is the closest possible behavior to Linux CLOCK_BOOTIME, as there is no concept of suspend/resume. Updates google/gvisor#218	2019-06-24 21:14:38 -07:00
Andrei Vagin	e9ea7230f7	fs: synchronize concurrent writes into files with O_APPEND For files with O_APPEND, a file write operation gets a file size and uses it as offset to call an inode write operation. This means that all other operations which can change a file size should be blocked while the write operation doesn't complete. PiperOrigin-RevId: 254873771	2019-06-24 17:45:02 -07:00
Adin Scannell	7f5d0afe52	Add O_EXITKILL to ptrace options. This prevents a race before PDEATH_SIG can take effect during a sentry crash. Discovered and solution by avagin@. PiperOrigin-RevId: 254871534	2019-06-24 17:30:01 -07:00
Rahat Mahmood	94a6bfab5d	Implement /proc/net/tcp. PiperOrigin-RevId: 254854346	2019-06-24 15:56:36 -07:00
Andrei Vagin	c5486f5122	platform/ptrace: specify PTRACE_O_TRACEEXIT for stub-processes The tracee is stopped early during process exit, when registers are still available, allowing the tracer to see where the exit occurred, whereas the normal exit notifi? cation is done after the process is finished exiting. Without this option, dumpAndPanic fails to get registers. PiperOrigin-RevId: 254852917	2019-06-24 15:48:58 -07:00
Nicolas Lacasse	87df9aab24	Use correct statx syscall number for amd64. The previous number was for the arm architecture. Also change the statx tests to force them to run on gVisor, which would have caught this issue. PiperOrigin-RevId: 254846831	2019-06-24 15:19:36 -07:00
Fabricio Voznika	b21b1db700	Allow to change logging options using 'runsc debug' New options are: runsc debug --strace=off\|all\|function1,function2 runsc debug --log-level=warning\|info\|debug runsc debug --log-packets=true\|false Updates #407 PiperOrigin-RevId: 254843128	2019-06-24 15:03:02 -07:00
chris.zn	f957fb23cf	Return ENOENT when reading /proc/{pid}/task of an exited process There will be a deadloop when we use getdents to read /proc/{pid}/task of an exited process Like this: Process A is running Process B: open /proc/{pid of A}/task Process A exits Process B: getdents /proc/{pid of A}/task Then, process B will fall into deadloop, and return "." and ".." in loops and never ends. This patch returns ENOENT when use getdents to read /proc/{pid}/task if the process is just exited. Signed-off-by: chris.zn <chris.zn@antfin.com>	2019-06-24 15:49:53 +08:00
Nicolas Lacasse	35719d52c7	Implement statx. We don't have the plumbing for btime yet, so that field is left off. The returned mask indicates that btime is absent. Fixes #343 PiperOrigin-RevId: 254575752	2019-06-22 13:29:26 -07:00
Bhasker Hariharan	c1761378a9	Fix the logic for sending zero window updates. Today we have the logic split in two places between endpoint Read() and the worker goroutine which actually sends a zero window. This change makes it so that when a zero window ACK is sent we set a flag in the endpoint which can be read by the endpoint to decide if it should notify the worker to send a nonZeroWindow update. The worker now does not do the check again but instead sends an ACK and flips the flag right away. Similarly today when SO_RECVBUF is set the SetSockOpt call has logic to decide if a zero window update is required. Rather than do that we move the logic to the worker goroutine and it can check the zeroWindow flag and send an update if required. PiperOrigin-RevId: 254505447	2019-06-21 18:31:31 -07:00
Andrei Vagin	ab6774cebf	gvisor/fs: getdents returns 0 if offset is equal to FileMaxOffset FileMaxOffset is a special case when lseek(d, 0, SEEK_END) has been called. PiperOrigin-RevId: 254498777	2019-06-21 17:25:17 -07:00
Michael Pratt	6f933a934f	Remove O(n) lookup on unlink/rename Currently, the path tracking in the gofer involves an O(n) lookup of child fidRefs. This causes a significant overhead on unlinks in directories with lots of child fidRefs (<4k). In this transition, pathNode moves from sync.Map to normal synchronized maps. There is a small chance of contention in walk, but the lock is held for a very short time (and sync.Map also had a chance of requiring locking). OTOH, sync.Map makes it very difficult to add a fidRef reverse map. PiperOrigin-RevId: 254489952	2019-06-21 16:27:26 -07:00
Brad Burlage	ae4ef32b8c	Deflake TestSimpleReceive failures due to timeouts This test will occasionally fail waiting to read a packet. From repeated runs, I've seen it up to 1.5s for waitForPackets to complete. PiperOrigin-RevId: 254484627	2019-06-21 15:56:12 -07:00
Ayush Ranjan	727375321f	ext4 block group descriptor implementation in disk layout package. PiperOrigin-RevId: 254482180	2019-06-21 15:42:46 -07:00
Jamie Liu	e806466fc5	Add //pkg/flipcall. Flipcall is a (conceptually) simple local-only RPC mechanism. Compared to unet, Flipcall does not support passing FDs (support for which will be provided out of band by another package), requires users to establish connections manually, and requires user management of concurrency since each connected Endpoint pair supports only a single RPC at a time; however, it improves performance by using shared memory for data (reducing memory copies) and using futexes for control signaling (which is much cheaper than sendto/recvfrom/sendmsg/recvmsg). PiperOrigin-RevId: 254471986	2019-06-21 14:47:04 -07:00
Fabricio Voznika	5ba16d51a9	Add list of stuck tasks to panic message PiperOrigin-RevId: 254450309	2019-06-21 12:46:53 -07:00
Michael Pratt	c0317b28cb	Update pathNode documentation to reflect reality Neither fidRefs or children are (directly) synchronized by mu. Remove the preconditions that say so. That said, the surrounding does enforce some synchronization guarantees (e.g., fidRef.renameChildTo does not atomically replace the child in the maps). I've tried to note the need for callers to do this synchronization. I've also renamed the maps to what are (IMO) clearer names. As is, it is not obvious that pathNode.fidRefs is a map of child fidRefs rather than self fidRefs. PiperOrigin-RevId: 254446965	2019-06-21 12:26:42 -07:00
Andrei Vagin	f94653b3de	kernel: call t.mu.Unlock() explicitly in WithMuLocked defer here doesn't improve readability, but we know it slower that the explicit call. PiperOrigin-RevId: 254441473	2019-06-21 11:55:42 -07:00
Fabricio Voznika	054b5632ef	Update comment PiperOrigin-RevId: 254428866	2019-06-21 10:56:42 -07:00
Jamie Liu	7db8685100	Preallocate auth.NewAnonymousCredentials() in contexttest.TestContext. Otherwise every call to, say, fs.ContextCanAccessFile() in a benchmark using contexttest allocates new auth.Credentials, a new auth.UserNamespace, ... PiperOrigin-RevId: 254261051	2019-06-20 13:36:14 -07:00
Michael Pratt	292f70cbf7	Add package docs to seqfile and ramfs These are the only packages missing docs: https://godoc.org/gvisor.dev/gvisor PiperOrigin-RevId: 254261022	2019-06-20 13:34:33 -07:00
Rahat Mahmood	ddc1d94a37	Unmark amutex_test as flaky. PiperOrigin-RevId: 254254058	2019-06-20 12:58:04 -07:00
Neel Natu	0b2135072d	Implement madvise(MADV_DONTFORK) PiperOrigin-RevId: 254253777	2019-06-20 12:56:00 -07:00
Ian Gudger	7e49515696	Deflake SendFileTest_Shutdown. The sendfile syscall's backing doSplice contained a race with regard to blocking. If the first attempt failed with syserror.ErrWouldBlock and then the blocking file became ready before registering a waiter, we would just return the ErrWouldBlock (even if we were supposed to block). PiperOrigin-RevId: 254114432	2019-06-19 18:40:54 -07:00
Michael Pratt	9d2efaac5a	Add renamed children pathNodes to target parent Otherwise future renames may miss Renamed calls. PiperOrigin-RevId: 254060946	2019-06-19 13:41:07 -07:00
Nicolas Lacasse	29f9e4fa87	fileOp{On,At} should pass the remaning symlink traversal count. And methods that do more traversals should use the remaining count rather than resetting. PiperOrigin-RevId: 254041720	2019-06-19 11:56:34 -07:00
Nicolas Lacasse	f7428af9c1	Add MountNamespace to task. This allows tasks to have distinct mount namespace, instead of all sharing the kernel's root mount namespace. Currently, the only way for a task to get a different mount namespace than the kernel's root is by explicitly setting a different MountNamespace in CreateProcessArgs, and nothing does this (yet). In a follow-up CL, we will set CreateProcessArgs.MountNamespace when creating a new container inside runsc. Note that "MountNamespace" is a poor term for this thing. It's more like a distinct VFS tree. When we get around to adding real mount namespaces, this will need a better naem. PiperOrigin-RevId: 254009310	2019-06-19 09:21:21 -07:00
Fabricio Voznika	ca245a428b	Attempt to fix TestPipeWritesAccumulate Test fails because it's reading 4KB instead of the expected 64KB. Changed the test to read pipe buffer size instead of hardcode and added some logging in case the reason for failure was not pipe buffer size. PiperOrigin-RevId: 253916040	2019-06-18 19:16:11 -07:00
Andrei Vagin	8ab0848c70	gvisor/fs: don't update file.offset for sockets, pipes, etc sockets, pipes and other non-seekable file descriptors don't use file.offset, so we don't need to update it. With this change, we will be able to call file operations without locking the file.mu mutex. This is already used for pipes in the splice system call. PiperOrigin-RevId: 253746644	2019-06-18 01:43:29 -07:00
Yong He	0dbdca349c	Skip tid allocation which is using When leader of process group (session) exit, the process group ID (session ID) is holding by other processes in the process group, so the process group ID (session ID) can not be reused. If reusing the process group ID (seession ID) as new process group ID for new process, this will cause session create failed, and later runsc crash when access process group. The fix skip the tid if it is using by a process group (session) when allocating a new tid. We could easily reproduce the runsc crash follow these steps: 1. build test program, and run inside container int main(int argc, char argv[]) { pid_t cpid, spid; cpid = fork(); if (cpid == -1) { perror("fork"); exit(EXIT_FAILURE); } if (cpid == 0) { pid_t sid = setsid(); printf("Start New Session %ld\n",sid); printf("Child PID %ld / PPID %ld / PGID %ld / SID %ld\n", getpid(),getppid(),getpgid(getpid()),getsid(getpid())); spid = fork(); if (spid == 0) { setpgid(getpid(), getpid()); printf("Set GrandSon as New Process Group\n"); printf("GrandSon PID %ld / PPID %ld / PGID %ld / SID %ld\n", getpid(),getppid(),getpgid(getpid()),getsid(getpid())); while(1) { usleep(1); } } sleep(3); exit(0); } else { exit(0); } return 0; } 2. build hello program int main(int argc, char argv[]) { printf("Current PID is %ld\n", (long) getpid()); return 0; } 3. run script on host which run hello inside container, you can speed up the test with set TasksLimit as lower value. for (( i=0; i<65535; i++ )) do docker exec <container id> /test/hello done 4. when hello process reusing the process group of loop process, runsc will crash. panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x79f0c8] goroutine 612475 [running]: gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(ProcessGroup).decRefWithParent(0x0, 0x0) pkg/sentry/kernel/sessions.go:160 +0x78 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(Task).exitNotifyLocked(0xc000663500, 0x0) pkg/sentry/kernel/task_exit.go:672 +0x2b7 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(runExitNotify).execute(0x0, 0xc000663500, 0x0, 0x0) pkg/sentry/kernel/task_exit.go:542 +0xc4 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(Task).run(0xc000663500, 0xc) pkg/sentry/kernel/task_run.go:91 +0x194 created by gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).Start pkg/sentry/kernel/task_start.go:286 +0xfe	2019-06-14 14:05:41 +08:00
Bhasker Hariharan	3d71c627fa	Add support for TCP receive buffer auto tuning. The implementation is similar to linux where we track the number of bytes consumed by the application to grow the receive buffer of a given TCP endpoint. This ensures that the advertised window grows at a reasonable rate to accomodate for the sender's rate and prevents large amounts of data being held in stack buffers if the application is not actively reading or not reading fast enough. The original paper that was used to implement the linux receive buffer auto- tuning is available @ https://public.lanl.gov/radiant/pubs/drs/lacsi2001.pdf NOTE: Linux does not implement DRS as defined in that paper, it's just a good reference to understand the solution space. Updates #230 PiperOrigin-RevId: 253168283	2019-06-13 22:28:01 -07:00
Ian Gudger	3e9b8ecbfe	Plumb context through more layers of filesytem. All functions which allocate objects containing AtomicRefCounts will soon need a context. PiperOrigin-RevId: 253147709	2019-06-13 18:40:38 -07:00
Ian Gudger	0a5ee6f7b2	Fix deadlock in fasync. The deadlock can occur when both ends of a connected Unix socket which has FIOASYNC enabled on at least one end are closed at the same time. One end notifies that it is closing, calling (waiter.Queue).Notify which takes waiter.Queue.mu (as a read lock) and then calls (FileAsync).Callback, which takes FileAsync.mu. The other end tries to unregister for notifications by calling (FileAsync).Unregister, which takes FileAsync.mu and calls (waiter.Queue).EventUnregister which takes waiter.Queue.mu. This is fixed by moving the calls to waiter.Waitable.EventRegister and waiter.Waitable.EventUnregister outside of the protection of any mutex used in (FileAsync).Callback. The new test is related, but does not cover this particular situation. Also fix a data race on FileAsync.e.Callback. (FileAsync).Callback checked FileAsync.e.Callback under the protection of FileAsync.mu, but the waiter calling (*FileAsync).Callback could not and did not. This is fixed by making FileAsync.e.Callback immutable before passing it to the waiter for the first time. Fixes #346 PiperOrigin-RevId: 253138340	2019-06-13 17:26:22 -07:00
Rahat Mahmood	05ff1ffaad	Implement getsockopt() SO_DOMAIN, SO_PROTOCOL and SO_TYPE. SO_TYPE was already implemented for everything but netlink sockets. PiperOrigin-RevId: 253138157	2019-06-13 17:24:51 -07:00
Adin Scannell	add40fd6ad	Update canonical repository. This can be merged after: https://github.com/google/gvisor-website/pull/77 or https://github.com/google/gvisor-website/pull/78 PiperOrigin-RevId: 253132620	2019-06-13 16:50:15 -07:00
Jamie Liu	0c8603084d	Add p9 and unet benchmarks. PiperOrigin-RevId: 253122166	2019-06-13 15:53:43 -07:00
Adin Scannell	e352f46478	Minor BUILD file cleanup. PiperOrigin-RevId: 252918338	2019-06-12 15:59:46 -07:00
Kevin Krakauer	0bbbcafd68	Merge branch 'master' into iptables-1-pkg Change-Id: I7457a11de4725e1bf3811420c505d225b1cb6943	2019-06-12 15:21:22 -07:00
Bhasker Hariharan	70578806e8	Add support for TCP_CONGESTION socket option. This CL also cleans up the error returned for setting congestion control which was incorrectly returning EINVAL instead of ENOENT. PiperOrigin-RevId: 252889093	2019-06-12 13:35:50 -07:00
Andrei Vagin	0d05a12fd3	gvisor/ptrace: print guest registers if a stub stopped with unexpected code PiperOrigin-RevId: 252855280	2019-06-12 10:48:46 -07:00
Adin Scannell	df110ad4fe	Eat sendfile partial error For sendfile(2), we propagate a TCP error through the system call layer. This should be eaten if there is a partial result. This change also adds a test to ensure that there is no panic in this case, for both TCP sockets and unix domain sockets. PiperOrigin-RevId: 252746192	2019-06-11 19:24:35 -07:00
Fabricio Voznika	fc746efa9a	Add support to mount pod shared tmpfs mounts Parse annotations containing 'gvisor.dev/spec/mount' that gives hints about how mounts are shared between containers inside a pod. This information can be used to better inform how to mount these volumes inside gVisor. For example, a volume that is shared between containers inside a pod can be bind mounted inside the sandbox, instead of being two independent mounts. For now, this information is used to allow the same tmpfs mounts to be shared between containers which wasn't possible before. PiperOrigin-RevId: 252704037	2019-06-11 14:54:31 -07:00
Ian Lewis	74e397e39a	Add introspection for Linux/AMD64 syscalls Adds simple introspection for syscall compatibility information to Linux/AMD64. Syscalls registered in the syscall table now have associated metadata like name, support level, notes, and URLs to relevant issues. Syscall information can be exported as a table, JSON, or CSV using the new 'runsc help syscalls' command. Users can use this info to debug and get info on the compatibility of the version of runsc they are running or to generate documentation. PiperOrigin-RevId: 252558304	2019-06-10 23:38:36 -07:00
Jamie Liu	589f36ac4a	Move //pkg/sentry/platform/procid to //pkg/procid. PiperOrigin-RevId: 252501653	2019-06-10 15:47:25 -07:00
Bhasker Hariharan	3933dd5c04	Fixes to listen backlog handling. Changes netstack to confirm to current linux behaviour where if the backlog is full then we drop the SYN and do not send a SYN-ACK. Similarly we allow upto backlog connections to be in SYN-RCVD state as long as the backlog is not full. We also now drop a SYN if syn cookies are in use and the backlog for the listening endpoint is full. Added new tests to confirm the behaviour. Also reverted the change to increase the backlog in TcpPortReuseMultiThread syscall test. Fixes #236 PiperOrigin-RevId: 252500462	2019-06-10 15:40:44 -07:00
Rahat Mahmood	a00157cc0e	Store more information in the kernel socket table. Store enough information in the kernel socket table to distinguish between different types of sockets. Previously we were only storing the socket family, but this isn't enough to classify sockets. For example, TCPv4 and UDPv4 sockets are both AF_INET, and ICMP sockets are SOCK_DGRAM sockets with a particular protocol. Instead of creating more sub-tables, flatten the socket table and provide a filtering mechanism based on the socket entry. Also generate and store a socket entry index ("sl" in linux) which allows us to output entries in a stable order from procfs. PiperOrigin-RevId: 252495895	2019-06-10 15:17:43 -07:00
Kevin Krakauer	06a83df533	Address more comments. Change-Id: I83ae1079f3dcba6b018f59ab7898decab5c211d2	2019-06-10 12:43:54 -07:00
Jamie Liu	48961d27a8	Move //pkg/sentry/memutil to //pkg/memutil. PiperOrigin-RevId: 252124156	2019-06-07 14:52:27 -07:00
Kevin Krakauer	8afbd974da	Address Ian's comments. Change-Id: I7445033b1970cbba3f2ed0682fe520dce02d8fad	2019-06-07 12:54:53 -07:00
Jamie Liu	c933f3eede	Change visibility of //pkg/sentry/time. PiperOrigin-RevId: 251965598	2019-06-06 17:58:55 -07:00
Jamie Liu	9ea248489b	Cap initial usermem.CopyStringIn buffer size. Almost (?) all uses of CopyStringIn are via linux.copyInPath(), which passes maxlen = linux.PATH_MAX = 4096. Pre-allocating a buffer of this size is measurably inefficient in most cases: most paths will not be this long, 4 KB is a lot of bytes to zero, and as of this writing the Go runtime allocator maps only two 4 KB objects to each 8 KB span, necessitating a call to runtime.mcache.refill() on ~every other call. Limit the initial buffer size to 256 B instead, and geometrically reallocate if necessary. PiperOrigin-RevId: 251960441	2019-06-06 17:22:00 -07:00
Rahat Mahmood	315cf9a523	Use common definition of SockType. SockType isn't specific to unix domain sockets, and the current definition basically mirrors the linux ABI's definition. PiperOrigin-RevId: 251956740	2019-06-06 17:00:27 -07:00
Fabricio Voznika	02ab1f187c	Copy up parent when binding UDS on overlayfs Overlayfs was expecting the parent to exist when bind(2) was called, which may not be the case. The fix is to copy the parent directory to the upper layer before binding the UDS. There is not good place to add tests for it. Syscall tests would be ideal, but it's hard to guarantee that the directory where the socket is created hasn't been touched before (and thus copied the parent to the upper layer). Added it to runsc integration tests for now. If it turns out we have lots of these kind of tests, we can consider moving them somewhere more appropriate. PiperOrigin-RevId: 251954156	2019-06-06 16:45:51 -07:00
Jamie Liu	b3f104507d	"Implement" mbind(2). We still only advertise a single NUMA node, and ignore mempolicy accordingly, but mbind() at least now succeeds and has effects reflected by get_mempolicy(). Also fix handling of nodemasks: round sizes to unsigned long (as documented and done by Linux), and zero trailing bits when copying them out. PiperOrigin-RevId: 251950859	2019-06-06 16:29:46 -07:00
Jamie Liu	a26043ee53	Implement reclaim-driven MemoryFile eviction. PiperOrigin-RevId: 251950660	2019-06-06 16:27:55 -07:00
Rahat Mahmood	2d2831e354	Track and export socket state. This is necessary for implementing network diagnostic interfaces like /proc/net/{tcp,udp,unix} and sock_diag(7). For pass-through endpoints such as hostinet, we obtain the socket state from the backend. For netstack, we add explicit tracking of TCP states. PiperOrigin-RevId: 251934850	2019-06-06 15:04:47 -07:00
Bhasker Hariharan	85be01b42d	Add multi-fd support to fdbased endpoint. This allows an fdbased endpoint to have multiple underlying fd's from which packets can be read and dispatched/written to. This should allow for higher throughput as well as better scalability of the network stack as number of connections increases. Updates #231 PiperOrigin-RevId: 251852825	2019-06-06 08:07:02 -07:00
Andrei Vagin	79f7cb6c1c	netstack/sniffer: log GSO attributes PiperOrigin-RevId: 251788534	2019-06-05 22:51:53 -07:00
Michael Pratt	57772db2e7	Shutdown host sockets on internal shutdown This is required to make the shutdown visible to peers outside the sandbox. The readClosed / writeClosed fields were dropped, as they were preventing a shutdown socket from reading the remainder of queued bytes. The host syscalls will return the appropriate errors for shutdown. The control message tests have been split out of socket_unix.cc to make the (few) remaining tests accessible to testing inherited host UDS, which don't support sending control messages. Updates #273 PiperOrigin-RevId: 251763060	2019-06-05 18:40:37 -07:00
Andrei Vagin	a12848ffeb	netstack/tcp: fix calculating a number of outstanding packets In case of GSO, a segment can container more than one packet and we need to use the pCount() helper to get a number of packets. PiperOrigin-RevId: 251743020	2019-06-05 16:30:45 -07:00
Chris Kuiper	d18bb4f38a	Adjust route when looping multicast packets Multicast packets are special in that their destination address does not identify a specific interface. When sending out such a packet the multicast address is the remote address, but for incoming packets it is the local address. Hence, when looping a multicast packet, the route needs to be tweaked to reflect this. PiperOrigin-RevId: 251739298	2019-06-05 16:08:29 -07:00
Michael Pratt	d3ed9baac0	Implement dumpability tracking and checks We don't actually support core dumps, but some applications want to get/set dumpability, which still has an effect in procfs. Lack of support for set-uid binaries or fs creds simplifies things a bit. As-is, processes started via CreateProcess (i.e., init and sentryctl exec) have normal dumpability. I'm a bit torn on whether sentryctl exec tasks should be dumpable, but at least since they have no parent normal UID/GID checks should protect them. PiperOrigin-RevId: 251712714	2019-06-05 14:00:13 -07:00
Bhasker Hariharan	e0fb921205	Fix data race in synRcvdState. When checking the length of the acceptedChan we should hold the endpoint mutex otherwise a syn received while the listening socket is being closed can result in a data race where the cleanupLocked routine sets acceptedChan to nil while a handshake goroutine in progress could try and check it at the same time. PiperOrigin-RevId: 251537697	2019-06-04 16:17:24 -07:00
Yong He	7398f013f0	Drop one dirent reference after referenced by file When pipe is created, a dirent of pipe will be created and its initial reference is set as 0. Cause all dirent will only be destroyed when the reference decreased to -1, so there is already a 'initial reference' of dirent after it created. For destroying dirent after all reference released, the correct way is to drop the 'initial reference' once someone hold a reference to the dirent, such as fs.NewFile, otherwise the reference of dirent will stay 0 all the time, and will cause memory leak of dirent. Except pipe, timerfd/eventfd/epoll has the same problem Here is a simple case to create memory leak of dirent for pipe/timerfd/eventfd/epoll in C langange, after run the case, pprof the runsc process, you will find lots dirents of pipe/timerfd/eventfd/epoll not freed: int main(int argc, char *argv[]) { int i; int n; int pipefd[2]; if (argc != 3) { printf("Usage: %s epoll\|timerfd\|eventfd\|pipe <iterations>\n", argv[0]); } n = strtol(argv[2], NULL, 10); if (strcmp(argv[1], "epoll") == 0) { for (i = 0; i < n; ++i) close(epoll_create(1)); } else if (strcmp(argv[1], "timerfd") == 0) { for (i = 0; i < n; ++i) close(timerfd_create(CLOCK_REALTIME, 0)); } else if (strcmp(argv[1], "eventfd") == 0) { for (i = 0; i < n; ++i) close(eventfd(0, 0)); } else if (strcmp(argv[1], "pipe") == 0) { for (i = 0; i < n; ++i) if (pipe(pipefd) == 0) { close(pipefd[0]); close(pipefd[1]); } } printf("%s %s test finished\r\n",argv[1],argv[2]); return 0; } Change-Id: Ia1b8a1fb9142edb00c040e44ec644d007f81f5d2 PiperOrigin-RevId: 251531096	2019-06-04 15:40:23 -07:00
Nicolas Lacasse	0c292cdaab	Remove the Dirent field from Pipe. Dirents are ref-counted, but Pipes are not. Holding a Dirent inside of a Pipe raises difficult questions about the lifecycle of the Pipe and Dirent. Fortunately, we can side-step those questions by removing the Dirent field from Pipe entirely. We only need the Dirent when constructing fs.Files (which are ref-counted), and in GetFile (when a Dirent is passed to us anyways). PiperOrigin-RevId: 251497628	2019-06-04 12:58:56 -07:00
Andrei Vagin	90a116890f	gvisor/sock/unix: pass creds when a message is sent between unconnected sockets and don't report a sender address if it doesn't have one PiperOrigin-RevId: 251371284	2019-06-03 21:48:19 -07:00
Andrei Vagin	00f8663887	gvisor/fs: return a proper error from FileWriter.Write in case of a short-write The io.Writer contract requires that Write writes all available bytes and does not return short writes. This causes errors with io.Copy, since our own Write interface does not have this same contract. PiperOrigin-RevId: 251368730	2019-06-03 21:26:01 -07:00
Bhasker Hariharan	bfe3220992	Delete debug log lines left by mistake. Updates #236 PiperOrigin-RevId: 251337915	2019-06-03 17:00:18 -07:00
Andrei Vagin	8e926e3f74	gvisor: validate a new map region in the mremap syscall Right now, mremap allows to remap a memory region over MaxUserAddress, this means that we can change the stub region. PiperOrigin-RevId: 251266886	2019-06-03 10:59:46 -07:00
Bhasker Hariharan	3577a4f691	Disable certain tests that are flaky under race detector. PiperOrigin-RevId: 250976665	2019-05-31 16:19:49 -07:00
Bhasker Hariharan	033f96cc93	Change segment queue limit to be of fixed size. Netstack sets the unprocessed segment queue size to match the receive buffer size. This is not required as this queue only needs to hold enough for a short duration before the endpoint goroutine can process it. Updates #230 PiperOrigin-RevId: 250976323	2019-05-31 16:17:33 -07:00
Kevin Krakauer	d58eb9ce82	Add basic iptables structures to netstack. Change-Id: Ib589906175a59dae315405a28f2d7f525ff8877f	2019-05-31 16:14:04 -07:00
Nicolas Lacasse	6f73d79c32	Simplify overlayBoundEndpoint. There is no reason to do the recursion manually, since Inode.BoundEndpoint will do it for us. PiperOrigin-RevId: 250794903	2019-05-30 17:20:20 -07:00
Fabricio Voznika	38de91b028	Add build guard to files using go:linkname Funcion signatures are not validated during compilation. Since they are not exported, they can change at any time. The guard ensures that they are verified at least on every version upgrade. PiperOrigin-RevId: 250733742	2019-05-30 12:09:39 -07:00
Bhasker Hariharan	ae26b2c425	Fixes to TCP listen behavior. Netstack listen loop can get stuck if cookies are in-use and the app is slow to accept incoming connections. Further we continue to complete handshake for a connection even if the backlog is full. This creates a problem when a lots of connections come in rapidly and we end up with lots of completed connections just hanging around to be delivered. These fixes change netstack behaviour to mirror what linux does as described here in the following article http://veithen.io/2014/01/01/how-tcp-backlog-works-in-linux.html Now when cookies are not in-use Netstack will silently drop the ACK to a SYN-ACK and not complete the handshake if the backlog is full. This will result in the connection staying in a half-complete state. Eventually the sender will retransmit the ACK and if backlog has space we will transition to a connected state and deliver the endpoint. Similarly when cookies are in use we do not try and create an endpoint unless there is space in the accept queue to accept the newly created endpoint. If there is no space then we again silently drop the ACK as we can just recreate it when the ACK is retransmitted by the peer. We also now use the backlog to cap the size of the SYN-RCVD queue for a given endpoint. So at any time there can be N connections in the backlog and N in a SYN-RCVD state if the application is not accepting connections. Any new SYNs will be dropped. This CL also fixes another small bug where we mark a new endpoint which has not completed handshake as connected. We should wait till handshake successfully completes before marking it connected. Updates #236 PiperOrigin-RevId: 250717817	2019-05-30 12:08:41 -07:00
Michael Pratt	8d25cd0b40	Update procid for Go 1.13 Upstream Go has no changes here. PiperOrigin-RevId: 250602731	2019-05-30 12:08:10 -07:00
chris.zn	b18df9bed6	Add VmData field to /proc/{pid}/status VmData is the size of private data segments. It has the same meaning as in Linux. Change-Id: Iebf1ae85940a810524a6cde9c2e767d4233ddb2a PiperOrigin-RevId: 250593739	2019-05-30 12:07:40 -07:00
Bhasker Hariharan	035a8fa38e	Add support for collecting execution trace to runsc. Updates #220 PiperOrigin-RevId: 250532302	2019-05-30 12:07:11 -07:00
Andrei Vagin	4b9cb38157	gvisor: socket() returns EPROTONOSUPPORT if protocol is not supported PiperOrigin-RevId: 250426407	2019-05-30 12:06:15 -07:00
Michael Pratt	507a15dce9	Always wait on tracee children After bf959931ddb88c4e4366e96dd22e68fa0db9527c ("wait/ptrace: assume __WALL if the child is traced") (Linux 4.7), tracees are always eligible for waiting, regardless of type. PiperOrigin-RevId: 250399527	2019-05-30 12:05:46 -07:00
Adin Scannell	2165b77774	Remove obsolete bug. The original bug is no longer relevant, and the FIXME here contains lots of obsolete information. PiperOrigin-RevId: 249924036	2019-05-30 12:03:39 -07:00
Adin Scannell	ed5793808e	Remove obsolete TODO. We don't need to model internal interfaces after the system call interfaces (which are objectively worse and simply use a flag to distinguish between two logically different operations). PiperOrigin-RevId: 249916814 Change-Id: I45d02e0ec0be66b782a685b1f305ea027694cab9	2019-05-24 16:18:09 -07:00
Michael Pratt	6cdec6fadf	Wrap comments and reword in common present tense PiperOrigin-RevId: 249888234 Change-Id: Icfef32c3ed34809c34100c07e93e9581c786776e	2019-05-24 13:23:53 -07:00
Tamir Duberstein	e4b395db49	Remove unused wakers These wakers are uselessly allocated and passed around; nothing ever listens for notifications on them. The code here appears to be vestigial, so removing it and allowing a nil waker to be passed seems appropriate. PiperOrigin-RevId: 249879320 Change-Id: Icd209fb77cc0dd4e5c49d7a9f2adc32bf88b4b71	2019-05-24 12:29:14 -07:00
Andrei Vagin	a949133c4b	gvisor: interrupt the sendfile system call if a task has been interrupted sendfile can be called for a big range and it can require significant amount of time to process it, so we need to handle task interrupts in this system call. PiperOrigin-RevId: 249781023 Change-Id: Ifc2ec505d74c06f5ee76f93b8d30d518ec2d4015	2019-05-23 23:21:13 -07:00
Ayush Ranjan	6240abb205	Added boilerplate code for ext4 fs. Initialized BUILD with license Mount is still unimplemented and is not meant to be part of this CL. Rest of the fs interface is implemented. Referenced the Linux kernel appropriately when needed PiperOrigin-RevId: 249741997 Change-Id: Id1e4c7c9e68b3f6946da39896fc6a0c3dcd7f98c	2019-05-23 16:55:42 -07:00
Fabricio Voznika	9006304dfe	Initial support for bind mounts Separate MountSource from Mount. This is needed to allow mounts to be shared by multiple containers within the same pod. PiperOrigin-RevId: 249617810 Change-Id: Id2944feb7e4194951f355cbe6d4944ae3c02e468	2019-05-23 04:16:10 -07:00
Bhasker Hariharan	022bd0fd10	Fix the signature for gopark. gopark's signature was changed from having a string reason to a uint8. See: `4d7cf3fedb` This broke execution tracing of the sentry. Switching to the right signature makes tracing work again. Updates #220 PiperOrigin-RevId: 249565311 Change-Id: If77fd276cecb37d4003c8222f6de510b8031a074	2019-05-22 18:57:15 -07:00
Adin Scannell	79738d3958	Log unhandled faults only at DEBUG level. PiperOrigin-RevId: 249561399 Change-Id: Ic73c68c8538bdca53068f38f82b7260939addac2	2019-05-22 18:18:53 -07:00
Michael Pratt	f65dfec096	Add WCLONE / WALL support to waitid The previous commit adds WNOTHREAD support to waitid, so we may as well complete the upstream change. Linux added WCLONE, WALL, WNOTHREAD support to waitid(2) in 91c4e8ea8f05916df0c8a6f383508ac7c9e10dba ("wait: allow sys_waitid() to accept __WNOTHREAD/__WCLONE/__WALL"). i.e., Linux 4.7. PiperOrigin-RevId: 249560587 Change-Id: Iff177b0848a3f7bae6cb5592e44500c5a942fbeb	2019-05-22 18:11:50 -07:00
Adin Scannell	21915eb58b	Remove obsolete TODO. There no obvious reason to require that BlockSize and StatFS are MountSource operations. Today they are in INodeOperations, and they can be moved elsewhere in the future as part of a normal refactor process. PiperOrigin-RevId: 249549982 Change-Id: Ib832e02faeaf8253674475df4e385bcc53d780f3	2019-05-22 17:00:36 -07:00
Michael Pratt	711290a7f6	Add support for wait(WNOTHREAD) PiperOrigin-RevId: 249537694 Change-Id: Iaa4bca73a2d8341e03064d59a2eb490afc3f80da	2019-05-22 15:54:23 -07:00
Kevin Krakauer	c1cdf18e7b	UDP and TCP raw socket support. PiperOrigin-RevId: 249511348 Change-Id: I34539092cc85032d9473ff4dd308fc29dc9bfd6b	2019-05-22 13:45:15 -07:00
Michael Pratt	69eac1198f	Move wait constants to abi/linux package Updates #214 PiperOrigin-RevId: 249483756 Change-Id: I0d3cf4112bed75a863d5eb08c2063fbc506cd875	2019-05-22 11:15:33 -07:00
Adin Scannell	ae1bb08871	Clean up pipe internals and add fcntl support Pipe internals are made more efficient by avoiding garbage collection. A pool is now used that can be shared by all pipes, and buffers are chained via an intrusive list. The documentation for pipe structures and methods is also simplified and clarified. The pipe tests are now parameterized, so that they are run on all different variants (named pipes, small buffers, default buffers). The pipe buffer sizes are exposed by fcntl, which is now supported by this change. A size change test has been added to the suite. These new tests uncovered a bug regarding the semantics of open named pipes with O_NONBLOCK, which is also fixed by this CL. This fix also addresses the lack of the O_LARGEFILE flag for named pipes. PiperOrigin-RevId: 249375888 Change-Id: I48e61e9c868aedb0cadda2dff33f09a560dee773	2019-05-21 20:12:27 -07:00
Michael Pratt	c8857f7269	Fix inconsistencies in ELF anonymous mappings * A segment with filesz == 0, memsz > 0 should be an anonymous only mapping. We were failing to load such an ELF. * Anonymous pages are always mapped RW, regardless of the segment protections. PiperOrigin-RevId: 249355239 Change-Id: I251e5c0ce8848cf8420c3aadf337b0d77b1ad991	2019-05-21 17:06:05 -07:00
Bhasker Hariharan	2ac0aeeb42	Refactor fdbased endpoint dispatcher code. This is in preparation to support an fdbased endpoint that can read/dispatch packets from multiple underlying fds. Updates #231 PiperOrigin-RevId: 249337074 Change-Id: Id7d375186cffcf55ae5e38986e7d605a96916d35	2019-05-21 15:24:25 -07:00
Adin Scannell	9cdae51fec	Add basic plumbing for splice and stub implementation. This does not actually implement an efficient splice or sendfile. Rather, it adds a generic plumbing to the file internals so that this can be added. All file implementations use the stub fileutil.NoSplice implementation, which causes sendfile and splice to fall back to an internal copy. A basic splice system call interface is added, along with a test. PiperOrigin-RevId: 249335960 Change-Id: Ic5568be2af0a505c19e7aec66d5af2480ab0939b	2019-05-21 15:18:12 -07:00
Neel Natu	adeb99709b	Remove unused struct member. Remove unused struct member. PiperOrigin-RevId: 249300446 Change-Id: Ifb16538f684bc3200342462c3da927eb564bf52d	2019-05-21 12:20:19 -07:00
Michael Pratt	80cc2c78e5	Forward named pipe creation to the gofer The backing 9p server must allow named pipe creation, which the runsc fsgofer currently does not. There are small changes to the overlay here. GetFile may block when opening a named pipe, which can cause a deadlock: 1. open(O_RDONLY) -> copyMu.Lock() -> GetFile() 2. open(O_WRONLY) -> copyMu.Lock() -> Deadlock A named pipe usable for writing must already be on the upper filesystem, but we are still taking copyMu for write when checking for upper. That can be changed to a read lock to fix the common case. However, a named pipe on the lower filesystem would still deadlock in open(O_WRONLY) when it tries to actually perform copy up (which would simply return EINVAL). Move the copy up type check before taking copyMu for write to avoid this. p9 must be modified, as it was incorrectly removing the file mode when sending messages on the wire. PiperOrigin-RevId: 249154033 Change-Id: Id6637130e567b03758130eb6c7cdbc976384b7d6	2019-05-20 16:53:08 -07:00
Michael Pratt	6588427451	Fix incorrect tmpfs timestamp updates * Creation of files, directories (and other fs objects) in a directory should always update ctime. * Same for removal. * atime should not be updated on lookup, only readdir. I've also renamed some misleading functions that update mtime and ctime. PiperOrigin-RevId: 249115063 Change-Id: I30fa275fa7db96d01aa759ed64628c18bb3a7dc7	2019-05-20 13:35:17 -07:00
Michael Pratt	4a842836e5	Return EPERM for mknod This more directly matches what Linux does with unsupported nodes. PiperOrigin-RevId: 248780425 Change-Id: I17f3dd0b244f6dc4eb00e2e42344851b8367fbec	2019-05-17 13:47:40 -07:00
Michael Pratt	04105781ad	Fix gofer rename ctime and cleanup stat_times test There is a lot of redundancy that we can simplify in the stat_times test. This will make it easier to add new tests. However, the simplification reveals that cached uattrs on goferfs don't properly update ctime on rename. PiperOrigin-RevId: 248773425 Change-Id: I52662728e1e9920981555881f9a85f9ce04041cf	2019-05-17 13:05:47 -07:00
Andrei Vagin	2105158d4b	gofer: don't call hostfile.Close if hostFile is nil PiperOrigin-RevId: 248437159 Change-Id: Ife71f6ca032fca59ec97a82961000ed0af257101	2019-05-15 17:21:10 -07:00
Andrei Vagin	3abee2ecb9	Automated rollback of changelist 247964961 PiperOrigin-RevId: 248411456 Change-Id: I21c3767b0b7e5948536d4c0b78be46ba35cf76cb	2019-05-15 14:58:40 -07:00
Nicolas Lacasse	dd153c014d	Start of support for /proc/pid/cgroup file. PiperOrigin-RevId: 248263378 Change-Id: Ic057d2bb0b6212110f43ac4df3f0ac9bf931ab98	2019-05-14 20:34:50 -07:00
Michael Pratt	330a1bbd04	Remove false comment PiperOrigin-RevId: 248249285 Change-Id: I9b6d267baa666798b22def590ff20c9a118efd47	2019-05-14 18:06:14 -07:00
Andrei Vagin	ec248daf29	gvisor/hostnet: restart epoll_wait after epoll_ctl Otherwise changes of epoll_ctl will not have affect. PiperOrigin-RevId: 247964961 Change-Id: I9fbb35c44766421af45d9ed53760e0c324d80d99	2019-05-13 10:38:27 -07:00
Jamie Liu	5ee8218483	Add pgalloc.DelayedEvictionManual. PiperOrigin-RevId: 247667272 Change-Id: I16b04e11bb93f50b7e05e888992303f730e4a877	2019-05-10 13:37:48 -07:00
Fabricio Voznika	1bee43be13	Implement fallocate(2) Closes #225 PiperOrigin-RevId: 247508791 Change-Id: I04f47cf2770b30043e5a272aba4ba6e11d0476cc	2019-05-09 15:35:49 -07:00
Tamir Duberstein	0f4be95a33	Remove dhcp client This was upstreamed from Fuchsia, but it is pretty buggy and doesn't rely on any private APIs. Thus it can be checked into the Fuchsia source tree without forking netstack, where we can more easily iterate on (and eventually remove) it. PiperOrigin-RevId: 247506582 Change-Id: Ifb1b60c6c4941c374a59c5570a6a9cacf2468981	2019-05-09 15:23:03 -07:00
Nicolas Lacasse	bfd9f75ba4	Set the FilesytemType in MountSource from the Filesystem. And stop storing the Filesystem in the MountSource. This allows us to decouple the MountSource filesystem type from the name of the filesystem. PiperOrigin-RevId: 247292982 Change-Id: I49cbcce3c17883b7aa918ba76203dfd6d1b03cc8	2019-05-08 14:35:06 -07:00
Googler	cbf6ab9697	Check GSO for nil in WritePacket Testing: Unit tests added PiperOrigin-RevId: 247096269 Change-Id: I849c010eadcb53caf45896a15ef38162d66a9568	2019-05-07 14:57:03 -07:00
Ian Gudger	20862f0db2	Add gonet.DialContextTCP. Allows cancellation and timeouts. PiperOrigin-RevId: 247090428 Change-Id: I91907f12e218677dcd0e0b6d72819deedbd9f20c	2019-05-07 14:27:36 -07:00
Fabricio Voznika	e5432fa1b3	Remove defers from gofer.contextFile Most are single line methods in hot paths. PiperOrigin-RevId: 247050267 Change-Id: I428d78723fe00b57483185899dc8fa9e1f01e2ea	2019-05-07 10:55:09 -07:00
Jamie Liu	14f0e7618e	Ensure all uses of MM.brk occur under MM.mappingMu in MM.Brk(). PiperOrigin-RevId: 246921386 Change-Id: I71d8908858f45a9a33a0483470d0240eaf0fd012	2019-05-06 16:39:43 -07:00
Kevin Krakauer	ff8ed5e6a5	Fix raw socket behavior and tests. Some behavior was broken due to the difficulty of running automated raw socket tests. Change-Id: I152ca53916bb24a0208f2dc1c4f5bc87f4724ff6 PiperOrigin-RevId: 246747067	2019-05-05 16:07:25 -07:00
Bin Lu	ebe2f78d9b	Add arm64 support to pkg/seccomp Signed-off-by: Bin Lu <bin.lu@arm.com> PiperOrigin-RevId: 246622505 Change-Id: I803639a0c5b0f75959c64fee5385314214834d10	2019-05-03 22:03:59 -07:00
Ian Gudger	b4a9f18687	Update tcpip Clock description. The tcpip.Clock comment stated that times provided by it should not be used for netstack internal timekeeping. This comment was from before the interface supported monotonic times. The monotonic times that it provides are now be the preferred time source for netstack internal timekeeping. PiperOrigin-RevId: 246618772 Change-Id: I853b720e3d719b03fabd6156d2431da05d354bda	2019-05-03 21:01:42 -07:00
Andrei Vagin	24d8656585	gofer: don't leak file descriptors Fixes #219 PiperOrigin-RevId: 246568639 Change-Id: Ic7afd15dde922638d77f6429c508d1cbe2e4288a	2019-05-03 14:01:50 -07:00
Googler	f2699b76c8	Support IPv4 fragmentation in netstack Testing: Unit tests and also large ping in Fuchsia OS PiperOrigin-RevId: 246563592 Change-Id: Ia12ab619f64f4be2c8d346ce81341a91724aef95	2019-05-03 13:30:35 -07:00
Kevin Krakauer	264d012d81	Add netfilter ABI for iptables support. Change-Id: Ifbd2abf63ea8062a89b83e948d3e9735480d8216 PiperOrigin-RevId: 246559904	2019-05-03 13:06:09 -07:00
Tamir Duberstein	0e1cc476db	Fix transport/raw copybara export - include packet_list.go - exclude state.go (by renaming to include an underscore) Also rename raw.go to endpoint.go for consistency. PiperOrigin-RevId: 246547912 Change-Id: I19c8331c794ba683a940cc96a8be6497b53ff24d	2019-05-03 11:52:59 -07:00
Bhasker Hariharan	458fe955a7	Implement support for SACK based recovery(RFC 6675). PiperOrigin-RevId: 246536003 Change-Id: I118b745f45040be9c70cb6a1028acdb06c78d8c9	2019-05-03 10:51:18 -07:00
Chris Kuiper	2d8e90b311	Proper cleanup of sockets that used REUSEPORT Fixed a small logic error that broke proper accounting of MultiPortEndpoints. PiperOrigin-RevId: 246502126 Change-Id: I1a7d6ea134f811612e545676212899a3707bc2c2	2019-05-03 07:02:51 -07:00
Chris Kuiper	8972e47a2e	Support reception of multicast data on more than one socket This requires two changes: 1) Support for more than one socket to join a given multicast group. 2) Duplicate delivery of incoming multicast packets to all sockets listening for it. In addition, I tweaked the code (and added a test) to disallow duplicates IP_ADD_MEMBERSHIP calls for the same group and NIC. This is how Linux does it. PiperOrigin-RevId: 246437315 Change-Id: Icad8300b4a8c3f501d9b4cd283bd3beabef88b72	2019-05-02 19:41:00 -07:00
Michael Pratt	23ca9886c6	Update reference to old type PiperOrigin-RevId: 246036806 Change-Id: I5554a43a1f8146c927402db3bf98488a2da0fbe7	2019-04-30 15:42:39 -07:00
Jamie Liu	8bfb83d0ac	Implement async MemoryFile eviction, and use it in CachingInodeOperations. This feature allows MemoryFile to delay eviction of "optional" allocations, such as unused cached file pages. Note that this incidentally makes CachingInodeOperations writeback asynchronous, in the sense that it doesn't occur until eviction; this is necessary because between when a cached page becomes evictable and when it's evicted, file writes (via CachingInodeOperations.Write) may dirty the page. As currently implemented, this feature won't meaningfully impact steady-state memory usage or caching; the reclaimer goroutine will schedule eviction as soon as it runs out of other work to do. Future CLs increase caching by adding constraints on when eviction is scheduled. PiperOrigin-RevId: 246014822 Change-Id: Ia85feb25a2de92a48359eb84434b6ec6f9bea2cb	2019-04-30 13:56:41 -07:00
Ian Gudger	81ecd8b6ea	Implement the MSG_CTRUNC msghdr flag for Unix sockets. Updates google/gvisor#206 PiperOrigin-RevId: 245880573 Change-Id: Ifa715e98d47f64b8a32b04ae9378d6cd6bd4025e	2019-04-29 21:21:08 -07:00
Fabricio Voznika	ddab854b9a	Reduce memory allocations on serving path Cache last used messages and reuse them for subsequent requests. If more messages are needed, they are created outside the cache on demand. PiperOrigin-RevId: 245836910 Change-Id: Icf099ddff95df420db8e09f5cdd41dcdce406c61	2019-04-29 15:33:47 -07:00
Michael Pratt	4d52a55201	Change copyright notice to "The gVisor Authors" Based on the guidelines at https://opensource.google.com/docs/releasing/authors/. 1. $ rg -l "Google LLC" \| xargs sed -i 's/Google LLC.*/The gVisor Authors./' 2. Manual fixup of "Google Inc" references. 3. Add AUTHORS file. Authors may request to be added to this file. 4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS. Fixes #209 PiperOrigin-RevId: 245823212 Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9	2019-04-29 14:26:23 -07:00
Nicolas Lacasse	f4ce43e1f4	Allow and document bug ids in gVisor codebase. PiperOrigin-RevId: 245818639 Change-Id: I03703ef0fb9b6675955637b9fe2776204c545789	2019-04-29 14:04:14 -07:00
Nicolas Lacasse	2df64cd6d2	createAt should return all errors from FindInode except ENOENT. Previously, createAt was eating all errors from FindInode except for EACCES and proceeding with the creation. This is incorrect, as FindInode can return many other errors (like ENAMETOOLONG) that should stop creation. This CL changes createAt to return all errors encountered except for ENOENT, which we can ignore because we are about to create the thing. PiperOrigin-RevId: 245773222 Change-Id: I1b317021de70f0550fb865506f6d8147d4aebc56	2019-04-29 10:30:24 -07:00
Ben Burkert	66bca6fc22	tcpip/adapters/gonet: add CloseRead & CloseWrite methods to Conn Add the CloseRead & CloseWrite methods that performs shutdown on the corresponding Read & Write sides of a connection. Change-Id: I3996a2abdc7cd68a2becba44dc4bd9f0919d2ce1 PiperOrigin-RevId: 245537950	2019-04-26 22:46:45 -07:00
Kevin Krakauer	43dff57b87	Make raw sockets a toggleable feature disabled by default. PiperOrigin-RevId: 245511019 Change-Id: Ia9562a301b46458988a6a1f0bbd5f07cbfcb0615	2019-04-26 16:51:46 -07:00
Adin Scannell	5749f64314	kvm: remove non-sane sanity check Apparently some platforms don't have pSize < vSize. Fixes #208 PiperOrigin-RevId: 245480998 Change-Id: I2a98229912f4ccbfcd8e79dfa355104f14275a9c	2019-04-26 13:53:12 -07:00
Bhasker Hariharan	228dc15fd1	Bump the AF_PACKET socket rcv buf size to 4MB by default. Packet socket receive buffers default to the sysctl value of net.core.rmem_default and are capped by net.core.rmem_max both which are usually set to 208KB on most systems. Since we can't expect every gVisor user to bump these we use SO_RCVBUFFORCE to exceed the limit. This is possible as runsc runs with CAP_NET_ADMIN outside the sandbox and can do this before the FD is passed to the sentry inside the sandbox. Updates #211 iperf output w/ 4MB buffer. iperf3 -c 172.17.0.2 -t 100 Connecting to host 172.17.0.2, port 5201 [ 4] local 172.17.0.1 port 40378 connected to 172.17.0.2 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 1.15 GBytes 9.89 Gbits/sec 0 1.02 MBytes [ 4] 1.00-2.00 sec 1.18 GBytes 10.2 Gbits/sec 0 1.02 MBytes [ 4] 2.00-3.00 sec 965 MBytes 8.09 Gbits/sec 0 1.02 MBytes [ 4] 3.00-4.00 sec 942 MBytes 7.90 Gbits/sec 0 1.02 MBytes [ 4] 4.00-5.00 sec 952 MBytes 7.99 Gbits/sec 0 1.02 MBytes [ 4] 5.00-6.00 sec 1.14 GBytes 9.81 Gbits/sec 0 1.02 MBytes [ 4] 6.00-7.00 sec 1.13 GBytes 9.68 Gbits/sec 0 1.02 MBytes [ 4] 7.00-8.00 sec 930 MBytes 7.80 Gbits/sec 0 1.02 MBytes [ 4] 8.00-9.00 sec 1.15 GBytes 9.91 Gbits/sec 0 1.02 MBytes [ 4] 9.00-10.00 sec 938 MBytes 7.87 Gbits/sec 0 1.02 MBytes [ 4] 10.00-11.00 sec 737 MBytes 6.18 Gbits/sec 0 1.02 MBytes [ 4] 11.00-12.00 sec 1.16 GBytes 9.93 Gbits/sec 0 1.02 MBytes [ 4] 12.00-13.00 sec 917 MBytes 7.69 Gbits/sec 0 1.02 MBytes [ 4] 13.00-14.00 sec 1.19 GBytes 10.2 Gbits/sec 0 1.02 MBytes [ 4] 14.00-15.00 sec 1.01 GBytes 8.70 Gbits/sec 0 1.02 MBytes [ 4] 15.00-16.00 sec 1.20 GBytes 10.3 Gbits/sec 0 1.02 MBytes [ 4] 16.00-17.00 sec 1.14 GBytes 9.80 Gbits/sec 0 1.02 MBytes ^C[ 4] 17.00-17.60 sec 718 MBytes 10.1 Gbits/sec 0 1.02 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-17.60 sec 18.4 GBytes 8.98 Gbits/sec 0 sender [ 4] 0.00-17.60 sec 0.00 Bytes 0.00 bits/sec receiver PiperOrigin-RevId: 245470590 Change-Id: I1c08c5ee8345de6ac070513656a4703312dc3c00	2019-04-26 12:52:02 -07:00
Kevin Krakauer	5f13338d30	Fix reference counting bug in /proc/PID/fdinfo/. PiperOrigin-RevId: 245452217 Change-Id: I7164d8f57fe34c17e601079eb9410a6d95af1869	2019-04-26 11:09:55 -07:00
Michael Pratt	f17cfa4d53	Perform explicit CPUID and FP state compatibility checks on restore PiperOrigin-RevId: 245341004 Change-Id: Ic4d581039d034a8ae944b43e45e84eb2c3973657	2019-04-25 17:47:05 -07:00
Jamie Liu	6b76c172b4	Don't enforce NAME_MAX in fs.Dirent.walk(). Maximum filename length is filesystem-dependent, and obtained via statfs::f_namelen. This limit is usually 255 bytes (NAME_MAX), but not always. For example, VFAT supports filenames of up to 255... UCS-2 characters, which Linux conservatively takes to mean UTF-8-encoded bytes: fs/fat/inode.c:fat_statfs(), FAT_LFN_LEN * NLS_MAX_CHARSET_SIZE. As a result, Linux's VFS does not enforce NAME_MAX: $ rg --maxdepth=1 '\WNAME_MAX\W' fs/ include/linux/ fs/libfs.c 38: buf->f_namelen = NAME_MAX; 64: if (dentry->d_name.len > NAME_MAX) include/linux/relay.h 74: char base_filename[NAME_MAX]; /* saved base filename / include/linux/fscrypt.h 149: filenames up to NAME_MAX bytes, since base64 encoding expands the length. include/linux/exportfs.h 176: * understanding that it is already pointing to a a %NAME_MAX+1 sized Remove this check from core VFS, and add it to ramfs (and by extension tmpfs), where it is actually applicable: mm/shmem.c:shmem_dir_inode_operations.lookup == simple_lookup does enforce NAME_MAX. PiperOrigin-RevId: 245324748 Change-Id: I17567c4324bfd60e31746a5270096e75db963fac	2019-04-25 16:05:13 -07:00
Bhasker Hariharan	56cadcac4e	Fixes to PacketMMap dispatcher. This CL fixes the following bugs: - Uses atomic to set/read status instead of binary.LittleEndian.PutUint32 etc which are not atomic. - Increments ringOffsets for frames that are truncated (i.e status is tpStatusCopy) - Does not ignore frames with tpStatusLost bit set as they are valid frames and only indicate that there some frames were lost before this one and metrics can be retrieved with a getsockopt call. - Adds checks to make sure blockSize is a multiple of page size. This is required as the kernel allocates in pages per block and rejects sizes that are not page aligned with an EINVAL. Updates #210 PiperOrigin-RevId: 244959464 Change-Id: I5d61337b7e4c0f8a3063dcfc07791d4c4521ba1f	2019-04-23 17:47:56 -07:00
Fabricio Voznika	db334f7154	Remove reflection from 9P serving path p9.messageByType was taking 7% of p9.recv before, spending time with reflection and map lookup. Now it's reduced to 1%. PiperOrigin-RevId: 244947313 Change-Id: I42813f920557b7656f8b29157eb32acd79e11fa5	2019-04-23 16:26:10 -07:00
Fabricio Voznika	908edee04f	Replace os.File with fd.FD in fsgofer os.NewFile() accounts for 38% of CPU time in localFile.Walk(). This change switchs to use fd.FD which is much cheaper to create. Now, fd.New() in localFile.Walk() accounts for only 4%. PiperOrigin-RevId: 244944983 Change-Id: Ic892df96cf2633e78ad379227a213cb93ee0ca46	2019-04-23 16:10:54 -07:00
Wei Zhang	17ff6063a3	Bugfix: fix fstatat symbol link to dir For a symbol link to some directory, eg. `/tmp/symlink -> /tmp/dir` `fstatat("/tmp/symlink")` should return symbol link data, but `fstatat("/tmp/symlink/")` (symlink with trailing slash) should return directory data it points following linux behaviour. Currently fstatat() a symlink with trailing slash will get "not a directory" error which is wrong. Signed-off-by: Wei Zhang <zhangwei198900@gmail.com> Change-Id: I63469b1fb89d083d1c1255d32d52864606fbd7e2 PiperOrigin-RevId: 244783916	2019-04-22 20:07:06 -07:00
Michael Pratt	d6aac9387f	Fix doc typo PiperOrigin-RevId: 244773890 Change-Id: I2d0cd7789771276ba545b38efff6d3e24133baaa	2019-04-22 18:22:19 -07:00
Michael Pratt	f86c35a51f	Clean up state error handling PiperOrigin-RevId: 244773836 Change-Id: I32223f79d2314fe1ac4ddfc63004fc22ff634adf	2019-04-22 18:20:51 -07:00
Ben Burkert	56927e5317	tcpip/transport/tcp: read side only shutdown of an endpoint Support shutdown on only the read side of an endpoint. Reads performed after a call to Shutdown with only the ShutdownRead flag will return ErrClosedForReceive without data. Break out the shutdown(2) with SHUT_RD syscall test into to two tests. The first tests that no packets are sent when shutting down the read side of a socket. The second tests that, after shutting down the read side of a socket, unread data can still be read, or an EOF if there is no more data to read. Change-Id: I9d7c0a06937909cbb466b7591544a4bcaebb11ce PiperOrigin-RevId: 244459430	2019-04-19 19:29:05 -07:00
Ian Gudger	358eb52a76	Add support for the MSG_TRUNC msghdr flag. The MSG_TRUNC flag is set in the msghdr when a message is truncated. Fixes google/gvisor#200 PiperOrigin-RevId: 244440486 Change-Id: I03c7d5e7f5935c0c6b8d69b012db1780ac5b8456	2019-04-19 16:17:01 -07:00
Ben Burkert	cec2cdc12f	tcpip/transport/udp: add Forwarder type Add a UDP forwarder for intercepting and forwarding UDP sessions. Change-Id: I2d83c900c1931adfc59a532dd4f6b33a0db406c9 PiperOrigin-RevId: 244293576	2019-04-18 17:49:57 -07:00
Michael Pratt	c931c8e082	Format struct pollfd in poll(2)/ppoll(2) I0410 15:40:38.854295 3776 x:0] [ 1] poll_test E poll(0x2b00bfb5c020 [{FD: 0x3 anon_inode:[eventfd], Events: POLLOUT, REvents: ...}], 0x1, 0x1) I0410 15:40:38.854348 3776 x:0] [ 1] poll_test X poll(0x2b00bfb5c020 [{FD: 0x3 anon_inode:[eventfd], Events: POLLOUT\|POLLERR\|POLLHUP, REvents: POLLOUT}], 0x1, 0x1) = 0x1 (10.765?s) PiperOrigin-RevId: 244269879 Change-Id: If07ba54a486fdeaaedfc0123769b78d1da862307	2019-04-18 15:24:07 -07:00
Ian Gudger	133700007a	Only emit unimplemented syscall events for unsupported values. Only emit unimplemented syscall events for setting SO_OOBINLINE and SO_LINGER when attempting to set unsupported values. PiperOrigin-RevId: 244229675 Change-Id: Icc4562af8f733dd75a90404621711f01a32a9fc1	2019-04-18 11:51:41 -07:00
Andrei Vagin	4524790ff6	netstack: use a proper network protocol to set gso.L3HdrLen It is possible to create a listening socket which will accept IPv4 and IPv6 connections. In this case, we set IPv6ProtocolNumber for all accepted endpoints, even if they handle IPv4 connections. This means that we can't use endpoint.netProto to set gso.L3HdrLen. PiperOrigin-RevId: 244227948 Change-Id: I5e1863596cb9f3d216febacdb7dc75651882eef1	2019-04-18 11:42:23 -07:00
Michael Pratt	b52cbd6028	Don't allow sigtimedwait to catch unblockable signals The existing logic attempting to do this is incorrect. Unary ^ has higher precedence than &^, so mask always has UnblockableSignals cleared, allowing dequeueSignalLocked to dequeue unblockable signals (which allows userspace to ignore them). Switch the logic so that unblockable signals are always masked. PiperOrigin-RevId: 244058487 Change-Id: Ib19630ac04068a1fbfb9dc4a8eab1ccbdb21edc3	2019-04-17 13:43:20 -07:00
Fabricio Voznika	c8cee7108f	Use FD limit and file size limit from host FD limit and file size limit is read from the host, instead of using hard-coded defaults, given that they effect the sandbox process. Also limit the direct cache to use no more than half if the available FDs. PiperOrigin-RevId: 244050323 Change-Id: I787ad0fdf07c49d589e51aebfeae477324fe26e6	2019-04-17 12:57:40 -07:00
Michael Pratt	08d99c5fbe	Convert poll/select to operate more directly on linux.PollFD Current, doPoll copies the user struct pollfd array into a []syscalls.PollFD, which contains internal kdefs.FD and waiter.EventMask types. While these are currently binary-compatible with the Linux versions, we generally discourage copying directly to internal types (someone may inadvertantly change kdefs.FD to uint64). Instead, copy directly to a []linux.PollFD, which will certainly be binary compatible. Most of syscalls/polling.go is included directly into syscalls/linux/sys_poll.go, as it can then operate directly on linux.PollFD. The additional syscalls.PollFD type is providing little value. I've also added explicit conversion functions for waiter.EventMask, which creates the possibility of a different binary format. PiperOrigin-RevId: 244042947 Change-Id: I24e5b642002a32b3afb95a9dcb80d4acd1288abf	2019-04-17 12:15:01 -07:00
Googler	e091b4e7c0	Internal change. PiperOrigin-RevId: 244036529 Change-Id: I280f9632a65d2e40d844e0d5ec3a101d808434ee	2019-04-17 11:40:11 -07:00
Fabricio Voznika	9f8c89fc7f	Return error from fdbased.New RELNOTES: n/a PiperOrigin-RevId: 244031742 Change-Id: Id0cdb73194018fb5979e67b58510ead19b5a2b81	2019-04-17 11:16:35 -07:00
Michael Pratt	6b24f7ab08	Format FDs in strace logs Normal files display their path in the current mount namespace: I0410 10:57:54.964196 216336 x:0] [ 1] ls X read(0x3 /proc/filesystems, 0x55cee3bdb2c0 "nodev\t9p\nnodev\tdevpts \nnodev\tdevtmpfs\nnodev\tproc\nnodev\tramdiskfs\nnodev\tsysfs\nnodev\ttmpfs\n", 0x1000) = 0x58 (24.462?s) AT_FDCWD includes the CWD: I0411 12:58:48.278427 1526 x:0] [ 1] stat_test E newfstatat(AT_FDCWD /home/prattmic, 0x55ea719b564e /proc/self, 0x7ef5cefc2be8, 0x0) Sockets (and other non-vfs files) display an inode number (like /proc/PID/fd): I0410 10:54:38.909123 207684 x:0] [ 1] nc E bind(0x3 socket:[1], 0x55b5a1652040 {Family: AF_INET, Addr: , Port: 8080}, 0x10) I also fixed a few syscall args that should be Path. PiperOrigin-RevId: 243169025 Change-Id: Ic7dda6a82ae27062fe2a4a371557acfd6a21fa2a	2019-04-11 16:48:39 -07:00
Jamie Liu	4209edafb6	Use open fids when fstat()ing gofer files. PiperOrigin-RevId: 243018347 Change-Id: I1e5b80607c1df0747482abea61db7fcf24536d37	2019-04-11 00:43:04 -07:00
Michael Pratt	cc48969bb7	Internal change PiperOrigin-RevId: 242978508 Change-Id: I0ea59ac5ba1dd499e87c53f2e24709371048679b	2019-04-10 18:00:18 -07:00
Nicolas Lacasse	d93d19fd4e	Fix uses of RootFromContext. RootFromContext can return a dirent with reference taken, or nil. We must call DecRef if (and only if) a real dirent is returned. PiperOrigin-RevId: 242965515 Change-Id: Ie2b7b4cb19ee09b6ccf788b71f3fd7efcdf35a11	2019-04-10 16:36:28 -07:00
Yong He	89cc8eef9b	DATA RACE in fs.(Dirent).fullName add renameMu.Lock when oldParent == newParent in order to avoid data race in following report: WARNING: DATA RACE Read at 0x00c000ba2160 by goroutine 405: gvisor.googlesource.com/gvisor/pkg/sentry/fs.(Dirent).fullName() pkg/sentry/fs/dirent.go:246 +0x6c gvisor.googlesource.com/gvisor/pkg/sentry/fs.(Dirent).FullName() pkg/sentry/fs/dirent.go:356 +0x8b gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(FDMap).String() pkg/sentry/kernel/fd_map.go:135 +0x1e0 fmt.(pp).handleMethods() GOROOT/src/fmt/print.go:603 +0x404 fmt.(pp).printArg() GOROOT/src/fmt/print.go:686 +0x255 fmt.(pp).doPrintf() GOROOT/src/fmt/print.go:1003 +0x33f fmt.Fprintf() GOROOT/src/fmt/print.go:188 +0x7f gvisor.googlesource.com/gvisor/pkg/log.(Writer).Emit() pkg/log/log.go:121 +0x89 gvisor.googlesource.com/gvisor/pkg/log.GoogleEmitter.Emit() pkg/log/glog.go:162 +0x1acc gvisor.googlesource.com/gvisor/pkg/log.(GoogleEmitter).Emit() <autogenerated>:1 +0xe1 gvisor.googlesource.com/gvisor/pkg/log.(BasicLogger).Debugf() pkg/log/log.go:177 +0x111 gvisor.googlesource.com/gvisor/pkg/log.Debugf() pkg/log/log.go:235 +0x66 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(Task).Debugf() pkg/sentry/kernel/task_log.go:48 +0xfe gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(Task).DebugDumpState() pkg/sentry/kernel/task_log.go:66 +0x11f gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(runApp).execute() pkg/sentry/kernel/task_run.go:272 +0xc80 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(Task).run() pkg/sentry/kernel/task_run.go:91 +0x24b Previous write at 0x00c000ba2160 by goroutine 423: gvisor.googlesource.com/gvisor/pkg/sentry/fs.Rename() pkg/sentry/fs/dirent.go:1628 +0x61f gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.renameAt.func1.1() pkg/sentry/syscalls/linux/sys_file.go:1864 +0x1f8 gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.fileOpAt( gvisor.googlesource.com/g/linux/sys_file.go:51 +0x20f gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.renameAt.func1() pkg/sentry/syscalls/linux/sys_file.go:1852 +0x218 gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.fileOpAt() pkg/sentry/syscalls/linux/sys_file.go:51 +0x20f gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.renameAt() pkg/sentry/syscalls/linux/sys_file.go:1840 +0x180 gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.Rename() pkg/sentry/syscalls/linux/sys_file.go:1873 +0x60 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(Task).executeSyscall() pkg/sentry/kernel/task_syscall.go:165 +0x17a gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(Task).doSyscallInvoke() pkg/sentry/kernel/task_syscall.go:283 +0xb4 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(Task).doSyscallEnter() pkg/sentry/kernel/task_syscall.go:244 +0x10c gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(Task).doSyscall() pkg/sentry/kernel/task_syscall.go:219 +0x1e3 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(runApp).execute() pkg/sentry/kernel/task_run.go:215 +0x15a9 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(Task).run() pkg/sentry/kernel/task_run.go:91 +0x24b Reported-by: syzbot+e1babbf756fab380dfff@syzkaller.appspotmail.com Change-Id: Icd2620bb3ea28b817bf0672d454a22b9d8ee189a PiperOrigin-RevId: 242938741	2019-04-10 14:17:33 -07:00
Kevin Krakauer	f7aff0aaa4	Allow threads with CAP_SYS_RESOURCE to raise hard rlimits. PiperOrigin-RevId: 242919489 Change-Id: Ie3267b3bcd8a54b54bc16a6556369a19e843376f	2019-04-10 12:36:45 -07:00
Nicolas Lacasse	0a0619216e	Start saving MountSource.DirentCache. DirentCache is already a savable type, and it ensures that it is empty at the point of Save. There is no reason not to save it along with the MountSource. This did uncover an issue where not all MountSources were properly flushed before Save. If a mount point has an open file and is then unmounted, we save the MountSource without flushing it first. This CL also fixes that by flushing all MountSources for all open FDs on Save. PiperOrigin-RevId: 242906637 Change-Id: I3acd9d52b6ce6b8c989f835a408016cb3e67018f	2019-04-10 11:27:16 -07:00
Shiva Prasanth	7140b1fdca	Fixed /proc/cpuinfo permissions This also applies these permissions to other static proc files. Change-Id: I4167e585fed49ad271aa4e1f1260babb3239a73d PiperOrigin-RevId: 242898575	2019-04-10 10:49:43 -07:00
Li Qiang	b3b140ea4f	syscalls: sendfile: limit the count to MAX_RW_COUNT From sendfile spec and also the linux kernel code, we should limit the count arg to 'MAX_RW_COUNT'. This patch export 'MAX_RW_COUNT' in kernel pkg and use it in the implementation of sendfile syscall. Signed-off-by: Li Qiang <pangpei.lq@antfin.com> Change-Id: I1086fec0685587116984555abd22b07ac233fbd2 PiperOrigin-RevId: 242745831	2019-04-09 14:57:05 -07:00
Bhasker Hariharan	eaac2806ff	Add TCP checksum verification. PiperOrigin-RevId: 242704699 Change-Id: I87db368ca343b3b4bf4f969b17d3aa4ce2f8bd4f	2019-04-09 11:23:47 -07:00
Tamir Duberstein	cf4ed408c3	Use (*testing.T).Helper to clean up test failures PiperOrigin-RevId: 242647530 Change-Id: I1bf9ac1d664f452dc47ca670d408a73538cb482f	2019-04-09 05:17:32 -07:00
Jamie Liu	9471c01348	Export kernel.SignalInfoPriv. Also add kernel.SignalInfoNoInfo, and use it in RLIMIT_FSIZE checks. PiperOrigin-RevId: 242562428 Change-Id: I4887c0e1c8f5fddcabfe6d4281bf76d2f2eafe90	2019-04-08 16:32:11 -07:00
Nicolas Lacasse	70906f1d24	Intermediate ram fs dirs should be writable. We construct a ramfs tree of "scaffolding" directories for all mount points, so that a directory exists that each mount point can be mounted over. We were creating these directories without write permissions, which meant that they were not wribable even when underlayed under a writable filesystem. They should be writable. PiperOrigin-RevId: 242507789 Change-Id: I86645e35417560d862442ff5962da211dbe9b731	2019-04-08 11:56:38 -07:00
Nicolas Lacasse	ee7e6d33b2	Use string type for extended attribute values, instead of []byte. Strings are a better fit for this usage because they are immutable in Go, and can contain arbitrary bytes. It also allows us to avoid casting bytes to string (and the associated allocation) in the hot path when checking for overlay whiteouts. PiperOrigin-RevId: 242208856 Change-Id: I7699ae6302492eca71787dd0b72e0a5a217a3db2	2019-04-05 15:49:39 -07:00
Michael Pratt	252f877f3d	Set fixed field in CPUID function 2 From the SDM: "The least-significant byte in register EAX (register AL) will always return 01H. Software should ignore this value and not interpret it as an informational descriptor." Unfortunately, online docs [1] [2] (likely based on an old version of the SDM) say: "The least-significant byte in register EAX (register AL) indicates the number of times the CPUID instruction must be executed with an input value of 2 to get a complete description of the processor's caches and TLBs." dlang uses this second interpretation [3] and will loop 2^32 times if we return zero. Fix this by specifying the fixed value of one. We still don't support exposing the actual cache information, leaving all other bytes empty. A zero byte means: "Null descriptor, this byte contains no information." [1] http://www.sandpile.org/x86/cpuid.htm#level_0000_0002h [2] https://c9x.me/x86/html/file_module_x86_id_45.html [3] `424640864c/src/core/cpuid.d (L533-L534)` PiperOrigin-RevId: 242046629 Change-Id: Ic0f0a5f974b20f71391cb85645bdcd4003e5fe88	2019-04-04 18:01:56 -07:00
Andrei Vagin	88409e983c	gvisor: Add support for the MS_NOEXEC mount option https://github.com/google/gvisor/issues/145 PiperOrigin-RevId: 242044115 Change-Id: I8f140fe05e32ecd438b6be218e224e4b7fe05878	2019-04-04 17:43:53 -07:00
Michael Pratt	75a5ccf5d9	Remove defer from trivial ThreadID methods In particular, ns.IDOfTask and tg.ID are used for gettid and getpid, respectively, where removing defer saves ~100ns. This may be a small improvement to application logging, which may call gettid/getpid frequently. PiperOrigin-RevId: 242039616 Change-Id: I860beb62db3fe077519835e6bafa7c74cba6ca80	2019-04-04 17:14:27 -07:00
Adin Scannell	75c8ac38e0	BUILD: Add useful go_path target Change-Id: Ibd6d8a1a63826af6e62a0f0669f8f0866c8091b4 PiperOrigin-RevId: 242037969	2019-04-04 17:05:38 -07:00
Googler	efe4461d74	Internal change. PiperOrigin-RevId: 241867632 Change-Id: I29459f2758ac4835882b491ff25c6aca9a37d41d	2019-04-03 22:02:51 -07:00
Michael Pratt	9cf33960fc	Only CopyOut CPU when it changes This will save copies when preemption is not caused by a CPU migration. PiperOrigin-RevId: 241844399 Change-Id: I2ba3b64aa377846ab763425bd59b61158f576851	2019-04-03 18:06:36 -07:00
Nicolas Lacasse	61d8c361c6	Don't release d.mu in checks for child-existence. Dirent.exists() is called in Create to check whether a child with the given name already exists. Dirent.exists() calls walk(), and before this CL allowed walk() to drop d.mu while calling d.Inode.Lookup. During this existence check, a racing Rename() can acquire d.mu and create a new child of the dirent with the same name. (Note that the source and destination of the rename must be in the same directory, otherwise renameMu will be taken preventing the race.) In this case, d.exists() can return false, even though a child with the same name actually does exist. This CL changes d.exists() so that it does not release d.mu while walking, thus preventing the race with Rename. It also adds comments noting that lockForRename may not take renameMu if the source and destination are in the same directory, as this is a bit surprising (at least it was to me). PiperOrigin-RevId: 241842579 Change-Id: I56524870e39dfcd18cab82054eb3088846c34813	2019-04-03 17:53:56 -07:00
Michael Pratt	4968dd1341	Cache ThreadGroups in PIDNamespace If there are thousands of threads, ThreadGroupsAppend becomes very expensive as it must iterate over all Tasks to find the ThreadGroup leaders. Reduce the cost by maintaining a map of ThreadGroups which can be used to grab them all directly. The one somewhat visible change is to convert PID namespace init children zapping to a group-directed SIGKILL, as Linux did in 82058d668465 "signal: Use group_send_sig_info to kill all processes in a pid namespace". In a benchmark that creates N threads which sleep for two minutes, we see approximately this much CPU time in ThreadGroupsAppend: Before: 1 thread: 0ms 1024 threads: 30ms - 9130ms 4096 threads: 50ms - 2000ms 8192 threads: 18160ms 16384 threads: 17210ms After: 1 thread: 0ms 1024 threads: 0ms 4096 threads: 0ms 8192 threads: 0ms 16384 threads: 0ms The profiling is actually extremely noisy (likely due to cache effects), as some runs show almost no samples at 1024, 4096 threads, but obviously this does not scale to lots of threads. PiperOrigin-RevId: 241828039 Change-Id: I17827c90045df4b3c49b3174f3a05bca3026a72c	2019-04-03 16:22:43 -07:00
Kevin Krakauer	82529becae	Fix index out of bounds in tty implementation. The previous implementation revolved around runes instead of bytes, which caused weird behavior when converting between the two. For example, peekRune would read the byte 0xff from a buffer, convert it to a rune, then return it. As rune is an alias of int32, 0xff was 0-padded to int32(255), which is the hex code point for ?. However, peekRune also returned the length of the byte (1). When calling utf8.EncodeRune, we only allocated 1 byte, but tried the write the 2-byte character ?. tl;dr: I apparently didn't understand runes when I wrote this. PiperOrigin-RevId: 241789081 Change-Id: I14c788af4d9754973137801500ef6af7ab8a8727	2019-04-03 13:00:34 -07:00
Kevin Krakauer	c79e81bd27	Addresses data race in tty implementation. Also makes the safemem reading and writing inline, as it makes it easier to see what locks are held. PiperOrigin-RevId: 241775201 Change-Id: Ib1072f246773ef2d08b5b9a042eb7e9e0284175c	2019-04-03 11:49:55 -07:00
Ian Lewis	77f01ee3c7	Add syscall annotations for unimplemented syscalls Added syscall annotations for unimplemented syscalls for later generation into reference docs. Annotations are of the form: @Syscall(<name>, <key:value>, ...) Supported args and values are: - arg: A syscall option. This entry only applies to the syscall when given this option. - support: Indicates support level - UNIMPLEMENTED: Unimplemented (implies returns:ENOSYS) - PARTIAL: Partial support. Details should be provided in note. - FULL: Full support - returns: Indicates a known return value. Values are syscall errors. This is treated as a string so you can use something like "returns:EPERM or ENOSYS". - issue: A Github issue number. - note: A note Example: // @Syscall(mmap, arg:MAP_PRIVATE, support:FULL, note:Private memory fully supported) // @Syscall(mmap, arg:MAP_SHARED, support:UNIMPLEMENTED, issue:123, note:Shared memory not supported) // @Syscall(setxattr, returns:ENOTSUP, note:Requires file system support) Annotations should be placed as close to their implementation as possible (preferrably as part of a supporting function's Godoc) and should be updated as syscall support changes. PiperOrigin-RevId: 241697482 Change-Id: I7a846135db124e1271dc5057d788cba82ca312d4	2019-04-03 03:10:23 -07:00
Jamie Liu	c4caccd540	Set options on the correct Task in PTRACE_SEIZE. $ docker run --rm --runtime=runsc -it --cap-add=SYS_PTRACE debian bash -c "apt-get update && apt-get install strace && strace ls" ... Setting up strace (4.15-2) ... execve("/bin/ls", ["ls"], [/* 6 vars */]) = 0 brk(NULL) = 0x5646d8c1e000 uname({sysname="Linux", nodename="114ef93d2db3", ...}) = 0 ... PiperOrigin-RevId: 241643321 Change-Id: Ie4bce27a7fb147eef07bbae5895c6ef3f529e177	2019-04-02 18:13:19 -07:00
Nicolas Lacasse	1776ab28f0	Add test that symlinking over a directory returns EEXIST. Also remove comments in InodeOperations that required that implementation of some Create* operations ensure that the name does not already exist, since these checks are all centralized in the Dirent. PiperOrigin-RevId: 241637335 Change-Id: Id098dc6063ff7c38347af29d1369075ad1e89a58	2019-04-02 17:28:36 -07:00
Rahat Mahmood	d14a7de658	Fix more data races in shm debug messages. PiperOrigin-RevId: 241630409 Change-Id: Ie0df5f5a2f20c2d32e615f16e2ba43c88f963181	2019-04-02 16:46:32 -07:00
Wei Zhang	1fcd40719d	device: fix device major/minor Current gvisor doesn't give devices a right major and minor number. When testing golang supporting of gvisor, I run the test case below: ``` $ docker run -ti --runtime runsc golang:1.12.1 bash -c "cd /usr/local/go/src && ./run.bash " ``` And it reports some errors, one of them is: "--- FAIL: TestDevices (0.00s) --- FAIL: TestDevices//dev/null_1:3 (0.00s) dev_linux_test.go:45: for /dev/null Major(0x0) == 0, want 1 dev_linux_test.go:48: for /dev/null Minor(0x0) == 0, want 3 dev_linux_test.go:51: for /dev/null Mkdev(1, 3) == 0x103, want 0x0 --- FAIL: TestDevices//dev/zero_1:5 (0.00s) dev_linux_test.go:45: for /dev/zero Major(0x0) == 0, want 1 dev_linux_test.go:48: for /dev/zero Minor(0x0) == 0, want 5 dev_linux_test.go:51: for /dev/zero Mkdev(1, 5) == 0x105, want 0x0 --- FAIL: TestDevices//dev/random_1:8 (0.00s) dev_linux_test.go:45: for /dev/random Major(0x0) == 0, want 1 dev_linux_test.go:48: for /dev/random Minor(0x0) == 0, want 8 dev_linux_test.go:51: for /dev/random Mkdev(1, 8) == 0x108, want 0x0 --- FAIL: TestDevices//dev/full_1:7 (0.00s) dev_linux_test.go:45: for /dev/full Major(0x0) == 0, want 1 dev_linux_test.go:48: for /dev/full Minor(0x0) == 0, want 7 dev_linux_test.go:51: for /dev/full Mkdev(1, 7) == 0x107, want 0x0 --- FAIL: TestDevices//dev/urandom_1:9 (0.00s) dev_linux_test.go:45: for /dev/urandom Major(0x0) == 0, want 1 dev_linux_test.go:48: for /dev/urandom Minor(0x0) == 0, want 9 dev_linux_test.go:51: for /dev/urandom Mkdev(1, 9) == 0x109, want 0x0 " So I think we'd better assign to them correct major/minor numbers following linux spec. Signed-off-by: Wei Zhang <zhangwei198900@gmail.com> Change-Id: I4521ee7884b4e214fd3a261929e3b6dac537ada9 PiperOrigin-RevId: 241609021	2019-04-02 14:51:07 -07:00
Kevin Krakauer	52a51a8e20	Add a raw socket transport endpoint and use it for raw ICMP sockets. Having raw socket code together will make it easier to add support for other raw network protocols. Currently, only ICMP uses the raw endpoint. However, adding support for other protocols such as UDP shouldn't be much more difficult than adding a few switch cases. PiperOrigin-RevId: 241564875 Change-Id: I77e03adafe4ce0fd29ba2d5dfdc547d2ae8f25bf	2019-04-02 11:13:49 -07:00
Rahat Mahmood	7cff746ef2	Save/restore simple devices. We weren't saving simple devices' last allocated inode numbers, which caused inode number reuse across S/R. PiperOrigin-RevId: 241414245 Change-Id: I964289978841ef0a57d2fa48daf8eab7633c1284	2019-04-01 15:39:16 -07:00
Jamie Liu	b4006686d2	Don't expand COW-break on executable VMAs. PiperOrigin-RevId: 241403847 Change-Id: I4631ca05734142da6e80cdfa1a1d63ed68aa05cc	2019-04-01 14:47:31 -07:00
Andrei Vagin	a4b34e2637	gvisor: convert ilist to ilist:generic_list ilist:generic_list works faster (cl/240185278) and the code looks cleaner without type casting. PiperOrigin-RevId: 241381175 Change-Id: I8487ab1d73637b3e9733c253c56dce9e79f0d35f	2019-04-01 12:53:27 -07:00

... 9 10 11 12 13 ...

1714 Commits