When execveat is called on an interpreter script, the symlink count for
resolving the script path should be separate from the count for resolving the
the corresponding interpreter. An ELOOP error should not occur if we do not hit
the symlink limit along any individual path, even if the total number of
symlinks encountered exceeds the limit.
Closes#574
PiperOrigin-RevId: 277358474
This change helps support iterating over an NDP options buffer so that
implementations can handle all the NDP options present in an NDP packet.
Note, this change does not yet actually handle these options, it just provides
the tools to do so (in preparation for NDP's Prefix, Parameter, and a complete
implementation of Neighbor Discovery).
Tests: Unittests to make sure we can iterate over a valid NDP options buffer
that may contain multiple options. Also tests to check an iterator before
using it to see if the NDP options buffer is malformed.
PiperOrigin-RevId: 277312487
When an interpreter script is opened with O_CLOEXEC and the resulting fd is
passed into execveat, an ENOENT error should occur (the script would otherwise
be inaccessible to the interpreter). This matches the actual behavior of
Linux's execveat.
PiperOrigin-RevId: 277306680
This change supports using a user supplied TCP MSS for new active TCP
connections. Note, the user supplied MSS must be less than or equal to the
maximum possible MSS for a TCP connection's route. If it is greater than the
maximum possible MSS, the maximum possible MSS will be used as the connection's
MSS instead.
This change does not use this user supplied MSS for connections accepted from
listening sockets - that will come in a later change.
Test: Test that outgoing TCP SYN segments contain a TCP MSS option with the user
supplied MSS if it is not greater than the maximum possible MSS for the route.
PiperOrigin-RevId: 277185125
Since the syscall.Stat_t.Nlink is defined as different types on
amd64 and arm64(uint64 and uint32 respectively), we need to cast
them to a unified uint64 type in gVisor code.
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I7542b99b195c708f3fc49b1cbe6adebdd2f6e96b
This change simplifies the function signatures of functions related to loading
executables, such as LoadTaskImage, Load, loadBinary.
PiperOrigin-RevId: 276821187
This change validates the ICMPv6 checksum field before further processing an
ICMPv6 packet.
Tests: Unittests to make sure that only ICMPv6 packets with a valid checksum
are accepted/processed. Existing tests using checker.ICMPv6 now also check the
ICMPv6 checksum field.
PiperOrigin-RevId: 276779148
This change is in preparation for NDP Prefix Discovery and SLAAC where the stack
will need to handle NDP Prefix Information options.
Tests: Test that given an NDP Prefix Information option buffer, correct values
are returned by the field getters.
PiperOrigin-RevId: 276594592
The amss field in the tcpip.tcp.handshake was not used anywhere. Removed it to
not cause confusion with the amss field in the tcpip.tcp.endpoint struct, which
was documented to be used (and is actually being used) for the same purpose.
PiperOrigin-RevId: 276577088
This change makes it so that NDP work is done using the per-interface NDP
configurations instead of the stack-wide default NDP configurations to correctly
implement RFC 4861 section 6.3.2 (note here, a host is a single NIC operating
as a host device), and RFC 4862 section 5.1.
Test: Test that we can set NDP configurations on a per-interface basis without
affecting the configurations of other interfaces or the stack-wide default. Also
make sure that after the configurations are updated, the updated configurations
are used for NDP processes (e.g. Duplicate Address Detection).
PiperOrigin-RevId: 276525661
Use fd.next to store the iteration start position, which can be used to accelerate allocating new FDs.
And adding the corresponding gtest benchmark to measure performance.
@tanjianfeng
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/758 from DarcySail:master 96685ec7886dfe1a64988406831d3bc002b438cc
PiperOrigin-RevId: 276351250
This change introduces a new interface, stack.NDPDispatcher. It can be
implemented by the netstack integrator to receive NDP related events. As of this
change, only DAD related events are supported.
Tests: Existing tests were modified to use the NDPDispatcher's DAD events for
DAD tests where it needed to wait for DAD completing (failing and resolving).
PiperOrigin-RevId: 276338733
This change is in preparation for NDP Router Discovery where the stack will need
to handle NDP Router Advertisments.
Tests: Test that given an NDP Router Advertisement buffer (body of an ICMPv6
packet, correct values are returned by the field getters).
PiperOrigin-RevId: 276146817
This change makes sure that when an address which is already known by a NIC and
has kind = permanentExpired gets promoted to permanent, the new
PrimaryEndpointBehavior is respected.
PiperOrigin-RevId: 276136317
Right now, we send each tcp packet separately, we call one system
call per-packet. This patch allows to generate multiple tcp packets
and send them by sendmmsg.
The arguable part of this CL is a way how to handle multiple headers.
This CL adds the next field to the Prepandable buffer.
Nginx test results:
Server Software: nginx/1.15.9
Server Hostname: 10.138.0.2
Server Port: 8080
Document Path: /10m.txt
Document Length: 10485760 bytes
w/o gso:
Concurrency Level: 5
Time taken for tests: 5.491 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 1048600200 bytes
HTML transferred: 1048576000 bytes
Requests per second: 18.21 [#/sec] (mean)
Time per request: 274.525 [ms] (mean)
Time per request: 54.905 [ms] (mean, across all concurrent requests)
Transfer rate: 186508.03 [Kbytes/sec] received
sw-gso:
Concurrency Level: 5
Time taken for tests: 3.852 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 1048600200 bytes
HTML transferred: 1048576000 bytes
Requests per second: 25.96 [#/sec] (mean)
Time per request: 192.576 [ms] (mean)
Time per request: 38.515 [ms] (mean, across all concurrent requests)
Transfer rate: 265874.92 [Kbytes/sec] received
w/o gso:
$ ./tcp_benchmark --client --duration 15 --ideal
[SUM] 0.0-15.1 sec 2.20 GBytes 1.25 Gbits/sec
software gso:
$ tcp_benchmark --client --duration 15 --ideal --gso $((1<<16)) --swgso
[SUM] 0.0-15.1 sec 3.99 GBytes 2.26 Gbits/sec
PiperOrigin-RevId: 276112677
This change adds support for optionally auto-generating an IPv6 link-local
address based on the NIC's MAC Address on NIC enable.
Note, this change will not break existing uses of netstack as the default
configuration for the stack options is set in such a way that a link-local
address will not be auto-generated unless the stack is explicitly configured.
See `stack.Options` for more details. Specifically, see
`stack.Options.AutoGenIPv6LinkLocal`.
Tests: Tests to make sure that the IPb6 link-local address is only
auto-generated if the stack is specifically configured to do so. Also tests to
make sure that an auto-generated address goes through the DAD process.
PiperOrigin-RevId: 276059813
This patch enabled the basic framework for arm64 guest.
Serveral jobs were finished in this patch:
1, ring0.Vectors()
2, switchToUser()
3, basic framwork for Arm64 guest.
Signed-off-by: Bin Lu <bin.lu@arm.com>
Like (AF_INET, SOCK_RAW) sockets, AF_PACKET sockets require CAP_NET_RAW. With
runsc, you'll need to pass `--net-raw=true` to enable them.
Binding isn't supported yet.
PiperOrigin-RevId: 275909366
This change fixes several issues with the fsgofer host UDS support. Notably, it
adds support for SOCK_SEQPACKET and SOCK_DGRAM sockets [1]. It also fixes
unsafe use of unet.Socket, which could cause a panic if Socket.FD is called
when err != nil, and calls to Socket.FD with nothing to prevent the garbage
collector from destroying and closing the socket.
A set of tests is added to exercise host UDS access. This required extracting
most of the syscall test runner into a library that can be used by custom
tests.
Updates #235
Updates #1003
[1] N.B. SOCK_DGRAM sockets are likely not particularly useful, as a server can
only reply to a client that binds first. We don't allow bind, so these are
unlikely to be used.
PiperOrigin-RevId: 275558502
It is quite legal to send from the ANY address (it is required for
DHCP). I can't figure out why the broadcast address was included here,
so removing that as well.
PiperOrigin-RevId: 275541954
* Pulls common functionality (IO and locking on open) into pipe_util.go.
* Adds pipe/vfs.go, which implements a subset of vfs.FileDescriptionImpl.
A subsequent change will add support for pipes in memfs.
PiperOrigin-RevId: 275322385
NDP Neighbor Solicitations sent during Duplicate Address Detection must have an
IP hop limit of 255, as all NDP Neighbor Solicitations should have.
Test: Test that DAD messages have the IPv6 hop limit field set to 255.
PiperOrigin-RevId: 275321680
This change adds support for Duplicate Address Detection on IPv6 addresses
as defined by RFC 4862 section 5.4.
Note, this change will not break existing uses of netstack as the default
configuration for the stack options is set in such a way that DAD will not be
performed. See `stack.Options` and `stack.NDPConfigurations` for more details.
Tests: Tests to make sure that the DAD process properly resolves or fails.
That is, tests make sure that DAD resolves only if:
- No other node is performing DAD for the same address
- No other node owns the same address
PiperOrigin-RevId: 275189471
Standard Linux kernel versions are VERSION.PATCHLEVEL.SUBLEVEL. e.g., 4.4.0,
even when the sublevel is 0. Match this standard.
PiperOrigin-RevId: 275125715
Linux kernel before 4.19 doesn't implement a feature that updates
open FD after a file is open for write (and is copied to the upper
layer). Already open FD will continue to read the old file content
until they are reopened. This is especially problematic for gVisor
because it caches open files.
Flag was added to force readonly files to be reopenned when the
same file is open for write. This is only needed if using kernels
prior to 4.19.
Closes#1006
It's difficult to really test this because we never run on tests
on older kernels. I'm adding a test in GKE which uses kernels
with the overlayfs problem for 1.14 and lower.
PiperOrigin-RevId: 275115289
When any of these flags are set, all writes will trigger a subsequent fsync
call. This behavior already existed for "write-through" mounts.
O_DIRECT is treated as an alias for O_SYNC. Better support coming soon.
PiperOrigin-RevId: 275114392
These syscalls were changed in the amd64 file around the time the arm64 PR was
sent out, so their changes got lost.
Updates #63
PiperOrigin-RevId: 275114194
- Pass context.Context to OnClose().
- Pass memmap.MMapOpts to ConfigureMMap() by pointer so that implementations
can actually mutate it as required.
PiperOrigin-RevId: 274934967
Reassembly can fail due to an invalid sequence of fragments
being received. eg. Multiple fragments with same id which
claim to be the last one by setting the more flag to 0 etc.
It's safer to just drop the reassembler and increment a metric
than to panic when reassembly fails.
PiperOrigin-RevId: 274920901
...and do not populate link address cache at dispatch. This partially
reverts 313c767b00, which caused malformed
packets (e.g. NDP Neighbor Adverts with incorrect hop limit values) to
populate the address cache. In particular, this masked a bug that was
introduced to the Neighbor Advert generation code in
7c1587e340.
PiperOrigin-RevId: 274865182
Netstack has its own stats, we use this to fill /proc/net/snmp.
Note that some metrics are not recorded in Netstack, which will be shown
as 0 in the proc file.
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
Change-Id: Ie0089184507d16f49bc0057b4b0482094417ebe1
For hostinet, we inherit the data from host procfs. To to that, we
cache the fds for these files for later reads.
Fixes#506
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
Change-Id: I2f81215477455b9c59acf67e33f5b9af28ee0165
This proc file reports routing information to applications inside the
container.
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
Change-Id: I498e47f8c4c185419befbb42d849d0b099ec71f3
This proc file contains statistics according to [1].
[1] https://tools.ietf.org/html/rfc2013
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
Change-Id: I9662132085edd8a7783d356ce4237d7ac0800d94
This allows for peeking at the length of the next message on a netlink socket
without pulling it off the socket's buffer/queue, allowing tools like 'ip' to
work.
This CL also fixes an issue where dump_done_errno was not included in the
NLMSG_DONE messages payload.
Issue #769
PiperOrigin-RevId: 274068637
Strengthen the header.IPv4.IsValid check to correctly check
for IHL/TotalLength fields. Also add a check to make sure
fragmentOffsets + size of the fragment do not cause a wrap
around for the end of the fragment.
PiperOrigin-RevId: 274049313
The signalfd descriptors otherwise always show as available. This can lead
programs to spin, assuming they are looking to see what signals are pending.
Updates #139
PiperOrigin-RevId: 274017890
Also ensure that all flipcall transport errors not returned by p9 (converted to
EIO by the client, or dropped on the floor by channel server goroutines) are
logged.
PiperOrigin-RevId: 272963663
The behavior for sending and receiving local broadcast (255.255.255.255)
traffic is as follows:
Outgoing
--------
* A broadcast packet sent on a socket that is bound to an interface goes out
that interface
* A broadcast packet sent on an unbound socket follows the route table to
select the outgoing interface
+ if an explicit route entry exists for 255.255.255.255/32, use that one
+ else use the default route
* Broadcast packets are looped back and delivered following the rules for
incoming packets (see next). This is the same behavior as for multicast
packets, except that it cannot be disabled via sockopt.
Incoming
--------
* Sockets wishing to receive broadcast packets must bind to either INADDR_ANY
(0.0.0.0) or INADDR_BROADCAST (255.255.255.255). No other socket receives
broadcast packets.
* Broadcast packets are multiplexed to all sockets matching it. This is the
same behavior as for multicast packets.
* A socket can bind to 255.255.255.255:<port> and then receive its own
broadcast packets sent to 255.255.255.255:<port>
In addition, this change implicitly fixes an issue with multicast reception. If
two sockets want to receive a given multicast stream and one is bound to ANY
while the other is bound to the multicast address, only one of them will
receive the traffic.
PiperOrigin-RevId: 272792377
The input file descriptor is always a regular file, so sendfile can't lose any
data if it will not be able to write them to the output file descriptor.
Reported-by: syzbot+22d22330a35fa1c02155@syzkaller.appspotmail.com
PiperOrigin-RevId: 272730357
Right now, we can find more than one process with the 1 PID in /proc.
$ for i in `seq 10`; do
> unshare -fp sleep 1000 &
> done
$ ls /proc
1 1 1 1 12 18 24 29 6 loadavg net sys version
1 1 1 1 16 20 26 32 cpuinfo meminfo self thread-self
1 1 1 1 17 21 28 36 filesystems mounts stat uptime
PiperOrigin-RevId: 272506593
gVisor does not currently implement the functionality that would result in
AT_SECURE = 1, but Linux includes AT_SECURE = 0 in the normal case, so we
should do the same.
PiperOrigin-RevId: 272311488
Kernel.cpuClockTicker increments kernel.cpuClock, which tasks use as a clock to
track their CPU usage. This improves latency in the syscall path by avoid
expensive monotonic clock calls on every syscall entry/exit.
However, this timer fires every 10ms. Thus, when all tasks are idle (i.e.,
blocked or stopped), this forces a sentry wakeup every 10ms, when we may
otherwise be able to sleep until the next app-relevant event. These wakeups
cause the sentry to utilize approximately 2% CPU when the application is
otherwise idle.
Updates to clock are not strictly necessary when the app is idle, as there are
no readers of cpuClock. This commit reduces idle CPU by disabling the timer
when tasks are completely idle, and computing its effects at the next wakeup.
Rather than disabling the timer as soon as the app goes idle, we wait until the
next tick, which provides a window for short sleeps to sleep and wakeup without
doing the (relatively) expensive work of disabling and enabling the timer.
PiperOrigin-RevId: 272265822
Linux changed this behavior in 16e72e9b30986ee15f17fbb68189ca842c32af58
(v4.11). Previously, extra pages were always mapped RW. Now, those pages will
be executable if the segment specified PF_X. They still must be writeable.
PiperOrigin-RevId: 272256280
It isn't allowed to splice data from and into the same pipe.
But right now this check is broken, because we don't check that both ends are
pipes.
PiperOrigin-RevId: 272107022
Netstack always picks a random start point everytime PickEphemeralPort
is called. While this is required for UDP so that DNS requests go
out through a randomized set of ports it is not required for TCP. Infact
Linux explicitly hashes the (srcip, dstip, dstport) and a one time secret
initialized at start of the application to get a random offset. But to
ensure it doesn't start from the same point on every scan it uses a static
hint that is incremented by 2 in every call to pick ephemeral ports.
The reason for 2 is Linux seems to split the port ranges where active connects
seem to use even ones while odd ones are used by listening sockets.
This CL implements a similar strategy where we use a hash + hint to generate
the offset to start the search for a free Ephemeral port.
This ensures that we cycle through the available port space in order for
repeated connects to the same destination and significantly reduces the
chance of picking a recently released port.
PiperOrigin-RevId: 272058370
The gofer's CachingInodeOperations implementation contains an optimization for
the common open-read-close pattern when we have a host FD. In this case, the
host kernel will update the timestamp for us to a reasonably close time, so we
don't need an extra RPC to the gofer.
However, when the app explicitly sets the timestamps (via futimes or similar)
then we actually DO need to update the timestamps, because the host kernel
won't do it for us.
To fix this, a new boolean `forceSetTimestamps` was added to
CachineInodeOperations.SetMaskedAttributes. It is only set by
gofer.InodeOperations.SetTimestamps.
PiperOrigin-RevId: 272048146
Before https://golang.org/cl/173160 syscall.RawSyscall would zero out
the last three register arguments to the system call. That no longer happens.
For system calls that take more than three arguments, use RawSyscall6 to
ensure that we pass zero, not random data, for the additional arguments.
PiperOrigin-RevId: 271062527
Non-primary addresses are used for endpoints created to accept multicast and
broadcast packets, as well as "helper" endpoints (0.0.0.0) that allow sending
packets when no proper address has been assigned yet (e.g., for DHCP). These
addresses are not real addresses from a user point of view and should not be
part of the NICInfo() value. Also see b/127321246 for more info.
This switches NICInfo() to call a new NIC.PrimaryAddresses() function. To still
allow an option to get all addresses (mostly for testing) I added
Stack.GetAllAddresses() and NIC.AllAddresses().
In addition, the return value for GetMainNICAddress() was changed for the case
where the NIC has no primary address. Instead of returning an error here,
it now returns an empty AddressWithPrefix() value. The rational for this
change is that it is a valid case for a NIC to have no primary addresses.
Lastly, I refactored the code based on the new additions.
PiperOrigin-RevId: 270971764