Strengthen the header.IPv4.IsValid check to correctly check
for IHL/TotalLength fields. Also add a check to make sure
fragmentOffsets + size of the fragment do not cause a wrap
around for the end of the fragment.
PiperOrigin-RevId: 274049313
The behavior for sending and receiving local broadcast (255.255.255.255)
traffic is as follows:
Outgoing
--------
* A broadcast packet sent on a socket that is bound to an interface goes out
that interface
* A broadcast packet sent on an unbound socket follows the route table to
select the outgoing interface
+ if an explicit route entry exists for 255.255.255.255/32, use that one
+ else use the default route
* Broadcast packets are looped back and delivered following the rules for
incoming packets (see next). This is the same behavior as for multicast
packets, except that it cannot be disabled via sockopt.
Incoming
--------
* Sockets wishing to receive broadcast packets must bind to either INADDR_ANY
(0.0.0.0) or INADDR_BROADCAST (255.255.255.255). No other socket receives
broadcast packets.
* Broadcast packets are multiplexed to all sockets matching it. This is the
same behavior as for multicast packets.
* A socket can bind to 255.255.255.255:<port> and then receive its own
broadcast packets sent to 255.255.255.255:<port>
In addition, this change implicitly fixes an issue with multicast reception. If
two sockets want to receive a given multicast stream and one is bound to ANY
while the other is bound to the multicast address, only one of them will
receive the traffic.
PiperOrigin-RevId: 272792377
Netstack always picks a random start point everytime PickEphemeralPort
is called. While this is required for UDP so that DNS requests go
out through a randomized set of ports it is not required for TCP. Infact
Linux explicitly hashes the (srcip, dstip, dstport) and a one time secret
initialized at start of the application to get a random offset. But to
ensure it doesn't start from the same point on every scan it uses a static
hint that is incremented by 2 in every call to pick ephemeral ports.
The reason for 2 is Linux seems to split the port ranges where active connects
seem to use even ones while odd ones are used by listening sockets.
This CL implements a similar strategy where we use a hash + hint to generate
the offset to start the search for a free Ephemeral port.
This ensures that we cycle through the available port space in order for
repeated connects to the same destination and significantly reduces the
chance of picking a recently released port.
PiperOrigin-RevId: 272058370
Non-primary addresses are used for endpoints created to accept multicast and
broadcast packets, as well as "helper" endpoints (0.0.0.0) that allow sending
packets when no proper address has been assigned yet (e.g., for DHCP). These
addresses are not real addresses from a user point of view and should not be
part of the NICInfo() value. Also see b/127321246 for more info.
This switches NICInfo() to call a new NIC.PrimaryAddresses() function. To still
allow an option to get all addresses (mostly for testing) I added
Stack.GetAllAddresses() and NIC.AllAddresses().
In addition, the return value for GetMainNICAddress() was changed for the case
where the NIC has no primary address. Instead of returning an error here,
it now returns an empty AddressWithPrefix() value. The rational for this
change is that it is a valid case for a NIC to have no primary addresses.
Lastly, I refactored the code based on the new additions.
PiperOrigin-RevId: 270971764
Previously, the only safe way to use an fdbased endpoint was to leak the FD.
This change makes it possible to safely close the FD.
This is the first step towards having stoppable stacks.
Updates #837
PiperOrigin-RevId: 270346582
This also allows the tee(2) implementation to be enabled, since dup can now be
properly supported via WriteTo.
Note that this change necessitated some minor restructoring with the
fs.FileOperations splice methods. If the *fs.File is passed through directly,
then only public API methods are accessible, which will deadlock immediately
since the locking is already done by fs.Splice. Instead, we pass through an
abstract io.Reader or io.Writer, which elide locks and use the underlying
fs.FileOperations directly.
PiperOrigin-RevId: 268805207
Fix a bug where udp.(*endpoint).Disconnect [accessible in gVisor via
epsocket.(*SocketOperations).Connect with AF_UNSPEC] would leak a port
reservation if the socket/endpoint had an ephemeral port assigned to it.
glibc's getaddrinfo uses connect with AF_UNSPEC, causing each call of
getaddrinfo to leak a port. Call getaddrinfo too many times and you run out of
ports (shows up as connect returning EAGAIN and getaddrinfo returning
EAI_NONAME "Name or service not known").
PiperOrigin-RevId: 268071160
The IPv6 all-nodes multicast address will be joined on NIC enable, and the
appropriate IPv6 solicited-node multicast address will be joined when IPv6
addresses are added.
Tests: Test receiving packets destined to the IPv6 link-local all-nodes
multicast address and the IPv6 solicted node address of an added IPv6 address.
PiperOrigin-RevId: 268047073
There are a few cases addressed by this change
- We no longer generate a RST in response to a RST packet.
- When we receive a RST we cleanup and release all reservations immediately as
the connection is now aborted.
- An ACK received by a listening socket generates a RST when SYN cookies are not
in-use. The only reason an ACK should land at the listening socket is if we
are using SYN cookies otherwise the goroutine for the handshake in progress
should have gotten the packet and it should never have arrived at the
listening endpoint.
- Also fixes the error returned when a connection times out due to a
Keepalive timer expiration from ECONNRESET to a ETIMEDOUT.
PiperOrigin-RevId: 267238427
This also renames "subnet" to "addressRange" to avoid any more confusion with
an interface IP's subnet.
Lastly, this also removes the Stack.ContainsSubnet(..) API since it isn't used
by anyone. Plus the same information can be obtained from
Stack.NICAddressRanges().
PiperOrigin-RevId: 267229843
Adds support to generate Port Unreachable messages for UDP
datagrams received on a port for which there is no valid
endpoint.
Fixes#703
PiperOrigin-RevId: 267034418
The blockingpoll_unsafe.go was copied to blockingpoll_noyield_unsafe.go
during merging commit 7206202bb9. If it still stay here, it would
cause build errors on non-amd64 platform.
ERROR:
pkg/tcpip/link/rawfile/BUILD:5:1:
GoCompilePkg
pkg/tcpip/link/rawfile.a
failed (Exit 1) builder failed: error executing command
bazel-out/host/bin/external/go_sdk/builder compilepkg -sdk
external/go_sdk -installsuffix linux_arm64 -src
pkg/tcpip/link/rawfile/blockingpoll_noyield_unsafe.go -src ...
(remaining 33 argument(s) skipped)
Use --sandbox_debug to see verbose messages from the sandbox
compilepkg: error running subcommand: exit status 2
pkg/tcpip/link/rawfile/blockingpoll_yield_unsafe.go:35:6:
BlockingPoll redeclared in this block
previous declaration at
pkg/tcpip/link/rawfile/blockingpoll_unsafe.go:26:78
Target //pkg/tcpip/link/rawfile:rawfile failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 25.531s, Critical Path: 21.08s
INFO: 262 processes: 262 linux-sandbox.
FAILED: Build did NOT complete successfully
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I4e21f82984225d0aa173de456f7a7c66053a053e
This allows the stack to learn remote link addresses on incoming
packets, reducing the need to ARP to send responses.
This also reduces the number of round trips to the system clock,
since that may also prove to be performance-sensitive.
Fixes#739.
PiperOrigin-RevId: 265815816
Add missing state transition to LastAck, which should happen when the
endpoint has already recieved a FIN from the remote side, and is
sending its own FIN.
PiperOrigin-RevId: 265568314
This addresses the problem where an endpoint has its address removed but still
has outstanding references held by routes used in connected TCP/UDP sockets
which prevent the removal of the endpoint.
The fix adds a new "expired" flag to the referenced network endpoint, which is
set when an endpoint has its address removed. Incoming packets are not
delivered to an expired endpoint (unless in promiscuous mode), while sending
outgoing packets triggers an error to the caller (unless in spoofing mode).
In addition, a few helper functions were added to stack_test.go to reduce
code duplications.
PiperOrigin-RevId: 265514326
This fixes the issue of not being able to bind to either a multicast or
broadcast address as well as to send and receive data from it. The way to solve
this is to treat these addresses similar to the ANY address and register their
transport endpoint ID with the global stack's demuxer rather than the NIC's.
That way there is no need to require an endpoint with that multicast or
broadcast address. The stack's demuxer is in fact the only correct one to use,
because neither broadcast- nor multicast-bound sockets care which NIC a
packet was received on (for multicast a join is still needed to receive packets
on a NIC).
I also took the liberty of refactoring udp_test.go to consolidate a lot of
duplicate code and make it easier to create repetitive tests that test the same
feature for a variety of packet and socket types. For this purpose I created a
"flowType" that represents two things: 1) the type of packet being sent or
received and 2) the type of socket used for the test. E.g., a "multicastV4in6"
flow represents a V4-mapped multicast packet run through a V6-dual socket.
This allows writing significantly simpler tests. A nice example is testTTL().
PiperOrigin-RevId: 264766909
This adds the same logic to NIC.findEndpoint that is already done in
NIC.getRef. Since this makes the two functions very similar they were combined
into one with the originals being wrappers.
PiperOrigin-RevId: 263864708
These errors are always pointers; there's no sense in dereferencing them
in the panic call. Changed one false positive for clarity.
PiperOrigin-RevId: 263611579
13a98df rearranged some of this code in a way that broke compilation of
the netstack-only export at github.com/google/netstack because
*_state.go files are not included in that export.
This commit moves resumption logic back into *_state.go, fixing the
compilation breakage.
PiperOrigin-RevId: 263601629
SendMsg before this change would copy all the data over into a
new slice even if the underlying socket could only accept a
small amount of data. This is really inefficient with non-blocking
sockets and under high throughput where large writes could get
ErrWouldBlock or if there was say a timeout associated with the sendmsg()
syscall.
With this change we delay copying bytes in till they are needed and only
copy what can be potentially sent/held in the socket buffer. Reducing
the need to repeatedly copy data over.
Also a minor fix to change state FIN-WAIT-1 when shutdown(..., SHUT_WR) is called
instead of when we transmit the actual FIN. Otherwise the socket could remain in
CONNECTED state even though the user has called shutdown() on the socket.
Updates #627
PiperOrigin-RevId: 263430505
This change just introduces different congestion control states and
ensures the sender.state is updated to reflect the current state
of the connection.
It is not used for any decisions yet but this is required before
algorithms like Eiffel/PRR can be implemented.
Fixes#394
PiperOrigin-RevId: 262638292
Endpoint protocol goroutines were previously started as part of
loading the endpoint. This is potentially too soon, as resources used
by these goroutine may not have been loaded. Protocol goroutines may
perform meaningful work as soon as they're started (ex: incoming
connect) which can cause them to indirectly access resources that
haven't been loaded yet.
This CL defers resuming all protocol goroutines until the end of
restore.
PiperOrigin-RevId: 262409429
This can happen because endpoint.Close() closes the accept channel first and
then drains/resets any accepted but not delivered connections. But there can be
connections that are connected but not delivered to the channel as the channel
was full. But closing the channel can cause these writes to fail with a write to
a closed channel.
The correct solution is to abort any connections in SYN-RCVD state and
drain/abort all completed connections before closing the accept channel.
PiperOrigin-RevId: 261951132
The checksum was not being reset before being re-calculated and sent out.
This caused the sent checksum to always be `0x0800`.
Fixes#605.
PiperOrigin-RevId: 260965059
This doesn't currently pass on gVisor.
While I'm here, fix a bug where connecting to the v6-mapped v4 address doesn't
work in gVisor.
PiperOrigin-RevId: 260923961
syscall.POLL is not supported on arm64, using syscall.PPOLL
to support both the x86 and arm64. refs #63
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I2c81a063d3ec4e7e6b38fe62f17a0924977f505e
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/543 from xiaobo55x:master ba598263fd3748d1addd48e4194080aa12085164
PiperOrigin-RevId: 260752049
This allows the user code to add a network address with a subnet prefix length.
The prefix length value is stored in the network endpoint and provided back to
the user in the ProtocolAddress type.
PiperOrigin-RevId: 259807693
This fixes a bug introduced in cl/251934850 that caused
connect-accept-close-connect races to result in the second connect call
failiing when it should have succeeded.
PiperOrigin-RevId: 259584525
iptables also relies on IPPROTO_RAW in a way. It opens such a socket to
manipulate the kernel's tables, but it doesn't actually use any of the
functionality. Blegh.
PiperOrigin-RevId: 257903078
Adds support to set/get the TCP_MAXSEG value but does not
really change the segment sizes emitted by netstack or
alter the MSS advertised by the endpoint. This is currently
being added only to unblock iperf3 on gVisor. Plumbing
this correctly requires a bit more work which will come
in separate CLs.
PiperOrigin-RevId: 257859112
Addresses obvious typos, in the documentation only.
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/443 from Pixep:fix/documentation-spelling 4d0688164eafaf0b3010e5f4824b35d1e7176d65
PiperOrigin-RevId: 255477779
Today we have the logic split in two places between endpoint Read() and the
worker goroutine which actually sends a zero window. This change makes it so
that when a zero window ACK is sent we set a flag in the endpoint which can be
read by the endpoint to decide if it should notify the worker to send a
nonZeroWindow update.
The worker now does not do the check again but instead sends an ACK and flips
the flag right away.
Similarly today when SO_RECVBUF is set the SetSockOpt call has logic
to decide if a zero window update is required. Rather than do that we move
the logic to the worker goroutine and it can check the zeroWindow flag
and send an update if required.
PiperOrigin-RevId: 254505447
This test will occasionally fail waiting to read a packet. From repeated runs,
I've seen it up to 1.5s for waitForPackets to complete.
PiperOrigin-RevId: 254484627
The implementation is similar to linux where we track the number of bytes
consumed by the application to grow the receive buffer of a given TCP endpoint.
This ensures that the advertised window grows at a reasonable rate to accomodate
for the sender's rate and prevents large amounts of data being held in stack
buffers if the application is not actively reading or not reading fast enough.
The original paper that was used to implement the linux receive buffer auto-
tuning is available @ https://public.lanl.gov/radiant/pubs/drs/lacsi2001.pdf
NOTE: Linux does not implement DRS as defined in that paper, it's just a good
reference to understand the solution space.
Updates #230
PiperOrigin-RevId: 253168283
This CL also cleans up the error returned for setting congestion
control which was incorrectly returning EINVAL instead of ENOENT.
PiperOrigin-RevId: 252889093
Changes netstack to confirm to current linux behaviour where if the backlog is
full then we drop the SYN and do not send a SYN-ACK. Similarly we allow upto
backlog connections to be in SYN-RCVD state as long as the backlog is not full.
We also now drop a SYN if syn cookies are in use and the backlog for the
listening endpoint is full.
Added new tests to confirm the behaviour.
Also reverted the change to increase the backlog in TcpPortReuseMultiThread
syscall test.
Fixes#236
PiperOrigin-RevId: 252500462
This is necessary for implementing network diagnostic interfaces like
/proc/net/{tcp,udp,unix} and sock_diag(7).
For pass-through endpoints such as hostinet, we obtain the socket
state from the backend. For netstack, we add explicit tracking of TCP
states.
PiperOrigin-RevId: 251934850
This allows an fdbased endpoint to have multiple underlying fd's from which
packets can be read and dispatched/written to.
This should allow for higher throughput as well as better scalability of the
network stack as number of connections increases.
Updates #231
PiperOrigin-RevId: 251852825
In case of GSO, a segment can container more than one packet
and we need to use the pCount() helper to get a number of packets.
PiperOrigin-RevId: 251743020
Multicast packets are special in that their destination address does not
identify a specific interface. When sending out such a packet the multicast
address is the remote address, but for incoming packets it is the local
address. Hence, when looping a multicast packet, the route needs to be
tweaked to reflect this.
PiperOrigin-RevId: 251739298
When checking the length of the acceptedChan we should hold the
endpoint mutex otherwise a syn received while the listening socket
is being closed can result in a data race where the cleanupLocked
routine sets acceptedChan to nil while a handshake goroutine
in progress could try and check it at the same time.
PiperOrigin-RevId: 251537697
Netstack sets the unprocessed segment queue size to match the receive
buffer size. This is not required as this queue only needs to hold enough
for a short duration before the endpoint goroutine can process it.
Updates #230
PiperOrigin-RevId: 250976323
Funcion signatures are not validated during compilation. Since
they are not exported, they can change at any time. The guard
ensures that they are verified at least on every version upgrade.
PiperOrigin-RevId: 250733742
Netstack listen loop can get stuck if cookies are in-use and the app is slow to
accept incoming connections. Further we continue to complete handshake for a
connection even if the backlog is full. This creates a problem when a lots of
connections come in rapidly and we end up with lots of completed connections
just hanging around to be delivered.
These fixes change netstack behaviour to mirror what linux does as described
here in the following article
http://veithen.io/2014/01/01/how-tcp-backlog-works-in-linux.html
Now when cookies are not in-use Netstack will silently drop the ACK to a SYN-ACK
and not complete the handshake if the backlog is full. This will result in the
connection staying in a half-complete state. Eventually the sender will
retransmit the ACK and if backlog has space we will transition to a connected
state and deliver the endpoint.
Similarly when cookies are in use we do not try and create an endpoint unless
there is space in the accept queue to accept the newly created endpoint. If
there is no space then we again silently drop the ACK as we can just recreate it
when the ACK is retransmitted by the peer.
We also now use the backlog to cap the size of the SYN-RCVD queue for a given
endpoint. So at any time there can be N connections in the backlog and N in a
SYN-RCVD state if the application is not accepting connections. Any new SYNs
will be dropped.
This CL also fixes another small bug where we mark a new endpoint which has not
completed handshake as connected. We should wait till handshake successfully
completes before marking it connected.
Updates #236
PiperOrigin-RevId: 250717817
These wakers are uselessly allocated and passed around; nothing ever
listens for notifications on them. The code here appears to be
vestigial, so removing it and allowing a nil waker to be passed seems
appropriate.
PiperOrigin-RevId: 249879320
Change-Id: Icd209fb77cc0dd4e5c49d7a9f2adc32bf88b4b71
This is in preparation to support an fdbased endpoint that can read/dispatch
packets from multiple underlying fds.
Updates #231
PiperOrigin-RevId: 249337074
Change-Id: Id7d375186cffcf55ae5e38986e7d605a96916d35
And stop storing the Filesystem in the MountSource.
This allows us to decouple the MountSource filesystem type from the name of the
filesystem.
PiperOrigin-RevId: 247292982
Change-Id: I49cbcce3c17883b7aa918ba76203dfd6d1b03cc8