Commit Graph

539 Commits

Author SHA1 Message Date
Arthur Sfez 2c8379d957 Expose header methods that validate checksums
This is done for IPv4, UDP and TCP headers.

This also changes the packet checkers used in tests to error on
zero-checksum, not sure why it was allowed before.

And while I'm here, make comments' case consistent.

RELNOTES: n/a

Fixes #5049

PiperOrigin-RevId: 369383862
2021-04-20 00:28:42 -07:00
Nick Brown 7bfc76d946 De-duplicate TCP state in TCPEndpointState vs tcp.endpoint
This change replaces individual private members in tcp.endpoint with a single
private TCPEndpointState member.

Some internal substructures within endpoint (receiver, sender) have been broken
into a public substructure (which is then copied into the TCPEndpointState
returned from completeState()) alongside other private fields.

Fixes #4466

PiperOrigin-RevId: 369329514
2021-04-19 16:43:30 -07:00
Nayana Bidari 8ad6657a22 Fix TCP RACK flaky unit tests.
- Added delay to increase the RTT: In DSACK tests with RACK enabled and low
RTT, TLP can be detected before sending ACK and the tests flake. Increasing
the RTT will ensure that TLP does not happen before the ACK is sent.
- Fix TestRACKOnePacketTailLoss: The ACK does not contain DSACK, which means
either the original or retransmission (probe) was lost and SACKRecovery count
must be incremented.

Before: http://sponge2/c9bd51de-f72f-481c-a7f3-e782e7524883
After: http://sponge2/1307a796-103a-4a45-b699-e8d239220ed1
PiperOrigin-RevId: 369305720
2021-04-19 14:44:05 -07:00
Kevin Krakauer 32c18f443f Enlarge port range and fix integer overflow
Also count failed TCP port allocations

PiperOrigin-RevId: 368939619
2021-04-16 16:28:56 -07:00
Kevin Krakauer 19dfc4f7af Reduce tcp_x_test runtime and memory usage
Reduce the ephemeral port range, which decreases the calls to makeEP.

PiperOrigin-RevId: 368748379
2021-04-15 17:16:08 -07:00
Kevin Krakauer 10de8978f9 Use nicer formatting for IP addresses in tests
This was semi-automated -- there are many addresses that were not replaced.
Future commits should clean those up.

Parse4 and Parse6 were given their own package because //pkg/test can introduce
dependency cycles, as it depends transitively on //pkg/tcpip and some other
netstack packages.

PiperOrigin-RevId: 368726528
2021-04-15 15:11:04 -07:00
Mithun Iyer 326394b79a Fix listener close, client connect race
Fix a race where the ACK completing the handshake can be dropped by
a closing listener without RST to the peer. The listener close would
reset the accepted queue and that causes the connecting endpoint
in SYNRCVD state to drop the ACK thinking the queue if filled up.

PiperOrigin-RevId: 368165509
2021-04-13 00:58:56 -07:00
Tamir Duberstein a804b42fe5 Drop locks before calling waiterQueue.Notify
Holding this lock can cause the user's callback to deadlock if it
attempts to inspect the accept queue.

PiperOrigin-RevId: 368068334
2021-04-12 13:14:41 -07:00
Tamir Duberstein c84ff99124 Use the SecureRNG to generate listener nonces
Some other cleanup while I'm here:
- Remove unused arguments
- Handle some unhandled errors
- Remove redundant casts
- Remove redundant parens
- Avoid shadowing `hash` package name

PiperOrigin-RevId: 367816161
2021-04-10 14:53:55 -07:00
Tamir Duberstein 2fea7d096b Don't store accepted endpoints in a channel
Use a linked list with cached length and capacity. The current channel
is already composed with a mutex and condition variable, and is never
used for its channel-like properties. Channels also require eager
allocation equal to their capacity, which a linked list does not.

PiperOrigin-RevId: 367766626
2021-04-10 01:00:41 -07:00
Mithun Iyer dc8f6c6914 Move maxListenBacklog check to sentry
Move maxListenBacklog check to the caller of endpoint Listen so that it
is applicable to Unix domain sockets as well.
This was changed in cl/366935921.

Reported-by: syzbot+a35ae7cdfdde0c41cf7a@syzkaller.appspotmail.com
PiperOrigin-RevId: 367728052
2021-04-09 16:53:33 -07:00
Tamir Duberstein 070b76fe7f Remove duplicate accept queue fullness check
Both code paths perform this check; extract it and remove the comment
that suggests it is unique to one of the paths.

PiperOrigin-RevId: 367666160
2021-04-09 11:08:21 -07:00
Tamir Duberstein 1fe5dd8c68 Propagate SYN handling error
Both callers of this function still drop this error on the floor, but
progress is progress.

Updates #4690.

PiperOrigin-RevId: 367604788
2021-04-09 03:43:23 -07:00
Mithun Iyer 56c69fb0e7 Fix listen backlog handling to be in parity with Linux
- Change the accept queue full condition for a listening endpoint
  to only honor completed (and delivered) connections.
- Use syncookies if the number of incomplete connections is beyond
  listen backlog. This also cleans up the SynThreshold option code
  as that is no longer used with this change.
- Added a new stack option to unconditionally generate syncookies.
  Similar to sysctl -w net.ipv4.tcp_syncookies=2 on Linux.
- Enable keeping of incomplete connections beyond listen backlog.
- Drop incoming SYNs only if the accept queue is filled up.
- Drop incoming ACKs that complete handshakes when accept queue is full
- Enable the stack to accept one more connection than programmed by
  listen backlog.
- Handle backlog argument being zero, negative for listen, as Linux.
- Add syscall and packetimpact tests to reflect the changes above.
- Remove TCPConnectBacklog test which is polling for completed
  connections on the client side which is not reflective of whether
  the accept queue is filled up by the test. The modified syscall test
  in this CL addresses testing of connecting sockets.

Fixes #3153

PiperOrigin-RevId: 366935921
2021-04-05 21:53:41 -07:00
Bhasker Hariharan e7ca2a51a8 Add POLLRDNORM/POLLWRNORM support.
On Linux these are meant to be equivalent to POLLIN/POLLOUT. Rather
than hack these on in sys_poll etc it felt cleaner to just cleanup
the call sites to notify for both events. This is what linux does
as well.

Fixes #5544

PiperOrigin-RevId: 364859977
2021-03-24 12:11:44 -07:00
Nick Brown ec0aa657ed Unexpose immutable fields in stack.Route
This change sets the inner `routeInfo` struct to be a named private member
and replaces direct access with access through getters. Note that direct
access to the fields of `routeInfo` is still possible through the `RouteInfo`
struct.

Fixes #4902

PiperOrigin-RevId: 364822872
2021-03-24 09:38:27 -07:00
Nayana Bidari dc75f08c2a Use constant (TestInitialSequenceNumber) instead of integer (789) in tests.
PiperOrigin-RevId: 364596526
2021-03-23 10:59:57 -07:00
Ghanan Gowripalan d3a433caae Do not use martian loopback packets in tests
Transport demuxer and UDP tests should not use a loopback address as the
source address for packets injected into the stack as martian loopback
packets will be dropped in a later change.

PiperOrigin-RevId: 363479681
2021-03-17 12:29:08 -07:00
Zeling Feng 3dd7ad13b4 Fix tcp_fin_retransmission_netstack_test
Netstack does not check ACK number for FIN-ACK packets and goes into TIMEWAIT
unconditionally. Fixing the state machine will give us back the retransmission
of FIN.

PiperOrigin-RevId: 363301883
2021-03-16 16:59:26 -07:00
Mithun Iyer 5eede4e756 Fix a race with synRcvdCount and accept
There is a race in handling new incoming connections on a listening
endpoint that causes the endpoint to reply to more incoming SYNs than
what is permitted by the listen backlog.

The race occurs when there is a successful passive connection handshake
and the synRcvdCount counter is decremented, followed by the endpoint
delivered to the accept queue. In the window of time between
synRcvdCount decrementing and the endpoint being enqueued for accept,
new incoming SYNs can be handled without honoring the listen backlog
value, as the backlog could be perceived not full.

Fixes #5637

PiperOrigin-RevId: 363279372
2021-03-16 15:08:09 -07:00
Kevin Krakauer b1d5787726 Make netstack (//pkg/tcpip) buildable for 32 bit
Doing so involved breaking dependencies between //pkg/tcpip and the rest
of gVisor, which are discouraged anyways.

Tested on the Go branch via:
  gvisor.dev/gvisor/pkg/tcpip/...

Addresses #1446.

PiperOrigin-RevId: 363081778
2021-03-15 18:49:59 -07:00
Kevin Krakauer 82d7fb2cb0 improve readability of ports package
Lots of small changes:
- simplify package API via Reservation type
- rename some single-letter variable names that were hard to follow
- rename some types

PiperOrigin-RevId: 362442366
2021-03-11 21:05:32 -08:00
Zeling Feng 2a888a106d Give TCP flags a dedicated type
- Implement Stringer for it so that we can improve error messages.
- Use TCPFlags through the code base. There used to be a mixed usage of byte,
  uint8 and int as TCP flags.

PiperOrigin-RevId: 361940150
2021-03-09 18:00:03 -08:00
Kevin Krakauer abbdcebc54 Implement /proc/sys/net/ipv4/ip_local_port_range
Speeds up the socket stress tests by a couple orders of magnitude.

PiperOrigin-RevId: 361721050
2021-03-08 20:40:34 -08:00
Arthur Sfez fb733cdb8f Increment the counters when sending Echo requests
Updates #5597

PiperOrigin-RevId: 361252003
2021-03-05 16:51:45 -08:00
Ting-Yu Wang a9face757a Nit fix: Should use maxTimeout in backoffTimer
The only user is in (*handshake).complete and it specifies MaxRTO, so there is
no behavior changes.

PiperOrigin-RevId: 360954447
2021-03-04 10:54:06 -08:00
Ting-Yu Wang 1cd76d958a Make dedicated methods for data operations in PacketBuffer
One of the preparation to decouple underlying buffer implementation.
There are still some methods that tie to VectorisedView, and they will be
changed gradually in later CLs.

This CL also introduce a new ICMPv6ChecksumParams to replace long list of
parameters when calling ICMPv6Checksum, aiming to be more descriptive.

PiperOrigin-RevId: 360778149
2021-03-03 16:05:16 -08:00
Bhasker Hariharan 3e69f5d088 Add checklocks analyzer.
This validates that struct fields if annotated with "// checklocks:mu" where
"mu" is a mutex field in the same struct then access to the field is only
done with "mu" locked.

All types that are guarded by a mutex must be annotated with

// +checklocks:<mutex field name>

For more details please refer to README.md.

PiperOrigin-RevId: 360729328
2021-03-03 12:24:21 -08:00
Andrei Vagin 865ca64ee8 tcp: endpoint.Write has to send all data that has been read from payload
io.Reader.ReadFull returns the number of bytes copied and an error if fewer
bytes were read.

PiperOrigin-RevId: 360247614
2021-03-01 12:17:20 -08:00
Bhasker Hariharan 037bb2f45a Fix panic due to zero length writes in TCP.
There is a short race where in Write an endpoint can transition from writable
to non-writable state due to say an incoming RST during the time we release
the endpoint lock and reacquire after copying the payload. In such a case
if the write happens to be a zero sized write we end up trying to call
sendData() even though nothing was queued.

This can panic when trying to enable/disable TCP timers if the endpoint had
already transitioned to a CLOSED/ERROR state due to the incoming RST as we
cleanup timers when the protocol goroutine terminates.

Sadly the race window is small enough that my attempts at reproducing the panic
in a syscall test has not been successful.

PiperOrigin-RevId: 359887905
2021-02-26 20:16:48 -08:00
Tamir Duberstein da2505df94 Use closure to avoid manual unlocking
Also increase refcount of raw.endpoint.route while in use.

Avoid allocating an array of size zero.

PiperOrigin-RevId: 359797788
2021-02-26 11:18:30 -08:00
Nayana Bidari f3de211bb7 RACK: recovery logic should check for receive window before re-transmitting.
Use maybeSendSegment while sending segments in RACK recovery which checks if
the receiver has space and splits the segments when the segment size is
greater than MSS.

PiperOrigin-RevId: 359641097
2021-02-25 16:29:28 -08:00
Kevin Krakauer 38c42bbf4a Remove deadlock in raw.endpoint caused by recursive read locking
Prevents the following deadlock:
- Raw packet is sent via e.Write(), which read locks e.mu
- Connect() is called, blocking on write locking e.mu
- The packet is routed to loopback and back to e.HandlePacket(), which read
  locks e.mu

Per the atomic.RWMutex documentation, this deadlocks:

"If a goroutine holds a RWMutex for reading and another goroutine might call
Lock, no goroutine should expect to be able to acquire a read lock until the
initial read lock is released. In particular, this prohibits recursive read
locking. This is to ensure that the lock eventually becomes available; a blocked
Lock call excludes new readers from acquiring the lock."

Also, release eps.mu earlier in deliverRawPacket.

PiperOrigin-RevId: 359600926
2021-02-25 13:35:44 -08:00
Ayush Ranjan 845d0a65f4 [rack] TLP: ACK Processing and PTO scheduling.
This change implements TLP details enumerated in
https://tools.ietf.org/html/draft-ietf-tcpm-rack-08#section-7.5.3

Fixes #5085

PiperOrigin-RevId: 357125037
2021-02-11 22:06:09 -08:00
Ayush Ranjan 91cf7b3ca4 [netstack] Fix recovery entry and exit checks.
Entry check:

- Earlier implementation was preventing us from entering recovery even if
  SND.UNA is lost but dupAckCount is still below threshold. Fixed that.
- We should only enter recovery when at least one more byte of data beyond the
  highest byte that was outstanding when fast retransmit was last entered is
  acked. Added that check.

Exit check:

- Earlier we were checking if SEG.ACK is in range [SND.UNA, SND.NXT]. The
  intention was to check if any unacknowledged data was ACKed. Note that
  (SEG.ACK - 1) is actually the sequence number which was ACKed. So we were
  incorrectly including (SND.UNA - 1) in the range. Fixed the check to now be
  (SEG.ACK - 1) in range [SND.UNA, SND.NXT).

Additionally, moved a RACK specific test to the rack tests file.
Added tests for the changes I made.

PiperOrigin-RevId: 357091322
2021-02-11 17:19:47 -08:00
Nayana Bidari ff04d019e3 RACK: Fix re-transmitting the segment twice when entering recovery.
TestRACKWithDuplicateACK is flaky as the reorder window can expire before
receiving three duplicate ACKs which will result in sending the first
unacknowledged segment twice: when reorder timer expired and again after
receiving the third duplicate ACK.

This CL will fix this behavior and will not resend the segment again if it was
already re-transmittted when reorder timer expired.

Update the TestRACKWithDuplicateACK to test that the first segment is
considered as lost and is re-transmitted.

PiperOrigin-RevId: 356855168
2021-02-10 16:38:55 -08:00
Bhasker Hariharan 298c129cc1 Add support for setting SO_SNDBUF for unix domain sockets.
The limits for snd/rcv buffers for unix domain socket is controlled by the
following sysctls on linux

 - net.core.rmem_default
 - net.core.rmem_max
 - net.core.wmem_default
 - net.core.wmem_max

Today in gVisor we do not expose these sysctls but we do support setting the
equivalent in netstack via stack.Options() method. But AF_UNIX sockets in gVisor
can be used without netstack, with hostinet or even without any networking stack
at all. Which means ideally these sysctls need to live as globals in gVisor.

But rather than make this a big change for now we hardcode the limits in the
AF_UNIX implementation itself (which in itself is better than where we were
before) where it SO_SNDBUF was hardcoded to 16KiB. Further we bump the initial
limit to a default value of 208 KiB to match linux from the paltry 16 KiB we use
today.

Updates #5132

PiperOrigin-RevId: 356665498
2021-02-09 21:55:16 -08:00
Zeling Feng 95500ece56 Allow UDP sockets connect()ing to port 0
We previously return EINVAL when connecting to port 0, however this is not the
observed behavior on Linux. One of the observable effects after connecting to
port 0 on Linux is that getpeername() will fail with ENOTCONN.

PiperOrigin-RevId: 356413451
2021-02-08 20:13:17 -08:00
Nayana Bidari fe63db2e96 RACK: Detect loss
Detect packet loss using reorder window and re-transmit them after the reorder
timer expires.

PiperOrigin-RevId: 356321786
2021-02-08 12:09:54 -08:00
Nayana Bidari e3bce9689f Add a function to enable RACK in tests.
- Adds a function to enable RACK in tests.
- RACK update functions are guarded behind the flag tcpRecovery.

PiperOrigin-RevId: 355435973
2021-02-03 11:09:23 -08:00
Nayana Bidari 49f783fb65 Rename HandleNDupAcks in TCP.
Rename HandleNDupAcks() to HandleLossDetected() as it will enter this when
is detected after:
- reorder window expires and TLP (in case of RACK)
- dupAckCount >= 3

PiperOrigin-RevId: 355237858
2021-02-02 13:21:40 -08:00
Bhasker Hariharan 8c7c5abafb Add support for rate limiting out of window ACKs.
Netstack today will send dupACK's with no rate limit for incoming out of
window segments. This can result in ACK loops for example if a TCP socket
connects to itself (actually permitted by TCP). Where the ACK sent in
response to packets being out of order itself gets considered as an out
of window segment resulting in another ACK being generated.

PiperOrigin-RevId: 355206877
2021-02-02 11:05:28 -08:00
Ghanan Gowripalan ebd3912c0f Refactor HandleControlPacket/SockError
...to remove the need for the transport layer to deduce the type of
error it received.

Rename HandleControlPacket to HandleError as HandleControlPacket only
handles errors.

tcpip.SockError now holds a tcpip.SockErrorCause interface that
different errors can implement.

PiperOrigin-RevId: 354994306
2021-02-01 12:04:03 -08:00
Ghanan Gowripalan daeb06d2cb Hide neighbor table kind from NetworkEndpoint
The network endpoint should not need to have logic to handle different
kinds of neighbor tables. Network endpoints can let the NIC know about
differnt neighbor discovery messages and let the NIC decide which table
to update.

This allows us to remove the LinkAddressCache interface.

PiperOrigin-RevId: 354812584
2021-01-31 10:03:46 -08:00
Nayana Bidari ff4fc42784 RACK: Update reorder window.
After receiving an ACK(cumulative or selective), RACK will update the reorder
window which is used as a settling time before marking the packet as lost.
This change will add an init function to initialize the variables in RACK and
also store the reference to sender in rackControl.
The reorder window is calculated as per rfc:
https://tools.ietf.org/html/draft-ietf-tcpm-rack-08#section-7.2 Step 4.

PiperOrigin-RevId: 354453528
2021-01-28 20:08:23 -08:00
Tamir Duberstein 8d1afb4185 Change tcpip.Error to an interface
This makes it possible to add data to types that implement tcpip.Error.
ErrBadLinkEndpoint is removed as it is unused.

PiperOrigin-RevId: 354437314
2021-01-28 17:59:58 -08:00
Marina Ciocea 6012fe9b59 Respect SO_BINDTODEVICE in unconnected UDP writes
Previously, sending on an unconnected UDP socket would ignore the
SO_BINDTODEVICE option. Send on the configured interface when an UDP socket
is bound to an interface through setsockop SO_BINDTODEVICE.

Add packetimpact tests exercising UDP reads and writes with every combination
of bound/unbound, broadcast/multicast/unicast destination, and bound/not-bound
to device.

PiperOrigin-RevId: 354299670
2021-01-28 06:24:46 -08:00
Ghanan Gowripalan b85b23e50d Confirm neighbor reachability with TCP ACKs
As per RFC 4861 section 7.3.1,
  A neighbor is considered reachable if the node has recently received
  a confirmation that packets sent recently to the neighbor were
  received by its IP layer. Positive confirmation can be gathered in
  two ways: hints from upper-layer protocols that indicate a connection
  is making "forward progress", or receipt of a Neighbor Advertisement
  message that is a response to a Neighbor Solicitation message.

This change adds support for TCP to let the IP/link layers know that a
neighbor is reachable.

Test: integration_test.TestTCPConfirmNeighborReachability
PiperOrigin-RevId: 354222833
2021-01-27 19:08:51 -08:00
Nayana Bidari 99988e45ed Add support for more fields in netstack for TCP_INFO
This CL adds support for the following fields:
- RTT, RTTVar, RTO
- send congestion window (sndCwnd) and send slow start threshold (sndSsthresh)
- congestion control state(CaState)
- ReorderSeen

PiperOrigin-RevId: 354195361
2021-01-27 16:14:50 -08:00
Nayana Bidari 8e66044741 Initialize the send buffer handler in endpoint creation.
- This CL will initialize the function handler used for getting the send
buffer size limits during endpoint creation and does not require the caller of
SetSendBufferSize(..) to know the endpoint type(tcp/udp/..)

PiperOrigin-RevId: 353992634
2021-01-26 18:05:29 -08:00