Commit Graph

111 Commits

Author SHA1 Message Date
Dean Deng 771e9ce8e1 Unlock tcp endpoint mutex before blocking forever.
This was occasionally causing tests to get stuck due to races with the save
process, during which the same mutex is acquired.

PiperOrigin-RevId: 340789616
2020-11-04 22:46:51 -08:00
Dean Deng 66d24bb692 Release mutex before blocking during TCP handshake route resolution.
Without releasing the mutex, operations on the endpoint following a
nonblocking connect will not make progress until connect is complete.

PiperOrigin-RevId: 340467654
2020-11-03 10:03:30 -08:00
Dean Deng 2eb3ee586b Automated rollback of changelist 339945377
PiperOrigin-RevId: 340274194
2020-11-02 11:09:43 -08:00
Dean Deng ba05c6845d Automated rollback of changelist 339750876
PiperOrigin-RevId: 339945377
2020-10-30 14:58:43 -07:00
Dean Deng a86f988a87 Automated rollback of changelist 339675182
PiperOrigin-RevId: 339750876
2020-10-29 14:48:08 -07:00
Dean Deng 1f0f687cbe Delay goroutine creation during TCP handshake for accept/connect.
Refactor TCP handshake code so that when connect is initiated, the initial SYN
is sent before creating a goroutine to handle the rest of the handshake (which
blocks). Similarly, the initial SYN-ACK is sent inline when SYN is received
during accept.

Some additional cleanup is done as well.

Eventually we would like to complete connections in the dispatcher without
requiring a wakeup to complete the handshake. This refactor makes that easier.

Updates #231

PiperOrigin-RevId: 339675182
2020-10-29 08:46:04 -07:00
Bhasker Hariharan db36d948fa TCP Receive window advertisement fixes.
The fix in commit 028e045da9 was incorrect as
it can cause the right edge of the window to shrink when we announce
a zero window due to receive buffer being full as its done before the check
for seeing if the window is being shrunk because of the selected window.

Further the window was calculated purely on available space but in cases where
we are getting full sized segments it makes more sense to use the actual bytes
being held. This CL changes to use the lower of the total available space vs
the available space in the maximal window we could advertise minus the actual
payload bytes being held.

This change also cleans up the code so that the window selection logic is
not duplicated between getSendParams() and windowCrossedACKThresholdLocked.

PiperOrigin-RevId: 336404827
2020-10-09 19:02:03 -07:00
Ghanan Gowripalan 91e2d15a62 Remove AssignableAddressEndpoint.NetworkEndpoint
We can get the network endpoint directly from the NIC.

This is a preparatory CL for when a Route needs to hold a dedicated NIC
as its output interface. This is because when forwarding is enabled,
packets may be sent from a NIC different from the NIC a route's local
address is associated with.

PiperOrigin-RevId: 335484500
2020-10-05 13:18:57 -07:00
Bhasker Hariharan 5d50c91c4d Change segment/pending queue to use receive buffer limits.
segment_queue today has its own standalone limit of MaxUnprocessedSegments but
this can be a problem in UnlockUser() we do not release the lock till there are
segments to be processed. What can happen is as handleSegments dequeues packets
more keep getting queued and we will never release the lock. This can keep
happening even if the receive buffer is full because nothing can read() till we
release the lock.

Further having a separate limit for pending segments makes it harder to track
memory usage etc. Unifying the limits makes it easier to reason about memory in
use and makes the overall buffer behaviour more consistent.

PiperOrigin-RevId: 333508122
2020-09-24 07:15:06 -07:00
Julian Elischer 99decaadd6 Extract ICMP error sender from UDP
Store transport protocol number on packet buffers for use in ICMP error
generation.

Updates #2211.

PiperOrigin-RevId: 333252762
2020-09-23 02:28:43 -07:00
Ghanan Gowripalan d35f07b36a Improve type safety for transport protocol options
The existing implementation for TransportProtocol.{Set}Option take
arguments of an empty interface type which all types (implicitly)
implement; any type may be passed to the functions.

This change introduces marker interfaces for transport protocol options
that may be set or queried which transport protocol option types
implement to ensure that invalid types are caught at compile time.
Different interfaces are used to allow the compiler to enforce read-only
or set-only socket options.

RELNOTES: n/a
PiperOrigin-RevId: 330559811
2020-09-08 12:17:39 -07:00
Ghanan Gowripalan dc81eb9c37 Add function to get error from a tcpip.Endpoint
In an upcoming CL, socket option types are made to implement a marker
interface with pointer receivers. Since this results in calling methods
of an interface with a pointer, we incur an allocation when attempting
to get an Endpoint's last error with the current implementation.

When calling the method of an interface, the compiler is unable to
determine what the interface implementation does with the pointer
(since calling a method on an interface uses virtual dispatch at runtime
so the compiler does not know what the interface method will do) so it
allocates on the heap to be safe incase an implementation continues to
hold the pointer after the functioon returns (the reference escapes the
scope of the object).

In the example below, the compiler does not know what b.foo does with
the reference to a it allocates a on the heap as the reference to a may
escape the scope of a.
```
var a int
var b someInterface
b.foo(&a)
```

This change removes the opportunity for that allocation.

RELNOTES: n/a
PiperOrigin-RevId: 328796559
2020-08-27 12:50:19 -07:00
Ghanan Gowripalan f1821fdb68 Automated rollback of changelist 327325153
PiperOrigin-RevId: 328259353
2020-08-24 20:41:09 -07:00
Nayana Bidari 4184a7d5f1 RACK: Create a new list for segments.
RACK requires the segments to be in the order of their transmission
or retransmission times. This cl creates a new list and moves the
retransmitted segments to the end of the list.

PiperOrigin-RevId: 327325153
2020-08-18 15:59:37 -07:00
Ting-Yu Wang 47515f4751 Migrate to PacketHeader API for PacketBuffer.
Formerly, when a packet is constructed or parsed, all headers are set by the
client code. This almost always involved prepending to pk.Header buffer or
trimming pk.Data portion. This is known to prone to bugs, due to the complexity
and number of the invariants assumed across netstack to maintain.

In the new PacketHeader API, client will call Push()/Consume() method to
construct/parse an outgoing/incoming packet. All invariants, such as slicing
and trimming, are maintained by the API itself.

NewPacketBuffer() is introduced to create new PacketBuffer. Zero value is no
longer valid.

PacketBuffer now assumes the packet is a concatenation of following portions:
* LinkHeader
* NetworkHeader
* TransportHeader
* Data
Any of them could be empty, or zero-length.

PiperOrigin-RevId: 326507688
2020-08-13 13:08:57 -07:00
Bhasker Hariharan b928d074b4 Ensure TCP TIME-WAIT is not terminated prematurely.
Netstack's TIME-WAIT state for a TCP socket could be terminated prematurely if
the socket entered TIME-WAIT using shutdown(..., SHUT_RDWR) and then was closed
using close(). This fixes that bug and updates the tests to verify that Netstack
correctly honors TIME-WAIT under such conditions.

Fixes #3106

PiperOrigin-RevId: 326456443
2020-08-13 09:04:31 -07:00
Nayana Bidari 0e6f7a12c2 Update variables for implementation of RACK in TCP
RACK (Recent Acknowledgement) is a new loss detection
algorithm in TCP. These are the fields which should be
stored on connections to implement RACK algorithm.

PiperOrigin-RevId: 324948703
2020-08-04 20:59:34 -07:00
Mithun Iyer ad8164bb50 Fix TCP CurrentConnected counter updates.
CurrentConnected counter is incorrectly decremented on close of an
endpoint which is still not connected.

Fixes #3443

PiperOrigin-RevId: 324155171
2020-07-30 22:49:30 -07:00
Kevin Krakauer fb8be7e627 make connect(2) fail when dest is unreachable
Previously, ICMP destination unreachable datagrams were ignored by TCP
endpoints. This caused connect to hang when an intermediate router
couldn't find a route to the host.

This manifested as a Kokoro error when Docker IPv6 was enabled. The Ruby
image test would try to install the sinatra gem and hang indefinitely
attempting to use an IPv6 address.

Fixes #3079.
2020-07-22 16:51:42 -07:00
Tamir Duberstein 4784ed46ef Avoid multiple atomic loads
...by calling (*tcp.endpoint).EndpointState only once when possible.

Avoid wrapping (*sleep.Waker).Assert in a useless func while I'm here.

PiperOrigin-RevId: 319074149
2020-06-30 12:25:03 -07:00
Bhasker Hariharan b070e218c6 Add support for Stack level options.
Linux controls socket send/receive buffers using a few sysctl variables
  - net.core.rmem_default
  - net.core.rmem_max
  - net.core.wmem_max
  - net.core.wmem_default
  - net.ipv4.tcp_rmem
  - net.ipv4.tcp_wmem

The first 4 control the default socket buffer sizes for all sockets
raw/packet/tcp/udp and also the maximum permitted socket buffer that can be
specified in setsockopt(SOL_SOCKET, SO_(RCV|SND)BUF,...).

The last two control the TCP auto-tuning limits and override the default
specified in rmem_default/wmem_default as well as the max limits.

Netstack today only implements tcp_rmem/tcp_wmem and incorrectly uses it
to limit the maximum size in setsockopt() as well as uses it for raw/udp
sockets.

This changelist introduces the other 4 and updates the udp/raw sockets to use
the newly introduced variables. The values for min/max match the current
tcp_rmem/wmem values and the default value buffers for UDP/RAW sockets is
updated to match the linux value of 212KiB up from the really low current value
of 32 KiB.

Updates #3043
Fixes #3043

PiperOrigin-RevId: 318089805
2020-06-24 10:24:20 -07:00
Bhasker Hariharan 07ff909e76 Support setsockopt SO_SNDBUF/SO_RCVBUF for raw/udp sockets.
Updates #173,#6
Fixes #2888

PiperOrigin-RevId: 317087652
2020-06-18 06:07:20 -07:00
Mithun Iyer 50afec55c7 TCP stat fixes
Ensure that CurrentConnected stat is updated on any errors and cleanups
during connected state processing.

Fixes #2968

PiperOrigin-RevId: 316919426
2020-06-17 10:47:04 -07:00
Mithun Iyer 67f261a87d TCP to honor updated window size during handshake.
In passive open cases, we transition to Established state after
initializing endpoint's sender and receiver. With this we lose out
on any updates coming from the ACK that completes the handshake.
This change ensures that we uniformly transition to Established in all
cases and does minor cleanups.

Fixes #2938

PiperOrigin-RevId: 316567014
2020-06-15 16:19:53 -07:00
Ting-Yu Wang d3a8bffe04 Pass PacketBuffer as pointer.
Historically we've been passing PacketBuffer by shallow copying through out
the stack. Right now, this is only correct as the caller would not use
PacketBuffer after passing into the next layer in netstack.

With new buffer management effort in gVisor/netstack, PacketBuffer will
own a Buffer (to be added). Internally, both PacketBuffer and Buffer may
have pointers and shallow copying shouldn't be used.

Updates #2404.

PiperOrigin-RevId: 314610879
2020-06-03 15:00:42 -07:00
Mithun Iyer 089c88f2e8 Move TCP to CLOSED from SYN-RCVD on RST.
RST handling is broken when the TCP state transitions
from SYN-SENT to SYN-RCVD in case of simultaneous open.
An incoming RST should trigger cleanup of the endpoint.
RFC793, section 3.9, page 70.

Fixes #2814

PiperOrigin-RevId: 313828777
2020-05-29 12:33:28 -07:00
Bhasker Hariharan ae15d90436 FIFO QDisc implementation
Updates #231

PiperOrigin-RevId: 309323808
2020-04-30 16:41:00 -07:00
Eyal Soha db2a60be67 Don't accept segments outside the receive window
Fixed to match RFC 793 page 69.

Fixes #1607

PiperOrigin-RevId: 307334892
2020-04-19 22:16:14 -07:00
Mithun Iyer 3b05f576d7 Reset pending connections on listener shutdown.
When the listening socket is read shutdown, we need to reset all pending
and incoming connections. Ensure that the endpoint is not cleaned up
from the demuxer and subsequent bind to same port does not go through.

PiperOrigin-RevId: 306958038
2020-04-16 17:58:08 -07:00
Mithun Iyer 9c918340e4 Reset pending connections on listener close
Attempt to redeliver TCP segments that are enqueued into a closing
TCP endpoint. This was being done for Established endpoints but not
for those that are listening or performing connection handshake.

Fixes #2417

PiperOrigin-RevId: 306598155
2020-04-15 01:11:44 -07:00
Bhasker Hariharan fc99a7ebf0 Refactor software GSO code.
Software GSO implementation currently has a complicated code path with
implicit assumptions that all packets to WritePackets carry same Data
and it does this to avoid allocations on the path etc. But this makes it
hard to reuse the WritePackets API.

This change breaks all such assumptions by introducing a new Vectorised
View API ReadToVV which can be used to cleanly split a VV into multiple
independent VVs. Further this change also makes packet buffers linkable
to form an intrusive list. This allows us to get rid of the array of
packet buffers that are passed in the WritePackets API call and replace
it with a list of packet buffers.

While this code does introduce some more allocations in the benchmarks
it doesn't cause any degradation.

Updates #231

PiperOrigin-RevId: 304731742
2020-04-03 18:35:55 -07:00
Nayana Bidari 92b9069b67 Support owner matching for iptables.
This feature will match UID and GID of the packet creator, for locally
generated packets. This match is only valid in the OUTPUT and POSTROUTING
chains. Forwarded packets do not have any socket associated with them.
Packets from kernel threads do have a socket, but usually no owner.
2020-03-26 12:21:24 -07:00
Bhasker Hariharan c8eeedcc1d Add support for setting TCP segment hash.
This allows the link layer endpoints to consistenly hash a TCP
segment to a single underlying queue in case a link layer endpoint
does support multiple underlying queues.

Updates #231

PiperOrigin-RevId: 302760664
2020-03-24 15:34:43 -07:00
Bhasker Hariharan 7e4073af12 Move tcpip.PacketBuffer and IPTables to stack package.
This is a precursor to be being able to build an intrusive list
of PacketBuffers for use in queuing disciplines being implemented.

Updates #2214

PiperOrigin-RevId: 302677662
2020-03-24 09:06:26 -07:00
Ting-Yu Wang 49aef9cee7 Remove unused variable `sndNxtList`.
PiperOrigin-RevId: 302110328
2020-03-20 15:25:15 -07:00
Bhasker Hariharan e9e399c25d Remove workMu from tcpip.Endpoint.
workMu is removed and e.mu is now a mutex that supports TryLock.  The packet
processing path tries to lock the mutex and if its locked it will just queue the
packet and move on. The endpoint.UnlockUser() will process any backlog of
packets before unlocking the socket.

This simplifies the locking inside tcp endpoints a lot. Further the
endpoint.LockUser() implements spinning as long as the lock is not held by
another syscall goroutine. This ensures low latency as not spinning leads to the
task thread being put to sleep if the lock is held by the packet dispatch
path. This is suboptimal as the lower layer rarely holds the lock for long so
implementing spinning here helps.

If the lock is held by another task goroutine then we just proceed to call
LockUser() and the task could be put to sleep.

The protocol goroutines themselves just call e.mu.Lock() and block if the
lock is currently not available.

Updates #231, #357

PiperOrigin-RevId: 301808349
2020-03-19 07:19:58 -07:00
Bhasker Hariharan 81675b850e Fix memory leak in danglingEndpoints.
Endpoints which were being terminated in an ERROR state or were moved to CLOSED
by the worker goroutine do not run cleanupLocked() as that should already be run
by the worker termination. But when making that change we made the mistake of
not removing the endpoint from the danglingEndpoints which is normally done in
cleanupLocked().

As a result these endpoints are leaked since a reference is held to them in the
danglingEndpoints array forever till Stack is torn down.

PiperOrigin-RevId: 300438426
2020-03-11 17:03:57 -07:00
Ian Gudger 9b3aad33c4 Use a pool of arrays to avoid slice headers from escaping in TCP options pool.
By putting slices into the pool, the slice header escapes. This can be avoided
by not putting the slice header into the pool.

This removes an allocation from the TCP segment send path.

PiperOrigin-RevId: 299215480
2020-03-05 15:56:42 -08:00
Bhasker Hariharan 3310175250 Fix data-race when reading/writing e.amss.
PiperOrigin-RevId: 298451319
2020-03-02 14:45:03 -08:00
Ian Gudger c6bdc6b05b Fix a race in TCP endpoint teardown and teardown the stack in tcp_test.
Call stack.Close on stacks when we are done with them in tcp_test. This avoids
leaking resources and reduces the test's flakiness when race/gotsan is enabled.
It also provides test coverage for the race also fixed in this change, which
can be reliably triggered with the stack.Close change (and without the other
changes) when race/gotsan is enabled.

The race was possible when calling Abort (via stack.Close) on an endpoint
processing a SYN segment as part of a passive connect.

Updates #1564

PiperOrigin-RevId: 297685432
2020-02-27 14:15:44 -08:00
Ian Gudger c37b196455 Add support for tearing down protocol dispatchers and TIME_WAIT endpoints.
Protocol dispatchers were previously leaked. Bypassing TIME_WAIT is required to
test this change.

Also fix a race when a socket in SYN-RCVD is closed. This is also required to
test this change.

PiperOrigin-RevId: 296922548
2020-02-24 10:32:17 -08:00
Ian Gudger a26a954946 Add socket connection stress test.
Tests 65k connection attempts on common types of sockets to check for port
leaks.

Also fixes a bug where dual-stack sockets wouldn't properly re-queue
segments received while closing.

PiperOrigin-RevId: 293241166
2020-02-04 15:54:49 -08:00
Bhasker Hariharan 51b783505b Add support for TCP_DEFER_ACCEPT.
PiperOrigin-RevId: 292233574
2020-01-29 15:53:45 -08:00
Mithun Iyer 7e6fbc6afe Add a new TCP stat for current open connections.
Such a stat accounts for all connections that are currently
established and not yet transitioned to close state.
Also fix bug in double increment of CurrentEstablished stat.

Fixes #1579

PiperOrigin-RevId: 290827365
2020-01-21 15:43:39 -08:00
Bhasker Hariharan 275ac8ce1d Bugfix to terminate the protocol loop on StateError.
The change to introduce worker goroutines can cause the endpoint
to transition to StateError and we should terminate the loop rather
than let the endpoint transition to a CLOSED state as we do
in case the endpoint enters TIME-WAIT/CLOSED. Moving to a closed
state would cause the actual error to not be propagated to
any read() calls etc.

PiperOrigin-RevId: 289923568
2020-01-15 13:21:50 -08:00
Bhasker Hariharan a611fdaee3 Changes TCP packet dispatch to use a pool of goroutines.
All inbound segments for connections in ESTABLISHED state are delivered to the
endpoint's queue but for every segment delivered we also queue the endpoint for
processing to a selected processor. This ensures that when there are a large
number of connections in ESTABLISHED state the inbound packets are all handled
by a small number of goroutines and significantly reduces the amount of work the
goscheduler has to perform.

We let connections in other states follow the current path where the
endpoint's goroutine directly handles the segments.

Updates #231

PiperOrigin-RevId: 289728325
2020-01-14 14:15:50 -08:00
Ian Gudger 27500d529f New sync package.
* Rename syncutil to sync.
* Add aliases to sync types.
* Replace existing usage of standard library sync package.

This will make it easier to swap out synchronization primitives. For example,
this will allow us to use primitives from github.com/sasha-s/go-deadlock to
check for lock ordering violations.

Updates #1472

PiperOrigin-RevId: 289033387
2020-01-09 22:02:24 -08:00
gVisor bot 3f4d8fefb4 Internal change.
PiperOrigin-RevId: 286003946
2019-12-17 10:10:06 -08:00
Bhasker Hariharan 6fc9f0aefd Add support for TCP_USER_TIMEOUT option.
The implementation follows the linux behavior where specifying
a TCP_USER_TIMEOUT will cause the resend timer to honor the
user specified timeout rather than the default rto based timeout.

Further it alters when connections are timedout due to keepalive
failures. It does not alter the behavior of when keepalives are
sent. This is as per the linux behavior.

PiperOrigin-RevId: 285099795
2019-12-11 17:52:53 -08:00
Mithun Iyer b1d44be7ad Add TCP stats for connection close and keep-alive timeouts.
Fix bugs in updates to TCP CurrentEstablished stat.

Fixes #1277

PiperOrigin-RevId: 284292459
2019-12-06 17:17:33 -08:00