Commit Graph

219 Commits

Author SHA1 Message Date
Bhasker Hariharan 3d71c627fa Add support for TCP receive buffer auto tuning.
The implementation is similar to linux where we track the number of bytes
consumed by the application to grow the receive buffer of a given TCP endpoint.

This ensures that the advertised window grows at a reasonable rate to accomodate
for the sender's rate and prevents large amounts of data being held in stack
buffers if the application is not actively reading or not reading fast enough.

The original paper that was used to implement the linux receive buffer auto-
tuning is available @ https://public.lanl.gov/radiant/pubs/drs/lacsi2001.pdf

NOTE: Linux does not implement DRS as defined in that paper, it's just a good
reference to understand the solution space.

Updates #230

PiperOrigin-RevId: 253168283
2019-06-13 22:28:01 -07:00
Adin Scannell add40fd6ad Update canonical repository.
This can be merged after:
https://github.com/google/gvisor-website/pull/77
  or
https://github.com/google/gvisor-website/pull/78

PiperOrigin-RevId: 253132620
2019-06-13 16:50:15 -07:00
Adin Scannell e352f46478 Minor BUILD file cleanup.
PiperOrigin-RevId: 252918338
2019-06-12 15:59:46 -07:00
Bhasker Hariharan 70578806e8 Add support for TCP_CONGESTION socket option.
This CL also cleans up the error returned for setting congestion
control which was incorrectly returning EINVAL instead of ENOENT.

PiperOrigin-RevId: 252889093
2019-06-12 13:35:50 -07:00
Bhasker Hariharan 3933dd5c04 Fixes to listen backlog handling.
Changes netstack to confirm to current linux behaviour where if the backlog is
full then we drop the SYN and do not send a SYN-ACK. Similarly we allow upto
backlog connections to be in SYN-RCVD state as long as the backlog is not full.

We also now drop a SYN if syn cookies are in use and the backlog for the
listening endpoint is full.

Added new tests to confirm the behaviour.

Also reverted the change to increase the backlog in TcpPortReuseMultiThread
syscall test.

Fixes #236

PiperOrigin-RevId: 252500462
2019-06-10 15:40:44 -07:00
Rahat Mahmood 2d2831e354 Track and export socket state.
This is necessary for implementing network diagnostic interfaces like
/proc/net/{tcp,udp,unix} and sock_diag(7).

For pass-through endpoints such as hostinet, we obtain the socket
state from the backend. For netstack, we add explicit tracking of TCP
states.

PiperOrigin-RevId: 251934850
2019-06-06 15:04:47 -07:00
Bhasker Hariharan 85be01b42d Add multi-fd support to fdbased endpoint.
This allows an fdbased endpoint to have multiple underlying fd's from which
packets can be read and dispatched/written to.

This should allow for higher throughput as well as better scalability of the
network stack as number of connections increases.

Updates #231

PiperOrigin-RevId: 251852825
2019-06-06 08:07:02 -07:00
Andrei Vagin 79f7cb6c1c netstack/sniffer: log GSO attributes
PiperOrigin-RevId: 251788534
2019-06-05 22:51:53 -07:00
Andrei Vagin a12848ffeb netstack/tcp: fix calculating a number of outstanding packets
In case of GSO, a segment can container more than one packet
and we need to use the pCount() helper to get a number of packets.

PiperOrigin-RevId: 251743020
2019-06-05 16:30:45 -07:00
Chris Kuiper d18bb4f38a Adjust route when looping multicast packets
Multicast packets are special in that their destination address does not
identify a specific interface. When sending out such a packet the multicast
address is the remote address, but for incoming packets it is the local
address. Hence, when looping a multicast packet, the route needs to be
tweaked to reflect this.

PiperOrigin-RevId: 251739298
2019-06-05 16:08:29 -07:00
Bhasker Hariharan e0fb921205 Fix data race in synRcvdState.
When checking the length of the acceptedChan we should hold the
endpoint mutex otherwise a syn received while the listening socket
is being closed can result in a data race where the cleanupLocked
routine sets acceptedChan to nil while a handshake goroutine
in progress could try and check it at the same time.

PiperOrigin-RevId: 251537697
2019-06-04 16:17:24 -07:00
Bhasker Hariharan bfe3220992 Delete debug log lines left by mistake.
Updates #236

PiperOrigin-RevId: 251337915
2019-06-03 17:00:18 -07:00
Bhasker Hariharan 3577a4f691 Disable certain tests that are flaky under race detector.
PiperOrigin-RevId: 250976665
2019-05-31 16:19:49 -07:00
Bhasker Hariharan 033f96cc93 Change segment queue limit to be of fixed size.
Netstack sets the unprocessed segment queue size to match the receive
buffer size. This is not required as this queue only needs to hold enough
for a short duration before the endpoint goroutine can process it.

Updates #230

PiperOrigin-RevId: 250976323
2019-05-31 16:17:33 -07:00
Fabricio Voznika 38de91b028 Add build guard to files using go:linkname
Funcion signatures are not validated during compilation. Since
they are not exported, they can change at any time. The guard
ensures that they are verified at least on every version upgrade.

PiperOrigin-RevId: 250733742
2019-05-30 12:09:39 -07:00
Bhasker Hariharan ae26b2c425 Fixes to TCP listen behavior.
Netstack listen loop can get stuck if cookies are in-use and the app is slow to
accept incoming connections. Further we continue to complete handshake for a
connection even if the backlog is full. This creates a problem when a lots of
connections come in rapidly and we end up with lots of completed connections
just hanging around to be delivered.

These fixes change netstack behaviour to mirror what linux does as described
here in the following article

http://veithen.io/2014/01/01/how-tcp-backlog-works-in-linux.html

Now when cookies are not in-use Netstack will silently drop the ACK to a SYN-ACK
and not complete the handshake if the backlog is full.  This will result in the
connection staying in a half-complete state. Eventually the sender will
retransmit the ACK and if backlog has space we will transition to a connected
state and deliver the endpoint.

Similarly when cookies are in use we do not try and create an endpoint unless
there is space in the accept queue to accept the newly created endpoint. If
there is no space then we again silently drop the ACK as we can just recreate it
when the ACK is retransmitted by the peer.

We also now use the backlog to cap the size of the SYN-RCVD queue for a given
endpoint. So at any time there can be N connections in the backlog and N in a
SYN-RCVD state if the application is not accepting connections. Any new SYNs
will be dropped.

This CL also fixes another small bug where we mark a new endpoint which has not
completed handshake as connected. We should wait till handshake successfully
completes before marking it connected.

Updates #236

PiperOrigin-RevId: 250717817
2019-05-30 12:08:41 -07:00
Tamir Duberstein e4b395db49 Remove unused wakers
These wakers are uselessly allocated and passed around; nothing ever
listens for notifications on them. The code here appears to be
vestigial, so removing it and allowing a nil waker to be passed seems
appropriate.

PiperOrigin-RevId: 249879320
Change-Id: Icd209fb77cc0dd4e5c49d7a9f2adc32bf88b4b71
2019-05-24 12:29:14 -07:00
Kevin Krakauer c1cdf18e7b UDP and TCP raw socket support.
PiperOrigin-RevId: 249511348
Change-Id: I34539092cc85032d9473ff4dd308fc29dc9bfd6b
2019-05-22 13:45:15 -07:00
Bhasker Hariharan 2ac0aeeb42 Refactor fdbased endpoint dispatcher code.
This is in preparation to support an fdbased endpoint that can read/dispatch
packets from multiple underlying fds.

Updates #231

PiperOrigin-RevId: 249337074
Change-Id: Id7d375186cffcf55ae5e38986e7d605a96916d35
2019-05-21 15:24:25 -07:00
Nicolas Lacasse bfd9f75ba4 Set the FilesytemType in MountSource from the Filesystem.
And stop storing the Filesystem in the MountSource.

This allows us to decouple the MountSource filesystem type from the name of the
filesystem.

PiperOrigin-RevId: 247292982
Change-Id: I49cbcce3c17883b7aa918ba76203dfd6d1b03cc8
2019-05-08 14:35:06 -07:00
Googler cbf6ab9697 Check GSO for nil in WritePacket
Testing:
Unit tests added
PiperOrigin-RevId: 247096269
Change-Id: I849c010eadcb53caf45896a15ef38162d66a9568
2019-05-07 14:57:03 -07:00
Ian Gudger 20862f0db2 Add gonet.DialContextTCP.
Allows cancellation and timeouts.

PiperOrigin-RevId: 247090428
Change-Id: I91907f12e218677dcd0e0b6d72819deedbd9f20c
2019-05-07 14:27:36 -07:00
Kevin Krakauer ff8ed5e6a5 Fix raw socket behavior and tests.
Some behavior was broken due to the difficulty of running automated raw
socket tests.

Change-Id: I152ca53916bb24a0208f2dc1c4f5bc87f4724ff6
PiperOrigin-RevId: 246747067
2019-05-05 16:07:25 -07:00
Ian Gudger b4a9f18687 Update tcpip Clock description.
The tcpip.Clock comment stated that times provided by it should not be used for
netstack internal timekeeping. This comment was from before the interface
supported monotonic times. The monotonic times that it provides are now be the
preferred time source for netstack internal timekeeping.

PiperOrigin-RevId: 246618772
Change-Id: I853b720e3d719b03fabd6156d2431da05d354bda
2019-05-03 21:01:42 -07:00
Googler f2699b76c8 Support IPv4 fragmentation in netstack
Testing:
Unit tests and also large ping in Fuchsia OS
PiperOrigin-RevId: 246563592
Change-Id: Ia12ab619f64f4be2c8d346ce81341a91724aef95
2019-05-03 13:30:35 -07:00
Tamir Duberstein 0e1cc476db Fix transport/raw copybara export
- include packet_list.go
- exclude state.go (by renaming to include an underscore)

Also rename raw.go to endpoint.go for consistency.

PiperOrigin-RevId: 246547912
Change-Id: I19c8331c794ba683a940cc96a8be6497b53ff24d
2019-05-03 11:52:59 -07:00
Bhasker Hariharan 458fe955a7 Implement support for SACK based recovery(RFC 6675).
PiperOrigin-RevId: 246536003
Change-Id: I118b745f45040be9c70cb6a1028acdb06c78d8c9
2019-05-03 10:51:18 -07:00
Chris Kuiper 2d8e90b311 Proper cleanup of sockets that used REUSEPORT
Fixed a small logic error that broke proper accounting of MultiPortEndpoints.

PiperOrigin-RevId: 246502126
Change-Id: I1a7d6ea134f811612e545676212899a3707bc2c2
2019-05-03 07:02:51 -07:00
Chris Kuiper 8972e47a2e Support reception of multicast data on more than one socket
This requires two changes:
1) Support for more than one socket to join a given multicast group.

2) Duplicate delivery of incoming multicast packets to all sockets listening
for it.

In addition, I tweaked the code (and added a test) to disallow duplicates
IP_ADD_MEMBERSHIP calls for the same group and NIC. This is how Linux does
it.

PiperOrigin-RevId: 246437315
Change-Id: Icad8300b4a8c3f501d9b4cd283bd3beabef88b72
2019-05-02 19:41:00 -07:00
Michael Pratt 4d52a55201 Change copyright notice to "The gVisor Authors"
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.

1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.

Fixes #209

PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
2019-04-29 14:26:23 -07:00
Nicolas Lacasse f4ce43e1f4 Allow and document bug ids in gVisor codebase.
PiperOrigin-RevId: 245818639
Change-Id: I03703ef0fb9b6675955637b9fe2776204c545789
2019-04-29 14:04:14 -07:00
Ben Burkert 66bca6fc22 tcpip/adapters/gonet: add CloseRead & CloseWrite methods to Conn
Add the CloseRead & CloseWrite methods that performs shutdown on the
corresponding Read & Write sides of a connection.

Change-Id: I3996a2abdc7cd68a2becba44dc4bd9f0919d2ce1
PiperOrigin-RevId: 245537950
2019-04-26 22:46:45 -07:00
Kevin Krakauer 43dff57b87 Make raw sockets a toggleable feature disabled by default.
PiperOrigin-RevId: 245511019
Change-Id: Ia9562a301b46458988a6a1f0bbd5f07cbfcb0615
2019-04-26 16:51:46 -07:00
Bhasker Hariharan 228dc15fd1 Bump the AF_PACKET socket rcv buf size to 4MB by default.
Packet socket receive buffers default to the sysctl value of
net.core.rmem_default and are capped by net.core.rmem_max both
which are usually set to 208KB on most systems.

Since we can't expect every gVisor user to bump these we use
SO_RCVBUFFORCE to exceed the limit. This is possible as runsc runs
with CAP_NET_ADMIN outside the sandbox and can do this before
the FD is passed to the sentry inside the sandbox.

Updates #211

iperf output w/ 4MB buffer.

 iperf3 -c 172.17.0.2 -t 100
 Connecting to host 172.17.0.2, port 5201
 [  4] local 172.17.0.1 port 40378 connected to 172.17.0.2 port 5201
 [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
 [  4]   0.00-1.00   sec  1.15 GBytes  9.89 Gbits/sec    0   1.02 MBytes
 [  4]   1.00-2.00   sec  1.18 GBytes  10.2 Gbits/sec    0   1.02 MBytes
 [  4]   2.00-3.00   sec   965 MBytes  8.09 Gbits/sec    0   1.02 MBytes
 [  4]   3.00-4.00   sec   942 MBytes  7.90 Gbits/sec    0   1.02 MBytes
 [  4]   4.00-5.00   sec   952 MBytes  7.99 Gbits/sec    0   1.02 MBytes
 [  4]   5.00-6.00   sec  1.14 GBytes  9.81 Gbits/sec    0   1.02 MBytes
 [  4]   6.00-7.00   sec  1.13 GBytes  9.68 Gbits/sec    0   1.02 MBytes
 [  4]   7.00-8.00   sec   930 MBytes  7.80 Gbits/sec    0   1.02 MBytes
 [  4]   8.00-9.00   sec  1.15 GBytes  9.91 Gbits/sec    0   1.02 MBytes
 [  4]   9.00-10.00  sec   938 MBytes  7.87 Gbits/sec    0   1.02 MBytes
 [  4]  10.00-11.00  sec   737 MBytes  6.18 Gbits/sec    0   1.02 MBytes
 [  4]  11.00-12.00  sec  1.16 GBytes  9.93 Gbits/sec    0   1.02 MBytes
 [  4]  12.00-13.00  sec   917 MBytes  7.69 Gbits/sec    0   1.02 MBytes
 [  4]  13.00-14.00  sec  1.19 GBytes  10.2 Gbits/sec    0   1.02 MBytes
 [  4]  14.00-15.00  sec  1.01 GBytes  8.70 Gbits/sec    0   1.02 MBytes
 [  4]  15.00-16.00  sec  1.20 GBytes  10.3 Gbits/sec    0   1.02 MBytes
 [  4]  16.00-17.00  sec  1.14 GBytes  9.80 Gbits/sec    0   1.02 MBytes
 ^C[  4]  17.00-17.60  sec   718 MBytes  10.1 Gbits/sec    0   1.02 MBytes
 - - - - - - - - - - - - - - - - - - - - - - - - -
 [ ID] Interval           Transfer     Bandwidth       Retr
 [  4]   0.00-17.60  sec  18.4 GBytes  8.98 Gbits/sec    0             sender
 [  4]   0.00-17.60  sec  0.00 Bytes  0.00 bits/sec                  receiver

PiperOrigin-RevId: 245470590
Change-Id: I1c08c5ee8345de6ac070513656a4703312dc3c00
2019-04-26 12:52:02 -07:00
Bhasker Hariharan 56cadcac4e Fixes to PacketMMap dispatcher.
This CL fixes the following bugs:

- Uses atomic to set/read status instead of binary.LittleEndian.PutUint32 etc
which are not atomic.

- Increments ringOffsets for frames that are truncated (i.e status is
  tpStatusCopy)

- Does not ignore frames with tpStatusLost bit set as they are valid frames and
  only indicate that there some frames were lost before this one and metrics can
  be retrieved with a getsockopt call.

- Adds checks to make sure blockSize is a multiple of page size. This is
  required as the kernel allocates in pages per block and rejects sizes that are
  not page aligned with an EINVAL.

Updates #210

PiperOrigin-RevId: 244959464
Change-Id: I5d61337b7e4c0f8a3063dcfc07791d4c4521ba1f
2019-04-23 17:47:56 -07:00
Ben Burkert 56927e5317 tcpip/transport/tcp: read side only shutdown of an endpoint
Support shutdown on only the read side of an endpoint. Reads performed
after a call to Shutdown with only the ShutdownRead flag will return
ErrClosedForReceive without data.

Break out the shutdown(2) with SHUT_RD syscall test into to two tests.
The first tests that no packets are sent when shutting down the read
side of a socket. The second tests that, after shutting down the read
side of a socket, unread data can still be read, or an EOF if there is
no more data to read.

Change-Id: I9d7c0a06937909cbb466b7591544a4bcaebb11ce
PiperOrigin-RevId: 244459430
2019-04-19 19:29:05 -07:00
Ben Burkert cec2cdc12f tcpip/transport/udp: add Forwarder type
Add a UDP forwarder for intercepting and forwarding UDP sessions.

Change-Id: I2d83c900c1931adfc59a532dd4f6b33a0db406c9
PiperOrigin-RevId: 244293576
2019-04-18 17:49:57 -07:00
Andrei Vagin 4524790ff6 netstack: use a proper network protocol to set gso.L3HdrLen
It is possible to create a listening socket which will accept
IPv4 and IPv6 connections. In this case, we set IPv6ProtocolNumber
for all accepted endpoints, even if they handle IPv4 connections.

This means that we can't use endpoint.netProto to set gso.L3HdrLen.

PiperOrigin-RevId: 244227948
Change-Id: I5e1863596cb9f3d216febacdb7dc75651882eef1
2019-04-18 11:42:23 -07:00
Fabricio Voznika 9f8c89fc7f Return error from fdbased.New
RELNOTES: n/a
PiperOrigin-RevId: 244031742
Change-Id: Id0cdb73194018fb5979e67b58510ead19b5a2b81
2019-04-17 11:16:35 -07:00
Bhasker Hariharan eaac2806ff Add TCP checksum verification.
PiperOrigin-RevId: 242704699
Change-Id: I87db368ca343b3b4bf4f969b17d3aa4ce2f8bd4f
2019-04-09 11:23:47 -07:00
Kevin Krakauer 52a51a8e20 Add a raw socket transport endpoint and use it for raw ICMP sockets.
Having raw socket code together will make it easier to add support for other raw
network protocols. Currently, only ICMP uses the raw endpoint. However, adding
support for other protocols such as UDP shouldn't be much more difficult than
adding a few switch cases.

PiperOrigin-RevId: 241564875
Change-Id: I77e03adafe4ce0fd29ba2d5dfdc547d2ae8f25bf
2019-04-02 11:13:49 -07:00
Bhasker Hariharan 45c54b1f4e Fix incorrect checksums in TCP and UDP tests.
PiperOrigin-RevId: 241025361
Change-Id: I292e7aea9a4b294b11e4f736e107010d9524586b
2019-03-29 12:05:43 -07:00
Bhasker Hariharan cc0e96a4bd Fix Panic in SACKScoreboard.Delete.
The panic was caused by modifying the tree while iterating which invalidated the
iterator.

Also fixes another bug in SACKScoreboard.Insert() which was causing blocks to be
merged incorrectly.

PiperOrigin-RevId: 240895053
Change-Id: Ia72b8244297962df5c04283346da5226434740af
2019-03-28 18:18:39 -07:00
Bert Muthalaly f2e5dcf21c Add ICMP stats
PiperOrigin-RevId: 240848882
Change-Id: I23dd4599f073263437aeab357c3f767e1a432b82
2019-03-28 14:09:20 -07:00
Andrei Vagin f4105ac21a netstack/fdbased: add generic segmentation offload (GSO) support
The linux packet socket can handle GSO packets, so we can segment packets to
64K instead of the MTU which is usually 1500.

Here are numbers for the nginx-1m test:
runsc:		579330.01 [Kbytes/sec] received
runsc-gso:	1794121.66 [Kbytes/sec] received
runc:		2122139.06 [Kbytes/sec] received

and for tcp_benchmark:

$ tcp_benchmark  --duration 15   --ideal
[  4]  0.0-15.0 sec  86647 MBytes  48456 Mbits/sec

$ tcp_benchmark --client --duration 15   --ideal
[  4]  0.0-15.0 sec  2173 MBytes  1214 Mbits/sec

$ tcp_benchmark --client --duration 15   --ideal --gso 65536
[  4]  0.0-15.0 sec  19357 MBytes  10825 Mbits/sec

PiperOrigin-RevId: 240809103
Change-Id: I2637f104db28b5d4c64e1e766c610162a195775a
2019-03-28 11:03:41 -07:00
Tamir Duberstein 8406504817 Avoid mutating memory passed to DeliverTransportPacket
PiperOrigin-RevId: 240642903
Change-Id: I16625015123a827d267d60b328a202057264bbd6
2019-03-27 14:36:57 -07:00
Tamir Duberstein 9c20a88bd7 Remove polling from ICMP test
PiperOrigin-RevId: 240483396
Change-Id: Ie75d3ae38af83f1d92f167ff9ba58fa10f5b372b
2019-03-26 20:20:52 -07:00
Andrei Vagin 654e878abb netstack: Don't exclude length when a pseudo-header checksum is calculated
This is a preparation for GSO changes (cl/234508902).

RELNOTES[gofers]: Refactor checksum code to include length, which
it already did, but in a convoluted way. Should be a no-op.

PiperOrigin-RevId: 240460794
Change-Id: I537381bc670b5a9f5d70a87aa3eb7252e8f5ace2
2019-03-26 17:15:13 -07:00
Tamir Duberstein 9cd2b66f10 Remove echoReplier
Mirror the ICMPv6 echo implementation in ICMPv4 echo. This removes
unnecessary asynchrony, reduces copying, and reduces complexity.

PiperOrigin-RevId: 240394525
Change-Id: If8f53254154f86772f5e51159765aa23b3b328b8
2019-03-26 11:45:01 -07:00
Andrei Vagin 9f4e1cb797 netstack: adjust the sequence number after trimming the packet
PiperOrigin-RevId: 239417224
Change-Id: I14a9adc31a6330a79a6156c105969cd5f1f63d20
2019-03-20 09:58:10 -07:00