gVisor incorrectly returns the wrong ARP type for SIOGIFHWADDR. This breaks
tcpdump as it tries to interpret the packets incorrectly.
Similarly, SIOCETHTOOL is used by tcpdump to query interface properties which
fails with an EINVAL since we don't implement it. For now change it to return
EOPNOTSUPP to indicate that we don't support the query rather than return
EINVAL.
NOTE: ARPHRD types for link endpoints are distinct from NIC capabilities
and NIC flags. In Linux all 3 exist eg. ARPHRD types are stored in dev->type
field while NIC capabilities are more like the device features which can be
queried using SIOCETHTOOL but not modified and NIC Flags are fields that can
be modified from user space. eg. NIC status (UP/DOWN/MULTICAST/BROADCAST) etc.
Updates #2746
PiperOrigin-RevId: 321436525
RFC 6864 imposes various restrictions on the uniqueness of the IPv4
Identification field for non-atomic datagrams, defined as an IP datagram that
either can be fragmented (DF=0) or is already a fragment (MF=1 or positive
fragment offset). In order to be compliant, the ID field is assigned for all
non-atomic datagrams.
Add a TCP unit test that induces retransmissions and checks that the IPv4
ID field is unique every time. Add basic handling of the IP_MTU_DISCOVER
socket option so that the option can be used to disable PMTU discovery,
effectively setting DF=0. Attempting to set the sockopt to anything other
than disabled will fail because PMTU discovery is currently not implemented,
and the default behavior matches that of disabled.
PiperOrigin-RevId: 320081842
Netstack has traditionally parsed headers on-demand as a packet moves up the
stack. This is conceptually simple and convenient, but incompatible with
iptables, where headers can be inspected and mangled before even a routing
decision is made.
This changes header parsing to happen early in the incoming packet path, as soon
as the NIC gets the packet from a link endpoint. Even if an invalid packet is
found (e.g. a TCP header of insufficient length), the packet is passed up the
stack for proper stats bookkeeping.
PiperOrigin-RevId: 315179302
Loopback traffic is not affected by rules in the PREROUTING chain.
This change is also necessary for istio's envoy to talk to other
components in the same pod.
Historically we've been passing PacketBuffer by shallow copying through out
the stack. Right now, this is only correct as the caller would not use
PacketBuffer after passing into the next layer in netstack.
With new buffer management effort in gVisor/netstack, PacketBuffer will
own a Buffer (to be added). Internally, both PacketBuffer and Buffer may
have pointers and shallow copying shouldn't be used.
Updates #2404.
PiperOrigin-RevId: 314610879
Connection tracking is used to track packets in prerouting and
output hooks of iptables. The NAT rules modify the tuples in
connections. The connection tracking code modifies the packets by
looking at the modified tuples.
These methods let users eaily break the VectorisedView abstraction, and
allowed netstack to slip into pseudo-enforcement of the "all headers are
in the first View" invariant. Removing them and replacing with PullUp(n)
breaks this reliance and will make it easier to add iptables support and
rework network buffer management.
The new View.PullUp(n) method is low cost in the common case, when when
all the headers fit in the first View.
PiperOrigin-RevId: 308163542
These methods let users eaily break the VectorisedView abstraction, and
allowed netstack to slip into pseudo-enforcement of the "all headers are
in the first View" invariant. Removing them and replacing with PullUp(n)
breaks this reliance and will make it easier to add iptables support and
rework network buffer management.
The new View.PullUp(n) method is low cost in the common case, when when
all the headers fit in the first View.
Better validate NDP NAs options before updating the link address cache.
Test: stack_test.TestNeighorAdvertisementWithTargetLinkLayerOption
PiperOrigin-RevId: 306962924
As per RFC 7217 section 6, attempt to regenerate IPv6 SLAAC address in response
to a DAD conflict if the address was generated with an opaque IID as outlined in
RFC 7217 section 5.
Test:
- stack_test.TestAutoGenAddrWithOpaqueIIDDADRetries
- stack_test.TestAutoGenAddrWithEUI64IIDNoDADRetries
- stack_test.TestAutoGenAddrContinuesLifetimesAfterRetry
PiperOrigin-RevId: 306555645
Better validate NDP NS messages and their options before doing work in
response to them. Also make sure that NA messages sent in response to
an NS use the correct IPv6 and link-layer addresses so they are
routed properly and received by the right node.
Test: stack_test.TestNeighorSolicitationResponse
PiperOrigin-RevId: 305799054
Software GSO implementation currently has a complicated code path with
implicit assumptions that all packets to WritePackets carry same Data
and it does this to avoid allocations on the path etc. But this makes it
hard to reuse the WritePackets API.
This change breaks all such assumptions by introducing a new Vectorised
View API ReadToVV which can be used to cleanly split a VV into multiple
independent VVs. Further this change also makes packet buffers linkable
to form an intrusive list. This allows us to get rid of the array of
packet buffers that are passed in the WritePackets API call and replace
it with a list of packet buffers.
While this code does introduce some more allocations in the benchmarks
it doesn't cause any degradation.
Updates #231
PiperOrigin-RevId: 304731742
As per RFC 6980 section 5, nodes MUST silently ignore NDP messages if
the packet carrying them include an IPv6 Fragmentation Header.
Test: ipv6_test.TestNDPValidation
PiperOrigin-RevId: 304519379
Enables handling the Hop by Hop and Destination Options extension
headers, but options are not yet supported. All options will be
treated as unknown and their respective action will be followed.
Note, the stack does not yet support sending ICMPv6 error messages in
response to options that cannot be handled/parsed. That will come
in a later change (Issue #2211).
Tests:
- header_test.TestIPv6UnknownExtHdrOption
- header_test.TestIPv6OptionsExtHdrIterErr
- header_test.TestIPv6OptionsExtHdrIter
- ipv6_test.TestReceiveIPv6ExtHdrs
PiperOrigin-RevId: 303433085
Enables the reassembly of fragmented IPv6 packets and handling of the
Routing extension header with a Segments Left value of 0. Atomic
fragments are handled as described in RFC 6946 to not interfere with
"normal" fragment traffic. No specific routing header type is supported.
Note, the stack does not yet support sending ICMPv6 error messages in
response to IPv6 packets that cannot be handled/parsed. That will come
in a later change (Issue #2211).
Test:
- header_test.TestIPv6RoutingExtHdr
- header_test.TestIPv6FragmentExtHdr
- header_test.TestIPv6ExtHdrIterErr
- header_test.TestIPv6ExtHdrIter
- ipv6_test.TestReceiveIPv6ExtHdrs
- ipv6_test.TestReceiveIPv6Fragments
RELNOTES: n/a
PiperOrigin-RevId: 303189584
This feature will match UID and GID of the packet creator, for locally
generated packets. This match is only valid in the OUTPUT and POSTROUTING
chains. Forwarded packets do not have any socket associated with them.
Packets from kernel threads do have a socket, but usually no owner.
This is a precursor to be being able to build an intrusive list
of PacketBuffers for use in queuing disciplines being implemented.
Updates #2214
PiperOrigin-RevId: 302677662
When list elements are removed from a list but not discarded, it becomes
important to invalidate the references they hold to their former
neighbors to prevent memory leaks.
PiperOrigin-RevId: 299412421
Protocol dispatchers were previously leaked. Bypassing TIME_WAIT is required to
test this change.
Also fix a race when a socket in SYN-RCVD is closed. This is also required to
test this change.
PiperOrigin-RevId: 296922548
Get the link address for the target of an NDP Neighbor Advertisement
from the NDP Target Link Layer Address option.
Tests:
- ipv6.TestNeighorAdvertisementWithTargetLinkLayerOption
- ipv6.TestNeighorAdvertisementWithInvalidTargetLinkLayerOption
PiperOrigin-RevId: 293632609
As per RFC 2464 section 7, an IPv6 packet with a multicast destination
address is transmitted to the mapped Ethernet multicast address.
Test:
- ipv6.TestLinkResolution
- stack_test.TestDADResolve
- stack_test.TestRouterSolicitation
PiperOrigin-RevId: 292610529
Update link address for senders of NDP Neighbor Solicitations when the NS
contains an NDP Source Link Layer Address option.
Tests:
- ipv6.TestNeighorSolicitationWithSourceLinkLayerOption
- ipv6.TestNeighorSolicitationWithInvalidSourceLinkLayerOption
PiperOrigin-RevId: 292028553
* Rename syncutil to sync.
* Add aliases to sync types.
* Replace existing usage of standard library sync package.
This will make it easier to swap out synchronization primitives. For example,
this will allow us to use primitives from github.com/sasha-s/go-deadlock to
check for lock ordering violations.
Updates #1472
PiperOrigin-RevId: 289033387
This change validates incoming NDP Router Advertisements as per RFC 4861 section
6.1.2. It also includes the skeleton to handle Router Advertiements that arrive
on some NIC.
Tests: Unittest to make sure only valid NDP Router Advertisements are received/
not dropped.
PiperOrigin-RevId: 278891972
This change validates the ICMPv6 checksum field before further processing an
ICMPv6 packet.
Tests: Unittests to make sure that only ICMPv6 packets with a valid checksum
are accepted/processed. Existing tests using checker.ICMPv6 now also check the
ICMPv6 checksum field.
PiperOrigin-RevId: 276779148
This change introduces a new interface, stack.NDPDispatcher. It can be
implemented by the netstack integrator to receive NDP related events. As of this
change, only DAD related events are supported.
Tests: Existing tests were modified to use the NDPDispatcher's DAD events for
DAD tests where it needed to wait for DAD completing (failing and resolving).
PiperOrigin-RevId: 276338733
Right now, we send each tcp packet separately, we call one system
call per-packet. This patch allows to generate multiple tcp packets
and send them by sendmmsg.
The arguable part of this CL is a way how to handle multiple headers.
This CL adds the next field to the Prepandable buffer.
Nginx test results:
Server Software: nginx/1.15.9
Server Hostname: 10.138.0.2
Server Port: 8080
Document Path: /10m.txt
Document Length: 10485760 bytes
w/o gso:
Concurrency Level: 5
Time taken for tests: 5.491 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 1048600200 bytes
HTML transferred: 1048576000 bytes
Requests per second: 18.21 [#/sec] (mean)
Time per request: 274.525 [ms] (mean)
Time per request: 54.905 [ms] (mean, across all concurrent requests)
Transfer rate: 186508.03 [Kbytes/sec] received
sw-gso:
Concurrency Level: 5
Time taken for tests: 3.852 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 1048600200 bytes
HTML transferred: 1048576000 bytes
Requests per second: 25.96 [#/sec] (mean)
Time per request: 192.576 [ms] (mean)
Time per request: 38.515 [ms] (mean, across all concurrent requests)
Transfer rate: 265874.92 [Kbytes/sec] received
w/o gso:
$ ./tcp_benchmark --client --duration 15 --ideal
[SUM] 0.0-15.1 sec 2.20 GBytes 1.25 Gbits/sec
software gso:
$ tcp_benchmark --client --duration 15 --ideal --gso $((1<<16)) --swgso
[SUM] 0.0-15.1 sec 3.99 GBytes 2.26 Gbits/sec
PiperOrigin-RevId: 276112677
Like (AF_INET, SOCK_RAW) sockets, AF_PACKET sockets require CAP_NET_RAW. With
runsc, you'll need to pass `--net-raw=true` to enable them.
Binding isn't supported yet.
PiperOrigin-RevId: 275909366
It is quite legal to send from the ANY address (it is required for
DHCP). I can't figure out why the broadcast address was included here,
so removing that as well.
PiperOrigin-RevId: 275541954
NDP Neighbor Solicitations sent during Duplicate Address Detection must have an
IP hop limit of 255, as all NDP Neighbor Solicitations should have.
Test: Test that DAD messages have the IPv6 hop limit field set to 255.
PiperOrigin-RevId: 275321680
This change adds support for Duplicate Address Detection on IPv6 addresses
as defined by RFC 4862 section 5.4.
Note, this change will not break existing uses of netstack as the default
configuration for the stack options is set in such a way that DAD will not be
performed. See `stack.Options` and `stack.NDPConfigurations` for more details.
Tests: Tests to make sure that the DAD process properly resolves or fails.
That is, tests make sure that DAD resolves only if:
- No other node is performing DAD for the same address
- No other node owns the same address
PiperOrigin-RevId: 275189471
Reassembly can fail due to an invalid sequence of fragments
being received. eg. Multiple fragments with same id which
claim to be the last one by setting the more flag to 0 etc.
It's safer to just drop the reassembler and increment a metric
than to panic when reassembly fails.
PiperOrigin-RevId: 274920901
...and do not populate link address cache at dispatch. This partially
reverts 313c767b00, which caused malformed
packets (e.g. NDP Neighbor Adverts with incorrect hop limit values) to
populate the address cache. In particular, this masked a bug that was
introduced to the Neighbor Advert generation code in
7c1587e340.
PiperOrigin-RevId: 274865182
Strengthen the header.IPv4.IsValid check to correctly check
for IHL/TotalLength fields. Also add a check to make sure
fragmentOffsets + size of the fragment do not cause a wrap
around for the end of the fragment.
PiperOrigin-RevId: 274049313
The behavior for sending and receiving local broadcast (255.255.255.255)
traffic is as follows:
Outgoing
--------
* A broadcast packet sent on a socket that is bound to an interface goes out
that interface
* A broadcast packet sent on an unbound socket follows the route table to
select the outgoing interface
+ if an explicit route entry exists for 255.255.255.255/32, use that one
+ else use the default route
* Broadcast packets are looped back and delivered following the rules for
incoming packets (see next). This is the same behavior as for multicast
packets, except that it cannot be disabled via sockopt.
Incoming
--------
* Sockets wishing to receive broadcast packets must bind to either INADDR_ANY
(0.0.0.0) or INADDR_BROADCAST (255.255.255.255). No other socket receives
broadcast packets.
* Broadcast packets are multiplexed to all sockets matching it. This is the
same behavior as for multicast packets.
* A socket can bind to 255.255.255.255:<port> and then receive its own
broadcast packets sent to 255.255.255.255:<port>
In addition, this change implicitly fixes an issue with multicast reception. If
two sockets want to receive a given multicast stream and one is bound to ANY
while the other is bound to the multicast address, only one of them will
receive the traffic.
PiperOrigin-RevId: 272792377
Previously, the only safe way to use an fdbased endpoint was to leak the FD.
This change makes it possible to safely close the FD.
This is the first step towards having stoppable stacks.
Updates #837
PiperOrigin-RevId: 270346582
The IPv6 all-nodes multicast address will be joined on NIC enable, and the
appropriate IPv6 solicited-node multicast address will be joined when IPv6
addresses are added.
Tests: Test receiving packets destined to the IPv6 link-local all-nodes
multicast address and the IPv6 solicted node address of an added IPv6 address.
PiperOrigin-RevId: 268047073
Adds support to generate Port Unreachable messages for UDP
datagrams received on a port for which there is no valid
endpoint.
Fixes#703
PiperOrigin-RevId: 267034418
This allows the stack to learn remote link addresses on incoming
packets, reducing the need to ARP to send responses.
This also reduces the number of round trips to the system clock,
since that may also prove to be performance-sensitive.
Fixes#739.
PiperOrigin-RevId: 265815816
The checksum was not being reset before being re-calculated and sent out.
This caused the sent checksum to always be `0x0800`.
Fixes#605.
PiperOrigin-RevId: 260965059
This allows the user code to add a network address with a subnet prefix length.
The prefix length value is stored in the network endpoint and provided back to
the user in the ProtocolAddress type.
PiperOrigin-RevId: 259807693
iptables also relies on IPPROTO_RAW in a way. It opens such a socket to
manipulate the kernel's tables, but it doesn't actually use any of the
functionality. Blegh.
PiperOrigin-RevId: 257903078
Addresses obvious typos, in the documentation only.
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/443 from Pixep:fix/documentation-spelling 4d0688164eafaf0b3010e5f4824b35d1e7176d65
PiperOrigin-RevId: 255477779
Multicast packets are special in that their destination address does not
identify a specific interface. When sending out such a packet the multicast
address is the remote address, but for incoming packets it is the local
address. Hence, when looping a multicast packet, the route needs to be
tweaked to reflect this.
PiperOrigin-RevId: 251739298
Some behavior was broken due to the difficulty of running automated raw
socket tests.
Change-Id: I152ca53916bb24a0208f2dc1c4f5bc87f4724ff6
PiperOrigin-RevId: 246747067
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.
1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.
Fixes#209
PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
The linux packet socket can handle GSO packets, so we can segment packets to
64K instead of the MTU which is usually 1500.
Here are numbers for the nginx-1m test:
runsc: 579330.01 [Kbytes/sec] received
runsc-gso: 1794121.66 [Kbytes/sec] received
runc: 2122139.06 [Kbytes/sec] received
and for tcp_benchmark:
$ tcp_benchmark --duration 15 --ideal
[ 4] 0.0-15.0 sec 86647 MBytes 48456 Mbits/sec
$ tcp_benchmark --client --duration 15 --ideal
[ 4] 0.0-15.0 sec 2173 MBytes 1214 Mbits/sec
$ tcp_benchmark --client --duration 15 --ideal --gso 65536
[ 4] 0.0-15.0 sec 19357 MBytes 10825 Mbits/sec
PiperOrigin-RevId: 240809103
Change-Id: I2637f104db28b5d4c64e1e766c610162a195775a