x86 and arm64 use a different stat struct in Linux
kernel, so the stat() syscall implementation has
to handle the file stat data separately.
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: If3986e915a667362257a54e7fbbcc1fe18951015
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/1493 from xiaobo55x:stat f15a216d9297eb9a96d2c483d396a9919145d7fa
PiperOrigin-RevId: 290274287
There was a very bare get/setxattr in the InodeOperations interface. Add
context.Context to both, size to getxattr, and flags to setxattr.
Note that extended attributes are passed around as strings in this
implementation, so size is automatically encoded into the value. Size is
added in getxattr so that implementations can return ERANGE if a value is larger
than can fit in the user-allocated buffer. This prevents us from unnecessarily
passing around an arbitrarily large xattr when the user buffer is actually too
small.
Don't use the existing xattrwalk and xattrcreate messages and define our
own, mainly for the sake of simplicity.
Extended attributes will be implemented in future commits.
PiperOrigin-RevId: 290121300
sys_clone has many flavors in Linux, and amd64 chose
a different one from x86(different arguments order).
Ref kernel/fork.c for more info.
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I6c8cbc685f4a6e786b171715ab68292fc95cbf48
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/1545 from xiaobo55x:clone 156bd2dfbc63ef5291627b0578ddea77997393b2
PiperOrigin-RevId: 290093953
If a previously added IPv6 address (statically or via SLAAC) was removed, it
would be left in an expired state waiting to be cleaned up if any references to
it were still held. During this time, the same address could be regenerated via
SLAAC, which should be allowed. This change supports this scenario.
When upgrading an endpoint from temporary or permanentExpired to permanent,
respect the new configuration type (static or SLAAC) and deprecated status,
along with the new PrimaryEndpointBehavior (which was already supported).
Test: stack.TestAutoGenAddrAfterRemoval
PiperOrigin-RevId: 289990168
This change adds support to send NDP Router Solicitation messages when a NIC
becomes enabled as a host, as per RFC 4861 section 6.3.7.
Note, Router Solicitations will only be sent when the stack has forwarding
disabled.
Tests: Unittests to make sure that the initial Router Solicitations are sent
as configured. The tests also validate the sent Router Solicitations' fields.
PiperOrigin-RevId: 289964095
The change to introduce worker goroutines can cause the endpoint
to transition to StateError and we should terminate the loop rather
than let the endpoint transition to a CLOSED state as we do
in case the endpoint enters TIME-WAIT/CLOSED. Moving to a closed
state would cause the actual error to not be propagated to
any read() calls etc.
PiperOrigin-RevId: 289923568
Signed-off-by: Bin Lu <bin.lu@arm.com>
Change-Id: I9cce23db4e5caec82ce42b4970fdb7f7e8c08f1d
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/773 from lubinszARM:pr_arch_basic 3fe2fd8e6286766bbe489ef971dce204f924feba
PiperOrigin-RevId: 289795569
This patch holds taskset.mu when getting tty. If we don't
do this, it may cause a "unlock of unlocked mutex" problem,
since signalHandlers may be replaced by CopyForExec() in
runSyscallAfterExecStop after the signalHandlers.mu has
been holded in TTY().
The problem is easy to reproduce with keeping to do "runsc ps".
The crash log is :
fatal error: sync: unlock of unlocked mutex
goroutine 5801304 [running]:
runtime.throw(0xfd019c, 0x1e)
GOROOT/src/runtime/panic.go:774 +0x72 fp=0xc001ba47b0 sp=0xc001ba4780 pc=0x431702
sync.throw(0xfd019c, 0x1e)
GOROOT/src/runtime/panic.go:760 +0x35 fp=0xc001ba47d0 sp=0xc001ba47b0 pc=0x431685
sync.(*Mutex).unlockSlow(0xc00cf94a30, 0xc0ffffffff)
GOROOT/src/sync/mutex.go:196 +0xd6 fp=0xc001ba47f8 sp=0xc001ba47d0 pc=0x4707d6
sync.(*Mutex).Unlock(0xc00cf94a30)
GOROOT/src/sync/mutex.go:190 +0x48 fp=0xc001ba4818 sp=0xc001ba47f8 pc=0x4706e8
gvisor.dev/gvisor/pkg/sentry/kernel.(*ThreadGroup).TTY(0xc011a9e800, 0x0)
pkg/sentry/kernel/tty.go:38 +0x88 fp=0xc001ba4868 sp=0xc001ba4818 pc=0x835fa8
gvisor.dev/gvisor/pkg/sentry/control.Processes(0xc00025ae00, 0xc013e397c0, 0x40, 0xc0137b9800, 0x1, 0x7f292e9a4cc0)
pkg/sentry/control/proc.go:366 +0x355 fp=0xc001ba49a0 sp=0xc001ba4868 pc=0x9ac4a5
gvisor.dev/gvisor/runsc/boot.(*containerManager).Processes(0xc0003b62c0, 0xc0051423d0, 0xc0137b9800, 0x0, 0x0)
runsc/boot/controller.go:228 +0xdf fp=0xc001ba49e8 sp=0xc001ba49a0 pc=0xaf06cf
Signed-off-by: chris.zn <chris.zn@antfin.com>
All inbound segments for connections in ESTABLISHED state are delivered to the
endpoint's queue but for every segment delivered we also queue the endpoint for
processing to a selected processor. This ensures that when there are a large
number of connections in ESTABLISHED state the inbound packets are all handled
by a small number of goroutines and significantly reduces the amount of work the
goscheduler has to perform.
We let connections in other states follow the current path where the
endpoint's goroutine directly handles the segments.
Updates #231
PiperOrigin-RevId: 289728325
CancellableTimer tests were in a timer_test package but lived within the
tcpip directory. This caused issues with go tools.
PiperOrigin-RevId: 289166345
* Rename syncutil to sync.
* Add aliases to sync types.
* Replace existing usage of standard library sync package.
This will make it easier to swap out synchronization primitives. For example,
this will allow us to use primitives from github.com/sasha-s/go-deadlock to
check for lock ordering violations.
Updates #1472
PiperOrigin-RevId: 289033387
Inform the Stack's NDPDispatcher when it receives an NDP Router Advertisement
that updates the available configurations via DHCPv6. The Stack makes sure that
its NDPDispatcher isn't informed unless the avaiable configurations via DHCPv6
for a NIC is updated.
Tests: Test that a Stack's NDPDispatcher is informed when it receives an NDP
Router Advertisement that informs it of new configurations available via DHCPv6.
PiperOrigin-RevId: 289001283
Internal tools timeout after 60s during tests that are required to pass before
changes can be submitted. Separate out NDP tests into its own package to help
prevent timeouts when testing.
PiperOrigin-RevId: 288990597
...retrievable later via stack.NICInfo().
Clients of this library can use it to add metadata that should be tracked
alongside a NIC, to avoid having to keep a map[tcpip.NICID]metadata mirroring
stack.Stack's nic map.
PiperOrigin-RevId: 288924900
When PCID is disabled, there would throw a panic
when dropPageTables() access to c.PCID without check.
Signed-off-by: Lai Jiangshan <eag0628@gmail.com>
Add a new CancellableTimer type to encapsulate the work of safely stopping
timers when it fires at the same time some "related work" is being handled. The
term "related work" is some work that needs to be done while having obtained
some common lock (L).
Example: Say we have an invalidation timer that may be extended or cancelled by
some event. Creating a normal timer and simply cancelling may not be sufficient
as the timer may have already fired when the event handler attemps to cancel it.
Even if the timer and event handler obtains L before doing work, once the event
handler releases L, the timer will eventually obtain L and do some unwanted
work.
To prevent the timer from doing unwanted work, it checks if it should early
return instead of doing the normal work after obtaining L. When stopping the
timer callers must have L locked so the timer can be safely informed that it
should early return.
Test: Tests that CancellableTimer fires and resets properly. Test to make sure
the timer fn is not called after being stopped within the lock L.
PiperOrigin-RevId: 288806984
This change calls a new Truncate method on the EndpointReader in RecvMsg for
both netlink and unix sockets. This allows readers such as sockets to peek at
the length of data without actually reading it to a buffer.
Fixes#993#1240
PiperOrigin-RevId: 288800167
This gets us closer to passing the iptables tests and opens up iptables
so it can be worked on by multiple people.
A few restrictions are enforced for security (i.e. we don't want to let
users write a bunch of iptables rules and then just not enforce them):
- Only the filter table is writable.
- Only ACCEPT rules with no matching criteria can be added.
...enabling us to remove the "CreateNamedLoopbackNIC" variant of
CreateNIC and all the plumbing to connect it through to where the value
is read in FindRoute.
PiperOrigin-RevId: 288713093
Before, each of small read()'s that raises window either from zero
or above threshold of aMSS, would generate an ACK. In a classic
silly-window-syndrome scenario, we can imagine a pessimistic case
when small read()'s generate a stream of ACKs.
This PR fixes that, essentially treating window size < aMSS as zero.
We send ACK exactly in a moment when window increases to >= aMSS
or half of receive buffer size (whichever smaller).
Support deprecating network endpoints on a NIC. If an endpoint is deprecated, it
should not be used for new connections unless a more preferred endpoint is not
available, or unless the deprecated endpoint was explicitly requested.
Test: Test that deprecated endpoints are only returned when more preferred
endpoints are not available and SLAAC addresses are deprecated after its
preferred lifetime
PiperOrigin-RevId: 288562705
When receiving data, netstack avoids sending spurious acks. When
user does recv() should netstack send ack telling the sender that
the window was increased? It depends. Before this patch, netstack
_will_ send the ack in the case when window was zero or window >>
scale was zero. Basically - when recv space increased from zero.
This is not working right with silly-window-avoidance on the sender
side. Some network stacks refuse to transmit segments, that will fill
the window but are below MSS. Before this patch, this confuses
netstack. On one hand if the window was like 3 bytes, netstack
will _not_ send ack if the window increases. On the other hand
sending party will refuse to transmit 3-byte packet.
This patch changes that, making netstack will send an ACK when
the available buffer size increases to or above 1*MSS. This will
inform other party buffer is large enough, and hopefully uncork it.
Signed-off-by: Marek Majkowski <marek@cloudflare.com>
Test: Test that an IPv6 link-local address is not auto-generated for loopback
NICs, even when it is enabled for non-loopback NICS.
PiperOrigin-RevId: 288519591
Pass the NIC-internal name to the NIC name function when generating opaque IIDs
so implementations can use the name that was provided when the NIC was created.
Previously, explicit NICID to NIC name resolution was required from the netstack
integrator.
Tests: Test that the name provided when creating a NIC is passed to the NIC name
function when generating opaque IIDs.
PiperOrigin-RevId: 288395359
Right now, we need to call ptrace(PTRACE_SYSCALL) and wait() twice to execute
one system call in a stub process. With these changes, we will need to call
ptrace + wait only once.
In addition, this allows to workaround the kernel bug when a stub process
doesn't stop on syscall-exit-stop and starts executing the next system call.
Reported-by: syzbot+37143cafa8dc3b5008ee@syzkaller.appspotmail.com
PiperOrigin-RevId: 288393029
- Renamed memfs to tmpfs.
- Copied fileRangeSet bits from fs/fsutil/ to fsimpl/tmpfs/
- Changed tmpfs to be backed by filemem instead of byte slice.
- regularFileReadWriter uses a sync.Pool, similar to gofer client.
PiperOrigin-RevId: 288356380
Currently, shm.Registry.FindByID will return Shm instances without taking an
additional reference on them, making it possible for them to disappear.
More explicitly handle references. All callers hold a reference for the
duration that they hold the instance. Registry.shms may transitively hold Shms
with no references, so it must TryIncRef to determine if they are still valid.
PiperOrigin-RevId: 288314529
Some of the flags in the file system related system call
are architecture specific(O_NOFOLLOW/O_DIRECT..). Ref to
the fcntl.h file in the Linux src codes.
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I354d988073bfd0c9ff5371d4e0be9da2b8fd019f
Support using opaque interface identifiers when generating IPv6 addresses via
SLAAC when configured to do so.
Note, this change does not handle retries in response to DAD conflicts yet.
That will also come in a later change.
Test: Test that when SLAAC addresses are generated, they use opaque interface
identifiers when configured to do so.
PiperOrigin-RevId: 288078605
Support generating opaque interface identifiers as defined by RFC 7217 for
auto-generated IPv6 link-local addresses. Opaque interface identifiers will also
be used for IPv6 addresses auto-generated via SLAAC in a later change.
Note, this change does not handle retries in response to DAD conflicts yet.
That will also come in a later change.
Tests: Test that when configured to generated opaque IIDs, they are properly
generated as outlined by RFC 7217.
PiperOrigin-RevId: 288035349
- Add FileDescriptionOptions.UseDentryMetadata, which reduces the amount of
boilerplate needed for device FDs and the like between filesystems.
- Switch back to having FileDescription.Init() take references on the Mount and
Dentry; otherwise managing refcounts around failed calls to
OpenDeviceSpecialFile() / Device.Open() is tricky.
PiperOrigin-RevId: 287575574
Added the ability to get/set the IP_RECVTOS socket option on UDP endpoints. If
enabled, TOS from the incoming Network Header passed as ancillary data in the
ControlMessages.
Test:
* Added unit test to udp_test.go that tests getting/setting as well as
verifying that we receive expected TOS from incoming packet.
* Added a syscall test
PiperOrigin-RevId: 287029703
There are 2 jobs have been finished in this patch:
1, a comment was added to explain the purpose of the extra NOPs in Vectors().
2, some merge errors were fixed.
Signed-off-by: Bin Lu <bin.lu@arm.com>
- Make FilesystemImpl methods that operate on parent directories require
!rp.Done() (i.e. there is at least one path component to resolve) as
precondition and postcondition (in cases where they do not finish path
resolution due to mount boundary / absolute symlink), and require that they
do not need to follow the last path component (the file being created /
deleted) as a symlink. Check for these in VFS.
- Add FilesystemImpl.GetParentDentryAt(), which is required to obtain the old
parent directory for VFS.RenameAt(). (Passing the Dentry to be renamed
instead has the wrong semantics if the file named by the old path is a mount
point since the Dentry will be on the wrong Mount.)
- Update memfs to implement these methods correctly (?), including RenameAt.
- Change fspath.Parse() to allow empty paths (to simplify implementation of
AT_EMPTY_PATH).
- Change vfs.PathOperation to take a fspath.Path instead of a raw pathname;
non-test callers will need to fspath.Parse() pathnames themselves anyway in
order to detect absolute paths and select PathOperation.Start accordingly.
PiperOrigin-RevId: 286934941
This change supports clearing all host-only NDP state when NICs become routers.
All discovered routers, discovered on-link prefixes and auto-generated addresses
will be invalidated when becoming a router. This is because normally, routers do
not process Router Advertisements to discover routers or on-link prefixes, and
do not do SLAAC.
Tests: Unittest to make sure that all discovered routers, discovered prefixes
and auto-generated addresses get invalidated when transitioning from a host to
a router.
PiperOrigin-RevId: 286902309
Linux PTRACE_SYSEMU support on arm64 was merged to mainline from
V5.3, and the corresponding support in go also enabled recently.
Since the "syscall" package is locked down from go 1.4, so the ptrace
PTRACE_SYSEMU definition can't be added to package "syscall" on arm64.
According to the golang community, updates required by new systems or
versions should use the corresponding package in the golang.org/x/sys
repository instead(https://golang.org/pkg/syscall/).
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I2f917bb2be62f990c3e158e2bb99e094ea03f751
This change is needed to be compatible with the Linux kernel.
There is no glibc wrapper for the futex system call, so it is easy to
make a mistake and call syscall(__NR_futex, FUTEX_WAKE, addr) without
the fourth argument. This works on Linux, because it wakes one waiter
even if val is nonpositive.
PiperOrigin-RevId: 286494396
Otherwise a copy happens, which triggers a data race when reading
masterInodeOperations.SimpleFileOperations.uattr, which must be accessed with a
lock held.
PiperOrigin-RevId: 286464473
When listen(2) is called on an unbound socket, the socket is
automatically bound to a random free port with the local address
set to INADDR_ANY.
PiperOrigin-RevId: 286305906
This change removes the requirement that a new routing table be provided when a
router or prefix discovery event happens so that an updated routing table may
be provided to the stack at a later time from the event.
This change is to address the use case where the netstack integrator may need to
obtain a lock before providing updated routes in response to the events above.
As an example, say we have an integrator that performs the below two operations
operations as described:
A. Normal route update:
1. Obtain integrator lock
2. Update routes in the integrator
3. Call Stack.SetRouteTable with the updated routes
3.1. Obtain Stack lock
3.2. Update routes in Stack
3.3. Release Stack lock
4. Release integrator lock
B. NDP event triggered route update:
1. Obtain Stack lock
2. Call event handler
2.1. Obtain integrator lock
2.2. Update routes in the integrator
2.3. Release integrator lock
2.4. Return updated routes to update Stack
3. Update routes in Stack
4. Release Stack lock
A deadlock may occur if a Normal route update was attemped at the same time an
NDP event triggered route update was attempted. With threads T1 and T2:
1) T1 -> A.1, A.2
2) T2 -> B.1
3) T1 -> A.3 (hangs at A.3.1 since Stack lock is taken in step 2)
4) T2 -> B.2 (hangs at B.2.1 since integrator lock is taken in step 1)
Test: Existing tests were modified to not provide or expect routing table
changes in response to Router and Prefix discovery events.
PiperOrigin-RevId: 286274712
This change makes sure that test variables are captured before running tests
in parallel, and removes unneeded buffered channel allocations. This change also
removes unnecessary timeouts.
PiperOrigin-RevId: 286255066
Several jobs were finished in this patch:
1, provide functions to get/set fpcr/fpsr/vregs
2, support lazy-fpsimd-context-switch in el1
Signed-off-by: Bin Lu <bin.lu@arm.com>
These comments provided nothing, and have been copy-pasted into all
implementations. The code is clear without them.
I considered also removing the "handle implements handler.handle" comments, but
will let those stay for now.
PiperOrigin-RevId: 285876428
Add checks for input arguments, file type, permissions, etc. that match
the Linux implementation. A call to get/setxattr that passes all the
checks will still currently return EOPNOTSUPP. Actual support will be
added in following commits.
Only allow user.* extended attributes for the time being.
PiperOrigin-RevId: 285835159
Copy up parent when binding UDS on overlayfs is supported in commit
02ab1f187c.
But the using of copyUp in overlayBind will cause sentry stuck, reason
is dead lock in renameMu.
1 [Process A] Invoke a Unix socket bind operation
renameMu is hold in fs.(*Dirent).genericCreate by process A
2 [Process B] Invoke a read syscall on /proc/task/mounts
waitng on Lock of renameMu in fs.(*MountNamespace).FindMount
3 [Process A] Continue Unix socket bind operation
wating on RLock of renameMu in fs.copyUp
Root cause is recursive reading lock of reanmeMu in bind call trace,
if there are writing lock between the two reading lock, then deadlock
occured.
Fixes#1397
After the finalizer optimize in 76039f8959
commit, clientFile needs to closed before finalizer release it.
The clientFile is not closed if it is created via
gofer.(*inodeOperations).Bind, this will cause fd leak which is hold
by gofer process.
Fixes#1396
Signed-off-by: Yong He <chenglang.hy@antfin.com>
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
There are 4 jobs were finished in this package:
1, Virtual machine initialization.
2, Bluepill implementation.
3, Move ring0.Vectors() into the address with 11-bits alignment.
4, Basic support for "SwitchToUser".
Signed-off-by: Bin Lu <bin.lu@arm.com>
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/919 from lubinszARM:pr_kvm eedea52db451bf62722759009a9f14c54a69c55f
PiperOrigin-RevId: 285501256
Fixed a bug where the interface identifier was not properly generated from an
Ethernet address.
Tests: Unittests to make sure the functions generating the EUI64 interface
identifier are correct.
PiperOrigin-RevId: 285494562
The implementation follows the linux behavior where specifying
a TCP_USER_TIMEOUT will cause the resend timer to honor the
user specified timeout rather than the default rto based timeout.
Further it alters when connections are timedout due to keepalive
failures. It does not alter the behavior of when keepalives are
sent. This is as per the linux behavior.
PiperOrigin-RevId: 285099795
The former is needed for vfs.FileDescription to implement
memmap.MappingIdentity, and the latter is needed to implement getcwd(2).
PiperOrigin-RevId: 285051855
We're missing several packages that runsc doesn't depend on. Most notable are
several tcpip link packages.
To find packages, I looked at a diff of directories on master vs go:
$ bazel build //:gopath
$ find bazel-bin/gopath/src/gvisor.dev/gvisor/ -type d > /tmp/gopath.txt
$ find . -type d > /tmp/master.txt
$ sed 's|bazel-bin/gopath/src/gvisor.dev/gvisor/||' < /tmp/gopath.txt > /tmp/gopath.trunc.txt
$ sed 's|./||' < /tmp/master.txt > /tmp/master.trunc.txt
$ vimdiff /tmp/gopath.trunc.txt /tmp/master.trunc.txt
Testing packages are still left out because :gopath can't depend on testonly
targets...
PiperOrigin-RevId: 285049029
runsc debug --ps list all processes with all threads. This option is added to
the debug command but not to the ps command, because it is going to be used for
debug purposes and we want to add any useful information without thinking about
backward compatibility.
This will help to investigate syzkaller issues.
PiperOrigin-RevId: 285013668
This change adds support to let an integrator know when it receives an NDP
Router Advertisement message with the NDP Recursive DNS Server option with at
least one DNS server's address. The stack will not maintain any state related to
the DNS servers - the integrator is expected to maintain any required state and
invalidate the servers after its valid lifetime expires, or refresh the lifetime
when a new one is received for a known DNS server.
Test: Unittest to make sure that an event is sent to the integrator when an NDP
Recursive DNS Server option is received with at least one address.
PiperOrigin-RevId: 284890502
Package strace is missing some syscalls we actually implement (e.g.,
getrandom). We also see newer syscalls sometimes (e.g., membarrier) that would
be handy to have formatted.
Let's go ahead and add all syscalls in the latest upstream release (v5.4), even
though we only intend to implement v4.4. None of them are implemented, just
included as placeholders.
PiperOrigin-RevId: 284797577
This change marks the socket as ESTABLISHED and creates the receiver and sender
the moment we send the final ACK in case of an active TCP handshake or when we
receive the final ACK for a passive TCP handshake. Before this change there was
a short window in which an ACK can be received and processed but the state on
the socket is not yet ESTABLISHED.
This can be seen in TestConnectBindToDevice which is flaky because sometimes
the socket is in SYN-SENT and not ESTABLISHED even though the other side has
already received the final ACK of the handshake.
PiperOrigin-RevId: 284277713
This change allows the netstack to do SLAAC as outlined by RFC 4862 section 5.5.
Note, this change will not break existing uses of netstack as the default
configuration for the stack options is set in such a way that SLAAC
will not be performed. See `stack.Options` and `stack.NDPConfigurations` for
more details.
This change reuses 1 option and introduces a new one that is required to take
advantage of SLAAC, all available under NDPConfigurations:
- HandleRAs: Whether or not NDP RAs are processes
- AutoGenGlobalAddresses: Whether or not SLAAC is performed.
Also note, this change does not deprecate SLAAC generated addresses after the
preferred lifetime. That will come in a later change (b/143713887). Currently,
only the valid lifetime is honoured.
Tests: Unittest to make sure that SLAAC generates and adds addresses only when
configured to do so. Tests also makes sure that conflicts with static addresses
do not modify the static address.
PiperOrigin-RevId: 284265317
Threadgroups already know their TTY (if they have one), which now contains the
TTY Index, and is returned in the Processes() call.
PiperOrigin-RevId: 284263850
If the socket is bound to ANY and connected to a loopback address,
getsockname() has to return the loopback address. Without this fix,
getsockname() returns ANY.
PiperOrigin-RevId: 283647781
The code in rcv.consumeSegment incorrectly transitions to
CLOSED state from LAST-ACK before the final ACK for the FIN.
Further if receiving a segment changes a socket to a closed state
then we should not invoke the sender as the socket is now closed
and sending any segments is incorrect.
PiperOrigin-RevId: 283625300
There are two potential ways of sending a TOS byte with outgoing packets:
including a control message in sendmsg, or setting the IP_TOS/IPV6_TCLASS
socket options (for IPV4 and IPV6 respectively). This change lets hostinet
support the latter.
Fixes#1188
PiperOrigin-RevId: 283550925
There are two potential ways of sending a TOS byte with outgoing packets:
including a control message in sendmsg, or setting the IP_TOS/IPV6_TCLASS
socket options (for IPV4 and IPV6 respectively). This change lets hostinet
support the former.
PiperOrigin-RevId: 283346737
This change does not introduce any new features, or modify existing ones.
This change tests handling TCP segments right away for connections that were
completed from a listening endpoint.
PiperOrigin-RevId: 282986457
This involves allowing getsockopt/setsockopt for the corresponding socket
options, as well as allowing hostinet to process control messages received from
the actual recvmsg syscall.
PiperOrigin-RevId: 282851425
This allows writable proc and devices files to be opened with O_CREAT|O_TRUNC.
This is encountered most frequently when interacting with proc or devices files
via the command line.
e.g. $ echo 8192 1048576 4194304 > /proc/sys/net/ipv4/tcp_rmem
Also adds a test to test the behavior of open(O_TRUNC), truncate, and ftruncate
on named pipes.
Fixes#1116
PiperOrigin-RevId: 282677425
For test case "TestApplicationFault",
Memory-fault in guest user level will be trapped in el0_da.
And in el0_da, we use mmio_exit to leave the KVM guest.
Signed-off-by: Bin Lu <bin.lu@arm.com>
For test case "TestApplicationSyscall",
Syscall in guest user level will be trapped in el0_svc.
And in el0_svc, we use mmio_exit to leave the KVM guest for now.
Signed-off-by: Bin Lu <bin.lu@arm.com>
Mainly 2 jobs were finished in this patch:
1, context switching for a container application:
a, R0-R30 b, pc\pstate\sp_el0 c, pagetable_el0 for container application
This job can help us to pass the following test cases:
"TestApplicationSyscall", "TestApplicationFault"
2, checking pagetable_el0 is empty
This job can help us to pass the following test case: "TestInvalidate"
Signed-off-by: Bin Lu <bin.lu@arm.com>
- Remove the Filesystem argument from DentryImpl.*Ref(); in general DentryImpls
that need the Filesystem for reference counting will probably also need it
for other interface methods that don't plumb Filesystem, so it's easier to
just store a pointer to the filesystem in the DentryImpl.
- Add a pointer to the VirtualFilesystem to Filesystem, which is needed by the
gofer client to disown dentries for cache eviction triggered by dentry
reference count changes.
- Rename FilesystemType.NewFilesystem to GetFilesystem; in some cases (e.g.
sysfs, cgroupfs) it's much cleaner for there to be only one Filesystem that
is used by all mounts, and in at least one case (devtmpfs) it's visibly
incorrect not to do so, so NewFilesystem doesn't always actually create and
return a *new* Filesystem.
- Require callers of FileDescription.Init() to increment Mount/Dentry
references. This is because the gofer client may, in the OpenAt() path, take
a reference on a dentry with 0 references, which is safe due to
synchronization that is outside the scope of this CL, and it would be safer
to still have its implementation of DentryImpl.IncRef() check for an
increment for 0 references in other cases.
- Add FileDescription.TryIncRef. This is used by the gofer client to take
references on "special file descriptions" (FDs for files such as pipes,
sockets, and devices), which use per-FD handles (fids) instead of
dentry-shared handles, for sync() and syncfs().
PiperOrigin-RevId: 282473364
This is required to test filesystems with a non-trivial implementation of
FilesystemImpl.Release(). Propagation isn't handled yet, and umount isn't yet
plumbed out to VirtualFilesystem.UmountAt(), but otherwise the implementation
of umount is believed to be correct.
- Move entering mountTable.seq writer critical sections to callers of
mountTable.{insert,remove}Seqed. This is required since umount(2) must ensure
that no new references are taken on the candidate mount after checking that
it isn't busy, which is only possible by entering a vfs.mountTable.seq writer
critical section before the check and remaining in it until after
VFS.umountRecursiveLocked() is complete. (Linux does the same thing:
fs/namespace.c:do_umount() => lock_mount_hash(),
fs/pnode.c:propagate_mount_busy(), umount_tree(), unlock_mount_hash().)
- It's not possible for dentry deletion to umount while only holding
VFS.mountMu for reading, but it's also very unappealing to hold VFS.mountMu
exclusively around e.g. gofer unlink RPCs. Introduce dentry.mu to avoid these
problems. This means that VFS.mountMu is never acquired for reading, so
change it to a sync.Mutex.
PiperOrigin-RevId: 282444343
Packets written via SOCK_RAW are guaranteed to have network headers, but not
transport headers. Check first whether there are enough bytes left in the packet
to contain a transport header before attempting to parse it.
PiperOrigin-RevId: 282363895
Signed-off-by: Bin Lu <bin.lu@arm.com>
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/891 from lubinszARM:pr_pagetable 2385de75a8662af3ab1ae289dd74dd0e5dcfaf66
PiperOrigin-RevId: 282013224
Note that the Sentry still calls Truncate() on the file before calling Open.
A new p9 version check was added to ensure that the p9 server can handle the
the OpenTruncate flag. If not, then the flag is stripped before sending.
PiperOrigin-RevId: 281609112
As we move to CLOSE state from LAST-ACK or TIME-WAIT,
ensure that we re-match all in-flight segments to any
listening endpoint.
Also fix LISTEN state handling of any ACK segments as per RFC793.
Fixes#1153
PiperOrigin-RevId: 280703556
Aside from the performance hit, there is no guarantee that p9.ClientFile's
finalizer runs before the associated p9.Client is closed.
PiperOrigin-RevId: 280702509
Sniffer assumed that outgoing packets have transport headers, but
users can write packets via SOCK_RAW with arbitrary transport headers that
netstack doesn't know about. We now explicitly check for the presence of network
and transport headers before assuming they exist.
PiperOrigin-RevId: 280594395
It was possible to panic the sentry by opening a cache revalidating folder with
O_TRUNC|O_CREAT.
Avoids breaking php tests.
PiperOrigin-RevId: 280533213
Initialize the VDSO "os" and "arch" fields explicitly,
or the VDSO load process would failed on arm64 platform.
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: Ic6768df88e43cd7c7956eb630511672ae11ac52f
newfstatat() syscall is not supported on arm64, so we resort
to use the fstatat() syscall.
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: Iea95550ea53bcf85c01f7b3b95da70ad0952177d
This patch also include a minor change to replace syscall.Dup2
with syscall.Dup3 which was missed in a previous commit(ref a25a976).
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I00beb9cc492e44c762ebaa3750201c63c1f7c2f3
This change drops TCP packets with a non-unicast IP address as the source or
destination address as TCP is meant for communication between two endpoints.
Test: Make sure that if the source or destination address contains a non-unicast
address, no TCP packet is sent in response and the packet is dropped.
PiperOrigin-RevId: 280073731
This change allows the netstack to do NDP's Prefix Discovery as outlined by
RFC 4861 section 6.3.4. If configured to do so, when a new on-link prefix is
discovered, the routing table will be updated with a device route through
the nic the RA arrived at. Likewise, when such a prefix gets invalidated, the
device route will be removed.
Note, this change will not break existing uses of netstack as the default
configuration for the stack options is set in such a way that Prefix Discovery
will not be performed. See `stack.Options` and `stack.NDPConfigurations` for
more details.
This change reuses 1 option and introduces a new one that is required to take
advantage of Prefix Discovery, all available under NDPConfigurations:
- HandleRAs: Whether or not NDP RAs are processes
- DiscoverOnLinkPrefixes: Whether or not Prefix Discovery is performed (new)
Another note: for a NIC to process Prefix Information options (in Router
Advertisements), it must not be a router itself. Currently the netstack does not
have per-interface routing configuration; the routing/forwarding configuration
is controlled stack-wide. Therefore, if the stack is configured to enable
forwarding/routing, no router Advertisements (and by extension the Prefix
Information options) will be processed.
Tests: Unittest to make sure that Prefix Discovery and updates to the routing
table only occur if explicitly configured to do so. Unittest to make sure at
max stack.MaxDiscoveredOnLinkPrefixes discovered on-link prefixes are
remembered.
PiperOrigin-RevId: 280049278
* Basic tests for the SO_REUSEADDR and SO_REUSEPORT options.
* SO_REUSEADDR functional tests for TCP and UDP.
* SO_REUSEADDR and SO_REUSEPORT interaction tests for UDP.
* Stubbed support for UDP getsockopt(SO_REUSEADDR).
PiperOrigin-RevId: 280049265
This change adds explicit support for honoring the 2MSL timeout
for sockets in TIME_WAIT state. It also adds support for the
TCP_LINGER2 option that allows modification of the FIN_WAIT2
state timeout duration for a given socket.
It also adds an option to modify the Stack wide TIME_WAIT timeout
but this is only for testing. On Linux this is fixed at 60s.
Further, we also now correctly process RST's in CLOSE_WAIT and
close the socket similar to linux without moving it to error
state.
We also now handle SYN in ESTABLISHED state as per
RFC5961#section-4.1. Earlier we would just drop these SYNs.
Which can result in some tests that pass on linux to fail on
gVisor.
Netstack now honors TIME_WAIT correctly as well as handles the
following cases correctly.
- TCP RSTs in TIME_WAIT are ignored.
- A duplicate TCP FIN during TIME_WAIT extends the TIME_WAIT
and a dup ACK is sent in response to the FIN as the dup FIN
indicates potential loss of the original final ACK.
- An out of order segment during TIME_WAIT generates a dup ACK.
- A new SYN w/ a sequence number > the highest sequence number
in the previous connection closes the TIME_WAIT early and
opens a new connection.
Further to make the SYN case work correctly the ISN (Initial
Sequence Number) generation for Netstack has been updated to
be as per RFC. Its not a pure random number anymore and follows
the recommendation in https://tools.ietf.org/html/rfc6528#page-3.
The current hash used is not a cryptographically secure hash
function. A separate change will update the hash function used
to Siphash similar to what is used in Linux.
PiperOrigin-RevId: 279106406
This is required to implement O_TRUNC correctly on filesystems backed by
gofers.
9P2000.L: "lopen prepares fid for file I/O. flags contains Linux open(2) flags
bits, e.g. O_RDONLY, O_RDWR, O_WRONLY."
open(2): "The argument flags must include one of the following access modes:
O_RDONLY, O_WRONLY, or O_RDWR. ... In addition, zero or more file creation
flags and file status flags can be bitwise-or'd in flags."
The reference 9P2000.L implementation also appears to expect arbitrary flags,
not just access modes, in Tlopen.flags:
https://github.com/chaos/diod/blob/master/diod/ops.c#L703
PiperOrigin-RevId: 278972683
This change allows the netstack to do NDP's Router Discovery as outlined by
RFC 4861 section 6.3.4.
Note, this change will not break existing uses of netstack as the default
configuration for the stack options is set in such a way that Router Discovery
will not be performed. See `stack.Options` and `stack.NDPConfigurations` for
more details.
This change introduces 2 options required to take advantage of Router Discovery,
all available under NDPConfigurations:
- HandleRAs: Whether or not NDP RAs are processes
- DiscoverDefaultRouters: Whether or not Router Discovery is performed
Another note: for a NIC to process Router Advertisements, it must not be a
router itself. Currently the netstack does not have per-interface routing
configuration; the routing/forwarding configuration is controlled stack-wide.
Therefore, if the stack is configured to enable forwarding/routing, no Router
Advertisements will be processed.
Tests: Unittest to make sure that Router Discovery and updates to the routing
table only occur if explicitly configured to do so. Unittest to make sure at
max stack.MaxDiscoveredDefaultRouters discovered default routers are remembered.
PiperOrigin-RevId: 278965143
This change better follows what is outlined in RFC 793 section 3.4 figure 12
where a listening socket should not accept a SYN-ACK segment in response to a
(potentially) old SYN segment.
Tests: Test that checks the TCP RST segment sent in response to a TCP SYN-ACK
segment received on a listening TCP endpoint.
PiperOrigin-RevId: 278893114
This change validates incoming NDP Router Advertisements as per RFC 4861 section
6.1.2. It also includes the skeleton to handle Router Advertiements that arrive
on some NIC.
Tests: Unittest to make sure only valid NDP Router Advertisements are received/
not dropped.
PiperOrigin-RevId: 278891972
NETLINK_KOBJECT_UEVENT sockets send udev-style messages for device events.
gVisor doesn't have any device events, so our sockets don't need to do anything
once created.
systemd's device manager needs to be able to create one of these sockets. It
also wants to install a BPF filter on the socket. Since we'll never send any
messages, the filter would never be invoked, thus we just fake it out.
Fixes#1117
Updates #1119
PiperOrigin-RevId: 278405893
Since we only supporting sending messages from the kernel, the peer is always
the kernel, simplifying handling.
There are currently no known users of SO_PASSCRED that would actually receive
messages from gVisor, but adding full support is barely more work than stubbing
out fake support.
Updates #1117Fixes#1119
PiperOrigin-RevId: 277981465
The watchdog currently can find stuck tasks, but has no way to tell if the
sandbox is stuck before the application starts executing.
This CL adds a startup timeout and action to the watchdog. If Start() is not
called before the given timeout (if non-zero), then the watchdog will take the
action.
PiperOrigin-RevId: 277970577
When VectorisedViews were passed up the stack from packet_dispatchers, we were
passing a sub-slice of the dispatcher's views fields. The dispatchers then
immediately set those views to nil.
This wasn't caught before because every implementer copied the data in these
views before returning.
PiperOrigin-RevId: 277615351
On Arm platform, "setMemoryRegion" has extra permission checks.
In virt/kvm/arm/mmu.c: kvm_arch_prepare_memory_region()
....
if (writable && !(vma->vm_flags & VM_WRITE)) {
ret = -EPERM;
break;
}
....
So, for Arm platform, the "flags" for kvm_memory_region is required.
And on x86 platform, the "flags" can be always set as '0'.
Signed-off-by: Bin Lu <bin.lu@arm.com>
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/810 from lubinszARM:pr_setregion 8c99b19cfb0c859c6630a1cfff951db65fcf87ac
PiperOrigin-RevId: 277602603
In the future this will replace DanglingEndpoints. DanglingEndpoints must be
kept for now due to issues with save/restore.
This is arguably a cleaner design and allows the stack to know which transport
endpoints might still be using its link endpoints.
Updates #837
PiperOrigin-RevId: 277386633
When execveat is called on an interpreter script, the symlink count for
resolving the script path should be separate from the count for resolving the
the corresponding interpreter. An ELOOP error should not occur if we do not hit
the symlink limit along any individual path, even if the total number of
symlinks encountered exceeds the limit.
Closes#574
PiperOrigin-RevId: 277358474
This change helps support iterating over an NDP options buffer so that
implementations can handle all the NDP options present in an NDP packet.
Note, this change does not yet actually handle these options, it just provides
the tools to do so (in preparation for NDP's Prefix, Parameter, and a complete
implementation of Neighbor Discovery).
Tests: Unittests to make sure we can iterate over a valid NDP options buffer
that may contain multiple options. Also tests to check an iterator before
using it to see if the NDP options buffer is malformed.
PiperOrigin-RevId: 277312487
When an interpreter script is opened with O_CLOEXEC and the resulting fd is
passed into execveat, an ENOENT error should occur (the script would otherwise
be inaccessible to the interpreter). This matches the actual behavior of
Linux's execveat.
PiperOrigin-RevId: 277306680
This change supports using a user supplied TCP MSS for new active TCP
connections. Note, the user supplied MSS must be less than or equal to the
maximum possible MSS for a TCP connection's route. If it is greater than the
maximum possible MSS, the maximum possible MSS will be used as the connection's
MSS instead.
This change does not use this user supplied MSS for connections accepted from
listening sockets - that will come in a later change.
Test: Test that outgoing TCP SYN segments contain a TCP MSS option with the user
supplied MSS if it is not greater than the maximum possible MSS for the route.
PiperOrigin-RevId: 277185125
Since the syscall.Stat_t.Nlink is defined as different types on
amd64 and arm64(uint64 and uint32 respectively), we need to cast
them to a unified uint64 type in gVisor code.
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I7542b99b195c708f3fc49b1cbe6adebdd2f6e96b
This change simplifies the function signatures of functions related to loading
executables, such as LoadTaskImage, Load, loadBinary.
PiperOrigin-RevId: 276821187
This change validates the ICMPv6 checksum field before further processing an
ICMPv6 packet.
Tests: Unittests to make sure that only ICMPv6 packets with a valid checksum
are accepted/processed. Existing tests using checker.ICMPv6 now also check the
ICMPv6 checksum field.
PiperOrigin-RevId: 276779148
This change is in preparation for NDP Prefix Discovery and SLAAC where the stack
will need to handle NDP Prefix Information options.
Tests: Test that given an NDP Prefix Information option buffer, correct values
are returned by the field getters.
PiperOrigin-RevId: 276594592
The amss field in the tcpip.tcp.handshake was not used anywhere. Removed it to
not cause confusion with the amss field in the tcpip.tcp.endpoint struct, which
was documented to be used (and is actually being used) for the same purpose.
PiperOrigin-RevId: 276577088
This change makes it so that NDP work is done using the per-interface NDP
configurations instead of the stack-wide default NDP configurations to correctly
implement RFC 4861 section 6.3.2 (note here, a host is a single NIC operating
as a host device), and RFC 4862 section 5.1.
Test: Test that we can set NDP configurations on a per-interface basis without
affecting the configurations of other interfaces or the stack-wide default. Also
make sure that after the configurations are updated, the updated configurations
are used for NDP processes (e.g. Duplicate Address Detection).
PiperOrigin-RevId: 276525661
Use fd.next to store the iteration start position, which can be used to accelerate allocating new FDs.
And adding the corresponding gtest benchmark to measure performance.
@tanjianfeng
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/758 from DarcySail:master 96685ec7886dfe1a64988406831d3bc002b438cc
PiperOrigin-RevId: 276351250
This change introduces a new interface, stack.NDPDispatcher. It can be
implemented by the netstack integrator to receive NDP related events. As of this
change, only DAD related events are supported.
Tests: Existing tests were modified to use the NDPDispatcher's DAD events for
DAD tests where it needed to wait for DAD completing (failing and resolving).
PiperOrigin-RevId: 276338733
This change is in preparation for NDP Router Discovery where the stack will need
to handle NDP Router Advertisments.
Tests: Test that given an NDP Router Advertisement buffer (body of an ICMPv6
packet, correct values are returned by the field getters).
PiperOrigin-RevId: 276146817
This change makes sure that when an address which is already known by a NIC and
has kind = permanentExpired gets promoted to permanent, the new
PrimaryEndpointBehavior is respected.
PiperOrigin-RevId: 276136317
Right now, we send each tcp packet separately, we call one system
call per-packet. This patch allows to generate multiple tcp packets
and send them by sendmmsg.
The arguable part of this CL is a way how to handle multiple headers.
This CL adds the next field to the Prepandable buffer.
Nginx test results:
Server Software: nginx/1.15.9
Server Hostname: 10.138.0.2
Server Port: 8080
Document Path: /10m.txt
Document Length: 10485760 bytes
w/o gso:
Concurrency Level: 5
Time taken for tests: 5.491 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 1048600200 bytes
HTML transferred: 1048576000 bytes
Requests per second: 18.21 [#/sec] (mean)
Time per request: 274.525 [ms] (mean)
Time per request: 54.905 [ms] (mean, across all concurrent requests)
Transfer rate: 186508.03 [Kbytes/sec] received
sw-gso:
Concurrency Level: 5
Time taken for tests: 3.852 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 1048600200 bytes
HTML transferred: 1048576000 bytes
Requests per second: 25.96 [#/sec] (mean)
Time per request: 192.576 [ms] (mean)
Time per request: 38.515 [ms] (mean, across all concurrent requests)
Transfer rate: 265874.92 [Kbytes/sec] received
w/o gso:
$ ./tcp_benchmark --client --duration 15 --ideal
[SUM] 0.0-15.1 sec 2.20 GBytes 1.25 Gbits/sec
software gso:
$ tcp_benchmark --client --duration 15 --ideal --gso $((1<<16)) --swgso
[SUM] 0.0-15.1 sec 3.99 GBytes 2.26 Gbits/sec
PiperOrigin-RevId: 276112677
This change adds support for optionally auto-generating an IPv6 link-local
address based on the NIC's MAC Address on NIC enable.
Note, this change will not break existing uses of netstack as the default
configuration for the stack options is set in such a way that a link-local
address will not be auto-generated unless the stack is explicitly configured.
See `stack.Options` for more details. Specifically, see
`stack.Options.AutoGenIPv6LinkLocal`.
Tests: Tests to make sure that the IPb6 link-local address is only
auto-generated if the stack is specifically configured to do so. Also tests to
make sure that an auto-generated address goes through the DAD process.
PiperOrigin-RevId: 276059813
This patch enabled the basic framework for arm64 guest.
Serveral jobs were finished in this patch:
1, ring0.Vectors()
2, switchToUser()
3, basic framwork for Arm64 guest.
Signed-off-by: Bin Lu <bin.lu@arm.com>
Like (AF_INET, SOCK_RAW) sockets, AF_PACKET sockets require CAP_NET_RAW. With
runsc, you'll need to pass `--net-raw=true` to enable them.
Binding isn't supported yet.
PiperOrigin-RevId: 275909366
This change fixes several issues with the fsgofer host UDS support. Notably, it
adds support for SOCK_SEQPACKET and SOCK_DGRAM sockets [1]. It also fixes
unsafe use of unet.Socket, which could cause a panic if Socket.FD is called
when err != nil, and calls to Socket.FD with nothing to prevent the garbage
collector from destroying and closing the socket.
A set of tests is added to exercise host UDS access. This required extracting
most of the syscall test runner into a library that can be used by custom
tests.
Updates #235
Updates #1003
[1] N.B. SOCK_DGRAM sockets are likely not particularly useful, as a server can
only reply to a client that binds first. We don't allow bind, so these are
unlikely to be used.
PiperOrigin-RevId: 275558502
It is quite legal to send from the ANY address (it is required for
DHCP). I can't figure out why the broadcast address was included here,
so removing that as well.
PiperOrigin-RevId: 275541954
* Pulls common functionality (IO and locking on open) into pipe_util.go.
* Adds pipe/vfs.go, which implements a subset of vfs.FileDescriptionImpl.
A subsequent change will add support for pipes in memfs.
PiperOrigin-RevId: 275322385
NDP Neighbor Solicitations sent during Duplicate Address Detection must have an
IP hop limit of 255, as all NDP Neighbor Solicitations should have.
Test: Test that DAD messages have the IPv6 hop limit field set to 255.
PiperOrigin-RevId: 275321680
This change adds support for Duplicate Address Detection on IPv6 addresses
as defined by RFC 4862 section 5.4.
Note, this change will not break existing uses of netstack as the default
configuration for the stack options is set in such a way that DAD will not be
performed. See `stack.Options` and `stack.NDPConfigurations` for more details.
Tests: Tests to make sure that the DAD process properly resolves or fails.
That is, tests make sure that DAD resolves only if:
- No other node is performing DAD for the same address
- No other node owns the same address
PiperOrigin-RevId: 275189471
Standard Linux kernel versions are VERSION.PATCHLEVEL.SUBLEVEL. e.g., 4.4.0,
even when the sublevel is 0. Match this standard.
PiperOrigin-RevId: 275125715
Linux kernel before 4.19 doesn't implement a feature that updates
open FD after a file is open for write (and is copied to the upper
layer). Already open FD will continue to read the old file content
until they are reopened. This is especially problematic for gVisor
because it caches open files.
Flag was added to force readonly files to be reopenned when the
same file is open for write. This is only needed if using kernels
prior to 4.19.
Closes#1006
It's difficult to really test this because we never run on tests
on older kernels. I'm adding a test in GKE which uses kernels
with the overlayfs problem for 1.14 and lower.
PiperOrigin-RevId: 275115289
When any of these flags are set, all writes will trigger a subsequent fsync
call. This behavior already existed for "write-through" mounts.
O_DIRECT is treated as an alias for O_SYNC. Better support coming soon.
PiperOrigin-RevId: 275114392
These syscalls were changed in the amd64 file around the time the arm64 PR was
sent out, so their changes got lost.
Updates #63
PiperOrigin-RevId: 275114194
- Pass context.Context to OnClose().
- Pass memmap.MMapOpts to ConfigureMMap() by pointer so that implementations
can actually mutate it as required.
PiperOrigin-RevId: 274934967
Reassembly can fail due to an invalid sequence of fragments
being received. eg. Multiple fragments with same id which
claim to be the last one by setting the more flag to 0 etc.
It's safer to just drop the reassembler and increment a metric
than to panic when reassembly fails.
PiperOrigin-RevId: 274920901
...and do not populate link address cache at dispatch. This partially
reverts 313c767b00, which caused malformed
packets (e.g. NDP Neighbor Adverts with incorrect hop limit values) to
populate the address cache. In particular, this masked a bug that was
introduced to the Neighbor Advert generation code in
7c1587e340.
PiperOrigin-RevId: 274865182
Netstack has its own stats, we use this to fill /proc/net/snmp.
Note that some metrics are not recorded in Netstack, which will be shown
as 0 in the proc file.
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
Change-Id: Ie0089184507d16f49bc0057b4b0482094417ebe1
For hostinet, we inherit the data from host procfs. To to that, we
cache the fds for these files for later reads.
Fixes#506
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
Change-Id: I2f81215477455b9c59acf67e33f5b9af28ee0165
This proc file reports routing information to applications inside the
container.
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
Change-Id: I498e47f8c4c185419befbb42d849d0b099ec71f3
This proc file contains statistics according to [1].
[1] https://tools.ietf.org/html/rfc2013
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
Change-Id: I9662132085edd8a7783d356ce4237d7ac0800d94
This allows for peeking at the length of the next message on a netlink socket
without pulling it off the socket's buffer/queue, allowing tools like 'ip' to
work.
This CL also fixes an issue where dump_done_errno was not included in the
NLMSG_DONE messages payload.
Issue #769
PiperOrigin-RevId: 274068637
Strengthen the header.IPv4.IsValid check to correctly check
for IHL/TotalLength fields. Also add a check to make sure
fragmentOffsets + size of the fragment do not cause a wrap
around for the end of the fragment.
PiperOrigin-RevId: 274049313
The signalfd descriptors otherwise always show as available. This can lead
programs to spin, assuming they are looking to see what signals are pending.
Updates #139
PiperOrigin-RevId: 274017890
Also ensure that all flipcall transport errors not returned by p9 (converted to
EIO by the client, or dropped on the floor by channel server goroutines) are
logged.
PiperOrigin-RevId: 272963663
The behavior for sending and receiving local broadcast (255.255.255.255)
traffic is as follows:
Outgoing
--------
* A broadcast packet sent on a socket that is bound to an interface goes out
that interface
* A broadcast packet sent on an unbound socket follows the route table to
select the outgoing interface
+ if an explicit route entry exists for 255.255.255.255/32, use that one
+ else use the default route
* Broadcast packets are looped back and delivered following the rules for
incoming packets (see next). This is the same behavior as for multicast
packets, except that it cannot be disabled via sockopt.
Incoming
--------
* Sockets wishing to receive broadcast packets must bind to either INADDR_ANY
(0.0.0.0) or INADDR_BROADCAST (255.255.255.255). No other socket receives
broadcast packets.
* Broadcast packets are multiplexed to all sockets matching it. This is the
same behavior as for multicast packets.
* A socket can bind to 255.255.255.255:<port> and then receive its own
broadcast packets sent to 255.255.255.255:<port>
In addition, this change implicitly fixes an issue with multicast reception. If
two sockets want to receive a given multicast stream and one is bound to ANY
while the other is bound to the multicast address, only one of them will
receive the traffic.
PiperOrigin-RevId: 272792377
The input file descriptor is always a regular file, so sendfile can't lose any
data if it will not be able to write them to the output file descriptor.
Reported-by: syzbot+22d22330a35fa1c02155@syzkaller.appspotmail.com
PiperOrigin-RevId: 272730357
Right now, we can find more than one process with the 1 PID in /proc.
$ for i in `seq 10`; do
> unshare -fp sleep 1000 &
> done
$ ls /proc
1 1 1 1 12 18 24 29 6 loadavg net sys version
1 1 1 1 16 20 26 32 cpuinfo meminfo self thread-self
1 1 1 1 17 21 28 36 filesystems mounts stat uptime
PiperOrigin-RevId: 272506593
gVisor does not currently implement the functionality that would result in
AT_SECURE = 1, but Linux includes AT_SECURE = 0 in the normal case, so we
should do the same.
PiperOrigin-RevId: 272311488
Kernel.cpuClockTicker increments kernel.cpuClock, which tasks use as a clock to
track their CPU usage. This improves latency in the syscall path by avoid
expensive monotonic clock calls on every syscall entry/exit.
However, this timer fires every 10ms. Thus, when all tasks are idle (i.e.,
blocked or stopped), this forces a sentry wakeup every 10ms, when we may
otherwise be able to sleep until the next app-relevant event. These wakeups
cause the sentry to utilize approximately 2% CPU when the application is
otherwise idle.
Updates to clock are not strictly necessary when the app is idle, as there are
no readers of cpuClock. This commit reduces idle CPU by disabling the timer
when tasks are completely idle, and computing its effects at the next wakeup.
Rather than disabling the timer as soon as the app goes idle, we wait until the
next tick, which provides a window for short sleeps to sleep and wakeup without
doing the (relatively) expensive work of disabling and enabling the timer.
PiperOrigin-RevId: 272265822
Linux changed this behavior in 16e72e9b30986ee15f17fbb68189ca842c32af58
(v4.11). Previously, extra pages were always mapped RW. Now, those pages will
be executable if the segment specified PF_X. They still must be writeable.
PiperOrigin-RevId: 272256280
It isn't allowed to splice data from and into the same pipe.
But right now this check is broken, because we don't check that both ends are
pipes.
PiperOrigin-RevId: 272107022
Netstack always picks a random start point everytime PickEphemeralPort
is called. While this is required for UDP so that DNS requests go
out through a randomized set of ports it is not required for TCP. Infact
Linux explicitly hashes the (srcip, dstip, dstport) and a one time secret
initialized at start of the application to get a random offset. But to
ensure it doesn't start from the same point on every scan it uses a static
hint that is incremented by 2 in every call to pick ephemeral ports.
The reason for 2 is Linux seems to split the port ranges where active connects
seem to use even ones while odd ones are used by listening sockets.
This CL implements a similar strategy where we use a hash + hint to generate
the offset to start the search for a free Ephemeral port.
This ensures that we cycle through the available port space in order for
repeated connects to the same destination and significantly reduces the
chance of picking a recently released port.
PiperOrigin-RevId: 272058370
The gofer's CachingInodeOperations implementation contains an optimization for
the common open-read-close pattern when we have a host FD. In this case, the
host kernel will update the timestamp for us to a reasonably close time, so we
don't need an extra RPC to the gofer.
However, when the app explicitly sets the timestamps (via futimes or similar)
then we actually DO need to update the timestamps, because the host kernel
won't do it for us.
To fix this, a new boolean `forceSetTimestamps` was added to
CachineInodeOperations.SetMaskedAttributes. It is only set by
gofer.InodeOperations.SetTimestamps.
PiperOrigin-RevId: 272048146
Before https://golang.org/cl/173160 syscall.RawSyscall would zero out
the last three register arguments to the system call. That no longer happens.
For system calls that take more than three arguments, use RawSyscall6 to
ensure that we pass zero, not random data, for the additional arguments.
PiperOrigin-RevId: 271062527
Non-primary addresses are used for endpoints created to accept multicast and
broadcast packets, as well as "helper" endpoints (0.0.0.0) that allow sending
packets when no proper address has been assigned yet (e.g., for DHCP). These
addresses are not real addresses from a user point of view and should not be
part of the NICInfo() value. Also see b/127321246 for more info.
This switches NICInfo() to call a new NIC.PrimaryAddresses() function. To still
allow an option to get all addresses (mostly for testing) I added
Stack.GetAllAddresses() and NIC.AllAddresses().
In addition, the return value for GetMainNICAddress() was changed for the case
where the NIC has no primary address. Instead of returning an error here,
it now returns an empty AddressWithPrefix() value. The rational for this
change is that it is a valid case for a NIC to have no primary addresses.
Lastly, I refactored the code based on the new additions.
PiperOrigin-RevId: 270971764
How to reproduce:
$ echo "timeout 10 ls" > foo.sh
$ chmod +x foo.sh
$ ./foo.sh
(will hang here for 10 secs, and the output of ls does not show)
When "ls" process writes to stdout, it receives SIGTTOU signal, and
hangs there. Until "timeout" process timeouts, and kills "ls" process.
The expected result is: "ls" writes its output into tty, and terminates
immdedately, then "timeout" process receives SIGCHLD and terminates.
The reason for this failure is that we missed the check for TOSTOP (if
set, background processes will receive the SIGTTOU signal when they do
write).
We use drivers/tty/n_tty.c:n_tty_write() as a reference.
Fixes: #862
Reported-by: chris.zn <chris.zn@antfin.com>
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
Signed-off-by: chenglang.hy <chenglang.hy@antfin.com>
Previously, the only safe way to use an fdbased endpoint was to leak the FD.
This change makes it possible to safely close the FD.
This is the first step towards having stoppable stacks.
Updates #837
PiperOrigin-RevId: 270346582
Previously, when we set hostname:
$ strace hostname abc
...
sethostname("abc", 3) = -1 ENAMETOOLONG (File name too long)
...
According to man 2 sethostname:
"The len argument specifies the number of bytes in name. (Thus, name
does not require a terminating null byte.)"
We wrongly use the CopyStringIn() to check terminating zero byte in
the implementation of sethostname syscall.
To fix this, we use CopyInBytes() instead.
Fixes: #861
Reported-by: chenglang.hy <chenglang.hy@antfin.com>
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
- Do not call Rread.SetPayload(flipcall packet window) in p9.channel.recv().
- Ignore EINTR from ppoll() in p9.Client.watch().
- Clean up handling of client socket FD lifetimes so that p9.Client.watch()
never ppoll()s a closed FD.
- Make p9test.Harness.Finish() call clientSocket.Shutdown() instead of
clientSocket.Close() for the same reason.
- Rework channel reuse to avoid leaking channels in the following case (suppose
we have two channels):
sendRecvChannel
len(channels) == 2 => idx = 1
inuse[1] = ch0
sendRecvChannel
len(channels) == 1 => idx = 0
inuse[0] = ch1
inuse[1] = nil
sendRecvChannel
len(channels) == 1 => idx = 0
inuse[0] = ch0
inuse[0] = nil
inuse[0] == nil => ch0 leaked
- Avoid deadlocking p9.Client.watch() by calling channelsWg.Wait() without
holding channelsMu.
- Bump p9test:client_test size to medium.
PiperOrigin-RevId: 270200314