From RFC 793 s3.9 p58 Event Processing:
If RECEIVE Call arrives in CLOSED state and the user has access to such a
connection, the return should be "error: connection does not exist"
Fixes#1598
PiperOrigin-RevId: 293494287
Updates #1198
Opening host pipes (by spinning in fdpipe) and host sockets is not yet
complete, and will be done in a future CL.
Major differences from VFS1 gofer client (sentry/fs/gofer), with varying levels
of backportability:
- "Cache policies" are replaced by InteropMode, which control the behavior of
timestamps in addition to caching. Under InteropModeExclusive (analogous to
cacheAll) and InteropModeWritethrough (analogous to cacheAllWritethrough),
client timestamps are *not* written back to the server (it is not possible in
9P or Linux for clients to set ctime, so writing back client-authoritative
timestamps results in incoherence between atime/mtime and ctime). Under
InteropModeShared (analogous to cacheRemoteRevalidating), client timestamps
are not used at all (remote filesystem clocks are authoritative). cacheNone
is translated to InteropModeShared + new option
filesystemOptions.specialRegularFiles.
- Under InteropModeShared, "unstable attribute" reloading for permission
checks, lookup, and revalidation are fused, which is feasible in VFS2 since
gofer.filesystem controls path resolution. This results in a ~33% reduction
in RPCs for filesystem operations compared to cacheRemoteRevalidating. For
example, consider stat("/foo/bar/baz") where "/foo/bar/baz" fails
revalidation, resulting in the instantiation of a new dentry:
VFS1 RPCs:
getattr("/") // fs.MountNamespace.FindLink() => fs.Inode.CheckPermission() => gofer.inodeOperations.check() => gofer.inodeOperations.UnstableAttr()
walkgetattr("/", "foo") = fid1 // fs.Dirent.walk() => gofer.session.Revalidate() => gofer.cachePolicy.Revalidate()
clunk(fid1)
getattr("/foo") // CheckPermission
walkgetattr("/foo", "bar") = fid2 // Revalidate
clunk(fid2)
getattr("/foo/bar") // CheckPermission
walkgetattr("/foo/bar", "baz") = fid3 // Revalidate
clunk(fid3)
walkgetattr("/foo/bar", "baz") = fid4 // fs.Dirent.walk() => gofer.inodeOperations.Lookup
getattr("/foo/bar/baz") // linux.stat() => gofer.inodeOperations.UnstableAttr()
VFS2 RPCs:
getattr("/") // gofer.filesystem.walkExistingLocked()
walkgetattr("/", "foo") = fid1 // gofer.filesystem.stepExistingLocked()
clunk(fid1)
// No getattr: walkgetattr already updated metadata for permission check
walkgetattr("/foo", "bar") = fid2
clunk(fid2)
walkgetattr("/foo/bar", "baz") = fid3
// No clunk: fid3 used for new gofer.dentry
// No getattr: walkgetattr already updated metadata for stat()
- gofer.filesystem.unlinkAt() does not require instantiation of a dentry that
represents the file to be deleted. Updates #898.
- gofer.regularFileFD.OnClose() skips Tflushf for regular files under
InteropModeExclusive, as it's nonsensical to request a remote file flush
without flushing locally-buffered writes to that remote file first.
- Symlink targets are cached when InteropModeShared is not in effect.
- p9.QID.Path (which is already required to be unique for each file within a
server, and is accordingly already synthesized from device/inode numbers in
all known gofers) is used as-is for inode numbers, rather than being mapped
along with attr.RDev in the client to yet another synthetic inode number.
- Relevant parts of fsutil.CachingInodeOperations are inlined directly into
gofer package code. This avoids having to duplicate part of its functionality
in fsutil.HostMappable.
PiperOrigin-RevId: 293190213
A couple other things that changed:
- There's a proper extension registration system for matchers. Anyone
adding another matcher can use tcp_matcher.go or udp_matcher.go as a
template.
- All logging and use of syserr.Error in the netfilter package happens at the
highest possible level (public functions). Lower-level functions just
return normal, descriptive golang errors.
Splice must not allow negative offsets. Writes also must not allow offset +
size to overflow int64. Reads are similarly broken, but not just in splice
(b/148095030).
Reported-by: syzbot+0e1ff0b95fb2859b4190@syzkaller.appspotmail.com
PiperOrigin-RevId: 292361208
Currently, Send() will copy data into a new byte slice without regard to the
original size. Size checks should be performed before the allocation takes
place.
Note that for the sake of performance, we avoid putting the buffer
allocation into the critical section. As a result, the size checks need to be
performed again within Enqueue() in case the limit has changed.
PiperOrigin-RevId: 292058147
WritableSource is a convenience interface used for files that can
be written to, e.g. /proc/net/ipv4/tpc_sack. It reads max of 4KB
and only from offset 0 which should cover most cases. It can be
extended as neeed.
Updates #1195
PiperOrigin-RevId: 292056924
FD table now holds both VFS1 and VFS2 types and uses the correct
one based on what's set.
Parts of this CL are just initial changes (e.g. sys_read.go,
runsc/main.go) to serve as a template for the remaining changes.
Updates #1487
Updates #1623
PiperOrigin-RevId: 292023223
Special files can have additional requirements for granularity.
For example, read from eventfd returns EINVAL if a size is less 8 bytes.
Reported-by: syzbot+3905f5493bec08eb7b02@syzkaller.appspotmail.com
PiperOrigin-RevId: 292002926
Test command:
$ ip route get 1.1.1.1
Fixes: #1099
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/1121 from tanjianfeng:fix-1099 e6919f3d4ede5aa51a48b3d2be0d7a4b482dd53d
PiperOrigin-RevId: 291990716
Because the abi will depend on the core types for marshalling (usermem,
context, safemem, safecopy), these need to be flattened from the sentry
directory. These packages contain no sentry-specific details.
PiperOrigin-RevId: 291811289
Note that in VFS2, filesystem device numbers are per-vfs.FilesystemImpl rather
than global, avoiding the need for a "registry" type to handle save/restore.
(This is more consistent with Linux anyway: compare e.g.
mm/shmem.c:shmem_mount() => fs/super.c:mount_nodev() => (indirectly)
set_anon_super().)
PiperOrigin-RevId: 291425193
Go 1.14+ sends SIGURG to Ms to attempt asynchronous preemption of a G. Since it
can't guarantee that a SIGURG is only related to preemption, it continues to
forward them to signal.Notify (see runtime.sighandler).
We should ignore these signals, as applications shouldn't receive them. Note
that this means that truly external SIGURG can no longer be sent to the
application (as with SIGCHLD).
PiperOrigin-RevId: 291415357
The kernel may return EINTR from:
kvm_create_vm
kvm_init_mmu_notifier
mmu_notifier_register
do_mmu_notifier_register
mm_take_all_locks
Go 1.14's preemptive scheduling signals make hitting this much more likely.
PiperOrigin-RevId: 291212669
The iptables binary is looking for libxt_.so when it should be looking
for libxt_udp.so, so it's having an issue reading the data in
xt_match_entry. I think it may be an alignment issue.
Trying to fix this is leading to me fighting with the metadata struct,
so I'm gonna go kill that.
Also renames TMutex to Mutex.
These custom mutexes aren't any worse than the standard library versions (same
code), so having both seems redundant.
PiperOrigin-RevId: 290873587
Such a stat accounts for all connections that are currently
established and not yet transitioned to close state.
Also fix bug in double increment of CurrentEstablished stat.
Fixes#1579
PiperOrigin-RevId: 290827365
Note that these simply will use the same logic as getxattr and setxattr, which
is not yet implemented for most filesystems.
PiperOrigin-RevId: 290800960
Java 11 parses /proc/self/mountinfo for cgroup information. Java 11.0.4 uses
the mount path to determine what cgroups existed, but Java 11.0.5 reads the
cgroup names from the superblock options.
This CL adds the cgroup name to the superblock options if the filesystem type
is "cgroup". Since gVisor doesn't actually support cgroups yet, we just infer
the cgroup name from the path.
PiperOrigin-RevId: 290434323
CERT Advisory CA-96.21 III. Solution advises that devices drop packets which
could not have correctly arrived on the wire, such as receiving a packet where
the source IP address is owned by the device that sent it.
Fixes#1507
PiperOrigin-RevId: 290378240
x86 and arm64 use a different stat struct in Linux
kernel, so the stat() syscall implementation has
to handle the file stat data separately.
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: If3986e915a667362257a54e7fbbcc1fe18951015
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/1493 from xiaobo55x:stat f15a216d9297eb9a96d2c483d396a9919145d7fa
PiperOrigin-RevId: 290274287
There was a very bare get/setxattr in the InodeOperations interface. Add
context.Context to both, size to getxattr, and flags to setxattr.
Note that extended attributes are passed around as strings in this
implementation, so size is automatically encoded into the value. Size is
added in getxattr so that implementations can return ERANGE if a value is larger
than can fit in the user-allocated buffer. This prevents us from unnecessarily
passing around an arbitrarily large xattr when the user buffer is actually too
small.
Don't use the existing xattrwalk and xattrcreate messages and define our
own, mainly for the sake of simplicity.
Extended attributes will be implemented in future commits.
PiperOrigin-RevId: 290121300
sys_clone has many flavors in Linux, and amd64 chose
a different one from x86(different arguments order).
Ref kernel/fork.c for more info.
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I6c8cbc685f4a6e786b171715ab68292fc95cbf48
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/1545 from xiaobo55x:clone 156bd2dfbc63ef5291627b0578ddea77997393b2
PiperOrigin-RevId: 290093953
Signed-off-by: Bin Lu <bin.lu@arm.com>
Change-Id: I9cce23db4e5caec82ce42b4970fdb7f7e8c08f1d
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/773 from lubinszARM:pr_arch_basic 3fe2fd8e6286766bbe489ef971dce204f924feba
PiperOrigin-RevId: 289795569
This patch holds taskset.mu when getting tty. If we don't
do this, it may cause a "unlock of unlocked mutex" problem,
since signalHandlers may be replaced by CopyForExec() in
runSyscallAfterExecStop after the signalHandlers.mu has
been holded in TTY().
The problem is easy to reproduce with keeping to do "runsc ps".
The crash log is :
fatal error: sync: unlock of unlocked mutex
goroutine 5801304 [running]:
runtime.throw(0xfd019c, 0x1e)
GOROOT/src/runtime/panic.go:774 +0x72 fp=0xc001ba47b0 sp=0xc001ba4780 pc=0x431702
sync.throw(0xfd019c, 0x1e)
GOROOT/src/runtime/panic.go:760 +0x35 fp=0xc001ba47d0 sp=0xc001ba47b0 pc=0x431685
sync.(*Mutex).unlockSlow(0xc00cf94a30, 0xc0ffffffff)
GOROOT/src/sync/mutex.go:196 +0xd6 fp=0xc001ba47f8 sp=0xc001ba47d0 pc=0x4707d6
sync.(*Mutex).Unlock(0xc00cf94a30)
GOROOT/src/sync/mutex.go:190 +0x48 fp=0xc001ba4818 sp=0xc001ba47f8 pc=0x4706e8
gvisor.dev/gvisor/pkg/sentry/kernel.(*ThreadGroup).TTY(0xc011a9e800, 0x0)
pkg/sentry/kernel/tty.go:38 +0x88 fp=0xc001ba4868 sp=0xc001ba4818 pc=0x835fa8
gvisor.dev/gvisor/pkg/sentry/control.Processes(0xc00025ae00, 0xc013e397c0, 0x40, 0xc0137b9800, 0x1, 0x7f292e9a4cc0)
pkg/sentry/control/proc.go:366 +0x355 fp=0xc001ba49a0 sp=0xc001ba4868 pc=0x9ac4a5
gvisor.dev/gvisor/runsc/boot.(*containerManager).Processes(0xc0003b62c0, 0xc0051423d0, 0xc0137b9800, 0x0, 0x0)
runsc/boot/controller.go:228 +0xdf fp=0xc001ba49e8 sp=0xc001ba49a0 pc=0xaf06cf
Signed-off-by: chris.zn <chris.zn@antfin.com>
* Rename syncutil to sync.
* Add aliases to sync types.
* Replace existing usage of standard library sync package.
This will make it easier to swap out synchronization primitives. For example,
this will allow us to use primitives from github.com/sasha-s/go-deadlock to
check for lock ordering violations.
Updates #1472
PiperOrigin-RevId: 289033387
When PCID is disabled, there would throw a panic
when dropPageTables() access to c.PCID without check.
Signed-off-by: Lai Jiangshan <eag0628@gmail.com>
This change calls a new Truncate method on the EndpointReader in RecvMsg for
both netlink and unix sockets. This allows readers such as sockets to peek at
the length of data without actually reading it to a buffer.
Fixes#993#1240
PiperOrigin-RevId: 288800167
This gets us closer to passing the iptables tests and opens up iptables
so it can be worked on by multiple people.
A few restrictions are enforced for security (i.e. we don't want to let
users write a bunch of iptables rules and then just not enforce them):
- Only the filter table is writable.
- Only ACCEPT rules with no matching criteria can be added.
Right now, we need to call ptrace(PTRACE_SYSCALL) and wait() twice to execute
one system call in a stub process. With these changes, we will need to call
ptrace + wait only once.
In addition, this allows to workaround the kernel bug when a stub process
doesn't stop on syscall-exit-stop and starts executing the next system call.
Reported-by: syzbot+37143cafa8dc3b5008ee@syzkaller.appspotmail.com
PiperOrigin-RevId: 288393029
- Renamed memfs to tmpfs.
- Copied fileRangeSet bits from fs/fsutil/ to fsimpl/tmpfs/
- Changed tmpfs to be backed by filemem instead of byte slice.
- regularFileReadWriter uses a sync.Pool, similar to gofer client.
PiperOrigin-RevId: 288356380
Currently, shm.Registry.FindByID will return Shm instances without taking an
additional reference on them, making it possible for them to disappear.
More explicitly handle references. All callers hold a reference for the
duration that they hold the instance. Registry.shms may transitively hold Shms
with no references, so it must TryIncRef to determine if they are still valid.
PiperOrigin-RevId: 288314529
- Add FileDescriptionOptions.UseDentryMetadata, which reduces the amount of
boilerplate needed for device FDs and the like between filesystems.
- Switch back to having FileDescription.Init() take references on the Mount and
Dentry; otherwise managing refcounts around failed calls to
OpenDeviceSpecialFile() / Device.Open() is tricky.
PiperOrigin-RevId: 287575574
Added the ability to get/set the IP_RECVTOS socket option on UDP endpoints. If
enabled, TOS from the incoming Network Header passed as ancillary data in the
ControlMessages.
Test:
* Added unit test to udp_test.go that tests getting/setting as well as
verifying that we receive expected TOS from incoming packet.
* Added a syscall test
PiperOrigin-RevId: 287029703
There are 2 jobs have been finished in this patch:
1, a comment was added to explain the purpose of the extra NOPs in Vectors().
2, some merge errors were fixed.
Signed-off-by: Bin Lu <bin.lu@arm.com>
- Make FilesystemImpl methods that operate on parent directories require
!rp.Done() (i.e. there is at least one path component to resolve) as
precondition and postcondition (in cases where they do not finish path
resolution due to mount boundary / absolute symlink), and require that they
do not need to follow the last path component (the file being created /
deleted) as a symlink. Check for these in VFS.
- Add FilesystemImpl.GetParentDentryAt(), which is required to obtain the old
parent directory for VFS.RenameAt(). (Passing the Dentry to be renamed
instead has the wrong semantics if the file named by the old path is a mount
point since the Dentry will be on the wrong Mount.)
- Update memfs to implement these methods correctly (?), including RenameAt.
- Change fspath.Parse() to allow empty paths (to simplify implementation of
AT_EMPTY_PATH).
- Change vfs.PathOperation to take a fspath.Path instead of a raw pathname;
non-test callers will need to fspath.Parse() pathnames themselves anyway in
order to detect absolute paths and select PathOperation.Start accordingly.
PiperOrigin-RevId: 286934941
Linux PTRACE_SYSEMU support on arm64 was merged to mainline from
V5.3, and the corresponding support in go also enabled recently.
Since the "syscall" package is locked down from go 1.4, so the ptrace
PTRACE_SYSEMU definition can't be added to package "syscall" on arm64.
According to the golang community, updates required by new systems or
versions should use the corresponding package in the golang.org/x/sys
repository instead(https://golang.org/pkg/syscall/).
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I2f917bb2be62f990c3e158e2bb99e094ea03f751
This change is needed to be compatible with the Linux kernel.
There is no glibc wrapper for the futex system call, so it is easy to
make a mistake and call syscall(__NR_futex, FUTEX_WAKE, addr) without
the fourth argument. This works on Linux, because it wakes one waiter
even if val is nonpositive.
PiperOrigin-RevId: 286494396