Commit Graph

728 Commits

Author SHA1 Message Date
Jianfeng Tan 96f78e2466 unix: return zero if peer is closed
Previously, recvmsg() on a unix stream socket with its peer closed will
never return, with goroutine call trace like this:

  ...
  2  in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).block
     at pkg/sentry/kernel/task_block.go:124
  3  in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).BlockWithDeadline
     at pkg/sentry/kernel/task_block.go:69
  4  in gvisor.dev/gvisor/pkg/sentry/socket/unix.(*SocketOperations).RecvMsg
     at pkg/sentry/socket/unix/unix.go:612
  5  in gvisor.dev/gvisor/pkg/sentry/syscalls/linux.recvFrom
     at pkg/sentry/syscalls/linux/sys_socket.go:885
  6  in gvisor.dev/gvisor/pkg/sentry/syscalls/linux.RecvFrom
     at pkg/sentry/syscalls/linux/sys_socket.go:910
  ...

The issue is caused by that ErrClosedForReceive returned by
unix/transport.queue is turned into nil in
unix.(*EndpointReader).ReadToBlocks():

  err.ToError()

As a result, in unix.(*SocketOperations).RecvMsg():

  n == 0 and err == nil

We shall differentiate it from another case - no data to read where
ErrWouldBlock shall be returned; and return 0 immediately.

Fixes: #734

Reported-by: chenglang.hy <chenglang.hy@antfin.com>
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
2019-08-22 15:25:38 +00:00
Tamir Duberstein 573e6e4bba Use tcpip.Subnet in tcpip.Route
This is the first step in replacing some of the redundant types with the
standard library equivalents.

PiperOrigin-RevId: 264706552
2019-08-21 15:31:18 -07:00
Zach Koopmans 67d7864f83 Document RWF_HIPRI not implemented for preadv2/pwritev2.
Document limitation of no reasonable implementation for RWF_HIPRI
flag (High Priority Read/Write for block-based file systems).

PiperOrigin-RevId: 264237589
2019-08-19 14:07:44 -07:00
Jianfeng Tan a63f88855f hostinet: fix parsing route netlink message
We wrongly parses output interface as gateway address.
The fix is straightforward.

Fixes #638

Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
Change-Id: Ia4bab31f3c238b0278ea57ab22590fad00eaf061
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/684 from tanjianfeng:fix-638 b940e810367ad1273519bfa594f4371bdd293e83
PiperOrigin-RevId: 264211336
2019-08-19 12:10:21 -07:00
Kevin Krakauer bd826092fe Read iptables via sockopts.
PiperOrigin-RevId: 264180125
2019-08-19 10:05:59 -07:00
Andrei Vagin 3e4102b2ea netstack: disconnect an unix socket only if the address family is AF_UNSPEC
Linux allows to call connect for ANY and the zero port.

PiperOrigin-RevId: 263892534
2019-08-16 19:32:14 -07:00
Ayush Ranjan 661b2b9f69 procfs: Migrate seqfile implementations.
Migrates all (except 3) seqfile implementations to the vfs.DynamicBytesSource
interface. There should not be any change in functionality due to this migration
itself.

Please note that the following seqfile implementations have not been migrated:
- /proc/filesystems in proc/filesystems.go
- /proc/[pid]/mountinfo in proc/mounts.go
- /proc/[pid]/mounts in proc/mounts.go
This is because these depend on pending changes in /pkg/senty/vfs.

PiperOrigin-RevId: 263880719
2019-08-16 17:36:42 -07:00
Andrei Vagin 2a1303357c ptrace: detect if a stub process exited unexpectedly
PiperOrigin-RevId: 263880577
2019-08-16 17:33:28 -07:00
Ayush Ranjan 4bab7d7f08 vfs: Remove vfs.DefaultDirectoryFD from embedding vfs.DefaultFD.
This fixes the implementation ambiguity issues when a filesystem
implementation embeds vfs.DefaultDirectoryFD to its directory FD along
with an internal common fileDescription utility.

For similar reasons also removes FileDescriptionDefaultImpl from
DynamicBytesFileDescriptionImpl.

PiperOrigin-RevId: 263795513
2019-08-16 10:20:11 -07:00
Tamir Duberstein d81d94ac4c Replace uinptr with int64 when returning lengths
This is in accordance with newer parts of the standard library.

PiperOrigin-RevId: 263449916
2019-08-14 16:05:56 -07:00
Bhasker Hariharan 570fb1db6b Improve SendMsg performance.
SendMsg before this change would copy all the data over into a
new slice even if the underlying socket could only accept a
small amount of data. This is really inefficient with non-blocking
sockets and under high throughput where large writes could get
ErrWouldBlock or if there was say a timeout associated with the sendmsg()
syscall.

With this change we delay copying bytes in till they are needed and only
copy what can be potentially sent/held in the socket buffer. Reducing
the need to repeatedly copy data over.

Also a minor fix to change state FIN-WAIT-1 when shutdown(..., SHUT_WR) is called
instead of when we transmit the actual FIN. Otherwise the socket could remain in
CONNECTED state even though the user has called shutdown() on the socket.

Updates #627

PiperOrigin-RevId: 263430505
2019-08-14 14:34:27 -07:00
Jamie Liu cee044c2ab Add vfs.DynamicBytesFileDescriptionImpl.
This replaces fs/proc/seqfile for vfs2-based filesystems.

PiperOrigin-RevId: 263254647
2019-08-13 17:54:24 -07:00
Fabricio Voznika 0e907c4298 Fix file mode check in pipeOperations
PiperOrigin-RevId: 263203441
2019-08-13 13:33:33 -07:00
Nicolas Lacasse 9769a8eaa4 Handle ENOSPC with a partial write.
Similar to the EPIPE case, we can return the number of bytes written before
ENOSPC was encountered. If the app tries to write more, we can return ENOSPC on
the next write.

PiperOrigin-RevId: 263041648
2019-08-12 17:41:33 -07:00
Andrei Vagin af90e68623 netlink: return an error in nlmsgerr
Now if a process sends an unsupported netlink requests,
an error is returned from the send system call.

The linux kernel works differently in this case. It returns errors in the
nlmsgerr netlink message.

Reported-by: syzbot+571d99510c6f935202da@syzkaller.appspotmail.com
PiperOrigin-RevId: 262690453
2019-08-09 22:34:54 -07:00
Haibo Xu 1c9da886e7 Add initial ptrace stub and syscall support for arm64.
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I1dbd23bb240cca71d0cc30fc75ca5be28cb4c37c
PiperOrigin-RevId: 262619519
2019-08-09 13:18:11 -07:00
Ayush Ranjan c8961a6cbd ext: Move to pkg/sentry/fsimpl.
fsimpl is the keeper of all filesystem implementations in VFS2.

PiperOrigin-RevId: 262617869
2019-08-09 13:08:28 -07:00
Ayush Ranjan 690308111c ext: Benchmark tests.
Added benchmark tests which emulate memfs benchmarks.

Stat benchmarks
BenchmarkVFS2Ext4fsStat/1-12      	10000000	       145 ns/op
BenchmarkVFS2Ext4fsStat/2-12      	10000000	       170 ns/op
BenchmarkVFS2Ext4fsStat/3-12      	10000000	       202 ns/op
BenchmarkVFS2Ext4fsStat/8-12      	 3000000	       374 ns/op
BenchmarkVFS2Ext4fsStat/64-12     	  500000	      2159 ns/op
BenchmarkVFS2Ext4fsStat/100-12    	  300000	      3459 ns/op

BenchmarkVFS1TmpfsStat/1-12       	 5000000	       348 ns/op
BenchmarkVFS1TmpfsStat/2-12       	 3000000	       487 ns/op
BenchmarkVFS1TmpfsStat/3-12       	 2000000	       655 ns/op
BenchmarkVFS1TmpfsStat/8-12       	 1000000	      1365 ns/op
BenchmarkVFS1TmpfsStat/64-12      	  200000	      9565 ns/op
BenchmarkVFS1TmpfsStat/100-12     	  100000	     15158 ns/op

BenchmarkVFS2MemfsStat/1-12       	10000000	       133 ns/op
BenchmarkVFS2MemfsStat/2-12       	10000000	       155 ns/op
BenchmarkVFS2MemfsStat/3-12       	10000000	       182 ns/op
BenchmarkVFS2MemfsStat/8-12       	 5000000	       310 ns/op
BenchmarkVFS2MemfsStat/64-12      	 1000000	      1659 ns/op
BenchmarkVFS2MemfsStat/100-12     	  500000	      2787 ns/op

Mount Stat benchmarks
BenchmarkVFS2ExtfsMountStat/1-12  	 5000000	       245 ns/op
BenchmarkVFS2ExtfsMountStat/2-12  	 5000000	       266 ns/op
BenchmarkVFS2ExtfsMountStat/3-12  	 5000000	       304 ns/op
BenchmarkVFS2ExtfsMountStat/8-12  	 3000000	       456 ns/op
BenchmarkVFS2ExtfsMountStat/64-12 	  500000	      2308 ns/op
BenchmarkVFS2ExtfsMountStat/100-12   300000	      3482 ns/op

BenchmarkVFS1TmpfsMountStat/1-12  	 3000000	       488 ns/op
BenchmarkVFS1TmpfsMountStat/2-12  	 2000000	       658 ns/op
BenchmarkVFS1TmpfsMountStat/3-12  	 2000000	       806 ns/op
BenchmarkVFS1TmpfsMountStat/8-12  	 1000000	      1514 ns/op
BenchmarkVFS1TmpfsMountStat/64-12 	  100000	     10037 ns/op
BenchmarkVFS1TmpfsMountStat/100-12        100000	     15280 ns/op

BenchmarkVFS2MemfsMountStat/1-12           	10000000	       212 ns/op
BenchmarkVFS2MemfsMountStat/2-12           	 5000000	       232 ns/op
BenchmarkVFS2MemfsMountStat/3-12           	 5000000	       264 ns/op
BenchmarkVFS2MemfsMountStat/8-12           	 3000000	       390 ns/op
BenchmarkVFS2MemfsMountStat/64-12          	 1000000	      1813 ns/op
BenchmarkVFS2MemfsMountStat/100-12         	  500000	      2812 ns/op

PiperOrigin-RevId: 262477158
2019-08-08 18:45:37 -07:00
Rahat Mahmood 7bfad8ebb6 Return a well-defined socket address type from socket funtions.
Previously we were representing socket addresses as an interface{},
which allowed any type which could be binary.Marshal()ed to be used as
a socket address. This is fine when the address is passed to userspace
via the linux ABI, but is problematic when used from within the sentry
such as by networking procfs files.

PiperOrigin-RevId: 262460640
2019-08-08 16:50:33 -07:00
Rahat Mahmood 13a98df49e netstack: Don't start endpoint goroutines too soon on restore.
Endpoint protocol goroutines were previously started as part of
loading the endpoint. This is potentially too soon, as resources used
by these goroutine may not have been loaded. Protocol goroutines may
perform meaningful work as soon as they're started (ex: incoming
connect) which can cause them to indirectly access resources that
haven't been loaded yet.

This CL defers resuming all protocol goroutines until the end of
restore.

PiperOrigin-RevId: 262409429
2019-08-08 12:33:11 -07:00
gVisor bot 2e45d1696e Merge pull request #653 from xiaobo55x:dev
PiperOrigin-RevId: 262402929
2019-08-08 11:58:14 -07:00
Jamie Liu 06102af65a memfs fixes.
- Unexport Filesystem/Dentry/Inode.

- Support SEEK_CUR in directoryFD.Seek().

- Hold Filesystem.mu before touching directoryFD.off in
directoryFD.Seek().

- Remove deleted Dentries from their parent directory.childLists.

- Remove invalid FIXMEs.

PiperOrigin-RevId: 262400633
2019-08-08 11:46:38 -07:00
Ayush Ranjan 08cd5e1d36 ext: Seek unit tests.
PiperOrigin-RevId: 262264674
2019-08-07 19:13:41 -07:00
Ayush Ranjan 40d6d8c15b ext: StatAt unit tests.
PiperOrigin-RevId: 262249166
2019-08-07 17:21:00 -07:00
Ayush Ranjan 3b368cabf9 ext: Read unit tests.
PiperOrigin-RevId: 262242410
2019-08-07 16:44:10 -07:00
Ayush Ranjan ad67e5a7a0 ext: IterDirent unit tests.
PiperOrigin-RevId: 262226761
2019-08-07 15:24:33 -07:00
Ayush Ranjan 1c9781a4ed ext: vfs.FileDescriptionImpl and vfs.FilesystemImpl implementations.
- This also gets rid of pipes for now because pipe does not have vfs2 specific
  support yet.
- Added file path resolution logic.
- Fixes testing infrastructure.
- Does not include unit tests yet.

PiperOrigin-RevId: 262213950
2019-08-07 14:23:42 -07:00
Michael Pratt 704f9610f3 Require pread/pwrite for splice file offsets
If there is an offset, the file must support pread/pwrite. See
fs/splice.c:do_splice.

PiperOrigin-RevId: 261944932
2019-08-06 10:35:28 -07:00
Haibo Xu 83fdb7739e Change syscall.EPOLLET to unix.EPOLLET
syscall.EPOLLET has been defined with different values on amd64 and
arm64(-0x80000000 on amd64, and 0x80000000 on arm64), while unix.EPOLLET
has been unified this value to 0x80000000(golang/go#5328). ref #63

Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: Id97d075c4e79d86a2ea3227ffbef02d8b00ffbb8
2019-08-05 23:10:08 +00:00
Kevin Krakauer 810cc07aab Plumbing for iptables sockopts.
PiperOrigin-RevId: 261413396
2019-08-02 16:26:48 -07:00
Kevin Krakauer b6a5b950d2 Job control: controlling TTYs and foreground process groups.
(Don't worry, this is mostly tests.)

Implemented the following ioctls:
- TIOCSCTTY - set controlling TTY
- TIOCNOTTY - remove controlling tty, maybe signal some other processes
- TIOCGPGRP - get foreground process group. Also enables tcgetpgrp().
- TIOCSPGRP - set foreground process group. Also enabled tcsetpgrp().

Next steps are to actually turn terminal-generated control characters (e.g. C^c)
into signals to the proper process groups, and to send SIGTTOU and SIGTTIN when
appropriate.

PiperOrigin-RevId: 261387276
2019-08-02 14:05:48 -07:00
Rahat Mahmood 2906dffcdb Automated rollback of changelist 261191548
PiperOrigin-RevId: 261373749
2019-08-02 12:52:40 -07:00
Nicolas Lacasse aaaefdf9ca Remove kernel.mounts.
We can get the mount namespace from the CreateProcessArgs in all cases where we
need it. This also gets rid of kernel.Destroy method, since the only thing it
was doing was DecRefing the mounts.

Removing the need to call kernel.SetRootMountNamespace also allowed for some
more simplifications in the container fs setup code.

PiperOrigin-RevId: 261357060
2019-08-02 11:23:11 -07:00
Nicolas Lacasse bad43772a1 Drop reference on fs.Inode if Mount goes wrong.
PiperOrigin-RevId: 261203674
2019-08-01 14:57:49 -07:00
Nicolas Lacasse f2b25aeac7 tmpfs and ramfs Dirs should drop references on children in Release().
This is the source of many warnings like:
AtomicRefCount 0x7f5ff84e3500 owned by "fs.Inode" garbage collected with ref count of 1 (want 0)

PiperOrigin-RevId: 261197093
2019-08-01 14:25:14 -07:00
Rahat Mahmood 79511e8a50 Implement getsockopt(TCP_INFO).
Export some readily-available fields for TCP_INFO and stub out the rest.

PiperOrigin-RevId: 261191548
2019-08-01 13:58:48 -07:00
Ian Lewis 0a246fab80 Basic support for 'ip route'
Implements support for RTM_GETROUTE requests for netlink sockets.

Fixes #507

PiperOrigin-RevId: 261051045
2019-07-31 20:30:09 -07:00
Nicolas Lacasse cf2b2d97d5 Initialize kernel.unimplementedSyscallEmitter with a sync.Once.
This is initialized lazily on the first unimplemented
syscall. Without the sync.Once, this is racy.

PiperOrigin-RevId: 260971758
2019-07-31 12:00:35 -07:00
Jamie Liu a7d5e0d254 Cache pages in CachingInodeOperations.Read when memory evictions are delayed.
PiperOrigin-RevId: 260851452
2019-07-30 20:32:29 -07:00
Ayush Ranjan 5afa642deb ext: Migrate from using fileReader custom interface to using io.Reader.
It gets rid of holding state of the io.Reader offset (which is anyways held by
the vfs.FileDescriptor struct. It is also odd using a io.Reader becuase we
using io.ReaderAt to interact with the device. So making a io.ReaderAt wrapper
makes more sense.

Most importantly, it gets rid of the complexity of extracting the file reader
from a regular file implementation and then using it. Now we can just use the
regular file implementation as a reader which is more intuitive.

PiperOrigin-RevId: 260846927
2019-07-30 19:43:59 -07:00
Ayush Ranjan 9fbe984dc1 ext: block map file reader implementation.
Also adds stress tests for block map reader and intensifies extent reader tests.

PiperOrigin-RevId: 260838177
2019-07-30 18:20:31 -07:00
gVisor bot 93b0917d23 Merge pull request #607 from DarcySail:master
PiperOrigin-RevId: 260783254
2019-07-30 13:31:29 -07:00
Zach Koopmans e511c0e05f Add feature to launch Sentry from an open host FD.
Adds feature to launch from an open host FD instead of a binary_path.
The FD should point to a valid executable and most likely be statically
compiled. If the executable is not statically compiled, the loader will
search along the interpreter paths, which must be able to be resolved in
the Sandbox's file system or start will fail.

PiperOrigin-RevId: 260756825
2019-07-30 11:20:40 -07:00
Ayush Ranjan 8da9f8a12c Migrate from using io.ReadSeeker to io.ReaderAt.
This provides the following benefits:
- We can now use pkg/fd package which does not take ownership
  of the file descriptor. So it does not close the fd when garbage collected.
  This reduces scope of errors from unexpected garbage collection of io.File.
- It enforces the offset parameter in every read call.
  It does not affect the fd offset nor is it affected by it. Hence reducing
  scope of error of using stale offsets when reading.
- We do not need to serialize the usage of any global file descriptor anymore.
  So this drops the mutual exclusion req hence reducing complexity and
  congestion.

PiperOrigin-RevId: 260635174
2019-07-29 20:12:37 -07:00
Hang Su 50f3447786 Combine multiple epoll events copies
Allocate a larger memory buffer and combine multiple copies into one copy,
to reduce the number of copies from kernel memory to user memory.

Signed-off-by: Hang Su <darcy.sh@antfin.com>
2019-07-30 10:53:55 +08:00
Ayush Ranjan ddf25e3331 ext: extent reader implementation.
PiperOrigin-RevId: 260629559
2019-07-29 19:17:27 -07:00
Ayush Ranjan b765eb4589 ext: inode implementations.
PiperOrigin-RevId: 260624470
2019-07-29 18:33:55 -07:00
Nicolas Lacasse 5fdb945a0d Rate limit the unimplemented syscall event handler.
This introduces two new types of Emitters:
1. MultiEmitter, which will forward events to other registered Emitters, and
2. RateLimitedEmitter, which will forward events to a wrapped Emitter, subject
	to given rate limits.

The methods in the eventchannel package itself act like a multiEmitter, but is
not actually an Emitter. Now we have a DefaultEmitter, and the methods in
eventchannel simply forward calls to the DefaultEmitter.

The unimplemented syscall handler now uses a RateLimetedEmitter that wraps the
DefaultEmitter.

PiperOrigin-RevId: 260612770
2019-07-29 17:12:50 -07:00
gVisor bot b50122379c Merge pull request #452 from zhangningdlut:chris_test_pidns
PiperOrigin-RevId: 260220279
2019-07-26 15:00:51 -07:00
Fabricio Voznika 7052d21dc4 Automated rollback of changelist 255679453
PiperOrigin-RevId: 260047477
2019-07-25 16:48:49 -07:00