Commit Graph

85 Commits

Author SHA1 Message Date
Rahat Mahmood 7bfad8ebb6 Return a well-defined socket address type from socket funtions.
Previously we were representing socket addresses as an interface{},
which allowed any type which could be binary.Marshal()ed to be used as
a socket address. This is fine when the address is passed to userspace
via the linux ABI, but is problematic when used from within the sentry
such as by networking procfs files.

PiperOrigin-RevId: 262460640
2019-08-08 16:50:33 -07:00
Ian Lewis 0a246fab80 Basic support for 'ip route'
Implements support for RTM_GETROUTE requests for netlink sockets.

Fixes #507

PiperOrigin-RevId: 261051045
2019-07-31 20:30:09 -07:00
Kevin Krakauer 5ddf9adb2b Fix up and add some iptables ABI.
PiperOrigin-RevId: 259437060
2019-07-22 17:06:18 -07:00
Jamie Liu fdac770f31 Fix struct statx field alignment.
PiperOrigin-RevId: 259376740
2019-07-22 12:04:21 -07:00
Jamie Liu 163ab5e9ba Sentry virtual filesystem, v2
Major differences from the current ("v1") sentry VFS:

- Path resolution is Filesystem-driven (FilesystemImpl methods call
vfs.ResolvingPath methods) rather than VFS-driven (fs package owns a
Dirent tree and calls fs.InodeOperations methods to populate it). This
drastically improves performance, primarily by reducing overhead from
inefficient synchronization and indirection. It also makes it possible
to implement remote filesystem protocols that translate FS system calls
into single RPCs, rather than having to make (at least) one RPC per path
component, significantly reducing the latency of remote filesystems
(especially during cold starts and for uncacheable shared filesystems).

- Mounts are correctly represented as a separate check based on
contextual state (current mount) rather than direct replacement in a
fs.Dirent tree. This makes it possible to support (non-recursive) bind
mounts and mount namespaces.

Included in this CL is fsimpl/memfs, an incomplete in-memory filesystem
that exists primarily to demonstrate intended filesystem implementation
patterns and for benchmarking:

BenchmarkVFS1TmpfsStat/1-6               3000000               497 ns/op
BenchmarkVFS1TmpfsStat/2-6               2000000               676 ns/op
BenchmarkVFS1TmpfsStat/3-6               2000000               904 ns/op
BenchmarkVFS1TmpfsStat/8-6               1000000              1944 ns/op
BenchmarkVFS1TmpfsStat/64-6               100000             14067 ns/op
BenchmarkVFS1TmpfsStat/100-6               50000             21700 ns/op
BenchmarkVFS2MemfsStat/1-6              10000000               197 ns/op
BenchmarkVFS2MemfsStat/2-6               5000000               233 ns/op
BenchmarkVFS2MemfsStat/3-6               5000000               268 ns/op
BenchmarkVFS2MemfsStat/8-6               3000000               477 ns/op
BenchmarkVFS2MemfsStat/64-6               500000              2592 ns/op
BenchmarkVFS2MemfsStat/100-6              300000              4045 ns/op
BenchmarkVFS1TmpfsMountStat/1-6          2000000               679 ns/op
BenchmarkVFS1TmpfsMountStat/2-6          2000000               912 ns/op
BenchmarkVFS1TmpfsMountStat/3-6          1000000              1113 ns/op
BenchmarkVFS1TmpfsMountStat/8-6          1000000              2118 ns/op
BenchmarkVFS1TmpfsMountStat/64-6                  100000             14251 ns/op
BenchmarkVFS1TmpfsMountStat/100-6                 100000             22397 ns/op
BenchmarkVFS2MemfsMountStat/1-6                  5000000               317 ns/op
BenchmarkVFS2MemfsMountStat/2-6                  5000000               361 ns/op
BenchmarkVFS2MemfsMountStat/3-6                  5000000               387 ns/op
BenchmarkVFS2MemfsMountStat/8-6                  3000000               582 ns/op
BenchmarkVFS2MemfsMountStat/64-6                  500000              2699 ns/op
BenchmarkVFS2MemfsMountStat/100-6                 300000              4133 ns/op

From this we can infer that, on this machine:

- Constant cost for tmpfs stat() is ~160ns in VFS2 and ~280ns in VFS1.

- Per-path-component cost is ~35ns in VFS2 and ~215ns in VFS1, a
difference of about 6x.

- The cost of crossing a mount boundary is about 80ns in VFS2
(MemfsMountStat/1 does approximately the same amount of work as
MemfsStat/2, except that it also crosses a mount boundary). This is an
inescapable cost of the separate mount lookup needed to support bind
mounts and mount namespaces.

PiperOrigin-RevId: 258853946
2019-07-18 15:10:29 -07:00
Jamie Liu 2bc398bfd8 Separate O_DSYNC and O_SYNC.
PiperOrigin-RevId: 258657913
2019-07-17 15:52:38 -07:00
Ayush Ranjan 8e3e021aca ext: Filesystem init implementation.
PiperOrigin-RevId: 258645957
2019-07-17 14:48:04 -07:00
Adin Scannell cceef9d2cf Cleanup straggling syscall dependencies.
PiperOrigin-RevId: 257293198
2019-07-09 16:18:02 -07:00
Adin Scannell 7dae043fec Drop ashmem and binder.
These are unfortunately unused and unmaintained. They can be brought back in
the future if need requires it.

PiperOrigin-RevId: 255697132
2019-06-28 17:20:25 -07:00
Michael Pratt 5b41ba5d0e Fix various spelling issues in the documentation
Addresses obvious typos, in the documentation only.

COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/443 from Pixep:fix/documentation-spelling 4d0688164eafaf0b3010e5f4824b35d1e7176d65
PiperOrigin-RevId: 255477779
2019-06-27 14:25:50 -07:00
Nicolas Lacasse 35719d52c7 Implement statx.
We don't have the plumbing for btime yet, so that field is left off. The
returned mask indicates that btime is absent.

Fixes #343

PiperOrigin-RevId: 254575752
2019-06-22 13:29:26 -07:00
Fabricio Voznika 054b5632ef Update comment
PiperOrigin-RevId: 254428866
2019-06-21 10:56:42 -07:00
Adin Scannell add40fd6ad Update canonical repository.
This can be merged after:
https://github.com/google/gvisor-website/pull/77
  or
https://github.com/google/gvisor-website/pull/78

PiperOrigin-RevId: 253132620
2019-06-13 16:50:15 -07:00
Ian Lewis 74e397e39a Add introspection for Linux/AMD64 syscalls
Adds simple introspection for syscall compatibility information to Linux/AMD64.
Syscalls registered in the syscall table now have associated metadata like
name, support level, notes, and URLs to relevant issues.

Syscall information can be exported as a table, JSON, or CSV using the new
'runsc help syscalls' command. Users can use this info to debug and get info
on the compatibility of the version of runsc they are running or to generate
documentation.

PiperOrigin-RevId: 252558304
2019-06-10 23:38:36 -07:00
Rahat Mahmood 315cf9a523 Use common definition of SockType.
SockType isn't specific to unix domain sockets, and the current
definition basically mirrors the linux ABI's definition.

PiperOrigin-RevId: 251956740
2019-06-06 17:00:27 -07:00
Jamie Liu b3f104507d "Implement" mbind(2).
We still only advertise a single NUMA node, and ignore mempolicy
accordingly, but mbind() at least now succeeds and has effects reflected
by get_mempolicy().

Also fix handling of nodemasks: round sizes to unsigned long (as
documented and done by Linux), and zero trailing bits when copying them
out.

PiperOrigin-RevId: 251950859
2019-06-06 16:29:46 -07:00
Rahat Mahmood 2d2831e354 Track and export socket state.
This is necessary for implementing network diagnostic interfaces like
/proc/net/{tcp,udp,unix} and sock_diag(7).

For pass-through endpoints such as hostinet, we obtain the socket
state from the backend. For netstack, we add explicit tracking of TCP
states.

PiperOrigin-RevId: 251934850
2019-06-06 15:04:47 -07:00
Michael Pratt d3ed9baac0 Implement dumpability tracking and checks
We don't actually support core dumps, but some applications want to
get/set dumpability, which still has an effect in procfs.

Lack of support for set-uid binaries or fs creds simplifies things a
bit.

As-is, processes started via CreateProcess (i.e., init and sentryctl
exec) have normal dumpability. I'm a bit torn on whether sentryctl exec
tasks should be dumpable, but at least since they have no parent normal
UID/GID checks should protect them.

PiperOrigin-RevId: 251712714
2019-06-05 14:00:13 -07:00
Michael Pratt 6cdec6fadf Wrap comments and reword in common present tense
PiperOrigin-RevId: 249888234
Change-Id: Icfef32c3ed34809c34100c07e93e9581c786776e
2019-05-24 13:23:53 -07:00
Michael Pratt 69eac1198f Move wait constants to abi/linux package
Updates #214

PiperOrigin-RevId: 249483756
Change-Id: I0d3cf4112bed75a863d5eb08c2063fbc506cd875
2019-05-22 11:15:33 -07:00
Adin Scannell ae1bb08871 Clean up pipe internals and add fcntl support
Pipe internals are made more efficient by avoiding garbage collection.
A pool is now used that can be shared by all pipes, and buffers are
chained via an intrusive list. The documentation for pipe structures
and methods is also simplified and clarified.

The pipe tests are now parameterized, so that they are run on all
different variants (named pipes, small buffers, default buffers).

The pipe buffer sizes are exposed by fcntl, which is now supported
by this change. A size change test has been added to the suite.

These new tests uncovered a bug regarding the semantics of open
named pipes with O_NONBLOCK, which is also fixed by this CL. This
fix also addresses the lack of the O_LARGEFILE flag for named pipes.

PiperOrigin-RevId: 249375888
Change-Id: I48e61e9c868aedb0cadda2dff33f09a560dee773
2019-05-21 20:12:27 -07:00
Adin Scannell 9cdae51fec Add basic plumbing for splice and stub implementation.
This does not actually implement an efficient splice or sendfile. Rather, it
adds a generic plumbing to the file internals so that this can be added. All
file implementations use the stub fileutil.NoSplice implementation, which
causes sendfile and splice to fall back to an internal copy.

A basic splice system call interface is added, along with a test.

PiperOrigin-RevId: 249335960
Change-Id: Ic5568be2af0a505c19e7aec66d5af2480ab0939b
2019-05-21 15:18:12 -07:00
Fabricio Voznika 1bee43be13 Implement fallocate(2)
Closes #225

PiperOrigin-RevId: 247508791
Change-Id: I04f47cf2770b30043e5a272aba4ba6e11d0476cc
2019-05-09 15:35:49 -07:00
Kevin Krakauer 264d012d81 Add netfilter ABI for iptables support.
Change-Id: Ifbd2abf63ea8062a89b83e948d3e9735480d8216
PiperOrigin-RevId: 246559904
2019-05-03 13:06:09 -07:00
Michael Pratt 4d52a55201 Change copyright notice to "The gVisor Authors"
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.

1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.

Fixes #209

PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
2019-04-29 14:26:23 -07:00
Ian Gudger 133700007a Only emit unimplemented syscall events for unsupported values.
Only emit unimplemented syscall events for setting SO_OOBINLINE and SO_LINGER
when attempting to set unsupported values.

PiperOrigin-RevId: 244229675
Change-Id: Icc4562af8f733dd75a90404621711f01a32a9fc1
2019-04-18 11:51:41 -07:00
Kevin Krakauer 82529becae Fix index out of bounds in tty implementation.
The previous implementation revolved around runes instead of bytes, which caused
weird behavior when converting between the two. For example, peekRune would read
the byte 0xff from a buffer, convert it to a rune, then return it. As rune is an
alias of int32, 0xff was 0-padded to int32(255), which is the hex code point for
?. However, peekRune also returned the length of the byte (1). When calling
utf8.EncodeRune, we only allocated 1 byte, but tried the write the 2-byte
character ?.

tl;dr: I apparently didn't understand runes when I wrote this.

PiperOrigin-RevId: 241789081
Change-Id: I14c788af4d9754973137801500ef6af7ab8a8727
2019-04-03 13:00:34 -07:00
Rahat Mahmood 06ec97a3f8 Implement memfd_create.
Memfds are simply anonymous tmpfs files with no associated
mounts. Also implementing file seals, which Linux only implements for
memfds at the moment.

PiperOrigin-RevId: 240450031
Change-Id: I31de78b950101ae8d7a13d0e93fe52d98ea06f2f
2019-03-26 16:16:57 -07:00
Fabricio Voznika 0b76887147 Priority-inheritance futex implementation
It is Implemented without the priority inheritance part given
that gVisor defers scheduling decisions to Go runtime and doesn't
have control over it.

PiperOrigin-RevId: 236989545
Change-Id: I714c8ca0798743ecf3167b14ffeb5cd834302560
2019-03-05 23:40:18 -08:00
Fabricio Voznika 3dbd4a16f8 Add semctl(GETPID) syscall
Also added unimplemented notification for semctl(2)
commands.

PiperOrigin-RevId: 236340672
Change-Id: I0795e3bd2e6d41d7936fabb731884df426a42478
2019-03-01 10:57:02 -08:00
Nicolas Lacasse e884168e1e Encode stat to bytes manually, instead of calling CopyObjectOut.
CopyObjectOut grows its destination byte slice incrementally, causing
many small slice allocations on the heap. This leads to increased GC and
noticeably slower stat calls.

PiperOrigin-RevId: 233140904
Change-Id: Ieb90295dd8dd45b3e56506fef9d7f86c92e97d97
2019-02-08 15:48:23 -08:00
Ian Gudger 80f901b16b Plumb IP_ADD_MEMBERSHIP and IP_DROP_MEMBERSHIP to netstack.
Also includes a few fixes for IPv4 multicast support. IPv6 support is coming in
a followup CL.

PiperOrigin-RevId: 233008638
Change-Id: If7dae6222fef43fda48033f0292af77832d95e82
2019-02-07 23:15:23 -08:00
Rahat Mahmood 2ba74f84be Implement /proc/net/unix.
PiperOrigin-RevId: 232948478
Change-Id: Ib830121e5e79afaf5d38d17aeef5a1ef97913d23
2019-02-07 14:44:21 -08:00
Michael Pratt 2a0c69b19f Remove license comments
Nothing reads them and they can simply get stale.

Generated with:
$ sed -i "s/licenses(\(.*\)).*/licenses(\1)/" **/BUILD

PiperOrigin-RevId: 231818945
Change-Id: Ibc3f9838546b7e94f13f217060d31f4ada9d4bf0
2019-01-31 11:12:53 -08:00
Nicolas Lacasse dc8450b567 Remove fs.Handle, ramfs.Entry, and all the DeprecatedFileOperations.
More helper structs have been added to the fsutil package to make it easier to
implement fs.InodeOperations and fs.FileOperations.

PiperOrigin-RevId: 229305982
Change-Id: Ib6f8d3862f4216745116857913dbfa351530223b
2019-01-14 20:34:28 -08:00
Ian Gudger b709997d78 Rename linux.Errno.Error to linux.Errno.String.
Using linux.Errno as an error doesn't work very well as none of the sentry code
expects error to contain a linux.Errno.

This moves using syserr.Error.ToLinux as an error in a syscall handler from a
runtime error to a compile error.

PiperOrigin-RevId: 227744312
Change-Id: Iea63108a5b198296c908614e09c01733dd684da0
2019-01-03 13:53:43 -08:00
Ian Gudger b515556519 Implement SO_KEEPALIVE, TCP_KEEPIDLE, and TCP_KEEPINTVL.
Within gVisor, plumb new socket options to netstack.

Within netstack, fix GetSockOpt and SetSockOpt return value logic.

PiperOrigin-RevId: 226532229
Change-Id: If40734e119eed633335f40b4c26facbebc791c74
2018-12-21 13:13:45 -08:00
Jamie Liu 9a442fa4b5 Automated rollback of changelist 226224230
PiperOrigin-RevId: 226493053
Change-Id: Ia98d1cb6dd0682049e4d907ef69619831de5c34a
2018-12-21 08:23:34 -08:00
Googler 86c9bd2547 Automated rollback of changelist 225861605
PiperOrigin-RevId: 226224230
Change-Id: Id24c7d3733722fd41d5fe74ef64e0ce8c68f0b12
2018-12-19 13:30:08 -08:00
Zach Koopmans ff7178a4d1 Implement pwritev2.
Implement pwritev2 and associated unit tests.
Clean up preadv2 unit tests.
Tag RWF_ flags in both preadv2 and pwritev2 with associated bug tickets.

PiperOrigin-RevId: 226222119
Change-Id: Ieb22672418812894ba114bbc88e67f1dd50de620
2018-12-19 13:16:06 -08:00
Fabricio Voznika 03226cd950 Add BPFAction type with Stringer
PiperOrigin-RevId: 226018694
Change-Id: I98965e26fe565f37e98e5df5f997363ab273c91b
2018-12-18 10:28:28 -08:00
Jamie Liu 2421006426 Implement mlock(), kind of.
Currently mlock() and friends do nothing whatsoever. However, mlocking
is directly application-visible in a number of ways; for example,
madvise(MADV_DONTNEED) and msync(MS_INVALIDATE) both fail on mlocked
regions. We handle this inconsistently: MADV_DONTNEED is too important
to not work, but MS_INVALIDATE is rejected.

Change MM to track mlocked regions in a manner consistent with Linux.
It still will not actually pin pages into host physical memory, but:

- mlock() will now cause sentry memory management to precommit mlocked
pages.

- MADV_DONTNEED and MS_INVALIDATE will interact with mlocked pages as
described above.

PiperOrigin-RevId: 225861605
Change-Id: Iee187204979ac9a4d15d0e037c152c0902c8d0ee
2018-12-17 11:38:59 -08:00
Michael Pratt 42e2e5cae9 Format sigaction in strace
Sample:

I1206 14:24:56.768520    3700 x:0] [   1] ioctl_test E rt_sigaction(SIGSEGV, 0x7ee6edb0c590 {Handler: 0x559c6d915cf0, Flags: SA_SIGINFO|SA_RESTORER|SA_ONSTACK|SA_NODEFER, Restorer: 0x2a9901a259a0, Mask: []}, 0x7ee6edb0c630)
I1206 14:24:56.768530    3700 x:0] [   1] ioctl_test X rt_sigaction(SIGSEGV, 0x7ee6edb0c590 {Handler: 0x559c6d915cf0, Flags: SA_SIGINFO|SA_RESTORER|SA_ONSTACK|SA_NODEFER, Restorer: 0x2a9901a259a0, Mask: []}, 0x7ee6edb0c630 {Handler: SIG_DFL, Flags: 0x0, Restorer: 0x0, Mask: []}) = 0x0 (2.701?s)

PiperOrigin-RevId: 224596606
Change-Id: I3512493aed99d3d75600249263da46686b1dc0e7
2018-12-07 16:28:54 -08:00
Michael Pratt 51900fe3a4 Format signals, signal masks in strace
Sample:

I1205 16:51:49.869701    2492 x:0] [   1] ioctl_test E rt_sigaction(SIGIO, 0x7e0e5b5e8500, 0x7e0e5b5e85a0)
I1205 16:51:49.869766    2492 x:0] [   1] ioctl_test X rt_sigaction(SIGIO, 0x7e0e5b5e8500, 0x7e0e5b5e85a0) = 0x0 (44.336?s)
I1205 16:51:49.869831    2492 x:0] [   1] ioctl_test E rt_sigprocmask(SIG_UNBLOCK, 0x7e0e5b5e8878 [SIGIO], 0x7e0e5b5e87c0, 0x8)
I1205 16:51:49.869866    2492 x:0] [   1] ioctl_test X rt_sigprocmask(SIG_UNBLOCK, 0x7e0e5b5e8878 [SIGIO], 0x7e0e5b5e87c0 [SIGIO 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64], 0x8) = 0x0 (2.575?s)

PiperOrigin-RevId: 224422404
Change-Id: I3ed3f2ec6b1a639baa9cacd37ce7ee325c3703e4
2018-12-06 15:47:06 -08:00
Michael Pratt 666db00c26 Convert ValueSet to a map
Unlike FlagSet, order doesn't matter here, so it can simply be a map.

PiperOrigin-RevId: 224377910
Change-Id: I15810c698a7f02d8614bf09b59583ab73cba0514
2018-12-06 11:43:11 -08:00
Bin Lu c3dd68cea7 Add ARM64 support to pkg/abi/linux
Signed-off-by: Bin Lu <bin.lu@arm.com>
Change-Id: I73cc4c406fadccb054e8e83c9464f6bef6280b0f
PiperOrigin-RevId: 224025309
2018-12-04 12:24:07 -08:00
Zach Koopmans b3b60ea29a Implementation of preadv2 for Linux 4.4 support
Implement RWF_HIPRI (4.6) silently passes the read call.
Implement -1 offset calls readv.

PiperOrigin-RevId: 222840324
Change-Id: If9ddc1e8d086e1a632bdf5e00bae08205f95b6b0
2018-11-26 09:50:47 -08:00
Fabricio Voznika eaac94d91c Use RET_KILL_PROCESS if available in kernel
RET_KILL_THREAD doesn't work well for Go because it will
kill only the offending thread and leave the process hanging.
RET_TRAP can be masked out and it's not guaranteed to kill
the process. RET_KILL_PROCESS is available since 4.14.

For older kernel, continue to use RET_TRAP as this is the
best option (likely to kill process, easy to debug).

PiperOrigin-RevId: 222357867
Change-Id: Icc1d7d731274b16c2125b7a1ba4f7883fbdb2cbd
2018-11-20 22:56:51 -08:00
Fabricio Voznika fadffa2ff8 Add unsupported syscall events for get/setsockopt
PiperOrigin-RevId: 222148953
Change-Id: I21500a9f08939c45314a6414e0824490a973e5aa
2018-11-20 14:04:12 -08:00
Andrei Vagin 2ef122da35 Implement sync_file_range()
sync_file_range - sync a file segment with disk

In Linux, sync_file_range() accepts three flags:

       SYNC_FILE_RANGE_WAIT_BEFORE
              Wait  upon  write-out  of  all pages in the specified range that
              have already been submitted to the device driver  for  write-out
              before performing any write.

       SYNC_FILE_RANGE_WRITE
              Initiate  write-out  of  all  dirty pages in the specified range
              which are not presently submitted  write-out.   Note  that  even
              this  may  block if you attempt to write more than request queue
              size.

       SYNC_FILE_RANGE_WAIT_AFTER
              Wait upon write-out of all pages in the range  after  performing
              any write.

In this implementation:

SYNC_FILE_RANGE_WAIT_BEFORE without SYNC_FILE_RANGE_WAIT_AFTER isn't
supported right now.

SYNC_FILE_RANGE_WRITE is skipped. It should initiate write-out of  all
dirty pages, but it doesn't wait, so it should be safe to do nothing
while nobody uses SYNC_FILE_RANGE_WAIT_BEFORE.

SYNC_FILE_RANGE_WAIT_AFTER is equal to fdatasync(). In Linux,
sync_file_range() doesn't writes out the  file's  meta-data, but
fdatasync() does if a file size is changed.

PiperOrigin-RevId: 220730840
Change-Id: Iae5dfb23c2c916967d67cf1a1ad32f25eb3f6286
2018-11-08 17:39:51 -08:00