Commit Graph

46 Commits

Author SHA1 Message Date
Dean Deng 5d885d7fb2 Port socket-related syscalls to VFS2.
Note that most kinds of sockets are not yet supported in VFS2
(only Unix sockets are partially supported at the moment), so
these syscalls will still generally fail. Enabling them allows
us to begin running socket tests for VFS2 as more features are
ported over.

Updates #1476, #1478, #1484, #1485.

PiperOrigin-RevId: 306292294
2020-04-13 13:02:34 -07:00
Nayana Bidari 92b9069b67 Support owner matching for iptables.
This feature will match UID and GID of the packet creator, for locally
generated packets. This match is only valid in the OUTPUT and POSTROUTING
chains. Forwarded packets do not have any socket associated with them.
Packets from kernel threads do have a socket, but usually no owner.
2020-03-26 12:21:24 -07:00
Jamie Liu 1c05352970 Fix oom_score_adj.
- Make oomScoreAdj a ThreadGroup field (Linux: signal_struct::oom_score_adj).

- Avoid deadlock caused by Task.OOMScoreAdj()/SetOOMScoreAdj() locking Task.mu
  and TaskSet.mu in the wrong order (via Task.ExitState()).

PiperOrigin-RevId: 300814698
2020-03-13 13:19:13 -07:00
Ian Lewis da48fc6cca Stub oom_score_adj and oom_score.
Adds an oom_score_adj and oom_score proc file stub. oom_score_adj accepts
writes of values -1000 to 1000 and persists the value with the task. New tasks
inherit the parent's oom_score_adj.

oom_score is a read-only stub that always returns the value '0'.

Issue #202

PiperOrigin-RevId: 299245355
2020-03-05 18:23:01 -08:00
Jamie Liu 471b15b212 Port most syscalls to VFS2.
pipe and pipe2 aren't ported, pending a slight rework of pipe FDs for VFS2.
mount and umount2 aren't ported out of temporary laziness. access and faccessat
need additional FSImpl methods to implement properly, but are stubbed to
prevent googletest from CHECK-failing. Other syscalls require additional
plumbing.

Updates #1623

PiperOrigin-RevId: 297188448
2020-02-25 13:37:34 -08:00
gVisor bot 4a73bae269 Initial network namespace support.
TCP/IP will work with netstack networking. hostinet doesn't work, and sockets
will have the same behavior as it is now.

Before the userspace is able to create device, the default loopback device can
be used to test.

/proc/net and /sys/net will still be connected to the root network stack; this
is the same behavior now.

Issue #1833

PiperOrigin-RevId: 296309389
2020-02-20 15:20:40 -08:00
gVisor bot 4075de11be Plumb VFS2 inside the Sentry
- Added fsbridge package with interface that can be used to open
  and read from VFS1 and VFS2 files.
- Converted ELF loader to use fsbridge
- Added VFS2 types to FSContext
- Added vfs.MountNamespace to ThreadGroup

Updates #1623

PiperOrigin-RevId: 295183950
2020-02-14 11:12:47 -08:00
Fabricio Voznika 437c986c6a Add vfs.FileDescription to FD table
FD table now holds both VFS1 and VFS2 types and uses the correct
one based on what's set.

Parts of this CL are just initial changes (e.g. sys_read.go,
runsc/main.go) to serve as a template for the remaining changes.

Updates #1487
Updates #1623

PiperOrigin-RevId: 292023223
2020-01-28 15:31:03 -08:00
Adin Scannell 0e2f1b7abd Update package locations.
Because the abi will depend on the core types for marshalling (usermem,
context, safemem, safecopy), these need to be flattened from the sentry
directory. These packages contain no sentry-specific details.

PiperOrigin-RevId: 291811289
2020-01-27 15:31:32 -08:00
Ian Gudger 27500d529f New sync package.
* Rename syncutil to sync.
* Add aliases to sync types.
* Replace existing usage of standard library sync package.

This will make it easier to swap out synchronization primitives. For example,
this will allow us to use primitives from github.com/sasha-s/go-deadlock to
check for lock ordering violations.

Updates #1472

PiperOrigin-RevId: 289033387
2020-01-09 22:02:24 -08:00
Michael Pratt 354a15a234 Implement rseq(2)
PiperOrigin-RevId: 288342928
2020-01-06 11:42:44 -08:00
Adin Scannell 371e210b83 Add runtime tracing.
This adds meaningful annotations to the trace generated by the runtime/trace
package.

PiperOrigin-RevId: 284290115
2019-12-06 17:00:07 -08:00
Adin Scannell c0f89eba6e Import and structure cleanup.
PiperOrigin-RevId: 281795269
2019-11-21 11:41:30 -08:00
Dean Deng d7f5e823e2 Fix grammar in comment.
Missing "for".

PiperOrigin-RevId: 277358513
2019-10-29 14:05:04 -07:00
Michael Pratt 198f1cddb8 Update comment
FDTable.GetFile doesn't exist.

PiperOrigin-RevId: 277089842
2019-10-28 10:20:23 -07:00
Adin Scannell c98e7f0d19 Signalfd support
Note that the exact semantics for these signalfds are slightly different from
Linux. These signalfds are bound to the process at creation time. Reads, polls,
etc. are all associated with signals directed at that task. In Linux, all
signalfd operations are associated with current, regardless of where the
signalfd originated.

In practice, this should not be an issue given how signalfds are used. In order
to fix this however, we will need to plumb the context through all the event
APIs. This gets complicated really quickly, because the waiter APIs are all
netstack-specific, and not generally exposed to the context.  Probably not
worthwhile fixing immediately.

PiperOrigin-RevId: 269901749
2019-09-18 15:16:42 -07:00
Adin Scannell 753da9604e Remove map from fd_map, change to fd_table.
This renames FDMap to FDTable and drops the kernel.FD type, which had an entire
package to itself and didn't serve much use (it was freely cast between types,
and served as more of an annoyance than providing any protection.)

Based on BenchmarkFDLookupAndDecRef-12, we can expect 5-10 ns per lookup
operation, and 10-15 ns per concurrent lookup operation of savings.

This also fixes two tangential usage issues with the FDMap. Namely, non-atomic
use of NewFDFrom and associated calls to Remove (that are both racy and fail to
drop the reference on the underlying file.)

PiperOrigin-RevId: 256285890
2019-07-02 19:28:59 -07:00
Andrei Vagin 03ae91c662 gvisor: lockless read access for task credentials
Credentials are immutable and even before these changes we could read them
without locks, but we needed to take a task lock to get a credential object
from a task object.

It is possible to avoid this lock, if we will guarantee that a credential
object will not be changed after setting it on a task.

PiperOrigin-RevId: 254989492
2019-06-25 09:52:49 -07:00
Andrei Vagin f94653b3de kernel: call t.mu.Unlock() explicitly in WithMuLocked
defer here doesn't improve readability, but we know it slower that
the explicit call.

PiperOrigin-RevId: 254441473
2019-06-21 11:55:42 -07:00
Nicolas Lacasse f7428af9c1 Add MountNamespace to task.
This allows tasks to have distinct mount namespace, instead of all sharing the
kernel's root mount namespace.

Currently, the only way for a task to get a different mount namespace than the
kernel's root is by explicitly setting a different MountNamespace in
CreateProcessArgs, and nothing does this (yet).

In a follow-up CL, we will set CreateProcessArgs.MountNamespace when creating a
new container inside runsc.

Note that "MountNamespace" is a poor term for this thing. It's more like a
distinct VFS tree. When we get around to adding real mount namespaces, this
will need a better naem.

PiperOrigin-RevId: 254009310
2019-06-19 09:21:21 -07:00
Adin Scannell add40fd6ad Update canonical repository.
This can be merged after:
https://github.com/google/gvisor-website/pull/77
  or
https://github.com/google/gvisor-website/pull/78

PiperOrigin-RevId: 253132620
2019-06-13 16:50:15 -07:00
Jamie Liu b3f104507d "Implement" mbind(2).
We still only advertise a single NUMA node, and ignore mempolicy
accordingly, but mbind() at least now succeeds and has effects reflected
by get_mempolicy().

Also fix handling of nodemasks: round sizes to unsigned long (as
documented and done by Linux), and zero trailing bits when copying them
out.

PiperOrigin-RevId: 251950859
2019-06-06 16:29:46 -07:00
Michael Pratt 4d52a55201 Change copyright notice to "The gVisor Authors"
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.

1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.

Fixes #209

PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
2019-04-29 14:26:23 -07:00
Fabricio Voznika c8cee7108f Use FD limit and file size limit from host
FD limit and file size limit is read from the host, instead
of using hard-coded defaults, given that they effect the sandbox
process. Also limit the direct cache to use no more than half
if the available FDs.

PiperOrigin-RevId: 244050323
Change-Id: I787ad0fdf07c49d589e51aebfeae477324fe26e6
2019-04-17 12:57:40 -07:00
Jamie Liu 26e8d9981f Use kernel.Task.CopyScratchBuffer in syscalls/linux where possible.
PiperOrigin-RevId: 241072126
Change-Id: Ib4d9f58f550732ac4c5153d3cf159a5b1a9749da
2019-03-29 16:25:33 -07:00
Jamie Liu 3d0b960112 Implement PTRACE_SEIZE, PTRACE_INTERRUPT, and PTRACE_LISTEN.
PiperOrigin-RevId: 239803092
Change-Id: I42d612ed6a889e011e8474538958c6de90c6fcab
2019-03-22 08:55:44 -07:00
Jamie Liu 8f4634997b Decouple filemem from platform and move it to pgalloc.MemoryFile.
This is in preparation for improved page cache reclaim, which requires
greater integration between the page cache and page allocator.

PiperOrigin-RevId: 238444706
Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4
2019-03-14 08:12:48 -07:00
Michael Pratt fe1369ac98 Move package sync to third_party
PiperOrigin-RevId: 231889261
Change-Id: I482f1df055bcedf4edb9fe3fe9b8e9c80085f1a0
2019-01-31 17:49:14 -08:00
Fabricio Voznika 8b314b0bf4 Fix recursive read lock taken on TaskSet
SyncSyscallFiltersToThreadGroup and Task.TheadID() both acquired TaskSet RWLock
in R mode and could deadlock if a writer comes in between.

PiperOrigin-RevId: 222313551
Change-Id: I4221057d8d46fec544cbfa55765c9a284fe7ebfa
2018-11-20 15:07:56 -08:00
Fabricio Voznika b2068cf5a5 Add more unimplemented syscall events
Added events for *ctl syscalls that may have multiple different commands.
For runsc, each syscall event is only logged once. For *ctl syscalls, use
the cmd as identifier, not only the syscall number.

PiperOrigin-RevId: 218015941
Change-Id: Ie3c19131ae36124861e9b492a7dbe1765d9e5e59
2018-10-20 11:14:23 -07:00
Ian Gudger 8fce67af24 Use correct company name in copyright header
PiperOrigin-RevId: 217951017
Change-Id: Ie08bf6987f98467d07457bcf35b5f1ff6e43c035
2018-10-19 16:35:11 -07:00
Ian Gudger 167f2401c4 Merge host.endpoint into host.ConnectedEndpoint
host.endpoint contained duplicated logic from the sockerpair implementation and
host.ConnectedEndpoint. Remove host.endpoint in favor of a
host.ConnectedEndpoint wrapped in a socketpair end.

PiperOrigin-RevId: 217240096
Change-Id: I4a3d51e3fe82bdf30e2d0152458b8499ab4c987c
2018-10-15 17:48:11 -07:00
Fabricio Voznika 491faac03b Implement 'runsc kill --all'
In order to implement kill --all correctly, the Sentry needs
to track all tasks that belong to a given container. This change
introduces ContainerID to the task, that gets inherited by all
children. 'kill --all' then iterates over all tasks comparing the
ContainerID field to find all processes that need to be signalled.

PiperOrigin-RevId: 214841768
Change-Id: I693b2374be8692d88cc441ef13a0ae34abf73ac6
2018-09-27 15:00:58 -07:00
Jamie Liu 0380bcb3a4 Fix interaction between rt_sigtimedwait and ignored signals.
PiperOrigin-RevId: 213011782
Change-Id: I716c6ea3c586b0c6c5a892b6390d2d11478bc5af
2018-09-14 11:10:50 -07:00
Jamie Liu f8ccfbbed4 Document more task-goroutine-owned fields in kernel.Task.
Task.creds can only be changed by the task's own set*id and execve
syscalls, and Task namespaces can only be changed by the task's own
unshare/setns syscalls.

PiperOrigin-RevId: 211156279
Change-Id: I94d57105d34e8739d964400995a8a5d76306b2a0
2018-08-31 15:44:40 -07:00
Jamie Liu 098046ba19 Disintegrate kernel.TaskResources.
This allows us to call kernel.FDMap.DecRef without holding mutexes
cleanly.

PiperOrigin-RevId: 211139657
Change-Id: Ie59d5210fb9282e1950e2e40323df7264a01bcec
2018-08-31 13:58:04 -07:00
Jamie Liu b1c1afa3cc Delete the long-obsolete kernel.TaskMaybe interface.
PiperOrigin-RevId: 211131855
Change-Id: Ia7799561ccd65d16269e0ae6f408ab53749bca37
2018-08-31 13:07:34 -07:00
Ian Gudger 9c407382b0 Fix races in kernel.(*Task).Value()
PiperOrigin-RevId: 209627180
Change-Id: Idc84afd38003427e411df6e75abfabd9174174e1
2018-08-21 11:16:17 -07:00
Zhaozhong Ni 57d0fcbdbf Automated rollback of changelist 207037226
PiperOrigin-RevId: 207125440
Change-Id: I6c572afb4d693ee72a0c458a988b0e96d191cd49
2018-08-02 10:42:48 -07:00
Brian Geffon cf44aff6e0 Add seccomp(2) support.
Add support for the seccomp syscall and the flag SECCOMP_FILTER_FLAG_TSYNC.

PiperOrigin-RevId: 207101507
Change-Id: I5eb8ba9d5ef71b0e683930a6429182726dc23175
2018-08-02 08:10:30 -07:00
Michael Pratt 60add78980 Automated rollback of changelist 207007153
PiperOrigin-RevId: 207037226
Change-Id: I8b5f1a056d4f3eab17846f2e0193bb737ecb5428
2018-08-01 19:57:32 -07:00
Zhaozhong Ni b9e1cf8404 stateify: convert all packages to use explicit mode.
PiperOrigin-RevId: 207007153
Change-Id: Ifedf1cc3758dc18be16647a4ece9c840c1c636c9
2018-08-01 15:43:24 -07:00
Adin Scannell 8b8aad91d5 kernel: mutations on creds now require a copy.
PiperOrigin-RevId: 205315612
Change-Id: I9a0a1e32c8abfb7467a38743b82449cc92830316
2018-07-19 15:48:56 -07:00
Rahat Mahmood 8878a66a56 Implement sysv shm.
PiperOrigin-RevId: 197058289
Change-Id: I3946c25028b7e032be4894d61acb48ac0c24d574
2018-05-17 15:06:19 -07:00
Kevin Krakauer 96c28a4368 sentry: Replaces saving of inet.Stack with retrieval via context.
Previously, inet.Stack was referenced in 2 structs in sentry/socket that can be
saved/restored.  If an app is saved and restored on another machine, it may try
to use the old stack, which will have been replaced by a new stack on the new
machine.

PiperOrigin-RevId: 196733985
Change-Id: I6a8cfe73b5d7a90749734677dada635ab3389cb9
2018-05-15 14:56:18 -07:00
Googler d02b74a5dc Check in gVisor.
PiperOrigin-RevId: 194583126
Change-Id: Ica1d8821a90f74e7e745962d71801c598c652463
2018-04-28 01:44:26 -04:00