Commit Graph

68 Commits

Author SHA1 Message Date
Ian Gudger 167f2401c4 Merge host.endpoint into host.ConnectedEndpoint
host.endpoint contained duplicated logic from the sockerpair implementation and
host.ConnectedEndpoint. Remove host.endpoint in favor of a
host.ConnectedEndpoint wrapped in a socketpair end.

PiperOrigin-RevId: 217240096
Change-Id: I4a3d51e3fe82bdf30e2d0152458b8499ab4c987c
2018-10-15 17:48:11 -07:00
Nicolas Lacasse b78552d30e When creating a new process group, add it to the session.
PiperOrigin-RevId: 216554791
Change-Id: Ia6b7a2e6eaad80a81b2a8f2e3241e93ebc2bda35
2018-10-10 10:42:11 -07:00
Jamie Liu e9e8be6613 Implement shared futexes.
- Shared futex objects on shared mappings are represented by Mappable +
  offset, analogous to Linux's use of inode + offset. Add type
  futex.Key, and change the futex.Manager bucket API to use futex.Keys
  instead of addresses.

- Extend the futex.Checker interface to be able to return Keys for
  memory mappings. It returns Keys rather than just mappings because
  whether the address or the target of the mapping is used in the Key
  depends on whether the mapping is MAP_SHARED or MAP_PRIVATE; this
  matters because using mapping target for a futex on a MAP_PRIVATE
  mapping causes it to stop working across COW-breaking.

- futex.Manager.WaitComplete depends on atomic updates to
  futex.Waiter.addr to determine when it has locked the right bucket,
  which is much less straightforward for struct futex.Waiter.key. Switch
  to an atomically-accessed futex.Waiter.bucket pointer.

- futex.Manager.Wake now needs to take a futex.Checker to resolve
  addresses for shared futexes. CLONE_CHILD_CLEARTID requires the exit
  path to perform a shared futex wakeup (Linux:
  kernel/fork.c:mm_release() => sys_futex(tsk->clear_child_tid,
  FUTEX_WAKE, ...)). This is a problem because futexChecker is in the
  syscalls/linux package. Move it to kernel.

PiperOrigin-RevId: 216207039
Change-Id: I708d68e2d1f47e526d9afd95e7fed410c84afccf
2018-10-08 10:20:38 -07:00
Ian Gudger beac59b37a Fix panic if FIOASYNC callback is registered and triggered without target
PiperOrigin-RevId: 215674589
Change-Id: I4f8871b64c570dc6da448d2fe351cec8a406efeb
2018-10-03 20:22:31 -07:00
Ian Gudger 4fef31f96c Add S/R support for FIOASYNC
PiperOrigin-RevId: 215655197
Change-Id: I668b1bc7c29daaf2999f8f759138bcbb09c4de6f
2018-10-03 17:03:09 -07:00
Nicolas Lacasse f1c01ed886 runsc: Support job control signals in "exec -it".
Terminal support in runsc relies on host tty file descriptors that are imported
into the sandbox. Application tty ioctls are sent directly to the host fd.

However, those host tty ioctls are associated in the host kernel with a host
process (in this case runsc), and the host kernel intercepts job control
characters like ^C and send signals to the host process. Thus, typing ^C into a
"runsc exec" shell will send a SIGINT to the runsc process.

This change makes "runsc exec" handle all signals, and forward them into the
sandbox via the "ContainerSignal" urpc method. Since the "runsc exec" is
associated with a particular container process in the sandbox, the signal must
be associated with the same container process.

One big difficulty is that the signal should not necessarily be sent to the
sandbox process started by "exec", but instead must be sent to the foreground
process group for the tty. For example, we may exec "bash", and from bash call
"sleep 100". A ^C at this point should SIGINT sleep, not bash.

To handle this, tty files inside the sandbox must keep track of their
foreground process group, which is set/get via ioctls. When an incoming
ContainerSignal urpc comes in, we look up the foreground process group via the
tty file. Unfortunately, this means we have to expose and cache the tty file in
the Loader.

Note that "runsc exec" now handles signals properly, but "runs run" does not.
That will come in a later CL, as this one is complex enough already.

Example:
	root@:/usr/local/apache2# sleep 100
	^C

	root@:/usr/local/apache2# sleep 100
	^Z
	[1]+  Stopped                 sleep 100

	root@:/usr/local/apache2# fg
	sleep 100
	^C

	root@:/usr/local/apache2#

PiperOrigin-RevId: 215334554
Change-Id: I53cdce39653027908510a5ba8d08c49f9cf24f39
2018-10-01 22:06:56 -07:00
Fabricio Voznika 491faac03b Implement 'runsc kill --all'
In order to implement kill --all correctly, the Sentry needs
to track all tasks that belong to a given container. This change
introduces ContainerID to the task, that gets inherited by all
children. 'kill --all' then iterates over all tasks comparing the
ContainerID field to find all processes that need to be signalled.

PiperOrigin-RevId: 214841768
Change-Id: I693b2374be8692d88cc441ef13a0ae34abf73ac6
2018-09-27 15:00:58 -07:00
Kevin Krakauer bb88c187c5 runsc: Enable waiting on exited processes.
This makes `runsc wait` behave more like waitpid()/wait4() in that:
- Once a process has run to completion, you can wait on it and get its exit
  code.
- Processes not waited on will consume memory (like a zombie process)

PiperOrigin-RevId: 213358916
Change-Id: I5b5eca41ce71eea68e447380df8c38361a4d1558
2018-09-17 16:25:24 -07:00
Ian Gudger ab6fa44588 Allow kernel.(*Task).Block to accept an extract only channel
PiperOrigin-RevId: 213328293
Change-Id: I4164133e6f709ecdb89ffbb5f7df3324c273860a
2018-09-17 13:35:54 -07:00
Jamie Liu 0380bcb3a4 Fix interaction between rt_sigtimedwait and ignored signals.
PiperOrigin-RevId: 213011782
Change-Id: I716c6ea3c586b0c6c5a892b6390d2d11478bc5af
2018-09-14 11:10:50 -07:00
Ian Gudger 29a7271f5d Plumb monotonic time to netstack
Netstack needs to be portable, so this seems to be preferable to using raw
system calls.

PiperOrigin-RevId: 212917409
Change-Id: I7b2073e7db4b4bf75300717ca23aea4c15be944c
2018-09-13 19:12:15 -07:00
Rahat Mahmood adf8f33970 Extend memory usage events to report mapped memory usage.
PiperOrigin-RevId: 212887555
Change-Id: I3545383ce903cbe9f00d9b5288d9ef9a049b9f4f
2018-09-13 15:16:47 -07:00
Fabricio Voznika 172860a059 Add 'Starting gVisor...' message to syslog
This allows applications to verify they are running with gVisor. It
also helps debugging when running with a mix of container runtimes.

Closes #54

PiperOrigin-RevId: 212059457
Change-Id: I51d9595ee742b58c1f83f3902ab2e2ecbd5cedec
2018-09-07 16:59:27 -07:00
Fabricio Voznika f895cb4d8b Use root abstract socket namespace for exec
PiperOrigin-RevId: 211999211
Change-Id: I5968dd1a8313d3e49bb6e6614e130107495de41d
2018-09-07 10:45:55 -07:00
Nicolas Lacasse 6516b5648b createProcessArgs.RootFromContext should return process Root if it exists.
It was always returning the MountNamespace root, which may be different from
the process Root if the process is in a chroot environment.

PiperOrigin-RevId: 211862181
Change-Id: I63bfeb610e2b0affa9fdbdd8147eba3c39014480
2018-09-06 13:47:49 -07:00
Adin Scannell c09f9acd7c Distinguish Element and Linker for ilist.
Furthermore, allow for the specification of an ElementMapper. This allows a
single "Element" type to exist on multiple inline lists, and work without
having to embed the entry type.

This is a requisite change for supporting a per-Inode list of Dirents.

PiperOrigin-RevId: 211467497
Change-Id: If2768999b43e03fdaecf8ed15f435fe37518d163
2018-09-04 09:19:11 -07:00
Jamie Liu f8ccfbbed4 Document more task-goroutine-owned fields in kernel.Task.
Task.creds can only be changed by the task's own set*id and execve
syscalls, and Task namespaces can only be changed by the task's own
unshare/setns syscalls.

PiperOrigin-RevId: 211156279
Change-Id: I94d57105d34e8739d964400995a8a5d76306b2a0
2018-08-31 15:44:40 -07:00
Jamie Liu 098046ba19 Disintegrate kernel.TaskResources.
This allows us to call kernel.FDMap.DecRef without holding mutexes
cleanly.

PiperOrigin-RevId: 211139657
Change-Id: Ie59d5210fb9282e1950e2e40323df7264a01bcec
2018-08-31 13:58:04 -07:00
Jamie Liu b1c1afa3cc Delete the long-obsolete kernel.TaskMaybe interface.
PiperOrigin-RevId: 211131855
Change-Id: Ia7799561ccd65d16269e0ae6f408ab53749bca37
2018-08-31 13:07:34 -07:00
Ian Gudger 52e6714146 fasync: don't keep mutex after return
PiperOrigin-RevId: 210637533
Change-Id: I3536c3f9efb54732a0d8ada8bc299142b2c1682f
2018-08-28 17:26:26 -07:00
Brian Geffon f0492d45aa Add /proc/sys/kernel/shm[all,max,mni].
PiperOrigin-RevId: 210459956
Change-Id: I51859b90fa967631e0a54a390abc3b5541fbee66
2018-08-27 17:21:37 -07:00
Jamie Liu 64403265a0 Implement POSIX per-process interval timers.
PiperOrigin-RevId: 210021612
Change-Id: If7c161e6fd08cf17942bfb6bc5a8d2c4e271c61e
2018-08-23 16:32:36 -07:00
Ian Gudger 9c407382b0 Fix races in kernel.(*Task).Value()
PiperOrigin-RevId: 209627180
Change-Id: Idc84afd38003427e411df6e75abfabd9174174e1
2018-08-21 11:16:17 -07:00
Kevin Krakauer 635b0c4593 runsc fsgofer: Support dynamic serving of filesystems.
When multiple containers run inside a sentry, each container has its own root
filesystem and set of mounts. Containers are also added after sentry boot rather
than all configured and known at boot time.

The fsgofer needs to be able to serve the root filesystem of each container.
Thus, it must be possible to add filesystems after the fsgofer has already
started.

This change:
* Creates a URPC endpoint within the gofer process that listens for requests to
  serve new content.
* Enables the sentry, when starting a new container, to add the new container's
  filesystem.
* Mounts those new filesystems at separate roots within the sentry.

PiperOrigin-RevId: 208903248
Change-Id: Ifa91ec9c8caf5f2f0a9eead83c4a57090ce92068
2018-08-15 16:25:22 -07:00
Michael Pratt 2e06b23aa6 Fix missing O_LARGEFILE from O_CREAT files
Cleanup some more syscall.O_* references while we're here.

PiperOrigin-RevId: 208133460
Change-Id: I48db71a38f817e4f4673977eafcc0e3874eb9a25
2018-08-09 16:50:37 -07:00
Jamie Liu c036da5dff Hold TaskSet.mu in Task.Parent.
PiperOrigin-RevId: 207766238
Change-Id: Id3b66d8fe1f44c3570f67fa5ae7ba16021e35be1
2018-08-07 13:09:42 -07:00
Zhaozhong Ni c348d07863 sentry: make epoll.pollEntry wait for the file operation in restore.
PiperOrigin-RevId: 207737935
Change-Id: I3a301ece1f1d30909715f36562474e3248b6a0d5
2018-08-07 10:27:37 -07:00
Zhaozhong Ni 57d0fcbdbf Automated rollback of changelist 207037226
PiperOrigin-RevId: 207125440
Change-Id: I6c572afb4d693ee72a0c458a988b0e96d191cd49
2018-08-02 10:42:48 -07:00
Brian Geffon cf44aff6e0 Add seccomp(2) support.
Add support for the seccomp syscall and the flag SECCOMP_FILTER_FLAG_TSYNC.

PiperOrigin-RevId: 207101507
Change-Id: I5eb8ba9d5ef71b0e683930a6429182726dc23175
2018-08-02 08:10:30 -07:00
Michael Pratt 60add78980 Automated rollback of changelist 207007153
PiperOrigin-RevId: 207037226
Change-Id: I8b5f1a056d4f3eab17846f2e0193bb737ecb5428
2018-08-01 19:57:32 -07:00
Zhaozhong Ni b9e1cf8404 stateify: convert all packages to use explicit mode.
PiperOrigin-RevId: 207007153
Change-Id: Ifedf1cc3758dc18be16647a4ece9c840c1c636c9
2018-08-01 15:43:24 -07:00
Andrei Vagin a7a0167716 proc: show file flags in fdinfo
Currently, there is an attempt to print FD flags, but
they are not decoded into a number, so we see something like this:

/criu # cat /proc/self/fdinfo/0
flags: {%!o(bool=000false)}

Actually, fdinfo has to contain file flags.

Change-Id: Idcbb7db908067447eb9ae6f2c3cfb861f2be1a97
PiperOrigin-RevId: 206794498
2018-07-31 11:19:15 -07:00
Zhaozhong Ni be7fcbc558 stateify: support explicit annotation mode; convert refs and stack packages.
We have been unnecessarily creating too many savable types implicitly.

PiperOrigin-RevId: 206334201
Change-Id: Idc5a3a14bfb7ee125c4f2bb2b1c53164e46f29a8
2018-07-27 10:17:21 -07:00
Adin Scannell 8b8aad91d5 kernel: mutations on creds now require a copy.
PiperOrigin-RevId: 205315612
Change-Id: I9a0a1e32c8abfb7467a38743b82449cc92830316
2018-07-19 15:48:56 -07:00
Adin Scannell 29e00c943a Add CPUID faulting for ptrace and KVM.
PiperOrigin-RevId: 204858314
Change-Id: I8252bf8de3232a7a27af51076139b585e73276d4
2018-07-16 22:02:58 -07:00
Neel Natu 8f21c0bb28 Add EventOperations.HostFD()
This method allows an eventfd inside the Sentry to be registered with with
the host kernel.

Update comment about memory mapping host fds via CachingInodeOperations.

PiperOrigin-RevId: 204784859
Change-Id: I55823321e2d84c17ae0f7efaabc6b55b852ae257
2018-07-16 12:20:05 -07:00
Zhaozhong Ni 1cd46c8dd1 sentry: wait for restore clock instead of panicing in Timekeeper.
PiperOrigin-RevId: 204372296
Change-Id: If1ed9843b93039806e0c65521f30177dc8036979
2018-07-12 15:09:02 -07:00
Michael Pratt 41e0b977e5 Format documentation
PiperOrigin-RevId: 204323728
Change-Id: I1ff9aa062ffa12583b2e38ec94c87db7a3711971
2018-07-12 10:37:21 -07:00
Jamie Liu b9c469f372 Move ptrace constants to abi/linux.
PiperOrigin-RevId: 204188763
Change-Id: I5596ab7abb3ec9e210a7f57b3fc420e836fa43f3
2018-07-11 14:24:19 -07:00
Zhaozhong Ni b1683df90b netstack: tcp socket connected state S/R support.
PiperOrigin-RevId: 203958972
Change-Id: Ia6fe16547539296d48e2c6731edacdd96bd6e93c
2018-07-10 09:23:35 -07:00
Jamie Liu 41aeb680b1 Inherit parent in clone(CLONE_THREAD) under TaskSet.mu.
PiperOrigin-RevId: 203849534
Change-Id: I4d81513bfd32e0b7fc40c8a4c194eba7abc35a83
2018-07-09 16:16:19 -07:00
Nicolas Lacasse f93bd2cbe6 Hold t.mu while calling t.FSContext().
PiperOrigin-RevId: 202562686
Change-Id: I0f5be7cc9098e86fa31d016251c127cb91084b05
2018-06-28 16:11:19 -07:00
Fabricio Voznika 6b6852bceb Fix semaphore data races
PiperOrigin-RevId: 202371908
Change-Id: I72603b1d321878cae6404987c49e64732b676331
2018-06-27 14:41:32 -07:00
Nicolas Lacasse 99afc982f1 Call mm.CheckIORange() when copying in IOVecs.
CheckIORange is analagous to Linux's access_ok() method, which is checked when
copying in IOVecs in both lib/iov_iter.c:import_single_range() and
lib/iov_iter.c:import_iovec() => fs/read_write.c:rw_copy_check_uvector().

gVisor copies in IOVecs via Task.SingleIOSequence() and Task.CopyInIovecs().
We were checking the address range bounds, but not whether the address is
valid. To conform with linux, we should also check that the address is valid.

For usual preadv/pwritev syscalls, the effect of this change is not noticeable,
since we find out that the address is invalid before the syscall completes.

For vectorized async-IO operations, however, this change is necessary because
Linux returns EFAULT when the operation is submitted, but before it executes.
Thus, we must validate the iovecs when copying them in.

PiperOrigin-RevId: 202370092
Change-Id: I8759a63ccf7e6b90d90d30f78ab8935a0fcf4936
2018-06-27 14:31:35 -07:00
Brian Geffon 51c1e510ab Automated rollback of changelist 201596247
PiperOrigin-RevId: 202151720
Change-Id: I0491172c436bbb32b977f557953ba0bc41cfe299
2018-06-26 10:33:24 -07:00
Michael Pratt 4ac79312b0 Don't read cwd or root without holding mu
PiperOrigin-RevId: 202043090
Change-Id: I3c47fb3413ca8615d50d8a0503d72fcce9b09421
2018-06-25 16:46:29 -07:00
Michael Pratt 478f0ac003 Don't read FSContext.root without holding FSContext.mu
IsChrooted still has the opportunity to race with another thread
entering the FSContext into a chroot, but that is unchanged (and
fine, AFAIK).

PiperOrigin-RevId: 202029117
Change-Id: I38bce763b3a7715fa6ae98aa200a19d51a0235f1
2018-06-25 15:22:56 -07:00
Zhaozhong Ni 0e434b66a6 netstack: tcp socket connected state S/R support.
PiperOrigin-RevId: 201596247
Change-Id: Id22f47b2cdcbe14aa0d930f7807ba75f91a56724
2018-06-21 15:19:45 -07:00
Michael Pratt 2dedbc7211 Drop return from SendExternalSignal
SendExternalSignal is no longer called before CreateProcess, so it can
enforce this simplified precondition.

StartForwarding, and after Kernel.Start.

PiperOrigin-RevId: 201591170
Change-Id: Ib7022ef7895612d7d82a00942ab59fa433c4d6e9
2018-06-21 14:53:55 -07:00
Ian Gudger d571a4359c Implement ioctl(FIOASYNC)
FIOASYNC and friends are used to send signals when a file is ready for IO.

This may or may not be needed by Nginx. While Nginx does use it, it is unclear
if the code that uses it has any effect.

PiperOrigin-RevId: 201550828
Change-Id: I7ba05a7db4eb2dfffde11e9bd9a35b65b98d7f50
2018-06-21 10:53:21 -07:00