See modified comment in auth.NewUserCredentials(); compare to the
behavior of setresuid(2) as implemented by
//pkg/sentry/kernel/task_identity.go:kernel.Task.setKUIDsUncheckedLocked().
PiperOrigin-RevId: 228381765
Change-Id: I45238777c8f63fcf41b99fce3969caaf682fe408
Platform objects are not savable, storing references to them in
filesystem datastructures would cause save to fail if someone actually
passed in a Platform.
Current implementations work because everywhere a Platform is
expected, we currently pass in a Kernel object which embeds Platform
and thus satisfies the interface.
Eliminate this indirection and save pointers to Kernel directly.
PiperOrigin-RevId: 225288336
Change-Id: Ica399ff43f425e15bc150a0d7102196c3d54a2ab
arch_prctl already verified that the new FS_BASE was canonical, but
Task.Clone did not. Centralize these checks in the arch packages.
Failure to validate could cause an error in PTRACE_SET_REGS when we try
to switch to the app.
PiperOrigin-RevId: 224862398
Change-Id: Iefe63b3f9aa6c4810326b8936e501be3ec407f14
The number of symbolic links that are allowed to be followed
are for a full path and not just a chain of symbolic links.
PiperOrigin-RevId: 224047321
Change-Id: I5e3c4caf66a93c17eeddcc7f046d1e8bb9434a40
RET_KILL_THREAD doesn't work well for Go because it will
kill only the offending thread and leave the process hanging.
RET_TRAP can be masked out and it's not guaranteed to kill
the process. RET_KILL_PROCESS is available since 4.14.
For older kernel, continue to use RET_TRAP as this is the
best option (likely to kill process, easy to debug).
PiperOrigin-RevId: 222357867
Change-Id: Icc1d7d731274b16c2125b7a1ba4f7883fbdb2cbd
SyncSyscallFiltersToThreadGroup and Task.TheadID() both acquired TaskSet RWLock
in R mode and could deadlock if a writer comes in between.
PiperOrigin-RevId: 222313551
Change-Id: I4221057d8d46fec544cbfa55765c9a284fe7ebfa
Also update test utilities for probing vsyscall support and add a
metric to see if vsyscalls are actually used in sandboxes.
PiperOrigin-RevId: 221698834
Change-Id: I57870ecc33ea8c864bd7437833f21aa1e8117477
Create syscall stubs for missing syscalls upto Linux 4.4 and advertise
a kernel version of 4.4.
PiperOrigin-RevId: 220667680
Change-Id: Idbdccde538faabf16debc22f492dd053a8af0ba7
Shm segments can be marked for lazy destruction via shmctl(IPC_RMID),
which destroys a segment once it is no longer attached to any
processes. We were unconditionally decrementing the segment refcount
on shmctl(IPC_RMID) which allowed a user to force a segment to be
destroyed by repeatedly calling shmctl(IPC_RMID), with outstanding
memory maps to the segment.
This is problematic because the memory released by a segment destroyed
this way can be reused by a different process while remaining
accessible by the process with outstanding maps to the segment.
PiperOrigin-RevId: 219713660
Change-Id: I443ab838322b4fb418ed87b2722c3413ead21845
Attempting to create a zero-len shm segment causes a panic since we
try to allocate a zero-len filemem region. The existing code had a
guard to disallow this, but the check didn't encode the fact that
requesting a private segment implies a segment creation regardless of
whether IPC_CREAT is explicitly specified.
PiperOrigin-RevId: 218405743
Change-Id: I30aef1232b2125ebba50333a73352c2f907977da
This change also adds extensive testing to the p9 package via mocks. The sanity
checks and type checks are moved from the gofer into the core package, where
they can be more easily validated.
PiperOrigin-RevId: 218296768
Change-Id: I4fc3c326e7bf1e0e140a454cbacbcc6fd617ab55
Added events for *ctl syscalls that may have multiple different commands.
For runsc, each syscall event is only logged once. For *ctl syscalls, use
the cmd as identifier, not only the syscall number.
PiperOrigin-RevId: 218015941
Change-Id: Ie3c19131ae36124861e9b492a7dbe1765d9e5e59
This reduces the number of goroutines and runtime timers when
ITIMER_VIRTUAL or ITIMER_PROF are enabled, or when RLIMIT_CPU is set.
This also ensures that thread group CPU timers only advance if running
tasks are observed at the time the CPU clock advances, mostly
eliminating the possibility that a CPU timer expiration observes no
running tasks and falls back to the group leader.
PiperOrigin-RevId: 217603396
Change-Id: Ia24ce934d5574334857d9afb5ad8ca0b6a6e65f4
Now containers run with "docker run -it" support control characters like ^C and
^Z.
This required refactoring our signal handling a bit. Signals delivered to the
"runsc boot" process are turned into loader.Signal calls with the appropriate
delivery mode. Previously they were always sent directly to PID 1.
PiperOrigin-RevId: 217566770
Change-Id: I5b7220d9a0f2b591a56335479454a200c6de8732
The existing logic is backwards and writes iov_len == 0 for a full write.
PiperOrigin-RevId: 217560377
Change-Id: I5a39c31bf0ba9063a8495993bfef58dc8ab7c5fa
host.endpoint contained duplicated logic from the sockerpair implementation and
host.ConnectedEndpoint. Remove host.endpoint in favor of a
host.ConnectedEndpoint wrapped in a socketpair end.
PiperOrigin-RevId: 217240096
Change-Id: I4a3d51e3fe82bdf30e2d0152458b8499ab4c987c
- Shared futex objects on shared mappings are represented by Mappable +
offset, analogous to Linux's use of inode + offset. Add type
futex.Key, and change the futex.Manager bucket API to use futex.Keys
instead of addresses.
- Extend the futex.Checker interface to be able to return Keys for
memory mappings. It returns Keys rather than just mappings because
whether the address or the target of the mapping is used in the Key
depends on whether the mapping is MAP_SHARED or MAP_PRIVATE; this
matters because using mapping target for a futex on a MAP_PRIVATE
mapping causes it to stop working across COW-breaking.
- futex.Manager.WaitComplete depends on atomic updates to
futex.Waiter.addr to determine when it has locked the right bucket,
which is much less straightforward for struct futex.Waiter.key. Switch
to an atomically-accessed futex.Waiter.bucket pointer.
- futex.Manager.Wake now needs to take a futex.Checker to resolve
addresses for shared futexes. CLONE_CHILD_CLEARTID requires the exit
path to perform a shared futex wakeup (Linux:
kernel/fork.c:mm_release() => sys_futex(tsk->clear_child_tid,
FUTEX_WAKE, ...)). This is a problem because futexChecker is in the
syscalls/linux package. Move it to kernel.
PiperOrigin-RevId: 216207039
Change-Id: I708d68e2d1f47e526d9afd95e7fed410c84afccf
Terminal support in runsc relies on host tty file descriptors that are imported
into the sandbox. Application tty ioctls are sent directly to the host fd.
However, those host tty ioctls are associated in the host kernel with a host
process (in this case runsc), and the host kernel intercepts job control
characters like ^C and send signals to the host process. Thus, typing ^C into a
"runsc exec" shell will send a SIGINT to the runsc process.
This change makes "runsc exec" handle all signals, and forward them into the
sandbox via the "ContainerSignal" urpc method. Since the "runsc exec" is
associated with a particular container process in the sandbox, the signal must
be associated with the same container process.
One big difficulty is that the signal should not necessarily be sent to the
sandbox process started by "exec", but instead must be sent to the foreground
process group for the tty. For example, we may exec "bash", and from bash call
"sleep 100". A ^C at this point should SIGINT sleep, not bash.
To handle this, tty files inside the sandbox must keep track of their
foreground process group, which is set/get via ioctls. When an incoming
ContainerSignal urpc comes in, we look up the foreground process group via the
tty file. Unfortunately, this means we have to expose and cache the tty file in
the Loader.
Note that "runsc exec" now handles signals properly, but "runs run" does not.
That will come in a later CL, as this one is complex enough already.
Example:
root@:/usr/local/apache2# sleep 100
^C
root@:/usr/local/apache2# sleep 100
^Z
[1]+ Stopped sleep 100
root@:/usr/local/apache2# fg
sleep 100
^C
root@:/usr/local/apache2#
PiperOrigin-RevId: 215334554
Change-Id: I53cdce39653027908510a5ba8d08c49f9cf24f39
In order to implement kill --all correctly, the Sentry needs
to track all tasks that belong to a given container. This change
introduces ContainerID to the task, that gets inherited by all
children. 'kill --all' then iterates over all tasks comparing the
ContainerID field to find all processes that need to be signalled.
PiperOrigin-RevId: 214841768
Change-Id: I693b2374be8692d88cc441ef13a0ae34abf73ac6
This makes `runsc wait` behave more like waitpid()/wait4() in that:
- Once a process has run to completion, you can wait on it and get its exit
code.
- Processes not waited on will consume memory (like a zombie process)
PiperOrigin-RevId: 213358916
Change-Id: I5b5eca41ce71eea68e447380df8c38361a4d1558
Netstack needs to be portable, so this seems to be preferable to using raw
system calls.
PiperOrigin-RevId: 212917409
Change-Id: I7b2073e7db4b4bf75300717ca23aea4c15be944c
This allows applications to verify they are running with gVisor. It
also helps debugging when running with a mix of container runtimes.
Closes#54
PiperOrigin-RevId: 212059457
Change-Id: I51d9595ee742b58c1f83f3902ab2e2ecbd5cedec
It was always returning the MountNamespace root, which may be different from
the process Root if the process is in a chroot environment.
PiperOrigin-RevId: 211862181
Change-Id: I63bfeb610e2b0affa9fdbdd8147eba3c39014480
Furthermore, allow for the specification of an ElementMapper. This allows a
single "Element" type to exist on multiple inline lists, and work without
having to embed the entry type.
This is a requisite change for supporting a per-Inode list of Dirents.
PiperOrigin-RevId: 211467497
Change-Id: If2768999b43e03fdaecf8ed15f435fe37518d163
Task.creds can only be changed by the task's own set*id and execve
syscalls, and Task namespaces can only be changed by the task's own
unshare/setns syscalls.
PiperOrigin-RevId: 211156279
Change-Id: I94d57105d34e8739d964400995a8a5d76306b2a0
This allows us to call kernel.FDMap.DecRef without holding mutexes
cleanly.
PiperOrigin-RevId: 211139657
Change-Id: Ie59d5210fb9282e1950e2e40323df7264a01bcec
When multiple containers run inside a sentry, each container has its own root
filesystem and set of mounts. Containers are also added after sentry boot rather
than all configured and known at boot time.
The fsgofer needs to be able to serve the root filesystem of each container.
Thus, it must be possible to add filesystems after the fsgofer has already
started.
This change:
* Creates a URPC endpoint within the gofer process that listens for requests to
serve new content.
* Enables the sentry, when starting a new container, to add the new container's
filesystem.
* Mounts those new filesystems at separate roots within the sentry.
PiperOrigin-RevId: 208903248
Change-Id: Ifa91ec9c8caf5f2f0a9eead83c4a57090ce92068
Add support for the seccomp syscall and the flag SECCOMP_FILTER_FLAG_TSYNC.
PiperOrigin-RevId: 207101507
Change-Id: I5eb8ba9d5ef71b0e683930a6429182726dc23175
Currently, there is an attempt to print FD flags, but
they are not decoded into a number, so we see something like this:
/criu # cat /proc/self/fdinfo/0
flags: {%!o(bool=000false)}
Actually, fdinfo has to contain file flags.
Change-Id: Idcbb7db908067447eb9ae6f2c3cfb861f2be1a97
PiperOrigin-RevId: 206794498
We have been unnecessarily creating too many savable types implicitly.
PiperOrigin-RevId: 206334201
Change-Id: Idc5a3a14bfb7ee125c4f2bb2b1c53164e46f29a8
This method allows an eventfd inside the Sentry to be registered with with
the host kernel.
Update comment about memory mapping host fds via CachingInodeOperations.
PiperOrigin-RevId: 204784859
Change-Id: I55823321e2d84c17ae0f7efaabc6b55b852ae257
CheckIORange is analagous to Linux's access_ok() method, which is checked when
copying in IOVecs in both lib/iov_iter.c:import_single_range() and
lib/iov_iter.c:import_iovec() => fs/read_write.c:rw_copy_check_uvector().
gVisor copies in IOVecs via Task.SingleIOSequence() and Task.CopyInIovecs().
We were checking the address range bounds, but not whether the address is
valid. To conform with linux, we should also check that the address is valid.
For usual preadv/pwritev syscalls, the effect of this change is not noticeable,
since we find out that the address is invalid before the syscall completes.
For vectorized async-IO operations, however, this change is necessary because
Linux returns EFAULT when the operation is submitted, but before it executes.
Thus, we must validate the iovecs when copying them in.
PiperOrigin-RevId: 202370092
Change-Id: I8759a63ccf7e6b90d90d30f78ab8935a0fcf4936
IsChrooted still has the opportunity to race with another thread
entering the FSContext into a chroot, but that is unchanged (and
fine, AFAIK).
PiperOrigin-RevId: 202029117
Change-Id: I38bce763b3a7715fa6ae98aa200a19d51a0235f1
SendExternalSignal is no longer called before CreateProcess, so it can
enforce this simplified precondition.
StartForwarding, and after Kernel.Start.
PiperOrigin-RevId: 201591170
Change-Id: Ib7022ef7895612d7d82a00942ab59fa433c4d6e9
FIOASYNC and friends are used to send signals when a file is ready for IO.
This may or may not be needed by Nginx. While Nginx does use it, it is unclear
if the code that uses it has any effect.
PiperOrigin-RevId: 201550828
Change-Id: I7ba05a7db4eb2dfffde11e9bd9a35b65b98d7f50
Almost all of the hundreds of pending signal queues are empty upon save.
PiperOrigin-RevId: 201380318
Change-Id: I40747072435299de681d646e0862efac0637e172
Walking off the bottom of the sigaltstack, for example with recursive faults,
results in forced signal delivery, not resetting the stack or pushing signal
stack to whatever happens to lie below the signal stack.
PiperOrigin-RevId: 199856085
Change-Id: I0004d2523f0df35d18714de2685b3eaa147837e0
Capabilities for sysv sem operations were being checked against the
current task's user namespace. They should be checked against the user
namespace owning the ipc namespace for the sems instead, per
ipc/util.c:ipcperms().
PiperOrigin-RevId: 197063111
Change-Id: Iba29486b316f2e01ee331dda4e48a6ab7960d589
Previously, inet.Stack was referenced in 2 structs in sentry/socket that can be
saved/restored. If an app is saved and restored on another machine, it may try
to use the old stack, which will have been replaced by a new stack on the new
machine.
PiperOrigin-RevId: 196733985
Change-Id: I6a8cfe73b5d7a90749734677dada635ab3389cb9