Use private futexes for performance and to align with other runtime uses.
PiperOrigin-RevId: 219422634
Change-Id: Ief2af5e8302847ea6dc246e8d1ee4d64684ca9dd
Previously this code used the tcpip error space. Since it is no longer part of
netstack, it can use the sentry's error space (except for a few cases where
there is still some shared code. This reduces the number of error space
conversions required for hot Unix socket operations.
PiperOrigin-RevId: 218541611
Change-Id: I3d13047006a8245b5dfda73364d37b8a453784bb
Pseudoterminal job control signals are meant to be received and handled by the
sandbox process, but if the ptrace stubs are running in the same process group,
they will receive the signals as well and inject then into the sentry kernel.
This can result in duplicate signals being delivered (often to the wrong
process), or a sentry panic if the ptrace stub is inactive.
This CL makes the ptrace stub run in a new session.
PiperOrigin-RevId: 218536851
Change-Id: Ie593c5687439bbfbf690ada3b2197ea71ed60a0e
Attempting to create a zero-len shm segment causes a panic since we
try to allocate a zero-len filemem region. The existing code had a
guard to disallow this, but the check didn't encode the fact that
requesting a private segment implies a segment creation regardless of
whether IPC_CREAT is explicitly specified.
PiperOrigin-RevId: 218405743
Change-Id: I30aef1232b2125ebba50333a73352c2f907977da
This change also adds extensive testing to the p9 package via mocks. The sanity
checks and type checks are moved from the gofer into the core package, where
they can be more easily validated.
PiperOrigin-RevId: 218296768
Change-Id: I4fc3c326e7bf1e0e140a454cbacbcc6fd617ab55
This allows us to release messages in the queue when all users close.
PiperOrigin-RevId: 218033550
Change-Id: I2f6e87650fced87a3977e3b74c64775c7b885c1b
Added events for *ctl syscalls that may have multiple different commands.
For runsc, each syscall event is only logged once. For *ctl syscalls, use
the cmd as identifier, not only the syscall number.
PiperOrigin-RevId: 218015941
Change-Id: Ie3c19131ae36124861e9b492a7dbe1765d9e5e59
This reduces the number of goroutines and runtime timers when
ITIMER_VIRTUAL or ITIMER_PROF are enabled, or when RLIMIT_CPU is set.
This also ensures that thread group CPU timers only advance if running
tasks are observed at the time the CPU clock advances, mostly
eliminating the possibility that a CPU timer expiration observes no
running tasks and falls back to the group leader.
PiperOrigin-RevId: 217603396
Change-Id: Ia24ce934d5574334857d9afb5ad8ca0b6a6e65f4
This queue only has a single user, so there is no need for it to use an
interface. Merging it into the same package as its sole user allows us to avoid
a circular dependency.
This simplifies the code and should slightly improve performance.
PiperOrigin-RevId: 217595889
Change-Id: Iabbd5164240b935f79933618c61581bc8dcd2822
Now containers run with "docker run -it" support control characters like ^C and
^Z.
This required refactoring our signal handling a bit. Signals delivered to the
"runsc boot" process are turned into loader.Signal calls with the appropriate
delivery mode. Previously they were always sent directly to PID 1.
PiperOrigin-RevId: 217566770
Change-Id: I5b7220d9a0f2b591a56335479454a200c6de8732
The existing logic is backwards and writes iov_len == 0 for a full write.
PiperOrigin-RevId: 217560377
Change-Id: I5a39c31bf0ba9063a8495993bfef58dc8ab7c5fa
* Integrate recvMsg and sendMsg functions into Recv and Send respectively as
they are no longer shared.
* Clean up partial read/write error handling code.
* Re-order code to make sense given that there is no longer a host.endpoint
type.
PiperOrigin-RevId: 217255072
Change-Id: Ib43fe9286452f813b8309d969be11f5fa40694cd
host.endpoint contained duplicated logic from the sockerpair implementation and
host.ConnectedEndpoint. Remove host.endpoint in favor of a
host.ConnectedEndpoint wrapped in a socketpair end.
PiperOrigin-RevId: 217240096
Change-Id: I4a3d51e3fe82bdf30e2d0152458b8499ab4c987c
- Change Dirent.Busy => Dirent.isMountPoint. The function body is unchanged,
and it is no longer exported.
- fs.MayDelete now checks that the victim is not the process root. This aligns
with Linux's namei.c:may_delete().
- Fix "is-ancestor" checks to actually compare all ancestors, not just the
parents.
- Fix handling of paths that end in dots, which are handled differently in
Rename vs. Unlink.
PiperOrigin-RevId: 217239274
Change-Id: I7a0eb768e70a1b2915017ce54f7f95cbf8edf1fb
This is a defense-in-depth measure. If the sentry is compromised, this prevents
system call injection to the stubs. There is some complexity with respect to
ptrace and seccomp interactions, so this protection is not really available
for kernel versions < 4.8; this is detected dynamically.
Note that this also solves the vsyscall emulation issue by adding in
appropriate trapping for those system calls. It does mean that a compromised
sentry could theoretically inject these into the stub (ignoring the trap and
resume, thereby allowing execution), but they are harmless.
PiperOrigin-RevId: 216647581
Change-Id: Id06c232cbac1f9489b1803ec97f83097fcba8eb8
Currently, in the face of FileMem fragmentation and a large sendmsg or
recvmsg call, host sockets may pass > 1024 iovecs to the host, which
will immediately cause the host to return EMSGSIZE.
When we detect this case, use a single intermediate buffer to pass to
the kernel, copying to/from the src/dst buffer.
To avoid creating unbounded intermediate buffers, enforce message size
checks and truncation w.r.t. the send buffer size. The same
functionality is added to netstack unix sockets for feature parity.
PiperOrigin-RevId: 216590198
Change-Id: I719a32e71c7b1098d5097f35e6daf7dd5190eff7
Also properly add padding after Procs in the linux.Sysinfo
structure. This will be implicitly padded to 64bits so we
need to do the same.
PiperOrigin-RevId: 216372907
Change-Id: I6eb6a27800da61d8f7b7b6e87bf0391a48fdb475
We accidentally set the wrong maximum. I've also added PATH_MAX and
NAME_MAX to the linux abi package.
PiperOrigin-RevId: 216221311
Change-Id: I44805fcf21508831809692184a0eba4cee469633
- Shared futex objects on shared mappings are represented by Mappable +
offset, analogous to Linux's use of inode + offset. Add type
futex.Key, and change the futex.Manager bucket API to use futex.Keys
instead of addresses.
- Extend the futex.Checker interface to be able to return Keys for
memory mappings. It returns Keys rather than just mappings because
whether the address or the target of the mapping is used in the Key
depends on whether the mapping is MAP_SHARED or MAP_PRIVATE; this
matters because using mapping target for a futex on a MAP_PRIVATE
mapping causes it to stop working across COW-breaking.
- futex.Manager.WaitComplete depends on atomic updates to
futex.Waiter.addr to determine when it has locked the right bucket,
which is much less straightforward for struct futex.Waiter.key. Switch
to an atomically-accessed futex.Waiter.bucket pointer.
- futex.Manager.Wake now needs to take a futex.Checker to resolve
addresses for shared futexes. CLONE_CHILD_CLEARTID requires the exit
path to perform a shared futex wakeup (Linux:
kernel/fork.c:mm_release() => sys_futex(tsk->clear_child_tid,
FUTEX_WAKE, ...)). This is a problem because futexChecker is in the
syscalls/linux package. Move it to kernel.
PiperOrigin-RevId: 216207039
Change-Id: I708d68e2d1f47e526d9afd95e7fed410c84afccf
Terminal support in runsc relies on host tty file descriptors that are imported
into the sandbox. Application tty ioctls are sent directly to the host fd.
However, those host tty ioctls are associated in the host kernel with a host
process (in this case runsc), and the host kernel intercepts job control
characters like ^C and send signals to the host process. Thus, typing ^C into a
"runsc exec" shell will send a SIGINT to the runsc process.
This change makes "runsc exec" handle all signals, and forward them into the
sandbox via the "ContainerSignal" urpc method. Since the "runsc exec" is
associated with a particular container process in the sandbox, the signal must
be associated with the same container process.
One big difficulty is that the signal should not necessarily be sent to the
sandbox process started by "exec", but instead must be sent to the foreground
process group for the tty. For example, we may exec "bash", and from bash call
"sleep 100". A ^C at this point should SIGINT sleep, not bash.
To handle this, tty files inside the sandbox must keep track of their
foreground process group, which is set/get via ioctls. When an incoming
ContainerSignal urpc comes in, we look up the foreground process group via the
tty file. Unfortunately, this means we have to expose and cache the tty file in
the Loader.
Note that "runsc exec" now handles signals properly, but "runs run" does not.
That will come in a later CL, as this one is complex enough already.
Example:
root@:/usr/local/apache2# sleep 100
^C
root@:/usr/local/apache2# sleep 100
^Z
[1]+ Stopped sleep 100
root@:/usr/local/apache2# fg
sleep 100
^C
root@:/usr/local/apache2#
PiperOrigin-RevId: 215334554
Change-Id: I53cdce39653027908510a5ba8d08c49f9cf24f39
There was a race where we checked task.Parent() != nil, and then later called
task.Parent() again, assuming that it is not nil. If the task is exiting, the
parent may have been set to nil in between the two calls, causing a panic.
This CL changes the code to only call task.Parent() once.
PiperOrigin-RevId: 215274456
Change-Id: Ib5a537312c917773265ec72016014f7bc59a5f59
host.endpoint already has the check, but it is missing from
host.ConnectedEndpoint.
PiperOrigin-RevId: 214962762
Change-Id: I88bb13a5c5871775e4e7bf2608433df8a3d348e6
Previously, if address resolution for UDP or Ping sockets required sending
packets using Write in Transport layer, Resolve would return ErrWouldBlock
and Write would return ErrNoLinkAddress. Meanwhile startAddressResolution
would run in background. Further calls to Write using same address would also
return ErrNoLinkAddress until resolution has been completed successfully.
Since Write is not allowed to block and System Calls need to be
interruptible in System Call layer, the caller to Write is responsible for
blocking upon return of ErrWouldBlock.
Now, when startAddressResolution is called a notification channel for
the completion of the address resolution is returned.
The channel will traverse up to the calling function of Write as well as
ErrNoLinkAddress. Once address resolution is complete (success or not) the
channel is closed. The caller would call Write again to send packets and
check if address resolution was compeleted successfully or not.
Fixesgoogle/gvisor#5
Change-Id: Idafaf31982bee1915ca084da39ae7bd468cebd93
PiperOrigin-RevId: 214962200
We already forward TCSETS and TCSETSW. TCSETSF is roughly equivalent but
discards pending input.
The filters were relaxed to allow host ioctls with TCSETSF argument.
This fixes programs like "passwd" that prevent user input from being displayed
on the terminal.
Before:
root@b8a0240fc836:/# passwd
Enter new UNIX password: 123
Retype new UNIX password: 123
passwd: password updated successfully
After:
root@ae6f5dabe402:/# passwd
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
PiperOrigin-RevId: 214869788
Change-Id: I31b4d1373c1388f7b51d0f2f45ce40aa8e8b0b58
In order to implement kill --all correctly, the Sentry needs
to track all tasks that belong to a given container. This change
introduces ContainerID to the task, that gets inherited by all
children. 'kill --all' then iterates over all tasks comparing the
ContainerID field to find all processes that need to be signalled.
PiperOrigin-RevId: 214841768
Change-Id: I693b2374be8692d88cc441ef13a0ae34abf73ac6
Old code was returning ID of the thread that created
the child process. It should be returning the ID of
the parent process instead.
PiperOrigin-RevId: 214720910
Change-Id: I95715c535bcf468ecf1ae771cccd04a4cd345b36
If we have an overlay file whose corresponding Dirent is frozen, then we should
not bother calling Readdir on the upper or lower files, since DirentReaddir
will calculate children based on the frozen Dirent tree.
A test was added that fails without this change.
PiperOrigin-RevId: 213531215
Change-Id: I4d6c98f1416541a476a34418f664ba58f936a81d
This makes `runsc wait` behave more like waitpid()/wait4() in that:
- Once a process has run to completion, you can wait on it and get its exit
code.
- Processes not waited on will consume memory (like a zombie process)
PiperOrigin-RevId: 213358916
Change-Id: I5b5eca41ce71eea68e447380df8c38361a4d1558