Commit Graph

208 Commits

Author SHA1 Message Date
Fabricio Voznika 054b5632ef Update comment
PiperOrigin-RevId: 254428866
2019-06-21 10:56:42 -07:00
Nicolas Lacasse f7428af9c1 Add MountNamespace to task.
This allows tasks to have distinct mount namespace, instead of all sharing the
kernel's root mount namespace.

Currently, the only way for a task to get a different mount namespace than the
kernel's root is by explicitly setting a different MountNamespace in
CreateProcessArgs, and nothing does this (yet).

In a follow-up CL, we will set CreateProcessArgs.MountNamespace when creating a
new container inside runsc.

Note that "MountNamespace" is a poor term for this thing. It's more like a
distinct VFS tree. When we get around to adding real mount namespaces, this
will need a better naem.

PiperOrigin-RevId: 254009310
2019-06-19 09:21:21 -07:00
Andrei Vagin 8ab0848c70 gvisor/fs: don't update file.offset for sockets, pipes, etc
sockets, pipes and other non-seekable file descriptors don't
use file.offset, so we don't need to update it.

With this change, we will be able to call file operations
without locking the file.mu mutex. This is already used for
pipes in the splice system call.

PiperOrigin-RevId: 253746644
2019-06-18 01:43:29 -07:00
Yong He 0dbdca349c Skip tid allocation which is using
When leader of process group (session) exit, the process
group ID (session ID) is holding by other processes in
the process group, so the process group ID (session ID)
can not be reused.
If reusing the process group ID (seession ID) as new process
group ID for new process, this will cause session create
failed, and later runsc crash when access process group.
The fix skip the tid if it is using by a process group
(session) when allocating a new tid.

We could easily reproduce the runsc crash follow
these steps:

1. build test program, and run inside container

int main(int argc, char *argv[])
{
	pid_t cpid, spid;

	cpid = fork();
	if (cpid == -1) {
		perror("fork");
		exit(EXIT_FAILURE);
	}
	if (cpid == 0) {
		pid_t sid = setsid();
		printf("Start New Session %ld\n",sid);
		printf("Child PID %ld / PPID %ld / PGID %ld / SID %ld\n",
				getpid(),getppid(),getpgid(getpid()),getsid(getpid()));
		spid = fork();
		if (spid == 0) {
			setpgid(getpid(), getpid());
			printf("Set GrandSon as New Process Group\n");
			printf("GrandSon PID %ld / PPID %ld / PGID %ld / SID %ld\n",
					getpid(),getppid(),getpgid(getpid()),getsid(getpid()));
			while(1) {
				usleep(1);
			}
		}
		sleep(3);
		exit(0);
	} else {
		exit(0);
	}

	return 0;
}

2. build hello program

int main(int argc, char *argv[])
{
	printf("Current PID is %ld\n", (long) getpid());
	return 0;
}

3. run script on host which run hello inside container, you can
speed up the test with set TasksLimit as lower value.
for (( i=0; i<65535; i++ ))
do
	docker exec <container id> /test/hello
done

4. when hello process reusing the process group of loop process,
runsc will crash.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x79f0c8]

goroutine 612475 [running]:
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*ProcessGroup).decRefWithParent(0x0, 0x0)
        pkg/sentry/kernel/sessions.go:160 +0x78
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).exitNotifyLocked(0xc000663500, 0x0)
        pkg/sentry/kernel/task_exit.go:672 +0x2b7
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runExitNotify).execute(0x0, 0xc000663500, 0x0, 0x0)
        pkg/sentry/kernel/task_exit.go:542 +0xc4
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run(0xc000663500, 0xc)
        pkg/sentry/kernel/task_run.go:91 +0x194
created by gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).Start
        pkg/sentry/kernel/task_start.go:286 +0xfe
2019-06-14 14:05:41 +08:00
Ian Gudger 3e9b8ecbfe Plumb context through more layers of filesytem.
All functions which allocate objects containing AtomicRefCounts will soon need
a context.

PiperOrigin-RevId: 253147709
2019-06-13 18:40:38 -07:00
Ian Gudger 0a5ee6f7b2 Fix deadlock in fasync.
The deadlock can occur when both ends of a connected Unix socket which has
FIOASYNC enabled on at least one end are closed at the same time. One end
notifies that it is closing, calling (*waiter.Queue).Notify which takes
waiter.Queue.mu (as a read lock) and then calls (*FileAsync).Callback, which
takes FileAsync.mu. The other end tries to unregister for notifications by
calling (*FileAsync).Unregister, which takes FileAsync.mu and calls
(*waiter.Queue).EventUnregister which takes waiter.Queue.mu.

This is fixed by moving the calls to waiter.Waitable.EventRegister and
waiter.Waitable.EventUnregister outside of the protection of any mutex used
in (*FileAsync).Callback.

The new test is related, but does not cover this particular situation.

Also fix a data race on FileAsync.e.Callback. (*FileAsync).Callback checked
FileAsync.e.Callback under the protection of FileAsync.mu, but the waiter
calling (*FileAsync).Callback could not and did not. This is fixed by making
FileAsync.e.Callback immutable before passing it to the waiter for the first
time.

Fixes #346

PiperOrigin-RevId: 253138340
2019-06-13 17:26:22 -07:00
Adin Scannell add40fd6ad Update canonical repository.
This can be merged after:
https://github.com/google/gvisor-website/pull/77
  or
https://github.com/google/gvisor-website/pull/78

PiperOrigin-RevId: 253132620
2019-06-13 16:50:15 -07:00
Ian Lewis 74e397e39a Add introspection for Linux/AMD64 syscalls
Adds simple introspection for syscall compatibility information to Linux/AMD64.
Syscalls registered in the syscall table now have associated metadata like
name, support level, notes, and URLs to relevant issues.

Syscall information can be exported as a table, JSON, or CSV using the new
'runsc help syscalls' command. Users can use this info to debug and get info
on the compatibility of the version of runsc they are running or to generate
documentation.

PiperOrigin-RevId: 252558304
2019-06-10 23:38:36 -07:00
Rahat Mahmood a00157cc0e Store more information in the kernel socket table.
Store enough information in the kernel socket table to distinguish
between different types of sockets. Previously we were only storing
the socket family, but this isn't enough to classify sockets. For
example, TCPv4 and UDPv4 sockets are both AF_INET, and ICMP sockets
are SOCK_DGRAM sockets with a particular protocol.

Instead of creating more sub-tables, flatten the socket table and
provide a filtering mechanism based on the socket entry.

Also generate and store a socket entry index ("sl" in linux) which
allows us to output entries in a stable order from procfs.

PiperOrigin-RevId: 252495895
2019-06-10 15:17:43 -07:00
Jamie Liu b3f104507d "Implement" mbind(2).
We still only advertise a single NUMA node, and ignore mempolicy
accordingly, but mbind() at least now succeeds and has effects reflected
by get_mempolicy().

Also fix handling of nodemasks: round sizes to unsigned long (as
documented and done by Linux), and zero trailing bits when copying them
out.

PiperOrigin-RevId: 251950859
2019-06-06 16:29:46 -07:00
Michael Pratt d3ed9baac0 Implement dumpability tracking and checks
We don't actually support core dumps, but some applications want to
get/set dumpability, which still has an effect in procfs.

Lack of support for set-uid binaries or fs creds simplifies things a
bit.

As-is, processes started via CreateProcess (i.e., init and sentryctl
exec) have normal dumpability. I'm a bit torn on whether sentryctl exec
tasks should be dumpable, but at least since they have no parent normal
UID/GID checks should protect them.

PiperOrigin-RevId: 251712714
2019-06-05 14:00:13 -07:00
Yong He 7398f013f0 Drop one dirent reference after referenced by file
When pipe is created, a dirent of pipe will be
created and its initial reference is set as 0.
Cause all dirent will only be destroyed when
the reference decreased to -1, so there is already
a 'initial reference' of dirent after it created.
For destroying dirent after all reference released,
the correct way is to drop the 'initial reference'
once someone hold a reference to the dirent, such
as fs.NewFile, otherwise the reference of dirent
will stay 0 all the time, and will cause memory
leak of dirent.
Except pipe, timerfd/eventfd/epoll has the same
problem

Here is a simple case to create memory leak of dirent
for pipe/timerfd/eventfd/epoll in C langange, after
run the case, pprof the runsc process, you will
find lots dirents of pipe/timerfd/eventfd/epoll not
freed:

int main(int argc, char *argv[])
{
	int i;
	int n;
	int pipefd[2];

	if (argc != 3) {
		printf("Usage: %s epoll|timerfd|eventfd|pipe <iterations>\n", argv[0]);
	}

	n = strtol(argv[2], NULL, 10);

	if (strcmp(argv[1], "epoll") == 0) {
		for (i = 0; i < n; ++i)
			close(epoll_create(1));
	} else if (strcmp(argv[1], "timerfd") == 0) {
		for (i = 0; i < n; ++i)
			close(timerfd_create(CLOCK_REALTIME, 0));
	} else if (strcmp(argv[1], "eventfd") == 0) {
		for (i = 0; i < n; ++i)
			close(eventfd(0, 0));
	} else if (strcmp(argv[1], "pipe") == 0) {
		for (i = 0; i < n; ++i)
			if (pipe(pipefd) == 0) {
				close(pipefd[0]);
				close(pipefd[1]);
			}
	}

	printf("%s %s test finished\r\n",argv[1],argv[2]);
	return 0;
}

Change-Id: Ia1b8a1fb9142edb00c040e44ec644d007f81f5d2
PiperOrigin-RevId: 251531096
2019-06-04 15:40:23 -07:00
Nicolas Lacasse 0c292cdaab Remove the Dirent field from Pipe.
Dirents are ref-counted, but Pipes are not. Holding a Dirent inside of a Pipe
raises difficult questions about the lifecycle of the Pipe and Dirent.

Fortunately, we can side-step those questions by removing the Dirent field from
Pipe entirely. We only need the Dirent when constructing fs.Files (which are
ref-counted), and in GetFile (when a Dirent is passed to us anyways).

PiperOrigin-RevId: 251497628
2019-06-04 12:58:56 -07:00
Michael Pratt 507a15dce9 Always wait on tracee children
After bf959931ddb88c4e4366e96dd22e68fa0db9527c ("wait/ptrace: assume
__WALL if the child is traced") (Linux 4.7), tracees are always eligible
for waiting, regardless of type.

PiperOrigin-RevId: 250399527
2019-05-30 12:05:46 -07:00
Andrei Vagin a949133c4b gvisor: interrupt the sendfile system call if a task has been interrupted
sendfile can be called for a big range and it can require significant
amount of time to process it, so we need to handle task interrupts in
this system call.

PiperOrigin-RevId: 249781023
Change-Id: Ifc2ec505d74c06f5ee76f93b8d30d518ec2d4015
2019-05-23 23:21:13 -07:00
Adin Scannell 79738d3958 Log unhandled faults only at DEBUG level.
PiperOrigin-RevId: 249561399
Change-Id: Ic73c68c8538bdca53068f38f82b7260939addac2
2019-05-22 18:18:53 -07:00
Michael Pratt 711290a7f6 Add support for wait(WNOTHREAD)
PiperOrigin-RevId: 249537694
Change-Id: Iaa4bca73a2d8341e03064d59a2eb490afc3f80da
2019-05-22 15:54:23 -07:00
Adin Scannell ae1bb08871 Clean up pipe internals and add fcntl support
Pipe internals are made more efficient by avoiding garbage collection.
A pool is now used that can be shared by all pipes, and buffers are
chained via an intrusive list. The documentation for pipe structures
and methods is also simplified and clarified.

The pipe tests are now parameterized, so that they are run on all
different variants (named pipes, small buffers, default buffers).

The pipe buffer sizes are exposed by fcntl, which is now supported
by this change. A size change test has been added to the suite.

These new tests uncovered a bug regarding the semantics of open
named pipes with O_NONBLOCK, which is also fixed by this CL. This
fix also addresses the lack of the O_LARGEFILE flag for named pipes.

PiperOrigin-RevId: 249375888
Change-Id: I48e61e9c868aedb0cadda2dff33f09a560dee773
2019-05-21 20:12:27 -07:00
Adin Scannell 9cdae51fec Add basic plumbing for splice and stub implementation.
This does not actually implement an efficient splice or sendfile. Rather, it
adds a generic plumbing to the file internals so that this can be added. All
file implementations use the stub fileutil.NoSplice implementation, which
causes sendfile and splice to fall back to an internal copy.

A basic splice system call interface is added, along with a test.

PiperOrigin-RevId: 249335960
Change-Id: Ic5568be2af0a505c19e7aec66d5af2480ab0939b
2019-05-21 15:18:12 -07:00
Jamie Liu 5ee8218483 Add pgalloc.DelayedEvictionManual.
PiperOrigin-RevId: 247667272
Change-Id: I16b04e11bb93f50b7e05e888992303f730e4a877
2019-05-10 13:37:48 -07:00
Fabricio Voznika 1bee43be13 Implement fallocate(2)
Closes #225

PiperOrigin-RevId: 247508791
Change-Id: I04f47cf2770b30043e5a272aba4ba6e11d0476cc
2019-05-09 15:35:49 -07:00
Jamie Liu 8bfb83d0ac Implement async MemoryFile eviction, and use it in CachingInodeOperations.
This feature allows MemoryFile to delay eviction of "optional"
allocations, such as unused cached file pages.

Note that this incidentally makes CachingInodeOperations writeback
asynchronous, in the sense that it doesn't occur until eviction; this is
necessary because between when a cached page becomes evictable and when
it's evicted, file writes (via CachingInodeOperations.Write) may dirty
the page.

As currently implemented, this feature won't meaningfully impact
steady-state memory usage or caching; the reclaimer goroutine will
schedule eviction as soon as it runs out of other work to do. Future CLs
increase caching by adding constraints on when eviction is scheduled.

PiperOrigin-RevId: 246014822
Change-Id: Ia85feb25a2de92a48359eb84434b6ec6f9bea2cb
2019-04-30 13:56:41 -07:00
Michael Pratt 4d52a55201 Change copyright notice to "The gVisor Authors"
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.

1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.

Fixes #209

PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
2019-04-29 14:26:23 -07:00
Nicolas Lacasse f4ce43e1f4 Allow and document bug ids in gVisor codebase.
PiperOrigin-RevId: 245818639
Change-Id: I03703ef0fb9b6675955637b9fe2776204c545789
2019-04-29 14:04:14 -07:00
Michael Pratt f17cfa4d53 Perform explicit CPUID and FP state compatibility checks on restore
PiperOrigin-RevId: 245341004
Change-Id: Ic4d581039d034a8ae944b43e45e84eb2c3973657
2019-04-25 17:47:05 -07:00
Michael Pratt f86c35a51f Clean up state error handling
PiperOrigin-RevId: 244773836
Change-Id: I32223f79d2314fe1ac4ddfc63004fc22ff634adf
2019-04-22 18:20:51 -07:00
Michael Pratt b52cbd6028 Don't allow sigtimedwait to catch unblockable signals
The existing logic attempting to do this is incorrect. Unary ^ has
higher precedence than &^, so mask always has UnblockableSignals
cleared, allowing dequeueSignalLocked to dequeue unblockable signals
(which allows userspace to ignore them).

Switch the logic so that unblockable signals are always masked.

PiperOrigin-RevId: 244058487
Change-Id: Ib19630ac04068a1fbfb9dc4a8eab1ccbdb21edc3
2019-04-17 13:43:20 -07:00
Fabricio Voznika c8cee7108f Use FD limit and file size limit from host
FD limit and file size limit is read from the host, instead
of using hard-coded defaults, given that they effect the sandbox
process. Also limit the direct cache to use no more than half
if the available FDs.

PiperOrigin-RevId: 244050323
Change-Id: I787ad0fdf07c49d589e51aebfeae477324fe26e6
2019-04-17 12:57:40 -07:00
Jamie Liu 4209edafb6 Use open fids when fstat()ing gofer files.
PiperOrigin-RevId: 243018347
Change-Id: I1e5b80607c1df0747482abea61db7fcf24536d37
2019-04-11 00:43:04 -07:00
Michael Pratt cc48969bb7 Internal change
PiperOrigin-RevId: 242978508
Change-Id: I0ea59ac5ba1dd499e87c53f2e24709371048679b
2019-04-10 18:00:18 -07:00
Nicolas Lacasse d93d19fd4e Fix uses of RootFromContext.
RootFromContext can return a dirent with reference taken, or nil. We must call
DecRef if (and only if) a real dirent is returned.

PiperOrigin-RevId: 242965515
Change-Id: Ie2b7b4cb19ee09b6ccf788b71f3fd7efcdf35a11
2019-04-10 16:36:28 -07:00
Kevin Krakauer f7aff0aaa4 Allow threads with CAP_SYS_RESOURCE to raise hard rlimits.
PiperOrigin-RevId: 242919489
Change-Id: Ie3267b3bcd8a54b54bc16a6556369a19e843376f
2019-04-10 12:36:45 -07:00
Nicolas Lacasse 0a0619216e Start saving MountSource.DirentCache.
DirentCache is already a savable type, and it ensures that it is empty at the
point of Save.  There is no reason not to save it along with the MountSource.

This did uncover an issue where not all MountSources were properly flushed
before Save.  If a mount point has an open file and is then unmounted, we save
the MountSource without flushing it first.  This CL also fixes that by flushing
all MountSources for all open FDs on Save.

PiperOrigin-RevId: 242906637
Change-Id: I3acd9d52b6ce6b8c989f835a408016cb3e67018f
2019-04-10 11:27:16 -07:00
Li Qiang b3b140ea4f syscalls: sendfile: limit the count to MAX_RW_COUNT
From sendfile spec and also the linux kernel code, we should
limit the count arg to 'MAX_RW_COUNT'. This patch export
'MAX_RW_COUNT' in kernel pkg and use it in the implementation
of sendfile syscall.

Signed-off-by: Li Qiang <pangpei.lq@antfin.com>
Change-Id: I1086fec0685587116984555abd22b07ac233fbd2
PiperOrigin-RevId: 242745831
2019-04-09 14:57:05 -07:00
Jamie Liu 9471c01348 Export kernel.SignalInfoPriv.
Also add kernel.SignalInfoNoInfo, and use it in RLIMIT_FSIZE checks.

PiperOrigin-RevId: 242562428
Change-Id: I4887c0e1c8f5fddcabfe6d4281bf76d2f2eafe90
2019-04-08 16:32:11 -07:00
Michael Pratt 75a5ccf5d9 Remove defer from trivial ThreadID methods
In particular, ns.IDOfTask and tg.ID are used for gettid and getpid,
respectively, where removing defer saves ~100ns. This may be a small
improvement to application logging, which may call gettid/getpid
frequently.

PiperOrigin-RevId: 242039616
Change-Id: I860beb62db3fe077519835e6bafa7c74cba6ca80
2019-04-04 17:14:27 -07:00
Michael Pratt 9cf33960fc Only CopyOut CPU when it changes
This will save copies when preemption is not caused by a CPU migration.

PiperOrigin-RevId: 241844399
Change-Id: I2ba3b64aa377846ab763425bd59b61158f576851
2019-04-03 18:06:36 -07:00
Michael Pratt 4968dd1341 Cache ThreadGroups in PIDNamespace
If there are thousands of threads, ThreadGroupsAppend becomes very
expensive as it must iterate over all Tasks to find the ThreadGroup
leaders.

Reduce the cost by maintaining a map of ThreadGroups which can be used
to grab them all directly.

The one somewhat visible change is to convert PID namespace init
children zapping to a group-directed SIGKILL, as Linux did in
82058d668465 "signal: Use group_send_sig_info to kill all processes in a
pid namespace".

In a benchmark that creates N threads which sleep for two minutes, we
see approximately this much CPU time in ThreadGroupsAppend:

Before:

1 thread: 0ms
1024 threads: 30ms - 9130ms
4096 threads: 50ms - 2000ms
8192 threads: 18160ms
16384 threads: 17210ms

After:

1 thread: 0ms
1024 threads: 0ms
4096 threads: 0ms
8192 threads: 0ms
16384 threads: 0ms

The profiling is actually extremely noisy (likely due to cache effects),
as some runs show almost no samples at 1024, 4096 threads, but obviously
this does not scale to lots of threads.

PiperOrigin-RevId: 241828039
Change-Id: I17827c90045df4b3c49b3174f3a05bca3026a72c
2019-04-03 16:22:43 -07:00
Jamie Liu c4caccd540 Set options on the correct Task in PTRACE_SEIZE.
$ docker run --rm --runtime=runsc -it --cap-add=SYS_PTRACE debian bash -c "apt-get update && apt-get install strace && strace ls"
...
Setting up strace (4.15-2) ...
execve("/bin/ls", ["ls"], [/* 6 vars */]) = 0
brk(NULL)                               = 0x5646d8c1e000
uname({sysname="Linux", nodename="114ef93d2db3", ...}) = 0
...

PiperOrigin-RevId: 241643321
Change-Id: Ie4bce27a7fb147eef07bbae5895c6ef3f529e177
2019-04-02 18:13:19 -07:00
Rahat Mahmood d14a7de658 Fix more data races in shm debug messages.
PiperOrigin-RevId: 241630409
Change-Id: Ie0df5f5a2f20c2d32e615f16e2ba43c88f963181
2019-04-02 16:46:32 -07:00
Rahat Mahmood 7cff746ef2 Save/restore simple devices.
We weren't saving simple devices' last allocated inode numbers, which
caused inode number reuse across S/R.

PiperOrigin-RevId: 241414245
Change-Id: I964289978841ef0a57d2fa48daf8eab7633c1284
2019-04-01 15:39:16 -07:00
Andrei Vagin a4b34e2637 gvisor: convert ilist to ilist:generic_list
ilist:generic_list works faster (cl/240185278) and
the code looks cleaner without type casting.
PiperOrigin-RevId: 241381175
Change-Id: I8487ab1d73637b3e9733c253c56dce9e79f0d35f
2019-04-01 12:53:27 -07:00
Jamie Liu 26e8d9981f Use kernel.Task.CopyScratchBuffer in syscalls/linux where possible.
PiperOrigin-RevId: 241072126
Change-Id: Ib4d9f58f550732ac4c5153d3cf159a5b1a9749da
2019-03-29 16:25:33 -07:00
Nicolas Lacasse e8fef3d873 Treat fsync errors during save as SaveRejection errors.
PiperOrigin-RevId: 241055485
Change-Id: I70259e9fef59bdf9733b35a2cd3319359449dd45
2019-03-29 14:48:16 -07:00
Nicolas Lacasse ed23f54709 Treat ENOSPC as a state-file error during save.
PiperOrigin-RevId: 241028806
Change-Id: I770bf751a2740869a93c3ab50370a727ae580470
2019-03-29 12:26:25 -07:00
chris.zn 31c2236e97 set task's name when fork
When fork a child process, the name filed of TaskContext is not set.
It results in that when we cat /proc/{pid}/status, the name filed is
null.

Like this:
Name:
State:  S (sleeping)
Tgid:   28
Pid:    28
PPid:   26
TracerPid:      0
FDSize: 8
VmSize: 89712 kB
VmRSS:  6648 kB
Threads:        1
CapInh: 00000000a93d35fb
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a93d35fb
Seccomp:        0
Change-Id: I5d469098c37cedd19da16b7ffab2e546a28a321e
PiperOrigin-RevId: 240893304
2019-03-28 18:05:42 -07:00
Jamie Liu f3723f8059 Call memmap.Mappable.Translate with more conservative usermem.AccessType.
MM.insertPMAsLocked() passes vma.maxPerms to memmap.Mappable.Translate
(although it unsets AccessType.Write if the vma is private). This
somewhat simplifies handling of pmas, since it means only COW-break
needs to replace existing pmas. However, it also means that a MAP_SHARED
mapping of a file opened O_RDWR dirties the file, regardless of the
mapping's permissions and whether or not the mapping is ever actually
written to with I/O that ignores permissions (e.g.
ptrace(PTRACE_POKEDATA)).

To fix this:

- Change the pma-getting path to request only the permissions that are
required for the calling access.

- Change memmap.Mappable.Translate to take requested permissions, and
return allowed permissions. This preserves the existing behavior in the
common cases where the memmap.Mappable isn't
fsutil.CachingInodeOperations and doesn't care if the translated
platform.File pages are written to.

- Change the MM.getPMAsLocked path to support permission upgrading of
pmas outside of copy-on-write.

PiperOrigin-RevId: 240196979
Change-Id: Ie0147c62c1fbc409467a6fa16269a413f3d7d571
2019-03-25 12:42:43 -07:00
Andrei Vagin ddc05e3053 epoll: use ilist:generic_list instead of ilist:ilist
ilist:generic_list works faster than ilist:ilist.

Here is a beanchmark test to measure performance of epoll_wait, when readyList
isn't empty. It shows about 30% better performance with these changes.

Benchmark           Time(ns)        CPU(ns)     Iterations
Before:
BM_EpollAllEvents      46725          46899          14286

After:
BM_EpollAllEvents      33167          33300          18919
PiperOrigin-RevId: 240185278
Change-Id: I3e33f9b214db13ab840b91613400525de5b58d18
2019-03-25 11:41:50 -07:00
Jamie Liu 3d0b960112 Implement PTRACE_SEIZE, PTRACE_INTERRUPT, and PTRACE_LISTEN.
PiperOrigin-RevId: 239803092
Change-Id: I42d612ed6a889e011e8474538958c6de90c6fcab
2019-03-22 08:55:44 -07:00
Andrei Vagin 064fda1a75 gvisor: don't allocate a new credential object on fork
A credential object is immutable, so we don't need to copy it for a new
task.

PiperOrigin-RevId: 239519266
Change-Id: I0632f641fdea9554779ac25d84bee4231d0d18f2
2019-03-20 18:41:00 -07:00
Rahat Mahmood cea1dd7d21 Remove racy access to shm fields.
PiperOrigin-RevId: 239016776
Change-Id: Ia7af4258e7c69b16a4630a6f3278aa8e6b627746
2019-03-18 10:49:03 -07:00
Jamie Liu 8f4634997b Decouple filemem from platform and move it to pgalloc.MemoryFile.
This is in preparation for improved page cache reclaim, which requires
greater integration between the page cache and page allocator.

PiperOrigin-RevId: 238444706
Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4
2019-03-14 08:12:48 -07:00
Fabricio Voznika 0b76887147 Priority-inheritance futex implementation
It is Implemented without the priority inheritance part given
that gVisor defers scheduling decisions to Go runtime and doesn't
have control over it.

PiperOrigin-RevId: 236989545
Change-Id: I714c8ca0798743ecf3167b14ffeb5cd834302560
2019-03-05 23:40:18 -08:00
Fabricio Voznika 3dbd4a16f8 Add semctl(GETPID) syscall
Also added unimplemented notification for semctl(2)
commands.

PiperOrigin-RevId: 236340672
Change-Id: I0795e3bd2e6d41d7936fabb731884df426a42478
2019-03-01 10:57:02 -08:00
Haibo Xu 15d3189884 Make some ptrace commands x86-only
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I9751f859332d433ca772d6b9733f5a5a64398ec7
PiperOrigin-RevId: 234877624
2019-02-20 15:10:59 -08:00
Jamie Liu bed6f8534b Set rax to syscall number on SECCOMP_RET_TRAP.
PiperOrigin-RevId: 234690475
Change-Id: I1cbfb5aecd4697a4a26ec8524354aa8656cc3ba1
2019-02-19 15:49:37 -08:00
Jamie Liu bb47d8a545 Fix clone(CLONE_NEWUSER).
- Use new user namespace for namespace creation checks.

- Ensure userns is never nil since it's used by other namespaces.

PiperOrigin-RevId: 234673175
Change-Id: I4b9d9d1e63ce4e24362089793961a996f7540cd9
2019-02-19 14:20:05 -08:00
Nicolas Lacasse 0a41ea72c1 Don't allow writing or reading to TTY unless process group is in foreground.
If a background process tries to read from a TTY, linux sends it a SIGTTIN
unless the signal is blocked or ignored, or the process group is an orphan, in
which case the syscall returns EIO.

See drivers/tty/n_tty.c:n_tty_read()=>job_control().

If a background process tries to write a TTY, set the termios, or set the
foreground process group, linux then sends a SIGTTOU. If the signal is ignored
or blocked, linux allows the write. If the process group is an orphan, the
syscall returns EIO.

See drivers/tty/tty_io.c:tty_check_change().

PiperOrigin-RevId: 234044367
Change-Id: I009461352ac4f3f11c5d42c43ac36bb0caa580f9
2019-02-14 15:47:31 -08:00
Rahat Mahmood 2ba74f84be Implement /proc/net/unix.
PiperOrigin-RevId: 232948478
Change-Id: Ib830121e5e79afaf5d38d17aeef5a1ef97913d23
2019-02-07 14:44:21 -08:00
Fabricio Voznika 9ef3427ac1 Implement semctl(2) SETALL and GETALL
PiperOrigin-RevId: 232914984
Change-Id: Id2643d7ad8e986ca9be76d860788a71db2674cda
2019-02-07 11:41:44 -08:00
Michael Pratt fe1369ac98 Move package sync to third_party
PiperOrigin-RevId: 231889261
Change-Id: I482f1df055bcedf4edb9fe3fe9b8e9c80085f1a0
2019-01-31 17:49:14 -08:00
Michael Pratt 2a0c69b19f Remove license comments
Nothing reads them and they can simply get stale.

Generated with:
$ sed -i "s/licenses(\(.*\)).*/licenses(\1)/" **/BUILD

PiperOrigin-RevId: 231818945
Change-Id: Ibc3f9838546b7e94f13f217060d31f4ada9d4bf0
2019-01-31 11:12:53 -08:00
Nicolas Lacasse dc8450b567 Remove fs.Handle, ramfs.Entry, and all the DeprecatedFileOperations.
More helper structs have been added to the fsutil package to make it easier to
implement fs.InodeOperations and fs.FileOperations.

PiperOrigin-RevId: 229305982
Change-Id: Ib6f8d3862f4216745116857913dbfa351530223b
2019-01-14 20:34:28 -08:00
Jamie Liu 9270d940eb Minor memevent fixes.
- Call MemoryEvents.done.Add(1) outside of MemoryEvents.run() so that if
  MemoryEvents.Stop() => MemoryEvents.done.Wait() is called before the
  goroutine starts running, it still waits for the goroutine to stop.

- Use defer to call MemoryEvents.done.Done() in MemoryEvents.run() so that it's
  called even if the goroutine panics.

PiperOrigin-RevId: 228623307
Change-Id: I1b0459e7999606c1a1a271b16092b1ca87005015
2019-01-09 17:54:40 -08:00
Brian Geffon 3676b7ff1c Improve loader related error messages returned to users.
PiperOrigin-RevId: 228382827
Change-Id: Ica1d30e0df826bdd77f180a5092b2b735ea5c804
2019-01-08 12:58:08 -08:00
Jamie Liu f95b94fbe3 Grant no initial capabilities to non-root UIDs.
See modified comment in auth.NewUserCredentials(); compare to the
behavior of setresuid(2) as implemented by
//pkg/sentry/kernel/task_identity.go:kernel.Task.setKUIDsUncheckedLocked().

PiperOrigin-RevId: 228381765
Change-Id: I45238777c8f63fcf41b99fce3969caaf682fe408
2019-01-08 12:52:24 -08:00
Brian Geffon bfa2f314ca Add EventChannel messages for uncaught signals.
PiperOrigin-RevId: 226936778
Change-Id: I2a6dda157c55d39d81e1b543ab11a58a0bfe5c05
2018-12-26 11:26:28 -08:00
Fabricio Voznika 03226cd950 Add BPFAction type with Stringer
PiperOrigin-RevId: 226018694
Change-Id: I98965e26fe565f37e98e5df5f997363ab273c91b
2018-12-18 10:28:28 -08:00
Adin Scannell 5d8cf31346 Move fdnotifier package to reduce internal confusion.
PiperOrigin-RevId: 225632398
Change-Id: I909e7e2925aa369adc28e844c284d9a6108e85ce
2018-12-14 18:05:01 -08:00
Rahat Mahmood ccce1d4281 Filesystems shouldn't be saving references to Platform.
Platform objects are not savable, storing references to them in
filesystem datastructures would cause save to fail if someone actually
passed in a Platform.

Current implementations work because everywhere a Platform is
expected, we currently pass in a Kernel object which embeds Platform
and thus satisfies the interface.

Eliminate this indirection and save pointers to Kernel directly.

PiperOrigin-RevId: 225288336
Change-Id: Ica399ff43f425e15bc150a0d7102196c3d54a2ab
2018-12-12 17:47:55 -08:00
Rahat Mahmood f93c288dd7 Fix a data race on Shm.key.
PiperOrigin-RevId: 225240907
Change-Id: Ie568ce3cd643f3e4a0eaa0444f4ed589dcf6031f
2018-12-12 13:18:48 -08:00
Rahat Mahmood 75e39eaa74 Pass information about map writableness to filesystems.
This is necessary to implement file seals for memfds.

PiperOrigin-RevId: 225239394
Change-Id: Ib3f1ab31385afc4b24e96cd81a05ef1bebbcbb70
2018-12-12 13:09:59 -08:00
Rahat Mahmood fc29770251 Add type safety to shm ids and keys.
PiperOrigin-RevId: 224864380
Change-Id: I49542279ad56bf15ba462d3de1ef2b157b31830a
2018-12-10 12:48:02 -08:00
Michael Pratt 99d5958693 Validate FS_BASE in Task.Clone
arch_prctl already verified that the new FS_BASE was canonical, but
Task.Clone did not. Centralize these checks in the arch packages.

Failure to validate could cause an error in PTRACE_SET_REGS when we try
to switch to the app.

PiperOrigin-RevId: 224862398
Change-Id: Iefe63b3f9aa6c4810326b8936e501be3ec407f14
2018-12-10 12:37:16 -08:00
Rahat Mahmood 685eaf119f Add counters for memory events.
Also ensure an event is emitted at startup.

PiperOrigin-RevId: 224372065
Change-Id: I5f642b6d6b13c6468ee8f794effe285fcbbf29cf
2018-12-06 11:15:47 -08:00
Brian Geffon 82719be42e Max link traversals should be for an entire path.
The number of symbolic links that are allowed to be followed
are for a full path and not just a chain of symbolic links.

PiperOrigin-RevId: 224047321
Change-Id: I5e3c4caf66a93c17eeddcc7f046d1e8bb9434a40
2018-12-04 14:32:03 -08:00
Fabricio Voznika eaac94d91c Use RET_KILL_PROCESS if available in kernel
RET_KILL_THREAD doesn't work well for Go because it will
kill only the offending thread and leave the process hanging.
RET_TRAP can be masked out and it's not guaranteed to kill
the process. RET_KILL_PROCESS is available since 4.14.

For older kernel, continue to use RET_TRAP as this is the
best option (likely to kill process, easy to debug).

PiperOrigin-RevId: 222357867
Change-Id: Icc1d7d731274b16c2125b7a1ba4f7883fbdb2cbd
2018-11-20 22:56:51 -08:00
Fabricio Voznika 8b314b0bf4 Fix recursive read lock taken on TaskSet
SyncSyscallFiltersToThreadGroup and Task.TheadID() both acquired TaskSet RWLock
in R mode and could deadlock if a writer comes in between.

PiperOrigin-RevId: 222313551
Change-Id: I4221057d8d46fec544cbfa55765c9a284fe7ebfa
2018-11-20 15:07:56 -08:00
Adin Scannell bb9a2bb62e Update futex to use usermem abstractions.
This eliminates the indirection that existed in task_futex.

PiperOrigin-RevId: 221832498
Change-Id: Ifb4c926d493913aa6694e193deae91616a29f042
2018-11-20 14:02:07 -08:00
Rahat Mahmood f7aa937124 Advertise vsyscall support via /proc/<pid>/maps.
Also update test utilities for probing vsyscall support and add a
metric to see if vsyscalls are actually used in sandboxes.

PiperOrigin-RevId: 221698834
Change-Id: I57870ecc33ea8c864bd7437833f21aa1e8117477
2018-11-15 15:14:38 -08:00
Rahat Mahmood 5a0be6fa20 Create stubs for syscalls upto Linux 4.4.
Create syscall stubs for missing syscalls upto Linux 4.4 and advertise
a kernel version of 4.4.

PiperOrigin-RevId: 220667680
Change-Id: Idbdccde538faabf16debc22f492dd053a8af0ba7
2018-11-08 11:09:46 -08:00
Rahat Mahmood 0e277a39c8 Prevent premature destruction of shm segments.
Shm segments can be marked for lazy destruction via shmctl(IPC_RMID),
which destroys a segment once it is no longer attached to any
processes. We were unconditionally decrementing the segment refcount
on shmctl(IPC_RMID) which allowed a user to force a segment to be
destroyed by repeatedly calling shmctl(IPC_RMID), with outstanding
memory maps to the segment.

This is problematic because the memory released by a segment destroyed
this way can be reused by a different process while remaining
accessible by the process with outstanding maps to the segment.

PiperOrigin-RevId: 219713660
Change-Id: I443ab838322b4fb418ed87b2722c3413ead21845
2018-11-01 15:54:14 -07:00
Rahat Mahmood 46603b569c Fix panic on creation of zero-len shm segments.
Attempting to create a zero-len shm segment causes a panic since we
try to allocate a zero-len filemem region. The existing code had a
guard to disallow this, but the check didn't encode the fact that
requesting a private segment implies a segment creation regardless of
whether IPC_CREAT is explicitly specified.

PiperOrigin-RevId: 218405743
Change-Id: I30aef1232b2125ebba50333a73352c2f907977da
2018-10-23 14:18:54 -07:00
Adin Scannell 75cd70ecc9 Track paths and provide a rename hook.
This change also adds extensive testing to the p9 package via mocks. The sanity
checks and type checks are moved from the gofer into the core package, where
they can be more easily validated.

PiperOrigin-RevId: 218296768
Change-Id: I4fc3c326e7bf1e0e140a454cbacbcc6fd617ab55
2018-10-23 00:20:15 -07:00
Fabricio Voznika b2068cf5a5 Add more unimplemented syscall events
Added events for *ctl syscalls that may have multiple different commands.
For runsc, each syscall event is only logged once. For *ctl syscalls, use
the cmd as identifier, not only the syscall number.

PiperOrigin-RevId: 218015941
Change-Id: Ie3c19131ae36124861e9b492a7dbe1765d9e5e59
2018-10-20 11:14:23 -07:00
Ian Gudger 8fce67af24 Use correct company name in copyright header
PiperOrigin-RevId: 217951017
Change-Id: Ie08bf6987f98467d07457bcf35b5f1ff6e43c035
2018-10-19 16:35:11 -07:00
Jamie Liu b2a88ff471 Check thread group CPU timers in the CPU clock ticker.
This reduces the number of goroutines and runtime timers when
ITIMER_VIRTUAL or ITIMER_PROF are enabled, or when RLIMIT_CPU is set.
This also ensures that thread group CPU timers only advance if running
tasks are observed at the time the CPU clock advances, mostly
eliminating the possibility that a CPU timer expiration observes no
running tasks and falls back to the group leader.

PiperOrigin-RevId: 217603396
Change-Id: Ia24ce934d5574334857d9afb5ad8ca0b6a6e65f4
2018-10-17 15:50:02 -07:00
Nicolas Lacasse 4e6f0892c9 runsc: Support job control signals for the root container.
Now containers run with "docker run -it" support control characters like ^C and
^Z.

This required refactoring our signal handling a bit. Signals delivered to the
"runsc boot" process are turned into loader.Signal calls with the appropriate
delivery mode. Previously they were always sent directly to PID 1.

PiperOrigin-RevId: 217566770
Change-Id: I5b7220d9a0f2b591a56335479454a200c6de8732
2018-10-17 12:29:05 -07:00
Michael Pratt 578fe5a50d Fix PTRACE_GETREGSET write size
The existing logic is backwards and writes iov_len == 0 for a full write.

PiperOrigin-RevId: 217560377
Change-Id: I5a39c31bf0ba9063a8495993bfef58dc8ab7c5fa
2018-10-17 11:53:04 -07:00
Ian Gudger 6cba410df0 Move Unix transport out of netstack
PiperOrigin-RevId: 217557656
Change-Id: I63d27635b1a6c12877279995d2d9847b6a19da9b
2018-10-17 11:37:51 -07:00
Ian Gudger 167f2401c4 Merge host.endpoint into host.ConnectedEndpoint
host.endpoint contained duplicated logic from the sockerpair implementation and
host.ConnectedEndpoint. Remove host.endpoint in favor of a
host.ConnectedEndpoint wrapped in a socketpair end.

PiperOrigin-RevId: 217240096
Change-Id: I4a3d51e3fe82bdf30e2d0152458b8499ab4c987c
2018-10-15 17:48:11 -07:00
Nicolas Lacasse b78552d30e When creating a new process group, add it to the session.
PiperOrigin-RevId: 216554791
Change-Id: Ia6b7a2e6eaad80a81b2a8f2e3241e93ebc2bda35
2018-10-10 10:42:11 -07:00
Jamie Liu e9e8be6613 Implement shared futexes.
- Shared futex objects on shared mappings are represented by Mappable +
  offset, analogous to Linux's use of inode + offset. Add type
  futex.Key, and change the futex.Manager bucket API to use futex.Keys
  instead of addresses.

- Extend the futex.Checker interface to be able to return Keys for
  memory mappings. It returns Keys rather than just mappings because
  whether the address or the target of the mapping is used in the Key
  depends on whether the mapping is MAP_SHARED or MAP_PRIVATE; this
  matters because using mapping target for a futex on a MAP_PRIVATE
  mapping causes it to stop working across COW-breaking.

- futex.Manager.WaitComplete depends on atomic updates to
  futex.Waiter.addr to determine when it has locked the right bucket,
  which is much less straightforward for struct futex.Waiter.key. Switch
  to an atomically-accessed futex.Waiter.bucket pointer.

- futex.Manager.Wake now needs to take a futex.Checker to resolve
  addresses for shared futexes. CLONE_CHILD_CLEARTID requires the exit
  path to perform a shared futex wakeup (Linux:
  kernel/fork.c:mm_release() => sys_futex(tsk->clear_child_tid,
  FUTEX_WAKE, ...)). This is a problem because futexChecker is in the
  syscalls/linux package. Move it to kernel.

PiperOrigin-RevId: 216207039
Change-Id: I708d68e2d1f47e526d9afd95e7fed410c84afccf
2018-10-08 10:20:38 -07:00
Ian Gudger beac59b37a Fix panic if FIOASYNC callback is registered and triggered without target
PiperOrigin-RevId: 215674589
Change-Id: I4f8871b64c570dc6da448d2fe351cec8a406efeb
2018-10-03 20:22:31 -07:00
Ian Gudger 4fef31f96c Add S/R support for FIOASYNC
PiperOrigin-RevId: 215655197
Change-Id: I668b1bc7c29daaf2999f8f759138bcbb09c4de6f
2018-10-03 17:03:09 -07:00
Nicolas Lacasse f1c01ed886 runsc: Support job control signals in "exec -it".
Terminal support in runsc relies on host tty file descriptors that are imported
into the sandbox. Application tty ioctls are sent directly to the host fd.

However, those host tty ioctls are associated in the host kernel with a host
process (in this case runsc), and the host kernel intercepts job control
characters like ^C and send signals to the host process. Thus, typing ^C into a
"runsc exec" shell will send a SIGINT to the runsc process.

This change makes "runsc exec" handle all signals, and forward them into the
sandbox via the "ContainerSignal" urpc method. Since the "runsc exec" is
associated with a particular container process in the sandbox, the signal must
be associated with the same container process.

One big difficulty is that the signal should not necessarily be sent to the
sandbox process started by "exec", but instead must be sent to the foreground
process group for the tty. For example, we may exec "bash", and from bash call
"sleep 100". A ^C at this point should SIGINT sleep, not bash.

To handle this, tty files inside the sandbox must keep track of their
foreground process group, which is set/get via ioctls. When an incoming
ContainerSignal urpc comes in, we look up the foreground process group via the
tty file. Unfortunately, this means we have to expose and cache the tty file in
the Loader.

Note that "runsc exec" now handles signals properly, but "runs run" does not.
That will come in a later CL, as this one is complex enough already.

Example:
	root@:/usr/local/apache2# sleep 100
	^C

	root@:/usr/local/apache2# sleep 100
	^Z
	[1]+  Stopped                 sleep 100

	root@:/usr/local/apache2# fg
	sleep 100
	^C

	root@:/usr/local/apache2#

PiperOrigin-RevId: 215334554
Change-Id: I53cdce39653027908510a5ba8d08c49f9cf24f39
2018-10-01 22:06:56 -07:00
Fabricio Voznika 491faac03b Implement 'runsc kill --all'
In order to implement kill --all correctly, the Sentry needs
to track all tasks that belong to a given container. This change
introduces ContainerID to the task, that gets inherited by all
children. 'kill --all' then iterates over all tasks comparing the
ContainerID field to find all processes that need to be signalled.

PiperOrigin-RevId: 214841768
Change-Id: I693b2374be8692d88cc441ef13a0ae34abf73ac6
2018-09-27 15:00:58 -07:00
Kevin Krakauer bb88c187c5 runsc: Enable waiting on exited processes.
This makes `runsc wait` behave more like waitpid()/wait4() in that:
- Once a process has run to completion, you can wait on it and get its exit
  code.
- Processes not waited on will consume memory (like a zombie process)

PiperOrigin-RevId: 213358916
Change-Id: I5b5eca41ce71eea68e447380df8c38361a4d1558
2018-09-17 16:25:24 -07:00
Ian Gudger ab6fa44588 Allow kernel.(*Task).Block to accept an extract only channel
PiperOrigin-RevId: 213328293
Change-Id: I4164133e6f709ecdb89ffbb5f7df3324c273860a
2018-09-17 13:35:54 -07:00
Jamie Liu 0380bcb3a4 Fix interaction between rt_sigtimedwait and ignored signals.
PiperOrigin-RevId: 213011782
Change-Id: I716c6ea3c586b0c6c5a892b6390d2d11478bc5af
2018-09-14 11:10:50 -07:00