Commit Graph

728 Commits

Author SHA1 Message Date
Bhasker Hariharan 70578806e8 Add support for TCP_CONGESTION socket option.
This CL also cleans up the error returned for setting congestion
control which was incorrectly returning EINVAL instead of ENOENT.

PiperOrigin-RevId: 252889093
2019-06-12 13:35:50 -07:00
Andrei Vagin 0d05a12fd3 gvisor/ptrace: print guest registers if a stub stopped with unexpected code
PiperOrigin-RevId: 252855280
2019-06-12 10:48:46 -07:00
Adin Scannell df110ad4fe Eat sendfile partial error
For sendfile(2), we propagate a TCP error through the system call layer.
This should be eaten if there is a partial result. This change also adds
a test to ensure that there is no panic in this case, for both TCP sockets
and unix domain sockets.

PiperOrigin-RevId: 252746192
2019-06-11 19:24:35 -07:00
Fabricio Voznika fc746efa9a Add support to mount pod shared tmpfs mounts
Parse annotations containing 'gvisor.dev/spec/mount' that gives
hints about how mounts are shared between containers inside a
pod. This information can be used to better inform how to mount
these volumes inside gVisor. For example, a volume that is shared
between containers inside a pod can be bind mounted inside the
sandbox, instead of being two independent mounts.

For now, this information is used to allow the same tmpfs mounts
to be shared between containers which wasn't possible before.

PiperOrigin-RevId: 252704037
2019-06-11 14:54:31 -07:00
Ian Lewis 74e397e39a Add introspection for Linux/AMD64 syscalls
Adds simple introspection for syscall compatibility information to Linux/AMD64.
Syscalls registered in the syscall table now have associated metadata like
name, support level, notes, and URLs to relevant issues.

Syscall information can be exported as a table, JSON, or CSV using the new
'runsc help syscalls' command. Users can use this info to debug and get info
on the compatibility of the version of runsc they are running or to generate
documentation.

PiperOrigin-RevId: 252558304
2019-06-10 23:38:36 -07:00
Jamie Liu 589f36ac4a Move //pkg/sentry/platform/procid to //pkg/procid.
PiperOrigin-RevId: 252501653
2019-06-10 15:47:25 -07:00
Rahat Mahmood a00157cc0e Store more information in the kernel socket table.
Store enough information in the kernel socket table to distinguish
between different types of sockets. Previously we were only storing
the socket family, but this isn't enough to classify sockets. For
example, TCPv4 and UDPv4 sockets are both AF_INET, and ICMP sockets
are SOCK_DGRAM sockets with a particular protocol.

Instead of creating more sub-tables, flatten the socket table and
provide a filtering mechanism based on the socket entry.

Also generate and store a socket entry index ("sl" in linux) which
allows us to output entries in a stable order from procfs.

PiperOrigin-RevId: 252495895
2019-06-10 15:17:43 -07:00
Jamie Liu 48961d27a8 Move //pkg/sentry/memutil to //pkg/memutil.
PiperOrigin-RevId: 252124156
2019-06-07 14:52:27 -07:00
Jamie Liu c933f3eede Change visibility of //pkg/sentry/time.
PiperOrigin-RevId: 251965598
2019-06-06 17:58:55 -07:00
Jamie Liu 9ea248489b Cap initial usermem.CopyStringIn buffer size.
Almost (?) all uses of CopyStringIn are via linux.copyInPath(), which
passes maxlen = linux.PATH_MAX = 4096. Pre-allocating a buffer of this
size is measurably inefficient in most cases: most paths will not be
this long, 4 KB is a lot of bytes to zero, and as of this writing the Go
runtime allocator maps only two 4 KB objects to each 8 KB span,
necessitating a call to runtime.mcache.refill() on ~every other call.
Limit the initial buffer size to 256 B instead, and geometrically
reallocate if necessary.

PiperOrigin-RevId: 251960441
2019-06-06 17:22:00 -07:00
Rahat Mahmood 315cf9a523 Use common definition of SockType.
SockType isn't specific to unix domain sockets, and the current
definition basically mirrors the linux ABI's definition.

PiperOrigin-RevId: 251956740
2019-06-06 17:00:27 -07:00
Fabricio Voznika 02ab1f187c Copy up parent when binding UDS on overlayfs
Overlayfs was expecting the parent to exist when bind(2)
was called, which may not be the case. The fix is to copy
the parent directory to the upper layer before binding
the UDS.

There is not good place to add tests for it. Syscall tests
would be ideal, but it's hard to guarantee that the
directory where the socket is created hasn't been touched
before (and thus copied the parent to the upper layer).
Added it to runsc integration tests for now. If it turns
out we have lots of these kind of tests, we can consider
moving them somewhere more appropriate.

PiperOrigin-RevId: 251954156
2019-06-06 16:45:51 -07:00
Jamie Liu b3f104507d "Implement" mbind(2).
We still only advertise a single NUMA node, and ignore mempolicy
accordingly, but mbind() at least now succeeds and has effects reflected
by get_mempolicy().

Also fix handling of nodemasks: round sizes to unsigned long (as
documented and done by Linux), and zero trailing bits when copying them
out.

PiperOrigin-RevId: 251950859
2019-06-06 16:29:46 -07:00
Jamie Liu a26043ee53 Implement reclaim-driven MemoryFile eviction.
PiperOrigin-RevId: 251950660
2019-06-06 16:27:55 -07:00
Rahat Mahmood 2d2831e354 Track and export socket state.
This is necessary for implementing network diagnostic interfaces like
/proc/net/{tcp,udp,unix} and sock_diag(7).

For pass-through endpoints such as hostinet, we obtain the socket
state from the backend. For netstack, we add explicit tracking of TCP
states.

PiperOrigin-RevId: 251934850
2019-06-06 15:04:47 -07:00
Michael Pratt 57772db2e7 Shutdown host sockets on internal shutdown
This is required to make the shutdown visible to peers outside the
sandbox.

The readClosed / writeClosed fields were dropped, as they were
preventing a shutdown socket from reading the remainder of queued bytes.
The host syscalls will return the appropriate errors for shutdown.

The control message tests have been split out of socket_unix.cc to make
the (few) remaining tests accessible to testing inherited host UDS,
which don't support sending control messages.

Updates #273

PiperOrigin-RevId: 251763060
2019-06-05 18:40:37 -07:00
Michael Pratt d3ed9baac0 Implement dumpability tracking and checks
We don't actually support core dumps, but some applications want to
get/set dumpability, which still has an effect in procfs.

Lack of support for set-uid binaries or fs creds simplifies things a
bit.

As-is, processes started via CreateProcess (i.e., init and sentryctl
exec) have normal dumpability. I'm a bit torn on whether sentryctl exec
tasks should be dumpable, but at least since they have no parent normal
UID/GID checks should protect them.

PiperOrigin-RevId: 251712714
2019-06-05 14:00:13 -07:00
Yong He 7398f013f0 Drop one dirent reference after referenced by file
When pipe is created, a dirent of pipe will be
created and its initial reference is set as 0.
Cause all dirent will only be destroyed when
the reference decreased to -1, so there is already
a 'initial reference' of dirent after it created.
For destroying dirent after all reference released,
the correct way is to drop the 'initial reference'
once someone hold a reference to the dirent, such
as fs.NewFile, otherwise the reference of dirent
will stay 0 all the time, and will cause memory
leak of dirent.
Except pipe, timerfd/eventfd/epoll has the same
problem

Here is a simple case to create memory leak of dirent
for pipe/timerfd/eventfd/epoll in C langange, after
run the case, pprof the runsc process, you will
find lots dirents of pipe/timerfd/eventfd/epoll not
freed:

int main(int argc, char *argv[])
{
	int i;
	int n;
	int pipefd[2];

	if (argc != 3) {
		printf("Usage: %s epoll|timerfd|eventfd|pipe <iterations>\n", argv[0]);
	}

	n = strtol(argv[2], NULL, 10);

	if (strcmp(argv[1], "epoll") == 0) {
		for (i = 0; i < n; ++i)
			close(epoll_create(1));
	} else if (strcmp(argv[1], "timerfd") == 0) {
		for (i = 0; i < n; ++i)
			close(timerfd_create(CLOCK_REALTIME, 0));
	} else if (strcmp(argv[1], "eventfd") == 0) {
		for (i = 0; i < n; ++i)
			close(eventfd(0, 0));
	} else if (strcmp(argv[1], "pipe") == 0) {
		for (i = 0; i < n; ++i)
			if (pipe(pipefd) == 0) {
				close(pipefd[0]);
				close(pipefd[1]);
			}
	}

	printf("%s %s test finished\r\n",argv[1],argv[2]);
	return 0;
}

Change-Id: Ia1b8a1fb9142edb00c040e44ec644d007f81f5d2
PiperOrigin-RevId: 251531096
2019-06-04 15:40:23 -07:00
Nicolas Lacasse 0c292cdaab Remove the Dirent field from Pipe.
Dirents are ref-counted, but Pipes are not. Holding a Dirent inside of a Pipe
raises difficult questions about the lifecycle of the Pipe and Dirent.

Fortunately, we can side-step those questions by removing the Dirent field from
Pipe entirely. We only need the Dirent when constructing fs.Files (which are
ref-counted), and in GetFile (when a Dirent is passed to us anyways).

PiperOrigin-RevId: 251497628
2019-06-04 12:58:56 -07:00
Andrei Vagin 90a116890f gvisor/sock/unix: pass creds when a message is sent between unconnected sockets
and don't report a sender address if it doesn't have one

PiperOrigin-RevId: 251371284
2019-06-03 21:48:19 -07:00
Andrei Vagin 00f8663887 gvisor/fs: return a proper error from FileWriter.Write in case of a short-write
The io.Writer contract requires that Write writes all available
bytes and does not return short writes. This causes errors with
io.Copy, since our own Write interface does not have this same
contract.

PiperOrigin-RevId: 251368730
2019-06-03 21:26:01 -07:00
Andrei Vagin 8e926e3f74 gvisor: validate a new map region in the mremap syscall
Right now, mremap allows to remap a memory region over MaxUserAddress,
this means that we can change the stub region.

PiperOrigin-RevId: 251266886
2019-06-03 10:59:46 -07:00
Nicolas Lacasse 6f73d79c32 Simplify overlayBoundEndpoint.
There is no reason to do the recursion manually, since
Inode.BoundEndpoint will do it for us.

PiperOrigin-RevId: 250794903
2019-05-30 17:20:20 -07:00
Fabricio Voznika 38de91b028 Add build guard to files using go:linkname
Funcion signatures are not validated during compilation. Since
they are not exported, they can change at any time. The guard
ensures that they are verified at least on every version upgrade.

PiperOrigin-RevId: 250733742
2019-05-30 12:09:39 -07:00
Bhasker Hariharan ae26b2c425 Fixes to TCP listen behavior.
Netstack listen loop can get stuck if cookies are in-use and the app is slow to
accept incoming connections. Further we continue to complete handshake for a
connection even if the backlog is full. This creates a problem when a lots of
connections come in rapidly and we end up with lots of completed connections
just hanging around to be delivered.

These fixes change netstack behaviour to mirror what linux does as described
here in the following article

http://veithen.io/2014/01/01/how-tcp-backlog-works-in-linux.html

Now when cookies are not in-use Netstack will silently drop the ACK to a SYN-ACK
and not complete the handshake if the backlog is full.  This will result in the
connection staying in a half-complete state. Eventually the sender will
retransmit the ACK and if backlog has space we will transition to a connected
state and deliver the endpoint.

Similarly when cookies are in use we do not try and create an endpoint unless
there is space in the accept queue to accept the newly created endpoint. If
there is no space then we again silently drop the ACK as we can just recreate it
when the ACK is retransmitted by the peer.

We also now use the backlog to cap the size of the SYN-RCVD queue for a given
endpoint. So at any time there can be N connections in the backlog and N in a
SYN-RCVD state if the application is not accepting connections. Any new SYNs
will be dropped.

This CL also fixes another small bug where we mark a new endpoint which has not
completed handshake as connected. We should wait till handshake successfully
completes before marking it connected.

Updates #236

PiperOrigin-RevId: 250717817
2019-05-30 12:08:41 -07:00
Michael Pratt 8d25cd0b40 Update procid for Go 1.13
Upstream Go has no changes here.

PiperOrigin-RevId: 250602731
2019-05-30 12:08:10 -07:00
chris.zn b18df9bed6 Add VmData field to /proc/{pid}/status
VmData is the size of private data segments.
It has the same meaning as in Linux.

Change-Id: Iebf1ae85940a810524a6cde9c2e767d4233ddb2a
PiperOrigin-RevId: 250593739
2019-05-30 12:07:40 -07:00
Bhasker Hariharan 035a8fa38e Add support for collecting execution trace to runsc.
Updates #220

PiperOrigin-RevId: 250532302
2019-05-30 12:07:11 -07:00
Andrei Vagin 4b9cb38157 gvisor: socket() returns EPROTONOSUPPORT if protocol is not supported
PiperOrigin-RevId: 250426407
2019-05-30 12:06:15 -07:00
Michael Pratt 507a15dce9 Always wait on tracee children
After bf959931ddb88c4e4366e96dd22e68fa0db9527c ("wait/ptrace: assume
__WALL if the child is traced") (Linux 4.7), tracees are always eligible
for waiting, regardless of type.

PiperOrigin-RevId: 250399527
2019-05-30 12:05:46 -07:00
Adin Scannell 2165b77774 Remove obsolete bug.
The original bug is no longer relevant, and the FIXME here
contains lots of obsolete information.

PiperOrigin-RevId: 249924036
2019-05-30 12:03:39 -07:00
Adin Scannell ed5793808e Remove obsolete TODO.
We don't need to model internal interfaces after the system
call interfaces (which are objectively worse and simply use a
flag to distinguish between two logically different operations).

PiperOrigin-RevId: 249916814
Change-Id: I45d02e0ec0be66b782a685b1f305ea027694cab9
2019-05-24 16:18:09 -07:00
Andrei Vagin a949133c4b gvisor: interrupt the sendfile system call if a task has been interrupted
sendfile can be called for a big range and it can require significant
amount of time to process it, so we need to handle task interrupts in
this system call.

PiperOrigin-RevId: 249781023
Change-Id: Ifc2ec505d74c06f5ee76f93b8d30d518ec2d4015
2019-05-23 23:21:13 -07:00
Ayush Ranjan 6240abb205 Added boilerplate code for ext4 fs.
Initialized BUILD with license
Mount is still unimplemented and is not meant to be
part of this CL. Rest of the fs interface is implemented.
Referenced the Linux kernel appropriately when needed

PiperOrigin-RevId: 249741997
Change-Id: Id1e4c7c9e68b3f6946da39896fc6a0c3dcd7f98c
2019-05-23 16:55:42 -07:00
Fabricio Voznika 9006304dfe Initial support for bind mounts
Separate MountSource from Mount. This is needed to allow
mounts to be shared by multiple containers within the same
pod.

PiperOrigin-RevId: 249617810
Change-Id: Id2944feb7e4194951f355cbe6d4944ae3c02e468
2019-05-23 04:16:10 -07:00
Adin Scannell 79738d3958 Log unhandled faults only at DEBUG level.
PiperOrigin-RevId: 249561399
Change-Id: Ic73c68c8538bdca53068f38f82b7260939addac2
2019-05-22 18:18:53 -07:00
Michael Pratt f65dfec096 Add WCLONE / WALL support to waitid
The previous commit adds WNOTHREAD support to waitid, so we may as well
complete the upstream change.

Linux added WCLONE, WALL, WNOTHREAD support to waitid(2) in
91c4e8ea8f05916df0c8a6f383508ac7c9e10dba ("wait: allow sys_waitid() to
accept __WNOTHREAD/__WCLONE/__WALL"). i.e., Linux 4.7.

PiperOrigin-RevId: 249560587
Change-Id: Iff177b0848a3f7bae6cb5592e44500c5a942fbeb
2019-05-22 18:11:50 -07:00
Adin Scannell 21915eb58b Remove obsolete TODO.
There no obvious reason to require that BlockSize and StatFS
are MountSource operations. Today they are in INodeOperations,
and they can be moved elsewhere in the future as part of a
normal refactor process.

PiperOrigin-RevId: 249549982
Change-Id: Ib832e02faeaf8253674475df4e385bcc53d780f3
2019-05-22 17:00:36 -07:00
Michael Pratt 711290a7f6 Add support for wait(WNOTHREAD)
PiperOrigin-RevId: 249537694
Change-Id: Iaa4bca73a2d8341e03064d59a2eb490afc3f80da
2019-05-22 15:54:23 -07:00
Kevin Krakauer c1cdf18e7b UDP and TCP raw socket support.
PiperOrigin-RevId: 249511348
Change-Id: I34539092cc85032d9473ff4dd308fc29dc9bfd6b
2019-05-22 13:45:15 -07:00
Michael Pratt 69eac1198f Move wait constants to abi/linux package
Updates #214

PiperOrigin-RevId: 249483756
Change-Id: I0d3cf4112bed75a863d5eb08c2063fbc506cd875
2019-05-22 11:15:33 -07:00
Adin Scannell ae1bb08871 Clean up pipe internals and add fcntl support
Pipe internals are made more efficient by avoiding garbage collection.
A pool is now used that can be shared by all pipes, and buffers are
chained via an intrusive list. The documentation for pipe structures
and methods is also simplified and clarified.

The pipe tests are now parameterized, so that they are run on all
different variants (named pipes, small buffers, default buffers).

The pipe buffer sizes are exposed by fcntl, which is now supported
by this change. A size change test has been added to the suite.

These new tests uncovered a bug regarding the semantics of open
named pipes with O_NONBLOCK, which is also fixed by this CL. This
fix also addresses the lack of the O_LARGEFILE flag for named pipes.

PiperOrigin-RevId: 249375888
Change-Id: I48e61e9c868aedb0cadda2dff33f09a560dee773
2019-05-21 20:12:27 -07:00
Michael Pratt c8857f7269 Fix inconsistencies in ELF anonymous mappings
* A segment with filesz == 0, memsz > 0 should be an anonymous only
  mapping. We were failing to load such an ELF.
* Anonymous pages are always mapped RW, regardless of the segment
  protections.

PiperOrigin-RevId: 249355239
Change-Id: I251e5c0ce8848cf8420c3aadf337b0d77b1ad991
2019-05-21 17:06:05 -07:00
Adin Scannell 9cdae51fec Add basic plumbing for splice and stub implementation.
This does not actually implement an efficient splice or sendfile. Rather, it
adds a generic plumbing to the file internals so that this can be added. All
file implementations use the stub fileutil.NoSplice implementation, which
causes sendfile and splice to fall back to an internal copy.

A basic splice system call interface is added, along with a test.

PiperOrigin-RevId: 249335960
Change-Id: Ic5568be2af0a505c19e7aec66d5af2480ab0939b
2019-05-21 15:18:12 -07:00
Neel Natu adeb99709b Remove unused struct member.
Remove unused struct member.

PiperOrigin-RevId: 249300446
Change-Id: Ifb16538f684bc3200342462c3da927eb564bf52d
2019-05-21 12:20:19 -07:00
Michael Pratt 80cc2c78e5 Forward named pipe creation to the gofer
The backing 9p server must allow named pipe creation, which the runsc
fsgofer currently does not.

There are small changes to the overlay here. GetFile may block when
opening a named pipe, which can cause a deadlock:

1. open(O_RDONLY) -> copyMu.Lock() -> GetFile()
2. open(O_WRONLY) -> copyMu.Lock() -> Deadlock

A named pipe usable for writing must already be on the upper filesystem,
but we are still taking copyMu for write when checking for upper. That
can be changed to a read lock to fix the common case.

However, a named pipe on the lower filesystem would still deadlock in
open(O_WRONLY) when it tries to actually perform copy up (which would
simply return EINVAL). Move the copy up type check before taking copyMu
for write to avoid this.

p9 must be modified, as it was incorrectly removing the file mode when
sending messages on the wire.

PiperOrigin-RevId: 249154033
Change-Id: Id6637130e567b03758130eb6c7cdbc976384b7d6
2019-05-20 16:53:08 -07:00
Michael Pratt 6588427451 Fix incorrect tmpfs timestamp updates
* Creation of files, directories (and other fs objects) in a directory
  should always update ctime.
* Same for removal.
* atime should not be updated on lookup, only readdir.

I've also renamed some misleading functions that update mtime and ctime.

PiperOrigin-RevId: 249115063
Change-Id: I30fa275fa7db96d01aa759ed64628c18bb3a7dc7
2019-05-20 13:35:17 -07:00
Michael Pratt 4a842836e5 Return EPERM for mknod
This more directly matches what Linux does with unsupported
nodes.

PiperOrigin-RevId: 248780425
Change-Id: I17f3dd0b244f6dc4eb00e2e42344851b8367fbec
2019-05-17 13:47:40 -07:00
Michael Pratt 04105781ad Fix gofer rename ctime and cleanup stat_times test
There is a lot of redundancy that we can simplify in the stat_times
test. This will make it easier to add new tests. However, the
simplification reveals that cached uattrs on goferfs don't properly
update ctime on rename.

PiperOrigin-RevId: 248773425
Change-Id: I52662728e1e9920981555881f9a85f9ce04041cf
2019-05-17 13:05:47 -07:00
Andrei Vagin 2105158d4b gofer: don't call hostfile.Close if hostFile is nil
PiperOrigin-RevId: 248437159
Change-Id: Ife71f6ca032fca59ec97a82961000ed0af257101
2019-05-15 17:21:10 -07:00
Nicolas Lacasse dd153c014d Start of support for /proc/pid/cgroup file.
PiperOrigin-RevId: 248263378
Change-Id: Ic057d2bb0b6212110f43ac4df3f0ac9bf931ab98
2019-05-14 20:34:50 -07:00
Michael Pratt 330a1bbd04 Remove false comment
PiperOrigin-RevId: 248249285
Change-Id: I9b6d267baa666798b22def590ff20c9a118efd47
2019-05-14 18:06:14 -07:00
Jamie Liu 5ee8218483 Add pgalloc.DelayedEvictionManual.
PiperOrigin-RevId: 247667272
Change-Id: I16b04e11bb93f50b7e05e888992303f730e4a877
2019-05-10 13:37:48 -07:00
Fabricio Voznika 1bee43be13 Implement fallocate(2)
Closes #225

PiperOrigin-RevId: 247508791
Change-Id: I04f47cf2770b30043e5a272aba4ba6e11d0476cc
2019-05-09 15:35:49 -07:00
Nicolas Lacasse bfd9f75ba4 Set the FilesytemType in MountSource from the Filesystem.
And stop storing the Filesystem in the MountSource.

This allows us to decouple the MountSource filesystem type from the name of the
filesystem.

PiperOrigin-RevId: 247292982
Change-Id: I49cbcce3c17883b7aa918ba76203dfd6d1b03cc8
2019-05-08 14:35:06 -07:00
Fabricio Voznika e5432fa1b3 Remove defers from gofer.contextFile
Most are single line methods in hot paths.

PiperOrigin-RevId: 247050267
Change-Id: I428d78723fe00b57483185899dc8fa9e1f01e2ea
2019-05-07 10:55:09 -07:00
Jamie Liu 14f0e7618e Ensure all uses of MM.brk occur under MM.mappingMu in MM.Brk().
PiperOrigin-RevId: 246921386
Change-Id: I71d8908858f45a9a33a0483470d0240eaf0fd012
2019-05-06 16:39:43 -07:00
Andrei Vagin 24d8656585 gofer: don't leak file descriptors
Fixes #219

PiperOrigin-RevId: 246568639
Change-Id: Ic7afd15dde922638d77f6429c508d1cbe2e4288a
2019-05-03 14:01:50 -07:00
Michael Pratt 23ca9886c6 Update reference to old type
PiperOrigin-RevId: 246036806
Change-Id: I5554a43a1f8146c927402db3bf98488a2da0fbe7
2019-04-30 15:42:39 -07:00
Jamie Liu 8bfb83d0ac Implement async MemoryFile eviction, and use it in CachingInodeOperations.
This feature allows MemoryFile to delay eviction of "optional"
allocations, such as unused cached file pages.

Note that this incidentally makes CachingInodeOperations writeback
asynchronous, in the sense that it doesn't occur until eviction; this is
necessary because between when a cached page becomes evictable and when
it's evicted, file writes (via CachingInodeOperations.Write) may dirty
the page.

As currently implemented, this feature won't meaningfully impact
steady-state memory usage or caching; the reclaimer goroutine will
schedule eviction as soon as it runs out of other work to do. Future CLs
increase caching by adding constraints on when eviction is scheduled.

PiperOrigin-RevId: 246014822
Change-Id: Ia85feb25a2de92a48359eb84434b6ec6f9bea2cb
2019-04-30 13:56:41 -07:00
Ian Gudger 81ecd8b6ea Implement the MSG_CTRUNC msghdr flag for Unix sockets.
Updates google/gvisor#206

PiperOrigin-RevId: 245880573
Change-Id: Ifa715e98d47f64b8a32b04ae9378d6cd6bd4025e
2019-04-29 21:21:08 -07:00
Michael Pratt 4d52a55201 Change copyright notice to "The gVisor Authors"
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.

1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.

Fixes #209

PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
2019-04-29 14:26:23 -07:00
Nicolas Lacasse f4ce43e1f4 Allow and document bug ids in gVisor codebase.
PiperOrigin-RevId: 245818639
Change-Id: I03703ef0fb9b6675955637b9fe2776204c545789
2019-04-29 14:04:14 -07:00
Nicolas Lacasse 2df64cd6d2 createAt should return all errors from FindInode except ENOENT.
Previously, createAt was eating all errors from FindInode except for EACCES and
proceeding with the creation. This is incorrect, as FindInode can return many
other errors (like ENAMETOOLONG) that should stop creation.

This CL changes createAt to return all errors encountered except for ENOENT,
which we can ignore because we are about to create the thing.

PiperOrigin-RevId: 245773222
Change-Id: I1b317021de70f0550fb865506f6d8147d4aebc56
2019-04-29 10:30:24 -07:00
Adin Scannell 5749f64314 kvm: remove non-sane sanity check
Apparently some platforms don't have pSize < vSize.

Fixes #208

PiperOrigin-RevId: 245480998
Change-Id: I2a98229912f4ccbfcd8e79dfa355104f14275a9c
2019-04-26 13:53:12 -07:00
Kevin Krakauer 5f13338d30 Fix reference counting bug in /proc/PID/fdinfo/.
PiperOrigin-RevId: 245452217
Change-Id: I7164d8f57fe34c17e601079eb9410a6d95af1869
2019-04-26 11:09:55 -07:00
Michael Pratt f17cfa4d53 Perform explicit CPUID and FP state compatibility checks on restore
PiperOrigin-RevId: 245341004
Change-Id: Ic4d581039d034a8ae944b43e45e84eb2c3973657
2019-04-25 17:47:05 -07:00
Jamie Liu 6b76c172b4 Don't enforce NAME_MAX in fs.Dirent.walk().
Maximum filename length is filesystem-dependent, and obtained via
statfs::f_namelen. This limit is usually 255 bytes (NAME_MAX), but not
always. For example, VFAT supports filenames of up to 255... UCS-2
characters, which Linux conservatively takes to mean UTF-8-encoded
bytes: fs/fat/inode.c:fat_statfs(), FAT_LFN_LEN * NLS_MAX_CHARSET_SIZE.
As a result, Linux's VFS does not enforce NAME_MAX:

$ rg --maxdepth=1 '\WNAME_MAX\W' fs/ include/linux/
fs/libfs.c
38:     buf->f_namelen = NAME_MAX;
64:     if (dentry->d_name.len > NAME_MAX)

include/linux/relay.h
74:     char base_filename[NAME_MAX];   /* saved base filename */

include/linux/fscrypt.h
149: * filenames up to NAME_MAX bytes, since base64 encoding expands the length.

include/linux/exportfs.h
176: *    understanding that it is already pointing to a a %NAME_MAX+1 sized

Remove this check from core VFS, and add it to ramfs (and by extension
tmpfs), where it is actually applicable:
mm/shmem.c:shmem_dir_inode_operations.lookup == simple_lookup *does*
enforce NAME_MAX.

PiperOrigin-RevId: 245324748
Change-Id: I17567c4324bfd60e31746a5270096e75db963fac
2019-04-25 16:05:13 -07:00
Wei Zhang 17ff6063a3 Bugfix: fix fstatat symbol link to dir
For a symbol link to some directory, eg.

`/tmp/symlink -> /tmp/dir`

`fstatat("/tmp/symlink")` should return symbol link data, but
`fstatat("/tmp/symlink/")` (symlink with trailing slash) should return
directory data it points following linux behaviour.

Currently fstatat() a symlink with trailing slash will get "not a
directory" error which is wrong.

Signed-off-by: Wei Zhang <zhangwei198900@gmail.com>
Change-Id: I63469b1fb89d083d1c1255d32d52864606fbd7e2
PiperOrigin-RevId: 244783916
2019-04-22 20:07:06 -07:00
Michael Pratt d6aac9387f Fix doc typo
PiperOrigin-RevId: 244773890
Change-Id: I2d0cd7789771276ba545b38efff6d3e24133baaa
2019-04-22 18:22:19 -07:00
Michael Pratt f86c35a51f Clean up state error handling
PiperOrigin-RevId: 244773836
Change-Id: I32223f79d2314fe1ac4ddfc63004fc22ff634adf
2019-04-22 18:20:51 -07:00
Ian Gudger 358eb52a76 Add support for the MSG_TRUNC msghdr flag.
The MSG_TRUNC flag is set in the msghdr when a message is truncated.

Fixes google/gvisor#200

PiperOrigin-RevId: 244440486
Change-Id: I03c7d5e7f5935c0c6b8d69b012db1780ac5b8456
2019-04-19 16:17:01 -07:00
Michael Pratt c931c8e082 Format struct pollfd in poll(2)/ppoll(2)
I0410 15:40:38.854295    3776 x:0] [   1] poll_test E poll(0x2b00bfb5c020 [{FD: 0x3 anon_inode:[eventfd], Events: POLLOUT, REvents: ...}], 0x1, 0x1)
I0410 15:40:38.854348    3776 x:0] [   1] poll_test X poll(0x2b00bfb5c020 [{FD: 0x3 anon_inode:[eventfd], Events: POLLOUT|POLLERR|POLLHUP, REvents: POLLOUT}], 0x1, 0x1) = 0x1 (10.765?s)

PiperOrigin-RevId: 244269879
Change-Id: If07ba54a486fdeaaedfc0123769b78d1da862307
2019-04-18 15:24:07 -07:00
Ian Gudger 133700007a Only emit unimplemented syscall events for unsupported values.
Only emit unimplemented syscall events for setting SO_OOBINLINE and SO_LINGER
when attempting to set unsupported values.

PiperOrigin-RevId: 244229675
Change-Id: Icc4562af8f733dd75a90404621711f01a32a9fc1
2019-04-18 11:51:41 -07:00
Michael Pratt b52cbd6028 Don't allow sigtimedwait to catch unblockable signals
The existing logic attempting to do this is incorrect. Unary ^ has
higher precedence than &^, so mask always has UnblockableSignals
cleared, allowing dequeueSignalLocked to dequeue unblockable signals
(which allows userspace to ignore them).

Switch the logic so that unblockable signals are always masked.

PiperOrigin-RevId: 244058487
Change-Id: Ib19630ac04068a1fbfb9dc4a8eab1ccbdb21edc3
2019-04-17 13:43:20 -07:00
Fabricio Voznika c8cee7108f Use FD limit and file size limit from host
FD limit and file size limit is read from the host, instead
of using hard-coded defaults, given that they effect the sandbox
process. Also limit the direct cache to use no more than half
if the available FDs.

PiperOrigin-RevId: 244050323
Change-Id: I787ad0fdf07c49d589e51aebfeae477324fe26e6
2019-04-17 12:57:40 -07:00
Michael Pratt 08d99c5fbe Convert poll/select to operate more directly on linux.PollFD
Current, doPoll copies the user struct pollfd array into a
[]syscalls.PollFD, which contains internal kdefs.FD and
waiter.EventMask types. While these are currently binary-compatible with
the Linux versions, we generally discourage copying directly to internal
types (someone may inadvertantly change kdefs.FD to uint64).

Instead, copy directly to a []linux.PollFD, which will certainly be
binary compatible. Most of syscalls/polling.go is included directly into
syscalls/linux/sys_poll.go, as it can then operate directly on
linux.PollFD. The additional syscalls.PollFD type is providing little
value.

I've also added explicit conversion functions for waiter.EventMask,
which creates the possibility of a different binary format.

PiperOrigin-RevId: 244042947
Change-Id: I24e5b642002a32b3afb95a9dcb80d4acd1288abf
2019-04-17 12:15:01 -07:00
Michael Pratt 6b24f7ab08 Format FDs in strace logs
Normal files display their path in the current mount namespace:

I0410 10:57:54.964196  216336 x:0] [   1] ls X read(0x3 /proc/filesystems, 0x55cee3bdb2c0 "nodev\t9p\nnodev\tdevpts \nnodev\tdevtmpfs\nnodev\tproc\nnodev\tramdiskfs\nnodev\tsysfs\nnodev\ttmpfs\n", 0x1000) = 0x58 (24.462?s)

AT_FDCWD includes the CWD:

I0411 12:58:48.278427    1526 x:0] [   1] stat_test E newfstatat(AT_FDCWD /home/prattmic, 0x55ea719b564e /proc/self, 0x7ef5cefc2be8, 0x0)

Sockets (and other non-vfs files) display an inode number (like
/proc/PID/fd):

I0410 10:54:38.909123  207684 x:0] [   1] nc E bind(0x3 socket:[1], 0x55b5a1652040 {Family: AF_INET, Addr: , Port: 8080}, 0x10)

I also fixed a few syscall args that should be Path.

PiperOrigin-RevId: 243169025
Change-Id: Ic7dda6a82ae27062fe2a4a371557acfd6a21fa2a
2019-04-11 16:48:39 -07:00
Jamie Liu 4209edafb6 Use open fids when fstat()ing gofer files.
PiperOrigin-RevId: 243018347
Change-Id: I1e5b80607c1df0747482abea61db7fcf24536d37
2019-04-11 00:43:04 -07:00
Michael Pratt cc48969bb7 Internal change
PiperOrigin-RevId: 242978508
Change-Id: I0ea59ac5ba1dd499e87c53f2e24709371048679b
2019-04-10 18:00:18 -07:00
Nicolas Lacasse d93d19fd4e Fix uses of RootFromContext.
RootFromContext can return a dirent with reference taken, or nil. We must call
DecRef if (and only if) a real dirent is returned.

PiperOrigin-RevId: 242965515
Change-Id: Ie2b7b4cb19ee09b6ccf788b71f3fd7efcdf35a11
2019-04-10 16:36:28 -07:00
Yong He 89cc8eef9b DATA RACE in fs.(*Dirent).fullName
add renameMu.Lock when oldParent == newParent
in order to avoid data race in following report:

WARNING: DATA RACE
Read at 0x00c000ba2160 by goroutine 405:
  gvisor.googlesource.com/gvisor/pkg/sentry/fs.(*Dirent).fullName()
      pkg/sentry/fs/dirent.go:246 +0x6c
  gvisor.googlesource.com/gvisor/pkg/sentry/fs.(*Dirent).FullName()
      pkg/sentry/fs/dirent.go:356 +0x8b
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*FDMap).String()
      pkg/sentry/kernel/fd_map.go:135 +0x1e0
  fmt.(*pp).handleMethods()
      GOROOT/src/fmt/print.go:603 +0x404
  fmt.(*pp).printArg()
      GOROOT/src/fmt/print.go:686 +0x255
  fmt.(*pp).doPrintf()
      GOROOT/src/fmt/print.go:1003 +0x33f
  fmt.Fprintf()
      GOROOT/src/fmt/print.go:188 +0x7f
  gvisor.googlesource.com/gvisor/pkg/log.(*Writer).Emit()
      pkg/log/log.go:121 +0x89
  gvisor.googlesource.com/gvisor/pkg/log.GoogleEmitter.Emit()
      pkg/log/glog.go:162 +0x1acc
  gvisor.googlesource.com/gvisor/pkg/log.(*GoogleEmitter).Emit()
      <autogenerated>:1 +0xe1
  gvisor.googlesource.com/gvisor/pkg/log.(*BasicLogger).Debugf()
      pkg/log/log.go:177 +0x111
  gvisor.googlesource.com/gvisor/pkg/log.Debugf()
      pkg/log/log.go:235 +0x66
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).Debugf()
      pkg/sentry/kernel/task_log.go:48 +0xfe
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).DebugDumpState()
      pkg/sentry/kernel/task_log.go:66 +0x11f
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runApp).execute()
      pkg/sentry/kernel/task_run.go:272 +0xc80
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run()
      pkg/sentry/kernel/task_run.go:91 +0x24b

Previous write at 0x00c000ba2160 by goroutine 423:
  gvisor.googlesource.com/gvisor/pkg/sentry/fs.Rename()
      pkg/sentry/fs/dirent.go:1628 +0x61f
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.renameAt.func1.1()
      pkg/sentry/syscalls/linux/sys_file.go:1864 +0x1f8
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.fileOpAt(  gvisor.googlesource.com/g/linux/sys_file.go:51 +0x20f
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.renameAt.func1()
      pkg/sentry/syscalls/linux/sys_file.go:1852 +0x218
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.fileOpAt()
      pkg/sentry/syscalls/linux/sys_file.go:51 +0x20f
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.renameAt()
      pkg/sentry/syscalls/linux/sys_file.go:1840 +0x180
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.Rename()
      pkg/sentry/syscalls/linux/sys_file.go:1873 +0x60
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).executeSyscall()
      pkg/sentry/kernel/task_syscall.go:165 +0x17a
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallInvoke()
      pkg/sentry/kernel/task_syscall.go:283 +0xb4
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallEnter()
      pkg/sentry/kernel/task_syscall.go:244 +0x10c
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscall()
      pkg/sentry/kernel/task_syscall.go:219 +0x1e3
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runApp).execute()
      pkg/sentry/kernel/task_run.go:215 +0x15a9
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run()
      pkg/sentry/kernel/task_run.go:91 +0x24b

Reported-by: syzbot+e1babbf756fab380dfff@syzkaller.appspotmail.com
Change-Id: Icd2620bb3ea28b817bf0672d454a22b9d8ee189a
PiperOrigin-RevId: 242938741
2019-04-10 14:17:33 -07:00
Kevin Krakauer f7aff0aaa4 Allow threads with CAP_SYS_RESOURCE to raise hard rlimits.
PiperOrigin-RevId: 242919489
Change-Id: Ie3267b3bcd8a54b54bc16a6556369a19e843376f
2019-04-10 12:36:45 -07:00
Nicolas Lacasse 0a0619216e Start saving MountSource.DirentCache.
DirentCache is already a savable type, and it ensures that it is empty at the
point of Save.  There is no reason not to save it along with the MountSource.

This did uncover an issue where not all MountSources were properly flushed
before Save.  If a mount point has an open file and is then unmounted, we save
the MountSource without flushing it first.  This CL also fixes that by flushing
all MountSources for all open FDs on Save.

PiperOrigin-RevId: 242906637
Change-Id: I3acd9d52b6ce6b8c989f835a408016cb3e67018f
2019-04-10 11:27:16 -07:00
Shiva Prasanth 7140b1fdca Fixed /proc/cpuinfo permissions
This also applies these permissions to other static proc files.

Change-Id: I4167e585fed49ad271aa4e1f1260babb3239a73d
PiperOrigin-RevId: 242898575
2019-04-10 10:49:43 -07:00
Li Qiang b3b140ea4f syscalls: sendfile: limit the count to MAX_RW_COUNT
From sendfile spec and also the linux kernel code, we should
limit the count arg to 'MAX_RW_COUNT'. This patch export
'MAX_RW_COUNT' in kernel pkg and use it in the implementation
of sendfile syscall.

Signed-off-by: Li Qiang <pangpei.lq@antfin.com>
Change-Id: I1086fec0685587116984555abd22b07ac233fbd2
PiperOrigin-RevId: 242745831
2019-04-09 14:57:05 -07:00
Bhasker Hariharan eaac2806ff Add TCP checksum verification.
PiperOrigin-RevId: 242704699
Change-Id: I87db368ca343b3b4bf4f969b17d3aa4ce2f8bd4f
2019-04-09 11:23:47 -07:00
Jamie Liu 9471c01348 Export kernel.SignalInfoPriv.
Also add kernel.SignalInfoNoInfo, and use it in RLIMIT_FSIZE checks.

PiperOrigin-RevId: 242562428
Change-Id: I4887c0e1c8f5fddcabfe6d4281bf76d2f2eafe90
2019-04-08 16:32:11 -07:00
Nicolas Lacasse 70906f1d24 Intermediate ram fs dirs should be writable.
We construct a ramfs tree of "scaffolding" directories for all mount points, so
that a directory exists that each mount point can be mounted over.

We were creating these directories without write permissions, which meant that
they were not wribable even when underlayed under a writable filesystem. They
should be writable.

PiperOrigin-RevId: 242507789
Change-Id: I86645e35417560d862442ff5962da211dbe9b731
2019-04-08 11:56:38 -07:00
Nicolas Lacasse ee7e6d33b2 Use string type for extended attribute values, instead of []byte.
Strings are a better fit for this usage because they are immutable in Go, and
can contain arbitrary bytes. It also allows us to avoid casting bytes to string
(and the associated allocation) in the hot path when checking for overlay
whiteouts.

PiperOrigin-RevId: 242208856
Change-Id: I7699ae6302492eca71787dd0b72e0a5a217a3db2
2019-04-05 15:49:39 -07:00
Andrei Vagin 88409e983c gvisor: Add support for the MS_NOEXEC mount option
https://github.com/google/gvisor/issues/145

PiperOrigin-RevId: 242044115
Change-Id: I8f140fe05e32ecd438b6be218e224e4b7fe05878
2019-04-04 17:43:53 -07:00
Michael Pratt 75a5ccf5d9 Remove defer from trivial ThreadID methods
In particular, ns.IDOfTask and tg.ID are used for gettid and getpid,
respectively, where removing defer saves ~100ns. This may be a small
improvement to application logging, which may call gettid/getpid
frequently.

PiperOrigin-RevId: 242039616
Change-Id: I860beb62db3fe077519835e6bafa7c74cba6ca80
2019-04-04 17:14:27 -07:00
Adin Scannell 75c8ac38e0 BUILD: Add useful go_path target
Change-Id: Ibd6d8a1a63826af6e62a0f0669f8f0866c8091b4
PiperOrigin-RevId: 242037969
2019-04-04 17:05:38 -07:00
Michael Pratt 9cf33960fc Only CopyOut CPU when it changes
This will save copies when preemption is not caused by a CPU migration.

PiperOrigin-RevId: 241844399
Change-Id: I2ba3b64aa377846ab763425bd59b61158f576851
2019-04-03 18:06:36 -07:00
Nicolas Lacasse 61d8c361c6 Don't release d.mu in checks for child-existence.
Dirent.exists() is called in Create to check whether a child with the given
name already exists.

Dirent.exists() calls walk(), and before this CL allowed walk() to drop d.mu
while calling d.Inode.Lookup. During this existence check, a racing Rename()
can acquire d.mu and create a new child of the dirent with the same name.
(Note that the source and destination of the rename must be in the same
directory, otherwise renameMu will be taken preventing the race.) In this
case, d.exists() can return false, even though a child with the same name
actually does exist.

This CL changes d.exists() so that it does not release d.mu while walking, thus
preventing the race with Rename.

It also adds comments noting that lockForRename may not take renameMu if the
source and destination are in the same directory, as this is a bit surprising
(at least it was to me).

PiperOrigin-RevId: 241842579
Change-Id: I56524870e39dfcd18cab82054eb3088846c34813
2019-04-03 17:53:56 -07:00
Michael Pratt 4968dd1341 Cache ThreadGroups in PIDNamespace
If there are thousands of threads, ThreadGroupsAppend becomes very
expensive as it must iterate over all Tasks to find the ThreadGroup
leaders.

Reduce the cost by maintaining a map of ThreadGroups which can be used
to grab them all directly.

The one somewhat visible change is to convert PID namespace init
children zapping to a group-directed SIGKILL, as Linux did in
82058d668465 "signal: Use group_send_sig_info to kill all processes in a
pid namespace".

In a benchmark that creates N threads which sleep for two minutes, we
see approximately this much CPU time in ThreadGroupsAppend:

Before:

1 thread: 0ms
1024 threads: 30ms - 9130ms
4096 threads: 50ms - 2000ms
8192 threads: 18160ms
16384 threads: 17210ms

After:

1 thread: 0ms
1024 threads: 0ms
4096 threads: 0ms
8192 threads: 0ms
16384 threads: 0ms

The profiling is actually extremely noisy (likely due to cache effects),
as some runs show almost no samples at 1024, 4096 threads, but obviously
this does not scale to lots of threads.

PiperOrigin-RevId: 241828039
Change-Id: I17827c90045df4b3c49b3174f3a05bca3026a72c
2019-04-03 16:22:43 -07:00
Kevin Krakauer 82529becae Fix index out of bounds in tty implementation.
The previous implementation revolved around runes instead of bytes, which caused
weird behavior when converting between the two. For example, peekRune would read
the byte 0xff from a buffer, convert it to a rune, then return it. As rune is an
alias of int32, 0xff was 0-padded to int32(255), which is the hex code point for
?. However, peekRune also returned the length of the byte (1). When calling
utf8.EncodeRune, we only allocated 1 byte, but tried the write the 2-byte
character ?.

tl;dr: I apparently didn't understand runes when I wrote this.

PiperOrigin-RevId: 241789081
Change-Id: I14c788af4d9754973137801500ef6af7ab8a8727
2019-04-03 13:00:34 -07:00
Kevin Krakauer c79e81bd27 Addresses data race in tty implementation.
Also makes the safemem reading and writing inline, as it makes it easier to see
what locks are held.

PiperOrigin-RevId: 241775201
Change-Id: Ib1072f246773ef2d08b5b9a042eb7e9e0284175c
2019-04-03 11:49:55 -07:00
Ian Lewis 77f01ee3c7 Add syscall annotations for unimplemented syscalls
Added syscall annotations for unimplemented syscalls for later generation into
reference docs. Annotations are of the form:
@Syscall(<name>, <key:value>, ...)

Supported args and values are:

- arg: A syscall option. This entry only applies to the syscall when given this
       option.
- support: Indicates support level
  - UNIMPLEMENTED: Unimplemented (implies returns:ENOSYS)
  - PARTIAL: Partial support. Details should be provided in note.
  - FULL: Full support
- returns: Indicates a known return value. Values are
           syscall errors. This is treated as a string so you can use something
           like "returns:EPERM or ENOSYS".
- issue: A Github issue number.
- note: A note

Example:
// @Syscall(mmap, arg:MAP_PRIVATE, support:FULL, note:Private memory fully supported)
// @Syscall(mmap, arg:MAP_SHARED, support:UNIMPLEMENTED, issue:123, note:Shared memory not supported)
// @Syscall(setxattr, returns:ENOTSUP, note:Requires file system support)

Annotations should be placed as close to their implementation as possible
(preferrably as part of a supporting function's Godoc) and should be updated as
syscall support changes.

PiperOrigin-RevId: 241697482
Change-Id: I7a846135db124e1271dc5057d788cba82ca312d4
2019-04-03 03:10:23 -07:00
Jamie Liu c4caccd540 Set options on the correct Task in PTRACE_SEIZE.
$ docker run --rm --runtime=runsc -it --cap-add=SYS_PTRACE debian bash -c "apt-get update && apt-get install strace && strace ls"
...
Setting up strace (4.15-2) ...
execve("/bin/ls", ["ls"], [/* 6 vars */]) = 0
brk(NULL)                               = 0x5646d8c1e000
uname({sysname="Linux", nodename="114ef93d2db3", ...}) = 0
...

PiperOrigin-RevId: 241643321
Change-Id: Ie4bce27a7fb147eef07bbae5895c6ef3f529e177
2019-04-02 18:13:19 -07:00
Nicolas Lacasse 1776ab28f0 Add test that symlinking over a directory returns EEXIST.
Also remove comments in InodeOperations that required that implementation of
some Create* operations ensure that the name does not already exist, since
these checks are all centralized in the Dirent.

PiperOrigin-RevId: 241637335
Change-Id: Id098dc6063ff7c38347af29d1369075ad1e89a58
2019-04-02 17:28:36 -07:00
Rahat Mahmood d14a7de658 Fix more data races in shm debug messages.
PiperOrigin-RevId: 241630409
Change-Id: Ie0df5f5a2f20c2d32e615f16e2ba43c88f963181
2019-04-02 16:46:32 -07:00
Wei Zhang 1fcd40719d device: fix device major/minor
Current gvisor doesn't give devices a right major and minor number.

When testing golang supporting of gvisor, I run the test case below:

```
$ docker run -ti --runtime runsc golang:1.12.1 bash -c "cd /usr/local/go/src && ./run.bash "
```

And it reports some errors, one of them is:

"--- FAIL: TestDevices (0.00s)
    --- FAIL: TestDevices//dev/null_1:3 (0.00s)
        dev_linux_test.go:45: for /dev/null Major(0x0) == 0, want 1
        dev_linux_test.go:48: for /dev/null Minor(0x0) == 0, want 3
        dev_linux_test.go:51: for /dev/null Mkdev(1, 3) == 0x103, want 0x0
    --- FAIL: TestDevices//dev/zero_1:5 (0.00s)
        dev_linux_test.go:45: for /dev/zero Major(0x0) == 0, want 1
        dev_linux_test.go:48: for /dev/zero Minor(0x0) == 0, want 5
        dev_linux_test.go:51: for /dev/zero Mkdev(1, 5) == 0x105, want 0x0
    --- FAIL: TestDevices//dev/random_1:8 (0.00s)
        dev_linux_test.go:45: for /dev/random Major(0x0) == 0, want 1
        dev_linux_test.go:48: for /dev/random Minor(0x0) == 0, want 8
        dev_linux_test.go:51: for /dev/random Mkdev(1, 8) == 0x108, want 0x0
    --- FAIL: TestDevices//dev/full_1:7 (0.00s)
        dev_linux_test.go:45: for /dev/full Major(0x0) == 0, want 1
        dev_linux_test.go:48: for /dev/full Minor(0x0) == 0, want 7
        dev_linux_test.go:51: for /dev/full Mkdev(1, 7) == 0x107, want 0x0
    --- FAIL: TestDevices//dev/urandom_1:9 (0.00s)
        dev_linux_test.go:45: for /dev/urandom Major(0x0) == 0, want 1
        dev_linux_test.go:48: for /dev/urandom Minor(0x0) == 0, want 9
        dev_linux_test.go:51: for /dev/urandom Mkdev(1, 9) == 0x109, want 0x0
"

So I think we'd better assign to them correct major/minor numbers following linux spec.

Signed-off-by: Wei Zhang <zhangwei198900@gmail.com>
Change-Id: I4521ee7884b4e214fd3a261929e3b6dac537ada9
PiperOrigin-RevId: 241609021
2019-04-02 14:51:07 -07:00
Rahat Mahmood 7cff746ef2 Save/restore simple devices.
We weren't saving simple devices' last allocated inode numbers, which
caused inode number reuse across S/R.

PiperOrigin-RevId: 241414245
Change-Id: I964289978841ef0a57d2fa48daf8eab7633c1284
2019-04-01 15:39:16 -07:00
Jamie Liu b4006686d2 Don't expand COW-break on executable VMAs.
PiperOrigin-RevId: 241403847
Change-Id: I4631ca05734142da6e80cdfa1a1d63ed68aa05cc
2019-04-01 14:47:31 -07:00
Andrei Vagin a4b34e2637 gvisor: convert ilist to ilist:generic_list
ilist:generic_list works faster (cl/240185278) and
the code looks cleaner without type casting.
PiperOrigin-RevId: 241381175
Change-Id: I8487ab1d73637b3e9733c253c56dce9e79f0d35f
2019-04-01 12:53:27 -07:00
Jamie Liu 26e8d9981f Use kernel.Task.CopyScratchBuffer in syscalls/linux where possible.
PiperOrigin-RevId: 241072126
Change-Id: Ib4d9f58f550732ac4c5153d3cf159a5b1a9749da
2019-03-29 16:25:33 -07:00
Nicolas Lacasse e8fef3d873 Treat fsync errors during save as SaveRejection errors.
PiperOrigin-RevId: 241055485
Change-Id: I70259e9fef59bdf9733b35a2cd3319359449dd45
2019-03-29 14:48:16 -07:00
Michael Pratt d11ef20a93 Drop reference on shared anon mappable
We call NewSharedAnonMappable simply to use it for Mappable/MappingIdentity for
shared anon mmap. From MMapOpts.MappingIdentity: "If MMapOpts is used to
successfully create a memory mapping, a reference is taken on MappingIdentity."

mm.createVMALocked (below) takes this additional reference, so we don't need
the reference returned by NewSharedAnonMappable. Holding it leaks the mappable.

PiperOrigin-RevId: 241038108
Change-Id: I78ee3af78e0cc7aac4063b274b30d0e41eb5677d
2019-03-29 13:17:56 -07:00
Jamie Liu 69afd0438e Return srclen in proc.idMapFileOperations.Write.
PiperOrigin-RevId: 241037926
Change-Id: I4b0381ac1c7575e8b861291b068d3da22bc03850
2019-03-29 13:16:46 -07:00
Nicolas Lacasse ed23f54709 Treat ENOSPC as a state-file error during save.
PiperOrigin-RevId: 241028806
Change-Id: I770bf751a2740869a93c3ab50370a727ae580470
2019-03-29 12:26:25 -07:00
chris.zn 31c2236e97 set task's name when fork
When fork a child process, the name filed of TaskContext is not set.
It results in that when we cat /proc/{pid}/status, the name filed is
null.

Like this:
Name:
State:  S (sleeping)
Tgid:   28
Pid:    28
PPid:   26
TracerPid:      0
FDSize: 8
VmSize: 89712 kB
VmRSS:  6648 kB
Threads:        1
CapInh: 00000000a93d35fb
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a93d35fb
Seccomp:        0
Change-Id: I5d469098c37cedd19da16b7ffab2e546a28a321e
PiperOrigin-RevId: 240893304
2019-03-28 18:05:42 -07:00
Nicolas Lacasse 99195b0e16 Setting timestamps should trigger an inotify event.
PiperOrigin-RevId: 240850187
Change-Id: I1458581b771a1031e47bba439e480829794927b8
2019-03-28 14:15:23 -07:00
Bert Muthalaly f2e5dcf21c Add ICMP stats
PiperOrigin-RevId: 240848882
Change-Id: I23dd4599f073263437aeab357c3f767e1a432b82
2019-03-28 14:09:20 -07:00
Googler e373d3642e Internal change.
PiperOrigin-RevId: 240842801
Change-Id: Ibbd6f849f9613edc1b1dd7a99a97d1ecdb6e9188
2019-03-28 13:43:47 -07:00
Jamie Liu f005350c93 Clean up gofer handle caching.
- Document fsutil.CachedFileObject.FD() requirements on access
permissions, and change gofer.inodeFileState.FD() to honor them.
Fixes #147.

- Combine gofer.inodeFileState.readonly and
gofer.inodeFileState.readthrough, and simplify handle caching logic.

- Inline gofer.cachePolicy.cacheHandles into
gofer.inodeFileState.setSharedHandles, because users with access to
gofer.inodeFileState don't necessarily have access to the fs.Inode
(predictably, this is a save/restore problem).

Before this CL:

$ docker run --runtime=runsc-d -v $(pwd)/gvisor/repro:/root/repro -it ubuntu bash
root@34d51017ed67:/# /root/repro/runsc-b147
mmap: 0x7f3c01e45000
Segmentation fault

After this CL:

$ docker run --runtime=runsc-d -v $(pwd)/gvisor/repro:/root/repro -it ubuntu bash
root@d3c3cb56bbf9:/# /root/repro/runsc-b147
mmap: 0x7f78987ec000
o
PiperOrigin-RevId: 240818413
Change-Id: I49e1d4a81a0cb9177832b0a9f31a10da722a896b
2019-03-28 11:43:51 -07:00
Nicolas Lacasse 9c18897887 Add rsslim field in /proc/pid/stat.
PiperOrigin-RevId: 240681675
Change-Id: Ib214106e303669fca2d5c744ed5c18e835775161
2019-03-27 17:44:38 -07:00
Nicolas Lacasse 2d355f0e8f Add start time to /proc/<pid>/stat.
The start time is the number of clock ticks between the boot time and
application start time.

PiperOrigin-RevId: 240619475
Change-Id: Ic8bd7a73e36627ed563988864b0c551c052492a5
2019-03-27 12:41:27 -07:00
Nicolas Lacasse 645af7cdd8 Dev device methods should take pointer receiver.
PiperOrigin-RevId: 240600504
Change-Id: I7dd5f27c8da31f24b68b48acdf8f1c19dbd0c32d
2019-03-27 11:08:50 -07:00
Jamie Liu 26583e413e Convert []byte to string without copying in usermem.CopyStringIn.
This is the same technique used by Go's strings.Builder
(https://golang.org/src/strings/builder.go#L45), and for the same
reason. (We can't just use strings.Builder because there's no way to get
the underlying []byte to pass to usermem.IO.CopyIn.)

PiperOrigin-RevId: 240594892
Change-Id: Ic070e7e480aee53a71289c7c120850991358c52c
2019-03-27 10:46:28 -07:00
Rahat Mahmood 06ec97a3f8 Implement memfd_create.
Memfds are simply anonymous tmpfs files with no associated
mounts. Also implementing file seals, which Linux only implements for
memfds at the moment.

PiperOrigin-RevId: 240450031
Change-Id: I31de78b950101ae8d7a13d0e93fe52d98ea06f2f
2019-03-26 16:16:57 -07:00
Jamie Liu f3723f8059 Call memmap.Mappable.Translate with more conservative usermem.AccessType.
MM.insertPMAsLocked() passes vma.maxPerms to memmap.Mappable.Translate
(although it unsets AccessType.Write if the vma is private). This
somewhat simplifies handling of pmas, since it means only COW-break
needs to replace existing pmas. However, it also means that a MAP_SHARED
mapping of a file opened O_RDWR dirties the file, regardless of the
mapping's permissions and whether or not the mapping is ever actually
written to with I/O that ignores permissions (e.g.
ptrace(PTRACE_POKEDATA)).

To fix this:

- Change the pma-getting path to request only the permissions that are
required for the calling access.

- Change memmap.Mappable.Translate to take requested permissions, and
return allowed permissions. This preserves the existing behavior in the
common cases where the memmap.Mappable isn't
fsutil.CachingInodeOperations and doesn't care if the translated
platform.File pages are written to.

- Change the MM.getPMAsLocked path to support permission upgrading of
pmas outside of copy-on-write.

PiperOrigin-RevId: 240196979
Change-Id: Ie0147c62c1fbc409467a6fa16269a413f3d7d571
2019-03-25 12:42:43 -07:00
Andrei Vagin ddc05e3053 epoll: use ilist:generic_list instead of ilist:ilist
ilist:generic_list works faster than ilist:ilist.

Here is a beanchmark test to measure performance of epoll_wait, when readyList
isn't empty. It shows about 30% better performance with these changes.

Benchmark           Time(ns)        CPU(ns)     Iterations
Before:
BM_EpollAllEvents      46725          46899          14286

After:
BM_EpollAllEvents      33167          33300          18919
PiperOrigin-RevId: 240185278
Change-Id: I3e33f9b214db13ab840b91613400525de5b58d18
2019-03-25 11:41:50 -07:00
Nicolas Lacasse b81bfd6013 lstat should resolve the final path component if it ends in a slash.
PiperOrigin-RevId: 239896221
Change-Id: I0949981fe50c57131c5631cdeb10b225648575c0
2019-03-22 17:38:13 -07:00
Jamie Liu 3d0b960112 Implement PTRACE_SEIZE, PTRACE_INTERRUPT, and PTRACE_LISTEN.
PiperOrigin-RevId: 239803092
Change-Id: I42d612ed6a889e011e8474538958c6de90c6fcab
2019-03-22 08:55:44 -07:00
Yong He 45ba52f824 Allow BP and OF can be called from user space
Change the DPL from 0 to 3 for Breakpoint and Overflow,
then user space could trigger Breakpoint and Overflow
as excepected.

Change-Id: Ibead65fb8c98b32b7737f316db93b3a8d9dcd648
PiperOrigin-RevId: 239736648
2019-03-21 22:04:50 -07:00
Kevin Krakauer 0cd5f20044 Replace manual pty copies to/from userspace with safemem operations.
Also, changing queue.writeBuf from a buffer.Bytes to a [][]byte should reduce
copying and reallocating of slices.

PiperOrigin-RevId: 239713547
Change-Id: I6ee5ff19c3ee2662f1af5749cae7b73db0569e96
2019-03-21 18:05:07 -07:00
Ian Gudger ba828233b9 Clear msghdr flags on successful recvmsg.
.net sets these flags to -1 and then uses their result, especting it to be
zero.

Does not set actual flags (e.g. MSG_TRUNC), but setting to zero is more correct
than what we did before.

PiperOrigin-RevId: 239657951
Change-Id: I89c5f84bc9b94a2cd8ff84e8ecfea09e01142030
2019-03-21 13:19:11 -07:00
Andrei Vagin 064fda1a75 gvisor: don't allocate a new credential object on fork
A credential object is immutable, so we don't need to copy it for a new
task.

PiperOrigin-RevId: 239519266
Change-Id: I0632f641fdea9554779ac25d84bee4231d0d18f2
2019-03-20 18:41:00 -07:00
Rahat Mahmood 81f4829d11 Record sockets created during accept(2) for all families.
Track new sockets created during accept(2) in the socket table for all
families. Previously we were only doing this for unix domain sockets.

PiperOrigin-RevId: 239475550
Change-Id: I16f009f24a06245bfd1d72ffd2175200f837c6ac
2019-03-20 14:31:16 -07:00
Andrei Vagin 87cce0ec08 netstack: reduce MSS from SYN to account tcp options
See: https://tools.ietf.org/html/rfc6691#section-2
PiperOrigin-RevId: 239305632
Change-Id: Ie8eb912a43332e6490045dc95570709c5b81855e
2019-03-19 17:33:20 -07:00
Fabricio Voznika 7b33df6845 Fix data race in netlink send buffer size
PiperOrigin-RevId: 239221041
Change-Id: Icc19e32a00fa89167447ab2f45e90dcfd61bea04
2019-03-19 10:38:50 -07:00
Michael Pratt 8a499ae65f Remove references to replaced child in Rename in ramfs/agentfs
In the case of a rename replacing an existing destination inode, ramfs
Rename failed to first remove the replaced inode. This caused:

1. A leak of a reference to the inode (making it live indefinitely).
2. For directories, a leak of the replaced directory's .. link to the
   parent. This would cause the parent's link count to incorrectly
   increase.

(2) is much simpler to test than (1), so that's what I've done.

agentfs has a similar bug with link count only, so the Dirent layer
informs the Inode if this is a replacing rename.

Fixes #133

PiperOrigin-RevId: 239105698
Change-Id: I4450af2462d8ae3339def812287213d2cbeebde0
2019-03-18 18:40:06 -07:00
Rahat Mahmood cea1dd7d21 Remove racy access to shm fields.
PiperOrigin-RevId: 239016776
Change-Id: Ia7af4258e7c69b16a4630a6f3278aa8e6b627746
2019-03-18 10:49:03 -07:00
Jamie Liu 8f4634997b Decouple filemem from platform and move it to pgalloc.MemoryFile.
This is in preparation for improved page cache reclaim, which requires
greater integration between the page cache and page allocator.

PiperOrigin-RevId: 238444706
Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4
2019-03-14 08:12:48 -07:00
Jamie Liu fb9919881c Use WalkGetAttr in gofer.inodeOperations.Create.
p9.Twalk.handle() with a non-empty path also stats the walked-to path
anyway, so the preceding GetAttr is completely wasted.

PiperOrigin-RevId: 238440645
Change-Id: I7fbc7536f46b8157639d0d1f491e6aaa9ab688a3
2019-03-14 07:43:15 -07:00
Nicolas Lacasse 2512cc5617 Allow filesystem.Mount to take an optional interface argument.
PiperOrigin-RevId: 238360231
Change-Id: I5eaf8d26f8892f77d71c7fbd6c5225ef471cedf1
2019-03-13 19:24:03 -07:00
Jamie Liu 8930e79ebf Clarify the platform.File interface.
- Redefine some memmap.Mappable, platform.File, and platform.Memory
semantics in terms of File reference counts (no functional change).

- Make AddressSpace.MapFile take a platform.File instead of a raw FD,
and replace platform.File.MapInto with platform.File.FD. This allows
kvm.AddressSpace.MapFile to always use platform.File.MapInternal instead
of maintaining its own (redundant) cache of file mappings in the sentry
address space.

PiperOrigin-RevId: 238044504
Change-Id: Ib73a11e4275c0da0126d0194aa6c6017a9cef64f
2019-03-12 10:29:16 -07:00
Adin Scannell 6e6dbf0e56 kvm: minimum guest/host timekeeping delta.
PiperOrigin-RevId: 237927368
Change-Id: I359badd1967bb118fe74eab3282c946c18937edc
2019-03-11 18:19:45 -07:00
Fabricio Voznika bc9b979b94 Add profiling commands to runsc
Example:
  runsc debug --root=<dir> \
      --profile-heap=/tmp/heap.prof \
      --profile-cpu=/tmp/cpu.prod --profile-delay=30 \
      <container ID>
PiperOrigin-RevId: 237848456
Change-Id: Icff3f20c1b157a84d0922599eaea327320dad773
2019-03-11 11:47:30 -07:00
Ian Gudger 71d53382bf Fix getsockopt(IP_MULTICAST_IF).
getsockopt(IP_MULTICAST_IF) only supports struct in_addr.

Also adds support for setsockopt(IP_MULTICAST_IF) with struct in_addr.

PiperOrigin-RevId: 237620230
Change-Id: I75e7b5b3e08972164eb1906f43ddd67aedffc27c
2019-03-09 11:40:51 -08:00
Ian Gudger 281092e842 Make IP_MULTICAST_LOOP and IP_MULTICAST_TTL allow setting int or char.
This is the correct Linux behavior, and at least PHP depends on it.

PiperOrigin-RevId: 237565639
Change-Id: I931af09c8ed99a842cf70d22bfe0b65e330c4137
2019-03-08 20:27:58 -08:00
Ian Gudger 56a6128295 Implement IP_MULTICAST_LOOP.
IP_MULTICAST_LOOP controls whether or not multicast packets sent on the default
route are looped back. In order to implement this switch, support for sending
and looping back multicast packets on the default route had to be implemented.

For now we only support IPv4 multicast.

PiperOrigin-RevId: 237534603
Change-Id: I490ac7ff8e8ebef417c7eb049a919c29d156ac1c
2019-03-08 15:49:17 -08:00
Nicolas Lacasse fbacb35039 No need to check for negative uintptr.
Fixes #134

PiperOrigin-RevId: 237128306
Change-Id: I396e808484c18931fc5775970ec1f5ae231e1cb9
2019-03-06 15:06:46 -08:00
Fabricio Voznika 0b76887147 Priority-inheritance futex implementation
It is Implemented without the priority inheritance part given
that gVisor defers scheduling decisions to Go runtime and doesn't
have control over it.

PiperOrigin-RevId: 236989545
Change-Id: I714c8ca0798743ecf3167b14ffeb5cd834302560
2019-03-05 23:40:18 -08:00
Bhasker Hariharan 1718fdd1a8 Add new retransmissions and recovery related metrics.
PiperOrigin-RevId: 236945145
Change-Id: I051760d95154ea5574c8bb6aea526f488af5e07b
2019-03-05 16:41:44 -08:00
Kevin Krakauer 23e66ee96d Remove unused commit() function argument to Bind.
PiperOrigin-RevId: 236926132
Change-Id: I5cf103f22766e6e65a581de780c7bb9ca0fa3181
2019-03-05 14:53:34 -08:00
Nicolas Lacasse 0d683c9961 Make tmpfs respect MountNoATime now that fs.Handle is gone.
PiperOrigin-RevId: 236752802
Change-Id: I9e50600b2ae25d5f2ac632c4405a7a185bdc3c92
2019-03-04 16:57:14 -08:00
Adin Scannell d811c1016d ptrace: drop old FIXME
The globalPool uses a sync.Once mechanism for initialization,
and no cleanup is strictly required. It's not really feasible
to have the platform implement a full creation -> destruction
cycle (due to the way filters are assumed to be installed), so
drop the FIXME.

PiperOrigin-RevId: 236385278
Change-Id: I98ac660ed58cc688d8a07147d16074a3e8181314
2019-03-01 15:05:18 -08:00
Nicolas Lacasse 9177bcd0ba DecRef replaced dirent in inode_overlay.
PiperOrigin-RevId: 236352158
Change-Id: Ide5104620999eaef6820917505e7299c7b0c5a03
2019-03-01 11:58:59 -08:00
Fabricio Voznika 3dbd4a16f8 Add semctl(GETPID) syscall
Also added unimplemented notification for semctl(2)
commands.

PiperOrigin-RevId: 236340672
Change-Id: I0795e3bd2e6d41d7936fabb731884df426a42478
2019-03-01 10:57:02 -08:00
Michael Pratt 7693b7469f Format capget/capset arguments
I0225 15:32:10.795034    4166 x:0] [   6]  E capget(0x7f477fdff8c8 {Version: 3, Pid: 0}, 0x7f477fdff8b0)
I0225 15:32:10.795059    4166 x:0] [   6]  X capget(0x7f477fdff8c8 {Version: 3, Pid: 0}, 0x7f477fdff8b0 {Permitted: CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SETUID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RAW|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_MODULE|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_NICE|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_SYS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP|CAP_MAC_OVERRIDE|CAP_MAC_ADMIN|CAP_SYSLOG|CAP_WAKE_ALARM|CAP_BLOCK_SUSPEND|CAP_AUDIT_READ, Inheritable: CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SETUID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RAW|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_MODULE|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_NICE|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_SYS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP|CAP_MAC_OVERRIDE|CAP_MAC_ADMIN|CAP_SYSLOG|CAP_WAKE_ALARM|CAP_BLOCK_SUSPEND|CAP_AUDIT_READ, Effective: 0x0}) = 0x0 (3.399?s)
I0225 15:32:10.795114    4166 x:0] [   6]  E capset(0x7f477fdff8c8 {Version: 3, Pid: 0}, 0x7f477fdff8b0 {Permitted: CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SETUID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RAW|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_MODULE|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_NICE|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_SYS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP|CAP_MAC_OVERRIDE|CAP_MAC_ADMIN|CAP_SYSLOG|CAP_WAKE_ALARM|CAP_BLOCK_SUSPEND|CAP_AUDIT_READ, Inheritable: CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SETUID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RAW|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_MODULE|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_NICE|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_SYS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP|CAP_MAC_OVERRIDE|CAP_MAC_ADMIN|CAP_SYSLOG|CAP_WAKE_ALARM|CAP_BLOCK_SUSPEND|CAP_AUDIT_READ, Effective: CAP_FOWNER})
I0225 15:32:10.795127    4166 x:0] [   6]  X capset(0x7f477fdff8c8 {Version: 3, Pid: 0}, 0x7f477fdff8b0 {Permitted: CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SETUID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RAW|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_MODULE|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_NICE|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_SYS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP|CAP_MAC_OVERRIDE|CAP_MAC_ADMIN|CAP_SYSLOG|CAP_WAKE_ALARM|CAP_BLOCK_SUSPEND|CAP_AUDIT_READ, Inheritable: CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_FSETID|CAP_KILL|CAP_SETGID|CAP_SETUID|CAP_SETPCAP|CAP_LINUX_IMMUTABLE|CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN|CAP_NET_RAW|CAP_IPC_LOCK|CAP_IPC_OWNER|CAP_SYS_MODULE|CAP_SYS_RAWIO|CAP_SYS_CHROOT|CAP_SYS_PTRACE|CAP_SYS_PACCT|CAP_SYS_ADMIN|CAP_SYS_BOOT|CAP_SYS_NICE|CAP_SYS_RESOURCE|CAP_SYS_TIME|CAP_SYS_TTY_CONFIG|CAP_MKNOD|CAP_LEASE|CAP_AUDIT_WRITE|CAP_AUDIT_CONTROL|CAP_SETFCAP|CAP_MAC_OVERRIDE|CAP_MAC_ADMIN|CAP_SYSLOG|CAP_WAKE_ALARM|CAP_BLOCK_SUSPEND|CAP_AUDIT_READ, Effective: CAP_FOWNER}) = 0x0 (3.062?s)

Not the most readable, but better than just a pointer.

PiperOrigin-RevId: 236338875
Change-Id: I4b83f778122ab98de3874e16f4258dae18da916b
2019-03-01 10:46:36 -08:00
Fabricio Voznika 3b44377eda Fix "-c dbg" build break
Remove allocation from vCPU.die() to save stack space.

Closes #131

PiperOrigin-RevId: 236238102
Change-Id: Iafca27a1a3a472d4cb11dcda9a2060e585139d11
2019-02-28 18:38:34 -08:00
Ruidong Cao 3851705a73 Fix procfs bugs
Current procfs has some bugs. After executing ls twice, many dirs come
out with same name like "1" or ".". Files like "cpuinfo" disappear.
Here variable names is a slice with cap() > len(). Sort after appending
to it will not alloc a new space and impact orignal slice. Same to m.

Signed-off-by: Ruidong Cao <crdfrank@gmail.com>
Change-Id: I83e5cd1c7968c6fe28c35ea4fee497488d4f9eef
PiperOrigin-RevId: 236222270
2019-02-28 16:44:54 -08:00
Michael Pratt f7df9d72cf Upgrade to Go 1.12
PiperOrigin-RevId: 236218980
Change-Id: I82cb4aeb2a56524ee1324bfea2ad41dce26db354
2019-02-28 16:26:14 -08:00
Jamie Liu 05d721f9ee Hold dataMu for writing in CachingInodeOperations.WriteOut.
fsutil.SyncDirtyAll mutates the DirtySet.

PiperOrigin-RevId: 236183349
Change-Id: I7e809d5b406ac843407e61eff17d81259a819b4f
2019-02-28 13:14:43 -08:00
Kevin Krakauer 121db29a93 Ping support via IPv4 raw sockets.
Broadly, this change:
* Enables sockets to be created via `socket(AF_INET, SOCK_RAW, IPPROTO_ICMP)`.
* Passes the network-layer (IP) header up the stack to the transport endpoint,
  which can pass it up to the socket layer. This allows a raw socket to return
  the entire IP packet to users.
* Adds functions to stack.TransportProtocol, stack.Stack, stack.transportDemuxer
  that enable incoming packets to be delivered to raw endpoints. New raw sockets
  of other protocols (not ICMP) just need to register with the stack.
* Enables ping.endpoint to return IP headers when created via SOCK_RAW.

PiperOrigin-RevId: 235993280
Change-Id: I60ed994f5ff18b2cbd79f063a7fdf15d093d845a
2019-02-27 14:31:21 -08:00
Nicolas Lacasse d516ee3312 Allow overlay to merge Directories and SepcialDirectories.
Needed to mount inside /proc or /sys.

PiperOrigin-RevId: 235936529
Change-Id: Iee6f2671721b1b9b58a3989705ea901322ec9206
2019-02-27 09:45:45 -08:00
Fabricio Voznika cff2c57192 Fix bad merge
PiperOrigin-RevId: 235818534
Change-Id: I99f7e3fd1dc808b35f7a08b96b7c3226603ab808
2019-02-26 16:42:06 -08:00
Ruidong Cao a2b794b30d FPE_INTOVF (integer overflow) should be 2 refer to Linux.
Signed-off-by: Ruidong Cao <crdfrank@gmail.com>
Change-Id: I03f8ab25cf29257b31f145cf43304525a93f3300
PiperOrigin-RevId: 235763203
2019-02-26 11:48:49 -08:00
Fabricio Voznika 23fe059761 Lazily allocate inotify map on inode
PiperOrigin-RevId: 235735865
Change-Id: I84223eb18eb51da1fa9768feaae80387ff6bfed0
2019-02-26 09:33:44 -08:00
Fabricio Voznika 10426e0f31 Handle invalid offset in sendfile(2)
PiperOrigin-RevId: 235578698
Change-Id: I608ff5e25eac97f6e1bda058511c1f82b0e3b736
2019-02-25 12:17:46 -08:00
Googler 532f4b2fba Internal change.
PiperOrigin-RevId: 235053594
Change-Id: Ie3d7b11843d0710184a2463886c7034e8f5305d1
2019-02-21 13:08:34 -08:00
Haibo Xu 15d3189884 Make some ptrace commands x86-only
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I9751f859332d433ca772d6b9733f5a5a64398ec7
PiperOrigin-RevId: 234877624
2019-02-20 15:10:59 -08:00
Amanda Tait ea070b9d5f Implement Broadcast support
This change adds support for the SO_BROADCAST socket option in gVisor Netstack.
This support includes getsockopt()/setsockopt() functionality for both UDP and
TCP endpoints (the latter being a NOOP), dispatching broadcast messages up and
down the stack, and route finding/creation for broadcast packets. Finally, a
suite of tests have been implemented, exercising this functionality through the
Linux syscall API.

PiperOrigin-RevId: 234850781
Change-Id: If3e666666917d39f55083741c78314a06defb26c
2019-02-20 12:54:13 -08:00
Kevin Krakauer ec2460b189 netstack: Add SIOCGSTAMP support.
Ping sometimes uses this instead of SO_TIMESTAMP.

PiperOrigin-RevId: 234699590
Change-Id: Ibec9c34fa0d443a931557a2b1b1ecd83effe7765
2019-02-19 16:41:32 -08:00
Jamie Liu bed6f8534b Set rax to syscall number on SECCOMP_RET_TRAP.
PiperOrigin-RevId: 234690475
Change-Id: I1cbfb5aecd4697a4a26ec8524354aa8656cc3ba1
2019-02-19 15:49:37 -08:00
Jamie Liu bb47d8a545 Fix clone(CLONE_NEWUSER).
- Use new user namespace for namespace creation checks.

- Ensure userns is never nil since it's used by other namespaces.

PiperOrigin-RevId: 234673175
Change-Id: I4b9d9d1e63ce4e24362089793961a996f7540cd9
2019-02-19 14:20:05 -08:00
Jamie Liu 22d8b6eba1 Break /proc/[pid]/{uid,gid}_map's dependence on seqfile.
In addition to simplifying the implementation, this fixes two bugs:

- seqfile.NewSeqFile unconditionally creates an inode with mode 0444,
  but {uid,gid}_map have mode 0644.

- idMapSeqFile.Write implements fs.FileOperations.Write ... but it
  doesn't implement any other fs.FileOperations methods and is never
  used as fs.FileOperations. idMapSeqFile.GetFile() =>
  seqfile.SeqFile.GetFile() uses seqfile.seqFileOperations instead,
  which rejects all writes.

PiperOrigin-RevId: 234638212
Change-Id: I4568f741ab07929273a009d7e468c8205a8541bc
2019-02-19 11:21:46 -08:00
Ian Gudger c611dbc5a7 Implement IP_MULTICAST_IF.
This allows setting a default send interface for IPv4 multicast. IPv6 support
will come later.

PiperOrigin-RevId: 234251379
Change-Id: I65922341cd8b8880f690fae3eeb7ddfa47c8c173
2019-02-15 18:40:15 -08:00
Kevin Krakauer a9cb3dcd9d Move SO_TIMESTAMP from different transport endpoints to epsocket.
SO_TIMESTAMP is reimplemented in ping and UDP sockets (and needs to be added for
TCP), but can just be implemented in epsocket for simplicity. This will also
make SIOCGSTAMP easier to implement.

PiperOrigin-RevId: 234179300
Change-Id: Ib5ea0b1261dc218c1a8b15a65775de0050fe3230
2019-02-15 11:18:44 -08:00
Fabricio Voznika e34d27e8b6 Redirect FIXME to more appropriate bug
PiperOrigin-RevId: 234147487
Change-Id: I779a6012832bb94a6b89f5bcc7d821b40ae969cc
2019-02-15 08:23:27 -08:00
Nicolas Lacasse 0a41ea72c1 Don't allow writing or reading to TTY unless process group is in foreground.
If a background process tries to read from a TTY, linux sends it a SIGTTIN
unless the signal is blocked or ignored, or the process group is an orphan, in
which case the syscall returns EIO.

See drivers/tty/n_tty.c:n_tty_read()=>job_control().

If a background process tries to write a TTY, set the termios, or set the
foreground process group, linux then sends a SIGTTOU. If the signal is ignored
or blocked, linux allows the write. If the process group is an orphan, the
syscall returns EIO.

See drivers/tty/tty_io.c:tty_check_change().

PiperOrigin-RevId: 234044367
Change-Id: I009461352ac4f3f11c5d42c43ac36bb0caa580f9
2019-02-14 15:47:31 -08:00
Jamie Liu 0e84ae72e0 Improve safecopy sanity checks.
- Fix CopyIn/CopyOut/ZeroOut range checks.

- Include the faulting signal number in the panic message.

PiperOrigin-RevId: 233829501
Change-Id: I8959ead12d05dbd4cd63c2b908cddeb2a27eb513
2019-02-13 14:25:15 -08:00
Googler 7aaa6cf225 Internal change.
PiperOrigin-RevId: 233802562
Change-Id: I40e1b13fd571daaf241b00f8df4bcedd034dc3f1
2019-02-13 12:07:34 -08:00
Nicolas Lacasse f17692d807 Add fs.AsyncWithContext and call it in fs/gofer/inodeOperations.Release.
fs/gofer/inodeOperations.Release does some asynchronous work.  Previously it
was calling fs.Async with an anonymous function, which caused the function to
be allocated on the heap.  Because Release is relatively hot, this results in a
lot of small allocations and increased GC pressure, noticeable in perf profiles.

This CL adds a new function, AsyncWithContext, which is just like Async, but
passes a context to the async function.  It avoids the need for an extra
anonymous function in fs/gofer/inodeOperations.Release.  The Async function
itself still requires a single anonymous function.

PiperOrigin-RevId: 233141763
Change-Id: I1dce4a883a7be9a8a5b884db01e654655f16d19c
2019-02-08 15:54:15 -08:00
Nicolas Lacasse e884168e1e Encode stat to bytes manually, instead of calling CopyObjectOut.
CopyObjectOut grows its destination byte slice incrementally, causing
many small slice allocations on the heap. This leads to increased GC and
noticeably slower stat calls.

PiperOrigin-RevId: 233140904
Change-Id: Ieb90295dd8dd45b3e56506fef9d7f86c92e97d97
2019-02-08 15:48:23 -08:00
Nicolas Lacasse 9c9386d2a8 CopyObjectOut should allocate a byte slice the size of the encoded object.
This adds an extra Reflection call to CopyObjectOut, but avoids many small
slice allocations if the object is large, since without this we grow the
backing slice incrementally as we encode more data.

PiperOrigin-RevId: 233110960
Change-Id: I93569af55912391e5471277f779139c23f040147
2019-02-08 13:00:00 -08:00
Ian Gudger 80f901b16b Plumb IP_ADD_MEMBERSHIP and IP_DROP_MEMBERSHIP to netstack.
Also includes a few fixes for IPv4 multicast support. IPv6 support is coming in
a followup CL.

PiperOrigin-RevId: 233008638
Change-Id: If7dae6222fef43fda48033f0292af77832d95e82
2019-02-07 23:15:23 -08:00
Rahat Mahmood 2ba74f84be Implement /proc/net/unix.
PiperOrigin-RevId: 232948478
Change-Id: Ib830121e5e79afaf5d38d17aeef5a1ef97913d23
2019-02-07 14:44:21 -08:00
Nicolas Lacasse fcae058a14 Make context.Background return a global background context.
It currently allocates a new context on the heap each time it is called. Some
of these are in relatively hot paths like signal delivery and releasing gofer
inodes.  It is also called very commonly in afterLoad.  All of these should
benefit from fewer heap allocations.

PiperOrigin-RevId: 232938873
Change-Id: I53cec0ca299f56dcd4866b0b4fd2ec4938526849
2019-02-07 13:55:23 -08:00
Fabricio Voznika 9ef3427ac1 Implement semctl(2) SETALL and GETALL
PiperOrigin-RevId: 232914984
Change-Id: Id2643d7ad8e986ca9be76d860788a71db2674cda
2019-02-07 11:41:44 -08:00
Zach Koopmans 0cf7fc4e11 Change /proc/PID/cmdline to read environment vector.
- Change proc to return envp on overwrite of argv with limitations from
upstream.
- Add unit tests
- Change layout of argv/envp on the stack so that end of argv is contiguous with
beginning of envp.

PiperOrigin-RevId: 232506107
Change-Id: I993880499ab2c1220f6dc456a922235c49304dec
2019-02-05 10:02:06 -08:00
Fabricio Voznika 2d20b121d7 CachingInodeOperations was over-dirtying cached attributes
Dirty should be set only when the attribute is changed in the cache
only. Instances where the change was also sent to the backing file
doesn't need to dirty the attribute.

Also remove size update during WriteOut as writing dirty page would
naturaly grow the file if needed.

RELNOTES: relnotes is needed for the parent CL.
PiperOrigin-RevId: 232068978
Change-Id: I00ba54693a2c7adc06efa9e030faf8f2e8e7f188
2019-02-01 17:51:48 -08:00
Nicolas Lacasse 92e85623a0 Factor the subtargets method into a helper method with tests.
PiperOrigin-RevId: 232047515
Change-Id: I00f036816e320356219be7b2f2e6d5fe57583a60
2019-02-01 15:23:43 -08:00
Michael Pratt fe1369ac98 Move package sync to third_party
PiperOrigin-RevId: 231889261
Change-Id: I482f1df055bcedf4edb9fe3fe9b8e9c80085f1a0
2019-01-31 17:49:14 -08:00
Michael Pratt 88b4ce8cac Fix comment
PiperOrigin-RevId: 231861005
Change-Id: I134d4e20cc898d44844219db0a8aacda87e11ef0
2019-01-31 15:03:12 -08:00
Fabricio Voznika a497f5ed5f Invalidate COW mappings when file is truncated
This changed required making fsutil.HostMappable use
a backing file to ensure the correct FD would be used
for read/write operations.

RELNOTES: relnotes is needed for the parent CL.
PiperOrigin-RevId: 231836164
Change-Id: I8ae9639715529874ea7d80a65e2c711a5b4ce254
2019-01-31 12:54:00 -08:00
Michael Pratt 2a0c69b19f Remove license comments
Nothing reads them and they can simply get stale.

Generated with:
$ sed -i "s/licenses(\(.*\)).*/licenses(\1)/" **/BUILD

PiperOrigin-RevId: 231818945
Change-Id: Ibc3f9838546b7e94f13f217060d31f4ada9d4bf0
2019-01-31 11:12:53 -08:00
Haibo Xu cedff8d3ae Add muldiv/rd_tsc support for arm64 platform.
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: If35459be78e023346a140184401172f8e023c7f9
PiperOrigin-RevId: 231638020
2019-01-30 11:49:08 -08:00
Zhaozhong Ni ae6e37df2a Convert TODO into FIXME.
PiperOrigin-RevId: 231301228
Change-Id: I3e18f3a12a35fb89a22a8c981188268d5887dc61
2019-01-28 15:34:18 -08:00
Nicolas Lacasse 09cf3b40a8 Fix data race in InodeSimpleAttributes.Unstable.
We were modifying InodeSimpleAttributes.Unstable.AccessTime without holding
the necessary lock.  Luckily for us, InodeSimpleAttributes already has a
NotifyAccess method that will do the update while holding the lock.

In addition, we were holding dfo.dir.mu.Lock while setting AccessTime, which
is unnecessary, so that lock has been removed.

PiperOrigin-RevId: 231278447
Change-Id: I81ed6d3dbc0b18e3f90c1df5e5a9c06132761769
2019-01-28 13:26:28 -08:00
Jamie Liu 1cedccf8e9 Drop the one-page limit for /proc/[pid]/{cmdline,environ}.
It never actually should have applied to environ (the relevant change in
Linux 4.2 is c2c0bb44620d "proc: fix PAGE_SIZE limit of
/proc/$PID/cmdline"), and we claim to be Linux 4.4 now anyway.

PiperOrigin-RevId: 231250661
Change-Id: I37f9c4280a533d1bcb3eebb7803373ac3c7b9f15
2019-01-28 11:00:23 -08:00
Fabricio Voznika 55e8eb775b Make cacheRemoteRevalidating detect changes to file size
When file size changes outside the sandbox, page cache was not
refreshing file size which is required for cacheRemoteRevalidating.
In fact, cacheRemoteRevalidating should be skipping the cache
completely since it's not really benefiting from it. The cache is
cache is already bypassed for unstable attributes (see
cachePolicy.cacheUAttrs). And althought the cache is called to
map pages, they will always miss the cache and map directly from
the host.

Created a HostMappable struct that maps directly to the host and
use it for files with cacheRemoteRevalidating.

Closes #124

PiperOrigin-RevId: 230998440
Change-Id: Ic5f632eabe33b47241e05e98c95e9b2090ae08fc
2019-01-25 17:23:07 -08:00
Adin Scannell b5088ba59c cleanup: extract the kernel from context
Change-Id: I94704a90beebb53164325e0cce1fcb9a0b97d65c
PiperOrigin-RevId: 230817308
2019-01-24 17:02:52 -08:00
Rahat Mahmood 8d7c10e908 Display /proc/net entries for all network configurations.
Most of the entries are stubbed out at the moment, but even those were
only displayed if IPv6 support was enabled. The entries should be
displayed with IPv4-support only, and with only loopback devices.

PiperOrigin-RevId: 229946441
Change-Id: I18afaa3af386322787f91bf9d168ab66c01d5a4c
2019-01-18 10:02:12 -08:00
Nicolas Lacasse 12bc7834dc Allow fsync on a directory.
PiperOrigin-RevId: 229781337
Change-Id: I1f946cff2771714fb1abd83a83ed454e9febda0a
2019-01-17 11:06:59 -08:00
Nicolas Lacasse dc8450b567 Remove fs.Handle, ramfs.Entry, and all the DeprecatedFileOperations.
More helper structs have been added to the fsutil package to make it easier to
implement fs.InodeOperations and fs.FileOperations.

PiperOrigin-RevId: 229305982
Change-Id: Ib6f8d3862f4216745116857913dbfa351530223b
2019-01-14 20:34:28 -08:00
Zach Koopmans 7f8de3bf92 Fixing select call to not enforce RLIMIT_NOFILE.
Removing check to RLIMIT_NOFILE in select call.
Adding unit test to select suite to document behavior.
Moving setrlimit class from mlock to a util file for reuse.
Fixing flaky test based on comments from Jamie.

PiperOrigin-RevId: 228726131
Change-Id: Ie9dbe970bbf835ba2cca6e17eec7c2ee6fadf459
2019-01-10 09:44:45 -08:00
Jamie Liu 9270d940eb Minor memevent fixes.
- Call MemoryEvents.done.Add(1) outside of MemoryEvents.run() so that if
  MemoryEvents.Stop() => MemoryEvents.done.Wait() is called before the
  goroutine starts running, it still waits for the goroutine to stop.

- Use defer to call MemoryEvents.done.Done() in MemoryEvents.run() so that it's
  called even if the goroutine panics.

PiperOrigin-RevId: 228623307
Change-Id: I1b0459e7999606c1a1a271b16092b1ca87005015
2019-01-09 17:54:40 -08:00