Commit Graph

412 Commits

Author SHA1 Message Date
Ian Gudger 45566fa4e4 Add finalizer on AtomicRefCount to check for leaks.
PiperOrigin-RevId: 255711454
2019-06-28 20:07:52 -07:00
Adin Scannell 7dae043fec Drop ashmem and binder.
These are unfortunately unused and unmaintained. They can be brought back in
the future if need requires it.

PiperOrigin-RevId: 255697132
2019-06-28 17:20:25 -07:00
Ayush Ranjan c4da599e22 ext4: disklayout: SuperBlock interface implementations.
PiperOrigin-RevId: 255687771
2019-06-28 16:18:29 -07:00
Nicolas Lacasse 295078fa7a Automated rollback of changelist 255263686
PiperOrigin-RevId: 255679453
2019-06-28 15:28:41 -07:00
Ayush Ranjan 7c13789818 Superblock interface in the disk layout package for ext4.
PiperOrigin-RevId: 255644277
2019-06-28 12:07:28 -07:00
Yong He c61d7761b4 Fix deadloop in proc subtask list
Readdir of /proc/x/task/ will get direntry entries
from tasks of specified taskgroup. Now the tasks
slice is unsorted, use sort.SearchInts search entry
from the slice may cause infinity loops.
The fix is sort the slice before search.
This issue could be easily reproduced via following
steps, revise Readdir in pkg/sentry/fs/proc/task.go,
force set taskInts into test slice
[]int{1, 11, 7, 5, 10, 6, 8, 3, 9, 2, 4},
then run docker image and run ls /proc/1/task, the
command will cause infinity loops.
2019-06-28 22:20:57 +08:00
Fabricio Voznika b2907595e5 Complete pipe support on overlayfs
Get/Set pipe size and ioctl support were missing from
overlayfs. It required moving the pipe.Sizer interface
to fs so that overlay could get access.

Fixes #318

PiperOrigin-RevId: 255511125
2019-06-27 17:22:53 -07:00
Michael Pratt 5b41ba5d0e Fix various spelling issues in the documentation
Addresses obvious typos, in the documentation only.

COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/443 from Pixep:fix/documentation-spelling 4d0688164eafaf0b3010e5f4824b35d1e7176d65
PiperOrigin-RevId: 255477779
2019-06-27 14:25:50 -07:00
Michael Pratt 085a907565 Cache directory entries in the overlay
Currently, the overlay dirCache is only used for a single logical use of
getdents. i.e., it is discard when the FD is closed or seeked back to
the beginning.

But the initial work of getting the directory contents can be quite
expensive (particularly sorting large directories), so we should keep it
as long as possible.

This is very similar to the readdirCache in fs/gofer.

Since the upper filesystem does not have to allow caching readdir
entries, the new CacheReaddir MountSourceOperations method controls this
behavior.

This caching should be trivially movable to all Inodes if desired,
though that adds an additional copy step for non-overlay Inodes.
(Overlay Inodes already do the extra copy).

PiperOrigin-RevId: 255477592
2019-06-27 14:24:03 -07:00
Fabricio Voznika 42e212f6b7 Preserve permissions when checking lower
The code was wrongly assuming that only read access was
required from the lower overlay when checking for permissions.
This allowed non-writable files to be writable in the overlay.

Fixes #316

PiperOrigin-RevId: 255263686
2019-06-26 14:24:44 -07:00
Michael Pratt e98ce4a2c6 Add TODO reminder to remove tmpfs caching options
Updates #179

PiperOrigin-RevId: 255081565
2019-06-25 17:12:34 -07:00
Andrei Vagin e9ea7230f7 fs: synchronize concurrent writes into files with O_APPEND
For files with O_APPEND, a file write operation gets a file size and uses it as
offset to call an inode write operation. This means that all other operations
which can change a file size should be blocked while the write operation doesn't
complete.

PiperOrigin-RevId: 254873771
2019-06-24 17:45:02 -07:00
Rahat Mahmood 94a6bfab5d Implement /proc/net/tcp.
PiperOrigin-RevId: 254854346
2019-06-24 15:56:36 -07:00
chris.zn f957fb23cf Return ENOENT when reading /proc/{pid}/task of an exited process
There will be a deadloop when we use getdents to read /proc/{pid}/task
of an exited process

Like this:

Process A is running
                         Process B: open /proc/{pid of A}/task
Process A exits
                         Process B: getdents /proc/{pid of A}/task

Then, process B will fall into deadloop, and return "." and ".."
in loops and never ends.

This patch returns ENOENT when use getdents to read /proc/{pid}/task
if the process is just exited.

Signed-off-by: chris.zn <chris.zn@antfin.com>
2019-06-24 15:49:53 +08:00
Nicolas Lacasse 35719d52c7 Implement statx.
We don't have the plumbing for btime yet, so that field is left off. The
returned mask indicates that btime is absent.

Fixes #343

PiperOrigin-RevId: 254575752
2019-06-22 13:29:26 -07:00
Andrei Vagin ab6774cebf gvisor/fs: getdents returns 0 if offset is equal to FileMaxOffset
FileMaxOffset is a special case when lseek(d, 0, SEEK_END) has been called.

PiperOrigin-RevId: 254498777
2019-06-21 17:25:17 -07:00
Ayush Ranjan 727375321f ext4 block group descriptor implementation in disk layout package.
PiperOrigin-RevId: 254482180
2019-06-21 15:42:46 -07:00
Michael Pratt 292f70cbf7 Add package docs to seqfile and ramfs
These are the only packages missing docs:
https://godoc.org/gvisor.dev/gvisor

PiperOrigin-RevId: 254261022
2019-06-20 13:34:33 -07:00
Nicolas Lacasse f7428af9c1 Add MountNamespace to task.
This allows tasks to have distinct mount namespace, instead of all sharing the
kernel's root mount namespace.

Currently, the only way for a task to get a different mount namespace than the
kernel's root is by explicitly setting a different MountNamespace in
CreateProcessArgs, and nothing does this (yet).

In a follow-up CL, we will set CreateProcessArgs.MountNamespace when creating a
new container inside runsc.

Note that "MountNamespace" is a poor term for this thing. It's more like a
distinct VFS tree. When we get around to adding real mount namespaces, this
will need a better naem.

PiperOrigin-RevId: 254009310
2019-06-19 09:21:21 -07:00
Fabricio Voznika ca245a428b Attempt to fix TestPipeWritesAccumulate
Test fails because it's reading 4KB instead of the
expected 64KB. Changed the test to read pipe buffer
size instead of hardcode and added some logging in
case the reason for failure was not pipe buffer size.

PiperOrigin-RevId: 253916040
2019-06-18 19:16:11 -07:00
Andrei Vagin 8ab0848c70 gvisor/fs: don't update file.offset for sockets, pipes, etc
sockets, pipes and other non-seekable file descriptors don't
use file.offset, so we don't need to update it.

With this change, we will be able to call file operations
without locking the file.mu mutex. This is already used for
pipes in the splice system call.

PiperOrigin-RevId: 253746644
2019-06-18 01:43:29 -07:00
Ian Gudger 3e9b8ecbfe Plumb context through more layers of filesytem.
All functions which allocate objects containing AtomicRefCounts will soon need
a context.

PiperOrigin-RevId: 253147709
2019-06-13 18:40:38 -07:00
Adin Scannell add40fd6ad Update canonical repository.
This can be merged after:
https://github.com/google/gvisor-website/pull/77
  or
https://github.com/google/gvisor-website/pull/78

PiperOrigin-RevId: 253132620
2019-06-13 16:50:15 -07:00
Fabricio Voznika fc746efa9a Add support to mount pod shared tmpfs mounts
Parse annotations containing 'gvisor.dev/spec/mount' that gives
hints about how mounts are shared between containers inside a
pod. This information can be used to better inform how to mount
these volumes inside gVisor. For example, a volume that is shared
between containers inside a pod can be bind mounted inside the
sandbox, instead of being two independent mounts.

For now, this information is used to allow the same tmpfs mounts
to be shared between containers which wasn't possible before.

PiperOrigin-RevId: 252704037
2019-06-11 14:54:31 -07:00
Rahat Mahmood a00157cc0e Store more information in the kernel socket table.
Store enough information in the kernel socket table to distinguish
between different types of sockets. Previously we were only storing
the socket family, but this isn't enough to classify sockets. For
example, TCPv4 and UDPv4 sockets are both AF_INET, and ICMP sockets
are SOCK_DGRAM sockets with a particular protocol.

Instead of creating more sub-tables, flatten the socket table and
provide a filtering mechanism based on the socket entry.

Also generate and store a socket entry index ("sl" in linux) which
allows us to output entries in a stable order from procfs.

PiperOrigin-RevId: 252495895
2019-06-10 15:17:43 -07:00
Rahat Mahmood 315cf9a523 Use common definition of SockType.
SockType isn't specific to unix domain sockets, and the current
definition basically mirrors the linux ABI's definition.

PiperOrigin-RevId: 251956740
2019-06-06 17:00:27 -07:00
Fabricio Voznika 02ab1f187c Copy up parent when binding UDS on overlayfs
Overlayfs was expecting the parent to exist when bind(2)
was called, which may not be the case. The fix is to copy
the parent directory to the upper layer before binding
the UDS.

There is not good place to add tests for it. Syscall tests
would be ideal, but it's hard to guarantee that the
directory where the socket is created hasn't been touched
before (and thus copied the parent to the upper layer).
Added it to runsc integration tests for now. If it turns
out we have lots of these kind of tests, we can consider
moving them somewhere more appropriate.

PiperOrigin-RevId: 251954156
2019-06-06 16:45:51 -07:00
Rahat Mahmood 2d2831e354 Track and export socket state.
This is necessary for implementing network diagnostic interfaces like
/proc/net/{tcp,udp,unix} and sock_diag(7).

For pass-through endpoints such as hostinet, we obtain the socket
state from the backend. For netstack, we add explicit tracking of TCP
states.

PiperOrigin-RevId: 251934850
2019-06-06 15:04:47 -07:00
Michael Pratt 57772db2e7 Shutdown host sockets on internal shutdown
This is required to make the shutdown visible to peers outside the
sandbox.

The readClosed / writeClosed fields were dropped, as they were
preventing a shutdown socket from reading the remainder of queued bytes.
The host syscalls will return the appropriate errors for shutdown.

The control message tests have been split out of socket_unix.cc to make
the (few) remaining tests accessible to testing inherited host UDS,
which don't support sending control messages.

Updates #273

PiperOrigin-RevId: 251763060
2019-06-05 18:40:37 -07:00
Michael Pratt d3ed9baac0 Implement dumpability tracking and checks
We don't actually support core dumps, but some applications want to
get/set dumpability, which still has an effect in procfs.

Lack of support for set-uid binaries or fs creds simplifies things a
bit.

As-is, processes started via CreateProcess (i.e., init and sentryctl
exec) have normal dumpability. I'm a bit torn on whether sentryctl exec
tasks should be dumpable, but at least since they have no parent normal
UID/GID checks should protect them.

PiperOrigin-RevId: 251712714
2019-06-05 14:00:13 -07:00
Yong He 7398f013f0 Drop one dirent reference after referenced by file
When pipe is created, a dirent of pipe will be
created and its initial reference is set as 0.
Cause all dirent will only be destroyed when
the reference decreased to -1, so there is already
a 'initial reference' of dirent after it created.
For destroying dirent after all reference released,
the correct way is to drop the 'initial reference'
once someone hold a reference to the dirent, such
as fs.NewFile, otherwise the reference of dirent
will stay 0 all the time, and will cause memory
leak of dirent.
Except pipe, timerfd/eventfd/epoll has the same
problem

Here is a simple case to create memory leak of dirent
for pipe/timerfd/eventfd/epoll in C langange, after
run the case, pprof the runsc process, you will
find lots dirents of pipe/timerfd/eventfd/epoll not
freed:

int main(int argc, char *argv[])
{
	int i;
	int n;
	int pipefd[2];

	if (argc != 3) {
		printf("Usage: %s epoll|timerfd|eventfd|pipe <iterations>\n", argv[0]);
	}

	n = strtol(argv[2], NULL, 10);

	if (strcmp(argv[1], "epoll") == 0) {
		for (i = 0; i < n; ++i)
			close(epoll_create(1));
	} else if (strcmp(argv[1], "timerfd") == 0) {
		for (i = 0; i < n; ++i)
			close(timerfd_create(CLOCK_REALTIME, 0));
	} else if (strcmp(argv[1], "eventfd") == 0) {
		for (i = 0; i < n; ++i)
			close(eventfd(0, 0));
	} else if (strcmp(argv[1], "pipe") == 0) {
		for (i = 0; i < n; ++i)
			if (pipe(pipefd) == 0) {
				close(pipefd[0]);
				close(pipefd[1]);
			}
	}

	printf("%s %s test finished\r\n",argv[1],argv[2]);
	return 0;
}

Change-Id: Ia1b8a1fb9142edb00c040e44ec644d007f81f5d2
PiperOrigin-RevId: 251531096
2019-06-04 15:40:23 -07:00
Andrei Vagin 90a116890f gvisor/sock/unix: pass creds when a message is sent between unconnected sockets
and don't report a sender address if it doesn't have one

PiperOrigin-RevId: 251371284
2019-06-03 21:48:19 -07:00
Andrei Vagin 00f8663887 gvisor/fs: return a proper error from FileWriter.Write in case of a short-write
The io.Writer contract requires that Write writes all available
bytes and does not return short writes. This causes errors with
io.Copy, since our own Write interface does not have this same
contract.

PiperOrigin-RevId: 251368730
2019-06-03 21:26:01 -07:00
Nicolas Lacasse 6f73d79c32 Simplify overlayBoundEndpoint.
There is no reason to do the recursion manually, since
Inode.BoundEndpoint will do it for us.

PiperOrigin-RevId: 250794903
2019-05-30 17:20:20 -07:00
chris.zn b18df9bed6 Add VmData field to /proc/{pid}/status
VmData is the size of private data segments.
It has the same meaning as in Linux.

Change-Id: Iebf1ae85940a810524a6cde9c2e767d4233ddb2a
PiperOrigin-RevId: 250593739
2019-05-30 12:07:40 -07:00
Adin Scannell 2165b77774 Remove obsolete bug.
The original bug is no longer relevant, and the FIXME here
contains lots of obsolete information.

PiperOrigin-RevId: 249924036
2019-05-30 12:03:39 -07:00
Adin Scannell ed5793808e Remove obsolete TODO.
We don't need to model internal interfaces after the system
call interfaces (which are objectively worse and simply use a
flag to distinguish between two logically different operations).

PiperOrigin-RevId: 249916814
Change-Id: I45d02e0ec0be66b782a685b1f305ea027694cab9
2019-05-24 16:18:09 -07:00
Andrei Vagin a949133c4b gvisor: interrupt the sendfile system call if a task has been interrupted
sendfile can be called for a big range and it can require significant
amount of time to process it, so we need to handle task interrupts in
this system call.

PiperOrigin-RevId: 249781023
Change-Id: Ifc2ec505d74c06f5ee76f93b8d30d518ec2d4015
2019-05-23 23:21:13 -07:00
Ayush Ranjan 6240abb205 Added boilerplate code for ext4 fs.
Initialized BUILD with license
Mount is still unimplemented and is not meant to be
part of this CL. Rest of the fs interface is implemented.
Referenced the Linux kernel appropriately when needed

PiperOrigin-RevId: 249741997
Change-Id: Id1e4c7c9e68b3f6946da39896fc6a0c3dcd7f98c
2019-05-23 16:55:42 -07:00
Fabricio Voznika 9006304dfe Initial support for bind mounts
Separate MountSource from Mount. This is needed to allow
mounts to be shared by multiple containers within the same
pod.

PiperOrigin-RevId: 249617810
Change-Id: Id2944feb7e4194951f355cbe6d4944ae3c02e468
2019-05-23 04:16:10 -07:00
Adin Scannell 21915eb58b Remove obsolete TODO.
There no obvious reason to require that BlockSize and StatFS
are MountSource operations. Today they are in INodeOperations,
and they can be moved elsewhere in the future as part of a
normal refactor process.

PiperOrigin-RevId: 249549982
Change-Id: Ib832e02faeaf8253674475df4e385bcc53d780f3
2019-05-22 17:00:36 -07:00
Adin Scannell 9cdae51fec Add basic plumbing for splice and stub implementation.
This does not actually implement an efficient splice or sendfile. Rather, it
adds a generic plumbing to the file internals so that this can be added. All
file implementations use the stub fileutil.NoSplice implementation, which
causes sendfile and splice to fall back to an internal copy.

A basic splice system call interface is added, along with a test.

PiperOrigin-RevId: 249335960
Change-Id: Ic5568be2af0a505c19e7aec66d5af2480ab0939b
2019-05-21 15:18:12 -07:00
Neel Natu adeb99709b Remove unused struct member.
Remove unused struct member.

PiperOrigin-RevId: 249300446
Change-Id: Ifb16538f684bc3200342462c3da927eb564bf52d
2019-05-21 12:20:19 -07:00
Michael Pratt 80cc2c78e5 Forward named pipe creation to the gofer
The backing 9p server must allow named pipe creation, which the runsc
fsgofer currently does not.

There are small changes to the overlay here. GetFile may block when
opening a named pipe, which can cause a deadlock:

1. open(O_RDONLY) -> copyMu.Lock() -> GetFile()
2. open(O_WRONLY) -> copyMu.Lock() -> Deadlock

A named pipe usable for writing must already be on the upper filesystem,
but we are still taking copyMu for write when checking for upper. That
can be changed to a read lock to fix the common case.

However, a named pipe on the lower filesystem would still deadlock in
open(O_WRONLY) when it tries to actually perform copy up (which would
simply return EINVAL). Move the copy up type check before taking copyMu
for write to avoid this.

p9 must be modified, as it was incorrectly removing the file mode when
sending messages on the wire.

PiperOrigin-RevId: 249154033
Change-Id: Id6637130e567b03758130eb6c7cdbc976384b7d6
2019-05-20 16:53:08 -07:00
Michael Pratt 6588427451 Fix incorrect tmpfs timestamp updates
* Creation of files, directories (and other fs objects) in a directory
  should always update ctime.
* Same for removal.
* atime should not be updated on lookup, only readdir.

I've also renamed some misleading functions that update mtime and ctime.

PiperOrigin-RevId: 249115063
Change-Id: I30fa275fa7db96d01aa759ed64628c18bb3a7dc7
2019-05-20 13:35:17 -07:00
Michael Pratt 4a842836e5 Return EPERM for mknod
This more directly matches what Linux does with unsupported
nodes.

PiperOrigin-RevId: 248780425
Change-Id: I17f3dd0b244f6dc4eb00e2e42344851b8367fbec
2019-05-17 13:47:40 -07:00
Michael Pratt 04105781ad Fix gofer rename ctime and cleanup stat_times test
There is a lot of redundancy that we can simplify in the stat_times
test. This will make it easier to add new tests. However, the
simplification reveals that cached uattrs on goferfs don't properly
update ctime on rename.

PiperOrigin-RevId: 248773425
Change-Id: I52662728e1e9920981555881f9a85f9ce04041cf
2019-05-17 13:05:47 -07:00
Andrei Vagin 2105158d4b gofer: don't call hostfile.Close if hostFile is nil
PiperOrigin-RevId: 248437159
Change-Id: Ife71f6ca032fca59ec97a82961000ed0af257101
2019-05-15 17:21:10 -07:00
Nicolas Lacasse dd153c014d Start of support for /proc/pid/cgroup file.
PiperOrigin-RevId: 248263378
Change-Id: Ic057d2bb0b6212110f43ac4df3f0ac9bf931ab98
2019-05-14 20:34:50 -07:00
Michael Pratt 330a1bbd04 Remove false comment
PiperOrigin-RevId: 248249285
Change-Id: I9b6d267baa666798b22def590ff20c9a118efd47
2019-05-14 18:06:14 -07:00
Fabricio Voznika 1bee43be13 Implement fallocate(2)
Closes #225

PiperOrigin-RevId: 247508791
Change-Id: I04f47cf2770b30043e5a272aba4ba6e11d0476cc
2019-05-09 15:35:49 -07:00
Nicolas Lacasse bfd9f75ba4 Set the FilesytemType in MountSource from the Filesystem.
And stop storing the Filesystem in the MountSource.

This allows us to decouple the MountSource filesystem type from the name of the
filesystem.

PiperOrigin-RevId: 247292982
Change-Id: I49cbcce3c17883b7aa918ba76203dfd6d1b03cc8
2019-05-08 14:35:06 -07:00
Fabricio Voznika e5432fa1b3 Remove defers from gofer.contextFile
Most are single line methods in hot paths.

PiperOrigin-RevId: 247050267
Change-Id: I428d78723fe00b57483185899dc8fa9e1f01e2ea
2019-05-07 10:55:09 -07:00
Andrei Vagin 24d8656585 gofer: don't leak file descriptors
Fixes #219

PiperOrigin-RevId: 246568639
Change-Id: Ic7afd15dde922638d77f6429c508d1cbe2e4288a
2019-05-03 14:01:50 -07:00
Michael Pratt 23ca9886c6 Update reference to old type
PiperOrigin-RevId: 246036806
Change-Id: I5554a43a1f8146c927402db3bf98488a2da0fbe7
2019-04-30 15:42:39 -07:00
Jamie Liu 8bfb83d0ac Implement async MemoryFile eviction, and use it in CachingInodeOperations.
This feature allows MemoryFile to delay eviction of "optional"
allocations, such as unused cached file pages.

Note that this incidentally makes CachingInodeOperations writeback
asynchronous, in the sense that it doesn't occur until eviction; this is
necessary because between when a cached page becomes evictable and when
it's evicted, file writes (via CachingInodeOperations.Write) may dirty
the page.

As currently implemented, this feature won't meaningfully impact
steady-state memory usage or caching; the reclaimer goroutine will
schedule eviction as soon as it runs out of other work to do. Future CLs
increase caching by adding constraints on when eviction is scheduled.

PiperOrigin-RevId: 246014822
Change-Id: Ia85feb25a2de92a48359eb84434b6ec6f9bea2cb
2019-04-30 13:56:41 -07:00
Ian Gudger 81ecd8b6ea Implement the MSG_CTRUNC msghdr flag for Unix sockets.
Updates google/gvisor#206

PiperOrigin-RevId: 245880573
Change-Id: Ifa715e98d47f64b8a32b04ae9378d6cd6bd4025e
2019-04-29 21:21:08 -07:00
Michael Pratt 4d52a55201 Change copyright notice to "The gVisor Authors"
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.

1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.

Fixes #209

PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
2019-04-29 14:26:23 -07:00
Nicolas Lacasse f4ce43e1f4 Allow and document bug ids in gVisor codebase.
PiperOrigin-RevId: 245818639
Change-Id: I03703ef0fb9b6675955637b9fe2776204c545789
2019-04-29 14:04:14 -07:00
Kevin Krakauer 5f13338d30 Fix reference counting bug in /proc/PID/fdinfo/.
PiperOrigin-RevId: 245452217
Change-Id: I7164d8f57fe34c17e601079eb9410a6d95af1869
2019-04-26 11:09:55 -07:00
Jamie Liu 6b76c172b4 Don't enforce NAME_MAX in fs.Dirent.walk().
Maximum filename length is filesystem-dependent, and obtained via
statfs::f_namelen. This limit is usually 255 bytes (NAME_MAX), but not
always. For example, VFAT supports filenames of up to 255... UCS-2
characters, which Linux conservatively takes to mean UTF-8-encoded
bytes: fs/fat/inode.c:fat_statfs(), FAT_LFN_LEN * NLS_MAX_CHARSET_SIZE.
As a result, Linux's VFS does not enforce NAME_MAX:

$ rg --maxdepth=1 '\WNAME_MAX\W' fs/ include/linux/
fs/libfs.c
38:     buf->f_namelen = NAME_MAX;
64:     if (dentry->d_name.len > NAME_MAX)

include/linux/relay.h
74:     char base_filename[NAME_MAX];   /* saved base filename */

include/linux/fscrypt.h
149: * filenames up to NAME_MAX bytes, since base64 encoding expands the length.

include/linux/exportfs.h
176: *    understanding that it is already pointing to a a %NAME_MAX+1 sized

Remove this check from core VFS, and add it to ramfs (and by extension
tmpfs), where it is actually applicable:
mm/shmem.c:shmem_dir_inode_operations.lookup == simple_lookup *does*
enforce NAME_MAX.

PiperOrigin-RevId: 245324748
Change-Id: I17567c4324bfd60e31746a5270096e75db963fac
2019-04-25 16:05:13 -07:00
Michael Pratt d6aac9387f Fix doc typo
PiperOrigin-RevId: 244773890
Change-Id: I2d0cd7789771276ba545b38efff6d3e24133baaa
2019-04-22 18:22:19 -07:00
Fabricio Voznika c8cee7108f Use FD limit and file size limit from host
FD limit and file size limit is read from the host, instead
of using hard-coded defaults, given that they effect the sandbox
process. Also limit the direct cache to use no more than half
if the available FDs.

PiperOrigin-RevId: 244050323
Change-Id: I787ad0fdf07c49d589e51aebfeae477324fe26e6
2019-04-17 12:57:40 -07:00
Jamie Liu 4209edafb6 Use open fids when fstat()ing gofer files.
PiperOrigin-RevId: 243018347
Change-Id: I1e5b80607c1df0747482abea61db7fcf24536d37
2019-04-11 00:43:04 -07:00
Nicolas Lacasse d93d19fd4e Fix uses of RootFromContext.
RootFromContext can return a dirent with reference taken, or nil. We must call
DecRef if (and only if) a real dirent is returned.

PiperOrigin-RevId: 242965515
Change-Id: Ie2b7b4cb19ee09b6ccf788b71f3fd7efcdf35a11
2019-04-10 16:36:28 -07:00
Yong He 89cc8eef9b DATA RACE in fs.(*Dirent).fullName
add renameMu.Lock when oldParent == newParent
in order to avoid data race in following report:

WARNING: DATA RACE
Read at 0x00c000ba2160 by goroutine 405:
  gvisor.googlesource.com/gvisor/pkg/sentry/fs.(*Dirent).fullName()
      pkg/sentry/fs/dirent.go:246 +0x6c
  gvisor.googlesource.com/gvisor/pkg/sentry/fs.(*Dirent).FullName()
      pkg/sentry/fs/dirent.go:356 +0x8b
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*FDMap).String()
      pkg/sentry/kernel/fd_map.go:135 +0x1e0
  fmt.(*pp).handleMethods()
      GOROOT/src/fmt/print.go:603 +0x404
  fmt.(*pp).printArg()
      GOROOT/src/fmt/print.go:686 +0x255
  fmt.(*pp).doPrintf()
      GOROOT/src/fmt/print.go:1003 +0x33f
  fmt.Fprintf()
      GOROOT/src/fmt/print.go:188 +0x7f
  gvisor.googlesource.com/gvisor/pkg/log.(*Writer).Emit()
      pkg/log/log.go:121 +0x89
  gvisor.googlesource.com/gvisor/pkg/log.GoogleEmitter.Emit()
      pkg/log/glog.go:162 +0x1acc
  gvisor.googlesource.com/gvisor/pkg/log.(*GoogleEmitter).Emit()
      <autogenerated>:1 +0xe1
  gvisor.googlesource.com/gvisor/pkg/log.(*BasicLogger).Debugf()
      pkg/log/log.go:177 +0x111
  gvisor.googlesource.com/gvisor/pkg/log.Debugf()
      pkg/log/log.go:235 +0x66
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).Debugf()
      pkg/sentry/kernel/task_log.go:48 +0xfe
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).DebugDumpState()
      pkg/sentry/kernel/task_log.go:66 +0x11f
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runApp).execute()
      pkg/sentry/kernel/task_run.go:272 +0xc80
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run()
      pkg/sentry/kernel/task_run.go:91 +0x24b

Previous write at 0x00c000ba2160 by goroutine 423:
  gvisor.googlesource.com/gvisor/pkg/sentry/fs.Rename()
      pkg/sentry/fs/dirent.go:1628 +0x61f
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.renameAt.func1.1()
      pkg/sentry/syscalls/linux/sys_file.go:1864 +0x1f8
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.fileOpAt(  gvisor.googlesource.com/g/linux/sys_file.go:51 +0x20f
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.renameAt.func1()
      pkg/sentry/syscalls/linux/sys_file.go:1852 +0x218
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.fileOpAt()
      pkg/sentry/syscalls/linux/sys_file.go:51 +0x20f
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.renameAt()
      pkg/sentry/syscalls/linux/sys_file.go:1840 +0x180
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.Rename()
      pkg/sentry/syscalls/linux/sys_file.go:1873 +0x60
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).executeSyscall()
      pkg/sentry/kernel/task_syscall.go:165 +0x17a
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallInvoke()
      pkg/sentry/kernel/task_syscall.go:283 +0xb4
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallEnter()
      pkg/sentry/kernel/task_syscall.go:244 +0x10c
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscall()
      pkg/sentry/kernel/task_syscall.go:219 +0x1e3
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runApp).execute()
      pkg/sentry/kernel/task_run.go:215 +0x15a9
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run()
      pkg/sentry/kernel/task_run.go:91 +0x24b

Reported-by: syzbot+e1babbf756fab380dfff@syzkaller.appspotmail.com
Change-Id: Icd2620bb3ea28b817bf0672d454a22b9d8ee189a
PiperOrigin-RevId: 242938741
2019-04-10 14:17:33 -07:00
Nicolas Lacasse 0a0619216e Start saving MountSource.DirentCache.
DirentCache is already a savable type, and it ensures that it is empty at the
point of Save.  There is no reason not to save it along with the MountSource.

This did uncover an issue where not all MountSources were properly flushed
before Save.  If a mount point has an open file and is then unmounted, we save
the MountSource without flushing it first.  This CL also fixes that by flushing
all MountSources for all open FDs on Save.

PiperOrigin-RevId: 242906637
Change-Id: I3acd9d52b6ce6b8c989f835a408016cb3e67018f
2019-04-10 11:27:16 -07:00
Shiva Prasanth 7140b1fdca Fixed /proc/cpuinfo permissions
This also applies these permissions to other static proc files.

Change-Id: I4167e585fed49ad271aa4e1f1260babb3239a73d
PiperOrigin-RevId: 242898575
2019-04-10 10:49:43 -07:00
Jamie Liu 9471c01348 Export kernel.SignalInfoPriv.
Also add kernel.SignalInfoNoInfo, and use it in RLIMIT_FSIZE checks.

PiperOrigin-RevId: 242562428
Change-Id: I4887c0e1c8f5fddcabfe6d4281bf76d2f2eafe90
2019-04-08 16:32:11 -07:00
Nicolas Lacasse 70906f1d24 Intermediate ram fs dirs should be writable.
We construct a ramfs tree of "scaffolding" directories for all mount points, so
that a directory exists that each mount point can be mounted over.

We were creating these directories without write permissions, which meant that
they were not wribable even when underlayed under a writable filesystem. They
should be writable.

PiperOrigin-RevId: 242507789
Change-Id: I86645e35417560d862442ff5962da211dbe9b731
2019-04-08 11:56:38 -07:00
Nicolas Lacasse ee7e6d33b2 Use string type for extended attribute values, instead of []byte.
Strings are a better fit for this usage because they are immutable in Go, and
can contain arbitrary bytes. It also allows us to avoid casting bytes to string
(and the associated allocation) in the hot path when checking for overlay
whiteouts.

PiperOrigin-RevId: 242208856
Change-Id: I7699ae6302492eca71787dd0b72e0a5a217a3db2
2019-04-05 15:49:39 -07:00
Andrei Vagin 88409e983c gvisor: Add support for the MS_NOEXEC mount option
https://github.com/google/gvisor/issues/145

PiperOrigin-RevId: 242044115
Change-Id: I8f140fe05e32ecd438b6be218e224e4b7fe05878
2019-04-04 17:43:53 -07:00
Nicolas Lacasse 61d8c361c6 Don't release d.mu in checks for child-existence.
Dirent.exists() is called in Create to check whether a child with the given
name already exists.

Dirent.exists() calls walk(), and before this CL allowed walk() to drop d.mu
while calling d.Inode.Lookup. During this existence check, a racing Rename()
can acquire d.mu and create a new child of the dirent with the same name.
(Note that the source and destination of the rename must be in the same
directory, otherwise renameMu will be taken preventing the race.) In this
case, d.exists() can return false, even though a child with the same name
actually does exist.

This CL changes d.exists() so that it does not release d.mu while walking, thus
preventing the race with Rename.

It also adds comments noting that lockForRename may not take renameMu if the
source and destination are in the same directory, as this is a bit surprising
(at least it was to me).

PiperOrigin-RevId: 241842579
Change-Id: I56524870e39dfcd18cab82054eb3088846c34813
2019-04-03 17:53:56 -07:00
Kevin Krakauer 82529becae Fix index out of bounds in tty implementation.
The previous implementation revolved around runes instead of bytes, which caused
weird behavior when converting between the two. For example, peekRune would read
the byte 0xff from a buffer, convert it to a rune, then return it. As rune is an
alias of int32, 0xff was 0-padded to int32(255), which is the hex code point for
?. However, peekRune also returned the length of the byte (1). When calling
utf8.EncodeRune, we only allocated 1 byte, but tried the write the 2-byte
character ?.

tl;dr: I apparently didn't understand runes when I wrote this.

PiperOrigin-RevId: 241789081
Change-Id: I14c788af4d9754973137801500ef6af7ab8a8727
2019-04-03 13:00:34 -07:00
Kevin Krakauer c79e81bd27 Addresses data race in tty implementation.
Also makes the safemem reading and writing inline, as it makes it easier to see
what locks are held.

PiperOrigin-RevId: 241775201
Change-Id: Ib1072f246773ef2d08b5b9a042eb7e9e0284175c
2019-04-03 11:49:55 -07:00
Nicolas Lacasse 1776ab28f0 Add test that symlinking over a directory returns EEXIST.
Also remove comments in InodeOperations that required that implementation of
some Create* operations ensure that the name does not already exist, since
these checks are all centralized in the Dirent.

PiperOrigin-RevId: 241637335
Change-Id: Id098dc6063ff7c38347af29d1369075ad1e89a58
2019-04-02 17:28:36 -07:00
Wei Zhang 1fcd40719d device: fix device major/minor
Current gvisor doesn't give devices a right major and minor number.

When testing golang supporting of gvisor, I run the test case below:

```
$ docker run -ti --runtime runsc golang:1.12.1 bash -c "cd /usr/local/go/src && ./run.bash "
```

And it reports some errors, one of them is:

"--- FAIL: TestDevices (0.00s)
    --- FAIL: TestDevices//dev/null_1:3 (0.00s)
        dev_linux_test.go:45: for /dev/null Major(0x0) == 0, want 1
        dev_linux_test.go:48: for /dev/null Minor(0x0) == 0, want 3
        dev_linux_test.go:51: for /dev/null Mkdev(1, 3) == 0x103, want 0x0
    --- FAIL: TestDevices//dev/zero_1:5 (0.00s)
        dev_linux_test.go:45: for /dev/zero Major(0x0) == 0, want 1
        dev_linux_test.go:48: for /dev/zero Minor(0x0) == 0, want 5
        dev_linux_test.go:51: for /dev/zero Mkdev(1, 5) == 0x105, want 0x0
    --- FAIL: TestDevices//dev/random_1:8 (0.00s)
        dev_linux_test.go:45: for /dev/random Major(0x0) == 0, want 1
        dev_linux_test.go:48: for /dev/random Minor(0x0) == 0, want 8
        dev_linux_test.go:51: for /dev/random Mkdev(1, 8) == 0x108, want 0x0
    --- FAIL: TestDevices//dev/full_1:7 (0.00s)
        dev_linux_test.go:45: for /dev/full Major(0x0) == 0, want 1
        dev_linux_test.go:48: for /dev/full Minor(0x0) == 0, want 7
        dev_linux_test.go:51: for /dev/full Mkdev(1, 7) == 0x107, want 0x0
    --- FAIL: TestDevices//dev/urandom_1:9 (0.00s)
        dev_linux_test.go:45: for /dev/urandom Major(0x0) == 0, want 1
        dev_linux_test.go:48: for /dev/urandom Minor(0x0) == 0, want 9
        dev_linux_test.go:51: for /dev/urandom Mkdev(1, 9) == 0x109, want 0x0
"

So I think we'd better assign to them correct major/minor numbers following linux spec.

Signed-off-by: Wei Zhang <zhangwei198900@gmail.com>
Change-Id: I4521ee7884b4e214fd3a261929e3b6dac537ada9
PiperOrigin-RevId: 241609021
2019-04-02 14:51:07 -07:00
Andrei Vagin a4b34e2637 gvisor: convert ilist to ilist:generic_list
ilist:generic_list works faster (cl/240185278) and
the code looks cleaner without type casting.
PiperOrigin-RevId: 241381175
Change-Id: I8487ab1d73637b3e9733c253c56dce9e79f0d35f
2019-04-01 12:53:27 -07:00
Jamie Liu 69afd0438e Return srclen in proc.idMapFileOperations.Write.
PiperOrigin-RevId: 241037926
Change-Id: I4b0381ac1c7575e8b861291b068d3da22bc03850
2019-03-29 13:16:46 -07:00
Googler e373d3642e Internal change.
PiperOrigin-RevId: 240842801
Change-Id: Ibbd6f849f9613edc1b1dd7a99a97d1ecdb6e9188
2019-03-28 13:43:47 -07:00
Jamie Liu f005350c93 Clean up gofer handle caching.
- Document fsutil.CachedFileObject.FD() requirements on access
permissions, and change gofer.inodeFileState.FD() to honor them.
Fixes #147.

- Combine gofer.inodeFileState.readonly and
gofer.inodeFileState.readthrough, and simplify handle caching logic.

- Inline gofer.cachePolicy.cacheHandles into
gofer.inodeFileState.setSharedHandles, because users with access to
gofer.inodeFileState don't necessarily have access to the fs.Inode
(predictably, this is a save/restore problem).

Before this CL:

$ docker run --runtime=runsc-d -v $(pwd)/gvisor/repro:/root/repro -it ubuntu bash
root@34d51017ed67:/# /root/repro/runsc-b147
mmap: 0x7f3c01e45000
Segmentation fault

After this CL:

$ docker run --runtime=runsc-d -v $(pwd)/gvisor/repro:/root/repro -it ubuntu bash
root@d3c3cb56bbf9:/# /root/repro/runsc-b147
mmap: 0x7f78987ec000
o
PiperOrigin-RevId: 240818413
Change-Id: I49e1d4a81a0cb9177832b0a9f31a10da722a896b
2019-03-28 11:43:51 -07:00
Nicolas Lacasse 9c18897887 Add rsslim field in /proc/pid/stat.
PiperOrigin-RevId: 240681675
Change-Id: Ib214106e303669fca2d5c744ed5c18e835775161
2019-03-27 17:44:38 -07:00
Nicolas Lacasse 2d355f0e8f Add start time to /proc/<pid>/stat.
The start time is the number of clock ticks between the boot time and
application start time.

PiperOrigin-RevId: 240619475
Change-Id: Ic8bd7a73e36627ed563988864b0c551c052492a5
2019-03-27 12:41:27 -07:00
Nicolas Lacasse 645af7cdd8 Dev device methods should take pointer receiver.
PiperOrigin-RevId: 240600504
Change-Id: I7dd5f27c8da31f24b68b48acdf8f1c19dbd0c32d
2019-03-27 11:08:50 -07:00
Rahat Mahmood 06ec97a3f8 Implement memfd_create.
Memfds are simply anonymous tmpfs files with no associated
mounts. Also implementing file seals, which Linux only implements for
memfds at the moment.

PiperOrigin-RevId: 240450031
Change-Id: I31de78b950101ae8d7a13d0e93fe52d98ea06f2f
2019-03-26 16:16:57 -07:00
Jamie Liu f3723f8059 Call memmap.Mappable.Translate with more conservative usermem.AccessType.
MM.insertPMAsLocked() passes vma.maxPerms to memmap.Mappable.Translate
(although it unsets AccessType.Write if the vma is private). This
somewhat simplifies handling of pmas, since it means only COW-break
needs to replace existing pmas. However, it also means that a MAP_SHARED
mapping of a file opened O_RDWR dirties the file, regardless of the
mapping's permissions and whether or not the mapping is ever actually
written to with I/O that ignores permissions (e.g.
ptrace(PTRACE_POKEDATA)).

To fix this:

- Change the pma-getting path to request only the permissions that are
required for the calling access.

- Change memmap.Mappable.Translate to take requested permissions, and
return allowed permissions. This preserves the existing behavior in the
common cases where the memmap.Mappable isn't
fsutil.CachingInodeOperations and doesn't care if the translated
platform.File pages are written to.

- Change the MM.getPMAsLocked path to support permission upgrading of
pmas outside of copy-on-write.

PiperOrigin-RevId: 240196979
Change-Id: Ie0147c62c1fbc409467a6fa16269a413f3d7d571
2019-03-25 12:42:43 -07:00
Kevin Krakauer 0cd5f20044 Replace manual pty copies to/from userspace with safemem operations.
Also, changing queue.writeBuf from a buffer.Bytes to a [][]byte should reduce
copying and reallocating of slices.

PiperOrigin-RevId: 239713547
Change-Id: I6ee5ff19c3ee2662f1af5749cae7b73db0569e96
2019-03-21 18:05:07 -07:00
Andrei Vagin 87cce0ec08 netstack: reduce MSS from SYN to account tcp options
See: https://tools.ietf.org/html/rfc6691#section-2
PiperOrigin-RevId: 239305632
Change-Id: Ie8eb912a43332e6490045dc95570709c5b81855e
2019-03-19 17:33:20 -07:00
Michael Pratt 8a499ae65f Remove references to replaced child in Rename in ramfs/agentfs
In the case of a rename replacing an existing destination inode, ramfs
Rename failed to first remove the replaced inode. This caused:

1. A leak of a reference to the inode (making it live indefinitely).
2. For directories, a leak of the replaced directory's .. link to the
   parent. This would cause the parent's link count to incorrectly
   increase.

(2) is much simpler to test than (1), so that's what I've done.

agentfs has a similar bug with link count only, so the Dirent layer
informs the Inode if this is a replacing rename.

Fixes #133

PiperOrigin-RevId: 239105698
Change-Id: I4450af2462d8ae3339def812287213d2cbeebde0
2019-03-18 18:40:06 -07:00
Jamie Liu 8f4634997b Decouple filemem from platform and move it to pgalloc.MemoryFile.
This is in preparation for improved page cache reclaim, which requires
greater integration between the page cache and page allocator.

PiperOrigin-RevId: 238444706
Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4
2019-03-14 08:12:48 -07:00
Jamie Liu fb9919881c Use WalkGetAttr in gofer.inodeOperations.Create.
p9.Twalk.handle() with a non-empty path also stats the walked-to path
anyway, so the preceding GetAttr is completely wasted.

PiperOrigin-RevId: 238440645
Change-Id: I7fbc7536f46b8157639d0d1f491e6aaa9ab688a3
2019-03-14 07:43:15 -07:00
Nicolas Lacasse 2512cc5617 Allow filesystem.Mount to take an optional interface argument.
PiperOrigin-RevId: 238360231
Change-Id: I5eaf8d26f8892f77d71c7fbd6c5225ef471cedf1
2019-03-13 19:24:03 -07:00
Jamie Liu 8930e79ebf Clarify the platform.File interface.
- Redefine some memmap.Mappable, platform.File, and platform.Memory
semantics in terms of File reference counts (no functional change).

- Make AddressSpace.MapFile take a platform.File instead of a raw FD,
and replace platform.File.MapInto with platform.File.FD. This allows
kvm.AddressSpace.MapFile to always use platform.File.MapInternal instead
of maintaining its own (redundant) cache of file mappings in the sentry
address space.

PiperOrigin-RevId: 238044504
Change-Id: Ib73a11e4275c0da0126d0194aa6c6017a9cef64f
2019-03-12 10:29:16 -07:00
Nicolas Lacasse fbacb35039 No need to check for negative uintptr.
Fixes #134

PiperOrigin-RevId: 237128306
Change-Id: I396e808484c18931fc5775970ec1f5ae231e1cb9
2019-03-06 15:06:46 -08:00
Nicolas Lacasse 0d683c9961 Make tmpfs respect MountNoATime now that fs.Handle is gone.
PiperOrigin-RevId: 236752802
Change-Id: I9e50600b2ae25d5f2ac632c4405a7a185bdc3c92
2019-03-04 16:57:14 -08:00
Nicolas Lacasse 9177bcd0ba DecRef replaced dirent in inode_overlay.
PiperOrigin-RevId: 236352158
Change-Id: Ide5104620999eaef6820917505e7299c7b0c5a03
2019-03-01 11:58:59 -08:00
Ruidong Cao 3851705a73 Fix procfs bugs
Current procfs has some bugs. After executing ls twice, many dirs come
out with same name like "1" or ".". Files like "cpuinfo" disappear.
Here variable names is a slice with cap() > len(). Sort after appending
to it will not alloc a new space and impact orignal slice. Same to m.

Signed-off-by: Ruidong Cao <crdfrank@gmail.com>
Change-Id: I83e5cd1c7968c6fe28c35ea4fee497488d4f9eef
PiperOrigin-RevId: 236222270
2019-02-28 16:44:54 -08:00
Jamie Liu 05d721f9ee Hold dataMu for writing in CachingInodeOperations.WriteOut.
fsutil.SyncDirtyAll mutates the DirtySet.

PiperOrigin-RevId: 236183349
Change-Id: I7e809d5b406ac843407e61eff17d81259a819b4f
2019-02-28 13:14:43 -08:00
Nicolas Lacasse d516ee3312 Allow overlay to merge Directories and SepcialDirectories.
Needed to mount inside /proc or /sys.

PiperOrigin-RevId: 235936529
Change-Id: Iee6f2671721b1b9b58a3989705ea901322ec9206
2019-02-27 09:45:45 -08:00
Fabricio Voznika 23fe059761 Lazily allocate inotify map on inode
PiperOrigin-RevId: 235735865
Change-Id: I84223eb18eb51da1fa9768feaae80387ff6bfed0
2019-02-26 09:33:44 -08:00
Googler 532f4b2fba Internal change.
PiperOrigin-RevId: 235053594
Change-Id: Ie3d7b11843d0710184a2463886c7034e8f5305d1
2019-02-21 13:08:34 -08:00
Jamie Liu 22d8b6eba1 Break /proc/[pid]/{uid,gid}_map's dependence on seqfile.
In addition to simplifying the implementation, this fixes two bugs:

- seqfile.NewSeqFile unconditionally creates an inode with mode 0444,
  but {uid,gid}_map have mode 0644.

- idMapSeqFile.Write implements fs.FileOperations.Write ... but it
  doesn't implement any other fs.FileOperations methods and is never
  used as fs.FileOperations. idMapSeqFile.GetFile() =>
  seqfile.SeqFile.GetFile() uses seqfile.seqFileOperations instead,
  which rejects all writes.

PiperOrigin-RevId: 234638212
Change-Id: I4568f741ab07929273a009d7e468c8205a8541bc
2019-02-19 11:21:46 -08:00
Nicolas Lacasse 0a41ea72c1 Don't allow writing or reading to TTY unless process group is in foreground.
If a background process tries to read from a TTY, linux sends it a SIGTTIN
unless the signal is blocked or ignored, or the process group is an orphan, in
which case the syscall returns EIO.

See drivers/tty/n_tty.c:n_tty_read()=>job_control().

If a background process tries to write a TTY, set the termios, or set the
foreground process group, linux then sends a SIGTTOU. If the signal is ignored
or blocked, linux allows the write. If the process group is an orphan, the
syscall returns EIO.

See drivers/tty/tty_io.c:tty_check_change().

PiperOrigin-RevId: 234044367
Change-Id: I009461352ac4f3f11c5d42c43ac36bb0caa580f9
2019-02-14 15:47:31 -08:00
Googler 7aaa6cf225 Internal change.
PiperOrigin-RevId: 233802562
Change-Id: I40e1b13fd571daaf241b00f8df4bcedd034dc3f1
2019-02-13 12:07:34 -08:00
Nicolas Lacasse f17692d807 Add fs.AsyncWithContext and call it in fs/gofer/inodeOperations.Release.
fs/gofer/inodeOperations.Release does some asynchronous work.  Previously it
was calling fs.Async with an anonymous function, which caused the function to
be allocated on the heap.  Because Release is relatively hot, this results in a
lot of small allocations and increased GC pressure, noticeable in perf profiles.

This CL adds a new function, AsyncWithContext, which is just like Async, but
passes a context to the async function.  It avoids the need for an extra
anonymous function in fs/gofer/inodeOperations.Release.  The Async function
itself still requires a single anonymous function.

PiperOrigin-RevId: 233141763
Change-Id: I1dce4a883a7be9a8a5b884db01e654655f16d19c
2019-02-08 15:54:15 -08:00
Rahat Mahmood 2ba74f84be Implement /proc/net/unix.
PiperOrigin-RevId: 232948478
Change-Id: Ib830121e5e79afaf5d38d17aeef5a1ef97913d23
2019-02-07 14:44:21 -08:00
Zach Koopmans 0cf7fc4e11 Change /proc/PID/cmdline to read environment vector.
- Change proc to return envp on overwrite of argv with limitations from
upstream.
- Add unit tests
- Change layout of argv/envp on the stack so that end of argv is contiguous with
beginning of envp.

PiperOrigin-RevId: 232506107
Change-Id: I993880499ab2c1220f6dc456a922235c49304dec
2019-02-05 10:02:06 -08:00
Fabricio Voznika 2d20b121d7 CachingInodeOperations was over-dirtying cached attributes
Dirty should be set only when the attribute is changed in the cache
only. Instances where the change was also sent to the backing file
doesn't need to dirty the attribute.

Also remove size update during WriteOut as writing dirty page would
naturaly grow the file if needed.

RELNOTES: relnotes is needed for the parent CL.
PiperOrigin-RevId: 232068978
Change-Id: I00ba54693a2c7adc06efa9e030faf8f2e8e7f188
2019-02-01 17:51:48 -08:00
Nicolas Lacasse 92e85623a0 Factor the subtargets method into a helper method with tests.
PiperOrigin-RevId: 232047515
Change-Id: I00f036816e320356219be7b2f2e6d5fe57583a60
2019-02-01 15:23:43 -08:00
Michael Pratt 88b4ce8cac Fix comment
PiperOrigin-RevId: 231861005
Change-Id: I134d4e20cc898d44844219db0a8aacda87e11ef0
2019-01-31 15:03:12 -08:00
Fabricio Voznika a497f5ed5f Invalidate COW mappings when file is truncated
This changed required making fsutil.HostMappable use
a backing file to ensure the correct FD would be used
for read/write operations.

RELNOTES: relnotes is needed for the parent CL.
PiperOrigin-RevId: 231836164
Change-Id: I8ae9639715529874ea7d80a65e2c711a5b4ce254
2019-01-31 12:54:00 -08:00
Michael Pratt 2a0c69b19f Remove license comments
Nothing reads them and they can simply get stale.

Generated with:
$ sed -i "s/licenses(\(.*\)).*/licenses(\1)/" **/BUILD

PiperOrigin-RevId: 231818945
Change-Id: Ibc3f9838546b7e94f13f217060d31f4ada9d4bf0
2019-01-31 11:12:53 -08:00
Zhaozhong Ni ae6e37df2a Convert TODO into FIXME.
PiperOrigin-RevId: 231301228
Change-Id: I3e18f3a12a35fb89a22a8c981188268d5887dc61
2019-01-28 15:34:18 -08:00
Nicolas Lacasse 09cf3b40a8 Fix data race in InodeSimpleAttributes.Unstable.
We were modifying InodeSimpleAttributes.Unstable.AccessTime without holding
the necessary lock.  Luckily for us, InodeSimpleAttributes already has a
NotifyAccess method that will do the update while holding the lock.

In addition, we were holding dfo.dir.mu.Lock while setting AccessTime, which
is unnecessary, so that lock has been removed.

PiperOrigin-RevId: 231278447
Change-Id: I81ed6d3dbc0b18e3f90c1df5e5a9c06132761769
2019-01-28 13:26:28 -08:00
Jamie Liu 1cedccf8e9 Drop the one-page limit for /proc/[pid]/{cmdline,environ}.
It never actually should have applied to environ (the relevant change in
Linux 4.2 is c2c0bb44620d "proc: fix PAGE_SIZE limit of
/proc/$PID/cmdline"), and we claim to be Linux 4.4 now anyway.

PiperOrigin-RevId: 231250661
Change-Id: I37f9c4280a533d1bcb3eebb7803373ac3c7b9f15
2019-01-28 11:00:23 -08:00
Fabricio Voznika 55e8eb775b Make cacheRemoteRevalidating detect changes to file size
When file size changes outside the sandbox, page cache was not
refreshing file size which is required for cacheRemoteRevalidating.
In fact, cacheRemoteRevalidating should be skipping the cache
completely since it's not really benefiting from it. The cache is
cache is already bypassed for unstable attributes (see
cachePolicy.cacheUAttrs). And althought the cache is called to
map pages, they will always miss the cache and map directly from
the host.

Created a HostMappable struct that maps directly to the host and
use it for files with cacheRemoteRevalidating.

Closes #124

PiperOrigin-RevId: 230998440
Change-Id: Ic5f632eabe33b47241e05e98c95e9b2090ae08fc
2019-01-25 17:23:07 -08:00
Adin Scannell b5088ba59c cleanup: extract the kernel from context
Change-Id: I94704a90beebb53164325e0cce1fcb9a0b97d65c
PiperOrigin-RevId: 230817308
2019-01-24 17:02:52 -08:00
Rahat Mahmood 8d7c10e908 Display /proc/net entries for all network configurations.
Most of the entries are stubbed out at the moment, but even those were
only displayed if IPv6 support was enabled. The entries should be
displayed with IPv4-support only, and with only loopback devices.

PiperOrigin-RevId: 229946441
Change-Id: I18afaa3af386322787f91bf9d168ab66c01d5a4c
2019-01-18 10:02:12 -08:00
Nicolas Lacasse 12bc7834dc Allow fsync on a directory.
PiperOrigin-RevId: 229781337
Change-Id: I1f946cff2771714fb1abd83a83ed454e9febda0a
2019-01-17 11:06:59 -08:00
Nicolas Lacasse dc8450b567 Remove fs.Handle, ramfs.Entry, and all the DeprecatedFileOperations.
More helper structs have been added to the fsutil package to make it easier to
implement fs.InodeOperations and fs.FileOperations.

PiperOrigin-RevId: 229305982
Change-Id: Ib6f8d3862f4216745116857913dbfa351530223b
2019-01-14 20:34:28 -08:00
Nicolas Lacasse d321f575e2 Fix lock order violation.
overlayFileOperations.Readdir was holding overlay.copyMu while calling
DirentReaddir, which then attempts to take take the corresponding Dirent.mu,
causing a lock order violation. (See lock order documentation in
fs/copy_up.go.)

We only actually need to hold copyMu during readdirEntries(), so holding the
lock is moved in there, thus avoiding the lock order violation.

A new lock was added to protect overlayFileOperations.dirCache. We were
inadvertently relying on copyMu to protect this.  There is no reason it should
not have its own lock.

PiperOrigin-RevId: 228542473
Change-Id: I03c3a368c8cbc0b5a79d50cc486fc94adaddc1c2
2019-01-09 10:29:36 -08:00
Jamie Liu 901ed5da44 Implement /proc/[pid]/smaps.
PiperOrigin-RevId: 228245523
Change-Id: I5a4d0a6570b93958e51437e917e5331d83e23a7e
2019-01-07 15:17:44 -08:00
Fabricio Voznika 8e586db162 Add /proc/net/psched content
FIO reads this file and expects it to be well formed.

PiperOrigin-RevId: 227554483
Change-Id: Ia48ae2377626dd6a2daf17b5b4f5119f90ece55b
2019-01-02 11:39:57 -08:00
Fabricio Voznika 46e6577014 Fix deadlock between epoll_wait and getdents
epoll_wait acquires EventPoll.listsMu (in EventPoll.ReadEvents) and
then calls Inotify.Readiness which tries to acquire Inotify.evMu.

getdents acquires Inotify.evMu (in Inotify.queueEvent) and then calls
readyCallback.Callback which tries to acquire EventPoll.listsMu.

The fix is to release Inotify.evMu before calling Queue.Notify. Queue
is thread-safe and doesn't require Inotify.evMu to be held.

Closes #121

PiperOrigin-RevId: 227066695
Change-Id: Id29364bb940d1727f33a5dff9a3c52f390c15761
2018-12-27 14:59:50 -08:00
Fabricio Voznika 1679ef31ef inotify notifies watchers when control events bit are set
The code that matches the event being published with events watchers
was wronly matching all watchers in case any of the control event bits
were set.

Issue #121

PiperOrigin-RevId: 226521230
Change-Id: Ie2c42bc4366faaf59fbf80a74e9297499bd93f9e
2018-12-21 11:54:02 -08:00
Nicolas Lacasse 8ba450363f Deflake gofer_test.
We must wait for all lazy resources to be released before closing the rootFile.

PiperOrigin-RevId: 226419499
Change-Id: I1d4d961a92b3816e02690cf3eaf0a88944d730cc
2018-12-20 17:23:26 -08:00
Nicolas Lacasse d3ae74d2a5 overlayBoundEndpoint must be recursive if there is an overlay in the lower.
The old overlayBoundEndpoint assumed that the lower is not an overlay.  It
should check if the lower is an overlay and handle that case.

PiperOrigin-RevId: 225882303
Change-Id: I60660c587d91db2826e0719da0983ec8ad024cb8
2018-12-17 13:46:57 -08:00
Adin Scannell 5d8cf31346 Move fdnotifier package to reduce internal confusion.
PiperOrigin-RevId: 225632398
Change-Id: I909e7e2925aa369adc28e844c284d9a6108e85ce
2018-12-14 18:05:01 -08:00
Andrei Vagin 3cf84e3bef Mark sync.Mutex in TTYFileOperations as nosave
PiperOrigin-RevId: 225621767
Change-Id: Ie3a42cdf0b0de22a020ff43e307bf86409cff329
2018-12-14 16:24:21 -08:00
Ian Gudger e1dcf92ec5 Implement SO_SNDTIMEO
PiperOrigin-RevId: 225620490
Change-Id: Ia726107b3f58093a5f881634f90b071b32d2c269
2018-12-14 16:15:06 -08:00
Rahat Mahmood ccce1d4281 Filesystems shouldn't be saving references to Platform.
Platform objects are not savable, storing references to them in
filesystem datastructures would cause save to fail if someone actually
passed in a Platform.

Current implementations work because everywhere a Platform is
expected, we currently pass in a Kernel object which embeds Platform
and thus satisfies the interface.

Eliminate this indirection and save pointers to Kernel directly.

PiperOrigin-RevId: 225288336
Change-Id: Ica399ff43f425e15bc150a0d7102196c3d54a2ab
2018-12-12 17:47:55 -08:00
Rahat Mahmood 75e39eaa74 Pass information about map writableness to filesystems.
This is necessary to implement file seals for memfds.

PiperOrigin-RevId: 225239394
Change-Id: Ib3f1ab31385afc4b24e96cd81a05ef1bebbcbb70
2018-12-12 13:09:59 -08:00
Ian Gudger 5d87d8865f Implement MSG_WAITALL
MSG_WAITALL requests that recv family calls do not perform short reads. It only
has an effect for SOCK_STREAM sockets, other types ignore it.

PiperOrigin-RevId: 224918540
Change-Id: Id97fbf972f1f7cbd4e08eec0138f8cbdf1c94fe7
2018-12-10 17:56:34 -08:00
Zhaozhong Ni 9984138abe sentry: turn "dynamically-created" procfs files into static creation.
PiperOrigin-RevId: 224600982
Change-Id: I547253528e24fb0bb318fc9d2632cb80504acb34
2018-12-07 17:03:54 -08:00
Michael Pratt 673949048e Add period to comment
PiperOrigin-RevId: 224553291
Change-Id: I35d0772c215b71f4319c23f22df5c61c908f8590
2018-12-07 11:53:19 -08:00
Michael Pratt 9f64e64a6e Enforce directory accessibility before delete Walk
By Walking before checking that the directory is writable and
executable, MayDelete may return the Walk error (e.g., ENOENT) which
would normally be masked by a permission error (EACCES).

PiperOrigin-RevId: 224222453
Change-Id: I108a7f730e6bdaa7f277eaddb776267c00805475
2018-12-05 14:31:58 -08:00
Michael Pratt 592f5bdc67 Add context to mount errors
This makes it more obvious why a mount failed.

PiperOrigin-RevId: 224203880
Change-Id: I7961774a7b6fdbb5493a791f8b3815c49b8f7631
2018-12-05 12:46:30 -08:00
Brian Geffon 82719be42e Max link traversals should be for an entire path.
The number of symbolic links that are allowed to be followed
are for a full path and not just a chain of symbolic links.

PiperOrigin-RevId: 224047321
Change-Id: I5e3c4caf66a93c17eeddcc7f046d1e8bb9434a40
2018-12-04 14:32:03 -08:00
Zhaozhong Ni adafc08d7c sentry: save / restore netstack procfs configuration.
PiperOrigin-RevId: 224047120
Change-Id: Ia6cb17fa978595cd73857b6178c4bdba401e185e
2018-12-04 14:30:42 -08:00
Brian Geffon 5a6a1eb420 Enforce name length restriction on paths.
NAME_LENGTH must be enforced per component.

PiperOrigin-RevId: 224046749
Change-Id: Iba8105b00d951f2509dc768af58e4110dafbe1c9
2018-12-04 14:28:33 -08:00
Nicolas Lacasse 54dd0d0dc5 Fix data race caused by unlocked call of Dirent.descendantOf.
PiperOrigin-RevId: 224025363
Change-Id: I98864403c779832e9e1436f7d3c3f6fb2fba9904
2018-12-04 12:24:55 -08:00
Nicolas Lacasse 573622fdca Fix data race in fs.Async.
Replaces the WaitGroup with a RWMutex. Calls to Async hold the mutex for
reading, while AsyncBarrier takes the lock for writing. This ensures that all
executing Async work finishes before AsyncBarrier returns.

Also pushes the Async() call from Inode.Release into
gofer/InodeOperations.Release(). This removes a recursive Async call which
should not have been allowed in the first place. The gofer Release call is the
slow one (since it may make RPCs to the gofer), so putting the Async call there
makes sense.

PiperOrigin-RevId: 223093067
Change-Id: I116da7b20fce5ebab8d99c2ab0f27db7c89d890e
2018-11-27 18:17:09 -08:00
Nicolas Lacasse 8c84f9a3c1 Parse the tmpfs mode before validating.
This gets rid of the problematic modeRegex.

PiperOrigin-RevId: 221835959
Change-Id: I566b8d8a43579a4c30c0a08a620a964bbcd826dd
2018-11-20 14:02:39 -08:00
Nicolas Lacasse 6ef08c2bc2 Allow setting sticky bit in tmpfs permissions.
PiperOrigin-RevId: 221683127
Change-Id: Ide6a9f41d75aa19d0e2051a05a1e4a114a4fb93c
2018-11-15 13:48:59 -08:00
Googler 25d07fbbed Internal change.
PiperOrigin-RevId: 221189534
Change-Id: Id20d318bed97d5226b454c9351df396d11251e1f
2018-11-12 17:44:46 -08:00
Rahat Mahmood 5a0be6fa20 Create stubs for syscalls upto Linux 4.4.
Create syscall stubs for missing syscalls upto Linux 4.4 and advertise
a kernel version of 4.4.

PiperOrigin-RevId: 220667680
Change-Id: Idbdccde538faabf16debc22f492dd053a8af0ba7
2018-11-08 11:09:46 -08:00
Juan b23cd33682 modify modeRegexp to adapt the default spec of containerd
https://github.com/containerd/containerd/blob/master/oci/spec.go#L206, the mode=755
didn't match the pattern modeRegexp = regexp.MustCompile("0[0-7][0-7][0-7]").

Closes #112

Signed-off-by: Juan <xionghuan.cn@gmail.com>
Change-Id: I469e0a68160a1278e34c9e1dbe4b7784c6f97e5a
PiperOrigin-RevId: 219672525
2018-11-01 11:57:54 -07:00
Ian Gudger 425dccdd7e Convert Unix transport to syserr
Previously this code used the tcpip error space. Since it is no longer part of
netstack, it can use the sentry's error space (except for a few cases where
there is still some shared code. This reduces the number of error space
conversions required for hot Unix socket operations.

PiperOrigin-RevId: 218541611
Change-Id: I3d13047006a8245b5dfda73364d37b8a453784bb
2018-10-24 11:05:08 -07:00
Adin Scannell 75cd70ecc9 Track paths and provide a rename hook.
This change also adds extensive testing to the p9 package via mocks. The sanity
checks and type checks are moved from the gofer into the core package, where
they can be more easily validated.

PiperOrigin-RevId: 218296768
Change-Id: I4fc3c326e7bf1e0e140a454cbacbcc6fd617ab55
2018-10-23 00:20:15 -07:00
Fabricio Voznika b2068cf5a5 Add more unimplemented syscall events
Added events for *ctl syscalls that may have multiple different commands.
For runsc, each syscall event is only logged once. For *ctl syscalls, use
the cmd as identifier, not only the syscall number.

PiperOrigin-RevId: 218015941
Change-Id: Ie3c19131ae36124861e9b492a7dbe1765d9e5e59
2018-10-20 11:14:23 -07:00
Ian Gudger 8fce67af24 Use correct company name in copyright header
PiperOrigin-RevId: 217951017
Change-Id: Ie08bf6987f98467d07457bcf35b5f1ff6e43c035
2018-10-19 16:35:11 -07:00
Ian Gudger 8c85f5e9ce Fix typos in socket_test
PiperOrigin-RevId: 217576188
Change-Id: I82e45c306c5c9161e207311c7dbb8a983820c1df
2018-10-17 13:25:45 -07:00
Ian Gudger 6cba410df0 Move Unix transport out of netstack
PiperOrigin-RevId: 217557656
Change-Id: I63d27635b1a6c12877279995d2d9847b6a19da9b
2018-10-17 11:37:51 -07:00
Ian Gudger 324ad3564b Refactor host.ConnectedEndpoint
* Integrate recvMsg and sendMsg functions into Recv and Send respectively as
  they are no longer shared.
* Clean up partial read/write error handling code.
* Re-order code to make sense given that there is no longer a host.endpoint
  type.

PiperOrigin-RevId: 217255072
Change-Id: Ib43fe9286452f813b8309d969be11f5fa40694cd
2018-10-15 20:23:18 -07:00
Ian Gudger 167f2401c4 Merge host.endpoint into host.ConnectedEndpoint
host.endpoint contained duplicated logic from the sockerpair implementation and
host.ConnectedEndpoint. Remove host.endpoint in favor of a
host.ConnectedEndpoint wrapped in a socketpair end.

PiperOrigin-RevId: 217240096
Change-Id: I4a3d51e3fe82bdf30e2d0152458b8499ab4c987c
2018-10-15 17:48:11 -07:00
Nicolas Lacasse ecd94ea7a6 Clean up Rename and Unlink checks for EBUSY.
- Change Dirent.Busy => Dirent.isMountPoint. The function body is unchanged,
  and it is no longer exported.

- fs.MayDelete now checks that the victim is not the process root. This aligns
  with Linux's namei.c:may_delete().

- Fix "is-ancestor" checks to actually compare all ancestors, not just the
  parents.

- Fix handling of paths that end in dots, which are handled differently in
  Rename vs. Unlink.

PiperOrigin-RevId: 217239274
Change-Id: I7a0eb768e70a1b2915017ce54f7f95cbf8edf1fb
2018-10-15 17:42:30 -07:00
Zhaozhong Ni 4ea69fce8d sentry: save fs.Dirent deleted info.
PiperOrigin-RevId: 217155458
Change-Id: Id3265b1ec784787039e2131c80254ac4937330c7
2018-10-15 09:31:32 -07:00
Zhaozhong Ni 0bfa03d61c sentry: allow saving of unlinked files with open fds on virtual fs.
PiperOrigin-RevId: 216733414
Change-Id: I33cd3eb818f0c39717d6656fcdfff6050b37ebb0
2018-10-11 11:41:44 -07:00
Michael Pratt ddb34b3690 Enforce message size limits and avoid host calls with too many iovecs
Currently, in the face of FileMem fragmentation and a large sendmsg or
recvmsg call, host sockets may pass > 1024 iovecs to the host, which
will immediately cause the host to return EMSGSIZE.

When we detect this case, use a single intermediate buffer to pass to
the kernel, copying to/from the src/dst buffer.

To avoid creating unbounded intermediate buffers, enforce message size
checks and truncation w.r.t. the send buffer size. The same
functionality is added to netstack unix sockets for feature parity.

PiperOrigin-RevId: 216590198
Change-Id: I719a32e71c7b1098d5097f35e6daf7dd5190eff7
2018-10-10 14:10:17 -07:00
Nicolas Lacasse 213f6688a5 Implement TIOCSCTTY ioctl as a noop.
PiperOrigin-RevId: 215658757
Change-Id: If63b33293f3e53a7f607ae72daa79e2b7ef6fcfd
2018-10-03 17:29:56 -07:00
Ian Gudger 4fef31f96c Add S/R support for FIOASYNC
PiperOrigin-RevId: 215655197
Change-Id: I668b1bc7c29daaf2999f8f759138bcbb09c4de6f
2018-10-03 17:03:09 -07:00
Nicolas Lacasse f1c01ed886 runsc: Support job control signals in "exec -it".
Terminal support in runsc relies on host tty file descriptors that are imported
into the sandbox. Application tty ioctls are sent directly to the host fd.

However, those host tty ioctls are associated in the host kernel with a host
process (in this case runsc), and the host kernel intercepts job control
characters like ^C and send signals to the host process. Thus, typing ^C into a
"runsc exec" shell will send a SIGINT to the runsc process.

This change makes "runsc exec" handle all signals, and forward them into the
sandbox via the "ContainerSignal" urpc method. Since the "runsc exec" is
associated with a particular container process in the sandbox, the signal must
be associated with the same container process.

One big difficulty is that the signal should not necessarily be sent to the
sandbox process started by "exec", but instead must be sent to the foreground
process group for the tty. For example, we may exec "bash", and from bash call
"sleep 100". A ^C at this point should SIGINT sleep, not bash.

To handle this, tty files inside the sandbox must keep track of their
foreground process group, which is set/get via ioctls. When an incoming
ContainerSignal urpc comes in, we look up the foreground process group via the
tty file. Unfortunately, this means we have to expose and cache the tty file in
the Loader.

Note that "runsc exec" now handles signals properly, but "runs run" does not.
That will come in a later CL, as this one is complex enough already.

Example:
	root@:/usr/local/apache2# sleep 100
	^C

	root@:/usr/local/apache2# sleep 100
	^Z
	[1]+  Stopped                 sleep 100

	root@:/usr/local/apache2# fg
	sleep 100
	^C

	root@:/usr/local/apache2#

PiperOrigin-RevId: 215334554
Change-Id: I53cdce39653027908510a5ba8d08c49f9cf24f39
2018-10-01 22:06:56 -07:00
Michael Pratt 3ff24b4f2c Require AF_UNIX sockets from the gofer
host.endpoint already has the check, but it is missing from
host.ConnectedEndpoint.

PiperOrigin-RevId: 214962762
Change-Id: I88bb13a5c5871775e4e7bf2608433df8a3d348e6
2018-09-28 11:03:11 -07:00
Nicolas Lacasse b709d23987 Forward ioctl(TCSETSF) calls on host ttys to the host kernel.
We already forward TCSETS and TCSETSW.  TCSETSF is roughly equivalent but
discards pending input.

The filters were relaxed to allow host ioctls with TCSETSF argument.

This fixes programs like "passwd" that prevent user input from being displayed
on the terminal.

Before:
	root@b8a0240fc836:/# passwd
	Enter new UNIX password: 123
	Retype new UNIX password: 123
	passwd: password updated successfully

After:
	root@ae6f5dabe402:/# passwd
	Enter new UNIX password:
	Retype new UNIX password:
	passwd: password updated successfully
PiperOrigin-RevId: 214869788
Change-Id: I31b4d1373c1388f7b51d0f2f45ce40aa8e8b0b58
2018-09-27 18:17:38 -07:00
Nicolas Lacasse fd222d62ed Short-circuit Readdir calls on overlay files when the dirent is frozen.
If we have an overlay file whose corresponding Dirent is frozen, then we should
not bother calling Readdir on the upper or lower files, since DirentReaddir
will calculate children based on the frozen Dirent tree.

A test was added that fails without this change.

PiperOrigin-RevId: 213531215
Change-Id: I4d6c98f1416541a476a34418f664ba58f936a81d
2018-09-18 15:42:22 -07:00
Ian Gudger ab6fa44588 Allow kernel.(*Task).Block to accept an extract only channel
PiperOrigin-RevId: 213328293
Change-Id: I4164133e6f709ecdb89ffbb5f7df3324c273860a
2018-09-17 13:35:54 -07:00
Michael Pratt 3aa50f18a4 Reuse readlink parameter, add sockaddr max.
PiperOrigin-RevId: 213058623
Change-Id: I522598c655d633b9330990951ff1c54d1023ec29
2018-09-14 16:00:02 -07:00
Nicolas Lacasse b84bfa570d Make gVisor hard link check match Linux's.
Linux permits hard-linking if the target is owned by the user OR the target has
Read+Write permission.

PiperOrigin-RevId: 213024613
Change-Id: If642066317b568b99084edd33ee4e8822ec9cbb3
2018-09-14 12:29:46 -07:00
Nicolas Lacasse 9751b800a6 runsc: Support multi-container exec.
We must use a context.Context with a Root Dirent that corresponds to the
container's chroot. Previously we were using the root context, which does not
have a chroot.

Getting the correct context required refactoring some of the path-lookup code.
We can't lookup the path without a context.Context, which requires
kernel.CreateProcArgs, which we only get inside control.Execute.  So we have to
do the path lookup much later than we previously were.

PiperOrigin-RevId: 212064734
Change-Id: I84a5cfadacb21fd9c3ab9c393f7e308a40b9b537
2018-09-07 17:39:54 -07:00
Fabricio Voznika 41b56696c4 Imported FD in exec was leaking
Imported file needs to be closed after it's
been imported.

PiperOrigin-RevId: 211732472
Change-Id: Ia9249210558b77be076bcce465b832a22eed301f
2018-09-05 18:07:11 -07:00
Michael Pratt 3944cb41cb /proc/PID/mounts is not tab-delimited
PiperOrigin-RevId: 211513847
Change-Id: Ib484dd2d921c3e5d70d0e410cd973d3bff4f6b73
2018-09-04 13:29:49 -07:00
Adin Scannell c09f9acd7c Distinguish Element and Linker for ilist.
Furthermore, allow for the specification of an ElementMapper. This allows a
single "Element" type to exist on multiple inline lists, and work without
having to embed the entry type.

This is a requisite change for supporting a per-Inode list of Dirents.

PiperOrigin-RevId: 211467497
Change-Id: If2768999b43e03fdaecf8ed15f435fe37518d163
2018-09-04 09:19:11 -07:00
Jamie Liu b935311e23 Do not use fs.FileOwnerFromContext in fs/proc.file.UnstableAttr().
From //pkg/sentry/context/context.go:

// - It is *not safe* to retain a Context passed to a function beyond the scope
// of that function call.

Passing a stored kernel.Task as a context.Context to
fs.FileOwnerFromContext violates this requirement.

PiperOrigin-RevId: 211143021
Change-Id: I4c5b02bd941407be4c9cfdbcbdfe5a26acaec037
2018-08-31 14:17:56 -07:00
Nicolas Lacasse 8bfb5fa919 fs: Add empty dir at /sys/class/power_supply.
PiperOrigin-RevId: 210953512
Change-Id: I07d2d7fb0d268aa8eca26d81ef28b5b5c42289ee
2018-08-30 12:01:27 -07:00
Nicolas Lacasse 956fe64ad6 fs: Fix renameMu lock recursion.
dirent.walk() takes renameMu, but is often called with renameMu already held,
which can lead to a deadlock.

Fix this by requiring renameMu to be held for reading when dirent.walk() is
called. This causes walks and existence checks to block while a rename
operation takes place, but that is what we were already trying to enforce by
taking renameMu in walk() anyways.

PiperOrigin-RevId: 210760780
Change-Id: Id61018e6e4adbeac53b9c1b3aa24ab77f75d8a54
2018-08-29 11:47:01 -07:00
Nicolas Lacasse 1893247616 fs: Drop reference to over-written file before renaming over it.
dirent.go:Rename() walks to the file being replaced and defers
replaced.DecRef(). After the rename, the reference is dropped, triggering a
writeout and SettAttr call to the gofer. Because of lazyOpenForWrite, the gofer
opens the replaced file BY ITS OLD NAME and calls ftruncate on it.

This CL changes Remove to drop the reference on replaced (and thus trigger
writeout) before the actual rename call.

PiperOrigin-RevId: 210756097
Change-Id: I01ea09a5ee6c2e2d464560362f09943641638e0f
2018-08-29 11:22:27 -07:00
Nicolas Lacasse 3b11769c77 fs: Don't bother saving negative dirents.
PiperOrigin-RevId: 210616454
Change-Id: I3f536e2b4d603e540cdd9a67c61b8ec3351f4ac3
2018-08-28 15:18:42 -07:00
Nicolas Lacasse 515d9bf43b fs: Add tests for dirent ref counting with an overlay.
PiperOrigin-RevId: 210614669
Change-Id: I408365ff6d6c7765ed7b789446d30e7079cbfc67
2018-08-28 15:09:17 -07:00
Zhaozhong Ni d724863a31 sentry: optimize dirent weakref map save / restore.
Weak references save / restore involves multiple interface indirection
and cause material latency overhead when there are lots of dirents, each
containing a weak reference map. The nil entries in the map should also
be purged.

PiperOrigin-RevId: 210593727
Change-Id: Ied6f4c3c0726fcc53a24b983d9b3a79121b6b758
2018-08-28 13:22:07 -07:00
Brian Geffon f0492d45aa Add /proc/sys/kernel/shm[all,max,mni].
PiperOrigin-RevId: 210459956
Change-Id: I51859b90fa967631e0a54a390abc3b5541fbee66
2018-08-27 17:21:37 -07:00
Nicolas Lacasse 0b3bfe2ea3 fs: Fix remote-revalidate cache policy.
When revalidating a Dirent, if the inode id is the same, then we don't need to
throw away the entire Dirent. We can just update the unstable attributes in
place.

If the inode id has changed, then the remote file has been deleted or moved,
and we have no choice but to throw away the dirent we have a look up another.
In this case, we may still end up losing a mounted dirent that is a child of
the revalidated dirent. However, that seems appropriate here because the entire
mount point has been pulled out from underneath us.

Because gVisor's overlay is at the Inode level rather than the Dirent level, we
must pass the parent Inode and name along with the Inode that is being
revalidated.

PiperOrigin-RevId: 210431270
Change-Id: I705caef9c68900234972d5aac4ae3a78c61c7d42
2018-08-27 14:26:29 -07:00
Zhaozhong Ni bd01816c87 sentry: mark fsutil.DirFileOperations as savable.
PiperOrigin-RevId: 210405166
Change-Id: I252766015885c418e914007baf2fc058fec39b3e
2018-08-27 11:55:32 -07:00
Kevin Krakauer 2524111fc6 runsc: Terminal resizing support.
Implements the TIOCGWINSZ and TIOCSWINSZ ioctls, which allow processes to resize
the terminal. This allows, for example, sshd to properly set the window size for
ssh sessions.

PiperOrigin-RevId: 210392504
Change-Id: I0d4789154d6d22f02509b31d71392e13ee4a50ba
2018-08-27 10:49:16 -07:00
Nicolas Lacasse 106de2182d runsc: Terminal support for "docker exec -ti".
This CL adds terminal support for "docker exec".  We previously only supported
consoles for the container process, but not exec processes.

The SYS_IOCTL syscall was added to the default seccomp filter list, but only
for ioctls that get/set winsize and termios structs. We need to allow these
ioctl for all containers because it's possible to run "exec -ti" on a
container that was started without an attached console, after the filters
have been installed.

Note that control-character signals are still not properly supported.

Tested with:
	$ docker run --runtime=runsc -it alpine
In another terminial:
	$ docker exec -it <containerid> /bin/sh

PiperOrigin-RevId: 210185456
Change-Id: I6d2401e53a7697bb988c120a8961505c335f96d9
2018-08-24 17:43:21 -07:00
Nicolas Lacasse c48708a041 fs: Drop unused WaitGroup in Dirent.destroy.
PiperOrigin-RevId: 210182476
Change-Id: I655a2a801e2069108d30323f7f5ae76deb3ea3ec
2018-08-24 17:15:42 -07:00
Zhaozhong Ni ba8f6ba8c8 sentry: mark idMapSeqHandle as savable.
PiperOrigin-RevId: 209994384
Change-Id: I16186cf79cb4760a134f3968db30c168a5f4340e
2018-08-23 13:59:20 -07:00
Zhaozhong Ni 6b9133ba96 sentry: mark S/R stating errors as save rejections / fs corruptions.
PiperOrigin-RevId: 209817767
Change-Id: Iddf2b8441bc44f31f9a8cf6f2bd8e7a5b824b487
2018-08-22 13:19:16 -07:00
Nicolas Lacasse 8d318aac55 fs: Hold Dirent.mu when calling Dirent.flush().
As required by the contract in Dirent.flush().

Also inline Dirent.freeze() into Dirent.Freeze(), since it is only called from
there.

PiperOrigin-RevId: 209783626
Change-Id: Ie6de4533d93dd299ffa01dabfa257c9cc259b1f4
2018-08-22 10:07:01 -07:00
Zhaozhong Ni 8bb50dab79 sentry: do not release gofer inode file state loading lock upon error.
When an inode file state failed to load asynchronuously, we want to report
the error instead of potentially panicing in another async loading goroutine
incorrectly unblocked.

PiperOrigin-RevId: 209683977
Change-Id: I591cde97710bbe3cdc53717ee58f1d28bbda9261
2018-08-21 16:52:27 -07:00
Nicolas Lacasse 0050e3e71c sysfs: Add (empty) cpu directories for each cpu in /sys/devices/system/cpu.
Numpy needs these.

Also added the "present" directory, since the contents are the same as possible
and online.

PiperOrigin-RevId: 209451777
Change-Id: I2048de3f57bf1c57e9b5421d607ca89c2a173684
2018-08-20 11:19:15 -07:00
Chenggang Qin aeec7a4c00 fs: Support possible and online knobs for cpu
Some linux commands depend on /sys/devices/system/cpu/possible, such
as 'lscpu'.

Add 2 knobs for cpu:
/sys/devices/system/cpu/possible
/sys/devices/system/cpu/online
Both the values are '0 - Kernel.ApplicationCores()-1'.

Change-Id: Iabd8a4e559cbb630ed249686b92c22b4e7120663
PiperOrigin-RevId: 209070163
2018-08-16 16:28:14 -07:00
Nicolas Lacasse e8a4f2e133 runsc: Change cache policy for root fs and volume mounts.
Previously, gofer filesystems were configured with the default "fscache"
policy, which caches filesystem metadata and contents aggressively.  While this
setting is best for performance, it means that changes from inside the sandbox
may not be immediately propagated outside the sandbox, and vice-versa.

This CL changes volumes and the root fs configuration to use a new
"remote-revalidate" cache policy which tries to retain as much caching as
possible while still making fs changes visible across the sandbox boundary.

This cache policy is enabled by default for the root filesystem. The default
value for the "--file-access" flag is still "proxy", but the behavior is
changed to use the new cache policy.

A new value for the "--file-access" flag is added, called "proxy-exclusive",
which turns on the previous aggressive caching behavior. As the name implies,
this flag should be used when the sandbox has "exclusive" access to the
filesystem.

All volume mounts are configured to use the new cache policy, since it is
safest and most likely to be correct. There is not currently a way to change
this behavior, but it's possible to add such a mechanism in the future. The
configurability is a smaller issue for volumes, since most of the expensive
application fs operations (walking + stating files) will likely served by the
root fs.

PiperOrigin-RevId: 208735037
Change-Id: Ife048fab1948205f6665df8563434dbc6ca8cfc9
2018-08-14 16:25:58 -07:00
Kevin Krakauer d4939f6dc2 TTY: Fix data race where calls into tty.queue's waiter were not synchronized.
Now, there's a waiter for each end (master and slave) of the TTY, and each
waiter.Entry is only enqueued in one of the waiters.

PiperOrigin-RevId: 208734483
Change-Id: I06996148f123075f8dd48cde5a553e2be74c6dce
2018-08-14 16:22:56 -07:00
Kevin Krakauer 12a4912aed Fix `ls -laR | wc -l` hanging.
stat()-ing /proc/PID/fd/FD incremented but didn't decrement the refcount for
FD. This behavior wasn't usually noticeable, but in the above case:

- ls would never decrement the refcount of the write end of the pipe to 0.
- This caused the write end of the pipe never to close.
- wc would then hang read()-ing from the pipe.

PiperOrigin-RevId: 208728817
Change-Id: I4fca1ba5ca24e4108915a1d30b41dc63da40604d
2018-08-14 15:49:58 -07:00
Nicolas Lacasse 66b0f3e15a Fix bind() on overlays.
InodeOperations.Bind now returns a Dirent which will be cached in the Dirent
tree.

When an overlay is in-use, Bind cannot return the Dirent created by the upper
filesystem because the Dirent does not know about the overlay. Instead,
overlayBind must create a new overlay-aware Inode and Dirent and return that.
This is analagous to how Lookup and overlayLookup work.

PiperOrigin-RevId: 208670710
Change-Id: I6390affbcf94c38656b4b458e248739b4853da29
2018-08-14 10:34:56 -07:00
Adin Scannell dde836a918 Prevent renames across walk fast path.
PiperOrigin-RevId: 208533436
Change-Id: Ifc1a4e2d6438a424650bee831c301b1ac0d670a3
2018-08-13 13:31:18 -07:00
Nicolas Lacasse a2ec391dfb fs: Allow overlays to revalidate files from the upper fs.
Previously, an overlay would panic if either the upper or lower fs required
revalidation for a given Dirent. Now, we allow revalidation from the upper
file, but not the lower.

If a cached overlay inode does need revalidation (because the upper needs
revalidation), then the entire overlay Inode will be discarded and a new
overlay Inode will be built with a fresh copy of the upper file.

As a side effect of this change, Revalidate must take an Inode instead of a
Dirent, since an overlay needs to revalidate individual Inodes.

PiperOrigin-RevId: 208293638
Change-Id: Ic8f8d1ffdc09114721745661a09522b54420c5f1
2018-08-10 17:16:38 -07:00
Nicolas Lacasse 567c5eed11 cache policy: Check policy before returning a negative dirent.
The cache policy determines whether Lookup should return a negative dirent, or
just ENOENT. This CL fixes one spot where we returned a negative dirent without
first consulting the policy.

PiperOrigin-RevId: 208280230
Change-Id: I8f963bbdb45a95a74ad0ecc1eef47eff2092d3a4
2018-08-10 15:43:03 -07:00
Brielle Broder 4ececd8e8d Enable checkpoint/restore in cases of UDS use.
Previously, processes which used file-system Unix Domain Sockets could not be
checkpoint-ed in runsc because the sockets were saved with their inode
numbers which do not necessarily remain the same upon restore. Now,
the sockets are also saved with their paths so that the new inodes
can be determined for the sockets based on these paths after restoring.
Tests for cases with UDS use are included. Test cleanup to come.

PiperOrigin-RevId: 208268781
Change-Id: Ieaa5d5d9a64914ca105cae199fd8492710b1d7ec
2018-08-10 14:33:20 -07:00
Nicolas Lacasse a38f41b464 fs: Add new cache policy "remote_revalidate".
This CL adds a new cache-policy for gofer filesystems that uses the host page
cache, but causes dirents to be reloaded on each Walk, and does not cache
readdir results.

This policy is useful when the remote filesystem may change out from underneath
us, as any remote changes will be reflected on the next Walk.

Importantly, this cache policy is only consistent if we do not use gVisor's
internal page cache, since that page cache is tied to the Inode and may be
thrown away upon Revalidation.

This cache policy should only be used when the gofer supports donating host
FDs, since then gVisor will make use of the host kernel page cache, which will
be consistent for all open files in the gofer. In fact, a panic will be raised
if a file is opened without a donated FD.

PiperOrigin-RevId: 207752937
Change-Id: I233cb78b4695bbe00a4605ae64080a47629329b8
2018-08-07 11:43:41 -07:00