Commit Graph

189 Commits

Author SHA1 Message Date
Nicolas Lacasse e8a4f2e133 runsc: Change cache policy for root fs and volume mounts.
Previously, gofer filesystems were configured with the default "fscache"
policy, which caches filesystem metadata and contents aggressively.  While this
setting is best for performance, it means that changes from inside the sandbox
may not be immediately propagated outside the sandbox, and vice-versa.

This CL changes volumes and the root fs configuration to use a new
"remote-revalidate" cache policy which tries to retain as much caching as
possible while still making fs changes visible across the sandbox boundary.

This cache policy is enabled by default for the root filesystem. The default
value for the "--file-access" flag is still "proxy", but the behavior is
changed to use the new cache policy.

A new value for the "--file-access" flag is added, called "proxy-exclusive",
which turns on the previous aggressive caching behavior. As the name implies,
this flag should be used when the sandbox has "exclusive" access to the
filesystem.

All volume mounts are configured to use the new cache policy, since it is
safest and most likely to be correct. There is not currently a way to change
this behavior, but it's possible to add such a mechanism in the future. The
configurability is a smaller issue for volumes, since most of the expensive
application fs operations (walking + stating files) will likely served by the
root fs.

PiperOrigin-RevId: 208735037
Change-Id: Ife048fab1948205f6665df8563434dbc6ca8cfc9
2018-08-14 16:25:58 -07:00
Kevin Krakauer d4939f6dc2 TTY: Fix data race where calls into tty.queue's waiter were not synchronized.
Now, there's a waiter for each end (master and slave) of the TTY, and each
waiter.Entry is only enqueued in one of the waiters.

PiperOrigin-RevId: 208734483
Change-Id: I06996148f123075f8dd48cde5a553e2be74c6dce
2018-08-14 16:22:56 -07:00
Kevin Krakauer 12a4912aed Fix `ls -laR | wc -l` hanging.
stat()-ing /proc/PID/fd/FD incremented but didn't decrement the refcount for
FD. This behavior wasn't usually noticeable, but in the above case:

- ls would never decrement the refcount of the write end of the pipe to 0.
- This caused the write end of the pipe never to close.
- wc would then hang read()-ing from the pipe.

PiperOrigin-RevId: 208728817
Change-Id: I4fca1ba5ca24e4108915a1d30b41dc63da40604d
2018-08-14 15:49:58 -07:00
Ian Gudger e97717e29a Enforce Unix socket address length limit
PiperOrigin-RevId: 208720936
Change-Id: Ic943a88b6efeff49574306d4d4e1f113116ae32e
2018-08-14 15:07:05 -07:00
Nicolas Lacasse 6cf2278167 Automated rollback of changelist 208284483
PiperOrigin-RevId: 208685417
Change-Id: Ie2849c4811e3a2d14a002f521cef018ded0c6c4a
2018-08-14 11:50:49 -07:00
Nicolas Lacasse 66b0f3e15a Fix bind() on overlays.
InodeOperations.Bind now returns a Dirent which will be cached in the Dirent
tree.

When an overlay is in-use, Bind cannot return the Dirent created by the upper
filesystem because the Dirent does not know about the overlay. Instead,
overlayBind must create a new overlay-aware Inode and Dirent and return that.
This is analagous to how Lookup and overlayLookup work.

PiperOrigin-RevId: 208670710
Change-Id: I6390affbcf94c38656b4b458e248739b4853da29
2018-08-14 10:34:56 -07:00
Adin Scannell dde836a918 Prevent renames across walk fast path.
PiperOrigin-RevId: 208533436
Change-Id: Ifc1a4e2d6438a424650bee831c301b1ac0d670a3
2018-08-13 13:31:18 -07:00
Nicolas Lacasse a2ec391dfb fs: Allow overlays to revalidate files from the upper fs.
Previously, an overlay would panic if either the upper or lower fs required
revalidation for a given Dirent. Now, we allow revalidation from the upper
file, but not the lower.

If a cached overlay inode does need revalidation (because the upper needs
revalidation), then the entire overlay Inode will be discarded and a new
overlay Inode will be built with a fresh copy of the upper file.

As a side effect of this change, Revalidate must take an Inode instead of a
Dirent, since an overlay needs to revalidate individual Inodes.

PiperOrigin-RevId: 208293638
Change-Id: Ic8f8d1ffdc09114721745661a09522b54420c5f1
2018-08-10 17:16:38 -07:00
Justine Olshan ae6f092fe1 Implemented the splice(2) syscall.
Currently the implementation matches the behavior of moving data
between two file descriptors. However, it does not implement this
through zero-copy movement. Thus, this code is a starting point
to build the more complex implementation.

PiperOrigin-RevId: 208284483
Change-Id: Ibde79520a3d50bc26aead7ad4f128d2be31db14e
2018-08-10 16:11:01 -07:00
Nicolas Lacasse 567c5eed11 cache policy: Check policy before returning a negative dirent.
The cache policy determines whether Lookup should return a negative dirent, or
just ENOENT. This CL fixes one spot where we returned a negative dirent without
first consulting the policy.

PiperOrigin-RevId: 208280230
Change-Id: I8f963bbdb45a95a74ad0ecc1eef47eff2092d3a4
2018-08-10 15:43:03 -07:00
Brielle Broder 4ececd8e8d Enable checkpoint/restore in cases of UDS use.
Previously, processes which used file-system Unix Domain Sockets could not be
checkpoint-ed in runsc because the sockets were saved with their inode
numbers which do not necessarily remain the same upon restore. Now,
the sockets are also saved with their paths so that the new inodes
can be determined for the sockets based on these paths after restoring.
Tests for cases with UDS use are included. Test cleanup to come.

PiperOrigin-RevId: 208268781
Change-Id: Ieaa5d5d9a64914ca105cae199fd8492710b1d7ec
2018-08-10 14:33:20 -07:00
Neel Natu d5b702b64f Validate FS.base before establishing it in the task's register set.
PiperOrigin-RevId: 208229341
Change-Id: I5d84bc52bbafa073446ef497e56958d0d7955aa8
2018-08-10 10:27:09 -07:00
Michael Pratt 2e06b23aa6 Fix missing O_LARGEFILE from O_CREAT files
Cleanup some more syscall.O_* references while we're here.

PiperOrigin-RevId: 208133460
Change-Id: I48db71a38f817e4f4673977eafcc0e3874eb9a25
2018-08-09 16:50:37 -07:00
Fabricio Voznika 4e171f7590 Basic support for ip link/addr and ifconfig
Closes #94

PiperOrigin-RevId: 207997580
Change-Id: I19b426f1586b5ec12f8b0cd5884d5b401d334924
2018-08-08 22:39:58 -07:00
Adin Scannell dbbe9ec915 Protect PCIDs with a mutex.
Because the Drop method may be called across vCPUs, it is necessary to protect
the PCID database with a mutex to prevent concurrent modification. The PCID is
assigned prior to entersyscall, so it's safe to block.

PiperOrigin-RevId: 207992864
Change-Id: I8b36d55106981f51e30dcf03e12886330bb79d67
2018-08-08 21:29:19 -07:00
Fabricio Voznika 0d350aac7f Enable SACK in runsc
SACK is disabled by default and needs to be manually enabled. It not only
improves performance, but also fixes hangs downloading files from certain
websites.

PiperOrigin-RevId: 207906742
Change-Id: I4fb7277b67bfdf83ac8195f1b9c38265a0d51e8b
2018-08-08 10:26:18 -07:00
Jamie Liu c036da5dff Hold TaskSet.mu in Task.Parent.
PiperOrigin-RevId: 207766238
Change-Id: Id3b66d8fe1f44c3570f67fa5ae7ba16021e35be1
2018-08-07 13:09:42 -07:00
Nicolas Lacasse a38f41b464 fs: Add new cache policy "remote_revalidate".
This CL adds a new cache-policy for gofer filesystems that uses the host page
cache, but causes dirents to be reloaded on each Walk, and does not cache
readdir results.

This policy is useful when the remote filesystem may change out from underneath
us, as any remote changes will be reflected on the next Walk.

Importantly, this cache policy is only consistent if we do not use gVisor's
internal page cache, since that page cache is tied to the Inode and may be
thrown away upon Revalidation.

This cache policy should only be used when the gofer supports donating host
FDs, since then gVisor will make use of the host kernel page cache, which will
be consistent for all open files in the gofer. In fact, a panic will be raised
if a file is opened without a donated FD.

PiperOrigin-RevId: 207752937
Change-Id: I233cb78b4695bbe00a4605ae64080a47629329b8
2018-08-07 11:43:41 -07:00
Zhaozhong Ni c348d07863 sentry: make epoll.pollEntry wait for the file operation in restore.
PiperOrigin-RevId: 207737935
Change-Id: I3a301ece1f1d30909715f36562474e3248b6a0d5
2018-08-07 10:27:37 -07:00
Michael Pratt 42086fe8e1 Make ramfs.File savable
In other news, apparently proc.fdInfo is the last user of ramfs.File.

PiperOrigin-RevId: 207564572
Change-Id: I5a92515698cc89652b80bea9a32d309e14059869
2018-08-06 10:15:56 -07:00
ShiruRen 3ec074897f Fix a bug in PCIDs.Assign
Store the new assigned pcid in p.cache[pt].

Signed-off-by: ShiruRen <renshiru2000@gmail.com>

Change-Id: I4aee4e06559e429fb5e90cb9fe28b36139e3b4b6
PiperOrigin-RevId: 207563833
2018-08-06 10:11:56 -07:00
Zhaozhong Ni 25178ebdf5 stateify: make explicit mode no longer optional.
PiperOrigin-RevId: 207303405
Change-Id: I17b6433963d78e3631a862b7ac80f566c8e7d106
2018-08-03 12:09:13 -07:00
Michael Pratt a3927157c5 Copy creds in access
PiperOrigin-RevId: 207181631
Change-Id: Ic6205278715a9260fb970efb414fc758ea72c4c6
2018-08-02 16:01:31 -07:00
Michael Pratt b6a37ab9d9 Update comment reference
PiperOrigin-RevId: 207180809
Change-Id: I08c264812919e81b2c56fdd4a9ef06924de8b52f
2018-08-02 15:56:40 -07:00
Zhaozhong Ni 57d0fcbdbf Automated rollback of changelist 207037226
PiperOrigin-RevId: 207125440
Change-Id: I6c572afb4d693ee72a0c458a988b0e96d191cd49
2018-08-02 10:42:48 -07:00
Brian Geffon cf44aff6e0 Add seccomp(2) support.
Add support for the seccomp syscall and the flag SECCOMP_FILTER_FLAG_TSYNC.

PiperOrigin-RevId: 207101507
Change-Id: I5eb8ba9d5ef71b0e683930a6429182726dc23175
2018-08-02 08:10:30 -07:00
Michael Pratt 60add78980 Automated rollback of changelist 207007153
PiperOrigin-RevId: 207037226
Change-Id: I8b5f1a056d4f3eab17846f2e0193bb737ecb5428
2018-08-01 19:57:32 -07:00
Zhaozhong Ni b9e1cf8404 stateify: convert all packages to use explicit mode.
PiperOrigin-RevId: 207007153
Change-Id: Ifedf1cc3758dc18be16647a4ece9c840c1c636c9
2018-08-01 15:43:24 -07:00
Brielle Broder 6b87378634 New conditional for adding key/value pairs to maps.
When adding MultiDeviceKeys and their values into MultiDevice maps, make
sure the keys and values have not already been added. This ensures that
preexisting key/value pairs are not overridden.

PiperOrigin-RevId: 206942766
Change-Id: I9d85f38eb59ba59f0305e6614a52690608944981
2018-08-01 09:44:57 -07:00
Andrei Vagin a7a0167716 proc: show file flags in fdinfo
Currently, there is an attempt to print FD flags, but
they are not decoded into a number, so we see something like this:

/criu # cat /proc/self/fdinfo/0
flags: {%!o(bool=000false)}

Actually, fdinfo has to contain file flags.

Change-Id: Idcbb7db908067447eb9ae6f2c3cfb861f2be1a97
PiperOrigin-RevId: 206794498
2018-07-31 11:19:15 -07:00
Justine Olshan 2793f7ac5f Added the O_LARGEFILE flag.
This flag will always be true for gVisor files.

PiperOrigin-RevId: 206355963
Change-Id: I2f03d2412e2609042df43b06d1318cba674574d0
2018-07-27 12:27:46 -07:00
Zhaozhong Ni be7fcbc558 stateify: support explicit annotation mode; convert refs and stack packages.
We have been unnecessarily creating too many savable types implicitly.

PiperOrigin-RevId: 206334201
Change-Id: Idc5a3a14bfb7ee125c4f2bb2b1c53164e46f29a8
2018-07-27 10:17:21 -07:00
Nicolas Lacasse 127c977ab0 Don't copy-up extended attributes that specifically configure a lower overlay.
When copying-up files from a lower fs to an upper, we also copy the extended
attributes on the file. If there is a (nested) overlay inside the lower, some
of these extended attributes configure the lower overlay, and should not be
copied-up to the upper.

In particular, whiteout attributes in the lower fs overlay should not be
copied-up, since the upper fs may actually contain the file.

PiperOrigin-RevId: 206236010
Change-Id: Ia0454ac7b99d0e11383f732a529cb195ed364062
2018-07-26 15:55:50 -07:00
Michael Pratt 7cd9405b9c Format openat flags
PiperOrigin-RevId: 206021774
Change-Id: I447b6c751c28a8d8d4d78468b756b6ad8c61e169
2018-07-25 11:07:19 -07:00
Kevin Krakauer 32aa0f5465 Typo fix.
PiperOrigin-RevId: 205880843
Change-Id: If2272b25f08a18ebe9b6309a1032dd5cdaa59866
2018-07-24 13:26:06 -07:00
Fabricio Voznika d7a34790a0 Add KVM and overlay dimensions to container_test
PiperOrigin-RevId: 205714667
Change-Id: I317a2ca98ac3bdad97c4790fcc61b004757d99ef
2018-07-23 13:31:42 -07:00
Michael Pratt 5f134b3c0a Format getcwd path
PiperOrigin-RevId: 205440332
Change-Id: I2a838f363e079164c83da88e1b0b8769844fe79b
2018-07-20 12:59:41 -07:00
Adin Scannell 8b8aad91d5 kernel: mutations on creds now require a copy.
PiperOrigin-RevId: 205315612
Change-Id: I9a0a1e32c8abfb7467a38743b82449cc92830316
2018-07-19 15:48:56 -07:00
Nicolas Lacasse be431d0934 fs: Pass context to Revalidate() function.
The current revalidation logic is very simple and does not do much
introspection of the dirent being revalidated (other than looking at the type
of file).

Fancier revalidation logic is coming soon, and we need to be able to look at
the cached and uncached attributes of a given dirent, and we need a context to
perform some of these operations.

PiperOrigin-RevId: 205307351
Change-Id: If17ea1c631d8f9489c0e05a263e23d7a8a3bf159
2018-07-19 14:57:52 -07:00
Nicolas Lacasse ea37103196 ConfigureMMap on an overlay file delegates to the upper if there is no lower.
In the general case with an overlay, all mmap calls must go through the
overlay, because in the event of a copy-up, the overlay needs to invalidate any
previously-created mappings.

If there if no lower file, however, there will never be a copy-up, so the
overlay can delegate directly to the upper file in that case.

This also allows us to correctly mmap /dev/zero when it is in an overlay. This
file has special semantics which the overlay does not know about. In
particular, it does not implement Mappable(), which (in the general case) the
overlay uses to detect if a file is mappable or not.

PiperOrigin-RevId: 205306743
Change-Id: I92331649aa648340ef6e65411c2b42c12fa69631
2018-07-19 14:53:38 -07:00
Brian Geffon df5a5d388e Add AT_UID, AT_EUID, AT_GID, AT_EGID to aux vector.
With musl libc when these entries are missing from the aux vector
it's forcing libc.secure (effectively AT_SECURE). This mode prevents
RPATH and LD_LIBRARY_PATH from working.

https://git.musl-libc.org/cgit/musl/tree/ldso/dynlink.c#n1488
As the first entry is a mask of all the aux fields set:
https://git.musl-libc.org/cgit/musl/tree/ldso/dynlink.c#n187

PiperOrigin-RevId: 205284684
Change-Id: I04de7bab241043306b4f732306a81d74edfdff26
2018-07-19 12:42:05 -07:00
Zhaozhong Ni a95640b1e9 sentry: save stack in proc net dev.
PiperOrigin-RevId: 205253858
Change-Id: Iccdc493b66d1b4d39de44afb1184952183b1283f
2018-07-19 09:37:32 -07:00
Nicolas Lacasse 63e2820f7b Fix lock-ordering violation in Create by logging BaseName instead of FullName.
Dirent.FullName takes the global renameMu, but can be called during Create,
which itself takes dirent.mu and dirent.dirMu, which is a lock-order violation:

Dirent.Create
  d.dirMu.Lock
  d.mu.Lock
  Inode.Create
    gofer.inodeOperations.Create
      gofer.NewFile
        Dirent.FullName
          d.renameMu.RLock

We only use the FullName here for logging, and in this case we can get by with
logging only the BaseName.

A `BaseName` method was added to Dirent, which simply returns the name, taking
d.parent.mu as required.

In the Create pathway, we can't call d.BaseName() because taking d.parent.mu
after d.mu violates the lock order. But we already know the base name of the
file we just created, so that's OK.

In the Open/GetFile pathway, we are free to call d.BaseName() because the other
dirent locks are not held.

PiperOrigin-RevId: 205112278
Change-Id: Ib45c734081aecc9b225249a65fa8093eb4995f10
2018-07-18 11:49:50 -07:00
Michael Pratt 733ebe7c09 Merge FileMem.usage in IncRef
Per the doc, usage must be kept maximally merged. Beyond that, it is simply a
good idea to keep fragmentation in usage to a minimum.

The glibc malloc allocator allocates one page at a time, potentially causing
lots of fragmentation. However, those pages are likely to have the same number
of references, often making it possible to merge ranges.

PiperOrigin-RevId: 204960339
Change-Id: I03a050cf771c29a4f05b36eaf75b1a09c9465e14
2018-07-17 13:03:59 -07:00
Adin Scannell 29e00c943a Add CPUID faulting for ptrace and KVM.
PiperOrigin-RevId: 204858314
Change-Id: I8252bf8de3232a7a27af51076139b585e73276d4
2018-07-16 22:02:58 -07:00
Michael Pratt 14d06064d2 Start allocation and reclaim scans only where they may find a match
If usageSet is heavily fragmented, findUnallocatedRange and findReclaimable
can spend excessive cycles linearly scanning the set for unallocated/free
pages.

Improve common cases by beginning the scan only at the first page that could
possibly contain an unallocated/free page. This metadata only guarantees that
there is no lower unallocated/free page, but a scan may still be required
(especially for multi-page allocations).

That said, this heuristic can still provide significant performance
improvements for certain applications.

PiperOrigin-RevId: 204841833
Change-Id: Ic41ad33bf9537ecd673a6f5852ab353bf63ea1e6
2018-07-16 18:19:01 -07:00
Neel Natu 8f21c0bb28 Add EventOperations.HostFD()
This method allows an eventfd inside the Sentry to be registered with with
the host kernel.

Update comment about memory mapping host fds via CachingInodeOperations.

PiperOrigin-RevId: 204784859
Change-Id: I55823321e2d84c17ae0f7efaabc6b55b852ae257
2018-07-16 12:20:05 -07:00
Neel Natu 5b09ec3b89 Allow a filesystem to control its visibility in /proc/filesystems.
PiperOrigin-RevId: 204508520
Change-Id: I09e5f8b6e69413370e1a0d39dbb7dc1ee0b6192d
2018-07-13 12:10:57 -07:00
Michael Pratt f09ebd9c71 Note that Mount errors do not require translations
PiperOrigin-RevId: 204490639
Change-Id: I0fe26306bae9320c6aa4f854fe0ef25eebd93233
2018-07-13 10:24:18 -07:00
Michael Pratt a28b274abb Fix aio eventfd lookup
We're failing to set eventFile in the outer scope.

PiperOrigin-RevId: 204392995
Change-Id: Ib9b04f839599ef552d7b5951d08223e2b1d5f6ad
2018-07-12 17:14:50 -07:00