Commit Graph

395 Commits

Author SHA1 Message Date
Zach Koopmans 42d78ba61b Remove HostFS from Sentry.
PiperOrigin-RevId: 301402181
2020-03-17 10:30:32 -07:00
Dean Deng 5e413cad10 Plumb VFS2 imported fds into virtual filesystem.
- When setting up the virtual filesystem, mount a host.filesystem to contain
  all files that need to be imported.
- Make read/preadv syscalls to the host in cases where preadv2 may not be
  supported yet (likewise for writing).
- Make save/restore functions in kernel/kernel.go return early if vfs2 is
  enabled.

PiperOrigin-RevId: 300922353
2020-03-14 07:14:33 -07:00
Jamie Liu 1c05352970 Fix oom_score_adj.
- Make oomScoreAdj a ThreadGroup field (Linux: signal_struct::oom_score_adj).

- Avoid deadlock caused by Task.OOMScoreAdj()/SetOOMScoreAdj() locking Task.mu
  and TaskSet.mu in the wrong order (via Task.ExitState()).

PiperOrigin-RevId: 300814698
2020-03-13 13:19:13 -07:00
Ting-Yu Wang b36de6e7be Move /proc/net to /proc/PID/net, and make /proc/net -> /proc/self/net.
Issue #1833

PiperOrigin-RevId: 299998105
2020-03-09 19:59:09 -07:00
Dean Deng 960f6a975b Add plumbing for importing fds in VFS2, along with non-socket, non-TTY impl.
In VFS2, imported file descriptors are stored in a kernfs-based filesystem.
Upon calling ImportFD, the host fd can be accessed in two ways:
1. a FileDescription that can be added to the FDTable, and
2. a Dentry in the host.filesystem mount, which we will want to access through
magic symlinks in /proc/[pid]/fd/.

An implementation of the kernfs.Inode interface stores a unique host fd. This
inode can be inserted into file descriptions as well as dentries.

This change also plumbs in three FileDescriptionImpls corresponding to fds for
sockets, TTYs, and other files (only the latter is implemented here).
These implementations will mostly make corresponding syscalls to the host.
Where possible, the logic is ported over from pkg/sentry/fs/host.

Updates #1672

PiperOrigin-RevId: 299417263
2020-03-06 12:59:49 -08:00
Tamir Duberstein 6fa5cee82c Prevent memory leaks in ilist
When list elements are removed from a list but not discarded, it becomes
important to invalidate the references they hold to their former
neighbors to prevent memory leaks.

PiperOrigin-RevId: 299412421
2020-03-06 12:31:43 -08:00
Ian Lewis da48fc6cca Stub oom_score_adj and oom_score.
Adds an oom_score_adj and oom_score proc file stub. oom_score_adj accepts
writes of values -1000 to 1000 and persists the value with the task. New tasks
inherit the parent's oom_score_adj.

oom_score is a read-only stub that always returns the value '0'.

Issue #202

PiperOrigin-RevId: 299245355
2020-03-05 18:23:01 -08:00
Michael Pratt 62bd3ca8a3 Take write lock when removing xattr
PiperOrigin-RevId: 298380654
2020-03-02 10:07:13 -08:00
Ting-Yu Wang 6b4d36e325 Hide /dev/net/tun when using hostinet.
/dev/net/tun does not currently work with hostinet. This has caused some
program starts failing because it thinks the feature exists.

PiperOrigin-RevId: 297876196
2020-02-28 10:39:12 -08:00
Adin Scannell fba479b3c7 Fix DATA RACE in fs.MayDelete.
MayDelete must lock the directory also, otherwise concurrent renames may
race. Note that this also changes the methods to be aligned with the actual
Remove and RemoveDirectory methods to minimize confusion when reading the
code. (It was hard to see that resolution was correct.)

PiperOrigin-RevId: 297258304
2020-02-25 19:04:15 -08:00
Adin Scannell 53504e29ca Fix mount refcount issue.
Each mount is holds a reference on a root Dirent, but the mount itself may
live beyond it's own reference. This means that a call to Root() can come
after the associated reference has been dropped.

Instead of introducing a separate layer of references for mount objects,
we simply change the Root() method to use TryIncRef() and allow it to return
nil if the mount is already gone. This requires updating a small number of
callers and minimizes the change (since VFSv2 will replace this code shortly).

PiperOrigin-RevId: 297174230
2020-02-25 12:17:52 -08:00
Ting-Yu Wang b8f56c79be Implement tap/tun device in vfs.
PiperOrigin-RevId: 296526279
2020-02-21 15:42:56 -08:00
gVisor bot 4a73bae269 Initial network namespace support.
TCP/IP will work with netstack networking. hostinet doesn't work, and sockets
will have the same behavior as it is now.

Before the userspace is able to create device, the default loopback device can
be used to test.

/proc/net and /sys/net will still be connected to the root network stack; this
is the same behavior now.

Issue #1833

PiperOrigin-RevId: 296309389
2020-02-20 15:20:40 -08:00
gVisor bot 4075de11be Plumb VFS2 inside the Sentry
- Added fsbridge package with interface that can be used to open
  and read from VFS1 and VFS2 files.
- Converted ELF loader to use fsbridge
- Added VFS2 types to FSContext
- Added vfs.MountNamespace to ThreadGroup

Updates #1623

PiperOrigin-RevId: 295183950
2020-02-14 11:12:47 -08:00
gVisor bot 6dced977ea Ensure fsimpl/gofer.dentryPlatformFile.hostFileMapper is initialized.
Fixes #1812. (The more direct cause of the deadlock is panic unsafety because
the historically high cost of defer means that we avoid it in hot paths,
including much of MM; defer is much cheaper as of Go 1.14, but still a
measurable overhead.)

PiperOrigin-RevId: 294560316
2020-02-11 17:38:57 -08:00
Adin Scannell 115898e368 Prevent DATA RACE in UnstableAttr.
The slaveInodeOperations is currently copying the object when
truncate is called (which is a no-op). This may result in a
(unconsequential) data race when being modified concurrently.

PiperOrigin-RevId: 294484276
2020-02-11 11:38:08 -08:00
Adin Scannell a6f9361c2f Add context to comments.
PiperOrigin-RevId: 294295852
2020-02-10 13:52:09 -08:00
Dean Deng 17b9f5e662 Support listxattr and removexattr syscalls.
Note that these are only implemented for tmpfs, and other impls will still
return EOPNOTSUPP.

PiperOrigin-RevId: 293899385
2020-02-07 14:47:13 -08:00
Adin Scannell 1b6a12a768 Add notes to relevant tests.
These were out-of-band notes that can help provide additional context
and simplify automated imports.

PiperOrigin-RevId: 293525915
2020-02-05 22:46:35 -08:00
Jamie Liu 492229d017 VFS2 gofer client
Updates #1198

Opening host pipes (by spinning in fdpipe) and host sockets is not yet
complete, and will be done in a future CL.

Major differences from VFS1 gofer client (sentry/fs/gofer), with varying levels
of backportability:

- "Cache policies" are replaced by InteropMode, which control the behavior of
  timestamps in addition to caching. Under InteropModeExclusive (analogous to
  cacheAll) and InteropModeWritethrough (analogous to cacheAllWritethrough),
  client timestamps are *not* written back to the server (it is not possible in
  9P or Linux for clients to set ctime, so writing back client-authoritative
  timestamps results in incoherence between atime/mtime and ctime). Under
  InteropModeShared (analogous to cacheRemoteRevalidating), client timestamps
  are not used at all (remote filesystem clocks are authoritative). cacheNone
  is translated to InteropModeShared + new option
  filesystemOptions.specialRegularFiles.

- Under InteropModeShared, "unstable attribute" reloading for permission
  checks, lookup, and revalidation are fused, which is feasible in VFS2 since
  gofer.filesystem controls path resolution. This results in a ~33% reduction
  in RPCs for filesystem operations compared to cacheRemoteRevalidating. For
  example, consider stat("/foo/bar/baz") where "/foo/bar/baz" fails
  revalidation, resulting in the instantiation of a new dentry:

  VFS1 RPCs:
  getattr("/")                          // fs.MountNamespace.FindLink() => fs.Inode.CheckPermission() => gofer.inodeOperations.check() => gofer.inodeOperations.UnstableAttr()
  walkgetattr("/", "foo") = fid1        // fs.Dirent.walk() => gofer.session.Revalidate() => gofer.cachePolicy.Revalidate()
  clunk(fid1)
  getattr("/foo")                       // CheckPermission
  walkgetattr("/foo", "bar") = fid2     // Revalidate
  clunk(fid2)
  getattr("/foo/bar")                   // CheckPermission
  walkgetattr("/foo/bar", "baz") = fid3 // Revalidate
  clunk(fid3)
  walkgetattr("/foo/bar", "baz") = fid4 // fs.Dirent.walk() => gofer.inodeOperations.Lookup
  getattr("/foo/bar/baz")               // linux.stat() => gofer.inodeOperations.UnstableAttr()

  VFS2 RPCs:
  getattr("/")                          // gofer.filesystem.walkExistingLocked()
  walkgetattr("/", "foo") = fid1        // gofer.filesystem.stepExistingLocked()
  clunk(fid1)
                                        // No getattr: walkgetattr already updated metadata for permission check
  walkgetattr("/foo", "bar") = fid2
  clunk(fid2)
  walkgetattr("/foo/bar", "baz") = fid3
                                        // No clunk: fid3 used for new gofer.dentry
                                        // No getattr: walkgetattr already updated metadata for stat()

- gofer.filesystem.unlinkAt() does not require instantiation of a dentry that
  represents the file to be deleted. Updates #898.

- gofer.regularFileFD.OnClose() skips Tflushf for regular files under
  InteropModeExclusive, as it's nonsensical to request a remote file flush
  without flushing locally-buffered writes to that remote file first.

- Symlink targets are cached when InteropModeShared is not in effect.

- p9.QID.Path (which is already required to be unique for each file within a
  server, and is accordingly already synthesized from device/inode numbers in
  all known gofers) is used as-is for inode numbers, rather than being mapped
  along with attr.RDev in the client to yet another synthetic inode number.

- Relevant parts of fsutil.CachingInodeOperations are inlined directly into
  gofer package code. This avoids having to duplicate part of its functionality
  in fsutil.HostMappable.

PiperOrigin-RevId: 293190213
2020-02-04 11:29:22 -08:00
Fabricio Voznika d7cd484091 Add support for sentry internal pipe for gofer mounts
Internal pipes are supported similarly to how internal UDS is done.
It is also controlled by the same flag.

Fixes #1102

PiperOrigin-RevId: 293150045
2020-02-04 08:20:52 -08:00
gVisor bot bc3a24d627 Internal change.
PiperOrigin-RevId: 292587459
2020-01-31 13:19:42 -08:00
Michael Pratt ede8dfab37 Enforce splice offset limits
Splice must not allow negative offsets. Writes also must not allow offset +
size to overflow int64. Reads are similarly broken, but not just in splice
(b/148095030).

Reported-by: syzbot+0e1ff0b95fb2859b4190@syzkaller.appspotmail.com
PiperOrigin-RevId: 292361208
2020-01-30 09:14:31 -08:00
Andrei Vagin f263801a74 fs/splice: don't report partial errors for special files
Special files can have additional requirements for granularity.
For example, read from eventfd returns EINVAL if a size is less 8 bytes.

Reported-by: syzbot+3905f5493bec08eb7b02@syzkaller.appspotmail.com
PiperOrigin-RevId: 292002926
2020-01-28 13:37:19 -08:00
Adin Scannell 0e2f1b7abd Update package locations.
Because the abi will depend on the core types for marshalling (usermem,
context, safemem, safecopy), these need to be flattened from the sentry
directory. These packages contain no sentry-specific details.

PiperOrigin-RevId: 291811289
2020-01-27 15:31:32 -08:00
Adin Scannell d29e59af9f Standardize on tools directory.
PiperOrigin-RevId: 291745021
2020-01-27 12:21:00 -08:00
Ian Gudger 6a59e7f510 Rename DowngradableRWMutex to RWmutex.
Also renames TMutex to Mutex.

These custom mutexes aren't any worse than the standard library versions (same
code), so having both seems redundant.

PiperOrigin-RevId: 290873587
2020-01-21 19:36:12 -08:00
Fabricio Voznika d46c397a1c Add line break to /proc/net files
Some files were missing the last line break.

PiperOrigin-RevId: 290808898
2020-01-21 13:28:24 -08:00
Nicolas Lacasse 10401599e1 Include the cgroup name in the superblock options in /proc/self/mountinfo.
Java 11 parses /proc/self/mountinfo for cgroup information. Java 11.0.4 uses
the mount path to determine what cgroups existed, but Java 11.0.5 reads the
cgroup names from the superblock options.

This CL adds the cgroup name to the superblock options if the filesystem type
is "cgroup". Since gVisor doesn't actually support cgroups yet, we just infer
the cgroup name from the path.

PiperOrigin-RevId: 290434323
2020-01-18 09:34:04 -08:00
Nicolas Lacasse f1a5178c58 Fix data race in MountNamespace.resolve.
We must hold fs.renameMu to access Dirent.parent.

PiperOrigin-RevId: 290340804
2020-01-17 14:21:27 -08:00
Nicolas Lacasse 80d0f93044 Fix data race in tty.queue.readableSize.
We were setting queue.readable without holding the lock.

PiperOrigin-RevId: 290306922
2020-01-17 11:22:10 -08:00
Dean Deng 345df7cab4 Add explanation for implementation of BSD full file locks.
PiperOrigin-RevId: 290272560
2020-01-17 08:11:52 -08:00
Adin Scannell 19b4653147 Remove unused rpcinet.
PiperOrigin-RevId: 290198756
2020-01-16 20:21:09 -08:00
Dean Deng 7a45ae7e67 Implement setxattr for overlays.
PiperOrigin-RevId: 290186303
2020-01-16 18:15:02 -08:00
Fabricio Voznika ab48112e41 Add IfChange/ThenChange reminders in fs/proc
There is a lot of code duplication for VFSv2 and this
serves as remind to keep the copies in sync.

Updates #1195

PiperOrigin-RevId: 290139234
2020-01-16 15:05:40 -08:00
Dean Deng 07f2584979 Plumb getting/setting xattrs through InodeOperations and 9p gofer interfaces.
There was a very bare get/setxattr in the InodeOperations interface. Add
context.Context to both, size to getxattr, and flags to setxattr.
Note that extended attributes are passed around as strings in this
implementation, so size is automatically encoded into the value. Size is
added in getxattr so that implementations can return ERANGE if a value is larger
than can fit in the user-allocated buffer. This prevents us from unnecessarily
passing around an arbitrarily large xattr when the user buffer is actually too
small.

Don't use the existing xattrwalk and xattrcreate messages and define our
own, mainly for the sake of simplicity.

Extended attributes will be implemented in future commits.

PiperOrigin-RevId: 290121300
2020-01-16 12:56:33 -08:00
Fabricio Voznika 7b7c31820b Add remaining /proc/* and /proc/sys/* files
Except for one under /proc/sys/net/ipv4/tcp_sack.
/proc/pid/* is still incomplete.

Updates #1195

PiperOrigin-RevId: 290120438
2020-01-16 12:30:21 -08:00
Ian Gudger 27500d529f New sync package.
* Rename syncutil to sync.
* Add aliases to sync types.
* Replace existing usage of standard library sync package.

This will make it easier to swap out synchronization primitives. For example,
this will allow us to use primitives from github.com/sasha-s/go-deadlock to
check for lock ordering violations.

Updates #1472

PiperOrigin-RevId: 289033387
2020-01-09 22:02:24 -08:00
Andrei Vagin a53ac7307a fs/splice: don't report a partialResult error if there is no data loss
PiperOrigin-RevId: 288642552
2020-01-07 23:54:14 -08:00
Nicolas Lacasse 51f3ab85e0 Convert memfs into proto-tmpfs.
- Renamed memfs to tmpfs.
- Copied fileRangeSet bits from fs/fsutil/ to fsimpl/tmpfs/
- Changed tmpfs to be backed by filemem instead of byte slice.
- regularFileReadWriter uses a sync.Pool, similar to gofer client.

PiperOrigin-RevId: 288356380
2020-01-06 12:52:55 -08:00
Nicolas Lacasse bb00438f36 Make masterInodeOperations.Truncate take a pointer receiver.
Otherwise a copy happens, which triggers a data race when reading
masterInodeOperations.SimpleFileOperations.uattr, which must be accessed with a
lock held.

PiperOrigin-RevId: 286464473
2019-12-19 14:34:53 -08:00
Michael Pratt 334a513f11 Add Mems_allowed to /proc/PID/status
PiperOrigin-RevId: 286248378
2019-12-18 13:16:28 -08:00
gVisor bot 3ab90ecf25 Merge pull request #1394 from zhuangel:bindlock
PiperOrigin-RevId: 286051631
2019-12-17 13:53:16 -08:00
gVisor bot 2e2545b458 Merge pull request #1392 from zhuangel:bindleak
PiperOrigin-RevId: 285874181
2019-12-16 16:21:17 -08:00
Dean Deng e6f4124afd Implement checks for get/setxattr at the syscall layer.
Add checks for input arguments, file type, permissions, etc. that match
the Linux implementation. A call to get/setxattr that passes all the
checks will still currently return EOPNOTSUPP. Actual support will be
added in following commits.

Only allow user.* extended attributes for the time being.

PiperOrigin-RevId: 285835159
2019-12-16 13:20:07 -08:00
Yong He bd5c7bf58d Fix deadlock in overlay bind
Copy up parent when binding UDS on overlayfs is supported in commit
02ab1f187c.
But the using of copyUp in overlayBind will cause sentry stuck, reason
is dead lock in renameMu.

1 [Process A] Invoke a Unix socket bind operation
  renameMu is hold in fs.(*Dirent).genericCreate by process A
2 [Process B] Invoke a read syscall on /proc/task/mounts
  waitng on Lock of renameMu in fs.(*MountNamespace).FindMount
3 [Process A] Continue Unix socket bind operation
  wating on RLock of renameMu in fs.copyUp

Root cause is recursive reading lock of reanmeMu in bind call trace,
if there are writing lock between the two reading lock, then deadlock
occured.

Fixes #1397
2019-12-16 18:37:35 +08:00
Yong He 8a46e83111 Fix UDS bind cause fd leak in gofer
After the finalizer optimize in 76039f8959
commit, clientFile needs to closed before finalizer release it.
The clientFile is not closed if it is created via
gofer.(*inodeOperations).Bind, this will cause fd leak which is hold
by gofer process.

Fixes #1396

Signed-off-by: Yong He <chenglang.hy@antfin.com>
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
2019-12-16 18:28:10 +08:00
Fabricio Voznika 898dcc2f83 Redirect TODOs to gvisor.dev
PiperOrigin-RevId: 284606233
2019-12-09 12:11:28 -08:00
Nicolas Lacasse 663fe840f7 Implement TTY field in control.Processes().
Threadgroups already know their TTY (if they have one), which now contains the
TTY Index, and is returned in the Processes() call.

PiperOrigin-RevId: 284263850
2019-12-06 14:34:13 -08:00
Zach Koopmans 0a32c02357 Create correct file for /proc/[pid]/task/[tid]/io
PiperOrigin-RevId: 284038840
2019-12-05 13:24:05 -08:00