Commit Graph

528 Commits

Author SHA1 Message Date
Michael Pratt d639c3d61b Allow NULL data in mount(2)
PiperOrigin-RevId: 213315267
Change-Id: I7562bcd81fb22e90aa9c7dd9eeb94803fcb8c5af
2018-09-17 12:16:29 -07:00
newmanwang de5a590ee2 Avoid reuse of pending SignalInfo objects
runApp.execute -> Task.SendSignal -> sendSignalLocked -> sendSignalTimerLocked
-> pendingSignals.enqueue assumes that it owns the arch.SignalInfo returned
from platform.Context.Switch.

On the other hand, ptrace.context.Switch assumes that it owns the returned
SignalInfo and can safely reuse it on the next call to Switch. The KVM platform
always returns a unique SignalInfo.

This becomes a problem when the returned signal is not immediately delivered,
allowing a future signal in Switch to change the previous pending SignalInfo.

This is noticeable in #38 when external SIGINTs are delivered from the PTY
slave FD. Note that the ptrace stubs are in the same process group as the
sentry, so they are eligible to receive the PTY signals. This should probably
change, but is not the only possible cause of this bug.

Updates #38

Original change by newmanwang <wcs1011@gmail.com>, updated by Michael Pratt
<mpratt@google.com>.

Change-Id: I5383840272309df70a29f67b25e8221f933622cd
PiperOrigin-RevId: 213071072
2018-09-14 17:39:25 -07:00
Tamir Duberstein 75c66f871b Remove buffer.Prependable.UsedBytes
It is the same as buffer.Prependable.View.

PiperOrigin-RevId: 213064166
Change-Id: Ib33b8a2c4da864209d9a0be0a1c113be10b520d3
2018-09-14 16:39:56 -07:00
Michael Pratt 3aa50f18a4 Reuse readlink parameter, add sockaddr max.
PiperOrigin-RevId: 213058623
Change-Id: I522598c655d633b9330990951ff1c54d1023ec29
2018-09-14 16:00:02 -07:00
Tamir Duberstein d7a05b4e63 Pass buffer.Prependable by value
PiperOrigin-RevId: 213053370
Change-Id: I60ea89572b4fca53fd126c870fcbde74fcf52562
2018-09-14 15:23:58 -07:00
Nicolas Lacasse b84bfa570d Make gVisor hard link check match Linux's.
Linux permits hard-linking if the target is owned by the user OR the target has
Read+Write permission.

PiperOrigin-RevId: 213024613
Change-Id: If642066317b568b99084edd33ee4e8822ec9cbb3
2018-09-14 12:29:46 -07:00
Jamie Liu 0380bcb3a4 Fix interaction between rt_sigtimedwait and ignored signals.
PiperOrigin-RevId: 213011782
Change-Id: I716c6ea3c586b0c6c5a892b6390d2d11478bc5af
2018-09-14 11:10:50 -07:00
Chenggang faa34a0738 platform/kvm: Get max vcpu number dynamically by ioctl
The old kernel version, such as 4.4, only support 255 vcpus.
While gvisor is ran on these kernels, it could panic because the
vcpu id and vcpu number beyond max_vcpus.
Use ioctl(vmfd, _KVM_CHECK_EXTENSION, _KVM_CAP_MAX_VCPUS) to get max
vcpus number dynamically.

Change-Id: I50dd859a11b1c2cea854a8e27d4bf11a411aa45c
PiperOrigin-RevId: 212929704
2018-09-13 21:47:11 -07:00
Ian Gudger 29a7271f5d Plumb monotonic time to netstack
Netstack needs to be portable, so this seems to be preferable to using raw
system calls.

PiperOrigin-RevId: 212917409
Change-Id: I7b2073e7db4b4bf75300717ca23aea4c15be944c
2018-09-13 19:12:15 -07:00
Rahat Mahmood adf8f33970 Extend memory usage events to report mapped memory usage.
PiperOrigin-RevId: 212887555
Change-Id: I3545383ce903cbe9f00d9b5288d9ef9a049b9f4f
2018-09-13 15:16:47 -07:00
Michael Pratt 9c6b38e295 Format struct itimerspec
PiperOrigin-RevId: 212874745
Change-Id: I0c3e8e6a9e8976631cee03bf0b8891b336ddb8c8
2018-09-13 14:07:47 -07:00
Nicolas Lacasse e2d79480f5 initArgs must hold a reference on the Root if it is not nil.
The contract in ExecArgs says that a reference on ExecArgs.Root must be held
for the lifetime of the struct, but the caller is free to drop the ref after
that.

As a result, proc.Exec must take an additional ref on Root when it constructs
the CreateProcessArgs, since that holds a pointer to Root as well. That ref is
dropped in CreateProcess.

PiperOrigin-RevId: 212828348
Change-Id: I7f44a612f337ff51a02b873b8a845d3119408707
2018-09-13 09:50:35 -07:00
Tamir Duberstein d689f8422f Always pass buffer.VectorisedView by value
PiperOrigin-RevId: 212757571
Change-Id: I04200df9e45c21eb64951cd2802532fa84afcb1a
2018-09-12 21:57:55 -07:00
Tamir Duberstein 5adb3468d4 Add multicast support
PiperOrigin-RevId: 212750821
Change-Id: I822fd63e48c684b45fd91f9ce057867b7eceb792
2018-09-12 20:39:24 -07:00
Zhaozhong Ni 9dec7a3db9 compressio: stop worker-pool reference / dependency loop.
PiperOrigin-RevId: 212732300
Change-Id: I9a0b9b7c28e7b7439d34656dd4f2f6114d173e22
2018-09-12 17:24:53 -07:00
Kevin Krakauer 2eff1fdd06 runsc: Add exec flag that specifies where to save the sandbox-internal pid.
This is different from the existing -pid-file flag, which saves a host pid.

PiperOrigin-RevId: 212713968
Change-Id: I2c486de8dd5cfd9b923fb0970165ef7c5fc597f0
2018-09-12 15:23:35 -07:00
Tamir Duberstein cbf3980464 Prevent UDP sockets from binding to bound ports
PiperOrigin-RevId: 212653818
Change-Id: Ib4e1d754d9cdddeaa428a066cb675e6ec44d91ad
2018-09-12 09:39:01 -07:00
Nicolas Lacasse 6cc9b311af platform: Pass device fd into platform constructor.
We were previously openining the platform device (i.e. /dev/kvm) inside the
platfrom constructor (i.e. kvm.New).  This requires that we have RW access to
the platform device when constructing the platform.

However, now that the runsc sandbox process runs as user "nobody", it is not
able to open the platform device.

This CL changes the kvm constructor to take the platform device FD, rather than
opening the device file itself. The device file is opened outside of the
sandbox and passed to the sandbox process.

PiperOrigin-RevId: 212505804
Change-Id: I427e1d9de5eb84c84f19d513356e1bb148a52910
2018-09-11 13:09:46 -07:00
Jamie Liu a29c39aa62 Map committed chunks concurrently in FileMem.LoadFrom.
PiperOrigin-RevId: 212345401
Change-Id: Iac626ee87ba312df88ab1019ade6ecd62c04c75c
2018-09-10 15:23:44 -07:00
Fabricio Voznika 7e9e6745ca Allow '/dev/zero' to be mapped with unaligned length
PiperOrigin-RevId: 212321271
Change-Id: I79d71c2e6f4b8fcd3b9b923fe96c2256755f4c48
2018-09-10 13:24:55 -07:00
Bert Muthalaly da9ecb748c Simplify some code in VectorisedView#ToView.
PiperOrigin-RevId: 212317717
Change-Id: Ic77449c53bf2f8be92c9f0a7a726c45bd35ec435
2018-09-10 13:04:06 -07:00
Michael Pratt 7045828a31 Update cleanup TODO
PiperOrigin-RevId: 212068327
Change-Id: I3f360cdf7d6caa1c96fae68ae3a1caaf440f0cbe
2018-09-07 18:14:57 -07:00
Nicolas Lacasse 9751b800a6 runsc: Support multi-container exec.
We must use a context.Context with a Root Dirent that corresponds to the
container's chroot. Previously we were using the root context, which does not
have a chroot.

Getting the correct context required refactoring some of the path-lookup code.
We can't lookup the path without a context.Context, which requires
kernel.CreateProcArgs, which we only get inside control.Execute.  So we have to
do the path lookup much later than we previously were.

PiperOrigin-RevId: 212064734
Change-Id: I84a5cfadacb21fd9c3ab9c393f7e308a40b9b537
2018-09-07 17:39:54 -07:00
Fabricio Voznika 172860a059 Add 'Starting gVisor...' message to syslog
This allows applications to verify they are running with gVisor. It
also helps debugging when running with a mix of container runtimes.

Closes #54

PiperOrigin-RevId: 212059457
Change-Id: I51d9595ee742b58c1f83f3902ab2e2ecbd5cedec
2018-09-07 16:59:27 -07:00
Adin Scannell 6cfb5cd56d Add additional sanity checks for walk.
PiperOrigin-RevId: 212058684
Change-Id: I319709b9ffcfccb3231bac98df345d2a20eca24b
2018-09-07 16:53:12 -07:00
Fabricio Voznika f895cb4d8b Use root abstract socket namespace for exec
PiperOrigin-RevId: 211999211
Change-Id: I5968dd1a8313d3e49bb6e6614e130107495de41d
2018-09-07 10:45:55 -07:00
Michael Pratt 169e2efc5a Continue handling signals after disabling forwarding
Before destroying the Kernel, we disable signal forwarding,
relinquishing control to the Go runtime. External signals that arrive
after disabling forwarding but before the sandbox exits thus may use
runtime.raise (i.e., tkill(2)) and violate the syscall filters.

Adjust forwardSignals to handle signals received after disabling
forwarding the same way they are handled before starting forwarding.
i.e., by implementing the standard Go runtime behavior using tgkill(2)
instead of tkill(2).

This also makes the stop callback block until forwarding actually stops.
This isn't required to avoid tkill(2) but is a saner interface.

PiperOrigin-RevId: 211995946
Change-Id: I3585841644409260eec23435cf65681ad41f5f03
2018-09-07 10:28:25 -07:00
Nicolas Lacasse 6516b5648b createProcessArgs.RootFromContext should return process Root if it exists.
It was always returning the MountNamespace root, which may be different from
the process Root if the process is in a chroot environment.

PiperOrigin-RevId: 211862181
Change-Id: I63bfeb610e2b0affa9fdbdd8147eba3c39014480
2018-09-06 13:47:49 -07:00
Tamir Duberstein 156b49ca85 Fix race condition introduced in 211135505
Now that it's possible to remove subnets, we must iterate over them with locks
held.

Also do the removal more efficiently while I'm here.

PiperOrigin-RevId: 211737416
Change-Id: I29025ec8b0c3ad11f22d4447e8ad473f1c785463
2018-09-05 18:59:16 -07:00
Fabricio Voznika 41b56696c4 Imported FD in exec was leaking
Imported file needs to be closed after it's
been imported.

PiperOrigin-RevId: 211732472
Change-Id: Ia9249210558b77be076bcce465b832a22eed301f
2018-09-05 18:07:11 -07:00
Bert Muthalaly 5685d6b5ad Update {LinkEndpoint,NetworkEndpoint}#WritePacket to take a VectorisedView
Makes it possible to avoid copying or allocating in cases where DeliverNetworkPacket (rx)
needs to turn around and call WritePacket (tx) with its VectorisedView.

Also removes the restriction on having VectorisedViews with multiple views in the write path.

PiperOrigin-RevId: 211728717
Change-Id: Ie03a65ecb4e28bd15ebdb9c69f05eced18fdfcff
2018-09-05 17:34:25 -07:00
Tamir Duberstein fe8ca76c22 Implement Subnet removal
This was used to implement https://fuchsia-review.googlesource.com/c/garnet/+/177771.

PiperOrigin-RevId: 211725098
Change-Id: Ib0acc7c13430b7341e8e0ec6eb5fc35f5cee5083
2018-09-05 17:06:29 -07:00
Bert Muthalaly b3b66dbd1f Enable constructing a Prependable from a View without allocating.
PiperOrigin-RevId: 211722525
Change-Id: Ie73753fd09d67d6a2ce70cfe2d4ecf7275f09ce0
2018-09-05 16:47:51 -07:00
Tamir Duberstein bc5e18c9d1 Implement TCP keepalives
PiperOrigin-RevId: 211670620
Change-Id: Ia8a3d8ae53a7fece1dee08ee9c74964bd7f71bb7
2018-09-05 11:48:23 -07:00
Brian Geffon 2b8dae0bc5 Open(2) isn't honoring O_NOFOLLOW
PiperOrigin-RevId: 211644897
Change-Id: I882ed827a477d6c03576463ca5bf2d6351892b90
2018-09-05 09:21:28 -07:00
Bhasker Hariharan 2cff07381a Automated rollback of changelist 211156845
PiperOrigin-RevId: 211525182
Change-Id: I462c20328955c77ecc7bfd8ee803ac91f15858e6
2018-09-04 14:31:52 -07:00
Michael Pratt 3944cb41cb /proc/PID/mounts is not tab-delimited
PiperOrigin-RevId: 211513847
Change-Id: Ib484dd2d921c3e5d70d0e410cd973d3bff4f6b73
2018-09-04 13:29:49 -07:00
Tamir Duberstein 3794cb6bff Expose TCP RTT
PiperOrigin-RevId: 211504634
Change-Id: I9a7bcbbdd40e5036894930f709278725ef477293
2018-09-04 12:39:47 -07:00
Adin Scannell c09f9acd7c Distinguish Element and Linker for ilist.
Furthermore, allow for the specification of an ElementMapper. This allows a
single "Element" type to exist on multiple inline lists, and work without
having to embed the entry type.

This is a requisite change for supporting a per-Inode list of Dirents.

PiperOrigin-RevId: 211467497
Change-Id: If2768999b43e03fdaecf8ed15f435fe37518d163
2018-09-04 09:19:11 -07:00
Googler f0d8817654 Automated rollback of changelist 211103930
PiperOrigin-RevId: 211156845
Change-Id: Ie28011d7eb5f45f3a0158dbee2a68c5edf22f6e0
2018-08-31 15:48:50 -07:00
Jamie Liu f8ccfbbed4 Document more task-goroutine-owned fields in kernel.Task.
Task.creds can only be changed by the task's own set*id and execve
syscalls, and Task namespaces can only be changed by the task's own
unshare/setns syscalls.

PiperOrigin-RevId: 211156279
Change-Id: I94d57105d34e8739d964400995a8a5d76306b2a0
2018-08-31 15:44:40 -07:00
Jamie Liu b935311e23 Do not use fs.FileOwnerFromContext in fs/proc.file.UnstableAttr().
From //pkg/sentry/context/context.go:

// - It is *not safe* to retain a Context passed to a function beyond the scope
// of that function call.

Passing a stored kernel.Task as a context.Context to
fs.FileOwnerFromContext violates this requirement.

PiperOrigin-RevId: 211143021
Change-Id: I4c5b02bd941407be4c9cfdbcbdfe5a26acaec037
2018-08-31 14:17:56 -07:00
Jamie Liu 098046ba19 Disintegrate kernel.TaskResources.
This allows us to call kernel.FDMap.DecRef without holding mutexes
cleanly.

PiperOrigin-RevId: 211139657
Change-Id: Ie59d5210fb9282e1950e2e40323df7264a01bcec
2018-08-31 13:58:04 -07:00
Jamie Liu b1c1afa3cc Delete the long-obsolete kernel.TaskMaybe interface.
PiperOrigin-RevId: 211131855
Change-Id: Ia7799561ccd65d16269e0ae6f408ab53749bca37
2018-08-31 13:07:34 -07:00
Tamir Duberstein 625edb9f28 ipv6: ICMP support
This CL does NDP link-address discovery for IPv6.

It includes several small changes necessary to get linux to talk to
this implementation. In particular, a hop limit of 255 is necessary
for ICMPv6.

PiperOrigin-RevId: 211103930
Change-Id: If25370ab84c6b1decfb15de917f3b0020f2c4e0e
2018-08-31 10:23:32 -07:00
Nicolas Lacasse 8bfb5fa919 fs: Add empty dir at /sys/class/power_supply.
PiperOrigin-RevId: 210953512
Change-Id: I07d2d7fb0d268aa8eca26d81ef28b5b5c42289ee
2018-08-30 12:01:27 -07:00
Ian Gudger 313d4af52d ping: update comment about UDP
PiperOrigin-RevId: 210788012
Change-Id: I5ebdcf3d02bfab3484a1374fbccba870c9d68954
2018-08-29 14:15:58 -07:00
Nicolas Lacasse 956fe64ad6 fs: Fix renameMu lock recursion.
dirent.walk() takes renameMu, but is often called with renameMu already held,
which can lead to a deadlock.

Fix this by requiring renameMu to be held for reading when dirent.walk() is
called. This causes walks and existence checks to block while a rename
operation takes place, but that is what we were already trying to enforce by
taking renameMu in walk() anyways.

PiperOrigin-RevId: 210760780
Change-Id: Id61018e6e4adbeac53b9c1b3aa24ab77f75d8a54
2018-08-29 11:47:01 -07:00
Nicolas Lacasse 1893247616 fs: Drop reference to over-written file before renaming over it.
dirent.go:Rename() walks to the file being replaced and defers
replaced.DecRef(). After the rename, the reference is dropped, triggering a
writeout and SettAttr call to the gofer. Because of lazyOpenForWrite, the gofer
opens the replaced file BY ITS OLD NAME and calls ftruncate on it.

This CL changes Remove to drop the reference on replaced (and thus trigger
writeout) before the actual rename call.

PiperOrigin-RevId: 210756097
Change-Id: I01ea09a5ee6c2e2d464560362f09943641638e0f
2018-08-29 11:22:27 -07:00
Ian Gudger 52e6714146 fasync: don't keep mutex after return
PiperOrigin-RevId: 210637533
Change-Id: I3536c3f9efb54732a0d8ada8bc299142b2c1682f
2018-08-28 17:26:26 -07:00
Nicolas Lacasse 3b11769c77 fs: Don't bother saving negative dirents.
PiperOrigin-RevId: 210616454
Change-Id: I3f536e2b4d603e540cdd9a67c61b8ec3351f4ac3
2018-08-28 15:18:42 -07:00
Nicolas Lacasse 515d9bf43b fs: Add tests for dirent ref counting with an overlay.
PiperOrigin-RevId: 210614669
Change-Id: I408365ff6d6c7765ed7b789446d30e7079cbfc67
2018-08-28 15:09:17 -07:00
Zhaozhong Ni d724863a31 sentry: optimize dirent weakref map save / restore.
Weak references save / restore involves multiple interface indirection
and cause material latency overhead when there are lots of dirents, each
containing a weak reference map. The nil entries in the map should also
be purged.

PiperOrigin-RevId: 210593727
Change-Id: Ied6f4c3c0726fcc53a24b983d9b3a79121b6b758
2018-08-28 13:22:07 -07:00
Michael Pratt 25a8e13a78 Bump to Go 1.11
The procid offset is unchanged.

PiperOrigin-RevId: 210551969
Change-Id: I33ba1ce56c2f5631b712417d870aa65ef24e6022
2018-08-28 09:22:41 -07:00
Zhaozhong Ni d08ccdaaad sentry: avoid double counting map objects in save / restore stats.
PiperOrigin-RevId: 210551929
Change-Id: Idd05935bffc63b39166cc3751139aff61b689faa
2018-08-28 09:21:16 -07:00
Fabricio Voznika ae648bafda Add command-line parameter to trigger panic on signal
This is to troubleshoot problems with a hung process that is
not responding to 'runsc debug --stack' command.

PiperOrigin-RevId: 210483513
Change-Id: I4377b210b4e51bc8a281ad34fd94f3df13d9187d
2018-08-27 20:36:10 -07:00
Brian Geffon f0492d45aa Add /proc/sys/kernel/shm[all,max,mni].
PiperOrigin-RevId: 210459956
Change-Id: I51859b90fa967631e0a54a390abc3b5541fbee66
2018-08-27 17:21:37 -07:00
Tamir Duberstein 0923bcf06b Add various statistics
PiperOrigin-RevId: 210442599
Change-Id: I9498351f461dc69c77b7f815d526c5693bec8e4a
2018-08-27 15:29:55 -07:00
Nicolas Lacasse 0b3bfe2ea3 fs: Fix remote-revalidate cache policy.
When revalidating a Dirent, if the inode id is the same, then we don't need to
throw away the entire Dirent. We can just update the unstable attributes in
place.

If the inode id has changed, then the remote file has been deleted or moved,
and we have no choice but to throw away the dirent we have a look up another.
In this case, we may still end up losing a mounted dirent that is a child of
the revalidated dirent. However, that seems appropriate here because the entire
mount point has been pulled out from underneath us.

Because gVisor's overlay is at the Inode level rather than the Dirent level, we
must pass the parent Inode and name along with the Inode that is being
revalidated.

PiperOrigin-RevId: 210431270
Change-Id: I705caef9c68900234972d5aac4ae3a78c61c7d42
2018-08-27 14:26:29 -07:00
Zhaozhong Ni bd01816c87 sentry: mark fsutil.DirFileOperations as savable.
PiperOrigin-RevId: 210405166
Change-Id: I252766015885c418e914007baf2fc058fec39b3e
2018-08-27 11:55:32 -07:00
Kevin Krakauer 2524111fc6 runsc: Terminal resizing support.
Implements the TIOCGWINSZ and TIOCSWINSZ ioctls, which allow processes to resize
the terminal. This allows, for example, sshd to properly set the window size for
ssh sessions.

PiperOrigin-RevId: 210392504
Change-Id: I0d4789154d6d22f02509b31d71392e13ee4a50ba
2018-08-27 10:49:16 -07:00
Tamir Duberstein b17e80ef5a Upstreaming DHCP changes from Fuchsia
PiperOrigin-RevId: 210221388
Change-Id: Ic82d592b8c4778855fa55ba913f6b9a10b2d511f
2018-08-25 06:17:32 -07:00
Nicolas Lacasse 106de2182d runsc: Terminal support for "docker exec -ti".
This CL adds terminal support for "docker exec".  We previously only supported
consoles for the container process, but not exec processes.

The SYS_IOCTL syscall was added to the default seccomp filter list, but only
for ioctls that get/set winsize and termios structs. We need to allow these
ioctl for all containers because it's possible to run "exec -ti" on a
container that was started without an attached console, after the filters
have been installed.

Note that control-character signals are still not properly supported.

Tested with:
	$ docker run --runtime=runsc -it alpine
In another terminial:
	$ docker exec -it <containerid> /bin/sh

PiperOrigin-RevId: 210185456
Change-Id: I6d2401e53a7697bb988c120a8961505c335f96d9
2018-08-24 17:43:21 -07:00
Nicolas Lacasse c48708a041 fs: Drop unused WaitGroup in Dirent.destroy.
PiperOrigin-RevId: 210182476
Change-Id: I655a2a801e2069108d30323f7f5ae76deb3ea3ec
2018-08-24 17:15:42 -07:00
Zhaozhong Ni a6b00502b0 compressio: support optional hashing and eliminate hashio.
Compared to previous compressio / hashio nesting, there is up to 100% speedup.

PiperOrigin-RevId: 210161269
Change-Id: I481aa9fe980bb817fe465fe34d32ea33fc8abf1c
2018-08-24 14:53:31 -07:00
Fabricio Voznika 7b0dfb0cdb SyscallRules merge and add were dropping AllowAny rules
PiperOrigin-RevId: 210131001
Change-Id: I285707c5143b3e4c9a6948c1d1a452b6f16e65b7
2018-08-24 11:39:21 -07:00
Jamie Liu 64403265a0 Implement POSIX per-process interval timers.
PiperOrigin-RevId: 210021612
Change-Id: If7c161e6fd08cf17942bfb6bc5a8d2c4e271c61e
2018-08-23 16:32:36 -07:00
Zhaozhong Ni e855e9cebc netstack: make listening tcp socket close state setting and cleanup atomic.
Otherwise the socket saving logic might find workers still running for closed
sockets unexpectedly.

PiperOrigin-RevId: 210018905
Change-Id: I443a04d355613f5f9983252cc6863bff6e0eda3a
2018-08-23 16:14:46 -07:00
Zhaozhong Ni ba8f6ba8c8 sentry: mark idMapSeqHandle as savable.
PiperOrigin-RevId: 209994384
Change-Id: I16186cf79cb4760a134f3968db30c168a5f4340e
2018-08-23 13:59:20 -07:00
Ian Gudger abe7764928 Encapsulate netstack metrics
PiperOrigin-RevId: 209943212
Change-Id: I96dcbc7c2ab2426e510b94a564436505256c5c79
2018-08-23 08:55:23 -07:00
Adin Scannell a7a8d07d7d Add separate Recycle method for allocator.
This improves debugging for pagetable-related issues.

PiperOrigin-RevId: 209827795
Change-Id: I4cfa11664b0b52f26f6bc90a14c5bb106f01e038
2018-08-22 14:16:04 -07:00
Googler bbee911179 Allow building on !linux
PiperOrigin-RevId: 209819644
Change-Id: I329d054bf8f4999e7db0dcd95b13f7793c65d4e2
2018-08-22 13:31:11 -07:00
Zhaozhong Ni 6b9133ba96 sentry: mark S/R stating errors as save rejections / fs corruptions.
PiperOrigin-RevId: 209817767
Change-Id: Iddf2b8441bc44f31f9a8cf6f2bd8e7a5b824b487
2018-08-22 13:19:16 -07:00
Brian Geffon 545ea7ab3f Always add AT_BASE even if there is no interpreter.
Linux will ALWAYS add AT_BASE even for a static binary, expect it
will be set to 0 [1].

1. https://github.com/torvalds/linux/blob/master/fs/binfmt_elf.c#L253

PiperOrigin-RevId: 209811129
Change-Id: I92cc66532f23d40f24414a921c030bd3481e12a0
2018-08-22 12:37:09 -07:00
Nicolas Lacasse 8d318aac55 fs: Hold Dirent.mu when calling Dirent.flush().
As required by the contract in Dirent.flush().

Also inline Dirent.freeze() into Dirent.Freeze(), since it is only called from
there.

PiperOrigin-RevId: 209783626
Change-Id: Ie6de4533d93dd299ffa01dabfa257c9cc259b1f4
2018-08-22 10:07:01 -07:00
Zhaozhong Ni 8bb50dab79 sentry: do not release gofer inode file state loading lock upon error.
When an inode file state failed to load asynchronuously, we want to report
the error instead of potentially panicing in another async loading goroutine
incorrectly unblocked.

PiperOrigin-RevId: 209683977
Change-Id: I591cde97710bbe3cdc53717ee58f1d28bbda9261
2018-08-21 16:52:27 -07:00
Ian Gudger e29a02239e binary: append slices
A new optimization in Go 1.11 improves the efficiency of slice extension:
"The compiler now optimizes slice extension of the form append(s, make([]T, n)...)."
https://tip.golang.org/doc/go1.11#performance-compiler

Before:
BenchmarkMarshalUnmarshal-12    	 2000000	       664 ns/op	       0 B/op	       0 allocs/op
BenchmarkReadWrite-12           	  500000	      2395 ns/op	     304 B/op	      24 allocs/op

After:
BenchmarkMarshalUnmarshal-12    	 2000000	       628 ns/op	       0 B/op	       0 allocs/op
BenchmarkReadWrite-12           	  500000	      2411 ns/op	     304 B/op	      24 allocs/op

BenchmarkMarshalUnmarshal benchmarks the code in this package, BenchmarkReadWrite benchmarks the code in the standard library.

PiperOrigin-RevId: 209679979
Change-Id: I51c6302e53f60bf79f84576b1ead4d36658897cb
2018-08-21 16:26:32 -07:00
Googler a316f83977 Expose route table
PiperOrigin-RevId: 209670528
Change-Id: I2890bcdef36f0b5f24b372b42cf628b38dd5764e
2018-08-21 15:27:09 -07:00
Ian Gudger 45e759a1fa Build PCAP file with atomic blocking writes
The previous use of non-blocking writes could result in corrupt PCAP files if a
partial write occurs. Using (*os.File).Write solves this problem by not
allowing partial writes. This change does not increase allocations (in one path
it actually reduces them), but does add additional copying.

PiperOrigin-RevId: 209652974
Change-Id: I4b1cf2eda4cfd7f237a4245aceb7391b3055a66c
2018-08-21 13:49:18 -07:00
Ian Gudger 9c407382b0 Fix races in kernel.(*Task).Value()
PiperOrigin-RevId: 209627180
Change-Id: Idc84afd38003427e411df6e75abfabd9174174e1
2018-08-21 11:16:17 -07:00
Ian Gudger 47d5a12ce5 Fix handling of abstract Unix socket addresses
* Don't truncate abstract addresses at second null.
* Properly handle abstract addresses with length < 108 bytes.

PiperOrigin-RevId: 209502703
Change-Id: I49053f2d18b5a78208c3f640c27dbbdaece4f1a9
2018-08-20 16:12:23 -07:00
Nicolas Lacasse 1501400d9c getdents should return type=DT_DIR for SpecialDirectories.
It was returning DT_UNKNOWN, and this was breaking numpy.

PiperOrigin-RevId: 209459351
Change-Id: Ic6f548e23aa9c551b2032b92636cb5f0df9ccbd4
2018-08-20 11:59:58 -07:00
Nicolas Lacasse 0050e3e71c sysfs: Add (empty) cpu directories for each cpu in /sys/devices/system/cpu.
Numpy needs these.

Also added the "present" directory, since the contents are the same as possible
and online.

PiperOrigin-RevId: 209451777
Change-Id: I2048de3f57bf1c57e9b5421d607ca89c2a173684
2018-08-20 11:19:15 -07:00
Chenggang Qin aeec7a4c00 fs: Support possible and online knobs for cpu
Some linux commands depend on /sys/devices/system/cpu/possible, such
as 'lscpu'.

Add 2 knobs for cpu:
/sys/devices/system/cpu/possible
/sys/devices/system/cpu/online
Both the values are '0 - Kernel.ApplicationCores()-1'.

Change-Id: Iabd8a4e559cbb630ed249686b92c22b4e7120663
PiperOrigin-RevId: 209070163
2018-08-16 16:28:14 -07:00
Googler fbd5df9c6f Internal change.
PiperOrigin-RevId: 209060862
Change-Id: I2cd02f0032b80d0087110095548b1a8ffa696ac2
2018-08-16 15:34:00 -07:00
Ian Gudger eacbe6a678 Remove obsolete comment about panicking
PiperOrigin-RevId: 208908702
Change-Id: I6be9c765c257a9ddb1a965a03942ab3fc3a34a43
2018-08-15 17:02:15 -07:00
Kevin Krakauer 635b0c4593 runsc fsgofer: Support dynamic serving of filesystems.
When multiple containers run inside a sentry, each container has its own root
filesystem and set of mounts. Containers are also added after sentry boot rather
than all configured and known at boot time.

The fsgofer needs to be able to serve the root filesystem of each container.
Thus, it must be possible to add filesystems after the fsgofer has already
started.

This change:
* Creates a URPC endpoint within the gofer process that listens for requests to
  serve new content.
* Enables the sentry, when starting a new container, to add the new container's
  filesystem.
* Mounts those new filesystems at separate roots within the sentry.

PiperOrigin-RevId: 208903248
Change-Id: Ifa91ec9c8caf5f2f0a9eead83c4a57090ce92068
2018-08-15 16:25:22 -07:00
Ian Gudger a620bea045 Reduce map lookups in syserr
PiperOrigin-RevId: 208755352
Change-Id: Ia24630f452a4a42940ab73a8113a2fd5ea2cfca2
2018-08-14 19:03:38 -07:00
Nicolas Lacasse e8a4f2e133 runsc: Change cache policy for root fs and volume mounts.
Previously, gofer filesystems were configured with the default "fscache"
policy, which caches filesystem metadata and contents aggressively.  While this
setting is best for performance, it means that changes from inside the sandbox
may not be immediately propagated outside the sandbox, and vice-versa.

This CL changes volumes and the root fs configuration to use a new
"remote-revalidate" cache policy which tries to retain as much caching as
possible while still making fs changes visible across the sandbox boundary.

This cache policy is enabled by default for the root filesystem. The default
value for the "--file-access" flag is still "proxy", but the behavior is
changed to use the new cache policy.

A new value for the "--file-access" flag is added, called "proxy-exclusive",
which turns on the previous aggressive caching behavior. As the name implies,
this flag should be used when the sandbox has "exclusive" access to the
filesystem.

All volume mounts are configured to use the new cache policy, since it is
safest and most likely to be correct. There is not currently a way to change
this behavior, but it's possible to add such a mechanism in the future. The
configurability is a smaller issue for volumes, since most of the expensive
application fs operations (walking + stating files) will likely served by the
root fs.

PiperOrigin-RevId: 208735037
Change-Id: Ife048fab1948205f6665df8563434dbc6ca8cfc9
2018-08-14 16:25:58 -07:00
Kevin Krakauer d4939f6dc2 TTY: Fix data race where calls into tty.queue's waiter were not synchronized.
Now, there's a waiter for each end (master and slave) of the TTY, and each
waiter.Entry is only enqueued in one of the waiters.

PiperOrigin-RevId: 208734483
Change-Id: I06996148f123075f8dd48cde5a553e2be74c6dce
2018-08-14 16:22:56 -07:00
Kevin Krakauer 12a4912aed Fix `ls -laR | wc -l` hanging.
stat()-ing /proc/PID/fd/FD incremented but didn't decrement the refcount for
FD. This behavior wasn't usually noticeable, but in the above case:

- ls would never decrement the refcount of the write end of the pipe to 0.
- This caused the write end of the pipe never to close.
- wc would then hang read()-ing from the pipe.

PiperOrigin-RevId: 208728817
Change-Id: I4fca1ba5ca24e4108915a1d30b41dc63da40604d
2018-08-14 15:49:58 -07:00
Ian Gudger e97717e29a Enforce Unix socket address length limit
PiperOrigin-RevId: 208720936
Change-Id: Ic943a88b6efeff49574306d4d4e1f113116ae32e
2018-08-14 15:07:05 -07:00
Nicolas Lacasse 6cf2278167 Automated rollback of changelist 208284483
PiperOrigin-RevId: 208685417
Change-Id: Ie2849c4811e3a2d14a002f521cef018ded0c6c4a
2018-08-14 11:50:49 -07:00
Nicolas Lacasse 66b0f3e15a Fix bind() on overlays.
InodeOperations.Bind now returns a Dirent which will be cached in the Dirent
tree.

When an overlay is in-use, Bind cannot return the Dirent created by the upper
filesystem because the Dirent does not know about the overlay. Instead,
overlayBind must create a new overlay-aware Inode and Dirent and return that.
This is analagous to how Lookup and overlayLookup work.

PiperOrigin-RevId: 208670710
Change-Id: I6390affbcf94c38656b4b458e248739b4853da29
2018-08-14 10:34:56 -07:00
Adin Scannell dde836a918 Prevent renames across walk fast path.
PiperOrigin-RevId: 208533436
Change-Id: Ifc1a4e2d6438a424650bee831c301b1ac0d670a3
2018-08-13 13:31:18 -07:00
Adin Scannell 85235ac212 Add path sanity checks.
PiperOrigin-RevId: 208527333
Change-Id: I55291bc6b8bc6b88fdd75baf899a71854c39c1a7
2018-08-13 12:50:29 -07:00
Nicolas Lacasse a2ec391dfb fs: Allow overlays to revalidate files from the upper fs.
Previously, an overlay would panic if either the upper or lower fs required
revalidation for a given Dirent. Now, we allow revalidation from the upper
file, but not the lower.

If a cached overlay inode does need revalidation (because the upper needs
revalidation), then the entire overlay Inode will be discarded and a new
overlay Inode will be built with a fresh copy of the upper file.

As a side effect of this change, Revalidate must take an Inode instead of a
Dirent, since an overlay needs to revalidate individual Inodes.

PiperOrigin-RevId: 208293638
Change-Id: Ic8f8d1ffdc09114721745661a09522b54420c5f1
2018-08-10 17:16:38 -07:00
Justine Olshan ae6f092fe1 Implemented the splice(2) syscall.
Currently the implementation matches the behavior of moving data
between two file descriptors. However, it does not implement this
through zero-copy movement. Thus, this code is a starting point
to build the more complex implementation.

PiperOrigin-RevId: 208284483
Change-Id: Ibde79520a3d50bc26aead7ad4f128d2be31db14e
2018-08-10 16:11:01 -07:00
Nicolas Lacasse 567c5eed11 cache policy: Check policy before returning a negative dirent.
The cache policy determines whether Lookup should return a negative dirent, or
just ENOENT. This CL fixes one spot where we returned a negative dirent without
first consulting the policy.

PiperOrigin-RevId: 208280230
Change-Id: I8f963bbdb45a95a74ad0ecc1eef47eff2092d3a4
2018-08-10 15:43:03 -07:00
Brielle Broder 4ececd8e8d Enable checkpoint/restore in cases of UDS use.
Previously, processes which used file-system Unix Domain Sockets could not be
checkpoint-ed in runsc because the sockets were saved with their inode
numbers which do not necessarily remain the same upon restore. Now,
the sockets are also saved with their paths so that the new inodes
can be determined for the sockets based on these paths after restoring.
Tests for cases with UDS use are included. Test cleanup to come.

PiperOrigin-RevId: 208268781
Change-Id: Ieaa5d5d9a64914ca105cae199fd8492710b1d7ec
2018-08-10 14:33:20 -07:00
Neel Natu d5b702b64f Validate FS.base before establishing it in the task's register set.
PiperOrigin-RevId: 208229341
Change-Id: I5d84bc52bbafa073446ef497e56958d0d7955aa8
2018-08-10 10:27:09 -07:00
Michael Pratt 2e06b23aa6 Fix missing O_LARGEFILE from O_CREAT files
Cleanup some more syscall.O_* references while we're here.

PiperOrigin-RevId: 208133460
Change-Id: I48db71a38f817e4f4673977eafcc0e3874eb9a25
2018-08-09 16:50:37 -07:00
Fabricio Voznika 4e171f7590 Basic support for ip link/addr and ifconfig
Closes #94

PiperOrigin-RevId: 207997580
Change-Id: I19b426f1586b5ec12f8b0cd5884d5b401d334924
2018-08-08 22:39:58 -07:00
Adin Scannell 48b5b35b2b Fix error handling for bad message sizes.
The message size check is legitimate: the size must be negotiated, which
relies on the fixed message limit up front. Sending a message larger than
that indicates that the connection is out of sync and is considered a
socket error (disconnect).

Similarly, sending a size that is too small indicates that the stream is
out-of-sync or invalid.

PiperOrigin-RevId: 207996551
Change-Id: Icd8b513d5307e9d5953dbb957ee70ceea111098d
2018-08-08 22:23:47 -07:00
Fabricio Voznika ea1e39a314 Resend packets back to netstack if destined to itself
Add option to redirect packet back to netstack if it's destined to itself.
This fixes the problem where connecting to the local NIC address would
not work, e.g.:
echo bar | nc -l -p 8080 &
echo foo | nc 192.168.0.2 8080

PiperOrigin-RevId: 207995083
Change-Id: I17adc2a04df48bfea711011a5df206326a1fb8ef
2018-08-08 22:03:35 -07:00
Adin Scannell dbbe9ec915 Protect PCIDs with a mutex.
Because the Drop method may be called across vCPUs, it is necessary to protect
the PCID database with a mutex to prevent concurrent modification. The PCID is
assigned prior to entersyscall, so it's safe to block.

PiperOrigin-RevId: 207992864
Change-Id: I8b36d55106981f51e30dcf03e12886330bb79d67
2018-08-08 21:29:19 -07:00
Ian Gudger 2a44362c0b Fix data race in unix.BoundEndpoint.UnidirectionalConnect.
Data race is:

Read:
(*connectionlessEndpoint).UnidirectionalConnect:
writeQueue: e.receiver.(*queueReceiver).readQueue,

Write:
(*connectionlessEndpoint).Close:
e.receiver = nil

The problem is that (*connectionlessEndpoint).UnidirectionalConnect assumed
that baseEndpoint.receiver is immutable which is explicitly not the case.
Fixing this required two changes:
1. Add synchronization around access of baseEndpoint.receiver in
   (*connectionlessEndpoint).UnidirectionalConnect.
2. Check for baseEndpoint.receiver being nil in
   (*connectionlessEndpoint).UnidirectionalConnect.

PiperOrigin-RevId: 207984402
Change-Id: Icddeeb43805e777fa3ef874329fa704891d14181
2018-08-08 19:23:48 -07:00
Fabricio Voznika 0d350aac7f Enable SACK in runsc
SACK is disabled by default and needs to be manually enabled. It not only
improves performance, but also fixes hangs downloading files from certain
websites.

PiperOrigin-RevId: 207906742
Change-Id: I4fb7277b67bfdf83ac8195f1b9c38265a0d51e8b
2018-08-08 10:26:18 -07:00
Jamie Liu c036da5dff Hold TaskSet.mu in Task.Parent.
PiperOrigin-RevId: 207766238
Change-Id: Id3b66d8fe1f44c3570f67fa5ae7ba16021e35be1
2018-08-07 13:09:42 -07:00
Bhasker Hariharan 7d3684aadf Adds support to dump out cubic internal state.
PiperOrigin-RevId: 207754087
Change-Id: I83abce64348ea93f8692da81a881b364dae2158b
2018-08-07 11:49:57 -07:00
Nicolas Lacasse a38f41b464 fs: Add new cache policy "remote_revalidate".
This CL adds a new cache-policy for gofer filesystems that uses the host page
cache, but causes dirents to be reloaded on each Walk, and does not cache
readdir results.

This policy is useful when the remote filesystem may change out from underneath
us, as any remote changes will be reflected on the next Walk.

Importantly, this cache policy is only consistent if we do not use gVisor's
internal page cache, since that page cache is tied to the Inode and may be
thrown away upon Revalidation.

This cache policy should only be used when the gofer supports donating host
FDs, since then gVisor will make use of the host kernel page cache, which will
be consistent for all open files in the gofer. In fact, a panic will be raised
if a file is opened without a donated FD.

PiperOrigin-RevId: 207752937
Change-Id: I233cb78b4695bbe00a4605ae64080a47629329b8
2018-08-07 11:43:41 -07:00
Zhaozhong Ni c348d07863 sentry: make epoll.pollEntry wait for the file operation in restore.
PiperOrigin-RevId: 207737935
Change-Id: I3a301ece1f1d30909715f36562474e3248b6a0d5
2018-08-07 10:27:37 -07:00
Brian Geffon d839dc13c6 Netstack doesn't handle sending after SHUT_WR correctly.
PiperOrigin-RevId: 207715032
Change-Id: I7b6690074c5be283145192895d706a92e921b22c
2018-08-07 07:57:20 -07:00
Michael Pratt 42086fe8e1 Make ramfs.File savable
In other news, apparently proc.fdInfo is the last user of ramfs.File.

PiperOrigin-RevId: 207564572
Change-Id: I5a92515698cc89652b80bea9a32d309e14059869
2018-08-06 10:15:56 -07:00
ShiruRen 3ec074897f Fix a bug in PCIDs.Assign
Store the new assigned pcid in p.cache[pt].

Signed-off-by: ShiruRen <renshiru2000@gmail.com>

Change-Id: I4aee4e06559e429fb5e90cb9fe28b36139e3b4b6
PiperOrigin-RevId: 207563833
2018-08-06 10:11:56 -07:00
Bhasker Hariharan 56fa562dda Cubic implementation for Netstack.
This CL implements CUBIC as described in https://tools.ietf.org/html/rfc8312.

PiperOrigin-RevId: 207353142
Change-Id: I329cbf3277f91127e99e488f07d906f6779c6603
2018-08-03 17:54:42 -07:00
Zhaozhong Ni 25178ebdf5 stateify: make explicit mode no longer optional.
PiperOrigin-RevId: 207303405
Change-Id: I17b6433963d78e3631a862b7ac80f566c8e7d106
2018-08-03 12:09:13 -07:00
Michael Pratt a3927157c5 Copy creds in access
PiperOrigin-RevId: 207181631
Change-Id: Ic6205278715a9260fb970efb414fc758ea72c4c6
2018-08-02 16:01:31 -07:00
Michael Pratt b6a37ab9d9 Update comment reference
PiperOrigin-RevId: 207180809
Change-Id: I08c264812919e81b2c56fdd4a9ef06924de8b52f
2018-08-02 15:56:40 -07:00
Zhaozhong Ni 57d0fcbdbf Automated rollback of changelist 207037226
PiperOrigin-RevId: 207125440
Change-Id: I6c572afb4d693ee72a0c458a988b0e96d191cd49
2018-08-02 10:42:48 -07:00
Brian Geffon cf44aff6e0 Add seccomp(2) support.
Add support for the seccomp syscall and the flag SECCOMP_FILTER_FLAG_TSYNC.

PiperOrigin-RevId: 207101507
Change-Id: I5eb8ba9d5ef71b0e683930a6429182726dc23175
2018-08-02 08:10:30 -07:00
Ian Gudger 3cd7824410 Move stack clock to options struct
PiperOrigin-RevId: 207039273
Change-Id: Ib8f55a6dc302052ab4a10ccd70b07f0d73b373df
2018-08-01 20:22:02 -07:00
Michael Pratt 60add78980 Automated rollback of changelist 207007153
PiperOrigin-RevId: 207037226
Change-Id: I8b5f1a056d4f3eab17846f2e0193bb737ecb5428
2018-08-01 19:57:32 -07:00
Zhaozhong Ni b9e1cf8404 stateify: convert all packages to use explicit mode.
PiperOrigin-RevId: 207007153
Change-Id: Ifedf1cc3758dc18be16647a4ece9c840c1c636c9
2018-08-01 15:43:24 -07:00
Brielle Broder 6b87378634 New conditional for adding key/value pairs to maps.
When adding MultiDeviceKeys and their values into MultiDevice maps, make
sure the keys and values have not already been added. This ensures that
preexisting key/value pairs are not overridden.

PiperOrigin-RevId: 206942766
Change-Id: I9d85f38eb59ba59f0305e6614a52690608944981
2018-08-01 09:44:57 -07:00
Andrei Vagin a7a0167716 proc: show file flags in fdinfo
Currently, there is an attempt to print FD flags, but
they are not decoded into a number, so we see something like this:

/criu # cat /proc/self/fdinfo/0
flags: {%!o(bool=000false)}

Actually, fdinfo has to contain file flags.

Change-Id: Idcbb7db908067447eb9ae6f2c3cfb861f2be1a97
PiperOrigin-RevId: 206794498
2018-07-31 11:19:15 -07:00
Zhaozhong Ni 0a55f8c1c1 netstack: support disconnect-on-save option per fdbased link.
PiperOrigin-RevId: 206659972
Change-Id: I5e0e035f97743b6525ad36bed2c802791609beaf
2018-07-30 15:43:25 -07:00
Justine Olshan 2793f7ac5f Added the O_LARGEFILE flag.
This flag will always be true for gVisor files.

PiperOrigin-RevId: 206355963
Change-Id: I2f03d2412e2609042df43b06d1318cba674574d0
2018-07-27 12:27:46 -07:00
Zhaozhong Ni be7fcbc558 stateify: support explicit annotation mode; convert refs and stack packages.
We have been unnecessarily creating too many savable types implicitly.

PiperOrigin-RevId: 206334201
Change-Id: Idc5a3a14bfb7ee125c4f2bb2b1c53164e46f29a8
2018-07-27 10:17:21 -07:00
Nicolas Lacasse 127c977ab0 Don't copy-up extended attributes that specifically configure a lower overlay.
When copying-up files from a lower fs to an upper, we also copy the extended
attributes on the file. If there is a (nested) overlay inside the lower, some
of these extended attributes configure the lower overlay, and should not be
copied-up to the upper.

In particular, whiteout attributes in the lower fs overlay should not be
copied-up, since the upper fs may actually contain the file.

PiperOrigin-RevId: 206236010
Change-Id: Ia0454ac7b99d0e11383f732a529cb195ed364062
2018-07-26 15:55:50 -07:00
Michael Pratt 7cd9405b9c Format openat flags
PiperOrigin-RevId: 206021774
Change-Id: I447b6c751c28a8d8d4d78468b756b6ad8c61e169
2018-07-25 11:07:19 -07:00
Kevin Krakauer 32aa0f5465 Typo fix.
PiperOrigin-RevId: 205880843
Change-Id: If2272b25f08a18ebe9b6309a1032dd5cdaa59866
2018-07-24 13:26:06 -07:00
Bhasker Hariharan da48c04d0d Refactor new reno congestion control logic out of sender.
This CL also puts the congestion control logic behind an
interface so that we can easily swap it out for say CUBIC
in the future.

PiperOrigin-RevId: 205732848
Change-Id: I891cdfd17d4d126b658b5faa0c6bd6083187944b
2018-07-23 15:15:07 -07:00
Fabricio Voznika d7a34790a0 Add KVM and overlay dimensions to container_test
PiperOrigin-RevId: 205714667
Change-Id: I317a2ca98ac3bdad97c4790fcc61b004757d99ef
2018-07-23 13:31:42 -07:00
Michael Pratt 5f134b3c0a Format getcwd path
PiperOrigin-RevId: 205440332
Change-Id: I2a838f363e079164c83da88e1b0b8769844fe79b
2018-07-20 12:59:41 -07:00
Adin Scannell 8b8aad91d5 kernel: mutations on creds now require a copy.
PiperOrigin-RevId: 205315612
Change-Id: I9a0a1e32c8abfb7467a38743b82449cc92830316
2018-07-19 15:48:56 -07:00
Nicolas Lacasse be431d0934 fs: Pass context to Revalidate() function.
The current revalidation logic is very simple and does not do much
introspection of the dirent being revalidated (other than looking at the type
of file).

Fancier revalidation logic is coming soon, and we need to be able to look at
the cached and uncached attributes of a given dirent, and we need a context to
perform some of these operations.

PiperOrigin-RevId: 205307351
Change-Id: If17ea1c631d8f9489c0e05a263e23d7a8a3bf159
2018-07-19 14:57:52 -07:00
Nicolas Lacasse ea37103196 ConfigureMMap on an overlay file delegates to the upper if there is no lower.
In the general case with an overlay, all mmap calls must go through the
overlay, because in the event of a copy-up, the overlay needs to invalidate any
previously-created mappings.

If there if no lower file, however, there will never be a copy-up, so the
overlay can delegate directly to the upper file in that case.

This also allows us to correctly mmap /dev/zero when it is in an overlay. This
file has special semantics which the overlay does not know about. In
particular, it does not implement Mappable(), which (in the general case) the
overlay uses to detect if a file is mappable or not.

PiperOrigin-RevId: 205306743
Change-Id: I92331649aa648340ef6e65411c2b42c12fa69631
2018-07-19 14:53:38 -07:00
Brian Geffon df5a5d388e Add AT_UID, AT_EUID, AT_GID, AT_EGID to aux vector.
With musl libc when these entries are missing from the aux vector
it's forcing libc.secure (effectively AT_SECURE). This mode prevents
RPATH and LD_LIBRARY_PATH from working.

https://git.musl-libc.org/cgit/musl/tree/ldso/dynlink.c#n1488
As the first entry is a mask of all the aux fields set:
https://git.musl-libc.org/cgit/musl/tree/ldso/dynlink.c#n187

PiperOrigin-RevId: 205284684
Change-Id: I04de7bab241043306b4f732306a81d74edfdff26
2018-07-19 12:42:05 -07:00
Zhaozhong Ni a95640b1e9 sentry: save stack in proc net dev.
PiperOrigin-RevId: 205253858
Change-Id: Iccdc493b66d1b4d39de44afb1184952183b1283f
2018-07-19 09:37:32 -07:00
Justine Olshan c05660373e Moved restore code out of create and made to be called after create.
Docker expects containers to be created before they are restored.
However, gVisor restoring requires specificactions regarding the kernel
and the file system. These actions were originally in booting the sandbox.

Now setting up the file system is deferred until a call to a call to
runsc start. In the restore case, the kernel is destroyed and a new kernel
is created in the same process, as we need the same process for Docker.

These changes required careful execution of concurrent processes which
required the use of a channel.

Full docker integration still needs the ability to restore into the same
container.

PiperOrigin-RevId: 205161441
Change-Id: Ie1d2304ead7e06855319d5dc310678f701bd099f
2018-07-18 16:58:30 -07:00
Nicolas Lacasse 63e2820f7b Fix lock-ordering violation in Create by logging BaseName instead of FullName.
Dirent.FullName takes the global renameMu, but can be called during Create,
which itself takes dirent.mu and dirent.dirMu, which is a lock-order violation:

Dirent.Create
  d.dirMu.Lock
  d.mu.Lock
  Inode.Create
    gofer.inodeOperations.Create
      gofer.NewFile
        Dirent.FullName
          d.renameMu.RLock

We only use the FullName here for logging, and in this case we can get by with
logging only the BaseName.

A `BaseName` method was added to Dirent, which simply returns the name, taking
d.parent.mu as required.

In the Create pathway, we can't call d.BaseName() because taking d.parent.mu
after d.mu violates the lock order. But we already know the base name of the
file we just created, so that's OK.

In the Open/GetFile pathway, we are free to call d.BaseName() because the other
dirent locks are not held.

PiperOrigin-RevId: 205112278
Change-Id: Ib45c734081aecc9b225249a65fa8093eb4995f10
2018-07-18 11:49:50 -07:00
Michael Pratt 733ebe7c09 Merge FileMem.usage in IncRef
Per the doc, usage must be kept maximally merged. Beyond that, it is simply a
good idea to keep fragmentation in usage to a minimum.

The glibc malloc allocator allocates one page at a time, potentially causing
lots of fragmentation. However, those pages are likely to have the same number
of references, often making it possible to merge ranges.

PiperOrigin-RevId: 204960339
Change-Id: I03a050cf771c29a4f05b36eaf75b1a09c9465e14
2018-07-17 13:03:59 -07:00
Neel Natu ed2e03d378 Add API to decode 'stat.st_rdev' into major and minor numbers.
PiperOrigin-RevId: 204936533
Change-Id: Ib060920077fc914f97c4a0548a176d1368510c7b
2018-07-17 10:50:53 -07:00
Zhaozhong Ni beb89bb757 netstack: update goroutine save / restore safety comments.
PiperOrigin-RevId: 204930314
Change-Id: Ifc4c41ed28616cd57fafbf7c92e87141a945c41f
2018-07-17 10:15:00 -07:00
Adin Scannell 29e00c943a Add CPUID faulting for ptrace and KVM.
PiperOrigin-RevId: 204858314
Change-Id: I8252bf8de3232a7a27af51076139b585e73276d4
2018-07-16 22:02:58 -07:00
Michael Pratt 14d06064d2 Start allocation and reclaim scans only where they may find a match
If usageSet is heavily fragmented, findUnallocatedRange and findReclaimable
can spend excessive cycles linearly scanning the set for unallocated/free
pages.

Improve common cases by beginning the scan only at the first page that could
possibly contain an unallocated/free page. This metadata only guarantees that
there is no lower unallocated/free page, but a scan may still be required
(especially for multi-page allocations).

That said, this heuristic can still provide significant performance
improvements for certain applications.

PiperOrigin-RevId: 204841833
Change-Id: Ic41ad33bf9537ecd673a6f5852ab353bf63ea1e6
2018-07-16 18:19:01 -07:00
Neel Natu 8f21c0bb28 Add EventOperations.HostFD()
This method allows an eventfd inside the Sentry to be registered with with
the host kernel.

Update comment about memory mapping host fds via CachingInodeOperations.

PiperOrigin-RevId: 204784859
Change-Id: I55823321e2d84c17ae0f7efaabc6b55b852ae257
2018-07-16 12:20:05 -07:00
Neel Natu 5b09ec3b89 Allow a filesystem to control its visibility in /proc/filesystems.
PiperOrigin-RevId: 204508520
Change-Id: I09e5f8b6e69413370e1a0d39dbb7dc1ee0b6192d
2018-07-13 12:10:57 -07:00
Michael Pratt f09ebd9c71 Note that Mount errors do not require translations
PiperOrigin-RevId: 204490639
Change-Id: I0fe26306bae9320c6aa4f854fe0ef25eebd93233
2018-07-13 10:24:18 -07:00
Michael Pratt a28b274abb Fix aio eventfd lookup
We're failing to set eventFile in the outer scope.

PiperOrigin-RevId: 204392995
Change-Id: Ib9b04f839599ef552d7b5951d08223e2b1d5f6ad
2018-07-12 17:14:50 -07:00
Zhaozhong Ni 1cd46c8dd1 sentry: wait for restore clock instead of panicing in Timekeeper.
PiperOrigin-RevId: 204372296
Change-Id: If1ed9843b93039806e0c65521f30177dc8036979
2018-07-12 15:09:02 -07:00
Zhaozhong Ni bb41ad808a sentry: save inet stacks in proc files.
PiperOrigin-RevId: 204362791
Change-Id: If85ea7442741e299f0d7cddbc3d6b415e285da81
2018-07-12 14:19:04 -07:00
Zhaozhong Ni 45c50eb124 netstack: save tcp endpoint accepted channel directly.
PiperOrigin-RevId: 204356873
Change-Id: I5e2f885f58678e693aae1a69e8bf8084a685af28
2018-07-12 13:49:21 -07:00
Zhaozhong Ni cc34a90fb4 netstack: do not defer panicable logic in tcp main loop.
PiperOrigin-RevId: 204355026
Change-Id: I1a8229879ea3b58aa861a4eb4456fd7aff99863d
2018-07-12 13:39:28 -07:00
Michael Pratt 41e0b977e5 Format documentation
PiperOrigin-RevId: 204323728
Change-Id: I1ff9aa062ffa12583b2e38ec94c87db7a3711971
2018-07-12 10:37:21 -07:00
Bhasker Hariharan c15cb8d432 Automated rollback of changelist 203157739
PiperOrigin-RevId: 204196916
Change-Id: If632750fc6368acb835e22cfcee0ae55c8a04d16
2018-07-11 15:07:19 -07:00
Jamie Liu b9c469f372 Move ptrace constants to abi/linux.
PiperOrigin-RevId: 204188763
Change-Id: I5596ab7abb3ec9e210a7f57b3fc420e836fa43f3
2018-07-11 14:24:19 -07:00
Jamie Liu ee0ef506d4 Add MemoryManager.Pin.
PiperOrigin-RevId: 204162313
Change-Id: Ib0593dde88ac33e222c12d0dca6733ef1f1035dc
2018-07-11 11:52:09 -07:00
Michael Pratt 9cd69c2f3d Internal change
PiperOrigin-RevId: 204028082
Change-Id: I4251cce10aace43f9b9a80c36204ef66f1b329df
2018-07-10 15:55:10 -07:00
Jamie Liu 06920b3d1b Exit tmpfs.fileInodeOperations.Translate early if required.Start >= EOF.
Otherwise required and optional can be empty or have negative length.

PiperOrigin-RevId: 204007079
Change-Id: I59e472a87a8caac11ffb9a914b8d79bf0cd70995
2018-07-10 13:58:54 -07:00
Zhaozhong Ni bf580cf64d netstack: only do connected TCP S/R for loopback connections.
PiperOrigin-RevId: 204006237
Change-Id: Ica8402ab54d9dd7d11cc41c6d74aacef51d140b7
2018-07-10 13:54:40 -07:00
Michael Pratt 065d7cee9a Internal change
PiperOrigin-RevId: 203997995
Change-Id: I8974fe74f1582bc9b2622f18a4bc4ab47ff5d622
2018-07-10 13:09:02 -07:00
Zhaozhong Ni b1683df90b netstack: tcp socket connected state S/R support.
PiperOrigin-RevId: 203958972
Change-Id: Ia6fe16547539296d48e2c6731edacdd96bd6e93c
2018-07-10 09:23:35 -07:00
Ian Gudger afd655a5d8 Notify UDP and Ping endpoints on close
PiperOrigin-RevId: 203883138
Change-Id: I7500c0a70f5d71c3fb37e2477f7fc466fa92fd3e
2018-07-09 21:20:50 -07:00
Brian Geffon da9b5153f2 Fix two race conditions in tcp stack.
PiperOrigin-RevId: 203880278
Change-Id: I66b790a616de59142859cc12db4781b57ea626d3
2018-07-09 20:48:27 -07:00
Jamie Liu 41aeb680b1 Inherit parent in clone(CLONE_THREAD) under TaskSet.mu.
PiperOrigin-RevId: 203849534
Change-Id: I4d81513bfd32e0b7fc40c8a4c194eba7abc35a83
2018-07-09 16:16:19 -07:00
Nicolas Lacasse bf0fa09537 Switch netstack licenses to Apache 2.0.
Fixes #27

PiperOrigin-RevId: 203825288
Change-Id: Ie9f3a2b2c1e296b026b024f75c07da1a7e118633
2018-07-09 14:04:40 -07:00
Michael Pratt 0dedac637f Trim all whitespace between interpreter and arg
Multiple whitespace characters are allowed. This fixes Ubuntu's
/usr/sbin/invoke-rc.d, which has trailing whitespace after the
interpreter which we were treating as an arg.

PiperOrigin-RevId: 203802278
Change-Id: I0a6cdb0af4b139cf8abb22fa70351fe3697a5c6b
2018-07-09 11:43:56 -07:00
Ian Gudger 5c88e6a15d Add non-AMD64 support to rawfile
PiperOrigin-RevId: 203499064
Change-Id: I2cd5189638e94ce926f1e82c1264a8d3ece9dfa5
2018-07-06 10:58:37 -07:00
Rahat Mahmood 34af9a6174 Fix data race on inotify.Watch.mask.
PiperOrigin-RevId: 203180463
Change-Id: Ief50988c1c028f81ec07a26e704d893e86985bf0
2018-07-03 14:08:51 -07:00
Michael Pratt 660f1203ff Fix runsc VDSO mapping
80bdf8a406 accidentally moved vdso into an
inner scope, never assigning the vdso variable passed to the Kernel and
thus skipping VDSO mappings.

Fix this and remove the ability for loadVDSO to skip VDSO mappings,
since tests that do so are gone.

PiperOrigin-RevId: 203169135
Change-Id: Ifd8cadcbaf82f959223c501edcc4d83d05327eba
2018-07-03 12:53:39 -07:00
Fabricio Voznika 0ef6066167 Resend packets back to netstack if destined to itself
Add option to redirect packet back to netstack if it's destined to itself.
This fixes the problem where connecting to the local NIC address would
not work, e.g.:
echo bar | nc -l -p 8080 &
echo foo | nc 192.168.0.2 8080

PiperOrigin-RevId: 203157739
Change-Id: I31c9f7c501e3f55007f25e1852c27893a16ac6c4
2018-07-03 11:39:17 -07:00
Michael Pratt 062a6f6ec5 Handle NUL-only paths in exec
The path in execve(2), interpreter script, and ELF interpreter may all
be no more than a NUL-byte. Handle each of those cases.

PiperOrigin-RevId: 203155745
Change-Id: I1c8b1b387924b23b2cf942341dfc76c9003da959
2018-07-03 11:28:53 -07:00
Michael Pratt 2821dfe6ce Hold d.parent.mu when reading d.name
PiperOrigin-RevId: 203041657
Change-Id: I120783d91712818e600505454c9276f8d9877f37
2018-07-02 17:39:10 -07:00
Michael Pratt 7f9c822f53 Drop version option from mount command
Fun fact: in protocol version negotiation, our 9p version must be
written "9P2000.L". In the 'version' mount option, it must be
written "9p2000.L". Very consistent!

The mount command as given complains about an unknown protocol
version. Drop it entirely because Linux defaults to 9p2000.L
anyways.

PiperOrigin-RevId: 202971961
Change-Id: I5d46c83f03182476033db9c36870c68aeaf30f65
2018-07-02 10:23:27 -07:00
Justine Olshan 80bdf8a406 Sets the restore environment for restoring a container.
Updated how restoring occurs through boot.go with a separate Restore function.
This prevents a new process and new mounts from being created.
Added tests to ensure the container is restored.
Registered checkpoint and restore commands so they can be used.
Docker support for these commands is still limited.
Working on #80.

PiperOrigin-RevId: 202710950
Change-Id: I2b893ceaef6b9442b1ce3743bd112383cb92af0c
2018-06-29 14:47:40 -07:00
Brian Geffon 23f49097c7 Panic in netstack during cleanup where a FIN becomes a RST.
There is a subtle bug where during cleanup with unread data a FIN can
be converted to a RST, at that point the entire connection should be
aborted as we're not expecting any ACKs to the RST.

PiperOrigin-RevId: 202691271
Change-Id: Idae70800208ca26e07a379bc6b2b8090805d0a22
2018-06-29 12:40:26 -07:00
Nicolas Lacasse 1b5e09f968 aio: Return EINVAL if the number of events is negative.
PiperOrigin-RevId: 202671065
Change-Id: I248b74544d47ddde9cd59d89aa6ccb7dad2b6f89
2018-06-29 10:47:38 -07:00
Nicolas Lacasse f93bd2cbe6 Hold t.mu while calling t.FSContext().
PiperOrigin-RevId: 202562686
Change-Id: I0f5be7cc9098e86fa31d016251c127cb91084b05
2018-06-28 16:11:19 -07:00
Nicolas Lacasse 1ceed49ba9 Check for invalid offset when submitting an AIO read/write request.
PiperOrigin-RevId: 202528335
Change-Id: Ic32312cf4337bcb40a7155cb2174e5cd89a280f7
2018-06-28 12:55:18 -07:00
Fabricio Voznika 6b6852bceb Fix semaphore data races
PiperOrigin-RevId: 202371908
Change-Id: I72603b1d321878cae6404987c49e64732b676331
2018-06-27 14:41:32 -07:00
Nicolas Lacasse 99afc982f1 Call mm.CheckIORange() when copying in IOVecs.
CheckIORange is analagous to Linux's access_ok() method, which is checked when
copying in IOVecs in both lib/iov_iter.c:import_single_range() and
lib/iov_iter.c:import_iovec() => fs/read_write.c:rw_copy_check_uvector().

gVisor copies in IOVecs via Task.SingleIOSequence() and Task.CopyInIovecs().
We were checking the address range bounds, but not whether the address is
valid. To conform with linux, we should also check that the address is valid.

For usual preadv/pwritev syscalls, the effect of this change is not noticeable,
since we find out that the address is invalid before the syscall completes.

For vectorized async-IO operations, however, this change is necessary because
Linux returns EFAULT when the operation is submitted, but before it executes.
Thus, we must validate the iovecs when copying them in.

PiperOrigin-RevId: 202370092
Change-Id: I8759a63ccf7e6b90d90d30f78ab8935a0fcf4936
2018-06-27 14:31:35 -07:00
Jamie Liu 4215e059e2 Ignore MADV_DONTDUMP and MADV_DODUMP.
PiperOrigin-RevId: 202361912
Change-Id: I1d0ee529073954d467b870872f494cebbf8ea61a
2018-06-27 13:42:37 -07:00
Fabricio Voznika c186e408cc Add KVM, overlay and host network to image tests
PiperOrigin-RevId: 202236006
Change-Id: I4ea964a70fc49e8b51c9da27d77301c4eadaae71
2018-06-26 19:05:50 -07:00
Adin Scannell dc33d71f8c Change SIGCHLD to SIGKILL in ptrace stubs.
If the child stubs are killed by any unmaskable signal (e.g. SIGKILL), then
the parent process will similarly be killed, resulting in the death of all
other stubs.

The effect of this is that if the OOM killer selects and kills a stub, the
effect is the same as though the OOM killer selected and killed the sentry.

PiperOrigin-RevId: 202219984
Change-Id: I0b638ce7e59e0a0f4d5cde12a7d05242673049d7
2018-06-26 16:54:44 -07:00
Jamie Liu ea10949a00 Use the correct Context for /proc/[pid]/maps.
PiperOrigin-RevId: 202180487
Change-Id: I95cce41a4842ab731a4821b387b32008bfbdcb08
2018-06-26 13:09:50 -07:00
Ian Gudger 5f7f78c1d7 Fix data races in Unix sockets
PiperOrigin-RevId: 202175558
Change-Id: I0113cb9a90d7a0cd7964bf43eef67f70c92d9589
2018-06-26 12:41:22 -07:00
Jamie Liu 33041b36cb Add Context to seqfile.SeqSource.ReadSeqFileData.
PiperOrigin-RevId: 202163895
Change-Id: Ib9942fcff80c0834216f4f10780662bef5b52270
2018-06-26 11:35:20 -07:00
Brian Geffon 51c1e510ab Automated rollback of changelist 201596247
PiperOrigin-RevId: 202151720
Change-Id: I0491172c436bbb32b977f557953ba0bc41cfe299
2018-06-26 10:33:24 -07:00
Michael Pratt db94befb63 Fix panic message
The arguments are backwards from the message.

PiperOrigin-RevId: 202054887
Change-Id: Id5750a84ca091f8b8fbe15be8c648d4fa3e31eb2
2018-06-25 18:17:17 -07:00
Jamie Liu 16882484f9 Check for empty applicationAddrRange in MM.DecUsers.
PiperOrigin-RevId: 202043776
Change-Id: I4373abbcf735dc1cf4bebbbbb0c7124df36e9e78
2018-06-25 16:50:38 -07:00
Michael Pratt 4ac79312b0 Don't read cwd or root without holding mu
PiperOrigin-RevId: 202043090
Change-Id: I3c47fb3413ca8615d50d8a0503d72fcce9b09421
2018-06-25 16:46:29 -07:00
Nicolas Lacasse 1a9917d14d MountSource.Root() should return a refernce on the dirent.
PiperOrigin-RevId: 202038397
Change-Id: I074d525f2e2d9bcd43b247b62f86f9129c101b78
2018-06-25 16:17:12 -07:00
Michael Pratt 478f0ac003 Don't read FSContext.root without holding FSContext.mu
IsChrooted still has the opportunity to race with another thread
entering the FSContext into a chroot, but that is unchanged (and
fine, AFAIK).

PiperOrigin-RevId: 202029117
Change-Id: I38bce763b3a7715fa6ae98aa200a19d51a0235f1
2018-06-25 15:22:56 -07:00
Brian Geffon 7c645ac273 Add rpcinet support for SIOCGIFCONF.
The interfaces and their addresses are already available via
the stack Intefaces and InterfaceAddrs.

Also add some tests as we had no tests around SIOCGIFCONF. I also added the socket_netgofer lifecycle for IOCTL tests.

PiperOrigin-RevId: 201744863
Change-Id: Ie0a285a2a2f859fa0cafada13201d5941b95499a
2018-06-22 14:48:19 -07:00
Nicolas Lacasse e0e6409812 Simplify some handle logic.
PiperOrigin-RevId: 201738936
Change-Id: Ib75136415e28e8df0c742acd6b9512d4809fe3a8
2018-06-22 14:10:30 -07:00
Jamie Liu fe3fc44da3 Handle mremap(old_size=0).
PiperOrigin-RevId: 201729703
Change-Id: I486900b0c6ec59533b88da225a5829c474e35a70
2018-06-22 13:08:38 -07:00
Brian Geffon 5d45f88f2c Netstack should return EOF on closed read.
The shutdown behavior where we return EAGAIN for sockets
which are non-blocking is only correct for packet based sockets.
SOCK_STREAM sockets should return EOF.

PiperOrigin-RevId: 201703055
Change-Id: I20b25ceca7286c37766936475855959706fc5397
2018-06-22 10:19:25 -07:00
Zhaozhong Ni 0e434b66a6 netstack: tcp socket connected state S/R support.
PiperOrigin-RevId: 201596247
Change-Id: Id22f47b2cdcbe14aa0d930f7807ba75f91a56724
2018-06-21 15:19:45 -07:00
Michael Pratt 2dedbc7211 Drop return from SendExternalSignal
SendExternalSignal is no longer called before CreateProcess, so it can
enforce this simplified precondition.

StartForwarding, and after Kernel.Start.

PiperOrigin-RevId: 201591170
Change-Id: Ib7022ef7895612d7d82a00942ab59fa433c4d6e9
2018-06-21 14:53:55 -07:00
Fabricio Voznika f6be5fe619 Forward SIGUSR2 to the sandbox too
SIGUSR2 was being masked out to be used as a way to dump sentry
stacks. This could cause compatibility problems in cases anyone
uses SIGUSR2 to communicate with the container init process.

PiperOrigin-RevId: 201575374
Change-Id: I312246e828f38ad059139bb45b8addc2ed055d74
2018-06-21 13:22:18 -07:00
Ian Gudger d571a4359c Implement ioctl(FIOASYNC)
FIOASYNC and friends are used to send signals when a file is ready for IO.

This may or may not be needed by Nginx. While Nginx does use it, it is unclear
if the code that uses it has any effect.

PiperOrigin-RevId: 201550828
Change-Id: I7ba05a7db4eb2dfffde11e9bd9a35b65b98d7f50
2018-06-21 10:53:21 -07:00
Fabricio Voznika 4ad7315b67 Add 'runsc debug' command
It prints sandbox stacks to the log to help debug stuckness. I expect
that many more options will be added in the future.

PiperOrigin-RevId: 201405931
Change-Id: I87e560800cd5a5a7b210dc25a5661363c8c3a16e
2018-06-20 13:31:31 -07:00
Nicolas Lacasse d93f55e863 Remove some defers in hot paths in the filesystem code.
PiperOrigin-RevId: 201401727
Change-Id: Ia5589882ba58a00efb522ab372e206b7e8e62aee
2018-06-20 13:05:54 -07:00
Zhaozhong Ni 4e9f0e91d7 sentry: pending signals S/R optimization.
Almost all of the hundreds of pending signal queues are empty upon save.

PiperOrigin-RevId: 201380318
Change-Id: I40747072435299de681d646e0862efac0637e172
2018-06-20 11:02:41 -07:00
Brian Geffon db66e383c3 Epsocket has incorrect recv(2) behavior after SHUT_RD.
After shutdown(SHUT_RD) calls to recv /w MSG_DONTWAIT or with
O_NONBLOCK should result in a EAGAIN and not 0. Blocking sockets
should return 0 as they would have otherwise blocked indefinitely.

PiperOrigin-RevId: 201271123
Change-Id: If589b69c17fa5b9ff05bcf9e44024da9588c8876
2018-06-19 17:29:11 -07:00
Zhaozhong Ni 18d8992453 state: pretty-print primitive type arrays.
PiperOrigin-RevId: 201269072
Change-Id: Ia542c5a42b5b5d21c1104a003ddff5279644d309
2018-06-19 17:13:35 -07:00
Adin Scannell be76cad5bc Make KVM more scalable by removing CPU cap.
Instead, CPUs will be created dynamically. We also allow a relatively
efficient mechanism for stealing and notifying when a vCPU becomes
available via unlock.

Since the number of vCPUs is no longer fixed at machine creation time,
we make the dirtySet packing more efficient. This has the pleasant side
effect of cutting out the unsafe address space code.

PiperOrigin-RevId: 201266691
Change-Id: I275c73525a4f38e3714b9ac0fd88731c26adfe66
2018-06-19 17:00:30 -07:00
Zhaozhong Ni aa14a2c1be sentry: futex S/R optimization.
No need to save thousands of zerovalue buckets.

PiperOrigin-RevId: 201258598
Change-Id: I5d3ea7b6a5345117ab4f610332d5288ca550be33
2018-06-19 16:08:00 -07:00
Justine Olshan a6dbef045f Added a resume command to unpause a paused container.
Resume checks the status of the container and unpauses the kernel
if its status is paused. Otherwise nothing happens.
Tests were added to ensure that the process is in the correct state
after various commands.

PiperOrigin-RevId: 201251234
Change-Id: Ifd11b336c33b654fea6238738f864fcf2bf81e19
2018-06-19 15:23:36 -07:00
Brian Geffon bda2a1ed35 Rpcinet is racy around shutdown flags.
Correct a data race in rpcinet where a shutdown and recvmsg can
race around shutown flags.

PiperOrigin-RevId: 201238366
Change-Id: I5eb06df4a2b4eba331eeb5de19076213081d581f
2018-06-19 14:12:52 -07:00
Nicolas Lacasse 9db7cfad93 Add a new cache policy FSCACHE_WRITETHROUGH.
The new policy is identical to FSCACHE (which caches everything in memory), but
it also flushes writes to the backing fs agent immediately.

All gofer cache policy decisions have been moved into the cachePolicy type.
Previously they were sprinkled around the codebase.

There are many different things that we cache (page cache, negative dirents,
dirent LRU, unstable attrs, readdir results....), and I don't think we should
have individual flags to control each of these.  Instead, we should have a few
high-level cache policies that are consistent and useful to users.  This
refactoring makes it easy to add more such policies.

PiperOrigin-RevId: 201206937
Change-Id: I6e225c382b2e5e1b0ad4ccf8ca229873f4cd389d
2018-06-19 11:10:11 -07:00
Zhaozhong Ni 5581256f87 state: include I/O and protobuf time in kernel S/R timing stats.
PiperOrigin-RevId: 201205733
Change-Id: I300307b0668989ba7776ab9e3faee71efdd33f46
2018-06-19 11:04:54 -07:00
Brian Geffon 4fd1d40e1d Rpcinet needs to track shutdown state for blocking sockets.
Because rpcinet will emulate a blocking socket backed by an rpc based
non-blocking socket. In the event of a shutdown(SHUT_RD) followed by a
read a non-blocking socket is allowed to return an EWOULDBLOCK however
since a blocking socket knows it cannot receive anymore data it would
block indefinitely and in this situation linux returns 0. We have to
track this on the rpcinet sentry side so we can emulate that behavior
because the remote side has no way to know if the socket is actually
blocking within the sentry.

PiperOrigin-RevId: 201201618
Change-Id: I4ac3a7b74b5dae471ab97c2e7d33b83f425aedac
2018-06-19 10:43:30 -07:00
Brian Geffon 563a71ef24 Add rpcinet support for control messages.
Add support for control messages, but at this time the only
control message that the sentry will support here is SO_TIMESTAMP.

PiperOrigin-RevId: 200922230
Change-Id: I63a852d9305255625d9df1d989bd46a66e93c446
2018-06-17 17:06:40 -07:00
Michael Pratt bd2d1aaa16 Replace crypto/rand with internal rand package
PiperOrigin-RevId: 200784607
Change-Id: I39aa6ee632936dcbb00fc298adccffa606e9f4c0
2018-06-15 15:36:00 -07:00
Zhaozhong Ni fc8ca72a32 sentry: do not start delivering external signal immediately.
PiperOrigin-RevId: 200765756
Change-Id: Ie4266f32e4e977df3925eb29f3fbb756e0337606
2018-06-15 13:38:14 -07:00
Brian Geffon fa6db05e0c FIFOs should support O_TRUNC as a no-op.
PiperOrigin-RevId: 200759323
Change-Id: I683b2edcc2188304c4ca563e46af457e23625905
2018-06-15 12:55:29 -07:00
Adin Scannell b31ac4e1df Use notify explicitly on unlock path.
There are circumstances under which the redpill call will not generate
the appropriate action and notification. Replace this call with an
explicit notification, which is guaranteed to transition as well as
perform the futex wake.

PiperOrigin-RevId: 200726934
Change-Id: Ie19e008a6007692dd7335a31a8b59f0af6e54aaa
2018-06-15 09:30:08 -07:00
Fabricio Voznika 119a302ceb Implement /proc/thread-self
Closes #68

PiperOrigin-RevId: 200725401
Change-Id: I4827009b8aee89d22887c3af67291ccf7058d420
2018-06-15 09:18:00 -07:00
Jamie Liu 657db692b2 Ignore expiration count in kernelCPUClockListener.Notify.
PiperOrigin-RevId: 200590832
Change-Id: I35b817ecccc9414a742dee4815dfc67d0c7d0496
2018-06-14 11:35:11 -07:00
Ian Gudger f5d0c59f5c Fix reference leak in VDSO validation
PiperOrigin-RevId: 200496070
Change-Id: I33adb717c44e5b4bcadece882be3ab1ee3920556
2018-06-13 20:00:55 -07:00
Brian Geffon 1170039e78 Fix missing returns in rpcinet.
PiperOrigin-RevId: 200472634
Change-Id: I3f0fb9e3b2f8616e6aa1569188258f330bf1ed31
2018-06-13 16:21:23 -07:00
Adin Scannell 7b7b199ed0 Deflake kvm_test.
PiperOrigin-RevId: 200439846
Change-Id: I9970fe0716cb02f0f41b754891d55db7e0729f56
2018-06-13 13:05:33 -07:00
Fabricio Voznika 717f2501c9 Fix failure to mount volume that sandbox process has no access
Boot loader tries to stat mount to determine whether it's a file or not. This
may file if the sandbox process doesn't have access to the file. Instead, add
overlay on top of file, which is better anyway since we don't want to propagate
changes to the host.

PiperOrigin-RevId: 200411261
Change-Id: I14222410e8bc00ed037b779a1883d503843ffebb
2018-06-13 10:20:06 -07:00
Zhaozhong Ni 686093669e sentry: do not treat all save errors as state file errors.
PiperOrigin-RevId: 200410220
Change-Id: I6a8745e33be949e335719083501f18b24f6ba471
2018-06-13 10:14:15 -07:00
Jamie Liu 55b9058456 Log filemem state when panicing due to invalid refcount.
PiperOrigin-RevId: 200408305
Change-Id: I676ee49ec77697105723577928c7f82088cd378e
2018-06-13 10:03:54 -07:00
Ian Gudger ba426f7782 Fix reference leak for negative dirents
PiperOrigin-RevId: 200306715
Change-Id: I7c80059c77ebd3d9a5d7d48b05c8e7a597f10850
2018-06-12 17:04:20 -07:00
Brian Geffon c2b3f04d1c Rpcinet doensn't handle SO_RCVTIMEO properly.
Rpcinet already inherits socket.ReceiveTimeout; however, it's
never set on setsockopt(2). The value is currently forwarded
as an RPC and ignored as all sockets will be non-blocking
on the RPC side.

PiperOrigin-RevId: 200299260
Change-Id: I6c610ea22c808ff6420c63759dccfaeab17959dd
2018-06-12 16:16:15 -07:00
Brielle Broder 711a9869e5 Runsc checkpoint works.
This is the first iteration of checkpoint that actually saves to a file.
Tests for checkpoint are included.

Ran into an issue when private unix sockets are enabled. An error message
was added for this case and the mutex state was set.

PiperOrigin-RevId: 200269470
Change-Id: I28d29a9f92c44bf73dc4a4b12ae0509ee4070e93
2018-06-12 13:25:23 -07:00
Jamie Liu 7a10df454b Drop MMapOpts.MappingIdentity reference in loader.mapSegment.
PiperOrigin-RevId: 200261995
Change-Id: I7e460b18ceab2c23096bdeb7416159d6e774aaf7
2018-06-12 12:38:02 -07:00
Adin Scannell 41f766893a Minor ring0 interface cleanup.
- Remove unused methods.
- Provide declaration for asm function.

PiperOrigin-RevId: 200146850
Change-Id: Ic455c96ffe0d2e78ef15f824eb65d7de705b054a
2018-06-11 18:17:15 -07:00
Adin Scannell 1397a413b4 Make page tables split-safe.
In order to minimize the likelihood of exit during page table
modifications, make the full set of page table functions split-safe.
This is not strictly necessary (and you may still incur splits due to
allocations from the allocator pool) but should make retries a very rare
occurance.

PiperOrigin-RevId: 200146688
Change-Id: I8fa36aa16b807beda2f0b057be60038258e8d597
2018-06-11 18:15:14 -07:00
Adin Scannell 09b0a9c320 Handle all exception vectors.
PiperOrigin-RevId: 200144655
Change-Id: I5a753c74b75007b7714d6fe34aa0d2e845dc5c41
2018-06-11 17:57:19 -07:00
Fabricio Voznika ea4a468fba Set CLOEXEC option to sockets
hostinet/socket.go: the Sentry doesn't spawn new processes, but it doesn't hurt to protect the socket from leaking.
unet/unet.go: should be setting closing on exec. The FD is explicitly donated to children when needed.

PiperOrigin-RevId: 200135682
Change-Id: Ia8a45ced1e00a19420c8611b12e7a8ee770f89cb
2018-06-11 16:45:50 -07:00
Brian Geffon ab2c2575d6 Rpcinet is incorrectly handling MSG_TRUNC with SOCK_STREAM
SOCK_STREAM has special behavior with respect to MSG_TRUNC. Specifically,
the data isn't actually copied back out to userspace when MSG_TRUNC is
provided on a SOCK_STREAM.

According to tcp(7): "Since version 2.4, Linux supports the use of
MSG_TRUNC in the flags argument of recv(2) (and recvmsg(2)). This flag
causes the received bytes of data to be discarded, rather than passed
back in a caller-supplied buffer."

PiperOrigin-RevId: 200134860
Change-Id: I70f17a5f60ffe7794c3f0cfafd131c069202e90d
2018-06-11 16:40:38 -07:00
Brian Geffon 0412f17e06 rpcinet is treating EAGAIN and EWOULDBLOCK as different errnos.
PiperOrigin-RevId: 200124614
Change-Id: I38a7b083f1464a2a586fe24db648e624c455fec5
2018-06-11 15:34:08 -07:00
Fabricio Voznika 7260363751 Add O_TRUNC handling in openat
PiperOrigin-RevId: 200103677
Change-Id: I3efb565c30c64d35f8fd7b5c05ed78dcc2990c51
2018-06-11 13:35:21 -07:00
Kevin Krakauer 032b0398a5 Sentry: split tty.queue into its own file.
Minor refactor. line_discipline.go was home to 2 large structs (lineDiscipline
and queue), and queue is now large enough IMO to get its own file.

Also moves queue locks into the queue struct, making locking simpler.

PiperOrigin-RevId: 200080301
Change-Id: Ia75a0e9b3d9ac8d7e5a0f0099a54e1f5b8bdea34
2018-06-11 11:09:43 -07:00
Adin Scannell c0ab059e7b Fix kernel flags handling and add missing vectors.
PiperOrigin-RevId: 199877174
Change-Id: I9d19ea301608c2b989df0a6123abb1e779427853
2018-06-08 17:51:50 -07:00
Brian Geffon 2fbd1cf57c Add checks for short CopyOut in rpcinet
PiperOrigin-RevId: 199864753
Change-Id: Ibace6a1fdf99ee6ce368ac12c390aa8a02dbdfb7
2018-06-08 15:58:22 -07:00
Adin Scannell 6728f09910 Fix sigaltstack semantics.
Walking off the bottom of the sigaltstack, for example with recursive faults,
results in forced signal delivery, not resetting the stack or pushing signal
stack to whatever happens to lie below the signal stack.

PiperOrigin-RevId: 199856085
Change-Id: I0004d2523f0df35d18714de2685b3eaa147837e0
2018-06-08 15:01:21 -07:00
Bhasker Hariharan de8dba205f Add a protocol option to set congestion control algorithm.
Also adds support to query available congestion control algorithms.

PiperOrigin-RevId: 199826897
Change-Id: I2b338b709820ee9cf58bb56d83aa7b1a39f4eab2
2018-06-08 11:46:23 -07:00
Brian Geffon 2f3895d6f7 rpcinet is not correctly handling MSG_TRUNC on recvmsg(2).
MSG_TRUNC can cause recvmsg(2) to return a value larger than
the buffer size. In this situation it's an indication that the
buffer was completely filled and that the msg was truncated.
Previously in rpcinet we were returning the buffer size but we
should actually be returning the payload length as returned by
the syscall.

PiperOrigin-RevId: 199814221
Change-Id: If09aa364219c1bf193603896fcc0dc5c55e85d21
2018-06-08 10:33:25 -07:00
Brian Geffon 5c37097e34 rpcinet should not block in read(2) rpcs.
PiperOrigin-RevId: 199703609
Change-Id: I8153b0396b22a230a68d4b69c46652a5545f7630
2018-06-07 15:10:15 -07:00
Brian Geffon 7e9893eeb5 Add missing rpcinet ioctls.
PiperOrigin-RevId: 199669120
Change-Id: I0be88cdbba29760f967e9a5bb4144ca62c1ed7aa
2018-06-07 11:37:16 -07:00
Kevin Krakauer 9170303105 Sentry: very basic terminal echo support.
Adds support for echo to terminals. Echoing is just copying input back out to
the user, e.g. when I type "foo" into a terminal, I expect "foo" to be echoed
back to my terminal.

Also makes the transform function part of the queue, eliminating the need to
pass them around together and the possibility of using the wrong transform for a
queue.

PiperOrigin-RevId: 199655147
Change-Id: I37c490d4fc1ee91da20ae58ba1f884a5c14fd0d8
2018-06-07 10:21:22 -07:00
Adin Scannell d269845159 Ensure guest-mode for page table modifications.
Because of the KVM shadow page table implementation, modifications made
to guest page tables from host mode may not be syncronized correctly,
resulting in undefined behavior. This is a KVM bug: page table pages
should also be tracked for host modifications and resynced appropriately
(e.g. the guest could "DMA" into a page table page in theory).

However, since we can't rely on this being fixed everywhere, workaround
the issue by forcing page table modifications to be in guest mode. This
will generally be the case anyways, but now if an exit occurs during
modifications, we will re-enter and perform the modifications again.

PiperOrigin-RevId: 199587895
Change-Id: I83c20b4cf2a9f9fa56f59f34939601dd34538fb0
2018-06-06 23:26:14 -07:00
Adin Scannell 3374849cb5 Split PCID implementation from page tables.
Instead of associating a single PCID with each set of page tables (which
will reach the maximum quickly), allow a dynamic pool for each vCPU.
This is the same way that Linux operates. We also split management of
PCIDs out of the page tables themselves for simplicity.

PiperOrigin-RevId: 199585631
Change-Id: I42f3486ada3cb2a26f623c65ac279b473ae63201
2018-06-06 22:52:55 -07:00