Commit Graph

670 Commits

Author SHA1 Message Date
Fabricio Voznika cff2c57192 Fix bad merge
PiperOrigin-RevId: 235818534
Change-Id: I99f7e3fd1dc808b35f7a08b96b7c3226603ab808
2019-02-26 16:42:06 -08:00
Ruidong Cao a2b794b30d FPE_INTOVF (integer overflow) should be 2 refer to Linux.
Signed-off-by: Ruidong Cao <crdfrank@gmail.com>
Change-Id: I03f8ab25cf29257b31f145cf43304525a93f3300
PiperOrigin-RevId: 235763203
2019-02-26 11:48:49 -08:00
Fabricio Voznika 23fe059761 Lazily allocate inotify map on inode
PiperOrigin-RevId: 235735865
Change-Id: I84223eb18eb51da1fa9768feaae80387ff6bfed0
2019-02-26 09:33:44 -08:00
Fabricio Voznika 10426e0f31 Handle invalid offset in sendfile(2)
PiperOrigin-RevId: 235578698
Change-Id: I608ff5e25eac97f6e1bda058511c1f82b0e3b736
2019-02-25 12:17:46 -08:00
Googler 532f4b2fba Internal change.
PiperOrigin-RevId: 235053594
Change-Id: Ie3d7b11843d0710184a2463886c7034e8f5305d1
2019-02-21 13:08:34 -08:00
Haibo Xu 15d3189884 Make some ptrace commands x86-only
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I9751f859332d433ca772d6b9733f5a5a64398ec7
PiperOrigin-RevId: 234877624
2019-02-20 15:10:59 -08:00
Amanda Tait ea070b9d5f Implement Broadcast support
This change adds support for the SO_BROADCAST socket option in gVisor Netstack.
This support includes getsockopt()/setsockopt() functionality for both UDP and
TCP endpoints (the latter being a NOOP), dispatching broadcast messages up and
down the stack, and route finding/creation for broadcast packets. Finally, a
suite of tests have been implemented, exercising this functionality through the
Linux syscall API.

PiperOrigin-RevId: 234850781
Change-Id: If3e666666917d39f55083741c78314a06defb26c
2019-02-20 12:54:13 -08:00
Kevin Krakauer ec2460b189 netstack: Add SIOCGSTAMP support.
Ping sometimes uses this instead of SO_TIMESTAMP.

PiperOrigin-RevId: 234699590
Change-Id: Ibec9c34fa0d443a931557a2b1b1ecd83effe7765
2019-02-19 16:41:32 -08:00
Jamie Liu bed6f8534b Set rax to syscall number on SECCOMP_RET_TRAP.
PiperOrigin-RevId: 234690475
Change-Id: I1cbfb5aecd4697a4a26ec8524354aa8656cc3ba1
2019-02-19 15:49:37 -08:00
Jamie Liu bb47d8a545 Fix clone(CLONE_NEWUSER).
- Use new user namespace for namespace creation checks.

- Ensure userns is never nil since it's used by other namespaces.

PiperOrigin-RevId: 234673175
Change-Id: I4b9d9d1e63ce4e24362089793961a996f7540cd9
2019-02-19 14:20:05 -08:00
Jamie Liu 22d8b6eba1 Break /proc/[pid]/{uid,gid}_map's dependence on seqfile.
In addition to simplifying the implementation, this fixes two bugs:

- seqfile.NewSeqFile unconditionally creates an inode with mode 0444,
  but {uid,gid}_map have mode 0644.

- idMapSeqFile.Write implements fs.FileOperations.Write ... but it
  doesn't implement any other fs.FileOperations methods and is never
  used as fs.FileOperations. idMapSeqFile.GetFile() =>
  seqfile.SeqFile.GetFile() uses seqfile.seqFileOperations instead,
  which rejects all writes.

PiperOrigin-RevId: 234638212
Change-Id: I4568f741ab07929273a009d7e468c8205a8541bc
2019-02-19 11:21:46 -08:00
Ian Gudger c611dbc5a7 Implement IP_MULTICAST_IF.
This allows setting a default send interface for IPv4 multicast. IPv6 support
will come later.

PiperOrigin-RevId: 234251379
Change-Id: I65922341cd8b8880f690fae3eeb7ddfa47c8c173
2019-02-15 18:40:15 -08:00
Kevin Krakauer a9cb3dcd9d Move SO_TIMESTAMP from different transport endpoints to epsocket.
SO_TIMESTAMP is reimplemented in ping and UDP sockets (and needs to be added for
TCP), but can just be implemented in epsocket for simplicity. This will also
make SIOCGSTAMP easier to implement.

PiperOrigin-RevId: 234179300
Change-Id: Ib5ea0b1261dc218c1a8b15a65775de0050fe3230
2019-02-15 11:18:44 -08:00
Fabricio Voznika e34d27e8b6 Redirect FIXME to more appropriate bug
PiperOrigin-RevId: 234147487
Change-Id: I779a6012832bb94a6b89f5bcc7d821b40ae969cc
2019-02-15 08:23:27 -08:00
Nicolas Lacasse 0a41ea72c1 Don't allow writing or reading to TTY unless process group is in foreground.
If a background process tries to read from a TTY, linux sends it a SIGTTIN
unless the signal is blocked or ignored, or the process group is an orphan, in
which case the syscall returns EIO.

See drivers/tty/n_tty.c:n_tty_read()=>job_control().

If a background process tries to write a TTY, set the termios, or set the
foreground process group, linux then sends a SIGTTOU. If the signal is ignored
or blocked, linux allows the write. If the process group is an orphan, the
syscall returns EIO.

See drivers/tty/tty_io.c:tty_check_change().

PiperOrigin-RevId: 234044367
Change-Id: I009461352ac4f3f11c5d42c43ac36bb0caa580f9
2019-02-14 15:47:31 -08:00
Jamie Liu 0e84ae72e0 Improve safecopy sanity checks.
- Fix CopyIn/CopyOut/ZeroOut range checks.

- Include the faulting signal number in the panic message.

PiperOrigin-RevId: 233829501
Change-Id: I8959ead12d05dbd4cd63c2b908cddeb2a27eb513
2019-02-13 14:25:15 -08:00
Googler 7aaa6cf225 Internal change.
PiperOrigin-RevId: 233802562
Change-Id: I40e1b13fd571daaf241b00f8df4bcedd034dc3f1
2019-02-13 12:07:34 -08:00
Nicolas Lacasse f17692d807 Add fs.AsyncWithContext and call it in fs/gofer/inodeOperations.Release.
fs/gofer/inodeOperations.Release does some asynchronous work.  Previously it
was calling fs.Async with an anonymous function, which caused the function to
be allocated on the heap.  Because Release is relatively hot, this results in a
lot of small allocations and increased GC pressure, noticeable in perf profiles.

This CL adds a new function, AsyncWithContext, which is just like Async, but
passes a context to the async function.  It avoids the need for an extra
anonymous function in fs/gofer/inodeOperations.Release.  The Async function
itself still requires a single anonymous function.

PiperOrigin-RevId: 233141763
Change-Id: I1dce4a883a7be9a8a5b884db01e654655f16d19c
2019-02-08 15:54:15 -08:00
Nicolas Lacasse e884168e1e Encode stat to bytes manually, instead of calling CopyObjectOut.
CopyObjectOut grows its destination byte slice incrementally, causing
many small slice allocations on the heap. This leads to increased GC and
noticeably slower stat calls.

PiperOrigin-RevId: 233140904
Change-Id: Ieb90295dd8dd45b3e56506fef9d7f86c92e97d97
2019-02-08 15:48:23 -08:00
Nicolas Lacasse 9c9386d2a8 CopyObjectOut should allocate a byte slice the size of the encoded object.
This adds an extra Reflection call to CopyObjectOut, but avoids many small
slice allocations if the object is large, since without this we grow the
backing slice incrementally as we encode more data.

PiperOrigin-RevId: 233110960
Change-Id: I93569af55912391e5471277f779139c23f040147
2019-02-08 13:00:00 -08:00
Ian Gudger 80f901b16b Plumb IP_ADD_MEMBERSHIP and IP_DROP_MEMBERSHIP to netstack.
Also includes a few fixes for IPv4 multicast support. IPv6 support is coming in
a followup CL.

PiperOrigin-RevId: 233008638
Change-Id: If7dae6222fef43fda48033f0292af77832d95e82
2019-02-07 23:15:23 -08:00
Rahat Mahmood 2ba74f84be Implement /proc/net/unix.
PiperOrigin-RevId: 232948478
Change-Id: Ib830121e5e79afaf5d38d17aeef5a1ef97913d23
2019-02-07 14:44:21 -08:00
Nicolas Lacasse fcae058a14 Make context.Background return a global background context.
It currently allocates a new context on the heap each time it is called. Some
of these are in relatively hot paths like signal delivery and releasing gofer
inodes.  It is also called very commonly in afterLoad.  All of these should
benefit from fewer heap allocations.

PiperOrigin-RevId: 232938873
Change-Id: I53cec0ca299f56dcd4866b0b4fd2ec4938526849
2019-02-07 13:55:23 -08:00
Fabricio Voznika 9ef3427ac1 Implement semctl(2) SETALL and GETALL
PiperOrigin-RevId: 232914984
Change-Id: Id2643d7ad8e986ca9be76d860788a71db2674cda
2019-02-07 11:41:44 -08:00
Zach Koopmans 0cf7fc4e11 Change /proc/PID/cmdline to read environment vector.
- Change proc to return envp on overwrite of argv with limitations from
upstream.
- Add unit tests
- Change layout of argv/envp on the stack so that end of argv is contiguous with
beginning of envp.

PiperOrigin-RevId: 232506107
Change-Id: I993880499ab2c1220f6dc456a922235c49304dec
2019-02-05 10:02:06 -08:00
Fabricio Voznika 2d20b121d7 CachingInodeOperations was over-dirtying cached attributes
Dirty should be set only when the attribute is changed in the cache
only. Instances where the change was also sent to the backing file
doesn't need to dirty the attribute.

Also remove size update during WriteOut as writing dirty page would
naturaly grow the file if needed.

RELNOTES: relnotes is needed for the parent CL.
PiperOrigin-RevId: 232068978
Change-Id: I00ba54693a2c7adc06efa9e030faf8f2e8e7f188
2019-02-01 17:51:48 -08:00
Nicolas Lacasse 92e85623a0 Factor the subtargets method into a helper method with tests.
PiperOrigin-RevId: 232047515
Change-Id: I00f036816e320356219be7b2f2e6d5fe57583a60
2019-02-01 15:23:43 -08:00
Michael Pratt fe1369ac98 Move package sync to third_party
PiperOrigin-RevId: 231889261
Change-Id: I482f1df055bcedf4edb9fe3fe9b8e9c80085f1a0
2019-01-31 17:49:14 -08:00
Michael Pratt 88b4ce8cac Fix comment
PiperOrigin-RevId: 231861005
Change-Id: I134d4e20cc898d44844219db0a8aacda87e11ef0
2019-01-31 15:03:12 -08:00
Fabricio Voznika a497f5ed5f Invalidate COW mappings when file is truncated
This changed required making fsutil.HostMappable use
a backing file to ensure the correct FD would be used
for read/write operations.

RELNOTES: relnotes is needed for the parent CL.
PiperOrigin-RevId: 231836164
Change-Id: I8ae9639715529874ea7d80a65e2c711a5b4ce254
2019-01-31 12:54:00 -08:00
Michael Pratt 2a0c69b19f Remove license comments
Nothing reads them and they can simply get stale.

Generated with:
$ sed -i "s/licenses(\(.*\)).*/licenses(\1)/" **/BUILD

PiperOrigin-RevId: 231818945
Change-Id: Ibc3f9838546b7e94f13f217060d31f4ada9d4bf0
2019-01-31 11:12:53 -08:00
Haibo Xu cedff8d3ae Add muldiv/rd_tsc support for arm64 platform.
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: If35459be78e023346a140184401172f8e023c7f9
PiperOrigin-RevId: 231638020
2019-01-30 11:49:08 -08:00
Zhaozhong Ni ae6e37df2a Convert TODO into FIXME.
PiperOrigin-RevId: 231301228
Change-Id: I3e18f3a12a35fb89a22a8c981188268d5887dc61
2019-01-28 15:34:18 -08:00
Nicolas Lacasse 09cf3b40a8 Fix data race in InodeSimpleAttributes.Unstable.
We were modifying InodeSimpleAttributes.Unstable.AccessTime without holding
the necessary lock.  Luckily for us, InodeSimpleAttributes already has a
NotifyAccess method that will do the update while holding the lock.

In addition, we were holding dfo.dir.mu.Lock while setting AccessTime, which
is unnecessary, so that lock has been removed.

PiperOrigin-RevId: 231278447
Change-Id: I81ed6d3dbc0b18e3f90c1df5e5a9c06132761769
2019-01-28 13:26:28 -08:00
Jamie Liu 1cedccf8e9 Drop the one-page limit for /proc/[pid]/{cmdline,environ}.
It never actually should have applied to environ (the relevant change in
Linux 4.2 is c2c0bb44620d "proc: fix PAGE_SIZE limit of
/proc/$PID/cmdline"), and we claim to be Linux 4.4 now anyway.

PiperOrigin-RevId: 231250661
Change-Id: I37f9c4280a533d1bcb3eebb7803373ac3c7b9f15
2019-01-28 11:00:23 -08:00
Fabricio Voznika 55e8eb775b Make cacheRemoteRevalidating detect changes to file size
When file size changes outside the sandbox, page cache was not
refreshing file size which is required for cacheRemoteRevalidating.
In fact, cacheRemoteRevalidating should be skipping the cache
completely since it's not really benefiting from it. The cache is
cache is already bypassed for unstable attributes (see
cachePolicy.cacheUAttrs). And althought the cache is called to
map pages, they will always miss the cache and map directly from
the host.

Created a HostMappable struct that maps directly to the host and
use it for files with cacheRemoteRevalidating.

Closes #124

PiperOrigin-RevId: 230998440
Change-Id: Ic5f632eabe33b47241e05e98c95e9b2090ae08fc
2019-01-25 17:23:07 -08:00
Adin Scannell b5088ba59c cleanup: extract the kernel from context
Change-Id: I94704a90beebb53164325e0cce1fcb9a0b97d65c
PiperOrigin-RevId: 230817308
2019-01-24 17:02:52 -08:00
Rahat Mahmood 8d7c10e908 Display /proc/net entries for all network configurations.
Most of the entries are stubbed out at the moment, but even those were
only displayed if IPv6 support was enabled. The entries should be
displayed with IPv4-support only, and with only loopback devices.

PiperOrigin-RevId: 229946441
Change-Id: I18afaa3af386322787f91bf9d168ab66c01d5a4c
2019-01-18 10:02:12 -08:00
Nicolas Lacasse 12bc7834dc Allow fsync on a directory.
PiperOrigin-RevId: 229781337
Change-Id: I1f946cff2771714fb1abd83a83ed454e9febda0a
2019-01-17 11:06:59 -08:00
Nicolas Lacasse dc8450b567 Remove fs.Handle, ramfs.Entry, and all the DeprecatedFileOperations.
More helper structs have been added to the fsutil package to make it easier to
implement fs.InodeOperations and fs.FileOperations.

PiperOrigin-RevId: 229305982
Change-Id: Ib6f8d3862f4216745116857913dbfa351530223b
2019-01-14 20:34:28 -08:00
Zach Koopmans 7f8de3bf92 Fixing select call to not enforce RLIMIT_NOFILE.
Removing check to RLIMIT_NOFILE in select call.
Adding unit test to select suite to document behavior.
Moving setrlimit class from mlock to a util file for reuse.
Fixing flaky test based on comments from Jamie.

PiperOrigin-RevId: 228726131
Change-Id: Ie9dbe970bbf835ba2cca6e17eec7c2ee6fadf459
2019-01-10 09:44:45 -08:00
Jamie Liu 9270d940eb Minor memevent fixes.
- Call MemoryEvents.done.Add(1) outside of MemoryEvents.run() so that if
  MemoryEvents.Stop() => MemoryEvents.done.Wait() is called before the
  goroutine starts running, it still waits for the goroutine to stop.

- Use defer to call MemoryEvents.done.Done() in MemoryEvents.run() so that it's
  called even if the goroutine panics.

PiperOrigin-RevId: 228623307
Change-Id: I1b0459e7999606c1a1a271b16092b1ca87005015
2019-01-09 17:54:40 -08:00
Nicolas Lacasse d321f575e2 Fix lock order violation.
overlayFileOperations.Readdir was holding overlay.copyMu while calling
DirentReaddir, which then attempts to take take the corresponding Dirent.mu,
causing a lock order violation. (See lock order documentation in
fs/copy_up.go.)

We only actually need to hold copyMu during readdirEntries(), so holding the
lock is moved in there, thus avoiding the lock order violation.

A new lock was added to protect overlayFileOperations.dirCache. We were
inadvertently relying on copyMu to protect this.  There is no reason it should
not have its own lock.

PiperOrigin-RevId: 228542473
Change-Id: I03c3a368c8cbc0b5a79d50cc486fc94adaddc1c2
2019-01-09 10:29:36 -08:00
Brian Geffon dd761c170c Allow MSG_OOB and MSG_DONTROUTE to be no-ops on recvmsg(2).
PiperOrigin-RevId: 228428223
Change-Id: I433ba5ffc15ea4c2706ec944901b8269b1f364f8
2019-01-08 17:13:17 -08:00
Brian Geffon 3676b7ff1c Improve loader related error messages returned to users.
PiperOrigin-RevId: 228382827
Change-Id: Ica1d30e0df826bdd77f180a5092b2b735ea5c804
2019-01-08 12:58:08 -08:00
Jamie Liu f95b94fbe3 Grant no initial capabilities to non-root UIDs.
See modified comment in auth.NewUserCredentials(); compare to the
behavior of setresuid(2) as implemented by
//pkg/sentry/kernel/task_identity.go:kernel.Task.setKUIDsUncheckedLocked().

PiperOrigin-RevId: 228381765
Change-Id: I45238777c8f63fcf41b99fce3969caaf682fe408
2019-01-08 12:52:24 -08:00
Jamie Liu dc4849e49c Add usermem support for arm64 platform.
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
PiperOrigin-RevId: 228249611
Change-Id: I1046e70bec4274f18b9948eefd6b0d546e4c48bb
2019-01-07 15:40:26 -08:00
Jamie Liu 901ed5da44 Implement /proc/[pid]/smaps.
PiperOrigin-RevId: 228245523
Change-Id: I5a4d0a6570b93958e51437e917e5331d83e23a7e
2019-01-07 15:17:44 -08:00
Fabricio Voznika 8e586db162 Add /proc/net/psched content
FIO reads this file and expects it to be well formed.

PiperOrigin-RevId: 227554483
Change-Id: Ia48ae2377626dd6a2daf17b5b4f5119f90ece55b
2019-01-02 11:39:57 -08:00
Andrei Vagin 652d068119 Implement SO_REUSEPORT for TCP and UDP sockets
This option allows multiple sockets to be bound to the same port.

Incoming packets are distributed to sockets using a hash based on source and
destination addresses. This means that all packets from one sender will be
received by the same server socket.

PiperOrigin-RevId: 227153413
Change-Id: I59b6edda9c2209d5b8968671e9129adb675920cf
2018-12-28 11:27:14 -08:00
Fabricio Voznika 46e6577014 Fix deadlock between epoll_wait and getdents
epoll_wait acquires EventPoll.listsMu (in EventPoll.ReadEvents) and
then calls Inotify.Readiness which tries to acquire Inotify.evMu.

getdents acquires Inotify.evMu (in Inotify.queueEvent) and then calls
readyCallback.Callback which tries to acquire EventPoll.listsMu.

The fix is to release Inotify.evMu before calling Queue.Notify. Queue
is thread-safe and doesn't require Inotify.evMu to be held.

Closes #121

PiperOrigin-RevId: 227066695
Change-Id: Id29364bb940d1727f33a5dff9a3c52f390c15761
2018-12-27 14:59:50 -08:00
Ian Gudger bce2f9751f Plumb IP_MULTICAST_TTL to netstack.
PiperOrigin-RevId: 226993086
Change-Id: I71757f231436538081d494da32ca69f709bc71c7
2018-12-26 23:52:12 -08:00
Brian Geffon bfa2f314ca Add EventChannel messages for uncaught signals.
PiperOrigin-RevId: 226936778
Change-Id: I2a6dda157c55d39d81e1b543ab11a58a0bfe5c05
2018-12-26 11:26:28 -08:00
Ian Gudger 0df0df35fc Stub out SO_OOBINLINE.
We don't explicitly support out-of-band data and treat it like normal in-band
data. This is equilivent to SO_OOBINLINE being enabled, so always report that
it is enabled.

PiperOrigin-RevId: 226572742
Change-Id: I4c30ccb83265e76c30dea631cbf86822e6ee1c1b
2018-12-21 19:46:55 -08:00
Ian Gudger b515556519 Implement SO_KEEPALIVE, TCP_KEEPIDLE, and TCP_KEEPINTVL.
Within gVisor, plumb new socket options to netstack.

Within netstack, fix GetSockOpt and SetSockOpt return value logic.

PiperOrigin-RevId: 226532229
Change-Id: If40734e119eed633335f40b4c26facbebc791c74
2018-12-21 13:13:45 -08:00
Fabricio Voznika 1679ef31ef inotify notifies watchers when control events bit are set
The code that matches the event being published with events watchers
was wronly matching all watchers in case any of the control event bits
were set.

Issue #121

PiperOrigin-RevId: 226521230
Change-Id: Ie2c42bc4366faaf59fbf80a74e9297499bd93f9e
2018-12-21 11:54:02 -08:00
Jamie Liu 9a442fa4b5 Automated rollback of changelist 226224230
PiperOrigin-RevId: 226493053
Change-Id: Ia98d1cb6dd0682049e4d907ef69619831de5c34a
2018-12-21 08:23:34 -08:00
Nicolas Lacasse 8ba450363f Deflake gofer_test.
We must wait for all lazy resources to be released before closing the rootFile.

PiperOrigin-RevId: 226419499
Change-Id: I1d4d961a92b3816e02690cf3eaf0a88944d730cc
2018-12-20 17:23:26 -08:00
Ian Gudger f6274804e1 Make read and write respect SO_RCVTIMEO and SO_SNDTIMEO
PiperOrigin-RevId: 226387521
Change-Id: I0579ab262320fde6c72d2994dd38437f01a99ea5
2018-12-20 13:48:52 -08:00
Jamie Liu 194ef586fc Rename limits.MemoryPagesLocked to limits.MemoryLocked.
"RLIMIT_MEMLOCK: This is the maximum number of bytes of memory that may
be locked into RAM." - getrlimit(2)

PiperOrigin-RevId: 226384346
Change-Id: Iefac4a1bb69f7714dc813b5b871226a8344dc800
2018-12-20 13:28:46 -08:00
Googler 86c9bd2547 Automated rollback of changelist 225861605
PiperOrigin-RevId: 226224230
Change-Id: Id24c7d3733722fd41d5fe74ef64e0ce8c68f0b12
2018-12-19 13:30:08 -08:00
Zach Koopmans ff7178a4d1 Implement pwritev2.
Implement pwritev2 and associated unit tests.
Clean up preadv2 unit tests.
Tag RWF_ flags in both preadv2 and pwritev2 with associated bug tickets.

PiperOrigin-RevId: 226222119
Change-Id: Ieb22672418812894ba114bbc88e67f1dd50de620
2018-12-19 13:16:06 -08:00
Jamie Liu 898838e34d Fix mremap expansion with mm.checkInvariants = true.
Also remove useless RSS changes in mm.movePMAsLocked().

PiperOrigin-RevId: 226052996
Change-Id: If59fd259b93238fb2f15c1c8ebfeda14cb590a87
2018-12-18 13:50:33 -08:00
Jamie Liu 3b3f026278 Truncate ar before calling mm.breakCopyOnWriteLocked().
... as required by the latter's precondition.

PiperOrigin-RevId: 226033824
Change-Id: I6bc46d0e100c61cc58cb5fc69e70c4ca905cd92d
2018-12-18 11:52:31 -08:00
Fabricio Voznika 03226cd950 Add BPFAction type with Stringer
PiperOrigin-RevId: 226018694
Change-Id: I98965e26fe565f37e98e5df5f997363ab273c91b
2018-12-18 10:28:28 -08:00
Ian Gudger 12c7430a01 Fix recv blocking for connectionless Unix sockets.
Connectionless Unix sockets (DGRAM Unix sockets created with the socket system
call) inherently only have a read queue. They do not establish bidirectional
connections, instead, the connect system call only sets a default send
location. Writes give the data to the other endpoint which has its own read
queue.

To simplify the code, connectionless Unix sockets still get read and write
queues, but the write queue is a dummy and never waited on. The read queue is
the connectionless endpoint's queue. This change fixes a bug where the dummy
queue was incorrectly set as the read queue and the endpoint's queue was
incorrectly set as the write queue. This meant that read notifications went
to the dummy queue and were black holed.

PiperOrigin-RevId: 225921042
Change-Id: I8d9059def787a2c3c305185b92d05093fbd2be2a
2018-12-17 17:53:22 -08:00
Nicolas Lacasse d3ae74d2a5 overlayBoundEndpoint must be recursive if there is an overlay in the lower.
The old overlayBoundEndpoint assumed that the lower is not an overlay.  It
should check if the lower is an overlay and handle that case.

PiperOrigin-RevId: 225882303
Change-Id: I60660c587d91db2826e0719da0983ec8ad024cb8
2018-12-17 13:46:57 -08:00
Jamie Liu 2421006426 Implement mlock(), kind of.
Currently mlock() and friends do nothing whatsoever. However, mlocking
is directly application-visible in a number of ways; for example,
madvise(MADV_DONTNEED) and msync(MS_INVALIDATE) both fail on mlocked
regions. We handle this inconsistently: MADV_DONTNEED is too important
to not work, but MS_INVALIDATE is rejected.

Change MM to track mlocked regions in a manner consistent with Linux.
It still will not actually pin pages into host physical memory, but:

- mlock() will now cause sentry memory management to precommit mlocked
pages.

- MADV_DONTNEED and MS_INVALIDATE will interact with mlocked pages as
described above.

PiperOrigin-RevId: 225861605
Change-Id: Iee187204979ac9a4d15d0e037c152c0902c8d0ee
2018-12-17 11:38:59 -08:00
Adin Scannell 5d8cf31346 Move fdnotifier package to reduce internal confusion.
PiperOrigin-RevId: 225632398
Change-Id: I909e7e2925aa369adc28e844c284d9a6108e85ce
2018-12-14 18:05:01 -08:00
Andrei Vagin 3cf84e3bef Mark sync.Mutex in TTYFileOperations as nosave
PiperOrigin-RevId: 225621767
Change-Id: Ie3a42cdf0b0de22a020ff43e307bf86409cff329
2018-12-14 16:24:21 -08:00
Ian Gudger e1dcf92ec5 Implement SO_SNDTIMEO
PiperOrigin-RevId: 225620490
Change-Id: Ia726107b3f58093a5f881634f90b071b32d2c269
2018-12-14 16:15:06 -08:00
Ian Gudger 4659f7ed1a Fix WAITALL and RCVTIMEO interaction
PiperOrigin-RevId: 225424296
Change-Id: I60fcc2b859339dca9963cb32227a287e719ab765
2018-12-13 13:20:46 -08:00
Rahat Mahmood ccce1d4281 Filesystems shouldn't be saving references to Platform.
Platform objects are not savable, storing references to them in
filesystem datastructures would cause save to fail if someone actually
passed in a Platform.

Current implementations work because everywhere a Platform is
expected, we currently pass in a Kernel object which embeds Platform
and thus satisfies the interface.

Eliminate this indirection and save pointers to Kernel directly.

PiperOrigin-RevId: 225288336
Change-Id: Ica399ff43f425e15bc150a0d7102196c3d54a2ab
2018-12-12 17:47:55 -08:00
Rahat Mahmood f93c288dd7 Fix a data race on Shm.key.
PiperOrigin-RevId: 225240907
Change-Id: Ie568ce3cd643f3e4a0eaa0444f4ed589dcf6031f
2018-12-12 13:18:48 -08:00
Rahat Mahmood 75e39eaa74 Pass information about map writableness to filesystems.
This is necessary to implement file seals for memfds.

PiperOrigin-RevId: 225239394
Change-Id: Ib3f1ab31385afc4b24e96cd81a05ef1bebbcbb70
2018-12-12 13:09:59 -08:00
Michael Pratt 2b6df6a204 Format unshare flags
unshare actually takes a subset of clone flags, but has no unique flags,
so formatting as clone flags is close enough.

PiperOrigin-RevId: 225082774
Change-Id: I5b580f18607c7785f323e37809094115520a17c0
2018-12-11 15:33:14 -08:00
Christopher Koch 5934fad1d7 Remove unused envv variable from two funcs.
PiperOrigin-RevId: 225041520
Change-Id: Ib1afc693e592d308d60db82022c5b7743fd3c646
2018-12-11 11:40:16 -08:00
Haibo Xu 52fe3b87a4 Add safecopy support for arm64 platform.
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I565214581eeb44045169da7f44d45a489082ac3a
PiperOrigin-RevId: 224938170
2018-12-10 21:35:02 -08:00
Ian Gudger 5d87d8865f Implement MSG_WAITALL
MSG_WAITALL requests that recv family calls do not perform short reads. It only
has an effect for SOCK_STREAM sockets, other types ignore it.

PiperOrigin-RevId: 224918540
Change-Id: Id97fbf972f1f7cbd4e08eec0138f8cbdf1c94fe7
2018-12-10 17:56:34 -08:00
Rahat Mahmood fc29770251 Add type safety to shm ids and keys.
PiperOrigin-RevId: 224864380
Change-Id: I49542279ad56bf15ba462d3de1ef2b157b31830a
2018-12-10 12:48:02 -08:00
Michael Pratt 99d5958693 Validate FS_BASE in Task.Clone
arch_prctl already verified that the new FS_BASE was canonical, but
Task.Clone did not. Centralize these checks in the arch packages.

Failure to validate could cause an error in PTRACE_SET_REGS when we try
to switch to the app.

PiperOrigin-RevId: 224862398
Change-Id: Iefe63b3f9aa6c4810326b8936e501be3ec407f14
2018-12-10 12:37:16 -08:00
Ian Gudger 25b8424d75 Stub out TCP_QUICKACK
PiperOrigin-RevId: 224696233
Change-Id: I45c425d9e32adee5dcce29ca7439a06567b26014
2018-12-09 00:50:33 -08:00
Zhaozhong Ni 9984138abe sentry: turn "dynamically-created" procfs files into static creation.
PiperOrigin-RevId: 224600982
Change-Id: I547253528e24fb0bb318fc9d2632cb80504acb34
2018-12-07 17:03:54 -08:00
Michael Pratt 42e2e5cae9 Format sigaction in strace
Sample:

I1206 14:24:56.768520    3700 x:0] [   1] ioctl_test E rt_sigaction(SIGSEGV, 0x7ee6edb0c590 {Handler: 0x559c6d915cf0, Flags: SA_SIGINFO|SA_RESTORER|SA_ONSTACK|SA_NODEFER, Restorer: 0x2a9901a259a0, Mask: []}, 0x7ee6edb0c630)
I1206 14:24:56.768530    3700 x:0] [   1] ioctl_test X rt_sigaction(SIGSEGV, 0x7ee6edb0c590 {Handler: 0x559c6d915cf0, Flags: SA_SIGINFO|SA_RESTORER|SA_ONSTACK|SA_NODEFER, Restorer: 0x2a9901a259a0, Mask: []}, 0x7ee6edb0c630 {Handler: SIG_DFL, Flags: 0x0, Restorer: 0x0, Mask: []}) = 0x0 (2.701?s)

PiperOrigin-RevId: 224596606
Change-Id: I3512493aed99d3d75600249263da46686b1dc0e7
2018-12-07 16:28:54 -08:00
Michael Pratt 673949048e Add period to comment
PiperOrigin-RevId: 224553291
Change-Id: I35d0772c215b71f4319c23f22df5c61c908f8590
2018-12-07 11:53:19 -08:00
Michael Pratt 51900fe3a4 Format signals, signal masks in strace
Sample:

I1205 16:51:49.869701    2492 x:0] [   1] ioctl_test E rt_sigaction(SIGIO, 0x7e0e5b5e8500, 0x7e0e5b5e85a0)
I1205 16:51:49.869766    2492 x:0] [   1] ioctl_test X rt_sigaction(SIGIO, 0x7e0e5b5e8500, 0x7e0e5b5e85a0) = 0x0 (44.336?s)
I1205 16:51:49.869831    2492 x:0] [   1] ioctl_test E rt_sigprocmask(SIG_UNBLOCK, 0x7e0e5b5e8878 [SIGIO], 0x7e0e5b5e87c0, 0x8)
I1205 16:51:49.869866    2492 x:0] [   1] ioctl_test X rt_sigprocmask(SIG_UNBLOCK, 0x7e0e5b5e8878 [SIGIO], 0x7e0e5b5e87c0 [SIGIO 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64], 0x8) = 0x0 (2.575?s)

PiperOrigin-RevId: 224422404
Change-Id: I3ed3f2ec6b1a639baa9cacd37ce7ee325c3703e4
2018-12-06 15:47:06 -08:00
Michael Pratt 666db00c26 Convert ValueSet to a map
Unlike FlagSet, order doesn't matter here, so it can simply be a map.

PiperOrigin-RevId: 224377910
Change-Id: I15810c698a7f02d8614bf09b59583ab73cba0514
2018-12-06 11:43:11 -08:00
Ian Gudger 000fa84a3b Fix tcpip.Endpoint.Write contract regarding short writes
* Clarify tcpip.Endpoint.Write contract regarding short writes.
* Enforce tcpip.Endpoint.Write contract regarding short writes.
* Update relevant users of tcpip.Endpoint.Write.

PiperOrigin-RevId: 224377586
Change-Id: I24299ecce902eb11317ee13dae3b8d8a7c5b097d
2018-12-06 11:41:33 -08:00
Rahat Mahmood 685eaf119f Add counters for memory events.
Also ensure an event is emitted at startup.

PiperOrigin-RevId: 224372065
Change-Id: I5f642b6d6b13c6468ee8f794effe285fcbbf29cf
2018-12-06 11:15:47 -08:00
Zach Koopmans 4d8c7ae869 Fixing O_TRUNC behavior to match Linux.
PiperOrigin-RevId: 224351139
Change-Id: I9453bd75e5a8d38db406bb47fdc01038ac60922e
2018-12-06 09:26:49 -08:00
Michael Pratt 9f64e64a6e Enforce directory accessibility before delete Walk
By Walking before checking that the directory is writable and
executable, MayDelete may return the Walk error (e.g., ENOENT) which
would normally be masked by a permission error (EACCES).

PiperOrigin-RevId: 224222453
Change-Id: I108a7f730e6bdaa7f277eaddb776267c00805475
2018-12-05 14:31:58 -08:00
Jamie Liu 23438b3632 Update MM.usageAS when mremap copies or moves a mapping.
PiperOrigin-RevId: 224221509
Change-Id: I7aaea74629227d682786d3e435737364921249bf
2018-12-05 14:27:23 -08:00
Michael Pratt 592f5bdc67 Add context to mount errors
This makes it more obvious why a mount failed.

PiperOrigin-RevId: 224203880
Change-Id: I7961774a7b6fdbb5493a791f8b3815c49b8f7631
2018-12-05 12:46:30 -08:00
Zach Koopmans 06131fe749 Check for CAP_SYS_RESOURCE in prctl(PR_SET_MM, ...)
If sys_prctl is called with PR_SET_MM without CAP_SYS_RESOURCE,
the syscall should return failure with errno set to EPERM.
See: http://man7.org/linux/man-pages/man2/prctl.2.html
PiperOrigin-RevId: 224182874
Change-Id: I630d1dd44af8b444dd16e8e58a0764a0cf1ad9a3
2018-12-05 10:53:51 -08:00
Michael Pratt 076f107643 Remove initRegs arg from clone
It is always the same as t.initRegs.

PiperOrigin-RevId: 224085550
Change-Id: I5cc4ddc3b481d4748c3c43f6f4bb50da1dbac694
2018-12-04 18:53:43 -08:00
Brian Geffon ffcbda0c8b Partial writes should loop in rpcinet.
FileOperations.Write should return ErrWouldBlock to allow the upper
layer to loop and sendmsg should continue writing where it left off
on a partial write.

PiperOrigin-RevId: 224081631
Change-Id: Ic61f6943ea6b7abbd82e4279decea215347eac48
2018-12-04 18:15:10 -08:00
Brian Geffon 2cab0e82ad Linkat(2) should sanity check flags.
PiperOrigin-RevId: 224047765
Change-Id: I6f3c75b33c32bf8f8910ea3fab35406d7d672d87
2018-12-04 14:34:19 -08:00
Brian Geffon 82719be42e Max link traversals should be for an entire path.
The number of symbolic links that are allowed to be followed
are for a full path and not just a chain of symbolic links.

PiperOrigin-RevId: 224047321
Change-Id: I5e3c4caf66a93c17eeddcc7f046d1e8bb9434a40
2018-12-04 14:32:03 -08:00
Zhaozhong Ni adafc08d7c sentry: save / restore netstack procfs configuration.
PiperOrigin-RevId: 224047120
Change-Id: Ia6cb17fa978595cd73857b6178c4bdba401e185e
2018-12-04 14:30:42 -08:00
Brian Geffon 5a6a1eb420 Enforce name length restriction on paths.
NAME_LENGTH must be enforced per component.

PiperOrigin-RevId: 224046749
Change-Id: Iba8105b00d951f2509dc768af58e4110dafbe1c9
2018-12-04 14:28:33 -08:00
Rahat Mahmood 806e346491 Fix mempolicy_test on bazel.
Bazel runs multiple test cases on the same thread. Some of the test
cases rely on the test thread starting with the default memory policy,
while other tests modify the test thread's memory policy. This
obviously breaks when the test framework doesn't run each test case on
a new thread.

Also fixing an incompatibility where set_mempolicy(2) was prevented
from specifying an empty nodemask, which is allowed for some modes.

PiperOrigin-RevId: 224038957
Change-Id: Ibf780766f2706ebc9b129dbc8cf1b85c2a275074
2018-12-04 13:45:58 -08:00
Nicolas Lacasse 54dd0d0dc5 Fix data race caused by unlocked call of Dirent.descendantOf.
PiperOrigin-RevId: 224025363
Change-Id: I98864403c779832e9e1436f7d3c3f6fb2fba9904
2018-12-04 12:24:55 -08:00
Ian Gudger 5560615c53 Return an int32 for netlink SO_RCVBUF
Untyped integer constants default to type int and the binary package will panic
if one tries to encode an int.

PiperOrigin-RevId: 223890001
Change-Id: Iccc3afd6d74bad24c35d764508e450fd317b76ec
2018-12-03 17:03:15 -08:00
Nicolas Lacasse 573622fdca Fix data race in fs.Async.
Replaces the WaitGroup with a RWMutex. Calls to Async hold the mutex for
reading, while AsyncBarrier takes the lock for writing. This ensures that all
executing Async work finishes before AsyncBarrier returns.

Also pushes the Async() call from Inode.Release into
gofer/InodeOperations.Release(). This removes a recursive Async call which
should not have been allowed in the first place. The gofer Release call is the
slow one (since it may make RPCs to the gofer), so putting the Async call there
makes sense.

PiperOrigin-RevId: 223093067
Change-Id: I116da7b20fce5ebab8d99c2ab0f27db7c89d890e
2018-11-27 18:17:09 -08:00
Brian Geffon 5bd02b224f Save shutdown flags first.
With rpcinet if shutdown flags are not saved before making
the rpc a race is possible where blocked threads are woken
up before the flags have been persisted. This would mean
that threads can block indefinitely in a recvmsg after a
shutdown(SHUT_RD) has happened.

PiperOrigin-RevId: 223089783
Change-Id: If595e7add12aece54bcdf668ab64c570910d061a
2018-11-27 17:48:05 -08:00
Haibo Xu 9e0f132377 Add procid support for arm64 platform
Change-Id: I7c3db8dfdf95a125d7384c1d67c3300dbb99a47e
PiperOrigin-RevId: 223039923
2018-11-27 12:46:39 -08:00
Zach Koopmans b3b60ea29a Implementation of preadv2 for Linux 4.4 support
Implement RWF_HIPRI (4.6) silently passes the read call.
Implement -1 offset calls readv.

PiperOrigin-RevId: 222840324
Change-Id: If9ddc1e8d086e1a632bdf5e00bae08205f95b6b0
2018-11-26 09:50:47 -08:00
Fabricio Voznika eaac94d91c Use RET_KILL_PROCESS if available in kernel
RET_KILL_THREAD doesn't work well for Go because it will
kill only the offending thread and leave the process hanging.
RET_TRAP can be masked out and it's not guaranteed to kill
the process. RET_KILL_PROCESS is available since 4.14.

For older kernel, continue to use RET_TRAP as this is the
best option (likely to kill process, easy to debug).

PiperOrigin-RevId: 222357867
Change-Id: Icc1d7d731274b16c2125b7a1ba4f7883fbdb2cbd
2018-11-20 22:56:51 -08:00
Fabricio Voznika 5236b78242 Dumps stacks if watchdog thread is stuck
PiperOrigin-RevId: 222332703
Change-Id: Id5c3cf79591c5d2949895b4e323e63c48c679820
2018-11-20 17:24:19 -08:00
Fabricio Voznika 8b314b0bf4 Fix recursive read lock taken on TaskSet
SyncSyscallFiltersToThreadGroup and Task.TheadID() both acquired TaskSet RWLock
in R mode and could deadlock if a writer comes in between.

PiperOrigin-RevId: 222313551
Change-Id: I4221057d8d46fec544cbfa55765c9a284fe7ebfa
2018-11-20 15:07:56 -08:00
Michael Pratt 03c1eb78b5 Reference upstream licenses
Include copyright notices and the referenced LICENSE file.

PiperOrigin-RevId: 222171321
Change-Id: I0cc0b167ca51b536d1087bf1c4742fdf1430bc2a
2018-11-20 14:05:16 -08:00
Fabricio Voznika fadffa2ff8 Add unsupported syscall events for get/setsockopt
PiperOrigin-RevId: 222148953
Change-Id: I21500a9f08939c45314a6414e0824490a973e5aa
2018-11-20 14:04:12 -08:00
Nicolas Lacasse 8c84f9a3c1 Parse the tmpfs mode before validating.
This gets rid of the problematic modeRegex.

PiperOrigin-RevId: 221835959
Change-Id: I566b8d8a43579a4c30c0a08a620a964bbcd826dd
2018-11-20 14:02:39 -08:00
Adin Scannell bb9a2bb62e Update futex to use usermem abstractions.
This eliminates the indirection that existed in task_futex.

PiperOrigin-RevId: 221832498
Change-Id: Ifb4c926d493913aa6694e193deae91616a29f042
2018-11-20 14:02:07 -08:00
Rahat Mahmood f7aa937124 Advertise vsyscall support via /proc/<pid>/maps.
Also update test utilities for probing vsyscall support and add a
metric to see if vsyscalls are actually used in sandboxes.

PiperOrigin-RevId: 221698834
Change-Id: I57870ecc33ea8c864bd7437833f21aa1e8117477
2018-11-15 15:14:38 -08:00
Nicolas Lacasse 6ef08c2bc2 Allow setting sticky bit in tmpfs permissions.
PiperOrigin-RevId: 221683127
Change-Id: Ide6a9f41d75aa19d0e2051a05a1e4a114a4fb93c
2018-11-15 13:48:59 -08:00
Ian Gudger 7f60294a73 Implement TCP_NODELAY and TCP_CORK
Previously, TCP_NODELAY was always enabled and we would lie about it being
configurable. TCP_NODELAY is now disabled by default (to match Linux) in the
socket layer so that non-gVisor users don't automatically start using this
questionable optimization.

PiperOrigin-RevId: 221368472
Change-Id: Ib0240f66d94455081f4e0ca94f09d9338b2c1356
2018-11-13 18:02:43 -08:00
Googler 25d07fbbed Internal change.
PiperOrigin-RevId: 221189534
Change-Id: Id20d318bed97d5226b454c9351df396d11251e1f
2018-11-12 17:44:46 -08:00
Andrei Vagin 2ef122da35 Implement sync_file_range()
sync_file_range - sync a file segment with disk

In Linux, sync_file_range() accepts three flags:

       SYNC_FILE_RANGE_WAIT_BEFORE
              Wait  upon  write-out  of  all pages in the specified range that
              have already been submitted to the device driver  for  write-out
              before performing any write.

       SYNC_FILE_RANGE_WRITE
              Initiate  write-out  of  all  dirty pages in the specified range
              which are not presently submitted  write-out.   Note  that  even
              this  may  block if you attempt to write more than request queue
              size.

       SYNC_FILE_RANGE_WAIT_AFTER
              Wait upon write-out of all pages in the range  after  performing
              any write.

In this implementation:

SYNC_FILE_RANGE_WAIT_BEFORE without SYNC_FILE_RANGE_WAIT_AFTER isn't
supported right now.

SYNC_FILE_RANGE_WRITE is skipped. It should initiate write-out of  all
dirty pages, but it doesn't wait, so it should be safe to do nothing
while nobody uses SYNC_FILE_RANGE_WAIT_BEFORE.

SYNC_FILE_RANGE_WAIT_AFTER is equal to fdatasync(). In Linux,
sync_file_range() doesn't writes out the  file's  meta-data, but
fdatasync() does if a file size is changed.

PiperOrigin-RevId: 220730840
Change-Id: Iae5dfb23c2c916967d67cf1a1ad32f25eb3f6286
2018-11-08 17:39:51 -08:00
Rahat Mahmood 5a0be6fa20 Create stubs for syscalls upto Linux 4.4.
Create syscall stubs for missing syscalls upto Linux 4.4 and advertise
a kernel version of 4.4.

PiperOrigin-RevId: 220667680
Change-Id: Idbdccde538faabf16debc22f492dd053a8af0ba7
2018-11-08 11:09:46 -08:00
Ian Lewis 9d69d85bc1 Make error messages a bit more user friendly.
Updated error messages so that it doesn't print full Go struct representations
when running a new container in a sandbox. For example, this occurs frequently
when commands are not found when doing a 'kubectl exec'.

PiperOrigin-RevId: 219729141
Change-Id: Ic3a7bc84cd7b2167f495d48a1da241d621d3ca09
2018-11-01 17:40:09 -07:00
Rahat Mahmood 0e277a39c8 Prevent premature destruction of shm segments.
Shm segments can be marked for lazy destruction via shmctl(IPC_RMID),
which destroys a segment once it is no longer attached to any
processes. We were unconditionally decrementing the segment refcount
on shmctl(IPC_RMID) which allowed a user to force a segment to be
destroyed by repeatedly calling shmctl(IPC_RMID), with outstanding
memory maps to the segment.

This is problematic because the memory released by a segment destroyed
this way can be reused by a different process while remaining
accessible by the process with outstanding maps to the segment.

PiperOrigin-RevId: 219713660
Change-Id: I443ab838322b4fb418ed87b2722c3413ead21845
2018-11-01 15:54:14 -07:00
Juan b23cd33682 modify modeRegexp to adapt the default spec of containerd
https://github.com/containerd/containerd/blob/master/oci/spec.go#L206, the mode=755
didn't match the pattern modeRegexp = regexp.MustCompile("0[0-7][0-7][0-7]").

Closes #112

Signed-off-by: Juan <xionghuan.cn@gmail.com>
Change-Id: I469e0a68160a1278e34c9e1dbe4b7784c6f97e5a
PiperOrigin-RevId: 219672525
2018-11-01 11:57:54 -07:00
Adin Scannell fb613020c7 kvm: simplify floating point logic.
This reduces the number of floating point save/restore cycles required (since
we don't need to restore immediately following the switch, this always happens
in a known context) and allows the kernel hooks to capture state. This lets us
remove calls like "Current()".

PiperOrigin-RevId: 219552844
Change-Id: I7676fa2f6c18b9919718458aa888b832a7db8cab
2018-10-31 15:59:23 -07:00
Adin Scannell c4bbb54168 kvm: add detailed traces on vCPU errors.
This improves debuggability greatly.

PiperOrigin-RevId: 219551560
Change-Id: I2ecaffdd1c17b0d9f25911538ea6f693e2bc699f
2018-10-31 15:50:10 -07:00
Adin Scannell e9dbd5ab67 kvm: avoid siginfo allocations.
PiperOrigin-RevId: 219492587
Change-Id: I47f6fc0b74a4907ab0aff03d5f26453bdb983bb5
2018-10-31 10:08:06 -07:00
Adin Scannell 0091db9cbd kvm: use private futexes.
Use private futexes for performance and to align with other runtime uses.

PiperOrigin-RevId: 219422634
Change-Id: Ief2af5e8302847ea6dc246e8d1ee4d64684ca9dd
2018-10-30 22:46:42 -07:00
Adin Scannell e7191f058f Use TRAP to simplify vsyscall emulation.
PiperOrigin-RevId: 218592058
Change-Id: I373a2d813aa6cc362500dd5a894c0b214a1959d7
2018-10-24 15:52:44 -07:00
Ian Gudger 425dccdd7e Convert Unix transport to syserr
Previously this code used the tcpip error space. Since it is no longer part of
netstack, it can use the sentry's error space (except for a few cases where
there is still some shared code. This reduces the number of error space
conversions required for hot Unix socket operations.

PiperOrigin-RevId: 218541611
Change-Id: I3d13047006a8245b5dfda73364d37b8a453784bb
2018-10-24 11:05:08 -07:00
Nicolas Lacasse 4a1a2dead9 Run ptrace stubs in their own session and process group.
Pseudoterminal job control signals are meant to be received and handled by the
sandbox process, but if the ptrace stubs are running in the same process group,
they will receive the signals as well and inject then into the sentry kernel.

This can result in duplicate signals being delivered (often to the wrong
process), or a sentry panic if the ptrace stub is inactive.

This CL makes the ptrace stub run in a new session.

PiperOrigin-RevId: 218536851
Change-Id: Ie593c5687439bbfbf690ada3b2197ea71ed60a0e
2018-10-24 10:42:35 -07:00
Rahat Mahmood 46603b569c Fix panic on creation of zero-len shm segments.
Attempting to create a zero-len shm segment causes a panic since we
try to allocate a zero-len filemem region. The existing code had a
guard to disallow this, but the check didn't encode the fact that
requesting a private segment implies a segment creation regardless of
whether IPC_CREAT is explicitly specified.

PiperOrigin-RevId: 218405743
Change-Id: I30aef1232b2125ebba50333a73352c2f907977da
2018-10-23 14:18:54 -07:00
Adin Scannell 75cd70ecc9 Track paths and provide a rename hook.
This change also adds extensive testing to the p9 package via mocks. The sanity
checks and type checks are moved from the gofer into the core package, where
they can be more easily validated.

PiperOrigin-RevId: 218296768
Change-Id: I4fc3c326e7bf1e0e140a454cbacbcc6fd617ab55
2018-10-23 00:20:15 -07:00
Ian Gudger d7c11c7417 Refcount Unix transport queue
This allows us to release messages in the queue when all users close.

PiperOrigin-RevId: 218033550
Change-Id: I2f6e87650fced87a3977e3b74c64775c7b885c1b
2018-10-20 17:58:26 -07:00
Fabricio Voznika b2068cf5a5 Add more unimplemented syscall events
Added events for *ctl syscalls that may have multiple different commands.
For runsc, each syscall event is only logged once. For *ctl syscalls, use
the cmd as identifier, not only the syscall number.

PiperOrigin-RevId: 218015941
Change-Id: Ie3c19131ae36124861e9b492a7dbe1765d9e5e59
2018-10-20 11:14:23 -07:00
Ian Gudger 8fce67af24 Use correct company name in copyright header
PiperOrigin-RevId: 217951017
Change-Id: Ie08bf6987f98467d07457bcf35b5f1ff6e43c035
2018-10-19 16:35:11 -07:00
Ian Gudger f7419fec26 Use generic ilist in Unix transport queue
This should improve performance.

PiperOrigin-RevId: 217610560
Change-Id: I370f196ea2396f1715a460b168ecbee197f94d6c
2018-10-17 16:31:15 -07:00
Jamie Liu b2a88ff471 Check thread group CPU timers in the CPU clock ticker.
This reduces the number of goroutines and runtime timers when
ITIMER_VIRTUAL or ITIMER_PROF are enabled, or when RLIMIT_CPU is set.
This also ensures that thread group CPU timers only advance if running
tasks are observed at the time the CPU clock advances, mostly
eliminating the possibility that a CPU timer expiration observes no
running tasks and falls back to the group leader.

PiperOrigin-RevId: 217603396
Change-Id: Ia24ce934d5574334857d9afb5ad8ca0b6a6e65f4
2018-10-17 15:50:02 -07:00
Ian Gudger 6922eee649 Merge queue into Unix transport
This queue only has a single user, so there is no need for it to use an
interface. Merging it into the same package as its sole user allows us to avoid
a circular dependency.

This simplifies the code and should slightly improve performance.

PiperOrigin-RevId: 217595889
Change-Id: Iabbd5164240b935f79933618c61581bc8dcd2822
2018-10-17 15:10:20 -07:00
Ian Gudger 8c85f5e9ce Fix typos in socket_test
PiperOrigin-RevId: 217576188
Change-Id: I82e45c306c5c9161e207311c7dbb8a983820c1df
2018-10-17 13:25:45 -07:00
Michael Pratt 8fa6f6fe76 Reflow comment to 80 columns
PiperOrigin-RevId: 217573168
Change-Id: Ic1914d0ef71bab020e3ee11cf9c4a50a702bd8dd
2018-10-17 13:06:16 -07:00
Nicolas Lacasse 4e6f0892c9 runsc: Support job control signals for the root container.
Now containers run with "docker run -it" support control characters like ^C and
^Z.

This required refactoring our signal handling a bit. Signals delivered to the
"runsc boot" process are turned into loader.Signal calls with the appropriate
delivery mode. Previously they were always sent directly to PID 1.

PiperOrigin-RevId: 217566770
Change-Id: I5b7220d9a0f2b591a56335479454a200c6de8732
2018-10-17 12:29:05 -07:00
Michael Pratt 578fe5a50d Fix PTRACE_GETREGSET write size
The existing logic is backwards and writes iov_len == 0 for a full write.

PiperOrigin-RevId: 217560377
Change-Id: I5a39c31bf0ba9063a8495993bfef58dc8ab7c5fa
2018-10-17 11:53:04 -07:00
Ian Gudger 6cba410df0 Move Unix transport out of netstack
PiperOrigin-RevId: 217557656
Change-Id: I63d27635b1a6c12877279995d2d9847b6a19da9b
2018-10-17 11:37:51 -07:00
Ian Gudger 324ad3564b Refactor host.ConnectedEndpoint
* Integrate recvMsg and sendMsg functions into Recv and Send respectively as
  they are no longer shared.
* Clean up partial read/write error handling code.
* Re-order code to make sense given that there is no longer a host.endpoint
  type.

PiperOrigin-RevId: 217255072
Change-Id: Ib43fe9286452f813b8309d969be11f5fa40694cd
2018-10-15 20:23:18 -07:00
Ian Gudger 167f2401c4 Merge host.endpoint into host.ConnectedEndpoint
host.endpoint contained duplicated logic from the sockerpair implementation and
host.ConnectedEndpoint. Remove host.endpoint in favor of a
host.ConnectedEndpoint wrapped in a socketpair end.

PiperOrigin-RevId: 217240096
Change-Id: I4a3d51e3fe82bdf30e2d0152458b8499ab4c987c
2018-10-15 17:48:11 -07:00
Nicolas Lacasse ecd94ea7a6 Clean up Rename and Unlink checks for EBUSY.
- Change Dirent.Busy => Dirent.isMountPoint. The function body is unchanged,
  and it is no longer exported.

- fs.MayDelete now checks that the victim is not the process root. This aligns
  with Linux's namei.c:may_delete().

- Fix "is-ancestor" checks to actually compare all ancestors, not just the
  parents.

- Fix handling of paths that end in dots, which are handled differently in
  Rename vs. Unlink.

PiperOrigin-RevId: 217239274
Change-Id: I7a0eb768e70a1b2915017ce54f7f95cbf8edf1fb
2018-10-15 17:42:30 -07:00
Zhaozhong Ni 4ea69fce8d sentry: save fs.Dirent deleted info.
PiperOrigin-RevId: 217155458
Change-Id: Id3265b1ec784787039e2131c80254ac4937330c7
2018-10-15 09:31:32 -07:00
Kevin Krakauer 47d3862c33 runsc: Support retrieving MTU via netdevice ioctl.
This enables ifconfig to display MTU.

PiperOrigin-RevId: 216917021
Change-Id: Id513b23d9d76899bcb71b0b6a25036f41629a923
2018-10-12 13:58:32 -07:00
Zhaozhong Ni 0bfa03d61c sentry: allow saving of unlinked files with open fds on virtual fs.
PiperOrigin-RevId: 216733414
Change-Id: I33cd3eb818f0c39717d6656fcdfff6050b37ebb0
2018-10-11 11:41:44 -07:00
Adin Scannell 463e73d46d Add seccomp filter configuration to ptrace stubs.
This is a defense-in-depth measure. If the sentry is compromised, this prevents
system call injection to the stubs. There is some complexity with respect to
ptrace and seccomp interactions, so this protection is not really available
for kernel versions < 4.8; this is detected dynamically.

Note that this also solves the vsyscall emulation issue by adding in
appropriate trapping for those system calls. It does mean that a compromised
sentry could theoretically inject these into the stub (ignoring the trap and
resume, thereby allowing execution), but they are harmless.

PiperOrigin-RevId: 216647581
Change-Id: Id06c232cbac1f9489b1803ec97f83097fcba8eb8
2018-10-10 22:40:28 -07:00
Michael Pratt ddb34b3690 Enforce message size limits and avoid host calls with too many iovecs
Currently, in the face of FileMem fragmentation and a large sendmsg or
recvmsg call, host sockets may pass > 1024 iovecs to the host, which
will immediately cause the host to return EMSGSIZE.

When we detect this case, use a single intermediate buffer to pass to
the kernel, copying to/from the src/dst buffer.

To avoid creating unbounded intermediate buffers, enforce message size
checks and truncation w.r.t. the send buffer size. The same
functionality is added to netstack unix sockets for feature parity.

PiperOrigin-RevId: 216590198
Change-Id: I719a32e71c7b1098d5097f35e6daf7dd5190eff7
2018-10-10 14:10:17 -07:00
Nicolas Lacasse b78552d30e When creating a new process group, add it to the session.
PiperOrigin-RevId: 216554791
Change-Id: Ia6b7a2e6eaad80a81b2a8f2e3241e93ebc2bda35
2018-10-10 10:42:11 -07:00
Ian Gudger c36d2ef373 Add new netstack metrics to the sentry
PiperOrigin-RevId: 216431260
Change-Id: Ia6e5c8d506940148d10ff2884cf4440f470e5820
2018-10-09 15:12:44 -07:00
Brian Geffon acf7a95189 Add memunit to sysinfo(2).
Also properly add padding after Procs in the linux.Sysinfo
structure. This will be implicitly padded to 64bits so we
need to do the same.

PiperOrigin-RevId: 216372907
Change-Id: I6eb6a27800da61d8f7b7b6e87bf0391a48fdb475
2018-10-09 09:52:14 -07:00
Michael Pratt 569c2b06c4 Statfs Namelen should be NAME_MAX not PATH_MAX
We accidentally set the wrong maximum. I've also added PATH_MAX and
NAME_MAX to the linux abi package.

PiperOrigin-RevId: 216221311
Change-Id: I44805fcf21508831809692184a0eba4cee469633
2018-10-08 11:39:54 -07:00
Jamie Liu e9e8be6613 Implement shared futexes.
- Shared futex objects on shared mappings are represented by Mappable +
  offset, analogous to Linux's use of inode + offset. Add type
  futex.Key, and change the futex.Manager bucket API to use futex.Keys
  instead of addresses.

- Extend the futex.Checker interface to be able to return Keys for
  memory mappings. It returns Keys rather than just mappings because
  whether the address or the target of the mapping is used in the Key
  depends on whether the mapping is MAP_SHARED or MAP_PRIVATE; this
  matters because using mapping target for a futex on a MAP_PRIVATE
  mapping causes it to stop working across COW-breaking.

- futex.Manager.WaitComplete depends on atomic updates to
  futex.Waiter.addr to determine when it has locked the right bucket,
  which is much less straightforward for struct futex.Waiter.key. Switch
  to an atomically-accessed futex.Waiter.bucket pointer.

- futex.Manager.Wake now needs to take a futex.Checker to resolve
  addresses for shared futexes. CLONE_CHILD_CLEARTID requires the exit
  path to perform a shared futex wakeup (Linux:
  kernel/fork.c:mm_release() => sys_futex(tsk->clear_child_tid,
  FUTEX_WAKE, ...)). This is a problem because futexChecker is in the
  syscalls/linux package. Move it to kernel.

PiperOrigin-RevId: 216207039
Change-Id: I708d68e2d1f47e526d9afd95e7fed410c84afccf
2018-10-08 10:20:38 -07:00
Ian Gudger beac59b37a Fix panic if FIOASYNC callback is registered and triggered without target
PiperOrigin-RevId: 215674589
Change-Id: I4f8871b64c570dc6da448d2fe351cec8a406efeb
2018-10-03 20:22:31 -07:00
Nicolas Lacasse 213f6688a5 Implement TIOCSCTTY ioctl as a noop.
PiperOrigin-RevId: 215658757
Change-Id: If63b33293f3e53a7f607ae72daa79e2b7ef6fcfd
2018-10-03 17:29:56 -07:00
Ian Gudger 4fef31f96c Add S/R support for FIOASYNC
PiperOrigin-RevId: 215655197
Change-Id: I668b1bc7c29daaf2999f8f759138bcbb09c4de6f
2018-10-03 17:03:09 -07:00
Nicolas Lacasse f1c01ed886 runsc: Support job control signals in "exec -it".
Terminal support in runsc relies on host tty file descriptors that are imported
into the sandbox. Application tty ioctls are sent directly to the host fd.

However, those host tty ioctls are associated in the host kernel with a host
process (in this case runsc), and the host kernel intercepts job control
characters like ^C and send signals to the host process. Thus, typing ^C into a
"runsc exec" shell will send a SIGINT to the runsc process.

This change makes "runsc exec" handle all signals, and forward them into the
sandbox via the "ContainerSignal" urpc method. Since the "runsc exec" is
associated with a particular container process in the sandbox, the signal must
be associated with the same container process.

One big difficulty is that the signal should not necessarily be sent to the
sandbox process started by "exec", but instead must be sent to the foreground
process group for the tty. For example, we may exec "bash", and from bash call
"sleep 100". A ^C at this point should SIGINT sleep, not bash.

To handle this, tty files inside the sandbox must keep track of their
foreground process group, which is set/get via ioctls. When an incoming
ContainerSignal urpc comes in, we look up the foreground process group via the
tty file. Unfortunately, this means we have to expose and cache the tty file in
the Loader.

Note that "runsc exec" now handles signals properly, but "runs run" does not.
That will come in a later CL, as this one is complex enough already.

Example:
	root@:/usr/local/apache2# sleep 100
	^C

	root@:/usr/local/apache2# sleep 100
	^Z
	[1]+  Stopped                 sleep 100

	root@:/usr/local/apache2# fg
	sleep 100
	^C

	root@:/usr/local/apache2#

PiperOrigin-RevId: 215334554
Change-Id: I53cdce39653027908510a5ba8d08c49f9cf24f39
2018-10-01 22:06:56 -07:00
Michael Pratt 0400e54592 Add itimer types to linux package, strace
PiperOrigin-RevId: 215278262
Change-Id: Icd10384c99802be6097be938196044386441e282
2018-10-01 14:16:53 -07:00
Nicolas Lacasse 07aa040842 Fix possible panic in control.Processes.
There was a race where we checked task.Parent() != nil, and then later called
task.Parent() again, assuming that it is not nil.  If the task is exiting, the
parent may have been set to nil in between the two calls, causing a panic.

This CL changes the code to only call task.Parent() once.

PiperOrigin-RevId: 215274456
Change-Id: Ib5a537312c917773265ec72016014f7bc59a5f59
2018-10-01 13:56:07 -07:00
Michael Pratt 3ff24b4f2c Require AF_UNIX sockets from the gofer
host.endpoint already has the check, but it is missing from
host.ConnectedEndpoint.

PiperOrigin-RevId: 214962762
Change-Id: I88bb13a5c5871775e4e7bf2608433df8a3d348e6
2018-09-28 11:03:11 -07:00
Sepehr Raissian c17ea8c6e2 Block for link address resolution
Previously, if address resolution for UDP or Ping sockets required sending
packets using Write in Transport layer, Resolve would return ErrWouldBlock
and Write would return ErrNoLinkAddress. Meanwhile startAddressResolution
would run in background. Further calls to Write using same address would also
return ErrNoLinkAddress until resolution has been completed successfully.

Since Write is not allowed to block and System Calls need to be
interruptible in System Call layer, the caller to Write is responsible for
blocking upon return of ErrWouldBlock.

Now, when startAddressResolution is called a notification channel for
the completion of the address resolution is returned.
The channel will traverse up to the calling function of Write as well as
ErrNoLinkAddress. Once address resolution is complete (success or not) the
channel is closed. The caller would call Write again to send packets and
check if address resolution was compeleted successfully or not.

Fixes google/gvisor#5

Change-Id: Idafaf31982bee1915ca084da39ae7bd468cebd93
PiperOrigin-RevId: 214962200
2018-09-28 11:00:16 -07:00
Nicolas Lacasse b709d23987 Forward ioctl(TCSETSF) calls on host ttys to the host kernel.
We already forward TCSETS and TCSETSW.  TCSETSF is roughly equivalent but
discards pending input.

The filters were relaxed to allow host ioctls with TCSETSF argument.

This fixes programs like "passwd" that prevent user input from being displayed
on the terminal.

Before:
	root@b8a0240fc836:/# passwd
	Enter new UNIX password: 123
	Retype new UNIX password: 123
	passwd: password updated successfully

After:
	root@ae6f5dabe402:/# passwd
	Enter new UNIX password:
	Retype new UNIX password:
	passwd: password updated successfully
PiperOrigin-RevId: 214869788
Change-Id: I31b4d1373c1388f7b51d0f2f45ce40aa8e8b0b58
2018-09-27 18:17:38 -07:00
Fabricio Voznika 491faac03b Implement 'runsc kill --all'
In order to implement kill --all correctly, the Sentry needs
to track all tasks that belong to a given container. This change
introduces ContainerID to the task, that gets inherited by all
children. 'kill --all' then iterates over all tasks comparing the
ContainerID field to find all processes that need to be signalled.

PiperOrigin-RevId: 214841768
Change-Id: I693b2374be8692d88cc441ef13a0ae34abf73ac6
2018-09-27 15:00:58 -07:00
Zhaozhong Ni 234f36b6f2 sentry: export cpuTime function.
PiperOrigin-RevId: 214798278
Change-Id: Id59d1ceb35037cda0689d3a1c4844e96c6957615
2018-09-27 12:52:25 -07:00
Fabricio Voznika fca9a390db Return correct parent PID
Old code was returning ID of the thread that created
the child process. It should be returning the ID of
the parent process instead.

PiperOrigin-RevId: 214720910
Change-Id: I95715c535bcf468ecf1ae771cccd04a4cd345b36
2018-09-26 22:00:04 -07:00
Nicolas Lacasse fd222d62ed Short-circuit Readdir calls on overlay files when the dirent is frozen.
If we have an overlay file whose corresponding Dirent is frozen, then we should
not bother calling Readdir on the upper or lower files, since DirentReaddir
will calculate children based on the frozen Dirent tree.

A test was added that fails without this change.

PiperOrigin-RevId: 213531215
Change-Id: I4d6c98f1416541a476a34418f664ba58f936a81d
2018-09-18 15:42:22 -07:00
Brian Geffon ed08597d12 Allow for MSG_CTRUNC in input flags for recv.
PiperOrigin-RevId: 213481363
Change-Id: I8150ea20cebeb207afe031ed146244de9209e745
2018-09-18 11:14:37 -07:00
Fabricio Voznika da20559137 Provide better message when memfd_create fails with ENOSYS
Updates #100

PiperOrigin-RevId: 213414821
Change-Id: I90c2e6c18c54a6afcd7ad6f409f670aa31577d37
2018-09-18 02:09:28 -07:00
Fabricio Voznika 5d9816be41 Remove memory usage static init
panic() during init() can be hard to debug.

Updates #100

PiperOrigin-RevId: 213391932
Change-Id: Ic103f1981c5b48f1e12da3b42e696e84ffac02a9
2018-09-17 21:34:37 -07:00
Kevin Krakauer bb88c187c5 runsc: Enable waiting on exited processes.
This makes `runsc wait` behave more like waitpid()/wait4() in that:
- Once a process has run to completion, you can wait on it and get its exit
  code.
- Processes not waited on will consume memory (like a zombie process)

PiperOrigin-RevId: 213358916
Change-Id: I5b5eca41ce71eea68e447380df8c38361a4d1558
2018-09-17 16:25:24 -07:00
Ian Gudger ab6fa44588 Allow kernel.(*Task).Block to accept an extract only channel
PiperOrigin-RevId: 213328293
Change-Id: I4164133e6f709ecdb89ffbb5f7df3324c273860a
2018-09-17 13:35:54 -07:00
Michael Pratt d639c3d61b Allow NULL data in mount(2)
PiperOrigin-RevId: 213315267
Change-Id: I7562bcd81fb22e90aa9c7dd9eeb94803fcb8c5af
2018-09-17 12:16:29 -07:00
newmanwang de5a590ee2 Avoid reuse of pending SignalInfo objects
runApp.execute -> Task.SendSignal -> sendSignalLocked -> sendSignalTimerLocked
-> pendingSignals.enqueue assumes that it owns the arch.SignalInfo returned
from platform.Context.Switch.

On the other hand, ptrace.context.Switch assumes that it owns the returned
SignalInfo and can safely reuse it on the next call to Switch. The KVM platform
always returns a unique SignalInfo.

This becomes a problem when the returned signal is not immediately delivered,
allowing a future signal in Switch to change the previous pending SignalInfo.

This is noticeable in #38 when external SIGINTs are delivered from the PTY
slave FD. Note that the ptrace stubs are in the same process group as the
sentry, so they are eligible to receive the PTY signals. This should probably
change, but is not the only possible cause of this bug.

Updates #38

Original change by newmanwang <wcs1011@gmail.com>, updated by Michael Pratt
<mpratt@google.com>.

Change-Id: I5383840272309df70a29f67b25e8221f933622cd
PiperOrigin-RevId: 213071072
2018-09-14 17:39:25 -07:00
Michael Pratt 3aa50f18a4 Reuse readlink parameter, add sockaddr max.
PiperOrigin-RevId: 213058623
Change-Id: I522598c655d633b9330990951ff1c54d1023ec29
2018-09-14 16:00:02 -07:00
Nicolas Lacasse b84bfa570d Make gVisor hard link check match Linux's.
Linux permits hard-linking if the target is owned by the user OR the target has
Read+Write permission.

PiperOrigin-RevId: 213024613
Change-Id: If642066317b568b99084edd33ee4e8822ec9cbb3
2018-09-14 12:29:46 -07:00
Jamie Liu 0380bcb3a4 Fix interaction between rt_sigtimedwait and ignored signals.
PiperOrigin-RevId: 213011782
Change-Id: I716c6ea3c586b0c6c5a892b6390d2d11478bc5af
2018-09-14 11:10:50 -07:00
Chenggang faa34a0738 platform/kvm: Get max vcpu number dynamically by ioctl
The old kernel version, such as 4.4, only support 255 vcpus.
While gvisor is ran on these kernels, it could panic because the
vcpu id and vcpu number beyond max_vcpus.
Use ioctl(vmfd, _KVM_CHECK_EXTENSION, _KVM_CAP_MAX_VCPUS) to get max
vcpus number dynamically.

Change-Id: I50dd859a11b1c2cea854a8e27d4bf11a411aa45c
PiperOrigin-RevId: 212929704
2018-09-13 21:47:11 -07:00
Ian Gudger 29a7271f5d Plumb monotonic time to netstack
Netstack needs to be portable, so this seems to be preferable to using raw
system calls.

PiperOrigin-RevId: 212917409
Change-Id: I7b2073e7db4b4bf75300717ca23aea4c15be944c
2018-09-13 19:12:15 -07:00
Rahat Mahmood adf8f33970 Extend memory usage events to report mapped memory usage.
PiperOrigin-RevId: 212887555
Change-Id: I3545383ce903cbe9f00d9b5288d9ef9a049b9f4f
2018-09-13 15:16:47 -07:00
Michael Pratt 9c6b38e295 Format struct itimerspec
PiperOrigin-RevId: 212874745
Change-Id: I0c3e8e6a9e8976631cee03bf0b8891b336ddb8c8
2018-09-13 14:07:47 -07:00
Nicolas Lacasse e2d79480f5 initArgs must hold a reference on the Root if it is not nil.
The contract in ExecArgs says that a reference on ExecArgs.Root must be held
for the lifetime of the struct, but the caller is free to drop the ref after
that.

As a result, proc.Exec must take an additional ref on Root when it constructs
the CreateProcessArgs, since that holds a pointer to Root as well. That ref is
dropped in CreateProcess.

PiperOrigin-RevId: 212828348
Change-Id: I7f44a612f337ff51a02b873b8a845d3119408707
2018-09-13 09:50:35 -07:00
Kevin Krakauer 2eff1fdd06 runsc: Add exec flag that specifies where to save the sandbox-internal pid.
This is different from the existing -pid-file flag, which saves a host pid.

PiperOrigin-RevId: 212713968
Change-Id: I2c486de8dd5cfd9b923fb0970165ef7c5fc597f0
2018-09-12 15:23:35 -07:00
Nicolas Lacasse 6cc9b311af platform: Pass device fd into platform constructor.
We were previously openining the platform device (i.e. /dev/kvm) inside the
platfrom constructor (i.e. kvm.New).  This requires that we have RW access to
the platform device when constructing the platform.

However, now that the runsc sandbox process runs as user "nobody", it is not
able to open the platform device.

This CL changes the kvm constructor to take the platform device FD, rather than
opening the device file itself. The device file is opened outside of the
sandbox and passed to the sandbox process.

PiperOrigin-RevId: 212505804
Change-Id: I427e1d9de5eb84c84f19d513356e1bb148a52910
2018-09-11 13:09:46 -07:00
Jamie Liu a29c39aa62 Map committed chunks concurrently in FileMem.LoadFrom.
PiperOrigin-RevId: 212345401
Change-Id: Iac626ee87ba312df88ab1019ade6ecd62c04c75c
2018-09-10 15:23:44 -07:00
Fabricio Voznika 7e9e6745ca Allow '/dev/zero' to be mapped with unaligned length
PiperOrigin-RevId: 212321271
Change-Id: I79d71c2e6f4b8fcd3b9b923fe96c2256755f4c48
2018-09-10 13:24:55 -07:00
Michael Pratt 7045828a31 Update cleanup TODO
PiperOrigin-RevId: 212068327
Change-Id: I3f360cdf7d6caa1c96fae68ae3a1caaf440f0cbe
2018-09-07 18:14:57 -07:00
Nicolas Lacasse 9751b800a6 runsc: Support multi-container exec.
We must use a context.Context with a Root Dirent that corresponds to the
container's chroot. Previously we were using the root context, which does not
have a chroot.

Getting the correct context required refactoring some of the path-lookup code.
We can't lookup the path without a context.Context, which requires
kernel.CreateProcArgs, which we only get inside control.Execute.  So we have to
do the path lookup much later than we previously were.

PiperOrigin-RevId: 212064734
Change-Id: I84a5cfadacb21fd9c3ab9c393f7e308a40b9b537
2018-09-07 17:39:54 -07:00
Fabricio Voznika 172860a059 Add 'Starting gVisor...' message to syslog
This allows applications to verify they are running with gVisor. It
also helps debugging when running with a mix of container runtimes.

Closes #54

PiperOrigin-RevId: 212059457
Change-Id: I51d9595ee742b58c1f83f3902ab2e2ecbd5cedec
2018-09-07 16:59:27 -07:00
Fabricio Voznika f895cb4d8b Use root abstract socket namespace for exec
PiperOrigin-RevId: 211999211
Change-Id: I5968dd1a8313d3e49bb6e6614e130107495de41d
2018-09-07 10:45:55 -07:00
Michael Pratt 169e2efc5a Continue handling signals after disabling forwarding
Before destroying the Kernel, we disable signal forwarding,
relinquishing control to the Go runtime. External signals that arrive
after disabling forwarding but before the sandbox exits thus may use
runtime.raise (i.e., tkill(2)) and violate the syscall filters.

Adjust forwardSignals to handle signals received after disabling
forwarding the same way they are handled before starting forwarding.
i.e., by implementing the standard Go runtime behavior using tgkill(2)
instead of tkill(2).

This also makes the stop callback block until forwarding actually stops.
This isn't required to avoid tkill(2) but is a saner interface.

PiperOrigin-RevId: 211995946
Change-Id: I3585841644409260eec23435cf65681ad41f5f03
2018-09-07 10:28:25 -07:00
Nicolas Lacasse 6516b5648b createProcessArgs.RootFromContext should return process Root if it exists.
It was always returning the MountNamespace root, which may be different from
the process Root if the process is in a chroot environment.

PiperOrigin-RevId: 211862181
Change-Id: I63bfeb610e2b0affa9fdbdd8147eba3c39014480
2018-09-06 13:47:49 -07:00
Fabricio Voznika 41b56696c4 Imported FD in exec was leaking
Imported file needs to be closed after it's
been imported.

PiperOrigin-RevId: 211732472
Change-Id: Ia9249210558b77be076bcce465b832a22eed301f
2018-09-05 18:07:11 -07:00
Brian Geffon 2b8dae0bc5 Open(2) isn't honoring O_NOFOLLOW
PiperOrigin-RevId: 211644897
Change-Id: I882ed827a477d6c03576463ca5bf2d6351892b90
2018-09-05 09:21:28 -07:00
Michael Pratt 3944cb41cb /proc/PID/mounts is not tab-delimited
PiperOrigin-RevId: 211513847
Change-Id: Ib484dd2d921c3e5d70d0e410cd973d3bff4f6b73
2018-09-04 13:29:49 -07:00
Adin Scannell c09f9acd7c Distinguish Element and Linker for ilist.
Furthermore, allow for the specification of an ElementMapper. This allows a
single "Element" type to exist on multiple inline lists, and work without
having to embed the entry type.

This is a requisite change for supporting a per-Inode list of Dirents.

PiperOrigin-RevId: 211467497
Change-Id: If2768999b43e03fdaecf8ed15f435fe37518d163
2018-09-04 09:19:11 -07:00
Jamie Liu f8ccfbbed4 Document more task-goroutine-owned fields in kernel.Task.
Task.creds can only be changed by the task's own set*id and execve
syscalls, and Task namespaces can only be changed by the task's own
unshare/setns syscalls.

PiperOrigin-RevId: 211156279
Change-Id: I94d57105d34e8739d964400995a8a5d76306b2a0
2018-08-31 15:44:40 -07:00
Jamie Liu b935311e23 Do not use fs.FileOwnerFromContext in fs/proc.file.UnstableAttr().
From //pkg/sentry/context/context.go:

// - It is *not safe* to retain a Context passed to a function beyond the scope
// of that function call.

Passing a stored kernel.Task as a context.Context to
fs.FileOwnerFromContext violates this requirement.

PiperOrigin-RevId: 211143021
Change-Id: I4c5b02bd941407be4c9cfdbcbdfe5a26acaec037
2018-08-31 14:17:56 -07:00
Jamie Liu 098046ba19 Disintegrate kernel.TaskResources.
This allows us to call kernel.FDMap.DecRef without holding mutexes
cleanly.

PiperOrigin-RevId: 211139657
Change-Id: Ie59d5210fb9282e1950e2e40323df7264a01bcec
2018-08-31 13:58:04 -07:00
Jamie Liu b1c1afa3cc Delete the long-obsolete kernel.TaskMaybe interface.
PiperOrigin-RevId: 211131855
Change-Id: Ia7799561ccd65d16269e0ae6f408ab53749bca37
2018-08-31 13:07:34 -07:00
Nicolas Lacasse 8bfb5fa919 fs: Add empty dir at /sys/class/power_supply.
PiperOrigin-RevId: 210953512
Change-Id: I07d2d7fb0d268aa8eca26d81ef28b5b5c42289ee
2018-08-30 12:01:27 -07:00
Nicolas Lacasse 956fe64ad6 fs: Fix renameMu lock recursion.
dirent.walk() takes renameMu, but is often called with renameMu already held,
which can lead to a deadlock.

Fix this by requiring renameMu to be held for reading when dirent.walk() is
called. This causes walks and existence checks to block while a rename
operation takes place, but that is what we were already trying to enforce by
taking renameMu in walk() anyways.

PiperOrigin-RevId: 210760780
Change-Id: Id61018e6e4adbeac53b9c1b3aa24ab77f75d8a54
2018-08-29 11:47:01 -07:00
Nicolas Lacasse 1893247616 fs: Drop reference to over-written file before renaming over it.
dirent.go:Rename() walks to the file being replaced and defers
replaced.DecRef(). After the rename, the reference is dropped, triggering a
writeout and SettAttr call to the gofer. Because of lazyOpenForWrite, the gofer
opens the replaced file BY ITS OLD NAME and calls ftruncate on it.

This CL changes Remove to drop the reference on replaced (and thus trigger
writeout) before the actual rename call.

PiperOrigin-RevId: 210756097
Change-Id: I01ea09a5ee6c2e2d464560362f09943641638e0f
2018-08-29 11:22:27 -07:00
Ian Gudger 52e6714146 fasync: don't keep mutex after return
PiperOrigin-RevId: 210637533
Change-Id: I3536c3f9efb54732a0d8ada8bc299142b2c1682f
2018-08-28 17:26:26 -07:00
Nicolas Lacasse 3b11769c77 fs: Don't bother saving negative dirents.
PiperOrigin-RevId: 210616454
Change-Id: I3f536e2b4d603e540cdd9a67c61b8ec3351f4ac3
2018-08-28 15:18:42 -07:00
Nicolas Lacasse 515d9bf43b fs: Add tests for dirent ref counting with an overlay.
PiperOrigin-RevId: 210614669
Change-Id: I408365ff6d6c7765ed7b789446d30e7079cbfc67
2018-08-28 15:09:17 -07:00
Zhaozhong Ni d724863a31 sentry: optimize dirent weakref map save / restore.
Weak references save / restore involves multiple interface indirection
and cause material latency overhead when there are lots of dirents, each
containing a weak reference map. The nil entries in the map should also
be purged.

PiperOrigin-RevId: 210593727
Change-Id: Ied6f4c3c0726fcc53a24b983d9b3a79121b6b758
2018-08-28 13:22:07 -07:00
Michael Pratt 25a8e13a78 Bump to Go 1.11
The procid offset is unchanged.

PiperOrigin-RevId: 210551969
Change-Id: I33ba1ce56c2f5631b712417d870aa65ef24e6022
2018-08-28 09:22:41 -07:00
Fabricio Voznika ae648bafda Add command-line parameter to trigger panic on signal
This is to troubleshoot problems with a hung process that is
not responding to 'runsc debug --stack' command.

PiperOrigin-RevId: 210483513
Change-Id: I4377b210b4e51bc8a281ad34fd94f3df13d9187d
2018-08-27 20:36:10 -07:00
Brian Geffon f0492d45aa Add /proc/sys/kernel/shm[all,max,mni].
PiperOrigin-RevId: 210459956
Change-Id: I51859b90fa967631e0a54a390abc3b5541fbee66
2018-08-27 17:21:37 -07:00
Nicolas Lacasse 0b3bfe2ea3 fs: Fix remote-revalidate cache policy.
When revalidating a Dirent, if the inode id is the same, then we don't need to
throw away the entire Dirent. We can just update the unstable attributes in
place.

If the inode id has changed, then the remote file has been deleted or moved,
and we have no choice but to throw away the dirent we have a look up another.
In this case, we may still end up losing a mounted dirent that is a child of
the revalidated dirent. However, that seems appropriate here because the entire
mount point has been pulled out from underneath us.

Because gVisor's overlay is at the Inode level rather than the Dirent level, we
must pass the parent Inode and name along with the Inode that is being
revalidated.

PiperOrigin-RevId: 210431270
Change-Id: I705caef9c68900234972d5aac4ae3a78c61c7d42
2018-08-27 14:26:29 -07:00
Zhaozhong Ni bd01816c87 sentry: mark fsutil.DirFileOperations as savable.
PiperOrigin-RevId: 210405166
Change-Id: I252766015885c418e914007baf2fc058fec39b3e
2018-08-27 11:55:32 -07:00
Kevin Krakauer 2524111fc6 runsc: Terminal resizing support.
Implements the TIOCGWINSZ and TIOCSWINSZ ioctls, which allow processes to resize
the terminal. This allows, for example, sshd to properly set the window size for
ssh sessions.

PiperOrigin-RevId: 210392504
Change-Id: I0d4789154d6d22f02509b31d71392e13ee4a50ba
2018-08-27 10:49:16 -07:00
Nicolas Lacasse 106de2182d runsc: Terminal support for "docker exec -ti".
This CL adds terminal support for "docker exec".  We previously only supported
consoles for the container process, but not exec processes.

The SYS_IOCTL syscall was added to the default seccomp filter list, but only
for ioctls that get/set winsize and termios structs. We need to allow these
ioctl for all containers because it's possible to run "exec -ti" on a
container that was started without an attached console, after the filters
have been installed.

Note that control-character signals are still not properly supported.

Tested with:
	$ docker run --runtime=runsc -it alpine
In another terminial:
	$ docker exec -it <containerid> /bin/sh

PiperOrigin-RevId: 210185456
Change-Id: I6d2401e53a7697bb988c120a8961505c335f96d9
2018-08-24 17:43:21 -07:00
Nicolas Lacasse c48708a041 fs: Drop unused WaitGroup in Dirent.destroy.
PiperOrigin-RevId: 210182476
Change-Id: I655a2a801e2069108d30323f7f5ae76deb3ea3ec
2018-08-24 17:15:42 -07:00
Jamie Liu 64403265a0 Implement POSIX per-process interval timers.
PiperOrigin-RevId: 210021612
Change-Id: If7c161e6fd08cf17942bfb6bc5a8d2c4e271c61e
2018-08-23 16:32:36 -07:00
Zhaozhong Ni ba8f6ba8c8 sentry: mark idMapSeqHandle as savable.
PiperOrigin-RevId: 209994384
Change-Id: I16186cf79cb4760a134f3968db30c168a5f4340e
2018-08-23 13:59:20 -07:00
Adin Scannell a7a8d07d7d Add separate Recycle method for allocator.
This improves debugging for pagetable-related issues.

PiperOrigin-RevId: 209827795
Change-Id: I4cfa11664b0b52f26f6bc90a14c5bb106f01e038
2018-08-22 14:16:04 -07:00
Zhaozhong Ni 6b9133ba96 sentry: mark S/R stating errors as save rejections / fs corruptions.
PiperOrigin-RevId: 209817767
Change-Id: Iddf2b8441bc44f31f9a8cf6f2bd8e7a5b824b487
2018-08-22 13:19:16 -07:00
Brian Geffon 545ea7ab3f Always add AT_BASE even if there is no interpreter.
Linux will ALWAYS add AT_BASE even for a static binary, expect it
will be set to 0 [1].

1. https://github.com/torvalds/linux/blob/master/fs/binfmt_elf.c#L253

PiperOrigin-RevId: 209811129
Change-Id: I92cc66532f23d40f24414a921c030bd3481e12a0
2018-08-22 12:37:09 -07:00
Nicolas Lacasse 8d318aac55 fs: Hold Dirent.mu when calling Dirent.flush().
As required by the contract in Dirent.flush().

Also inline Dirent.freeze() into Dirent.Freeze(), since it is only called from
there.

PiperOrigin-RevId: 209783626
Change-Id: Ie6de4533d93dd299ffa01dabfa257c9cc259b1f4
2018-08-22 10:07:01 -07:00
Zhaozhong Ni 8bb50dab79 sentry: do not release gofer inode file state loading lock upon error.
When an inode file state failed to load asynchronuously, we want to report
the error instead of potentially panicing in another async loading goroutine
incorrectly unblocked.

PiperOrigin-RevId: 209683977
Change-Id: I591cde97710bbe3cdc53717ee58f1d28bbda9261
2018-08-21 16:52:27 -07:00
Ian Gudger 9c407382b0 Fix races in kernel.(*Task).Value()
PiperOrigin-RevId: 209627180
Change-Id: Idc84afd38003427e411df6e75abfabd9174174e1
2018-08-21 11:16:17 -07:00
Ian Gudger 47d5a12ce5 Fix handling of abstract Unix socket addresses
* Don't truncate abstract addresses at second null.
* Properly handle abstract addresses with length < 108 bytes.

PiperOrigin-RevId: 209502703
Change-Id: I49053f2d18b5a78208c3f640c27dbbdaece4f1a9
2018-08-20 16:12:23 -07:00
Nicolas Lacasse 1501400d9c getdents should return type=DT_DIR for SpecialDirectories.
It was returning DT_UNKNOWN, and this was breaking numpy.

PiperOrigin-RevId: 209459351
Change-Id: Ic6f548e23aa9c551b2032b92636cb5f0df9ccbd4
2018-08-20 11:59:58 -07:00
Nicolas Lacasse 0050e3e71c sysfs: Add (empty) cpu directories for each cpu in /sys/devices/system/cpu.
Numpy needs these.

Also added the "present" directory, since the contents are the same as possible
and online.

PiperOrigin-RevId: 209451777
Change-Id: I2048de3f57bf1c57e9b5421d607ca89c2a173684
2018-08-20 11:19:15 -07:00
Chenggang Qin aeec7a4c00 fs: Support possible and online knobs for cpu
Some linux commands depend on /sys/devices/system/cpu/possible, such
as 'lscpu'.

Add 2 knobs for cpu:
/sys/devices/system/cpu/possible
/sys/devices/system/cpu/online
Both the values are '0 - Kernel.ApplicationCores()-1'.

Change-Id: Iabd8a4e559cbb630ed249686b92c22b4e7120663
PiperOrigin-RevId: 209070163
2018-08-16 16:28:14 -07:00
Kevin Krakauer 635b0c4593 runsc fsgofer: Support dynamic serving of filesystems.
When multiple containers run inside a sentry, each container has its own root
filesystem and set of mounts. Containers are also added after sentry boot rather
than all configured and known at boot time.

The fsgofer needs to be able to serve the root filesystem of each container.
Thus, it must be possible to add filesystems after the fsgofer has already
started.

This change:
* Creates a URPC endpoint within the gofer process that listens for requests to
  serve new content.
* Enables the sentry, when starting a new container, to add the new container's
  filesystem.
* Mounts those new filesystems at separate roots within the sentry.

PiperOrigin-RevId: 208903248
Change-Id: Ifa91ec9c8caf5f2f0a9eead83c4a57090ce92068
2018-08-15 16:25:22 -07:00
Ian Gudger a620bea045 Reduce map lookups in syserr
PiperOrigin-RevId: 208755352
Change-Id: Ia24630f452a4a42940ab73a8113a2fd5ea2cfca2
2018-08-14 19:03:38 -07:00
Nicolas Lacasse e8a4f2e133 runsc: Change cache policy for root fs and volume mounts.
Previously, gofer filesystems were configured with the default "fscache"
policy, which caches filesystem metadata and contents aggressively.  While this
setting is best for performance, it means that changes from inside the sandbox
may not be immediately propagated outside the sandbox, and vice-versa.

This CL changes volumes and the root fs configuration to use a new
"remote-revalidate" cache policy which tries to retain as much caching as
possible while still making fs changes visible across the sandbox boundary.

This cache policy is enabled by default for the root filesystem. The default
value for the "--file-access" flag is still "proxy", but the behavior is
changed to use the new cache policy.

A new value for the "--file-access" flag is added, called "proxy-exclusive",
which turns on the previous aggressive caching behavior. As the name implies,
this flag should be used when the sandbox has "exclusive" access to the
filesystem.

All volume mounts are configured to use the new cache policy, since it is
safest and most likely to be correct. There is not currently a way to change
this behavior, but it's possible to add such a mechanism in the future. The
configurability is a smaller issue for volumes, since most of the expensive
application fs operations (walking + stating files) will likely served by the
root fs.

PiperOrigin-RevId: 208735037
Change-Id: Ife048fab1948205f6665df8563434dbc6ca8cfc9
2018-08-14 16:25:58 -07:00
Kevin Krakauer d4939f6dc2 TTY: Fix data race where calls into tty.queue's waiter were not synchronized.
Now, there's a waiter for each end (master and slave) of the TTY, and each
waiter.Entry is only enqueued in one of the waiters.

PiperOrigin-RevId: 208734483
Change-Id: I06996148f123075f8dd48cde5a553e2be74c6dce
2018-08-14 16:22:56 -07:00
Kevin Krakauer 12a4912aed Fix `ls -laR | wc -l` hanging.
stat()-ing /proc/PID/fd/FD incremented but didn't decrement the refcount for
FD. This behavior wasn't usually noticeable, but in the above case:

- ls would never decrement the refcount of the write end of the pipe to 0.
- This caused the write end of the pipe never to close.
- wc would then hang read()-ing from the pipe.

PiperOrigin-RevId: 208728817
Change-Id: I4fca1ba5ca24e4108915a1d30b41dc63da40604d
2018-08-14 15:49:58 -07:00
Ian Gudger e97717e29a Enforce Unix socket address length limit
PiperOrigin-RevId: 208720936
Change-Id: Ic943a88b6efeff49574306d4d4e1f113116ae32e
2018-08-14 15:07:05 -07:00
Nicolas Lacasse 6cf2278167 Automated rollback of changelist 208284483
PiperOrigin-RevId: 208685417
Change-Id: Ie2849c4811e3a2d14a002f521cef018ded0c6c4a
2018-08-14 11:50:49 -07:00
Nicolas Lacasse 66b0f3e15a Fix bind() on overlays.
InodeOperations.Bind now returns a Dirent which will be cached in the Dirent
tree.

When an overlay is in-use, Bind cannot return the Dirent created by the upper
filesystem because the Dirent does not know about the overlay. Instead,
overlayBind must create a new overlay-aware Inode and Dirent and return that.
This is analagous to how Lookup and overlayLookup work.

PiperOrigin-RevId: 208670710
Change-Id: I6390affbcf94c38656b4b458e248739b4853da29
2018-08-14 10:34:56 -07:00
Adin Scannell dde836a918 Prevent renames across walk fast path.
PiperOrigin-RevId: 208533436
Change-Id: Ifc1a4e2d6438a424650bee831c301b1ac0d670a3
2018-08-13 13:31:18 -07:00
Nicolas Lacasse a2ec391dfb fs: Allow overlays to revalidate files from the upper fs.
Previously, an overlay would panic if either the upper or lower fs required
revalidation for a given Dirent. Now, we allow revalidation from the upper
file, but not the lower.

If a cached overlay inode does need revalidation (because the upper needs
revalidation), then the entire overlay Inode will be discarded and a new
overlay Inode will be built with a fresh copy of the upper file.

As a side effect of this change, Revalidate must take an Inode instead of a
Dirent, since an overlay needs to revalidate individual Inodes.

PiperOrigin-RevId: 208293638
Change-Id: Ic8f8d1ffdc09114721745661a09522b54420c5f1
2018-08-10 17:16:38 -07:00
Justine Olshan ae6f092fe1 Implemented the splice(2) syscall.
Currently the implementation matches the behavior of moving data
between two file descriptors. However, it does not implement this
through zero-copy movement. Thus, this code is a starting point
to build the more complex implementation.

PiperOrigin-RevId: 208284483
Change-Id: Ibde79520a3d50bc26aead7ad4f128d2be31db14e
2018-08-10 16:11:01 -07:00
Nicolas Lacasse 567c5eed11 cache policy: Check policy before returning a negative dirent.
The cache policy determines whether Lookup should return a negative dirent, or
just ENOENT. This CL fixes one spot where we returned a negative dirent without
first consulting the policy.

PiperOrigin-RevId: 208280230
Change-Id: I8f963bbdb45a95a74ad0ecc1eef47eff2092d3a4
2018-08-10 15:43:03 -07:00
Brielle Broder 4ececd8e8d Enable checkpoint/restore in cases of UDS use.
Previously, processes which used file-system Unix Domain Sockets could not be
checkpoint-ed in runsc because the sockets were saved with their inode
numbers which do not necessarily remain the same upon restore. Now,
the sockets are also saved with their paths so that the new inodes
can be determined for the sockets based on these paths after restoring.
Tests for cases with UDS use are included. Test cleanup to come.

PiperOrigin-RevId: 208268781
Change-Id: Ieaa5d5d9a64914ca105cae199fd8492710b1d7ec
2018-08-10 14:33:20 -07:00
Neel Natu d5b702b64f Validate FS.base before establishing it in the task's register set.
PiperOrigin-RevId: 208229341
Change-Id: I5d84bc52bbafa073446ef497e56958d0d7955aa8
2018-08-10 10:27:09 -07:00
Michael Pratt 2e06b23aa6 Fix missing O_LARGEFILE from O_CREAT files
Cleanup some more syscall.O_* references while we're here.

PiperOrigin-RevId: 208133460
Change-Id: I48db71a38f817e4f4673977eafcc0e3874eb9a25
2018-08-09 16:50:37 -07:00
Fabricio Voznika 4e171f7590 Basic support for ip link/addr and ifconfig
Closes #94

PiperOrigin-RevId: 207997580
Change-Id: I19b426f1586b5ec12f8b0cd5884d5b401d334924
2018-08-08 22:39:58 -07:00
Adin Scannell dbbe9ec915 Protect PCIDs with a mutex.
Because the Drop method may be called across vCPUs, it is necessary to protect
the PCID database with a mutex to prevent concurrent modification. The PCID is
assigned prior to entersyscall, so it's safe to block.

PiperOrigin-RevId: 207992864
Change-Id: I8b36d55106981f51e30dcf03e12886330bb79d67
2018-08-08 21:29:19 -07:00
Fabricio Voznika 0d350aac7f Enable SACK in runsc
SACK is disabled by default and needs to be manually enabled. It not only
improves performance, but also fixes hangs downloading files from certain
websites.

PiperOrigin-RevId: 207906742
Change-Id: I4fb7277b67bfdf83ac8195f1b9c38265a0d51e8b
2018-08-08 10:26:18 -07:00
Jamie Liu c036da5dff Hold TaskSet.mu in Task.Parent.
PiperOrigin-RevId: 207766238
Change-Id: Id3b66d8fe1f44c3570f67fa5ae7ba16021e35be1
2018-08-07 13:09:42 -07:00
Nicolas Lacasse a38f41b464 fs: Add new cache policy "remote_revalidate".
This CL adds a new cache-policy for gofer filesystems that uses the host page
cache, but causes dirents to be reloaded on each Walk, and does not cache
readdir results.

This policy is useful when the remote filesystem may change out from underneath
us, as any remote changes will be reflected on the next Walk.

Importantly, this cache policy is only consistent if we do not use gVisor's
internal page cache, since that page cache is tied to the Inode and may be
thrown away upon Revalidation.

This cache policy should only be used when the gofer supports donating host
FDs, since then gVisor will make use of the host kernel page cache, which will
be consistent for all open files in the gofer. In fact, a panic will be raised
if a file is opened without a donated FD.

PiperOrigin-RevId: 207752937
Change-Id: I233cb78b4695bbe00a4605ae64080a47629329b8
2018-08-07 11:43:41 -07:00
Zhaozhong Ni c348d07863 sentry: make epoll.pollEntry wait for the file operation in restore.
PiperOrigin-RevId: 207737935
Change-Id: I3a301ece1f1d30909715f36562474e3248b6a0d5
2018-08-07 10:27:37 -07:00
Michael Pratt 42086fe8e1 Make ramfs.File savable
In other news, apparently proc.fdInfo is the last user of ramfs.File.

PiperOrigin-RevId: 207564572
Change-Id: I5a92515698cc89652b80bea9a32d309e14059869
2018-08-06 10:15:56 -07:00
ShiruRen 3ec074897f Fix a bug in PCIDs.Assign
Store the new assigned pcid in p.cache[pt].

Signed-off-by: ShiruRen <renshiru2000@gmail.com>

Change-Id: I4aee4e06559e429fb5e90cb9fe28b36139e3b4b6
PiperOrigin-RevId: 207563833
2018-08-06 10:11:56 -07:00
Zhaozhong Ni 25178ebdf5 stateify: make explicit mode no longer optional.
PiperOrigin-RevId: 207303405
Change-Id: I17b6433963d78e3631a862b7ac80f566c8e7d106
2018-08-03 12:09:13 -07:00
Michael Pratt a3927157c5 Copy creds in access
PiperOrigin-RevId: 207181631
Change-Id: Ic6205278715a9260fb970efb414fc758ea72c4c6
2018-08-02 16:01:31 -07:00
Michael Pratt b6a37ab9d9 Update comment reference
PiperOrigin-RevId: 207180809
Change-Id: I08c264812919e81b2c56fdd4a9ef06924de8b52f
2018-08-02 15:56:40 -07:00
Zhaozhong Ni 57d0fcbdbf Automated rollback of changelist 207037226
PiperOrigin-RevId: 207125440
Change-Id: I6c572afb4d693ee72a0c458a988b0e96d191cd49
2018-08-02 10:42:48 -07:00
Brian Geffon cf44aff6e0 Add seccomp(2) support.
Add support for the seccomp syscall and the flag SECCOMP_FILTER_FLAG_TSYNC.

PiperOrigin-RevId: 207101507
Change-Id: I5eb8ba9d5ef71b0e683930a6429182726dc23175
2018-08-02 08:10:30 -07:00
Michael Pratt 60add78980 Automated rollback of changelist 207007153
PiperOrigin-RevId: 207037226
Change-Id: I8b5f1a056d4f3eab17846f2e0193bb737ecb5428
2018-08-01 19:57:32 -07:00
Zhaozhong Ni b9e1cf8404 stateify: convert all packages to use explicit mode.
PiperOrigin-RevId: 207007153
Change-Id: Ifedf1cc3758dc18be16647a4ece9c840c1c636c9
2018-08-01 15:43:24 -07:00
Brielle Broder 6b87378634 New conditional for adding key/value pairs to maps.
When adding MultiDeviceKeys and their values into MultiDevice maps, make
sure the keys and values have not already been added. This ensures that
preexisting key/value pairs are not overridden.

PiperOrigin-RevId: 206942766
Change-Id: I9d85f38eb59ba59f0305e6614a52690608944981
2018-08-01 09:44:57 -07:00
Andrei Vagin a7a0167716 proc: show file flags in fdinfo
Currently, there is an attempt to print FD flags, but
they are not decoded into a number, so we see something like this:

/criu # cat /proc/self/fdinfo/0
flags: {%!o(bool=000false)}

Actually, fdinfo has to contain file flags.

Change-Id: Idcbb7db908067447eb9ae6f2c3cfb861f2be1a97
PiperOrigin-RevId: 206794498
2018-07-31 11:19:15 -07:00
Justine Olshan 2793f7ac5f Added the O_LARGEFILE flag.
This flag will always be true for gVisor files.

PiperOrigin-RevId: 206355963
Change-Id: I2f03d2412e2609042df43b06d1318cba674574d0
2018-07-27 12:27:46 -07:00
Zhaozhong Ni be7fcbc558 stateify: support explicit annotation mode; convert refs and stack packages.
We have been unnecessarily creating too many savable types implicitly.

PiperOrigin-RevId: 206334201
Change-Id: Idc5a3a14bfb7ee125c4f2bb2b1c53164e46f29a8
2018-07-27 10:17:21 -07:00
Nicolas Lacasse 127c977ab0 Don't copy-up extended attributes that specifically configure a lower overlay.
When copying-up files from a lower fs to an upper, we also copy the extended
attributes on the file. If there is a (nested) overlay inside the lower, some
of these extended attributes configure the lower overlay, and should not be
copied-up to the upper.

In particular, whiteout attributes in the lower fs overlay should not be
copied-up, since the upper fs may actually contain the file.

PiperOrigin-RevId: 206236010
Change-Id: Ia0454ac7b99d0e11383f732a529cb195ed364062
2018-07-26 15:55:50 -07:00
Michael Pratt 7cd9405b9c Format openat flags
PiperOrigin-RevId: 206021774
Change-Id: I447b6c751c28a8d8d4d78468b756b6ad8c61e169
2018-07-25 11:07:19 -07:00
Kevin Krakauer 32aa0f5465 Typo fix.
PiperOrigin-RevId: 205880843
Change-Id: If2272b25f08a18ebe9b6309a1032dd5cdaa59866
2018-07-24 13:26:06 -07:00
Fabricio Voznika d7a34790a0 Add KVM and overlay dimensions to container_test
PiperOrigin-RevId: 205714667
Change-Id: I317a2ca98ac3bdad97c4790fcc61b004757d99ef
2018-07-23 13:31:42 -07:00
Michael Pratt 5f134b3c0a Format getcwd path
PiperOrigin-RevId: 205440332
Change-Id: I2a838f363e079164c83da88e1b0b8769844fe79b
2018-07-20 12:59:41 -07:00
Adin Scannell 8b8aad91d5 kernel: mutations on creds now require a copy.
PiperOrigin-RevId: 205315612
Change-Id: I9a0a1e32c8abfb7467a38743b82449cc92830316
2018-07-19 15:48:56 -07:00
Nicolas Lacasse be431d0934 fs: Pass context to Revalidate() function.
The current revalidation logic is very simple and does not do much
introspection of the dirent being revalidated (other than looking at the type
of file).

Fancier revalidation logic is coming soon, and we need to be able to look at
the cached and uncached attributes of a given dirent, and we need a context to
perform some of these operations.

PiperOrigin-RevId: 205307351
Change-Id: If17ea1c631d8f9489c0e05a263e23d7a8a3bf159
2018-07-19 14:57:52 -07:00
Nicolas Lacasse ea37103196 ConfigureMMap on an overlay file delegates to the upper if there is no lower.
In the general case with an overlay, all mmap calls must go through the
overlay, because in the event of a copy-up, the overlay needs to invalidate any
previously-created mappings.

If there if no lower file, however, there will never be a copy-up, so the
overlay can delegate directly to the upper file in that case.

This also allows us to correctly mmap /dev/zero when it is in an overlay. This
file has special semantics which the overlay does not know about. In
particular, it does not implement Mappable(), which (in the general case) the
overlay uses to detect if a file is mappable or not.

PiperOrigin-RevId: 205306743
Change-Id: I92331649aa648340ef6e65411c2b42c12fa69631
2018-07-19 14:53:38 -07:00
Brian Geffon df5a5d388e Add AT_UID, AT_EUID, AT_GID, AT_EGID to aux vector.
With musl libc when these entries are missing from the aux vector
it's forcing libc.secure (effectively AT_SECURE). This mode prevents
RPATH and LD_LIBRARY_PATH from working.

https://git.musl-libc.org/cgit/musl/tree/ldso/dynlink.c#n1488
As the first entry is a mask of all the aux fields set:
https://git.musl-libc.org/cgit/musl/tree/ldso/dynlink.c#n187

PiperOrigin-RevId: 205284684
Change-Id: I04de7bab241043306b4f732306a81d74edfdff26
2018-07-19 12:42:05 -07:00
Zhaozhong Ni a95640b1e9 sentry: save stack in proc net dev.
PiperOrigin-RevId: 205253858
Change-Id: Iccdc493b66d1b4d39de44afb1184952183b1283f
2018-07-19 09:37:32 -07:00
Nicolas Lacasse 63e2820f7b Fix lock-ordering violation in Create by logging BaseName instead of FullName.
Dirent.FullName takes the global renameMu, but can be called during Create,
which itself takes dirent.mu and dirent.dirMu, which is a lock-order violation:

Dirent.Create
  d.dirMu.Lock
  d.mu.Lock
  Inode.Create
    gofer.inodeOperations.Create
      gofer.NewFile
        Dirent.FullName
          d.renameMu.RLock

We only use the FullName here for logging, and in this case we can get by with
logging only the BaseName.

A `BaseName` method was added to Dirent, which simply returns the name, taking
d.parent.mu as required.

In the Create pathway, we can't call d.BaseName() because taking d.parent.mu
after d.mu violates the lock order. But we already know the base name of the
file we just created, so that's OK.

In the Open/GetFile pathway, we are free to call d.BaseName() because the other
dirent locks are not held.

PiperOrigin-RevId: 205112278
Change-Id: Ib45c734081aecc9b225249a65fa8093eb4995f10
2018-07-18 11:49:50 -07:00
Michael Pratt 733ebe7c09 Merge FileMem.usage in IncRef
Per the doc, usage must be kept maximally merged. Beyond that, it is simply a
good idea to keep fragmentation in usage to a minimum.

The glibc malloc allocator allocates one page at a time, potentially causing
lots of fragmentation. However, those pages are likely to have the same number
of references, often making it possible to merge ranges.

PiperOrigin-RevId: 204960339
Change-Id: I03a050cf771c29a4f05b36eaf75b1a09c9465e14
2018-07-17 13:03:59 -07:00
Adin Scannell 29e00c943a Add CPUID faulting for ptrace and KVM.
PiperOrigin-RevId: 204858314
Change-Id: I8252bf8de3232a7a27af51076139b585e73276d4
2018-07-16 22:02:58 -07:00
Michael Pratt 14d06064d2 Start allocation and reclaim scans only where they may find a match
If usageSet is heavily fragmented, findUnallocatedRange and findReclaimable
can spend excessive cycles linearly scanning the set for unallocated/free
pages.

Improve common cases by beginning the scan only at the first page that could
possibly contain an unallocated/free page. This metadata only guarantees that
there is no lower unallocated/free page, but a scan may still be required
(especially for multi-page allocations).

That said, this heuristic can still provide significant performance
improvements for certain applications.

PiperOrigin-RevId: 204841833
Change-Id: Ic41ad33bf9537ecd673a6f5852ab353bf63ea1e6
2018-07-16 18:19:01 -07:00
Neel Natu 8f21c0bb28 Add EventOperations.HostFD()
This method allows an eventfd inside the Sentry to be registered with with
the host kernel.

Update comment about memory mapping host fds via CachingInodeOperations.

PiperOrigin-RevId: 204784859
Change-Id: I55823321e2d84c17ae0f7efaabc6b55b852ae257
2018-07-16 12:20:05 -07:00
Neel Natu 5b09ec3b89 Allow a filesystem to control its visibility in /proc/filesystems.
PiperOrigin-RevId: 204508520
Change-Id: I09e5f8b6e69413370e1a0d39dbb7dc1ee0b6192d
2018-07-13 12:10:57 -07:00
Michael Pratt f09ebd9c71 Note that Mount errors do not require translations
PiperOrigin-RevId: 204490639
Change-Id: I0fe26306bae9320c6aa4f854fe0ef25eebd93233
2018-07-13 10:24:18 -07:00
Michael Pratt a28b274abb Fix aio eventfd lookup
We're failing to set eventFile in the outer scope.

PiperOrigin-RevId: 204392995
Change-Id: Ib9b04f839599ef552d7b5951d08223e2b1d5f6ad
2018-07-12 17:14:50 -07:00
Zhaozhong Ni 1cd46c8dd1 sentry: wait for restore clock instead of panicing in Timekeeper.
PiperOrigin-RevId: 204372296
Change-Id: If1ed9843b93039806e0c65521f30177dc8036979
2018-07-12 15:09:02 -07:00
Zhaozhong Ni bb41ad808a sentry: save inet stacks in proc files.
PiperOrigin-RevId: 204362791
Change-Id: If85ea7442741e299f0d7cddbc3d6b415e285da81
2018-07-12 14:19:04 -07:00
Michael Pratt 41e0b977e5 Format documentation
PiperOrigin-RevId: 204323728
Change-Id: I1ff9aa062ffa12583b2e38ec94c87db7a3711971
2018-07-12 10:37:21 -07:00
Jamie Liu b9c469f372 Move ptrace constants to abi/linux.
PiperOrigin-RevId: 204188763
Change-Id: I5596ab7abb3ec9e210a7f57b3fc420e836fa43f3
2018-07-11 14:24:19 -07:00
Jamie Liu ee0ef506d4 Add MemoryManager.Pin.
PiperOrigin-RevId: 204162313
Change-Id: Ib0593dde88ac33e222c12d0dca6733ef1f1035dc
2018-07-11 11:52:09 -07:00
Jamie Liu 06920b3d1b Exit tmpfs.fileInodeOperations.Translate early if required.Start >= EOF.
Otherwise required and optional can be empty or have negative length.

PiperOrigin-RevId: 204007079
Change-Id: I59e472a87a8caac11ffb9a914b8d79bf0cd70995
2018-07-10 13:58:54 -07:00
Zhaozhong Ni b1683df90b netstack: tcp socket connected state S/R support.
PiperOrigin-RevId: 203958972
Change-Id: Ia6fe16547539296d48e2c6731edacdd96bd6e93c
2018-07-10 09:23:35 -07:00
Jamie Liu 41aeb680b1 Inherit parent in clone(CLONE_THREAD) under TaskSet.mu.
PiperOrigin-RevId: 203849534
Change-Id: I4d81513bfd32e0b7fc40c8a4c194eba7abc35a83
2018-07-09 16:16:19 -07:00
Michael Pratt 0dedac637f Trim all whitespace between interpreter and arg
Multiple whitespace characters are allowed. This fixes Ubuntu's
/usr/sbin/invoke-rc.d, which has trailing whitespace after the
interpreter which we were treating as an arg.

PiperOrigin-RevId: 203802278
Change-Id: I0a6cdb0af4b139cf8abb22fa70351fe3697a5c6b
2018-07-09 11:43:56 -07:00
Rahat Mahmood 34af9a6174 Fix data race on inotify.Watch.mask.
PiperOrigin-RevId: 203180463
Change-Id: Ief50988c1c028f81ec07a26e704d893e86985bf0
2018-07-03 14:08:51 -07:00
Michael Pratt 660f1203ff Fix runsc VDSO mapping
80bdf8a406 accidentally moved vdso into an
inner scope, never assigning the vdso variable passed to the Kernel and
thus skipping VDSO mappings.

Fix this and remove the ability for loadVDSO to skip VDSO mappings,
since tests that do so are gone.

PiperOrigin-RevId: 203169135
Change-Id: Ifd8cadcbaf82f959223c501edcc4d83d05327eba
2018-07-03 12:53:39 -07:00
Michael Pratt 062a6f6ec5 Handle NUL-only paths in exec
The path in execve(2), interpreter script, and ELF interpreter may all
be no more than a NUL-byte. Handle each of those cases.

PiperOrigin-RevId: 203155745
Change-Id: I1c8b1b387924b23b2cf942341dfc76c9003da959
2018-07-03 11:28:53 -07:00
Michael Pratt 2821dfe6ce Hold d.parent.mu when reading d.name
PiperOrigin-RevId: 203041657
Change-Id: I120783d91712818e600505454c9276f8d9877f37
2018-07-02 17:39:10 -07:00
Justine Olshan 80bdf8a406 Sets the restore environment for restoring a container.
Updated how restoring occurs through boot.go with a separate Restore function.
This prevents a new process and new mounts from being created.
Added tests to ensure the container is restored.
Registered checkpoint and restore commands so they can be used.
Docker support for these commands is still limited.
Working on #80.

PiperOrigin-RevId: 202710950
Change-Id: I2b893ceaef6b9442b1ce3743bd112383cb92af0c
2018-06-29 14:47:40 -07:00
Nicolas Lacasse 1b5e09f968 aio: Return EINVAL if the number of events is negative.
PiperOrigin-RevId: 202671065
Change-Id: I248b74544d47ddde9cd59d89aa6ccb7dad2b6f89
2018-06-29 10:47:38 -07:00
Nicolas Lacasse f93bd2cbe6 Hold t.mu while calling t.FSContext().
PiperOrigin-RevId: 202562686
Change-Id: I0f5be7cc9098e86fa31d016251c127cb91084b05
2018-06-28 16:11:19 -07:00
Nicolas Lacasse 1ceed49ba9 Check for invalid offset when submitting an AIO read/write request.
PiperOrigin-RevId: 202528335
Change-Id: Ic32312cf4337bcb40a7155cb2174e5cd89a280f7
2018-06-28 12:55:18 -07:00
Fabricio Voznika 6b6852bceb Fix semaphore data races
PiperOrigin-RevId: 202371908
Change-Id: I72603b1d321878cae6404987c49e64732b676331
2018-06-27 14:41:32 -07:00
Nicolas Lacasse 99afc982f1 Call mm.CheckIORange() when copying in IOVecs.
CheckIORange is analagous to Linux's access_ok() method, which is checked when
copying in IOVecs in both lib/iov_iter.c:import_single_range() and
lib/iov_iter.c:import_iovec() => fs/read_write.c:rw_copy_check_uvector().

gVisor copies in IOVecs via Task.SingleIOSequence() and Task.CopyInIovecs().
We were checking the address range bounds, but not whether the address is
valid. To conform with linux, we should also check that the address is valid.

For usual preadv/pwritev syscalls, the effect of this change is not noticeable,
since we find out that the address is invalid before the syscall completes.

For vectorized async-IO operations, however, this change is necessary because
Linux returns EFAULT when the operation is submitted, but before it executes.
Thus, we must validate the iovecs when copying them in.

PiperOrigin-RevId: 202370092
Change-Id: I8759a63ccf7e6b90d90d30f78ab8935a0fcf4936
2018-06-27 14:31:35 -07:00