Commit Graph

625 Commits

Author SHA1 Message Date
Fabricio Voznika 5a8ee1beee Preserve log FD after execve
PiperOrigin-RevId: 306908296
2020-04-16 13:17:00 -07:00
gVisor bot ac9b32c36b Merge pull request #2212 from aaronlu:dup_stdioFDs
PiperOrigin-RevId: 306477639
2020-04-14 11:20:11 -07:00
Ian Lewis daf3322498 Add logging message for noNewPrivileges OCI option.
noNewPrivileges is ignored if set to false since gVisor assumes that
PR_SET_NO_NEW_PRIVS is always enabled.

PiperOrigin-RevId: 305991947
2020-04-10 20:32:23 -07:00
Fabricio Voznika 96f9142959 Use O_CLOEXEC when dup'ing FDs
The sentry doesn't allow execve, but it's a good defense
in-depth measure.

PiperOrigin-RevId: 305958737
2020-04-10 15:47:23 -07:00
gVisor bot 78126611e6 Merge pull request #2253 from amscanne:nogo
PiperOrigin-RevId: 305807868
2020-04-09 19:16:46 -07:00
Fabricio Voznika 2a28e3e9c3 Don't unconditionally set --panic-signal
Closes #2393

PiperOrigin-RevId: 305793027
2020-04-09 17:20:14 -07:00
Fabricio Voznika 6dd5a1f3fe Clean up TODOs
PiperOrigin-RevId: 305592245
2020-04-08 17:58:13 -07:00
Adin Scannell 928a7c60b8 Fix all printf formatting errors.
Updates #2243
2020-04-08 10:14:34 -07:00
Adin Scannell 94b793262d Fix all copy locks violations.
This required minor restructuring of how system call tables were saved
and restored, but it makes way more sense this way.

Updates #2243
2020-04-08 10:00:14 -07:00
Ian Lewis 56054fc1fb Add friendlier messages for frequently encountered errors.
Issue #2270
Issue #1765

PiperOrigin-RevId: 305385436
2020-04-07 18:51:01 -07:00
Ian Lewis 5802051b3d Update TODO to #238
Move TODO to #238 so that proper synchronization of operations is handled
when we create the urpc client.

Issue #238
Fixes #512

PiperOrigin-RevId: 305383924
2020-04-07 18:39:33 -07:00
Andrei Vagin acf0259255 Don't map the 0 uid into a sandbox user namespace
Starting with go1.13, we can specify ambient capabilities when we execute a new
process with os/exe.Cmd.

PiperOrigin-RevId: 305366706
2020-04-07 16:46:05 -07:00
Dean Deng fc72eb3595 Remove TODOs for local gofer extended attributes.
PiperOrigin-RevId: 305344989
2020-04-07 14:48:40 -07:00
Adin Scannell 4e6a1a5adb Automated rollback of changelist 303799678
PiperOrigin-RevId: 304221302
2020-04-01 11:06:26 -07:00
Aaron Lu 0cfdd47391 checkpoint/restore: make sure the donated stdioFDs have the same value
Suppose I start a runsc container using kvm platform like this:
$ sudo runsc --debug=true --debug-log=1.txt --platform=kvm run rootbash
The donating FD and the corresponding cmdline for runsc-sandbox is:

D0313 17:50:12.608203   44389 x:0] Donating FD 3: "1.txt"
D0313 17:50:12.608214   44389 x:0] Donating FD 4: "control_server_socket"
D0313 17:50:12.608224   44389 x:0] Donating FD 5: "|0"
D0313 17:50:12.608229   44389 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json"
D0313 17:50:12.608234   44389 x:0] Donating FD 7: "|1"
D0313 17:50:12.608238   44389 x:0] Donating FD 8: "sandbox IO FD"
D0313 17:50:12.608242   44389 x:0] Donating FD 9: "/dev/kvm"
D0313 17:50:12.608246   44389 x:0] Donating FD 10: "/dev/stdin"
D0313 17:50:12.608249   44389 x:0] Donating FD 11: "/dev/stdout"
D0313 17:50:12.608253   44389 x:0] Donating FD 12: "/dev/stderr"
D0313 17:50:12.608257   44389 x:0] Starting sandbox: /proc/self/exe
[runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log=
--max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt
--debug-log-format=text --file-access=exclusive --overlay=false
--fsgofer-host-uds=false --network=sandbox --log-packets=false
--platform=kvm --strace=false --strace-syscalls=--strace-log-size=1024
--watchdog-action=Panic --panic-signal=-1 --profile=false --net-raw=true
--num-network-channels=1 --rootless=false --alsologtostderr=false
--ref-leak-mode=disabled --gso=true --software-gso=true
--overlayfs-stale-read=false --shared-volume= --debug-log-fd=3
--panic-signal=15 boot --bundle=/home/ziqian.lzq/bundle/bash/runsc
--controller-fd=4 --mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8
--device-fd=9 --stdio-fds=10 --stdio-fds=11 --stdio-fds=12 --pidns=true
--setup-root --cpu-num 32 --total-memory 4294967296 rootbash]

Note stdioFDs starts from 10 with kvm platform and stderr's FD is 12.

If I restore a container from the checkpoint image which is derived
by checkpointing the above rootbash container, but either omit the
platform switch or specify to use ptrace platform explicitely:
$ sudo runsc --debug=true --debug-log=1.txt restore --image-path=some_path restored_rootbash

the donating FD and corresponding cmdline for runsc-sandbox is:

D0313 17:50:15.258632   44452 x:0] Donating FD 3: "1.txt"
D0313 17:50:15.258640   44452 x:0] Donating FD 4: "control_server_socket"
D0313 17:50:15.258645   44452 x:0] Donating FD 5: "|0"
D0313 17:50:15.258648   44452 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json"
D0313 17:50:15.258653   44452 x:0] Donating FD 7: "|1"
D0313 17:50:15.258657   44452 x:0] Donating FD 8: "sandbox IO FD"
D0313 17:50:15.258661   44452 x:0] Donating FD 9: "/dev/stdin"
D0313 17:50:15.258675   44452 x:0] Donating FD 10: "/dev/stdout"
D0313 17:50:15.258680   44452 x:0] Donating FD 11: "/dev/stderr"
D0313 17:50:15.258684   44452 x:0] Starting sandbox: /proc/self/exe
[runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log=
--max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt
--debug-log-format=text --file-access=exclusive --overlay=false
--fsgofer-host-uds=false --network=sandbox --log-packets=false
--platform=ptrace --strace=false --strace-syscalls=
--strace-log-size=1024 --watchdog-action=Panic --panic-signal=-1
--profile=false --net-raw=true --num-network-channels=1 --rootless=false
--alsologtostderr=false --ref-leak-mode=disabled --gso=true
--software-gso=true --overlayfs-stale-read=false --shared-volume=
--debug-log-fd=3 --panic-signal=15 boot
--bundle=/home/ziqian.lzq/bundle/bash/runsc --controller-fd=4
--mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8 --stdio-fds=9
--stdio-fds=10 --stdio-fds=11 --setup-root --cpu-num 32 --total-memory
4294967296 restored_rootbash]

Note this time, stdioFDs starts from 9 and stderr's FD is 11(so the
saved host.descritor.origFD which is 12 for stderr is no longer valid).

For the three host FD based files, The s.Dev and s.Ino derived from
fstat(fd) shall all be the same and since the two fields are used
as device.MultiDeviceKey, the host.inodeFileState.sattr.InodeId which is
the value of MultiDevice.Map(MultiDeviceKey), shall also all be the same.
Note that for MultiDevice m, m.cache records the mapping of key to value
and m.rcache records the mapping of value to key. If same value doesn't
map to the same key, it will panic on restore.

Now that stderr's origFD 12 is no longer valid(it happens to be
/memfd:runsc-memory in my test on restore), the s.Dev and s.Ino derived
from fstat(fd=12) in host.inodeFileState.afterLoad() will neither be
correct. But its InodeID is still the same as saved, MultiDevice.Load()
will complain about the same value(InodeID) being mapped to different
keys (different from stdin and stdout's) and panic with: "MultiDevice's
caches are inconsistent".

Solve this problem by making sure stdioFDs for root container's init
task are always the same on initial start and on restore time, no matter
what cmdline user has used: debug log specified or not, platform changed
or not etc. shall not affect the ability to restore.

Fixes #1844.
2020-03-31 11:37:11 +08:00
Adin Scannell 3fac85da95 kvm: handle exit reasons even under EINTR.
In the case of other signals (preemption), inject a normal bounce and
defer the signal until the vCPU has been returned from guest mode.

PiperOrigin-RevId: 303799678
2020-03-30 12:37:57 -07:00
Dean Deng 137f361400 Use host-defined file owner and mode, when possible, for imported fds.
Using the host-defined file owner matches VFS1. It is more correct to use the
host-defined mode, since the cached value may become out of date. However,
kernfs.Inode.Mode() does not return an error--other filesystems on kernfs are
in-memory so retrieving mode should not fail. Therefore, if the host syscall
fails, we rely on a cached value instead.

Updates #1672.

PiperOrigin-RevId: 303220864
2020-03-26 16:47:20 -07:00
Dean Deng 248e46f320 Whitelist utimensat(2).
utimensat is used by hostfs for setting timestamps on imported fds. Previously,
this would crash the sandbox since utimensat was not allowed.

Correct the VFS2 version of hostfs to match the call in VFS1.

PiperOrigin-RevId: 301970121
2020-03-19 23:30:21 -07:00
Fabricio Voznika 069f1edbe4 Improve error message when pivot_root fails
PiperOrigin-RevId: 301949722
2020-03-19 20:18:03 -07:00
Dean Deng 5e413cad10 Plumb VFS2 imported fds into virtual filesystem.
- When setting up the virtual filesystem, mount a host.filesystem to contain
  all files that need to be imported.
- Make read/preadv syscalls to the host in cases where preadv2 may not be
  supported yet (likewise for writing).
- Make save/restore functions in kernel/kernel.go return early if vfs2 is
  enabled.

PiperOrigin-RevId: 300922353
2020-03-14 07:14:33 -07:00
Fabricio Voznika f2e4b5ab93 Kill sandbox process when parent process terminates
When the sandbox runs in attached more, e.g. runsc do, runsc run, the
sandbox lifetime is controlled by the parent process. This wasn't working
in all cases because PR_GET_PDEATHSIG doesn't propagate through execve
when the process changes uid/gid. So it was getting dropped when the
sandbox execve's to change to user nobody.

PiperOrigin-RevId: 300601247
2020-03-12 12:32:26 -07:00
Andrei Vagin d3fa741fb5 runsc: Set asyncpreemptoff for the kvm platform
The asynchronous goroutine preemption is a new feature of Go 1.14.

When we switched to go 1.14 (cl/297915917) in the bazel config,
the kokoro syscall-kvm job started permanently failing. Lets
temporary set asyncpreemptoff for the kvm platform to unblock tests.

PiperOrigin-RevId: 300372387
2020-03-11 11:45:50 -07:00
gVisor bot 6367963c14 Merge pull request #1951 from moricho:moricho/add-profiler-option
PiperOrigin-RevId: 299233818
2020-03-05 17:16:54 -08:00
Andrei Vagin 6ec669631f tests: Don't print log messages on stdout
A parser of test results doesn't expect to see any extra messages.

PiperOrigin-RevId: 299174138
2020-03-05 13:08:04 -08:00
Andrei Vagin 80b40bbb06 tests: Don't print log messages on stdout
A parser of test results doesn't expect to see any extra messages.

PiperOrigin-RevId: 298966577
2020-03-04 16:16:35 -08:00
Andrei Vagin 322dbfe06b Allow to specify a separate log for GO's runtime messages
GO's runtime calls the write system call twice to print "panic:"
and "the reason of this panic", so here is a race window when
other threads can print something to the log and we will see
something like this:

panic: log messages from another thread
The reason of the panic.

This confuses the syzkaller blacklist and dedup detection.

It also makes the logs generally difficult to read. e.g.,
data races often have one side of the race, followed by
a large "diagnosis" dump, finally followed by the other
side of the race.

PiperOrigin-RevId: 297887895
2020-02-28 11:24:11 -08:00
Fabricio Voznika 88f7369922 Log oom_score_adj value on error
Updates #1873

PiperOrigin-RevId: 297695241
2020-02-27 14:59:38 -08:00
moricho d8ed784311 add profile option 2020-02-26 16:49:51 +09:00
Jamie Liu 471b15b212 Port most syscalls to VFS2.
pipe and pipe2 aren't ported, pending a slight rework of pipe FDs for VFS2.
mount and umount2 aren't ported out of temporary laziness. access and faccessat
need additional FSImpl methods to implement properly, but are stubbed to
prevent googletest from CHECK-failing. Other syscalls require additional
plumbing.

Updates #1623

PiperOrigin-RevId: 297188448
2020-02-25 13:37:34 -08:00
Fabricio Voznika 4d7db46123 Add log during process wait in tests
TestMultiContainerKillAll timed out under --race. Without logging,
we cannot tell if the process list is still increasing, but slowly,
or is stuck.

PiperOrigin-RevId: 297158834
2020-02-25 11:14:47 -08:00
gVisor bot 4a73bae269 Initial network namespace support.
TCP/IP will work with netstack networking. hostinet doesn't work, and sockets
will have the same behavior as it is now.

Before the userspace is able to create device, the default loopback device can
be used to test.

/proc/net and /sys/net will still be connected to the root network stack; this
is the same behavior now.

Issue #1833

PiperOrigin-RevId: 296309389
2020-02-20 15:20:40 -08:00
Adin Scannell ec5630527b Add statefile command to runsc.
PiperOrigin-RevId: 296105337
2020-02-19 18:28:42 -08:00
gVisor bot 5baf9dc2fb Synchronize signalling with S/R
This is to fix a data race between sending an external signal to
a ThreadGroup and kernel saving state for S/R.

PiperOrigin-RevId: 295244281
2020-02-14 15:49:09 -08:00
gVisor bot 4075de11be Plumb VFS2 inside the Sentry
- Added fsbridge package with interface that can be used to open
  and read from VFS1 and VFS2 files.
- Converted ELF loader to use fsbridge
- Added VFS2 types to FSContext
- Added vfs.MountNamespace to ThreadGroup

Updates #1623

PiperOrigin-RevId: 295183950
2020-02-14 11:12:47 -08:00
gVisor bot b8e22e241c Disallow duplicate NIC names.
PiperOrigin-RevId: 294500858
2020-02-11 12:59:11 -08:00
Adin Scannell afcab8fe9f Clean-up comments in runsc/BUILD and CONTRIBUTING.md.
PiperOrigin-RevId: 294300437
2020-02-10 14:15:36 -08:00
Adin Scannell 3e8b38d08b Add flag package to limit visibility.
PiperOrigin-RevId: 294297004
2020-02-10 13:57:01 -08:00
Dean Deng 17b9f5e662 Support listxattr and removexattr syscalls.
Note that these are only implemented for tmpfs, and other impls will still
return EOPNOTSUPP.

PiperOrigin-RevId: 293899385
2020-02-07 14:47:13 -08:00
Ting-Yu Wang 386a1a1564 Fix TestPauseResume in container test failed with connection refused.
Sometimes we get this error under TSAN:
"""
error getting process data from container: connecting to control server at PID
XXXX: connection refused
"""

The theory is that the top "sleep 20" was too short for TSAN, and the container
already exited, so we get connected refused. This commit changes the test to
let container signaling it's running by touching a file repeatedly forever
during the test.

PiperOrigin-RevId: 293710957
2020-02-06 17:07:07 -08:00
Andrei Vagin 615d661112 runsc/container_test: hide host /etc in test containers
The host /etc can contain config files which affect tests.

For example, bash reads /etc/passwd and if it is too big
a test can fail by timeout.

PiperOrigin-RevId: 293670637
2020-02-06 14:02:52 -08:00
Adin Scannell 1b6a12a768 Add notes to relevant tests.
These were out-of-band notes that can help provide additional context
and simplify automated imports.

PiperOrigin-RevId: 293525915
2020-02-05 22:46:35 -08:00
gVisor bot b29aeebaf6 Merge pull request #1683 from kevinGC:ipt-udp-matchers
PiperOrigin-RevId: 293243342
2020-02-04 16:20:16 -08:00
Kevin Krakauer 3f5642c5af Increase container_test size.
container_test was flaking because a small percentage of runs timed out. Tested
this fix with --runs_per_test=100.

PiperOrigin-RevId: 293240102
2020-02-04 15:38:53 -08:00
Fabricio Voznika 6d8bf405bc Allow mlock in fsgofer system call filters
Go 1.14 has a workaround for a Linux 5.2-5.4 bug which requires mlock'ing the g
stack to prevent register corruption. We need to allow this syscall until it is
removed from Go.

PiperOrigin-RevId: 293212935
2020-02-04 13:42:27 -08:00
Ting-Yu Wang e7846e50f2 Reduce run time for //test/syscalls:socket_inet_loopback_test_runsc_ptrace.
* Tests are picked for a shard differently. It now picks one test from each
  block, instead of picking the whole block. This makes the same kind of tests
  spreads across different shards.

* Reduce the number of connect() calls in TCPListenClose.

PiperOrigin-RevId: 293019281
2020-02-03 15:42:21 -08:00
Brad Burlage 80ce7f2537 Tag version_test as noguitar.
PiperOrigin-RevId: 292974323
2020-02-03 12:09:52 -08:00
Michael Pratt 4d1a648c7c Allow mlock in system call filters
Go 1.14 has a workaround for a Linux 5.2-5.4 bug which requires mlock'ing the g
stack to prevent register corruption. We need to allow this syscall until it is
removed from Go.

PiperOrigin-RevId: 292967478
2020-02-03 11:39:51 -08:00
Fabricio Voznika 437c986c6a Add vfs.FileDescription to FD table
FD table now holds both VFS1 and VFS2 types and uses the correct
one based on what's set.

Parts of this CL are just initial changes (e.g. sys_read.go,
runsc/main.go) to serve as a template for the remaining changes.

Updates #1487
Updates #1623

PiperOrigin-RevId: 292023223
2020-01-28 15:31:03 -08:00
Adin Scannell 253c9e666c Cleanup glog and add real caller information.
In general, we've learned that logging must be avoided at all
costs in the hot path. It's unlikely that the optimizations
here were significant in any case, since buffer would certainly
escape.

This also adds a test to ensure that the caller identification
works as expected, and so that logging can be benchmarked.

Original:
BenchmarkGoogleLogging-6   	 1222255	       949 ns/op

With this change:
BenchmarkGoogleLogging-6   	  517323	      2346 ns/op

Fixes #184

PiperOrigin-RevId: 291815420
2020-01-27 16:08:35 -08:00
Adin Scannell 0e2f1b7abd Update package locations.
Because the abi will depend on the core types for marshalling (usermem,
context, safemem, safecopy), these need to be flattened from the sentry
directory. These packages contain no sentry-specific details.

PiperOrigin-RevId: 291811289
2020-01-27 15:31:32 -08:00