Commit Graph

655 Commits

Author SHA1 Message Date
Jamie Liu 9115f26851 Allocate device numbers for VFS2 filesystems.
Updates #1197, #1198, #1672

PiperOrigin-RevId: 310432006
2020-05-07 14:01:53 -07:00
Bhasker Hariharan 28b5565fdd Automated rollback of changelist 309339316
PiperOrigin-RevId: 310417191
2020-05-07 12:48:23 -07:00
Dean Deng 16da7e790f Update privateunixsocket TODOs.
Synthetic sockets do not have the race condition issue in VFS2, and we will
get rid of privateunixsocket as well.

Fixes #1200.

PiperOrigin-RevId: 310386474
2020-05-07 10:20:48 -07:00
Adin Scannell 279f1eb7ab Fix runsc syscall documentation generation.
We can register any number of tables with any number of architectures, and
need not limit the definitions to the architecture in question. This allows
runsc to generate documentation for all architectures simultaneously.

Similarly, this simplifies the VFSv2 patching process.

PiperOrigin-RevId: 310224827
2020-05-06 14:13:48 -07:00
Fabricio Voznika e2b0e0e272 Enable TestRunNonRoot on VFS2
Also added back the default test dimension back which was
dropped in a previous refactor.

PiperOrigin-RevId: 309797327
2020-05-04 12:29:03 -07:00
Fabricio Voznika 0a307d0072 Mount VSFS2 filesystem using root credentials
PiperOrigin-RevId: 309787938
2020-05-04 11:48:00 -07:00
Fabricio Voznika cbc5bef2a6 Add TTY support on VFS2 to runsc
Updates #1623, #1487

PiperOrigin-RevId: 309777922
2020-05-04 10:59:20 -07:00
Bhasker Hariharan 8962b7840f Enable FIFO QDisc by default in runsc.
Updates #231

PiperOrigin-RevId: 309339316
2020-04-30 18:29:57 -07:00
Bhasker Hariharan ae15d90436 FIFO QDisc implementation
Updates #231

PiperOrigin-RevId: 309323808
2020-04-30 16:41:00 -07:00
gVisor bot d5c34ba2ff Merge pull request #2487 from moricho:fix/bindmount
PiperOrigin-RevId: 309082540
2020-04-29 13:13:51 -07:00
gVisor bot ceb3c0e062 Merge pull request #2558 from prattmic:forward_signal
PiperOrigin-RevId: 308829800
2020-04-28 08:43:49 -07:00
gVisor bot 316394ee89 Merge pull request #2544 from prattmic:runsc_do_cleanup
PiperOrigin-RevId: 308727526
2020-04-27 17:01:33 -07:00
Michael Pratt 147c8ba1f7 runsc: extend do network cleanup
Previously we unconditionally failed to cleanup the networking files
(hostname, resolve.conf, hosts), and failed to cleanup the netns, etc on
partial setup failure.

We can drop the iptables commands from cleanup, as the routes
automatically go away when the device is deleted. Those commands were
failing previously.

Forward signals to the container, allowing it to exit normally when a
signal is received, and then for runsc to run the cleanup. This doesn't
cover cleanup when runsc is signalled before the container start, it
covers the most common case.

Fixes #2539
Fixes #2540
2020-04-27 16:36:07 -04:00
Michael Pratt b15d49a137 container: use sighandling package
Use the sighandling package for Container.ForwardSignals, for
consistency with other signal forwarding.

Fixes #2546
2020-04-27 11:52:43 -04:00
kevin.xu 9a4ae0322e
Update container.go
typo, should be `start` in comments
2020-04-27 21:53:04 +08:00
moricho fc53d64367 refactor and add test for bindmount
Signed-off-by: moricho <ikeda.morito@gmail.com>
2020-04-26 17:24:34 +09:00
Zach Koopmans 17ac90a203 Add container tests passing with VFS2
Several tests are passing after getting TestAppExitStatus (run /bin/true)
changes. Make versions that run via VFS2 so that we know what is and isn't
working.

In addition, fix bug in VFSFile ReadFull. For the TestExePath test in
container_test.go, the case "unmasked" will return 0 bytes read with no
EOF err, causing the ReadFull call to spin.

PiperOrigin-RevId: 308428126
2020-04-25 11:27:23 -07:00
moricho 0b3166f624 add bind/rbind options for mount
Signed-off-by: moricho <ikeda.morito@gmail.com>
2020-04-25 22:04:39 +09:00
moricho 93e510e26f fix behavior of `getMountNameAndOptions` when options include either bind or rbind
Signed-off-by: moricho <ikeda.morito@gmail.com>
2020-04-25 22:04:39 +09:00
Zach Koopmans 15a822a193 VFS2: Get HelloWorld image tests to pass with VFS2
This change includes:
- Modifications to loader_test.go to get TestCreateMountNamespace to
pass with VFS2.
- Changes necessary to get TestHelloWorld in image tests to pass with
VFS2. This means runsc can run the hello-world container with docker
on VSF2.

Note: Containers that use sockets will not run with these changes.
See "//test/image/...". Any tests here with sockets currently fail
(which is all of them but HelloWorld).
PiperOrigin-RevId: 308363072
2020-04-24 18:23:37 -07:00
Fabricio Voznika 4af39dd1c5 Propagate PID limit from OCI to sandbox cgroup
Closes #2489

PiperOrigin-RevId: 308362434
2020-04-24 18:17:01 -07:00
Dean Deng 632b104aff Plumb context.Context into kernfs.Inode.Open().
PiperOrigin-RevId: 308304793
2020-04-24 12:37:49 -07:00
Dean Deng 1b88c63b3e Move hostfs mount to Kernel struct.
This is needed to set up host fds passed through a Unix socket. Note that
the host package depends on kernel, so we cannot set up the hostfs mount
directly in Kernel.Init as we do for sockfs and pipefs.

Also, adjust sockfs to make its setup look more like hostfs's and pipefs's.

PiperOrigin-RevId: 308274053
2020-04-24 10:03:43 -07:00
Jamie Liu 5042ea7e2c Add vfs.MkdirOptions.ForSyntheticMountpoint.
PiperOrigin-RevId: 308143529
2020-04-23 15:37:10 -07:00
Adin Scannell 1481499fe2 Simplify Docker test infrastructure.
This change adds a layer of abstraction around the internal Docker APIs,
and eliminates all direct dependencies on Dockerfiles in the infrastructure.

A subsequent change will automated the generation of local images (with
efficient caching). Note that this change drops the use of bazel container
rules, as that experiment does not seem to be viable.

PiperOrigin-RevId: 308095430
2020-04-23 11:33:30 -07:00
Nicolas Lacasse e69a871c7b Move user home detection to its own library.
PiperOrigin-RevId: 307977689
2020-04-22 22:18:21 -07:00
Andrei Vagin 0c586946ea Specify a memory file in platform.New().
PiperOrigin-RevId: 307941984
2020-04-22 17:50:10 -07:00
Adin Scannell 1a597e01be Add a functional vm_test for root_test.
This change renames the tools/images directory to tools/vm for clarity, and
adds a functional vm_test. Sharding is also added to the same test, and some
documentation added around key flags & variables to describe how they work.

Subsequent changes will add vm_tests for other cases, such as the runtime tests.

PiperOrigin-RevId: 307492245
2020-04-20 15:48:27 -07:00
Fabricio Voznika a80cd43023 Add test name to boot and gofer log files
This is to make easier to find corresponding logs in
case test fails.

PiperOrigin-RevId: 307104283
2020-04-17 13:28:54 -07:00
Zach Koopmans 12bde95635 Get /bin/true to run on VFS2
Included:
- loader_test.go RunTest and TestStartSignal VFS2
- container_test.go TestAppExitStatus on VFS2
- experimental flag added to runsc to turn on VFS2

Note: shared mounts are not yet supported.
PiperOrigin-RevId: 307070753
2020-04-17 10:39:19 -07:00
Fabricio Voznika 5a8ee1beee Preserve log FD after execve
PiperOrigin-RevId: 306908296
2020-04-16 13:17:00 -07:00
gVisor bot ac9b32c36b Merge pull request #2212 from aaronlu:dup_stdioFDs
PiperOrigin-RevId: 306477639
2020-04-14 11:20:11 -07:00
Ian Lewis daf3322498 Add logging message for noNewPrivileges OCI option.
noNewPrivileges is ignored if set to false since gVisor assumes that
PR_SET_NO_NEW_PRIVS is always enabled.

PiperOrigin-RevId: 305991947
2020-04-10 20:32:23 -07:00
Fabricio Voznika 96f9142959 Use O_CLOEXEC when dup'ing FDs
The sentry doesn't allow execve, but it's a good defense
in-depth measure.

PiperOrigin-RevId: 305958737
2020-04-10 15:47:23 -07:00
gVisor bot 78126611e6 Merge pull request #2253 from amscanne:nogo
PiperOrigin-RevId: 305807868
2020-04-09 19:16:46 -07:00
Fabricio Voznika 2a28e3e9c3 Don't unconditionally set --panic-signal
Closes #2393

PiperOrigin-RevId: 305793027
2020-04-09 17:20:14 -07:00
Fabricio Voznika 6dd5a1f3fe Clean up TODOs
PiperOrigin-RevId: 305592245
2020-04-08 17:58:13 -07:00
Adin Scannell 928a7c60b8 Fix all printf formatting errors.
Updates #2243
2020-04-08 10:14:34 -07:00
Adin Scannell 94b793262d Fix all copy locks violations.
This required minor restructuring of how system call tables were saved
and restored, but it makes way more sense this way.

Updates #2243
2020-04-08 10:00:14 -07:00
Ian Lewis 56054fc1fb Add friendlier messages for frequently encountered errors.
Issue #2270
Issue #1765

PiperOrigin-RevId: 305385436
2020-04-07 18:51:01 -07:00
Ian Lewis 5802051b3d Update TODO to #238
Move TODO to #238 so that proper synchronization of operations is handled
when we create the urpc client.

Issue #238
Fixes #512

PiperOrigin-RevId: 305383924
2020-04-07 18:39:33 -07:00
Andrei Vagin acf0259255 Don't map the 0 uid into a sandbox user namespace
Starting with go1.13, we can specify ambient capabilities when we execute a new
process with os/exe.Cmd.

PiperOrigin-RevId: 305366706
2020-04-07 16:46:05 -07:00
Dean Deng fc72eb3595 Remove TODOs for local gofer extended attributes.
PiperOrigin-RevId: 305344989
2020-04-07 14:48:40 -07:00
Adin Scannell 4e6a1a5adb Automated rollback of changelist 303799678
PiperOrigin-RevId: 304221302
2020-04-01 11:06:26 -07:00
Aaron Lu 0cfdd47391 checkpoint/restore: make sure the donated stdioFDs have the same value
Suppose I start a runsc container using kvm platform like this:
$ sudo runsc --debug=true --debug-log=1.txt --platform=kvm run rootbash
The donating FD and the corresponding cmdline for runsc-sandbox is:

D0313 17:50:12.608203   44389 x:0] Donating FD 3: "1.txt"
D0313 17:50:12.608214   44389 x:0] Donating FD 4: "control_server_socket"
D0313 17:50:12.608224   44389 x:0] Donating FD 5: "|0"
D0313 17:50:12.608229   44389 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json"
D0313 17:50:12.608234   44389 x:0] Donating FD 7: "|1"
D0313 17:50:12.608238   44389 x:0] Donating FD 8: "sandbox IO FD"
D0313 17:50:12.608242   44389 x:0] Donating FD 9: "/dev/kvm"
D0313 17:50:12.608246   44389 x:0] Donating FD 10: "/dev/stdin"
D0313 17:50:12.608249   44389 x:0] Donating FD 11: "/dev/stdout"
D0313 17:50:12.608253   44389 x:0] Donating FD 12: "/dev/stderr"
D0313 17:50:12.608257   44389 x:0] Starting sandbox: /proc/self/exe
[runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log=
--max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt
--debug-log-format=text --file-access=exclusive --overlay=false
--fsgofer-host-uds=false --network=sandbox --log-packets=false
--platform=kvm --strace=false --strace-syscalls=--strace-log-size=1024
--watchdog-action=Panic --panic-signal=-1 --profile=false --net-raw=true
--num-network-channels=1 --rootless=false --alsologtostderr=false
--ref-leak-mode=disabled --gso=true --software-gso=true
--overlayfs-stale-read=false --shared-volume= --debug-log-fd=3
--panic-signal=15 boot --bundle=/home/ziqian.lzq/bundle/bash/runsc
--controller-fd=4 --mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8
--device-fd=9 --stdio-fds=10 --stdio-fds=11 --stdio-fds=12 --pidns=true
--setup-root --cpu-num 32 --total-memory 4294967296 rootbash]

Note stdioFDs starts from 10 with kvm platform and stderr's FD is 12.

If I restore a container from the checkpoint image which is derived
by checkpointing the above rootbash container, but either omit the
platform switch or specify to use ptrace platform explicitely:
$ sudo runsc --debug=true --debug-log=1.txt restore --image-path=some_path restored_rootbash

the donating FD and corresponding cmdline for runsc-sandbox is:

D0313 17:50:15.258632   44452 x:0] Donating FD 3: "1.txt"
D0313 17:50:15.258640   44452 x:0] Donating FD 4: "control_server_socket"
D0313 17:50:15.258645   44452 x:0] Donating FD 5: "|0"
D0313 17:50:15.258648   44452 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json"
D0313 17:50:15.258653   44452 x:0] Donating FD 7: "|1"
D0313 17:50:15.258657   44452 x:0] Donating FD 8: "sandbox IO FD"
D0313 17:50:15.258661   44452 x:0] Donating FD 9: "/dev/stdin"
D0313 17:50:15.258675   44452 x:0] Donating FD 10: "/dev/stdout"
D0313 17:50:15.258680   44452 x:0] Donating FD 11: "/dev/stderr"
D0313 17:50:15.258684   44452 x:0] Starting sandbox: /proc/self/exe
[runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log=
--max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt
--debug-log-format=text --file-access=exclusive --overlay=false
--fsgofer-host-uds=false --network=sandbox --log-packets=false
--platform=ptrace --strace=false --strace-syscalls=
--strace-log-size=1024 --watchdog-action=Panic --panic-signal=-1
--profile=false --net-raw=true --num-network-channels=1 --rootless=false
--alsologtostderr=false --ref-leak-mode=disabled --gso=true
--software-gso=true --overlayfs-stale-read=false --shared-volume=
--debug-log-fd=3 --panic-signal=15 boot
--bundle=/home/ziqian.lzq/bundle/bash/runsc --controller-fd=4
--mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8 --stdio-fds=9
--stdio-fds=10 --stdio-fds=11 --setup-root --cpu-num 32 --total-memory
4294967296 restored_rootbash]

Note this time, stdioFDs starts from 9 and stderr's FD is 11(so the
saved host.descritor.origFD which is 12 for stderr is no longer valid).

For the three host FD based files, The s.Dev and s.Ino derived from
fstat(fd) shall all be the same and since the two fields are used
as device.MultiDeviceKey, the host.inodeFileState.sattr.InodeId which is
the value of MultiDevice.Map(MultiDeviceKey), shall also all be the same.
Note that for MultiDevice m, m.cache records the mapping of key to value
and m.rcache records the mapping of value to key. If same value doesn't
map to the same key, it will panic on restore.

Now that stderr's origFD 12 is no longer valid(it happens to be
/memfd:runsc-memory in my test on restore), the s.Dev and s.Ino derived
from fstat(fd=12) in host.inodeFileState.afterLoad() will neither be
correct. But its InodeID is still the same as saved, MultiDevice.Load()
will complain about the same value(InodeID) being mapped to different
keys (different from stdin and stdout's) and panic with: "MultiDevice's
caches are inconsistent".

Solve this problem by making sure stdioFDs for root container's init
task are always the same on initial start and on restore time, no matter
what cmdline user has used: debug log specified or not, platform changed
or not etc. shall not affect the ability to restore.

Fixes #1844.
2020-03-31 11:37:11 +08:00
Adin Scannell 3fac85da95 kvm: handle exit reasons even under EINTR.
In the case of other signals (preemption), inject a normal bounce and
defer the signal until the vCPU has been returned from guest mode.

PiperOrigin-RevId: 303799678
2020-03-30 12:37:57 -07:00
Dean Deng 137f361400 Use host-defined file owner and mode, when possible, for imported fds.
Using the host-defined file owner matches VFS1. It is more correct to use the
host-defined mode, since the cached value may become out of date. However,
kernfs.Inode.Mode() does not return an error--other filesystems on kernfs are
in-memory so retrieving mode should not fail. Therefore, if the host syscall
fails, we rely on a cached value instead.

Updates #1672.

PiperOrigin-RevId: 303220864
2020-03-26 16:47:20 -07:00
Dean Deng 248e46f320 Whitelist utimensat(2).
utimensat is used by hostfs for setting timestamps on imported fds. Previously,
this would crash the sandbox since utimensat was not allowed.

Correct the VFS2 version of hostfs to match the call in VFS1.

PiperOrigin-RevId: 301970121
2020-03-19 23:30:21 -07:00
Fabricio Voznika 069f1edbe4 Improve error message when pivot_root fails
PiperOrigin-RevId: 301949722
2020-03-19 20:18:03 -07:00
Dean Deng 5e413cad10 Plumb VFS2 imported fds into virtual filesystem.
- When setting up the virtual filesystem, mount a host.filesystem to contain
  all files that need to be imported.
- Make read/preadv syscalls to the host in cases where preadv2 may not be
  supported yet (likewise for writing).
- Make save/restore functions in kernel/kernel.go return early if vfs2 is
  enabled.

PiperOrigin-RevId: 300922353
2020-03-14 07:14:33 -07:00