Commit Graph

874 Commits

Author SHA1 Message Date
Tamir Duberstein b63e61828d Initialize Kernel.Timekeeper before network NS
PiperOrigin-RevId: 375843579
2021-05-25 18:57:38 -07:00
Tamir Duberstein a54cb9d8a2 Use specific fmt verbs (avoid %v)
Remove useless conversions. Avoid unhandled errors.

PiperOrigin-RevId: 375834275
2021-05-25 17:48:34 -07:00
Fabricio Voznika ec542dbedf Suppress log message when there is no error
PiperOrigin-RevId: 374981100
2021-05-20 17:14:19 -07:00
Dean Deng 894187b2c6 Resolve remaining O_PATH TODOs.
O_PATH is now implemented in vfs2.

Fixes #2782.

PiperOrigin-RevId: 373861410
2021-05-14 14:04:46 -07:00
gVisor bot 3894c9fcb9 Merge pull request #5983 from btw616:fix/issue-5982
PiperOrigin-RevId: 373661350
2021-05-13 14:50:03 -07:00
Fabricio Voznika f3478b7516 Fix problem with grouped cgroups
cgroup controllers can be grouped together (e.g. cpu,cpuacct) and
that was confusing Cgroup.Install() into thinking that a cgroup
directory was created by the caller, when it had being created by
another controller that is grouped together.

PiperOrigin-RevId: 373661336
2021-05-13 14:44:08 -07:00
Tiwei Bie ddaa36bde5 Fix file descriptor leak in MultiGetAttr
We need to make sure that all children are closed before
return. But the last child saved in parent isn't closed
after we successfully iterate all the files in "names".
This patch fixes this issue.

Fixes #5982

Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
2021-05-13 09:08:20 +08:00
gVisor bot 6c349c675c Merge pull request #5764 from zhlhahaha:2126-2
PiperOrigin-RevId: 372993341
2021-05-10 12:59:03 -07:00
gVisor bot e691004e0c Merge pull request #5758 from zhlhahaha:2125
PiperOrigin-RevId: 372608247
2021-05-07 12:39:14 -07:00
howard zhang 0bff4afd0f Init all vCPU when initializing machine on ARM64
This patch is to solve problem that vCPU timer mess up when
adding vCPU dynamically on ARM64, for detailed information
please refer to:
https://github.com/google/gvisor/issues/5739

There is no influence on x86 and here are main changes for
ARM64:
1. create maxVCPUs number of vCPU in machine initialization
2. we want to sync gvisor vCPU number with host CPU number,
so use smaller number between runtime.NumCPU and
KVM_CAP_MAX_VCPUS to be maxVCPUS
3. put unused vCPUs into architecture-specific map initialvCPUs
4. When machine need to bind a new vCPU with tid, rather
than creating new one, it would pick a vCPU from map initalvCPUs
5. change the setSystemTime function. When vCPU number increasing,
the time cost for function setTSC(use syscall to set cntvoff) is
liner growth from around 300 ns to 100000 ns, and this leads to
the function setSystemTimeLegacy can not get correct offset
value.
6. initializing StdioFDs and goferFD before a platform to avoid
StdioFDs confects with vCPU fds

Signed-off-by: howard zhang <howard.zhang@arm.com>
2021-05-07 16:42:58 +08:00
Fabricio Voznika 9f33fe64f2 Fixes to runsc cgroups
When loading cgroups for another process, `/proc/self` was used in
a few places, causing the end state to be a mix of the process
and self. This is now fixes to always use the proper `/proc/[pid]`
path.

Added net_prio and net_cls to the list of optional controllers. This
is to allow runsc to execute then these cgroups are disabled as long
as there are no net_prio and net_cls limits that need to be applied.

Deflake TestMultiContainerEvent.

Closes #5875
Closes #5887

PiperOrigin-RevId: 372242687
2021-05-05 17:39:29 -07:00
Rahat Mahmood e00bd82816 Remove uses of the binary package from the rest of the sentry.
PiperOrigin-RevId: 372020696
2021-05-04 16:41:08 -07:00
Fabricio Voznika 95df852bf2 Make Mount.Type optional for bind mounts
According to the OCI spec Mount.Type is an optional field and it
defaults to "bind" when any of "bind" or "rbind" is included in
Mount.Options.

Also fix the shim to remove bind/rbind from options when mount is
converted from bind to tmpfs inside the Sentry.

Fixes #2330
Fixes #3274

PiperOrigin-RevId: 371996891
2021-05-04 14:36:06 -07:00
Fabricio Voznika 26adb3c474 Automated rollback of changelist 369686285
PiperOrigin-RevId: 371015541
2021-04-28 17:02:33 -07:00
Nayana Bidari 0a6eaed50b Add weirdness sentry metric.
Weirdness metric contains fields to track the number of clock fallback,
partial result and vsyscalls. This metric will avoid the overhead of
having three different metrics (fallbackMetric, partialResultMetric,
vsyscallCount).

PiperOrigin-RevId: 369970218
2021-04-22 16:07:15 -07:00
Michael Pratt c2955339d8 Automated rollback of changelist 369325957
PiperOrigin-RevId: 369686285
2021-04-21 10:41:28 -07:00
Adin Scannell 8192cccda6 Clean test tags.
PiperOrigin-RevId: 369505182
2021-04-20 13:11:25 -07:00
Dean Deng 20b1c3c632 Move runsc reference leak checking to better locations.
In the previous spot, there was a roughly 50% chance that leak checking would
actually run. Move it to the waitContainer() call on the root container, where
it is guaranteed to run before the sandbox process is terminated. Add it to
runsc/cli/main.go as well for good measure, in case the sandbox exit path does
not involve waitContainer().

PiperOrigin-RevId: 369329796
2021-04-19 16:48:27 -07:00
Fabricio Voznika 276ff149a4 Add MultiGetAttr message to 9P
While using remote-validation, the vast majority of time spent during
FS operations is re-walking the path to check for modifications and
then closing the file given that in most cases it has not been
modified externally.

This change introduces a new 9P message called MultiGetAttr which bulks
query attributes of several files in one shot. The returned attributes are
then used to update cached dentries before they are walked. File attributes
are updated for files that still exist. Dentries that have been deleted are
removed from the cache. And negative cache entries are removed if a new
file/directory was created externally. Similarly, synthetic dentries are
replaced if a file/directory is created externally.

The bulk update needs to be carefull not to follow symlinks, cross mount
points, because the gofer doesn't know how to resolve symlinks and where
mounts points are located. It also doesn't walk to the parent ("..") to
avoid deadlocks.

Here are the results:

Workload        VFS1       VFS2     Change
bazel action     115s       70s	     28.8s
Stat/100      11,043us   7,623us      974us

Updates #1638

PiperOrigin-RevId: 369325957
2021-04-19 16:25:01 -07:00
Dean Deng 0c3e8daf50 Allow runsc to generate coverage reports.
Add a coverage-report flag that will cause the sandbox to generate a coverage
report (with suffix .cov) in the debug log directory upon exiting. For the
report to be generated, runsc must have been built with the following Bazel
flags: `--collect_code_coverage --instrumentation_filter=...`.

With coverage reports, we should be able to aggregate results across all tests
to surface code coverage statistics for the project as a whole.

The report is simply a text file with each line representing a covered block
as `file:start_line.start_col,end_line.end_col`. Note that this is similar to
the format of coverage reports generated with `go test -coverprofile`,
although we omit the count and number of statements, which are not useful for
us.

Some simple ways of getting coverage reports:

bazel test <some_test> --collect_code_coverage \
  --instrumentation_filter=//pkg/...

bazel build //runsc --collect_code_coverage \
  --instrumentation_filter=//pkg/...
runsc -coverage-report=dir/ <other_flags> do ...

PiperOrigin-RevId: 368952911
2021-04-16 17:56:16 -07:00
Zach Koopmans 025cff180c Internal change
PiperOrigin-RevId: 368919504
2021-04-16 14:28:23 -07:00
Adin Scannell cbf00d633d Clarify platform errors.
PiperOrigin-RevId: 367446222
2021-04-08 09:34:01 -07:00
Adin Scannell 192f20788b Add internal staging tags to //runsc and //shim binaries.
PiperOrigin-RevId: 367328273
2021-04-07 17:13:11 -07:00
Chong Cai e21a71bff1 Allow user mount for verity fs
Allow user mounting a verity fs on an existing mount by specifying mount
flags root_hash and lower_path.

PiperOrigin-RevId: 366843846
2021-04-05 12:01:44 -07:00
Chong Cai 58afd120d3 Set Verity bit in verity_prepare cmd
This is needed to enable Xattrs features required by verity.

PiperOrigin-RevId: 366843640
2021-04-05 11:56:59 -07:00
Rahat Mahmood 932c8abd0f Implement cgroupfs.
A skeleton implementation of cgroupfs. It supports trivial cpu and
memory controllers with no support for hierarchies.

PiperOrigin-RevId: 366561126
2021-04-02 21:10:44 -07:00
Rahat Mahmood 491b106d62 Implement the runsc verity-prepare command.
Implement a new runsc command to set up a sandbox with verityfs and
run the measure tool. This is loosely forked from the do command, and
currently requires the caller to provide the measure tool binary.

PiperOrigin-RevId: 366553769
2021-04-02 19:34:50 -07:00
Howard Zhang 73679fae2a Disable mitigate and related test on ARM64
As MDS side channel attack does not affect ARM64, we disable
mitigate on ARM64 in case misusage.

For more detail, please refer to:
https://access.redhat.com/security/vulnerabilities/mds

Signed-off-by: Howard Zhang <howard.zhang@arm.com>
2021-04-01 10:56:33 +08:00
Fabricio Voznika 71f3dccbb3 Fix panic when overriding /dev files with VFS2
VFS1 skips over mounts that overrides files in /dev because the list of
files is hardcoded. This is not needed for VFS2 and a recent change
lifted this restriction. However, parts of the code were still skipping
/dev mounts even in VFS2, causing the loader to panic when it ran short
of FDs to connect to the gofer.

PiperOrigin-RevId: 365858436
2021-03-30 11:36:55 -07:00
Fabricio Voznika 960155cdaa Add --file-access-mounts flag
--file-access-mounts flag is similar to --file-access, but controls
non-root mounts that were previously mounted in shared mode only.
This gives more flexibility to control how mounts are shared within
a container.

PiperOrigin-RevId: 364669882
2021-03-23 16:21:12 -07:00
Kevin Krakauer 92374e5197 setgid directory support in goferfs
Also adds support for clearing the setuid bit when appropriate (writing,
truncating, changing size, changing UID, or changing GID).

VFS2 only.

PiperOrigin-RevId: 364661835
2021-03-23 15:42:12 -07:00
Chong Cai beb11cec76 Allow FSETXATTR/FGETXATTR host calls for Verity
These host calls are needed for Verity fs to generate/verify hashes.

PiperOrigin-RevId: 364598180
2021-03-23 11:06:02 -07:00
Jamie Liu 5c4f4ed9eb Skip /dev submount hack on VFS2.
containerd usually configures both /dev and /dev/shm as tmpfs mounts, e.g.:

```
  "mounts": [
    ...
    {
      "destination": "/dev",
      "type": "tmpfs",
      "source": "/run/containerd/io.containerd.runtime.v2.task/moby/10eedbd6a0e7937ddfcab90f2c25bd9a9968b734c4ae361318142165d445e67e/tmpfs",
      "options": [
        "nosuid",
        "strictatime",
        "mode=755",
        "size=65536k"
      ]
    },
    ...
    {
      "destination": "/dev/shm",
      "type": "tmpfs",
      "source": "/run/containerd/io.containerd.runtime.v2.task/moby/10eedbd6a0e7937ddfcab90f2c25bd9a9968b734c4ae361318142165d445e67e/shm",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "mode=1777",
        "size=67108864"
      ]
    },
    ...
```

(This is mostly consistent with how Linux is usually configured, except that
/dev is conventionally devtmpfs, not regular tmpfs. runc/libcontainer
implements OCI-runtime-spec-undocumented behavior to create
/dev/{ptmx,fd,stdin,stdout,stderr} in non-bind /dev mounts. runsc silently
switches /dev to devtmpfs. In VFS1, this is necessary to get device files like
/dev/null at all, since VFS1 doesn't support real device special files, only
what is hardcoded in devfs. VFS2 does support device special files, but using
devtmpfs is the easiest way to get pre-created files in /dev.)

runsc ignores many /dev submounts in the spec, including /dev/shm. In VFS1,
this appears to be to avoid introducing a submount overlay for /dev, and is
mostly fine since the typical mode for the /dev/shm mount is ~consistent with
the mode of the /dev/shm directory provided by devfs (modulo the sticky bit).
In VFS2, this is vestigial (VFS2 does not use submount overlays), and devtmpfs'
/dev/shm mode is correct for the mount point but not the mount. So turn off
this behavior for VFS2.

After this change:

```
$ docker run --rm -it ubuntu:focal ls -lah /dev/shm
total 0
drwxrwxrwt 2 root root  40 Mar 18 00:16 .
drwxr-xr-x 5 root root 360 Mar 18 00:16 ..

$ docker run --runtime=runsc --rm -it ubuntu:focal ls -lah /dev/shm
total 0
drwxrwxrwx 1 root root 0 Mar 18 00:16 .
dr-xr-xr-x 1 root root 0 Mar 18 00:16 ..

$ docker run --runtime=runsc-vfs2 --rm -it ubuntu:focal ls -lah /dev/shm
total 0
drwxrwxrwt 2 root root  40 Mar 18 00:16 .
drwxr-xr-x 5 root root 320 Mar 18 00:16 ..
```

Fixes #5687

PiperOrigin-RevId: 363699385
2021-03-18 11:12:43 -07:00
Rahat Mahmood c5667022b6 Report filesystem-specific mount options.
PiperOrigin-RevId: 362406813
2021-03-11 16:49:36 -08:00
Zach Koopmans a82bd04e2a Major refactor of runsc mitigate.
PiperOrigin-RevId: 362360425
2021-03-11 13:10:08 -08:00
Fabricio Voznika 14fc2ddd6c Update flock to v0.8.0
PiperOrigin-RevId: 361962416
2021-03-09 20:54:15 -08:00
Fabricio Voznika e0e04814b4 Fix invalid interface conversion in runner
panic: interface conversion: interface {} is syscall.WaitStatus, not unix.WaitStatus

goroutine 1 [running]:
main.runTestCaseNative(0xc0001fc000, 0xe3, 0xc000119b60, 0x1, 0x1, 0x0, 0x0)
	test/runner/runner.go:185 +0xa94
main.main()
	test/runner/runner.go:118 +0x745

PiperOrigin-RevId: 361957796
2021-03-09 20:12:20 -08:00
Chong Cai 8018bf62ba Internal change.
PiperOrigin-RevId: 361689477
2021-03-08 16:56:16 -08:00
Ayush Ranjan e668288faf [op] Replace syscall package usage with golang.org/x/sys/unix in runsc/.
The syscall package has been deprecated in favor of golang.org/x/sys.

Note that syscall is still used in some places because the following don't seem
to have an equivalent in unix package:
- syscall.SysProcIDMap
- syscall.Credential

Updates #214

PiperOrigin-RevId: 361381490
2021-03-06 22:07:07 -08:00
Zach Koopmans b8a5420f49 Add reverse flag to mitigate.
Add reverse operation to mitigate that just enables
all CPUs.

PiperOrigin-RevId: 360511215
2021-03-02 14:10:51 -08:00
gVisor bot 8f6274404a Merge pull request #5519 from dqminh:runsc-ps-pids
PiperOrigin-RevId: 359334029
2021-02-24 11:47:27 -08:00
Andrei Vagin 055073f118 runsc/filters: permit clock_nanosleep for race
Syzkaller hosts contains many audit messages that runsc tries
to call the clock_nanosleep syscall.

PiperOrigin-RevId: 359331413
2021-02-24 11:36:59 -08:00
Daniel Dao 306a9477da
return root pids with runsc ps
`runsc ps` currently return pid for a task's immediate pid namespace,
which is confusing when there're multiple pid namespaces. We should
return only pids in the root namespace.

Before:

```
1000      1         0         0         ?         02:24     250ms     chrome
1000      1         0         0         ?         02:24     40ms      dumb-init
1000      1         0         0         ?         02:24     240ms     chrome
1000      2         1         0         ?         02:24     2.78s     node
```

After:

```
UID       PID       PPID      C         TTY       STIME     TIME      CMD
1000      1         0         0         ?         12:35     0s        dumb-init
1000      2         1         7         ?         12:35     240ms     node
1000      13        2         21        ?         12:35     2.33s     chrome
1000      27        13        3         ?         12:35     260ms     chrome
```

Signed-off-by: Daniel Dao <dqminh@cloudflare.com>
2021-02-24 15:20:43 +00:00
Zach Koopmans 24ea8003a4 Only detect mds for mitigate.
Only detect and mitigate on mds for the mitigate command.

PiperOrigin-RevId: 358924466
2021-02-22 16:02:32 -08:00
Fabricio Voznika 34e2cda9ad Return nicer error message when cgroups v1 isn't available
Updates #3481
Closes #5430

PiperOrigin-RevId: 358923208
2021-02-22 15:57:07 -08:00
Fabricio Voznika 19fe3a2bfb Fix `runsc kill --pid`
Previously, loader.signalProcess was inconsitently using both root and
container's PID namespace to find the process. It used root namespace
for the exec'd process and container's PID namespace for other processes.
This fixes the code to use the root PID namespace across the board, which
is the same PID reported in `runsc ps` (or soon will after
https://github.com/google/gvisor/pull/5519).

PiperOrigin-RevId: 358836297
2021-02-22 09:33:46 -08:00
Adin Scannell 3ef012944d Stop the control server only once.
Operations are now shut down automatically by the main Stop
command, and it is not necessary to call Stop during Destroy.

Fixes #5454

PiperOrigin-RevId: 357295930
2021-02-12 17:13:44 -08:00
Fabricio Voznika 192780946f Allow rt_sigaction in gofer seccomp
rt_sigaction may be called by Go runtime when trying to panic:

https://cs.opensource.google/go/go/+/master:src/runtime/signal_unix.go;drc=ed3e4afa12d655a0c5606bcf3dd4e1cdadcb1476;bpv=1;bpt=1;l=780?q=rt_sigaction&ss=go

Updates #5038

PiperOrigin-RevId: 357013186
2021-02-11 11:01:21 -08:00
Zach Koopmans 1ac58cc23e Add mitigate command to runsc
PiperOrigin-RevId: 356772367
2021-02-10 10:48:48 -08:00
Ting-Yu Wang 120c8e3468 Replace TaskFromContext(ctx).Kernel() with KernelFromContext(ctx)
Panic seen at some code path like control.ExecAsync where
ctx does not have a Task.

Reported-by: syzbot+55ce727161cf94a7b7d6@syzkaller.appspotmail.com
PiperOrigin-RevId: 355960596
2021-02-05 17:28:01 -08:00