Commit Graph

384 Commits

Author SHA1 Message Date
Ridwan Sharif a63db7d903 Moved FUSE device under the fuse directory 2020-06-25 14:22:21 -04:00
Nicolas Lacasse 58880bf551 Port /dev/net/tun device to VFS2.
Updates #2912 #1035

PiperOrigin-RevId: 318162565
2020-06-24 16:23:44 -07:00
Bhasker Hariharan b070e218c6 Add support for Stack level options.
Linux controls socket send/receive buffers using a few sysctl variables
  - net.core.rmem_default
  - net.core.rmem_max
  - net.core.wmem_max
  - net.core.wmem_default
  - net.ipv4.tcp_rmem
  - net.ipv4.tcp_wmem

The first 4 control the default socket buffer sizes for all sockets
raw/packet/tcp/udp and also the maximum permitted socket buffer that can be
specified in setsockopt(SOL_SOCKET, SO_(RCV|SND)BUF,...).

The last two control the TCP auto-tuning limits and override the default
specified in rmem_default/wmem_default as well as the max limits.

Netstack today only implements tcp_rmem/tcp_wmem and incorrectly uses it
to limit the maximum size in setsockopt() as well as uses it for raw/udp
sockets.

This changelist introduces the other 4 and updates the udp/raw sockets to use
the newly introduced variables. The values for min/max match the current
tcp_rmem/wmem values and the default value buffers for UDP/RAW sockets is
updated to match the linux value of 212KiB up from the really low current value
of 32 KiB.

Updates #3043
Fixes #3043

PiperOrigin-RevId: 318089805
2020-06-24 10:24:20 -07:00
Nicolas Lacasse 0f328beb0d Port /dev/tty device to VFS2.
Support is limited to the functionality that exists in VFS1.

Updates #2923 #1035

PiperOrigin-RevId: 317981417
2020-06-23 18:48:37 -07:00
Kevin Krakauer 28b8a5cc3a iptables: remove metadata struct
Metadata was useful for debugging and safety, but enough tests exist that we
should see failures when (de)serialization is broken. It made stack
initialization more cumbersome and it's also getting in the way of ip6tables.

PiperOrigin-RevId: 317210653
2020-06-18 17:02:16 -07:00
Bhasker Hariharan 07ff909e76 Support setsockopt SO_SNDBUF/SO_RCVBUF for raw/udp sockets.
Updates #173,#6
Fixes #2888

PiperOrigin-RevId: 317087652
2020-06-18 06:07:20 -07:00
gVisor bot dbf786c6b3 Add runsc options to set checksum offloading status
--tx-checksum-offload=<true|false>
  enable TX checksum offload (default: false)
--rx-checksum-offload=<true|false>
  enable RX checksum offload (default: true)

Fixes #2989

PiperOrigin-RevId: 316781309
2020-06-16 16:34:26 -07:00
Ian Lewis 8ea99d58ff Set the HOME environment variable for sub-containers.
Fixes #701

PiperOrigin-RevId: 316025635
2020-06-11 19:31:24 -07:00
Jamie Liu 77c206e371 Add //pkg/sentry/fsimpl/overlay.
Major differences from existing overlay filesystems:

- Linux allows lower layers in an overlay to require revalidation, but not the
  upper layer. VFS1 allows the upper layer in an overlay to require
  revalidation, but not the lower layer. VFS2 does not allow any layers to
  require revalidation. (Now that vfs.MkdirOptions.ForSyntheticMountpoint
  exists, no uses of overlay in VFS1 are believed to require upper layer
  revalidation; in particular, the requirement that the upper layer support the
  creation of "trusted." extended attributes for whiteouts effectively required
  the upper filesystem to be tmpfs in most cases.)

- Like VFS1, but unlike Linux, VFS2 overlay does not attempt to make mutations
  of the upper layer atomic using a working directory and features like
  RENAME_WHITEOUT. (This may change in the future, since not having a working
  directory makes error recovery for some operations, e.g. rmdir, particularly
  painful.)

- Like Linux, but unlike VFS1, VFS2 represents whiteouts using character
  devices with rdev == 0; the equivalent of the whiteout attribute on
  directories is xattr trusted.overlay.opaque = "y"; and there is no equivalent
  to the whiteout attribute on non-directories since non-directories are never
  merged with lower layers.

- Device and inode numbers work as follows:

    - In Linux, modulo the xino feature and a special case for when all layers
      are the same filesystem:

        - Directories use the overlay filesystem's device number and an
          ephemeral inode number assigned by the overlay.

        - Non-directories that have been copied up use the device and inode
          number assigned by the upper filesystem.

        - Non-directories that have not been copied up use a per-(overlay,
          layer)-pair device number and the inode number assigned by the lower
          filesystem.

    - In VFS1, device and inode numbers always come from the lower layer unless
      "whited out"; this has the adverse effect of requiring interaction with
      the lower filesystem even for non-directory files that exist on the upper
      layer.

    - In VFS2, device and inode numbers are assigned as in Linux, except that
      xino and the samefs special case are not supported.

- Like Linux, but unlike VFS1, VFS2 does not attempt to maintain memory mapping
  coherence across copy-up. (This may have to change in the future, as users
  may be dependent on this property.)

- Like Linux, but unlike VFS1, VFS2 uses the overlayfs mounter's credentials
  when interacting with the overlay's layers, rather than the caller's.

- Like Linux, but unlike VFS1, VFS2 permits multiple lower layers in an
  overlay.

- Like Linux, but unlike VFS1, VFS2's overlay filesystem is
  application-mountable.

Updates #1199

PiperOrigin-RevId: 316019067
2020-06-11 18:34:53 -07:00
Fabricio Voznika 4e96b94915 Combine executable lookup code
Run vs. exec, VFS1 vs. VFS2 were executable lookup were
slightly different from each other. Combine them all
into the same logic.

PiperOrigin-RevId: 315426443
2020-06-08 23:08:23 -07:00
Rahat Mahmood 21b6bc7280 Implement mount(2) and umount2(2) for VFS2.
This is mostly syscall plumbing, VFS2 already implements the internals of
mounts. In addition to the syscall defintions, the following mount-related
mechanisms are updated:

- Implement MS_NOATIME for VFS2, but only for tmpfs and goferfs. The other VFS2
  filesystems don't implement node-level timestamps yet.

- Implement the 'mode', 'uid' and 'gid' mount options for VFS2's tmpfs.

- Plumb mount namespace ownership, which is necessary for checking appropriate
  capabilities during mount(2).

Updates #1035

PiperOrigin-RevId: 315035352
2020-06-05 19:12:03 -07:00
Nicolas Lacasse e4e11f2798 Expand syscall filters to support MSAN.
PiperOrigin-RevId: 314997564
2020-06-05 14:33:50 -07:00
Ting-Yu Wang 41da7a568b Fix copylocks error about copying IPTables.
IPTables.connections contains a sync.RWMutex. Copying it will trigger copylocks
analysis. Tested by manually enabling nogo tests.

sync.RWMutex is added to IPTables for the additional race condition discovered.

PiperOrigin-RevId: 314817019
2020-06-05 11:29:09 -07:00
Fabricio Voznika ca5912d13c More runsc changes for VFS2
- Add /tmp handling
- Apply mount options
- Enable more container_test tests
- Forward signals to child process when test respaws process
  to run as root inside namespace.

Updates #1487

PiperOrigin-RevId: 314263281
2020-06-01 21:32:09 -07:00
Jamie Liu 3a987160aa Handle gofer blocking opens of host named pipes in VFS2.
Using tee instead of read to detect when a O_RDONLY|O_NONBLOCK pipe FD has a
writer circumvents the problem of what to do with the byte read from the pipe,
avoiding much of the complexity of the fdpipe package.

PiperOrigin-RevId: 314216146
2020-06-01 15:33:30 -07:00
Nicolas Lacasse 93edb36cbb Refactor the ResolveExecutablePath logic.
PiperOrigin-RevId: 313871804
2020-05-29 16:35:21 -07:00
Fabricio Voznika a8c1b32660 Automated rollback of changelist 309082540
PiperOrigin-RevId: 313636920
2020-05-28 12:25:57 -07:00
Fabricio Voznika 32ab382c80 Improve unsupported syscall message
PiperOrigin-RevId: 312104899
2020-05-18 10:23:22 -07:00
Jamie Liu 64afaf0e9b Fix runsc association of gofers and FDs on VFS2.
Updates #1487

PiperOrigin-RevId: 311443628
2020-05-13 18:18:09 -07:00
Jamie Liu d846077628 Enable overlayfs_stale_read by default for runsc.
Linux 4.18 and later make reads and writes coherent between pre-copy-up and
post-copy-up FDs representing the same file on an overlay filesystem. However,
memory mappings remain incoherent:

- Documentation/filesystems/overlayfs.rst, "Non-standard behavior": "If a file
  residing on a lower layer is opened for read-only and then memory mapped with
  MAP_SHARED, then subsequent changes to the file are not reflected in the
  memory mapping."

- fs/overlay/file.c:ovl_mmap() passes through to the underlying FD without any
  management of coherence in the overlay.

- Experimentally on Linux 5.2:

```
$ cat mmap_cat_page.c
#include <err.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>

int main(int argc, char **argv) {
  if (argc < 2) {
    errx(1, "syntax: %s [FILE]", argv[0]);
  }
  const int fd = open(argv[1], O_RDONLY);
  if (fd < 0) {
    err(1, "open(%s)", argv[1]);
  }
  const size_t page_size = sysconf(_SC_PAGE_SIZE);
  void* page = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
  if (page == MAP_FAILED) {
    err(1, "mmap");
  }
  for (;;) {
    write(1, page, strnlen(page, page_size));
    if (getc(stdin) == EOF) {
      break;
    }
  }
  return 0;
}

$ gcc -O2 -o mmap_cat_page mmap_cat_page.c
$ mkdir lowerdir upperdir workdir overlaydir
$ echo old > lowerdir/file
$ sudo mount -t overlay -o "lowerdir=lowerdir,upperdir=upperdir,workdir=workdir" none overlaydir
$ ./mmap_cat_page overlaydir/file
old
^Z
[1]+  Stopped                 ./mmap_cat_page overlaydir/file
$ echo new > overlaydir/file
$ cat overlaydir/file
new
$ fg
./mmap_cat_page overlaydir/file

old
```

Therefore, while the VFS1 gofer client's behavior of reopening read FDs is only
necessary pre-4.18, replacing existing memory mappings (in both sentry and
application address spaces) with mappings of the new FD is required regardless
of kernel version, and this latter behavior is common to both VFS1 and VFS2.
Re-document accordingly, and change the runsc flag to enabled by default.

New test:
- Before this CL: https://source.cloud.google.com/results/invocations/5b222d2c-e918-4bae-afc4-407f5bac509b
- After this CL: https://source.cloud.google.com/results/invocations/f28c747e-d89c-4d8c-a461-602b33e71aab

PiperOrigin-RevId: 311361267
2020-05-13 10:53:37 -07:00
Fabricio Voznika 18cb3d24cb Use VFS2 mount names
Updates #1487

PiperOrigin-RevId: 311356385
2020-05-13 10:31:29 -07:00
Fabricio Voznika 305f786e51 Adjust a few log messages
PiperOrigin-RevId: 311234146
2020-05-12 17:26:07 -07:00
Nicolas Lacasse c52195d258 Stop avoiding preadv2 and pwritev2, and add them to the filters.
Some code paths needed these syscalls anyways, so they should be included in
the filters. Given that we depend on these syscalls in some cases, there's no
real reason to avoid them any more.

PiperOrigin-RevId: 310829126
2020-05-10 17:52:20 -07:00
Jamie Liu 9115f26851 Allocate device numbers for VFS2 filesystems.
Updates #1197, #1198, #1672

PiperOrigin-RevId: 310432006
2020-05-07 14:01:53 -07:00
Dean Deng 16da7e790f Update privateunixsocket TODOs.
Synthetic sockets do not have the race condition issue in VFS2, and we will
get rid of privateunixsocket as well.

Fixes #1200.

PiperOrigin-RevId: 310386474
2020-05-07 10:20:48 -07:00
Adin Scannell 279f1eb7ab Fix runsc syscall documentation generation.
We can register any number of tables with any number of architectures, and
need not limit the definitions to the architecture in question. This allows
runsc to generate documentation for all architectures simultaneously.

Similarly, this simplifies the VFSv2 patching process.

PiperOrigin-RevId: 310224827
2020-05-06 14:13:48 -07:00
Fabricio Voznika 0a307d0072 Mount VSFS2 filesystem using root credentials
PiperOrigin-RevId: 309787938
2020-05-04 11:48:00 -07:00
Fabricio Voznika cbc5bef2a6 Add TTY support on VFS2 to runsc
Updates #1623, #1487

PiperOrigin-RevId: 309777922
2020-05-04 10:59:20 -07:00
Bhasker Hariharan ae15d90436 FIFO QDisc implementation
Updates #231

PiperOrigin-RevId: 309323808
2020-04-30 16:41:00 -07:00
moricho fc53d64367 refactor and add test for bindmount
Signed-off-by: moricho <ikeda.morito@gmail.com>
2020-04-26 17:24:34 +09:00
moricho 0b3166f624 add bind/rbind options for mount
Signed-off-by: moricho <ikeda.morito@gmail.com>
2020-04-25 22:04:39 +09:00
moricho 93e510e26f fix behavior of `getMountNameAndOptions` when options include either bind or rbind
Signed-off-by: moricho <ikeda.morito@gmail.com>
2020-04-25 22:04:39 +09:00
Zach Koopmans 15a822a193 VFS2: Get HelloWorld image tests to pass with VFS2
This change includes:
- Modifications to loader_test.go to get TestCreateMountNamespace to
pass with VFS2.
- Changes necessary to get TestHelloWorld in image tests to pass with
VFS2. This means runsc can run the hello-world container with docker
on VSF2.

Note: Containers that use sockets will not run with these changes.
See "//test/image/...". Any tests here with sockets currently fail
(which is all of them but HelloWorld).
PiperOrigin-RevId: 308363072
2020-04-24 18:23:37 -07:00
Dean Deng 632b104aff Plumb context.Context into kernfs.Inode.Open().
PiperOrigin-RevId: 308304793
2020-04-24 12:37:49 -07:00
Dean Deng 1b88c63b3e Move hostfs mount to Kernel struct.
This is needed to set up host fds passed through a Unix socket. Note that
the host package depends on kernel, so we cannot set up the hostfs mount
directly in Kernel.Init as we do for sockfs and pipefs.

Also, adjust sockfs to make its setup look more like hostfs's and pipefs's.

PiperOrigin-RevId: 308274053
2020-04-24 10:03:43 -07:00
Jamie Liu 5042ea7e2c Add vfs.MkdirOptions.ForSyntheticMountpoint.
PiperOrigin-RevId: 308143529
2020-04-23 15:37:10 -07:00
Adin Scannell 1481499fe2 Simplify Docker test infrastructure.
This change adds a layer of abstraction around the internal Docker APIs,
and eliminates all direct dependencies on Dockerfiles in the infrastructure.

A subsequent change will automated the generation of local images (with
efficient caching). Note that this change drops the use of bazel container
rules, as that experiment does not seem to be viable.

PiperOrigin-RevId: 308095430
2020-04-23 11:33:30 -07:00
Nicolas Lacasse e69a871c7b Move user home detection to its own library.
PiperOrigin-RevId: 307977689
2020-04-22 22:18:21 -07:00
Zach Koopmans 12bde95635 Get /bin/true to run on VFS2
Included:
- loader_test.go RunTest and TestStartSignal VFS2
- container_test.go TestAppExitStatus on VFS2
- experimental flag added to runsc to turn on VFS2

Note: shared mounts are not yet supported.
PiperOrigin-RevId: 307070753
2020-04-17 10:39:19 -07:00
gVisor bot ac9b32c36b Merge pull request #2212 from aaronlu:dup_stdioFDs
PiperOrigin-RevId: 306477639
2020-04-14 11:20:11 -07:00
Fabricio Voznika 96f9142959 Use O_CLOEXEC when dup'ing FDs
The sentry doesn't allow execve, but it's a good defense
in-depth measure.

PiperOrigin-RevId: 305958737
2020-04-10 15:47:23 -07:00
Adin Scannell 94b793262d Fix all copy locks violations.
This required minor restructuring of how system call tables were saved
and restored, but it makes way more sense this way.

Updates #2243
2020-04-08 10:00:14 -07:00
Ian Lewis 56054fc1fb Add friendlier messages for frequently encountered errors.
Issue #2270
Issue #1765

PiperOrigin-RevId: 305385436
2020-04-07 18:51:01 -07:00
Aaron Lu 0cfdd47391 checkpoint/restore: make sure the donated stdioFDs have the same value
Suppose I start a runsc container using kvm platform like this:
$ sudo runsc --debug=true --debug-log=1.txt --platform=kvm run rootbash
The donating FD and the corresponding cmdline for runsc-sandbox is:

D0313 17:50:12.608203   44389 x:0] Donating FD 3: "1.txt"
D0313 17:50:12.608214   44389 x:0] Donating FD 4: "control_server_socket"
D0313 17:50:12.608224   44389 x:0] Donating FD 5: "|0"
D0313 17:50:12.608229   44389 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json"
D0313 17:50:12.608234   44389 x:0] Donating FD 7: "|1"
D0313 17:50:12.608238   44389 x:0] Donating FD 8: "sandbox IO FD"
D0313 17:50:12.608242   44389 x:0] Donating FD 9: "/dev/kvm"
D0313 17:50:12.608246   44389 x:0] Donating FD 10: "/dev/stdin"
D0313 17:50:12.608249   44389 x:0] Donating FD 11: "/dev/stdout"
D0313 17:50:12.608253   44389 x:0] Donating FD 12: "/dev/stderr"
D0313 17:50:12.608257   44389 x:0] Starting sandbox: /proc/self/exe
[runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log=
--max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt
--debug-log-format=text --file-access=exclusive --overlay=false
--fsgofer-host-uds=false --network=sandbox --log-packets=false
--platform=kvm --strace=false --strace-syscalls=--strace-log-size=1024
--watchdog-action=Panic --panic-signal=-1 --profile=false --net-raw=true
--num-network-channels=1 --rootless=false --alsologtostderr=false
--ref-leak-mode=disabled --gso=true --software-gso=true
--overlayfs-stale-read=false --shared-volume= --debug-log-fd=3
--panic-signal=15 boot --bundle=/home/ziqian.lzq/bundle/bash/runsc
--controller-fd=4 --mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8
--device-fd=9 --stdio-fds=10 --stdio-fds=11 --stdio-fds=12 --pidns=true
--setup-root --cpu-num 32 --total-memory 4294967296 rootbash]

Note stdioFDs starts from 10 with kvm platform and stderr's FD is 12.

If I restore a container from the checkpoint image which is derived
by checkpointing the above rootbash container, but either omit the
platform switch or specify to use ptrace platform explicitely:
$ sudo runsc --debug=true --debug-log=1.txt restore --image-path=some_path restored_rootbash

the donating FD and corresponding cmdline for runsc-sandbox is:

D0313 17:50:15.258632   44452 x:0] Donating FD 3: "1.txt"
D0313 17:50:15.258640   44452 x:0] Donating FD 4: "control_server_socket"
D0313 17:50:15.258645   44452 x:0] Donating FD 5: "|0"
D0313 17:50:15.258648   44452 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json"
D0313 17:50:15.258653   44452 x:0] Donating FD 7: "|1"
D0313 17:50:15.258657   44452 x:0] Donating FD 8: "sandbox IO FD"
D0313 17:50:15.258661   44452 x:0] Donating FD 9: "/dev/stdin"
D0313 17:50:15.258675   44452 x:0] Donating FD 10: "/dev/stdout"
D0313 17:50:15.258680   44452 x:0] Donating FD 11: "/dev/stderr"
D0313 17:50:15.258684   44452 x:0] Starting sandbox: /proc/self/exe
[runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log=
--max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt
--debug-log-format=text --file-access=exclusive --overlay=false
--fsgofer-host-uds=false --network=sandbox --log-packets=false
--platform=ptrace --strace=false --strace-syscalls=
--strace-log-size=1024 --watchdog-action=Panic --panic-signal=-1
--profile=false --net-raw=true --num-network-channels=1 --rootless=false
--alsologtostderr=false --ref-leak-mode=disabled --gso=true
--software-gso=true --overlayfs-stale-read=false --shared-volume=
--debug-log-fd=3 --panic-signal=15 boot
--bundle=/home/ziqian.lzq/bundle/bash/runsc --controller-fd=4
--mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8 --stdio-fds=9
--stdio-fds=10 --stdio-fds=11 --setup-root --cpu-num 32 --total-memory
4294967296 restored_rootbash]

Note this time, stdioFDs starts from 9 and stderr's FD is 11(so the
saved host.descritor.origFD which is 12 for stderr is no longer valid).

For the three host FD based files, The s.Dev and s.Ino derived from
fstat(fd) shall all be the same and since the two fields are used
as device.MultiDeviceKey, the host.inodeFileState.sattr.InodeId which is
the value of MultiDevice.Map(MultiDeviceKey), shall also all be the same.
Note that for MultiDevice m, m.cache records the mapping of key to value
and m.rcache records the mapping of value to key. If same value doesn't
map to the same key, it will panic on restore.

Now that stderr's origFD 12 is no longer valid(it happens to be
/memfd:runsc-memory in my test on restore), the s.Dev and s.Ino derived
from fstat(fd=12) in host.inodeFileState.afterLoad() will neither be
correct. But its InodeID is still the same as saved, MultiDevice.Load()
will complain about the same value(InodeID) being mapped to different
keys (different from stdin and stdout's) and panic with: "MultiDevice's
caches are inconsistent".

Solve this problem by making sure stdioFDs for root container's init
task are always the same on initial start and on restore time, no matter
what cmdline user has used: debug log specified or not, platform changed
or not etc. shall not affect the ability to restore.

Fixes #1844.
2020-03-31 11:37:11 +08:00
Dean Deng 137f361400 Use host-defined file owner and mode, when possible, for imported fds.
Using the host-defined file owner matches VFS1. It is more correct to use the
host-defined mode, since the cached value may become out of date. However,
kernfs.Inode.Mode() does not return an error--other filesystems on kernfs are
in-memory so retrieving mode should not fail. Therefore, if the host syscall
fails, we rely on a cached value instead.

Updates #1672.

PiperOrigin-RevId: 303220864
2020-03-26 16:47:20 -07:00
Dean Deng 248e46f320 Whitelist utimensat(2).
utimensat is used by hostfs for setting timestamps on imported fds. Previously,
this would crash the sandbox since utimensat was not allowed.

Correct the VFS2 version of hostfs to match the call in VFS1.

PiperOrigin-RevId: 301970121
2020-03-19 23:30:21 -07:00
Dean Deng 5e413cad10 Plumb VFS2 imported fds into virtual filesystem.
- When setting up the virtual filesystem, mount a host.filesystem to contain
  all files that need to be imported.
- Make read/preadv syscalls to the host in cases where preadv2 may not be
  supported yet (likewise for writing).
- Make save/restore functions in kernel/kernel.go return early if vfs2 is
  enabled.

PiperOrigin-RevId: 300922353
2020-03-14 07:14:33 -07:00
gVisor bot 6367963c14 Merge pull request #1951 from moricho:moricho/add-profiler-option
PiperOrigin-RevId: 299233818
2020-03-05 17:16:54 -08:00
Andrei Vagin 322dbfe06b Allow to specify a separate log for GO's runtime messages
GO's runtime calls the write system call twice to print "panic:"
and "the reason of this panic", so here is a race window when
other threads can print something to the log and we will see
something like this:

panic: log messages from another thread
The reason of the panic.

This confuses the syzkaller blacklist and dedup detection.

It also makes the logs generally difficult to read. e.g.,
data races often have one side of the race, followed by
a large "diagnosis" dump, finally followed by the other
side of the race.

PiperOrigin-RevId: 297887895
2020-02-28 11:24:11 -08:00
moricho d8ed784311 add profile option 2020-02-26 16:49:51 +09:00
Jamie Liu 471b15b212 Port most syscalls to VFS2.
pipe and pipe2 aren't ported, pending a slight rework of pipe FDs for VFS2.
mount and umount2 aren't ported out of temporary laziness. access and faccessat
need additional FSImpl methods to implement properly, but are stubbed to
prevent googletest from CHECK-failing. Other syscalls require additional
plumbing.

Updates #1623

PiperOrigin-RevId: 297188448
2020-02-25 13:37:34 -08:00
gVisor bot 4a73bae269 Initial network namespace support.
TCP/IP will work with netstack networking. hostinet doesn't work, and sockets
will have the same behavior as it is now.

Before the userspace is able to create device, the default loopback device can
be used to test.

/proc/net and /sys/net will still be connected to the root network stack; this
is the same behavior now.

Issue #1833

PiperOrigin-RevId: 296309389
2020-02-20 15:20:40 -08:00
gVisor bot 5baf9dc2fb Synchronize signalling with S/R
This is to fix a data race between sending an external signal to
a ThreadGroup and kernel saving state for S/R.

PiperOrigin-RevId: 295244281
2020-02-14 15:49:09 -08:00
gVisor bot 4075de11be Plumb VFS2 inside the Sentry
- Added fsbridge package with interface that can be used to open
  and read from VFS1 and VFS2 files.
- Converted ELF loader to use fsbridge
- Added VFS2 types to FSContext
- Added vfs.MountNamespace to ThreadGroup

Updates #1623

PiperOrigin-RevId: 295183950
2020-02-14 11:12:47 -08:00
Adin Scannell 1b6a12a768 Add notes to relevant tests.
These were out-of-band notes that can help provide additional context
and simplify automated imports.

PiperOrigin-RevId: 293525915
2020-02-05 22:46:35 -08:00
Michael Pratt 4d1a648c7c Allow mlock in system call filters
Go 1.14 has a workaround for a Linux 5.2-5.4 bug which requires mlock'ing the g
stack to prevent register corruption. We need to allow this syscall until it is
removed from Go.

PiperOrigin-RevId: 292967478
2020-02-03 11:39:51 -08:00
Fabricio Voznika 437c986c6a Add vfs.FileDescription to FD table
FD table now holds both VFS1 and VFS2 types and uses the correct
one based on what's set.

Parts of this CL are just initial changes (e.g. sys_read.go,
runsc/main.go) to serve as a template for the remaining changes.

Updates #1487
Updates #1623

PiperOrigin-RevId: 292023223
2020-01-28 15:31:03 -08:00
Adin Scannell 253c9e666c Cleanup glog and add real caller information.
In general, we've learned that logging must be avoided at all
costs in the hot path. It's unlikely that the optimizations
here were significant in any case, since buffer would certainly
escape.

This also adds a test to ensure that the caller identification
works as expected, and so that logging can be benchmarked.

Original:
BenchmarkGoogleLogging-6   	 1222255	       949 ns/op

With this change:
BenchmarkGoogleLogging-6   	  517323	      2346 ns/op

Fixes #184

PiperOrigin-RevId: 291815420
2020-01-27 16:08:35 -08:00
Adin Scannell 0e2f1b7abd Update package locations.
Because the abi will depend on the core types for marshalling (usermem,
context, safemem, safecopy), these need to be flattened from the sentry
directory. These packages contain no sentry-specific details.

PiperOrigin-RevId: 291811289
2020-01-27 15:31:32 -08:00
Adin Scannell d29e59af9f Standardize on tools directory.
PiperOrigin-RevId: 291745021
2020-01-27 12:21:00 -08:00
Ian Gudger 27500d529f New sync package.
* Rename syncutil to sync.
* Add aliases to sync types.
* Replace existing usage of standard library sync package.

This will make it easier to swap out synchronization primitives. For example,
this will allow us to use primitives from github.com/sasha-s/go-deadlock to
check for lock ordering violations.

Updates #1472

PiperOrigin-RevId: 289033387
2020-01-09 22:02:24 -08:00
Bert Muthalaly e21c584056 Combine various Create*NIC methods into CreateNICWithOptions.
PiperOrigin-RevId: 288779416
2020-01-08 14:50:49 -08:00
Bert Muthalaly 0cc1e74b57 Add NIC.isLoopback()
...enabling us to remove the "CreateNamedLoopbackNIC" variant of
CreateNIC and all the plumbing to connect it through to where the value
is read in FindRoute.

PiperOrigin-RevId: 288713093
2020-01-08 09:30:20 -08:00
Aleksandr Razumov 67f678be27
Leave minimum CPU number as a constant
Remove introduced CPUNumMin config and hard-code it as 2.
2019-12-17 20:41:02 +03:00
Aleksandr Razumov b661434202
Add minimum CPU number and only lower CPUs on --cpu-num-from-quota
* Add `--cpu-num-min` flag to control minimum CPUs
* Only lower CPU count
* Fix comments
2019-12-17 13:27:13 +03:00
Aleksandr Razumov 8782f0e287
Set CPU number to CPU quota
When application is not cgroups-aware, it can spawn excessive threads
which often defaults to CPU number.
Introduce a opt-in flag that will set CPU number accordingly to CPU
quota (if available).

Fixes #1391
2019-12-15 21:12:43 +03:00
Bhasker Hariharan b9aa62b9f9 Enable IPv6 in runsc
Fixes #1341

PiperOrigin-RevId: 285108973
2019-12-11 19:14:26 -08:00
Fabricio Voznika 01eadf51ea Bump up Go 1.13 as minimum requirement
PiperOrigin-RevId: 284320186
2019-12-06 23:10:15 -08:00
gVisor bot e70636d7f1 Merge pull request #1233 from xiaobo55x:compatLog
PiperOrigin-RevId: 284305935
2019-12-06 19:41:39 -08:00
Adin Scannell 371e210b83 Add runtime tracing.
This adds meaningful annotations to the trace generated by the runtime/trace
package.

PiperOrigin-RevId: 284290115
2019-12-06 17:00:07 -08:00
Fabricio Voznika ea7a100202 Make annotations OCI compliant
Changed annotation to follow the standard defined here:
https://github.com/opencontainers/image-spec/blob/master/annotations.md

PiperOrigin-RevId: 284254847
2019-12-06 13:51:38 -08:00
Dean Deng 19b2d997ec Support IP_TOS and IPV6_TCLASS socket options for hostinet sockets.
There are two potential ways of sending a TOS byte with outgoing packets:
including a control message in sendmsg, or setting the IP_TOS/IPV6_TCLASS
socket options (for IPV4 and IPV6 respectively). This change lets hostinet
support the latter.

Fixes #1188

PiperOrigin-RevId: 283550925
2019-12-03 08:33:22 -08:00
Haibo Xu 61f2274cb6 Enable runsc compatLog support on arm64.
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I3fd5e552f5f03b5144ed52647f75af3b8253b1d6
2019-12-03 03:25:54 +00:00
Dean Deng 684f757a22 Add support for receiving TOS and TCLASS control messages in hostinet.
This involves allowing getsockopt/setsockopt for the corresponding socket
options, as well as allowing hostinet to process control messages received from
the actual recvmsg syscall.

PiperOrigin-RevId: 282851425
2019-11-27 16:21:05 -08:00
Fabricio Voznika 97d2c9a94e Use mount hints to determine FileAccessType
PiperOrigin-RevId: 282401165
2019-11-25 11:43:05 -08:00
gVisor bot 0416c247ec Merge pull request #1176 from xiaobo55x:runsc_boot
PiperOrigin-RevId: 282382564
2019-11-25 11:01:22 -08:00
Haibo Xu 05871a1cdc Enable runsc/boot support on arm64.
This patch also include a minor change to replace syscall.Dup2
with syscall.Dup3 which was missed in a previous commit(ref a25a976).

Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I00beb9cc492e44c762ebaa3750201c63c1f7c2f3
2019-11-13 06:39:11 +00:00
Michael Pratt b23b36e701 Add NETLINK_KOBJECT_UEVENT socket support
NETLINK_KOBJECT_UEVENT sockets send udev-style messages for device events.
gVisor doesn't have any device events, so our sockets don't need to do anything
once created.

systemd's device manager needs to be able to create one of these sockets. It
also wants to install a BPF filter on the socket. Since we'll never send any
messages, the filter would never be invoked, thus we just fake it out.

Fixes #1117
Updates #1119

PiperOrigin-RevId: 278405893
2019-11-04 10:07:52 -08:00
Nicolas Lacasse e70f28664a Allow the watchdog to detect when the sandbox is stuck during setup.
The watchdog currently can find stuck tasks, but has no way to tell if the
sandbox is stuck before the application starts executing.

This CL adds a startup timeout and action to the watchdog. If Start() is not
called before the given timeout (if non-zero), then the watchdog will take the
action.

PiperOrigin-RevId: 277970577
2019-11-01 11:49:31 -07:00
gVisor bot 0202be1ba5 Merge pull request #1058 from cmingxu:master
PiperOrigin-RevId: 277623766
2019-10-31 11:26:45 -07:00
Andrei Vagin db37483cb6 Store endpoints inside multiPortEndpoint in a sorted order
It is required to guarantee the same order of endpoints after save/restore.

PiperOrigin-RevId: 277598665
2019-10-30 15:33:41 -07:00
kevin.xu 1f19624fa1
fix typo
fix a typo
2019-10-23 15:21:50 +08:00
kevin.xu 3edbdcc191
remove duplicated period
remove a duplicated period
2019-10-23 14:56:44 +08:00
Andrei Vagin 8720bd643e netstack/tcp: software segmentation offload
Right now, we send each tcp packet separately, we call one system
call per-packet. This patch allows to generate multiple tcp packets
and send them by sendmmsg.

The arguable part of this CL is a way how to handle multiple headers.
This CL adds the next field to the Prepandable buffer.

Nginx test results:

Server Software:        nginx/1.15.9
Server Hostname:        10.138.0.2
Server Port:            8080

Document Path:          /10m.txt
Document Length:        10485760 bytes

w/o gso:
Concurrency Level:      5
Time taken for tests:   5.491 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      1048600200 bytes
HTML transferred:       1048576000 bytes
Requests per second:    18.21 [#/sec] (mean)
Time per request:       274.525 [ms] (mean)
Time per request:       54.905 [ms] (mean, across all concurrent requests)
Transfer rate:          186508.03 [Kbytes/sec] received

sw-gso:

Concurrency Level:      5
Time taken for tests:   3.852 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      1048600200 bytes
HTML transferred:       1048576000 bytes
Requests per second:    25.96 [#/sec] (mean)
Time per request:       192.576 [ms] (mean)
Time per request:       38.515 [ms] (mean, across all concurrent requests)
Transfer rate:          265874.92 [Kbytes/sec] received

w/o gso:
$ ./tcp_benchmark --client --duration 15  --ideal
[SUM]  0.0-15.1 sec  2.20 GBytes  1.25 Gbits/sec

software gso:
$ tcp_benchmark --client --duration 15  --ideal --gso $((1<<16)) --swgso
[SUM]  0.0-15.1 sec  3.99 GBytes  2.26 Gbits/sec

PiperOrigin-RevId: 276112677
2019-10-22 11:55:56 -07:00
Kevin Krakauer 12235d533a AF_PACKET support for netstack (aka epsocket).
Like (AF_INET, SOCK_RAW) sockets, AF_PACKET sockets require CAP_NET_RAW. With
runsc, you'll need to pass `--net-raw=true` to enable them.

Binding isn't supported yet.

PiperOrigin-RevId: 275909366
2019-10-21 13:23:18 -07:00
Fabricio Voznika 9fb562234e Fix problem with open FD when copy up is triggered in overlayfs
Linux kernel before 4.19 doesn't implement a feature that updates
open FD after a file is open for write (and is copied to the upper
layer). Already open FD will continue to read the old file content
until they are reopened. This is especially problematic for gVisor
because it caches open files.

Flag was added to force readonly files to be reopenned when the
same file is open for write. This is only needed if using kernels
prior to 4.19.

Closes #1006

It's difficult to really test this because we never run on tests
on older kernels. I'm adding a test in GKE which uses kernels
with the overlayfs problem for 1.14 and lower.

PiperOrigin-RevId: 275115289
2019-10-16 15:06:24 -07:00
Fabricio Voznika a357fe427b Remove stale TODO
PiperOrigin-RevId: 273630282
2019-10-08 16:23:41 -07:00
Fabricio Voznika b9cdbc26bc Ignore mount options that are not supported in shared mounts
Options that do not change mount behavior inside the Sentry are
irrelevant and should not be used when looking for possible
incompatibilities between master and slave mounts.

PiperOrigin-RevId: 273593486
2019-10-08 13:36:16 -07:00
Ian Gudger 7c1587e340 Implement IP_TTL.
Also change the default TTL to 64 to match Linux.

PiperOrigin-RevId: 273430341
2019-10-07 19:29:51 -07:00
Kevin Krakauer 6a98237949 Rename epsocket to netstack.
PiperOrigin-RevId: 273365058
2019-10-07 13:57:59 -07:00
gVisor bot dd0e5eedae Merge pull request #765 from trailofbits:uds_support
PiperOrigin-RevId: 271235134
2019-09-25 16:44:22 -07:00
Kevin Krakauer 59ccbb1044 Remove centralized registration of protocols.
Also removes the need for protocol names.

PiperOrigin-RevId: 271186030
2019-09-25 12:57:05 -07:00
Robert Tonic 7810b30983 Refactor command line options and remove the allowed terminology for uds 2019-09-24 18:24:10 -04:00
Nicolas Lacasse f2ea8e6b24 Always set HOME env var with `runsc exec`.
We already do this for `runsc run`, but need to do the same for `runsc exec`.

PiperOrigin-RevId: 270793459
2019-09-23 17:06:02 -07:00
Robert Tonic 46beb91912 Fix documentation, clean up seccomp filter installation, rename helpers.
Filter installation has been streamlined and functions renamed. 
Documentation has been fixed to be standards compliant, and missing 
documentation added. gofmt has also been applied to modified files.
2019-09-19 17:10:50 -04:00
Robert Tonic ac38a7ead0 Place the host UDS mounting behind --fsgofer-host-uds-allowed.
This commit allows the use of the `--fsgofer-host-uds-allowed` flag to 
enable mounting sockets and add the appropriate seccomp filters.
2019-09-19 12:37:15 -04:00
Fabricio Voznika 010b093258 Bring back to life features lost in recent refactor
- Sandbox logs are generated when running tests
- Kokoro uploads the sandbox logs
- Supports multiple parallel runs
- Revive script to install locally built runsc with docker

PiperOrigin-RevId: 269337274
2019-09-16 08:17:00 -07:00
Adin Scannell a8834fc555 Update p9 to support flipcall.
PiperOrigin-RevId: 268845090
2019-09-12 23:37:31 -07:00
Ian Gudger fe1f521077 Remove reundant global tcpip.LinkEndpointID.
PiperOrigin-RevId: 267709597
2019-09-06 18:01:14 -07:00
Fabricio Voznika 0f5cdc1e00 Resolve flakes with TestMultiContainerDestroy
Some processes are reparented to the root container depending
on the kill order and the root container would not reap in time.
So some zombie processes were still present when the test checked.

Fix it by running the second container inside a PID namespace.

PiperOrigin-RevId: 267278591
2019-09-04 18:56:49 -07:00