Linux 4.18 and later make reads and writes coherent between pre-copy-up and
post-copy-up FDs representing the same file on an overlay filesystem. However,
memory mappings remain incoherent:
- Documentation/filesystems/overlayfs.rst, "Non-standard behavior": "If a file
residing on a lower layer is opened for read-only and then memory mapped with
MAP_SHARED, then subsequent changes to the file are not reflected in the
memory mapping."
- fs/overlay/file.c:ovl_mmap() passes through to the underlying FD without any
management of coherence in the overlay.
- Experimentally on Linux 5.2:
```
$ cat mmap_cat_page.c
#include <err.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>
int main(int argc, char **argv) {
if (argc < 2) {
errx(1, "syntax: %s [FILE]", argv[0]);
}
const int fd = open(argv[1], O_RDONLY);
if (fd < 0) {
err(1, "open(%s)", argv[1]);
}
const size_t page_size = sysconf(_SC_PAGE_SIZE);
void* page = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
if (page == MAP_FAILED) {
err(1, "mmap");
}
for (;;) {
write(1, page, strnlen(page, page_size));
if (getc(stdin) == EOF) {
break;
}
}
return 0;
}
$ gcc -O2 -o mmap_cat_page mmap_cat_page.c
$ mkdir lowerdir upperdir workdir overlaydir
$ echo old > lowerdir/file
$ sudo mount -t overlay -o "lowerdir=lowerdir,upperdir=upperdir,workdir=workdir" none overlaydir
$ ./mmap_cat_page overlaydir/file
old
^Z
[1]+ Stopped ./mmap_cat_page overlaydir/file
$ echo new > overlaydir/file
$ cat overlaydir/file
new
$ fg
./mmap_cat_page overlaydir/file
old
```
Therefore, while the VFS1 gofer client's behavior of reopening read FDs is only
necessary pre-4.18, replacing existing memory mappings (in both sentry and
application address spaces) with mappings of the new FD is required regardless
of kernel version, and this latter behavior is common to both VFS1 and VFS2.
Re-document accordingly, and change the runsc flag to enabled by default.
New test:
- Before this CL: https://source.cloud.google.com/results/invocations/5b222d2c-e918-4bae-afc4-407f5bac509b
- After this CL: https://source.cloud.google.com/results/invocations/f28c747e-d89c-4d8c-a461-602b33e71aab
PiperOrigin-RevId: 311361267
Included:
- loader_test.go RunTest and TestStartSignal VFS2
- container_test.go TestAppExitStatus on VFS2
- experimental flag added to runsc to turn on VFS2
Note: shared mounts are not yet supported.
PiperOrigin-RevId: 307070753
GO's runtime calls the write system call twice to print "panic:"
and "the reason of this panic", so here is a race window when
other threads can print something to the log and we will see
something like this:
panic: log messages from another thread
The reason of the panic.
This confuses the syzkaller blacklist and dedup detection.
It also makes the logs generally difficult to read. e.g.,
data races often have one side of the race, followed by
a large "diagnosis" dump, finally followed by the other
side of the race.
PiperOrigin-RevId: 297887895
FD table now holds both VFS1 and VFS2 types and uses the correct
one based on what's set.
Parts of this CL are just initial changes (e.g. sys_read.go,
runsc/main.go) to serve as a template for the remaining changes.
Updates #1487
Updates #1623
PiperOrigin-RevId: 292023223
When application is not cgroups-aware, it can spawn excessive threads
which often defaults to CPU number.
Introduce a opt-in flag that will set CPU number accordingly to CPU
quota (if available).
Fixes#1391
Right now, we send each tcp packet separately, we call one system
call per-packet. This patch allows to generate multiple tcp packets
and send them by sendmmsg.
The arguable part of this CL is a way how to handle multiple headers.
This CL adds the next field to the Prepandable buffer.
Nginx test results:
Server Software: nginx/1.15.9
Server Hostname: 10.138.0.2
Server Port: 8080
Document Path: /10m.txt
Document Length: 10485760 bytes
w/o gso:
Concurrency Level: 5
Time taken for tests: 5.491 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 1048600200 bytes
HTML transferred: 1048576000 bytes
Requests per second: 18.21 [#/sec] (mean)
Time per request: 274.525 [ms] (mean)
Time per request: 54.905 [ms] (mean, across all concurrent requests)
Transfer rate: 186508.03 [Kbytes/sec] received
sw-gso:
Concurrency Level: 5
Time taken for tests: 3.852 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 1048600200 bytes
HTML transferred: 1048576000 bytes
Requests per second: 25.96 [#/sec] (mean)
Time per request: 192.576 [ms] (mean)
Time per request: 38.515 [ms] (mean, across all concurrent requests)
Transfer rate: 265874.92 [Kbytes/sec] received
w/o gso:
$ ./tcp_benchmark --client --duration 15 --ideal
[SUM] 0.0-15.1 sec 2.20 GBytes 1.25 Gbits/sec
software gso:
$ tcp_benchmark --client --duration 15 --ideal --gso $((1<<16)) --swgso
[SUM] 0.0-15.1 sec 3.99 GBytes 2.26 Gbits/sec
PiperOrigin-RevId: 276112677
Linux kernel before 4.19 doesn't implement a feature that updates
open FD after a file is open for write (and is copied to the upper
layer). Already open FD will continue to read the old file content
until they are reopened. This is especially problematic for gVisor
because it caches open files.
Flag was added to force readonly files to be reopenned when the
same file is open for write. This is only needed if using kernels
prior to 4.19.
Closes#1006
It's difficult to really test this because we never run on tests
on older kernels. I'm adding a test in GKE which uses kernels
with the overlayfs problem for 1.14 and lower.
PiperOrigin-RevId: 275115289
Filter installation has been streamlined and functions renamed.
Documentation has been fixed to be standards compliant, and missing
documentation added. gofmt has also been applied to modified files.
'--rootless' flag lets a non-root user execute 'runsc do'.
The drawback is that the sandbox and gofer processes will
run as root inside a user namespace that is mapped to the
caller's user, intead of nobody. And network is defaulted
to '--network=host' inside the root network namespace. On
the bright side, it's very convenient for testing:
runsc --rootless do ls
runsc --rootless do curl www.google.com
PiperOrigin-RevId: 252840970
This allows an fdbased endpoint to have multiple underlying fd's from which
packets can be read and dispatched/written to.
This should allow for higher throughput as well as better scalability of the
network stack as number of connections increases.
Updates #231
PiperOrigin-RevId: 251852825
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.
1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.
Fixes#209
PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
The linux packet socket can handle GSO packets, so we can segment packets to
64K instead of the MTU which is usually 1500.
Here are numbers for the nginx-1m test:
runsc: 579330.01 [Kbytes/sec] received
runsc-gso: 1794121.66 [Kbytes/sec] received
runc: 2122139.06 [Kbytes/sec] received
and for tcp_benchmark:
$ tcp_benchmark --duration 15 --ideal
[ 4] 0.0-15.0 sec 86647 MBytes 48456 Mbits/sec
$ tcp_benchmark --client --duration 15 --ideal
[ 4] 0.0-15.0 sec 2173 MBytes 1214 Mbits/sec
$ tcp_benchmark --client --duration 15 --ideal --gso 65536
[ 4] 0.0-15.0 sec 19357 MBytes 10825 Mbits/sec
PiperOrigin-RevId: 241072403
Change-Id: I20b03063a1a6649362b43609cbbc9b59be06e6d5
This option is effectively equivalent to -panic-signal, except that the
sandbox does not die after logging the traceback.
PiperOrigin-RevId: 225089593
Change-Id: Ifb1c411210110b6104613f404334bd02175e484e
Fluentd configuration uses 'log' for the log message
while containerd uses 'msg'. Since we can't have a single
JSON format for both, add another log format and make
debug log configurable.
PiperOrigin-RevId: 219729658
Change-Id: I2a6afc4034d893ab90bafc63b394c4fb62b2a7a0
It's hard to resolve symlinks inside the sandbox because rootfs and mounts
may be read-only, forcing us to create mount points inside lower layer of an
overlay, **before** the volumes are mounted.
Since the destination must already be resolved outside the sandbox when creating
mounts, take this opportunity to rewrite the spec with paths resolved.
"runsc boot" will use the "resolved" spec to load mounts. In addition, symlink
traversals were disabled while mounting containers inside the sandbox.
It haven't been able to write a good test for it. So I'm relying on manual tests
for now.
PiperOrigin-RevId: 217749904
Change-Id: I7ac434d5befd230db1488446cda03300cc0751a9
This is a breaking change if you're using --debug-log-dir.
The fix is to replace it with --debug-log and add a '/' at
the end:
--debug-log-dir=/tmp/runsc ==> --debug-log=/tmp/runsc/
PiperOrigin-RevId: 216761212
Change-Id: I244270a0a522298c48115719fa08dad55e34ade1
It was used before gofer was implemented and it's not
supported anymore.
BREAKING CHANGE: proxy-shared and proxy-exclusive options
are now: shared and exclusive.
PiperOrigin-RevId: 212017643
Change-Id: If029d4073fe60583e5ca25f98abb2953de0d78fd
We construct a dir with the executable bind-mounted at /exe, and proc mounted
at /proc. Runsc now executes the sandbox process inside this chroot, thus
limiting access to the host filesystem. The mounts and chroot dir are removed
when the sandbox is destroyed.
Because this requires bind-mounts, we can only do the chroot if we have
CAP_SYS_ADMIN.
PiperOrigin-RevId: 211994001
Change-Id: Ia71c515e26085e0b69b833e71691830148bc70d1
When starting a sandbox without direct file or network access, we create an
empty user namespace and run the sandbox in there. However, the root user in
that namespace is still mapped to the root user in the parent namespace.
This CL maps the "nobody" user from the parent namespace into the child
namespace, and runs the sandbox process as user "nobody" inside the new
namespace.
PiperOrigin-RevId: 211572223
Change-Id: I1b1f9b1a86c0b4e7e5ca7bc93be7d4887678bab6
This is a prereq for running the sandbox process as user "nobody", when it may
not have permissions to open these files.
Instead, we must open then before starting the sandbox process, and pass them
by FD.
The specutils.ReadSpecFromFile method was fixed to always seek to the beginning
of the file before reading. This allows Files from the same FD to be read
multiple times, as we do in the boot command when the apply-caps flag is set.
Tested with --network=host.
PiperOrigin-RevId: 211570647
Change-Id: I685be0a290aa7f70731ebdce82ebc0ebcc9d475c
This is a prereq for running the sandbox process as user "nobody", when it may
not have permissions to open these files.
Instead, we must open then before starting the sandbox process, and pass them
by FD.
PiperOrigin-RevId: 210995199
Change-Id: I715875a9553290b4a49394a8fcd93be78b1933dd
This is to troubleshoot problems with a hung process that is
not responding to 'runsc debug --stack' command.
PiperOrigin-RevId: 210483513
Change-Id: I4377b210b4e51bc8a281ad34fd94f3df13d9187d
This file access type is actually called "proxy-shared", but I forgot to update
all locations.
PiperOrigin-RevId: 208832491
Change-Id: I7848bc4ec2478f86cf2de1dcd1bfb5264c6276de
Previously, gofer filesystems were configured with the default "fscache"
policy, which caches filesystem metadata and contents aggressively. While this
setting is best for performance, it means that changes from inside the sandbox
may not be immediately propagated outside the sandbox, and vice-versa.
This CL changes volumes and the root fs configuration to use a new
"remote-revalidate" cache policy which tries to retain as much caching as
possible while still making fs changes visible across the sandbox boundary.
This cache policy is enabled by default for the root filesystem. The default
value for the "--file-access" flag is still "proxy", but the behavior is
changed to use the new cache policy.
A new value for the "--file-access" flag is added, called "proxy-exclusive",
which turns on the previous aggressive caching behavior. As the name implies,
this flag should be used when the sandbox has "exclusive" access to the
filesystem.
All volume mounts are configured to use the new cache policy, since it is
safest and most likely to be correct. There is not currently a way to change
this behavior, but it's possible to add such a mechanism in the future. The
configurability is a smaller issue for volumes, since most of the expensive
application fs operations (walking + stating files) will likely served by the
root fs.
PiperOrigin-RevId: 208735037
Change-Id: Ife048fab1948205f6665df8563434dbc6ca8cfc9