Commit Graph

1361 Commits

Author SHA1 Message Date
Andrei Vagin 66cc0e9f92 gvisor/bazel: use python2 to build runsc-debian
$ bazel build runsc:runsc-debian
  File ".../bazel_tools/tools/build_defs/pkg/make_deb.py", line 311,
  in GetFlagValue:
    flagvalue = flagvalue.decode('utf-8')
AttributeError: 'str' object has no attribute 'decode'

make_deb.py is incompatible with Python3.
https://github.com/bazelbuild/bazel/issues/8443

PiperOrigin-RevId: 253691923
2019-06-17 17:09:06 -07:00
gVisor bot 99d286370d Internal change.
PiperOrigin-RevId: 253559564
2019-06-17 05:13:55 -07:00
Bhasker Hariharan a8608c501b Enable Receive Buffer Auto-Tuning for runsc.
Updates #230

PiperOrigin-RevId: 253225078
2019-06-14 07:31:45 -07:00
Bhasker Hariharan 3d71c627fa Add support for TCP receive buffer auto tuning.
The implementation is similar to linux where we track the number of bytes
consumed by the application to grow the receive buffer of a given TCP endpoint.

This ensures that the advertised window grows at a reasonable rate to accomodate
for the sender's rate and prevents large amounts of data being held in stack
buffers if the application is not actively reading or not reading fast enough.

The original paper that was used to implement the linux receive buffer auto-
tuning is available @ https://public.lanl.gov/radiant/pubs/drs/lacsi2001.pdf

NOTE: Linux does not implement DRS as defined in that paper, it's just a good
reference to understand the solution space.

Updates #230

PiperOrigin-RevId: 253168283
2019-06-13 22:28:01 -07:00
Ian Gudger 3e9b8ecbfe Plumb context through more layers of filesytem.
All functions which allocate objects containing AtomicRefCounts will soon need
a context.

PiperOrigin-RevId: 253147709
2019-06-13 18:40:38 -07:00
Ian Gudger 0a5ee6f7b2 Fix deadlock in fasync.
The deadlock can occur when both ends of a connected Unix socket which has
FIOASYNC enabled on at least one end are closed at the same time. One end
notifies that it is closing, calling (*waiter.Queue).Notify which takes
waiter.Queue.mu (as a read lock) and then calls (*FileAsync).Callback, which
takes FileAsync.mu. The other end tries to unregister for notifications by
calling (*FileAsync).Unregister, which takes FileAsync.mu and calls
(*waiter.Queue).EventUnregister which takes waiter.Queue.mu.

This is fixed by moving the calls to waiter.Waitable.EventRegister and
waiter.Waitable.EventUnregister outside of the protection of any mutex used
in (*FileAsync).Callback.

The new test is related, but does not cover this particular situation.

Also fix a data race on FileAsync.e.Callback. (*FileAsync).Callback checked
FileAsync.e.Callback under the protection of FileAsync.mu, but the waiter
calling (*FileAsync).Callback could not and did not. This is fixed by making
FileAsync.e.Callback immutable before passing it to the waiter for the first
time.

Fixes #346

PiperOrigin-RevId: 253138340
2019-06-13 17:26:22 -07:00
Rahat Mahmood 05ff1ffaad Implement getsockopt() SO_DOMAIN, SO_PROTOCOL and SO_TYPE.
SO_TYPE was already implemented for everything but netlink sockets.

PiperOrigin-RevId: 253138157
2019-06-13 17:24:51 -07:00
Shentubot 7b0f068258 Merge pull request #306 from amscanne:add_readme
PiperOrigin-RevId: 253137638
2019-06-13 17:20:49 -07:00
Adin Scannell add40fd6ad Update canonical repository.
This can be merged after:
https://github.com/google/gvisor-website/pull/77
  or
https://github.com/google/gvisor-website/pull/78

PiperOrigin-RevId: 253132620
2019-06-13 16:50:15 -07:00
Jamie Liu 0c8603084d Add p9 and unet benchmarks.
PiperOrigin-RevId: 253122166
2019-06-13 15:53:43 -07:00
Ian Lewis 4fdd560b76 Set the HOME environment variable (fixes #293)
runsc will now set the HOME environment variable as required by POSIX. The
user's home directory is retrieved from the /etc/passwd file located on the
container's file system during boot.

PiperOrigin-RevId: 253120627
2019-06-13 15:45:25 -07:00
Bhasker Hariharan 9f77b36fa1 Set optlen correctly when calling getsockopt.
PiperOrigin-RevId: 253096085
2019-06-13 13:41:39 -07:00
Nicolas Lacasse 94ab253d7e Bump rules_go to v0.18.6, and go toolchain to v1.12.6.
PiperOrigin-RevId: 253061432
2019-06-13 10:47:29 -07:00
Adin Scannell e352f46478 Minor BUILD file cleanup.
PiperOrigin-RevId: 252918338
2019-06-12 15:59:46 -07:00
Bhasker Hariharan 70578806e8 Add support for TCP_CONGESTION socket option.
This CL also cleans up the error returned for setting congestion
control which was incorrectly returning EINVAL instead of ENOENT.

PiperOrigin-RevId: 252889093
2019-06-12 13:35:50 -07:00
Andrei Vagin bb849bad29 gvisor/runsc: apply seccomp filters before parsing a state file
PiperOrigin-RevId: 252869983
2019-06-12 11:55:24 -07:00
Andrei Vagin 0d05a12fd3 gvisor/ptrace: print guest registers if a stub stopped with unexpected code
PiperOrigin-RevId: 252855280
2019-06-12 10:48:46 -07:00
Fabricio Voznika 356d1be140 Allow 'runsc do' to run without root
'--rootless' flag lets a non-root user execute 'runsc do'.
The drawback is that the sandbox and gofer processes will
run as root inside a user namespace that is mapped to the
caller's user, intead of nobody. And network is defaulted
to '--network=host' inside the root network namespace. On
the bright side, it's very convenient for testing:

runsc --rootless do ls
runsc --rootless do curl www.google.com

PiperOrigin-RevId: 252840970
2019-06-12 09:41:50 -07:00
Adin Scannell df110ad4fe Eat sendfile partial error
For sendfile(2), we propagate a TCP error through the system call layer.
This should be eaten if there is a partial result. This change also adds
a test to ensure that there is no panic in this case, for both TCP sockets
and unix domain sockets.

PiperOrigin-RevId: 252746192
2019-06-11 19:24:35 -07:00
Andrei Vagin 69c8657a66 kokoro: don't overwrite test results for different runtimes
PiperOrigin-RevId: 252724255
2019-06-11 16:36:53 -07:00
Michael Pratt 478a0873e1 Explicitly reference workspace root in test command
oh-my-zsh aliases ... to ../.. [1]. Add an explicit reference
to workspace root to work around the alias.

[1] https://github.com/robbyrussell/oh-my-zsh/blob/master/lib/directories.zsh

Fixes #341

PiperOrigin-RevId: 252720590
2019-06-11 16:15:39 -07:00
Fabricio Voznika fc746efa9a Add support to mount pod shared tmpfs mounts
Parse annotations containing 'gvisor.dev/spec/mount' that gives
hints about how mounts are shared between containers inside a
pod. This information can be used to better inform how to mount
these volumes inside gVisor. For example, a volume that is shared
between containers inside a pod can be bind mounted inside the
sandbox, instead of being two independent mounts.

For now, this information is used to allow the same tmpfs mounts
to be shared between containers which wasn't possible before.

PiperOrigin-RevId: 252704037
2019-06-11 14:54:31 -07:00
Fabricio Voznika 847c4b9759 Use net.HardwareAddr for FDBasedLink.LinkAddress
It prints formatted to the log.

PiperOrigin-RevId: 252699551
2019-06-11 14:31:46 -07:00
Fabricio Voznika a775ae82fe Fix broken pipe error building version file
(11:34:09) ERROR: /tmpfs/src/github/repo/runsc/BUILD:82:1: Couldn't build file runsc/version.txt: Executing genrule //runsc:deb-version failed (Broken pipe): bash failed: error executing command

PiperOrigin-RevId: 252691902
2019-06-11 13:56:32 -07:00
Andrei Vagin 307a9854ed gvisor/test: create a per-testcase directory for runsc logs
Otherwise it's hard to find a directory for a specific test case.

PiperOrigin-RevId: 252636901
2019-06-11 09:38:07 -07:00
Ian Lewis 74e397e39a Add introspection for Linux/AMD64 syscalls
Adds simple introspection for syscall compatibility information to Linux/AMD64.
Syscalls registered in the syscall table now have associated metadata like
name, support level, notes, and URLs to relevant issues.

Syscall information can be exported as a table, JSON, or CSV using the new
'runsc help syscalls' command. Users can use this info to debug and get info
on the compatibility of the version of runsc they are running or to generate
documentation.

PiperOrigin-RevId: 252558304
2019-06-10 23:38:36 -07:00
Jamie Liu 589f36ac4a Move //pkg/sentry/platform/procid to //pkg/procid.
PiperOrigin-RevId: 252501653
2019-06-10 15:47:25 -07:00
Bhasker Hariharan 3933dd5c04 Fixes to listen backlog handling.
Changes netstack to confirm to current linux behaviour where if the backlog is
full then we drop the SYN and do not send a SYN-ACK. Similarly we allow upto
backlog connections to be in SYN-RCVD state as long as the backlog is not full.

We also now drop a SYN if syn cookies are in use and the backlog for the
listening endpoint is full.

Added new tests to confirm the behaviour.

Also reverted the change to increase the backlog in TcpPortReuseMultiThread
syscall test.

Fixes #236

PiperOrigin-RevId: 252500462
2019-06-10 15:40:44 -07:00
Rahat Mahmood a00157cc0e Store more information in the kernel socket table.
Store enough information in the kernel socket table to distinguish
between different types of sockets. Previously we were only storing
the socket family, but this isn't enough to classify sockets. For
example, TCPv4 and UDPv4 sockets are both AF_INET, and ICMP sockets
are SOCK_DGRAM sockets with a particular protocol.

Instead of creating more sub-tables, flatten the socket table and
provide a filtering mechanism based on the socket entry.

Also generate and store a socket entry index ("sl" in linux) which
allows us to output entries in a stable order from procfs.

PiperOrigin-RevId: 252495895
2019-06-10 15:17:43 -07:00
Jamie Liu 48961d27a8 Move //pkg/sentry/memutil to //pkg/memutil.
PiperOrigin-RevId: 252124156
2019-06-07 14:52:27 -07:00
Adin Scannell e5fb3aab12 BUILD: Use runsc to generate version
This also ensures BUILD files are correctly formatted.

PiperOrigin-RevId: 251990267
2019-06-06 22:09:55 -07:00
Jamie Liu c933f3eede Change visibility of //pkg/sentry/time.
PiperOrigin-RevId: 251965598
2019-06-06 17:58:55 -07:00
Fabricio Voznika 2e43dcb26b Add alsologtostderr option
When set sends log messages to the error log:

sudo ./runsc --logtostderr do ls
I0531 17:59:58.105064  144564 x:0] ***************************
I0531 17:59:58.105087  144564 x:0] Args: [runsc --logtostderr do ls]
I0531 17:59:58.105112  144564 x:0] PID: 144564
I0531 17:59:58.105125  144564 x:0] UID: 0, GID: 0
[...]

PiperOrigin-RevId: 251964377
2019-06-06 17:50:07 -07:00
Jamie Liu 9ea248489b Cap initial usermem.CopyStringIn buffer size.
Almost (?) all uses of CopyStringIn are via linux.copyInPath(), which
passes maxlen = linux.PATH_MAX = 4096. Pre-allocating a buffer of this
size is measurably inefficient in most cases: most paths will not be
this long, 4 KB is a lot of bytes to zero, and as of this writing the Go
runtime allocator maps only two 4 KB objects to each 8 KB span,
necessitating a call to runtime.mcache.refill() on ~every other call.
Limit the initial buffer size to 256 B instead, and geometrically
reallocate if necessary.

PiperOrigin-RevId: 251960441
2019-06-06 17:22:00 -07:00
Rahat Mahmood 315cf9a523 Use common definition of SockType.
SockType isn't specific to unix domain sockets, and the current
definition basically mirrors the linux ABI's definition.

PiperOrigin-RevId: 251956740
2019-06-06 17:00:27 -07:00
Ian Lewis 6a4c006564 Add the gVisor gitter badge to the README
Moves the build badge to just below the logo and adds the gitter badge next to
it for consistency.

PiperOrigin-RevId: 251956383
2019-06-06 16:58:29 -07:00
Fabricio Voznika 02ab1f187c Copy up parent when binding UDS on overlayfs
Overlayfs was expecting the parent to exist when bind(2)
was called, which may not be the case. The fix is to copy
the parent directory to the upper layer before binding
the UDS.

There is not good place to add tests for it. Syscall tests
would be ideal, but it's hard to guarantee that the
directory where the socket is created hasn't been touched
before (and thus copied the parent to the upper layer).
Added it to runsc integration tests for now. If it turns
out we have lots of these kind of tests, we can consider
moving them somewhere more appropriate.

PiperOrigin-RevId: 251954156
2019-06-06 16:45:51 -07:00
Jamie Liu b3f104507d "Implement" mbind(2).
We still only advertise a single NUMA node, and ignore mempolicy
accordingly, but mbind() at least now succeeds and has effects reflected
by get_mempolicy().

Also fix handling of nodemasks: round sizes to unsigned long (as
documented and done by Linux), and zero trailing bits when copying them
out.

PiperOrigin-RevId: 251950859
2019-06-06 16:29:46 -07:00
Jamie Liu a26043ee53 Implement reclaim-driven MemoryFile eviction.
PiperOrigin-RevId: 251950660
2019-06-06 16:27:55 -07:00
Fabricio Voznika 93aa7d1167 Remove tmpfs restriction from test
runsc supports UDS over gofer mounts and tmpfs is
not needed for this test.

PiperOrigin-RevId: 251944870
2019-06-06 15:56:20 -07:00
Rahat Mahmood 2d2831e354 Track and export socket state.
This is necessary for implementing network diagnostic interfaces like
/proc/net/{tcp,udp,unix} and sock_diag(7).

For pass-through endpoints such as hostinet, we obtain the socket
state from the backend. For netstack, we add explicit tracking of TCP
states.

PiperOrigin-RevId: 251934850
2019-06-06 15:04:47 -07:00
Fabricio Voznika bf0b1b9d76 Add overlay dimension to FS related syscall tests
PiperOrigin-RevId: 251929314
2019-06-06 14:38:47 -07:00
Rahat Mahmood 8b8bd8d5b2 Try increase listen backlog.
PiperOrigin-RevId: 251928000
2019-06-06 14:32:04 -07:00
Googler 81eafb2c5e Internal change.
PiperOrigin-RevId: 251902567
2019-06-06 12:29:12 -07:00
Fabricio Voznika 720ec3590d Send error message to docker/kubectl exec on failure
Containerd uses the last error message sent to the log to
print as failure cause for create/exec. This required a
few changes in the logging logic for runsc:
  - cmd.Errorf/Fatalf: now writes a message with 'error'
    level to containerd log, in addition to stderr and
    debug logs, like before.
  - log.Infof/Warningf/Fatalf: are not sent to containerd
    log anymore. They are mostly used for debugging and not
    useful to containerd. In most cases, --debug-log is
    enabled and this avoids the logs messages from being
    duplicated.
  - stderr is not used as default log destination anymore.
    Some commands assume stdio is for the container/process
    running inside the sandbox and it's better to never use
    it for logging. By default, logs are supressed now.

PiperOrigin-RevId: 251881815
2019-06-06 10:49:43 -07:00
Bhasker Hariharan 85be01b42d Add multi-fd support to fdbased endpoint.
This allows an fdbased endpoint to have multiple underlying fd's from which
packets can be read and dispatched/written to.

This should allow for higher throughput as well as better scalability of the
network stack as number of connections increases.

Updates #231

PiperOrigin-RevId: 251852825
2019-06-06 08:07:02 -07:00
Andrei Vagin 79f7cb6c1c netstack/sniffer: log GSO attributes
PiperOrigin-RevId: 251788534
2019-06-05 22:51:53 -07:00
Michael Pratt 57772db2e7 Shutdown host sockets on internal shutdown
This is required to make the shutdown visible to peers outside the
sandbox.

The readClosed / writeClosed fields were dropped, as they were
preventing a shutdown socket from reading the remainder of queued bytes.
The host syscalls will return the appropriate errors for shutdown.

The control message tests have been split out of socket_unix.cc to make
the (few) remaining tests accessible to testing inherited host UDS,
which don't support sending control messages.

Updates #273

PiperOrigin-RevId: 251763060
2019-06-05 18:40:37 -07:00
Andrei Vagin a12848ffeb netstack/tcp: fix calculating a number of outstanding packets
In case of GSO, a segment can container more than one packet
and we need to use the pCount() helper to get a number of packets.

PiperOrigin-RevId: 251743020
2019-06-05 16:30:45 -07:00
Chris Kuiper d18bb4f38a Adjust route when looping multicast packets
Multicast packets are special in that their destination address does not
identify a specific interface. When sending out such a packet the multicast
address is the remote address, but for incoming packets it is the local
address. Hence, when looping a multicast packet, the route needs to be
tweaked to reflect this.

PiperOrigin-RevId: 251739298
2019-06-05 16:08:29 -07:00