Commit Graph

290 Commits

Author SHA1 Message Date
Robert Tonic 46beb91912 Fix documentation, clean up seccomp filter installation, rename helpers.
Filter installation has been streamlined and functions renamed. 
Documentation has been fixed to be standards compliant, and missing 
documentation added. gofmt has also been applied to modified files.
2019-09-19 17:10:50 -04:00
Robert Tonic ac38a7ead0 Place the host UDS mounting behind --fsgofer-host-uds-allowed.
This commit allows the use of the `--fsgofer-host-uds-allowed` flag to 
enable mounting sockets and add the appropriate seccomp filters.
2019-09-19 12:37:15 -04:00
Fabricio Voznika 010b093258 Bring back to life features lost in recent refactor
- Sandbox logs are generated when running tests
- Kokoro uploads the sandbox logs
- Supports multiple parallel runs
- Revive script to install locally built runsc with docker

PiperOrigin-RevId: 269337274
2019-09-16 08:17:00 -07:00
Adin Scannell a8834fc555 Update p9 to support flipcall.
PiperOrigin-RevId: 268845090
2019-09-12 23:37:31 -07:00
Ian Gudger fe1f521077 Remove reundant global tcpip.LinkEndpointID.
PiperOrigin-RevId: 267709597
2019-09-06 18:01:14 -07:00
Fabricio Voznika 0f5cdc1e00 Resolve flakes with TestMultiContainerDestroy
Some processes are reparented to the root container depending
on the kill order and the root container would not reap in time.
So some zombie processes were still present when the test checked.

Fix it by running the second container inside a PID namespace.

PiperOrigin-RevId: 267278591
2019-09-04 18:56:49 -07:00
gVisor bot 0789b9cc08 Merge pull request #655 from praveensastry:feature/runsc-ref-chk-leak
PiperOrigin-RevId: 266226714
2019-08-29 14:17:32 -07:00
Fabricio Voznika c39564332b Mount volumes as super user
This used to be the case, but regressed after a recent change.
Also made a few fixes around it and clean up the code a bit.

Closes #720

PiperOrigin-RevId: 265717496
2019-08-27 10:47:16 -07:00
praveensastry 7672eaae25 Add log prefix for better clarity 2019-08-22 22:52:43 +10:00
Tamir Duberstein 573e6e4bba Use tcpip.Subnet in tcpip.Route
This is the first step in replacing some of the redundant types with the
standard library equivalents.

PiperOrigin-RevId: 264706552
2019-08-21 15:31:18 -07:00
praveensastry 73985c6545 Fix the Stringer for leak mode 2019-08-09 17:13:06 +10:00
Rahat Mahmood 13a98df49e netstack: Don't start endpoint goroutines too soon on restore.
Endpoint protocol goroutines were previously started as part of
loading the endpoint. This is potentially too soon, as resources used
by these goroutine may not have been loaded. Protocol goroutines may
perform meaningful work as soon as they're started (ex: incoming
connect) which can cause them to indirectly access resources that
haven't been loaded yet.

This CL defers resuming all protocol goroutines until the end of
restore.

PiperOrigin-RevId: 262409429
2019-08-08 12:33:11 -07:00
praveensastry 8d89c0d92b Remove traces option for ref leak mode 2019-08-06 11:57:50 +10:00
praveensastry 607be0585f Add option to configure reference leak checking 2019-08-06 01:15:48 +10:00
Fabricio Voznika 960a5e5536 Remove stale TODO
This was done in commit 04cbb13ce9

PiperOrigin-RevId: 261414748
2019-08-02 16:35:05 -07:00
Kevin Krakauer 810cc07aab Plumbing for iptables sockopts.
PiperOrigin-RevId: 261413396
2019-08-02 16:26:48 -07:00
Fabricio Voznika b461be88a8 Stops container if gofer is killed
Each gofer now has a goroutine that polls on the FDs used
to communicate with the sandbox. The respective gofer is
destroyed if any of the FDs is closed.

Closes #601

PiperOrigin-RevId: 261383725
2019-08-02 13:47:55 -07:00
Nicolas Lacasse aaaefdf9ca Remove kernel.mounts.
We can get the mount namespace from the CreateProcessArgs in all cases where we
need it. This also gets rid of kernel.Destroy method, since the only thing it
was doing was DecRefing the mounts.

Removing the need to call kernel.SetRootMountNamespace also allowed for some
more simplifications in the container fs setup code.

PiperOrigin-RevId: 261357060
2019-08-02 11:23:11 -07:00
Haibo Xu 1decf76471 Change syscall.POLL to syscall.PPOLL.
syscall.POLL is not supported on arm64, using syscall.PPOLL
to support both the x86 and arm64. refs #63

Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I2c81a063d3ec4e7e6b38fe62f17a0924977f505e
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/543 from xiaobo55x:master ba598263fd3748d1addd48e4194080aa12085164
PiperOrigin-RevId: 260752049
2019-07-30 11:01:29 -07:00
Andrei Vagin 4183b9021a runsc: propagate the alsologtostderr to sub-commands
PiperOrigin-RevId: 260239119
2019-07-26 16:53:54 -07:00
gVisor bot b50122379c Merge pull request #452 from zhangningdlut:chris_test_pidns
PiperOrigin-RevId: 260220279
2019-07-26 15:00:51 -07:00
Fabricio Voznika 7052d21dc4 Automated rollback of changelist 255679453
PiperOrigin-RevId: 260047477
2019-07-25 16:48:49 -07:00
chris.zn 1c5b6d9bd2 Use different pidns among different containers
The different containers in a sandbox used only one pid
namespace before. This results in that a container can see
the processes in another container in the same sandbox.

This patch use different pid namespace for different containers.

Signed-off-by: chris.zn <chris.zn@antfin.com>
2019-07-24 13:38:23 +08:00
Nicolas Lacasse 04cbb13ce9 Give each container a distinct MountNamespace.
This keeps all container filesystem completely separate from eachother
(including from the root container filesystem), and allows us to get rid of the
"__runsc_containers__" directory.

It also simplifies container startup/teardown as we don't have to muck around
in the root container's filesystem.

PiperOrigin-RevId: 259613346
2019-07-23 14:37:07 -07:00
Nicolas Lacasse 8f11e257c9 Take a reference on the already-mounted inode before re-mounting it.
PiperOrigin-RevId: 257855777
2019-07-12 13:15:14 -07:00
gVisor bot c2cebbc8da Merge pull request #375 from jmgao:master
PiperOrigin-RevId: 257041876
2019-07-08 13:51:09 -07:00
Andrei Vagin 67f2cefce0 Avoid importing platforms from many source files
PiperOrigin-RevId: 256494243
2019-07-03 22:51:26 -07:00
Adin Scannell 753da9604e Remove map from fd_map, change to fd_table.
This renames FDMap to FDTable and drops the kernel.FD type, which had an entire
package to itself and didn't serve much use (it was freely cast between types,
and served as more of an annoyance than providing any protection.)

Based on BenchmarkFDLookupAndDecRef-12, we can expect 5-10 ns per lookup
operation, and 10-15 ns per concurrent lookup operation of savings.

This also fixes two tangential usage issues with the FDMap. Namely, non-atomic
use of NewFDFrom and associated calls to Remove (that are both racy and fail to
drop the reference on the underlying file.)

PiperOrigin-RevId: 256285890
2019-07-02 19:28:59 -07:00
Ian Gudger 45566fa4e4 Add finalizer on AtomicRefCount to check for leaks.
PiperOrigin-RevId: 255711454
2019-06-28 20:07:52 -07:00
Nicolas Lacasse 295078fa7a Automated rollback of changelist 255263686
PiperOrigin-RevId: 255679453
2019-06-28 15:28:41 -07:00
Andrei Vagin 8a625ceeb1 runsc: allow openat for runsc-race
I see that runsc-race is killed by SIGSYS, because openat isn't
allowed by seccomp filters:
60052 openat(AT_FDCWD, "/proc/sys/vm/overcommit_memory",
			O_RDONLY|O_CLOEXEC <unfinished ...>
60052 <... openat resumed> )            = 257
60052 --- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP, si_call_addr=0xfaacf1,
		si_syscall=__NR_openat, si_arch=AUDIT_ARCH_X86_64} ---

PiperOrigin-RevId: 255640808
2019-06-28 11:49:45 -07:00
Michael Pratt 5b41ba5d0e Fix various spelling issues in the documentation
Addresses obvious typos, in the documentation only.

COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/443 from Pixep:fix/documentation-spelling 4d0688164eafaf0b3010e5f4824b35d1e7176d65
PiperOrigin-RevId: 255477779
2019-06-27 14:25:50 -07:00
Michael Pratt 085a907565 Cache directory entries in the overlay
Currently, the overlay dirCache is only used for a single logical use of
getdents. i.e., it is discard when the FD is closed or seeked back to
the beginning.

But the initial work of getting the directory contents can be quite
expensive (particularly sorting large directories), so we should keep it
as long as possible.

This is very similar to the readdirCache in fs/gofer.

Since the upper filesystem does not have to allow caching readdir
entries, the new CacheReaddir MountSourceOperations method controls this
behavior.

This caching should be trivially movable to all Inodes if desired,
though that adds an additional copy step for non-overlay Inodes.
(Overlay Inodes already do the extra copy).

PiperOrigin-RevId: 255477592
2019-06-27 14:24:03 -07:00
Fabricio Voznika 42e212f6b7 Preserve permissions when checking lower
The code was wrongly assuming that only read access was
required from the lower overlay when checking for permissions.
This allowed non-writable files to be writable in the overlay.

Fixes #316

PiperOrigin-RevId: 255263686
2019-06-26 14:24:44 -07:00
Andrei Vagin fd16a329ce fsgopher: reopen files via /proc/self/fd
When we reopen file by path, we can't be sure that
we will open exactly the same file. The file can be
deleted and another one with the same name can be
created.

PiperOrigin-RevId: 254898594
2019-06-24 21:44:27 -07:00
Fabricio Voznika b21b1db700 Allow to change logging options using 'runsc debug'
New options are:
  runsc debug --strace=off|all|function1,function2
  runsc debug --log-level=warning|info|debug
  runsc debug --log-packets=true|false

Updates #407

PiperOrigin-RevId: 254843128
2019-06-24 15:03:02 -07:00
Bhasker Hariharan a8608c501b Enable Receive Buffer Auto-Tuning for runsc.
Updates #230

PiperOrigin-RevId: 253225078
2019-06-14 07:31:45 -07:00
Ian Gudger 3e9b8ecbfe Plumb context through more layers of filesytem.
All functions which allocate objects containing AtomicRefCounts will soon need
a context.

PiperOrigin-RevId: 253147709
2019-06-13 18:40:38 -07:00
Adin Scannell add40fd6ad Update canonical repository.
This can be merged after:
https://github.com/google/gvisor-website/pull/77
  or
https://github.com/google/gvisor-website/pull/78

PiperOrigin-RevId: 253132620
2019-06-13 16:50:15 -07:00
Ian Lewis 4fdd560b76 Set the HOME environment variable (fixes #293)
runsc will now set the HOME environment variable as required by POSIX. The
user's home directory is retrieved from the /etc/passwd file located on the
container's file system during boot.

PiperOrigin-RevId: 253120627
2019-06-13 15:45:25 -07:00
Josh Gao b915a25597 Fix use of "2 ^ 30".
2 ^ 30 is 28, not 1073741824.
2019-06-13 14:26:26 -07:00
Andrei Vagin bb849bad29 gvisor/runsc: apply seccomp filters before parsing a state file
PiperOrigin-RevId: 252869983
2019-06-12 11:55:24 -07:00
Fabricio Voznika 356d1be140 Allow 'runsc do' to run without root
'--rootless' flag lets a non-root user execute 'runsc do'.
The drawback is that the sandbox and gofer processes will
run as root inside a user namespace that is mapped to the
caller's user, intead of nobody. And network is defaulted
to '--network=host' inside the root network namespace. On
the bright side, it's very convenient for testing:

runsc --rootless do ls
runsc --rootless do curl www.google.com

PiperOrigin-RevId: 252840970
2019-06-12 09:41:50 -07:00
Fabricio Voznika fc746efa9a Add support to mount pod shared tmpfs mounts
Parse annotations containing 'gvisor.dev/spec/mount' that gives
hints about how mounts are shared between containers inside a
pod. This information can be used to better inform how to mount
these volumes inside gVisor. For example, a volume that is shared
between containers inside a pod can be bind mounted inside the
sandbox, instead of being two independent mounts.

For now, this information is used to allow the same tmpfs mounts
to be shared between containers which wasn't possible before.

PiperOrigin-RevId: 252704037
2019-06-11 14:54:31 -07:00
Fabricio Voznika 847c4b9759 Use net.HardwareAddr for FDBasedLink.LinkAddress
It prints formatted to the log.

PiperOrigin-RevId: 252699551
2019-06-11 14:31:46 -07:00
Jamie Liu 48961d27a8 Move //pkg/sentry/memutil to //pkg/memutil.
PiperOrigin-RevId: 252124156
2019-06-07 14:52:27 -07:00
Jamie Liu a26043ee53 Implement reclaim-driven MemoryFile eviction.
PiperOrigin-RevId: 251950660
2019-06-06 16:27:55 -07:00
Bhasker Hariharan 85be01b42d Add multi-fd support to fdbased endpoint.
This allows an fdbased endpoint to have multiple underlying fd's from which
packets can be read and dispatched/written to.

This should allow for higher throughput as well as better scalability of the
network stack as number of connections increases.

Updates #231

PiperOrigin-RevId: 251852825
2019-06-06 08:07:02 -07:00
Michael Pratt 57772db2e7 Shutdown host sockets on internal shutdown
This is required to make the shutdown visible to peers outside the
sandbox.

The readClosed / writeClosed fields were dropped, as they were
preventing a shutdown socket from reading the remainder of queued bytes.
The host syscalls will return the appropriate errors for shutdown.

The control message tests have been split out of socket_unix.cc to make
the (few) remaining tests accessible to testing inherited host UDS,
which don't support sending control messages.

Updates #273

PiperOrigin-RevId: 251763060
2019-06-05 18:40:37 -07:00
Fabricio Voznika f1aee6a7ad Refactor container FS setup
No change in functionaly. Added containerMounter object
to keep state while the mounts are processed. This will
help upcoming changes to share mounts per-pod.

PiperOrigin-RevId: 251350096
2019-06-03 18:20:57 -07:00
Fabricio Voznika d28f71adcf Remove 'clearStatus' option from container.Wait*PID()
clearStatus was added to allow detached execution to wait
on the exec'd process and retrieve its exit status. However,
it's not currently used. Both docker and gvisor-containerd-shim
wait on the "shim" process and retrieve the exit status from
there. We could change gvisor-containerd-shim to use waits, but
it will end up also consuming a process for the wait, which is
similar to having the shim process.

Closes #234

PiperOrigin-RevId: 251349490
2019-06-03 18:16:09 -07:00
Michael Pratt 955685845e Remove spurious period
PiperOrigin-RevId: 251288885
2019-06-03 12:48:24 -07:00
Bhasker Hariharan 035a8fa38e Add support for collecting execution trace to runsc.
Updates #220

PiperOrigin-RevId: 250532302
2019-05-30 12:07:11 -07:00
Fabricio Voznika c091e62369 Set sticky bit to /tmp
This is generally done for '/tmp' to prevent accidental
deletion of files. More details here:
http://man7.org/linux/man-pages/man1/chmod.1.html#RESTRICTED_DELETION_FLAG_OR_STICKY_BIT

PiperOrigin-RevId: 249633207
Change-Id: I444a5b406fdef664f5677b2f20f374972613a02b
2019-05-23 06:48:00 -07:00
Fabricio Voznika 9006304dfe Initial support for bind mounts
Separate MountSource from Mount. This is needed to allow
mounts to be shared by multiple containers within the same
pod.

PiperOrigin-RevId: 249617810
Change-Id: Id2944feb7e4194951f355cbe6d4944ae3c02e468
2019-05-23 04:16:10 -07:00
Fabricio Voznika ecb0f00e10 Cleanup around urpc file payload handling
urpc always closes all files once the RPC function returns.

PiperOrigin-RevId: 248406857
Change-Id: I400a8562452ec75c8e4bddc2154948567d572950
2019-05-15 14:36:28 -07:00
Andrei Vagin 85380ff03d gvisor/runsc: use a veth link address instead of generating a new one
PiperOrigin-RevId: 248367340
Change-Id: Id792afcfff9c9d2cfd62cae21048316267b4a924
2019-05-15 11:11:58 -07:00
Fabricio Voznika 1bee43be13 Implement fallocate(2)
Closes #225

PiperOrigin-RevId: 247508791
Change-Id: I04f47cf2770b30043e5a272aba4ba6e11d0476cc
2019-05-09 15:35:49 -07:00
Andrei Vagin bf0ac565d2 Fix runsc restore to be compatible with docker start --checkpoint ...
Change-Id: I02b30de13f1393df66edf8829fedbf32405d18f8
PiperOrigin-RevId: 246621192
2019-05-03 21:41:45 -07:00
Jamie Liu 8bfb83d0ac Implement async MemoryFile eviction, and use it in CachingInodeOperations.
This feature allows MemoryFile to delay eviction of "optional"
allocations, such as unused cached file pages.

Note that this incidentally makes CachingInodeOperations writeback
asynchronous, in the sense that it doesn't occur until eviction; this is
necessary because between when a cached page becomes evictable and when
it's evicted, file writes (via CachingInodeOperations.Write) may dirty
the page.

As currently implemented, this feature won't meaningfully impact
steady-state memory usage or caching; the reclaimer goroutine will
schedule eviction as soon as it runs out of other work to do. Future CLs
increase caching by adding constraints on when eviction is scheduled.

PiperOrigin-RevId: 246014822
Change-Id: Ia85feb25a2de92a48359eb84434b6ec6f9bea2cb
2019-04-30 13:56:41 -07:00
Michael Pratt 4d52a55201 Change copyright notice to "The gVisor Authors"
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.

1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.

Fixes #209

PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
2019-04-29 14:26:23 -07:00
Nicolas Lacasse f4ce43e1f4 Allow and document bug ids in gVisor codebase.
PiperOrigin-RevId: 245818639
Change-Id: I03703ef0fb9b6675955637b9fe2776204c545789
2019-04-29 14:04:14 -07:00
Kevin Krakauer 43dff57b87 Make raw sockets a toggleable feature disabled by default.
PiperOrigin-RevId: 245511019
Change-Id: Ia9562a301b46458988a6a1f0bbd5f07cbfcb0615
2019-04-26 16:51:46 -07:00
Bhasker Hariharan 99b877fa1d Revert runsc to use RecvMMsg packet dispatcher.
PacketMMap mode has issues due to a kernel bug. This change
reverts us to using recvmmsg instead of a shared ring buffer to
dispatch inbound packets. This will reduce performance but should
be more stable under heavy load till PacketMMap is updated to
use TPacketv3.

See #210 for details.

Perf difference between recvmmsg vs packetmmap.

RecvMMsg :
iperf3 -c 172.17.0.2
Connecting to host 172.17.0.2, port 5201
[  4] local 172.17.0.1 port 43478 connected to 172.17.0.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   778 MBytes  6.53 Gbits/sec  4349    188 KBytes
[  4]   1.00-2.00   sec   786 MBytes  6.59 Gbits/sec  4395    212 KBytes
[  4]   2.00-3.00   sec   756 MBytes  6.34 Gbits/sec  3655    161 KBytes
[  4]   3.00-4.00   sec   782 MBytes  6.56 Gbits/sec  4419    175 KBytes
[  4]   4.00-5.00   sec   755 MBytes  6.34 Gbits/sec  4317    187 KBytes
[  4]   5.00-6.00   sec   774 MBytes  6.49 Gbits/sec  4002    173 KBytes
[  4]   6.00-7.00   sec   737 MBytes  6.18 Gbits/sec  3904    191 KBytes
[  4]   7.00-8.00   sec   530 MBytes  4.44 Gbits/sec  3318    189 KBytes
[  4]   8.00-9.00   sec   487 MBytes  4.09 Gbits/sec  2627    188 KBytes
[  4]   9.00-10.00  sec   770 MBytes  6.46 Gbits/sec  4221    170 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  6.99 GBytes  6.00 Gbits/sec  39207             sender
[  4]   0.00-10.00  sec  6.99 GBytes  6.00 Gbits/sec                  receiver

iperf Done.

PacketMMap:

bhaskerh@gvisor-bench:~/tensorflow$ iperf3 -c 172.17.0.2
Connecting to host 172.17.0.2, port 5201
[  4] local 172.17.0.1 port 43496 connected to 172.17.0.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   657 MBytes  5.51 Gbits/sec    0   1.01 MBytes
[  4]   1.00-2.00   sec  1021 MBytes  8.56 Gbits/sec    0   1.01 MBytes
[  4]   2.00-3.00   sec  1.21 GBytes  10.4 Gbits/sec   45   1.01 MBytes
[  4]   3.00-4.00   sec  1018 MBytes  8.54 Gbits/sec   15   1.01 MBytes
[  4]   4.00-5.00   sec  1.28 GBytes  11.0 Gbits/sec   45   1.01 MBytes
[  4]   5.00-6.00   sec  1.38 GBytes  11.9 Gbits/sec    0   1.01 MBytes
[  4]   6.00-7.00   sec  1.34 GBytes  11.5 Gbits/sec   45    856 KBytes
[  4]   7.00-8.00   sec  1.23 GBytes  10.5 Gbits/sec    0    901 KBytes
[  4]   8.00-9.00   sec  1010 MBytes  8.48 Gbits/sec    0    923 KBytes
[  4]   9.00-10.00  sec  1.39 GBytes  11.9 Gbits/sec    0    960 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  11.4 GBytes  9.83 Gbits/sec  150             sender
[  4]   0.00-10.00  sec  11.4 GBytes  9.83 Gbits/sec                  receiver

Updates #210

PiperOrigin-RevId: 244968438
Change-Id: Id461b5cbff2dea6fa55cfc108ea246d8f83da20b
2019-04-23 19:07:06 -07:00
Fabricio Voznika c8cee7108f Use FD limit and file size limit from host
FD limit and file size limit is read from the host, instead
of using hard-coded defaults, given that they effect the sandbox
process. Also limit the direct cache to use no more than half
if the available FDs.

PiperOrigin-RevId: 244050323
Change-Id: I787ad0fdf07c49d589e51aebfeae477324fe26e6
2019-04-17 12:57:40 -07:00
Fabricio Voznika 9f8c89fc7f Return error from fdbased.New
RELNOTES: n/a
PiperOrigin-RevId: 244031742
Change-Id: Id0cdb73194018fb5979e67b58510ead19b5a2b81
2019-04-17 11:16:35 -07:00
Bhasker Hariharan eaac2806ff Add TCP checksum verification.
PiperOrigin-RevId: 242704699
Change-Id: I87db368ca343b3b4bf4f969b17d3aa4ce2f8bd4f
2019-04-09 11:23:47 -07:00
Andrei Vagin 88409e983c gvisor: Add support for the MS_NOEXEC mount option
https://github.com/google/gvisor/issues/145

PiperOrigin-RevId: 242044115
Change-Id: I8f140fe05e32ecd438b6be218e224e4b7fe05878
2019-04-04 17:43:53 -07:00
Kevin Krakauer f9431fb20f Remove obsolete TODO.
PiperOrigin-RevId: 241637164
Change-Id: I65476a739cf38f1818dc47f6ce60638dec8b77a8
2019-04-02 17:27:05 -07:00
Kevin Krakauer a40ee4f4b8 Change bug number for duplicate bug.
PiperOrigin-RevId: 241567897
Change-Id: I580eac04f52bb15f4aab7df9822c4aa92e743021
2019-04-02 11:28:06 -07:00
Andrei Vagin a046054ba3 gvisor/runsc: enable generic segmentation offload (GSO)
The linux packet socket can handle GSO packets, so we can segment packets to
64K instead of the MTU which is usually 1500.

Here are numbers for the nginx-1m test:
runsc:		579330.01 [Kbytes/sec] received
runsc-gso:	1794121.66 [Kbytes/sec] received
runc:		2122139.06 [Kbytes/sec] received

and for tcp_benchmark:

$ tcp_benchmark  --duration 15   --ideal
[  4]  0.0-15.0 sec  86647 MBytes  48456 Mbits/sec

$ tcp_benchmark --client --duration 15   --ideal
[  4]  0.0-15.0 sec  2173 MBytes  1214 Mbits/sec

$ tcp_benchmark --client --duration 15   --ideal --gso 65536
[  4]  0.0-15.0 sec  19357 MBytes  10825 Mbits/sec

PiperOrigin-RevId: 241072403
Change-Id: I20b03063a1a6649362b43609cbbc9b59be06e6d5
2019-03-29 16:27:38 -07:00
Jamie Liu 8f4634997b Decouple filemem from platform and move it to pgalloc.MemoryFile.
This is in preparation for improved page cache reclaim, which requires
greater integration between the page cache and page allocator.

PiperOrigin-RevId: 238444706
Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4
2019-03-14 08:12:48 -07:00
Nicolas Lacasse 2512cc5617 Allow filesystem.Mount to take an optional interface argument.
PiperOrigin-RevId: 238360231
Change-Id: I5eaf8d26f8892f77d71c7fbd6c5225ef471cedf1
2019-03-13 19:24:03 -07:00
Ian Gudger a16f6e50c5 Make HandleLocal apply to all non-loopback interfaces.
HandleLocal is very similar conceptually to MULTICAST_LOOP, so we can unify
the implementations. This has the benefit of making HandleLocal apply even when
the fdbased link endpoint isn't in use.

In addition, move looping logic to route creation so that it doesn't need to be
run for each packet. This should improve performance.

PiperOrigin-RevId: 238099480
Change-Id: I72839f16f25310471453bc9d3fb8544815b25c23
2019-03-12 14:37:56 -07:00
Fabricio Voznika bc9b979b94 Add profiling commands to runsc
Example:
  runsc debug --root=<dir> \
      --profile-heap=/tmp/heap.prof \
      --profile-cpu=/tmp/cpu.prod --profile-delay=30 \
      <container ID>
PiperOrigin-RevId: 237848456
Change-Id: Icff3f20c1b157a84d0922599eaea327320dad773
2019-03-11 11:47:30 -07:00
Ian Gudger 56a6128295 Implement IP_MULTICAST_LOOP.
IP_MULTICAST_LOOP controls whether or not multicast packets sent on the default
route are looped back. In order to implement this switch, support for sending
and looping back multicast packets on the default route had to be implemented.

For now we only support IPv4 multicast.

PiperOrigin-RevId: 237534603
Change-Id: I490ac7ff8e8ebef417c7eb049a919c29d156ac1c
2019-03-08 15:49:17 -08:00
Fabricio Voznika 0b76887147 Priority-inheritance futex implementation
It is Implemented without the priority inheritance part given
that gVisor defers scheduling decisions to Go runtime and doesn't
have control over it.

PiperOrigin-RevId: 236989545
Change-Id: I714c8ca0798743ecf3167b14ffeb5cd834302560
2019-03-05 23:40:18 -08:00
Fabricio Voznika fcba4e8f04 Add uncaught signal message to the user log
This help troubleshoot cases where the container is killed and the
app logs don't show the reason.

PiperOrigin-RevId: 236982883
Change-Id: I361892856a146cea5b04abaa3aedbf805e123724
2019-03-05 22:20:17 -08:00
Fabricio Voznika 3dbd4a16f8 Add semctl(GETPID) syscall
Also added unimplemented notification for semctl(2)
commands.

PiperOrigin-RevId: 236340672
Change-Id: I0795e3bd2e6d41d7936fabb731884df426a42478
2019-03-01 10:57:02 -08:00
Kevin Krakauer b75aa51504 Rename ping endpoints to icmp endpoints.
PiperOrigin-RevId: 235248572
Change-Id: I5b0538b6feb365a98712c2a2d56d856fe80a8a09
2019-02-22 13:34:47 -08:00
Nicolas Lacasse 0a41ea72c1 Don't allow writing or reading to TTY unless process group is in foreground.
If a background process tries to read from a TTY, linux sends it a SIGTTIN
unless the signal is blocked or ignored, or the process group is an orphan, in
which case the syscall returns EIO.

See drivers/tty/n_tty.c:n_tty_read()=>job_control().

If a background process tries to write a TTY, set the termios, or set the
foreground process group, linux then sends a SIGTTOU. If the signal is ignored
or blocked, linux allows the write. If the process group is an orphan, the
syscall returns EIO.

See drivers/tty/tty_io.c:tty_check_change().

PiperOrigin-RevId: 234044367
Change-Id: I009461352ac4f3f11c5d42c43ac36bb0caa580f9
2019-02-14 15:47:31 -08:00
Bhasker Hariharan e0b3d3323f Add support for using PACKET_RX_RING to receive packets.
PACKET_RX_RING allows the use of an mmapped buffer to receive packets from the
kernel. This should cut down the number of host syscalls that need to be made
to receive packets when the underlying fd is a socket of the AF_PACKET type.

PiperOrigin-RevId: 233834998
Change-Id: I8060025c6ced206986e94cc46b8f382b81bfa47f
2019-02-13 14:53:03 -08:00
Nicolas Lacasse 92e85623a0 Factor the subtargets method into a helper method with tests.
PiperOrigin-RevId: 232047515
Change-Id: I00f036816e320356219be7b2f2e6d5fe57583a60
2019-02-01 15:23:43 -08:00
Michael Pratt 2a0c69b19f Remove license comments
Nothing reads them and they can simply get stale.

Generated with:
$ sed -i "s/licenses(\(.*\)).*/licenses(\1)/" **/BUILD

PiperOrigin-RevId: 231818945
Change-Id: Ibc3f9838546b7e94f13f217060d31f4ada9d4bf0
2019-01-31 11:12:53 -08:00
Andrei Vagin 7e8a56087b runsc: check whether a container is deleted or not before setupContainerFS
PiperOrigin-RevId: 231811387
Change-Id: Ib143fb9a4d0fa1f105d1a3a3bd533dfc44e792af
2019-01-31 10:34:15 -08:00
Andrei Vagin dd577f5410 runsc: reap a sandbox process only in sandbox.Wait()
PiperOrigin-RevId: 231504064
Change-Id: I585b769aef04a3ad7e7936027958910a6eed9c8d
2019-01-29 17:15:56 -08:00
Bhasker Hariharan 24cb2c0a72 Use recvmmsg() instead of readv() to read packets from NIC.
This should reduce the number of syscalls required to process packets
significantly and improve throughputs.

PiperOrigin-RevId: 231366886
Change-Id: I8b38077262bf9c53176bc4a94b530188d3d7c0ca
2019-01-29 01:39:01 -08:00
Fabricio Voznika c1be25b78d Scrub runsc error messages
Removed "error" and "failed to" prefix that don't add value
from messages. Adjusted a few other messages.  In particular,
when the container fail to start, the message returned is easier
for humans to read:

$ docker run --rm --runtime=runsc alpine foobar
docker: Error response from daemon: OCI runtime start failed: <path> did not terminate sucessfully: starting container: starting root container [foobar]: starting sandbox: searching for executable "foobar", cwd: "/", $PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin": no such file or directory

Closes #77

PiperOrigin-RevId: 230022798
Change-Id: I83339017c70dae09e4f9f8e0ea2e554c4d5d5cd1
2019-01-18 17:36:02 -08:00
Fabricio Voznika e4d3ca7263 Prevent internal tmpfs mount to override files in /tmp
Runsc wants to mount /tmp using internal tmpfs implementation for
performance. However, it risks hiding files that may exist under
/tmp in case it's present in the container. Now, it only mounts
over /tmp iff:
  - /tmp was not explicitly asked to be mounted
  - /tmp is empty

If any of this is not true, then /tmp maps to the container's
image /tmp.

Note: checkpoint doesn't have sentry FS mounted to check if /tmp
is empty. It simply looks for explicit mounts right now.
PiperOrigin-RevId: 229607856
Change-Id: I10b6dae7ac157ef578efc4dfceb089f3b94cde06
2019-01-16 12:48:32 -08:00
Nicolas Lacasse dc8450b567 Remove fs.Handle, ramfs.Entry, and all the DeprecatedFileOperations.
More helper structs have been added to the fsutil package to make it easier to
implement fs.InodeOperations and fs.FileOperations.

PiperOrigin-RevId: 229305982
Change-Id: Ib6f8d3862f4216745116857913dbfa351530223b
2019-01-14 20:34:28 -08:00
Michael Pratt 33191e1cc4 Automated rollback of changelist 225089593
PiperOrigin-RevId: 227595007
Change-Id: If14cc5aab869c5fd7a4ebd95929c887ab690e94c
2019-01-02 15:48:00 -08:00
Fabricio Voznika a891afad6d Simplify synchronization between runsc and sandbox process
Make 'runsc create' join cgroup before creating sandbox process.
This removes the need to synchronize platform creation and ensure
that sandbox process is charged to the right cgroup from the start.

PiperOrigin-RevId: 227166451
Change-Id: Ieb4b18e6ca0daf7b331dc897699ca419bc5ee3a2
2018-12-28 13:48:24 -08:00
Jamie Liu 194ef586fc Rename limits.MemoryPagesLocked to limits.MemoryLocked.
"RLIMIT_MEMLOCK: This is the maximum number of bytes of memory that may
be locked into RAM." - getrlimit(2)

PiperOrigin-RevId: 226384346
Change-Id: Iefac4a1bb69f7714dc813b5b871226a8344dc800
2018-12-20 13:28:46 -08:00
Googler 86c9bd2547 Automated rollback of changelist 225861605
PiperOrigin-RevId: 226224230
Change-Id: Id24c7d3733722fd41d5fe74ef64e0ce8c68f0b12
2018-12-19 13:30:08 -08:00
Michael Pratt b62591e6a8 Expose internal testing flag
Never to used outside of runsc tests!

PiperOrigin-RevId: 225919013
Change-Id: Ib3b14aa2a2564b5246fb3f8933d95e01027ed186
2018-12-17 17:35:06 -08:00
Jamie Liu 2421006426 Implement mlock(), kind of.
Currently mlock() and friends do nothing whatsoever. However, mlocking
is directly application-visible in a number of ways; for example,
madvise(MADV_DONTNEED) and msync(MS_INVALIDATE) both fail on mlocked
regions. We handle this inconsistently: MADV_DONTNEED is too important
to not work, but MS_INVALIDATE is rejected.

Change MM to track mlocked regions in a manner consistent with Linux.
It still will not actually pin pages into host physical memory, but:

- mlock() will now cause sentry memory management to precommit mlocked
pages.

- MADV_DONTNEED and MS_INVALIDATE will interact with mlocked pages as
described above.

PiperOrigin-RevId: 225861605
Change-Id: Iee187204979ac9a4d15d0e037c152c0902c8d0ee
2018-12-17 11:38:59 -08:00
Michael Pratt 24c1158b9c Add "trace signal" option
This option is effectively equivalent to -panic-signal, except that the
sandbox does not die after logging the traceback.

PiperOrigin-RevId: 225089593
Change-Id: Ifb1c411210110b6104613f404334bd02175e484e
2018-12-11 16:12:41 -08:00
Brian Geffon d3bc79bc84 Open source system call tests.
PiperOrigin-RevId: 224886231
Change-Id: I0fccb4d994601739d8b16b1d4e6b31f40297fb22
2018-12-10 14:42:34 -08:00
Zhaozhong Ni 9984138abe sentry: turn "dynamically-created" procfs files into static creation.
PiperOrigin-RevId: 224600982
Change-Id: I547253528e24fb0bb318fc9d2632cb80504acb34
2018-12-07 17:03:54 -08:00
Brian Geffon 82719be42e Max link traversals should be for an entire path.
The number of symbolic links that are allowed to be followed
are for a full path and not just a chain of symbolic links.

PiperOrigin-RevId: 224047321
Change-Id: I5e3c4caf66a93c17eeddcc7f046d1e8bb9434a40
2018-12-04 14:32:03 -08:00