Commit Graph

728 Commits

Author SHA1 Message Date
Ayush Ranjan 8376757495 ext: filesystem boilerplate code.
PiperOrigin-RevId: 259865366
2019-07-24 19:08:21 -07:00
Ayush Ranjan 417096f781 ext: Add tests for root directory inode.
PiperOrigin-RevId: 259856442
2019-07-24 17:59:57 -07:00
Ayush Ranjan 2ed832ff86 ext: testing environment setup with VFS2 support.
PiperOrigin-RevId: 259835948
2019-07-24 16:03:30 -07:00
Chris Kuiper 40e682759f Add support for a subnet prefix length on interface network addresses
This allows the user code to add a network address with a subnet prefix length.
The prefix length value is stored in the network endpoint and provided back to
the user in the ProtocolAddress type.

PiperOrigin-RevId: 259807693
2019-07-24 13:42:14 -07:00
chris.zn 1c5b6d9bd2 Use different pidns among different containers
The different containers in a sandbox used only one pid
namespace before. This results in that a container can see
the processes in another container in the same sandbox.

This patch use different pid namespace for different containers.

Signed-off-by: chris.zn <chris.zn@antfin.com>
2019-07-24 13:38:23 +08:00
Ayush Ranjan 7e38d64333 ext: Inode creation logic.
PiperOrigin-RevId: 259666476
2019-07-23 20:36:04 -07:00
Ayush Ranjan d7bb79b6f1 ext: Add ext2 and ext3 tiny images.
PiperOrigin-RevId: 259657917
2019-07-23 19:01:05 -07:00
Ayush Ranjan bd7708956f ext: Added extent tree building logic.
PiperOrigin-RevId: 259628657
2019-07-23 15:51:50 -07:00
Nicolas Lacasse 04cbb13ce9 Give each container a distinct MountNamespace.
This keeps all container filesystem completely separate from eachother
(including from the root container filesystem), and allows us to get rid of the
"__runsc_containers__" directory.

It also simplifies container startup/teardown as we don't have to muck around
in the root container's filesystem.

PiperOrigin-RevId: 259613346
2019-07-23 14:37:07 -07:00
gVisor bot d706922d78 Merge pull request #571 from lubinszARM:pr_loader
PiperOrigin-RevId: 259427074
2019-07-22 16:12:46 -07:00
Andrei Vagin ec906e46c0 kvm: fix race between machine.Put and machine.Get
m.available.Signal() has to be called under m.mu.RLock, otherwise it can
race with machine.Get:

m.Get			| m.Put
-------------------------------------
m.mu.Lock()		|
Seatching available vcpu|
			| m.available.Signal()
m.available.Wait	|

PiperOrigin-RevId: 259394051
2019-07-22 13:28:16 -07:00
Bin Lu ffe45f38e6 Add ARM64 support to pkg/sentry/loader
Signed-off-by: Bin Lu <bin.lu@arm.com>
2019-07-21 19:30:18 -07:00
gVisor bot f544509c01 Merge pull request #450 from Pixep:feature/add-clock-boottime-as-monotonic
PiperOrigin-RevId: 258996346
2019-07-19 10:44:45 -07:00
Andrei Vagin eefa817cfd net/tcp/setockopt: impelment setsockopt(fd, SOL_TCP, TCP_INQ)
PiperOrigin-RevId: 258859507
2019-07-18 15:41:04 -07:00
Jamie Liu 163ab5e9ba Sentry virtual filesystem, v2
Major differences from the current ("v1") sentry VFS:

- Path resolution is Filesystem-driven (FilesystemImpl methods call
vfs.ResolvingPath methods) rather than VFS-driven (fs package owns a
Dirent tree and calls fs.InodeOperations methods to populate it). This
drastically improves performance, primarily by reducing overhead from
inefficient synchronization and indirection. It also makes it possible
to implement remote filesystem protocols that translate FS system calls
into single RPCs, rather than having to make (at least) one RPC per path
component, significantly reducing the latency of remote filesystems
(especially during cold starts and for uncacheable shared filesystems).

- Mounts are correctly represented as a separate check based on
contextual state (current mount) rather than direct replacement in a
fs.Dirent tree. This makes it possible to support (non-recursive) bind
mounts and mount namespaces.

Included in this CL is fsimpl/memfs, an incomplete in-memory filesystem
that exists primarily to demonstrate intended filesystem implementation
patterns and for benchmarking:

BenchmarkVFS1TmpfsStat/1-6               3000000               497 ns/op
BenchmarkVFS1TmpfsStat/2-6               2000000               676 ns/op
BenchmarkVFS1TmpfsStat/3-6               2000000               904 ns/op
BenchmarkVFS1TmpfsStat/8-6               1000000              1944 ns/op
BenchmarkVFS1TmpfsStat/64-6               100000             14067 ns/op
BenchmarkVFS1TmpfsStat/100-6               50000             21700 ns/op
BenchmarkVFS2MemfsStat/1-6              10000000               197 ns/op
BenchmarkVFS2MemfsStat/2-6               5000000               233 ns/op
BenchmarkVFS2MemfsStat/3-6               5000000               268 ns/op
BenchmarkVFS2MemfsStat/8-6               3000000               477 ns/op
BenchmarkVFS2MemfsStat/64-6               500000              2592 ns/op
BenchmarkVFS2MemfsStat/100-6              300000              4045 ns/op
BenchmarkVFS1TmpfsMountStat/1-6          2000000               679 ns/op
BenchmarkVFS1TmpfsMountStat/2-6          2000000               912 ns/op
BenchmarkVFS1TmpfsMountStat/3-6          1000000              1113 ns/op
BenchmarkVFS1TmpfsMountStat/8-6          1000000              2118 ns/op
BenchmarkVFS1TmpfsMountStat/64-6                  100000             14251 ns/op
BenchmarkVFS1TmpfsMountStat/100-6                 100000             22397 ns/op
BenchmarkVFS2MemfsMountStat/1-6                  5000000               317 ns/op
BenchmarkVFS2MemfsMountStat/2-6                  5000000               361 ns/op
BenchmarkVFS2MemfsMountStat/3-6                  5000000               387 ns/op
BenchmarkVFS2MemfsMountStat/8-6                  3000000               582 ns/op
BenchmarkVFS2MemfsMountStat/64-6                  500000              2699 ns/op
BenchmarkVFS2MemfsMountStat/100-6                 300000              4133 ns/op

From this we can infer that, on this machine:

- Constant cost for tmpfs stat() is ~160ns in VFS2 and ~280ns in VFS1.

- Per-path-component cost is ~35ns in VFS2 and ~215ns in VFS1, a
difference of about 6x.

- The cost of crossing a mount boundary is about 80ns in VFS2
(MemfsMountStat/1 does approximately the same amount of work as
MemfsStat/2, except that it also crosses a mount boundary). This is an
inescapable cost of the separate mount lookup needed to support bind
mounts and mount namespaces.

PiperOrigin-RevId: 258853946
2019-07-18 15:10:29 -07:00
Adrien Leravat 2d11fa05f7 sys_time: Wrap comments to 80 columns 2019-07-17 20:25:18 -07:00
Michael Pratt 6f7e2bb388 Take copyMu in Revalidate
copyMu is required to read child.overlay.upper.

PiperOrigin-RevId: 258662209
2019-07-17 16:12:01 -07:00
Jamie Liu 2bc398bfd8 Separate O_DSYNC and O_SYNC.
PiperOrigin-RevId: 258657913
2019-07-17 15:52:38 -07:00
Ayush Ranjan 84a59de5dc ext: disklayout: extents support.
PiperOrigin-RevId: 258657776
2019-07-17 15:48:58 -07:00
Ayush Ranjan 8e3e021aca ext: Filesystem init implementation.
PiperOrigin-RevId: 258645957
2019-07-17 14:48:04 -07:00
gVisor bot 609cd91e3f Merge pull request #355 from zhuangel:master
PiperOrigin-RevId: 258643966
2019-07-17 14:38:22 -07:00
Bhasker Hariharan 542fbd01a7 Fix race in FDTable.GetFDs().
PiperOrigin-RevId: 258635459
2019-07-17 13:56:49 -07:00
Kevin Krakauer 9f1189130e Add AF_UNIX, SOCK_RAW sockets, which exist for some reason.
tcpdump creates these.

PiperOrigin-RevId: 258611829
2019-07-17 11:49:16 -07:00
gVisor bot 682fd2d68f Merge pull request #533 from kevinGC:stub-dev-tty
PiperOrigin-RevId: 258607547
2019-07-17 11:28:30 -07:00
Michael Pratt ca829158e3 Properly invalidate cache in rename and remove
We were invalidating the wrong overlayEntry in rename and missing invalidation
in rename and remove if lower exists.

PiperOrigin-RevId: 258604685
2019-07-17 11:14:57 -07:00
gVisor bot 78a2704bde Merge pull request #474 from zhuangel:proctasks
PiperOrigin-RevId: 258479216
2019-07-16 18:12:07 -07:00
Jianfeng Tan cf4fc510fd Support /proc/net/dev
This proc file reports the stats of interfaces. We could use ifconfig
command to check the result.

Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
Change-Id: Ia7c1e637f5c76c30791ffda68ee61e861b6ef827
COPYBARA_INTEGRATE_REVIEW=https://gvisor-review.googlesource.com/c/gvisor/+/18282/
PiperOrigin-RevId: 258303936
2019-07-15 22:51:05 -07:00
Andrei Vagin 6a8ff6daef kvm: wake up all waiter of vCPU.state
Now we call FUTEX_WAKE with ^uintptr(0) of waiters, but in this case only one
waiter will be waked up. If we want to wake up all of them, the number of
waiters has to be set to math.MaxInt32.

PiperOrigin-RevId: 258285286
2019-07-15 19:27:18 -07:00
Kevin Krakauer 9b4d3280e1 Add IPPROTO_RAW, which allows raw sockets to write IP headers.
iptables also relies on IPPROTO_RAW in a way. It opens such a socket to
manipulate the kernel's tables, but it doesn't actually use any of the
functionality. Blegh.

PiperOrigin-RevId: 257903078
2019-07-12 18:09:12 -07:00
Bhasker Hariharan 6116473b2f Stub out support for TCP_MAXSEG.
Adds support to set/get the TCP_MAXSEG value but does not
really change the segment sizes emitted by netstack or
alter the MSS advertised by the endpoint. This is currently
being added only to unblock iperf3 on gVisor. Plumbing
this correctly requires a bit more work which will come
in separate CLs.

PiperOrigin-RevId: 257859112
2019-07-12 13:35:17 -07:00
gVisor bot eff2c264a4 Merge pull request #282 from zhangningdlut:chris_test_proc
PiperOrigin-RevId: 257855479
2019-07-12 13:11:01 -07:00
Nicolas Lacasse 69e0affaec Don't emit an event for extended attribute syscalls.
These are filesystem-specific, and filesystems are allowed to return ENOTSUP if
they are not supported.

PiperOrigin-RevId: 257813477
2019-07-12 09:11:04 -07:00
Kevin ddef7f8078 Fix license year and remove Read. 2019-07-11 21:31:26 -07:00
Kevin 44427d8e26 Add a stub for /dev/tty.
Actual implementation to follow, but this will satisfy applications that
want it to just exist.
2019-07-11 21:24:27 -07:00
Ayush Ranjan 2eeca68900 Added tiny ext4 image.
The image is of size 64Kb which supports 64 1k blocks
and 16 inodes. This is the smallest size mkfs.ext4 works with.

Added README.md documenting how this was created and included
all files on the device under assets.

PiperOrigin-RevId: 257712672
2019-07-11 17:17:47 -07:00
Ayush Ranjan 5242face2e ext: boilerplate code.
Renamed ext4 to ext since we are targeting ext(2/3/4).
Removed fs.go since we are targeting VFS2.
Added ext.go with filesystem struct.

PiperOrigin-RevId: 257689775
2019-07-11 15:05:36 -07:00
Liu Hua 7581e84cb6 tss: block userspace access to all I/O ports.
A userspace process (CPL=3) can access an i/o port if the bit corresponding to
the port is set to 0 in the I/O permission bitmap.

Configure the I/O permission bitmap address beyond the last valid byte in the
TSS so access to all i/o ports is blocked.

Signed-off-by: Liu Hua <sdu.liu@huawei.com>
Change-Id: I3df76980c3735491db768f7210e71703f86bb989
PiperOrigin-RevId: 257336518
2019-07-09 22:21:56 -07:00
Ayush Ranjan 7965b1272b ext4: disklayout: Directory Entry implementation.
PiperOrigin-RevId: 257314911
2019-07-09 18:36:02 -07:00
Adin Scannell dea3cb92f2 build: add nogo for static validation
PiperOrigin-RevId: 257297820
2019-07-09 16:44:06 -07:00
Adin Scannell cceef9d2cf Cleanup straggling syscall dependencies.
PiperOrigin-RevId: 257293198
2019-07-09 16:18:02 -07:00
Nicolas Lacasse 6db3f8d54c Don't mask errors in createAt loop.
The error set in the loop in createAt was being masked
by other errors declared with ":=". This allowed an
ErrResolveViaReadlink error to escape, which can cause
a sentry panic.

Added test case which repros without the fix.

PiperOrigin-RevId: 257061767
2019-07-08 14:57:15 -07:00
Nicolas Lacasse 659bebab8e Don't try to execute a file that is not regular.
PiperOrigin-RevId: 257037608
2019-07-08 12:56:48 -07:00
Ayush Ranjan 8f9b1ca8e7 ext4: disklayout: inode impl.
PiperOrigin-RevId: 257010414
2019-07-08 10:44:11 -07:00
Andrei Vagin 67f2cefce0 Avoid importing platforms from many source files
PiperOrigin-RevId: 256494243
2019-07-03 22:51:26 -07:00
Ian Lewis da57fb9d25 Fix syscall doc for getresgid
PiperOrigin-RevId: 256481284
2019-07-03 20:13:19 -07:00
Neel Natu 9f2f9f0cab futex: compare keys for equality when doing a FUTEX_UNLOCK_PI.
PiperOrigin-RevId: 256453827
2019-07-03 16:01:38 -07:00
Andrei Vagin 116cac053e netstack/udp: connect with the AF_UNSPEC address family means disconnect
PiperOrigin-RevId: 256433283
2019-07-03 14:19:02 -07:00
gVisor bot f10862696c Merge pull request #493 from ahmetb:reticulating-splines
PiperOrigin-RevId: 256319059
2019-07-03 01:10:34 -07:00
Yong He 85b27a9f8f Solve BounceToKernel may hang issue
BounceToKernel will make vCPU quit from guest ring3 to guest ring0, but
vCPUWaiter is not cleared when we unlock the vCPU, when next time this vCPU
enter guest mode ring3, vCPU may enter guest mode with vCPUWaiter bit setted,
this will cause the following BounceToKernel to this vCPU hangs at
waitUntilNot.

Halt may workaroud this issue, because halt process will reset vCPU status into
vCPUUser, and notify all waiter for vCPU state change, but if there is no
exception or syscall in this period, BounceToKernel will hang at waitUntilNot.

PiperOrigin-RevId: 256299660
2019-07-02 22:03:28 -07:00
Adin Scannell 753da9604e Remove map from fd_map, change to fd_table.
This renames FDMap to FDTable and drops the kernel.FD type, which had an entire
package to itself and didn't serve much use (it was freely cast between types,
and served as more of an annoyance than providing any protection.)

Based on BenchmarkFDLookupAndDecRef-12, we can expect 5-10 ns per lookup
operation, and 10-15 ns per concurrent lookup operation of savings.

This also fixes two tangential usage issues with the FDMap. Namely, non-atomic
use of NewFDFrom and associated calls to Remove (that are both racy and fail to
drop the reference on the underlying file.)

PiperOrigin-RevId: 256285890
2019-07-02 19:28:59 -07:00
Ian Lewis 3f14caeb99 Add documentation for remaining syscalls (fixes #197, #186)
Adds support level documentation for all syscalls. Removes the Undocumented
utility function to discourage usage while leaving SupportUndocumented as the
default support level for Syscall structs.

PiperOrigin-RevId: 256281927
2019-07-02 18:45:16 -07:00
Ayush Ranjan d8ec2fb671 Ext4: DiskLayout: Inode interface.
PiperOrigin-RevId: 256234390
2019-07-02 14:04:31 -07:00
Nicolas Lacasse 4f2f44320f Simplify (and fix) refcounts in createAt.
fileOpAt holds references on the Dirents passed as arguments to the callback,
and drops refs when finished, so we don't need to DecRef those Dirents
ourselves

However, all Dirents that we get from FindInode/FindLink must be DecRef'd.

This CL cleans up the ref-counting logic, and fixes some refcount issues in the
process.

PiperOrigin-RevId: 256220882
2019-07-02 12:58:58 -07:00
Ahmet Alp Balkan 4cd28c6e27
sentry/kernel: add syslog message
It feels like "reticulating splines" is missing from the list of meaningless
syslog messages.

Signed-off-by: Ahmet Alp Balkan <ahmetb@google.com>
2019-07-02 12:05:41 -07:00
Ian Gudger 0aa9418a77 Fix unix/transport.queue reference leaks.
Fix two leaks for connectionless Unix sockets:
* Double connect: Subsequent connects would leak a reference on the previously
  connected endpoint.
* Close unconnected: Sockets which were not connected at the time of closure
  would leak a reference on their receiver.

PiperOrigin-RevId: 256070451
2019-07-01 17:46:24 -07:00
Nicolas Lacasse 06537129a6 Check remaining traversal limit when creating a file through a symlink.
This fixes the case when an app tries to create a file that already exists, and
is a symlink to itself. A test was added.

PiperOrigin-RevId: 256044811
2019-07-01 15:25:22 -07:00
Ian Gudger 45566fa4e4 Add finalizer on AtomicRefCount to check for leaks.
PiperOrigin-RevId: 255711454
2019-06-28 20:07:52 -07:00
Adin Scannell 7dae043fec Drop ashmem and binder.
These are unfortunately unused and unmaintained. They can be brought back in
the future if need requires it.

PiperOrigin-RevId: 255697132
2019-06-28 17:20:25 -07:00
Nicolas Lacasse d3f97aec49 Remove events from name_to_handle_at and open_by_handle_at.
These syscalls require filesystem support that gVisor does not provide, and is
not planning to implement. Their absense should not trigger an event.

PiperOrigin-RevId: 255692871
2019-06-28 16:50:24 -07:00
Ayush Ranjan c4da599e22 ext4: disklayout: SuperBlock interface implementations.
PiperOrigin-RevId: 255687771
2019-06-28 16:18:29 -07:00
Nicolas Lacasse 295078fa7a Automated rollback of changelist 255263686
PiperOrigin-RevId: 255679453
2019-06-28 15:28:41 -07:00
Andrei Vagin e21d49c2d8 platform/ptrace: return more detailed errors
Right now, if we can't create a stub process, we will see this error:
panic: unable to activate mm: resource temporarily unavailable

It would be better to know the root cause of this "resource temporarily
unavailable".

PiperOrigin-RevId: 255656831
2019-06-28 13:23:36 -07:00
Ayush Ranjan 7c13789818 Superblock interface in the disk layout package for ext4.
PiperOrigin-RevId: 255644277
2019-06-28 12:07:28 -07:00
Yong He c61d7761b4 Fix deadloop in proc subtask list
Readdir of /proc/x/task/ will get direntry entries
from tasks of specified taskgroup. Now the tasks
slice is unsorted, use sort.SearchInts search entry
from the slice may cause infinity loops.
The fix is sort the slice before search.
This issue could be easily reproduced via following
steps, revise Readdir in pkg/sentry/fs/proc/task.go,
force set taskInts into test slice
[]int{1, 11, 7, 5, 10, 6, 8, 3, 9, 2, 4},
then run docker image and run ls /proc/1/task, the
command will cause infinity loops.
2019-06-28 22:20:57 +08:00
Fabricio Voznika b2907595e5 Complete pipe support on overlayfs
Get/Set pipe size and ioctl support were missing from
overlayfs. It required moving the pipe.Sizer interface
to fs so that overlay could get access.

Fixes #318

PiperOrigin-RevId: 255511125
2019-06-27 17:22:53 -07:00
Michael Pratt 5b41ba5d0e Fix various spelling issues in the documentation
Addresses obvious typos, in the documentation only.

COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/443 from Pixep:fix/documentation-spelling 4d0688164eafaf0b3010e5f4824b35d1e7176d65
PiperOrigin-RevId: 255477779
2019-06-27 14:25:50 -07:00
Michael Pratt 085a907565 Cache directory entries in the overlay
Currently, the overlay dirCache is only used for a single logical use of
getdents. i.e., it is discard when the FD is closed or seeked back to
the beginning.

But the initial work of getting the directory contents can be quite
expensive (particularly sorting large directories), so we should keep it
as long as possible.

This is very similar to the readdirCache in fs/gofer.

Since the upper filesystem does not have to allow caching readdir
entries, the new CacheReaddir MountSourceOperations method controls this
behavior.

This caching should be trivially movable to all Inodes if desired,
though that adds an additional copy step for non-overlay Inodes.
(Overlay Inodes already do the extra copy).

PiperOrigin-RevId: 255477592
2019-06-27 14:24:03 -07:00
Andrei Vagin e276083903 gvisor/ptrace: grub initial thread registers only once
PiperOrigin-RevId: 255465635
2019-06-27 13:59:57 -07:00
Fabricio Voznika 42e212f6b7 Preserve permissions when checking lower
The code was wrongly assuming that only read access was
required from the lower overlay when checking for permissions.
This allowed non-writable files to be writable in the overlay.

Fixes #316

PiperOrigin-RevId: 255263686
2019-06-26 14:24:44 -07:00
Nicolas Lacasse 857e5c47e9 Follow symlinks when creating a file, and create the target.
If we have a symlink whose target does not exist, creating the symlink (either
via 'creat' or 'open' with O_CREAT flag) should create the target of the
symlink. Previously, gVisor would error with EEXIST in this case

PiperOrigin-RevId: 255232944
2019-06-26 11:49:20 -07:00
Michael Pratt e98ce4a2c6 Add TODO reminder to remove tmpfs caching options
Updates #179

PiperOrigin-RevId: 255081565
2019-06-25 17:12:34 -07:00
Andrei Vagin 03ae91c662 gvisor: lockless read access for task credentials
Credentials are immutable and even before these changes we could read them
without locks, but we needed to take a task lock to get a credential object
from a task object.

It is possible to avoid this lock, if we will guarantee that a credential
object will not be changed after setting it on a task.

PiperOrigin-RevId: 254989492
2019-06-25 09:52:49 -07:00
Adrien Leravat 3688e6e99d Add CLOCK_BOOTTIME as a CLOCK_MONOTONIC alias
Makes CLOCK_BOOTTIME available with
* clock_gettime
* timerfd_create
* clock_gettime vDSO

CLOCK_BOOTTIME is implemented as an alias to CLOCK_MONOTONIC.
CLOCK_MONOTONIC already keeps track of time across save
and restore. This is the closest possible behavior to Linux
CLOCK_BOOTIME, as there is no concept of suspend/resume.

Updates google/gvisor#218
2019-06-24 21:14:38 -07:00
Andrei Vagin e9ea7230f7 fs: synchronize concurrent writes into files with O_APPEND
For files with O_APPEND, a file write operation gets a file size and uses it as
offset to call an inode write operation. This means that all other operations
which can change a file size should be blocked while the write operation doesn't
complete.

PiperOrigin-RevId: 254873771
2019-06-24 17:45:02 -07:00
Adin Scannell 7f5d0afe52 Add O_EXITKILL to ptrace options.
This prevents a race before PDEATH_SIG can take effect during
a sentry crash.

Discovered and solution by avagin@.

PiperOrigin-RevId: 254871534
2019-06-24 17:30:01 -07:00
Rahat Mahmood 94a6bfab5d Implement /proc/net/tcp.
PiperOrigin-RevId: 254854346
2019-06-24 15:56:36 -07:00
Andrei Vagin c5486f5122 platform/ptrace: specify PTRACE_O_TRACEEXIT for stub-processes
The tracee is stopped early  during  process  exit,  when registers are still
available, allowing the tracer to see where the exit occurred, whereas the
normal exit  notifi? cation  is  done  after  the process is finished exiting.

Without this option, dumpAndPanic fails to get registers.

PiperOrigin-RevId: 254852917
2019-06-24 15:48:58 -07:00
Nicolas Lacasse 87df9aab24 Use correct statx syscall number for amd64.
The previous number was for the arm architecture.

Also change the statx tests to force them to run on gVisor, which would have
caught this issue.

PiperOrigin-RevId: 254846831
2019-06-24 15:19:36 -07:00
Fabricio Voznika b21b1db700 Allow to change logging options using 'runsc debug'
New options are:
  runsc debug --strace=off|all|function1,function2
  runsc debug --log-level=warning|info|debug
  runsc debug --log-packets=true|false

Updates #407

PiperOrigin-RevId: 254843128
2019-06-24 15:03:02 -07:00
chris.zn f957fb23cf Return ENOENT when reading /proc/{pid}/task of an exited process
There will be a deadloop when we use getdents to read /proc/{pid}/task
of an exited process

Like this:

Process A is running
                         Process B: open /proc/{pid of A}/task
Process A exits
                         Process B: getdents /proc/{pid of A}/task

Then, process B will fall into deadloop, and return "." and ".."
in loops and never ends.

This patch returns ENOENT when use getdents to read /proc/{pid}/task
if the process is just exited.

Signed-off-by: chris.zn <chris.zn@antfin.com>
2019-06-24 15:49:53 +08:00
Nicolas Lacasse 35719d52c7 Implement statx.
We don't have the plumbing for btime yet, so that field is left off. The
returned mask indicates that btime is absent.

Fixes #343

PiperOrigin-RevId: 254575752
2019-06-22 13:29:26 -07:00
Andrei Vagin ab6774cebf gvisor/fs: getdents returns 0 if offset is equal to FileMaxOffset
FileMaxOffset is a special case when lseek(d, 0, SEEK_END) has been called.

PiperOrigin-RevId: 254498777
2019-06-21 17:25:17 -07:00
Ayush Ranjan 727375321f ext4 block group descriptor implementation in disk layout package.
PiperOrigin-RevId: 254482180
2019-06-21 15:42:46 -07:00
Fabricio Voznika 5ba16d51a9 Add list of stuck tasks to panic message
PiperOrigin-RevId: 254450309
2019-06-21 12:46:53 -07:00
Andrei Vagin f94653b3de kernel: call t.mu.Unlock() explicitly in WithMuLocked
defer here doesn't improve readability, but we know it slower that
the explicit call.

PiperOrigin-RevId: 254441473
2019-06-21 11:55:42 -07:00
Fabricio Voznika 054b5632ef Update comment
PiperOrigin-RevId: 254428866
2019-06-21 10:56:42 -07:00
Jamie Liu 7db8685100 Preallocate auth.NewAnonymousCredentials() in contexttest.TestContext.
Otherwise every call to, say, fs.ContextCanAccessFile() in a benchmark
using contexttest allocates new auth.Credentials, a new
auth.UserNamespace, ...

PiperOrigin-RevId: 254261051
2019-06-20 13:36:14 -07:00
Michael Pratt 292f70cbf7 Add package docs to seqfile and ramfs
These are the only packages missing docs:
https://godoc.org/gvisor.dev/gvisor

PiperOrigin-RevId: 254261022
2019-06-20 13:34:33 -07:00
Neel Natu 0b2135072d Implement madvise(MADV_DONTFORK)
PiperOrigin-RevId: 254253777
2019-06-20 12:56:00 -07:00
Ian Gudger 7e49515696 Deflake SendFileTest_Shutdown.
The sendfile syscall's backing doSplice contained a race with regard to
blocking. If the first attempt failed with syserror.ErrWouldBlock and then
the blocking file became ready before registering a waiter, we would just
return the ErrWouldBlock (even if we were supposed to block).

PiperOrigin-RevId: 254114432
2019-06-19 18:40:54 -07:00
Nicolas Lacasse 29f9e4fa87 fileOp{On,At} should pass the remaning symlink traversal count.
And methods that do more traversals should use the remaining count rather than
resetting.

PiperOrigin-RevId: 254041720
2019-06-19 11:56:34 -07:00
Nicolas Lacasse f7428af9c1 Add MountNamespace to task.
This allows tasks to have distinct mount namespace, instead of all sharing the
kernel's root mount namespace.

Currently, the only way for a task to get a different mount namespace than the
kernel's root is by explicitly setting a different MountNamespace in
CreateProcessArgs, and nothing does this (yet).

In a follow-up CL, we will set CreateProcessArgs.MountNamespace when creating a
new container inside runsc.

Note that "MountNamespace" is a poor term for this thing. It's more like a
distinct VFS tree. When we get around to adding real mount namespaces, this
will need a better naem.

PiperOrigin-RevId: 254009310
2019-06-19 09:21:21 -07:00
Fabricio Voznika ca245a428b Attempt to fix TestPipeWritesAccumulate
Test fails because it's reading 4KB instead of the
expected 64KB. Changed the test to read pipe buffer
size instead of hardcode and added some logging in
case the reason for failure was not pipe buffer size.

PiperOrigin-RevId: 253916040
2019-06-18 19:16:11 -07:00
Andrei Vagin 8ab0848c70 gvisor/fs: don't update file.offset for sockets, pipes, etc
sockets, pipes and other non-seekable file descriptors don't
use file.offset, so we don't need to update it.

With this change, we will be able to call file operations
without locking the file.mu mutex. This is already used for
pipes in the splice system call.

PiperOrigin-RevId: 253746644
2019-06-18 01:43:29 -07:00
Yong He 0dbdca349c Skip tid allocation which is using
When leader of process group (session) exit, the process
group ID (session ID) is holding by other processes in
the process group, so the process group ID (session ID)
can not be reused.
If reusing the process group ID (seession ID) as new process
group ID for new process, this will cause session create
failed, and later runsc crash when access process group.
The fix skip the tid if it is using by a process group
(session) when allocating a new tid.

We could easily reproduce the runsc crash follow
these steps:

1. build test program, and run inside container

int main(int argc, char *argv[])
{
	pid_t cpid, spid;

	cpid = fork();
	if (cpid == -1) {
		perror("fork");
		exit(EXIT_FAILURE);
	}
	if (cpid == 0) {
		pid_t sid = setsid();
		printf("Start New Session %ld\n",sid);
		printf("Child PID %ld / PPID %ld / PGID %ld / SID %ld\n",
				getpid(),getppid(),getpgid(getpid()),getsid(getpid()));
		spid = fork();
		if (spid == 0) {
			setpgid(getpid(), getpid());
			printf("Set GrandSon as New Process Group\n");
			printf("GrandSon PID %ld / PPID %ld / PGID %ld / SID %ld\n",
					getpid(),getppid(),getpgid(getpid()),getsid(getpid()));
			while(1) {
				usleep(1);
			}
		}
		sleep(3);
		exit(0);
	} else {
		exit(0);
	}

	return 0;
}

2. build hello program

int main(int argc, char *argv[])
{
	printf("Current PID is %ld\n", (long) getpid());
	return 0;
}

3. run script on host which run hello inside container, you can
speed up the test with set TasksLimit as lower value.
for (( i=0; i<65535; i++ ))
do
	docker exec <container id> /test/hello
done

4. when hello process reusing the process group of loop process,
runsc will crash.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x79f0c8]

goroutine 612475 [running]:
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*ProcessGroup).decRefWithParent(0x0, 0x0)
        pkg/sentry/kernel/sessions.go:160 +0x78
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).exitNotifyLocked(0xc000663500, 0x0)
        pkg/sentry/kernel/task_exit.go:672 +0x2b7
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runExitNotify).execute(0x0, 0xc000663500, 0x0, 0x0)
        pkg/sentry/kernel/task_exit.go:542 +0xc4
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run(0xc000663500, 0xc)
        pkg/sentry/kernel/task_run.go:91 +0x194
created by gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).Start
        pkg/sentry/kernel/task_start.go:286 +0xfe
2019-06-14 14:05:41 +08:00
Bhasker Hariharan 3d71c627fa Add support for TCP receive buffer auto tuning.
The implementation is similar to linux where we track the number of bytes
consumed by the application to grow the receive buffer of a given TCP endpoint.

This ensures that the advertised window grows at a reasonable rate to accomodate
for the sender's rate and prevents large amounts of data being held in stack
buffers if the application is not actively reading or not reading fast enough.

The original paper that was used to implement the linux receive buffer auto-
tuning is available @ https://public.lanl.gov/radiant/pubs/drs/lacsi2001.pdf

NOTE: Linux does not implement DRS as defined in that paper, it's just a good
reference to understand the solution space.

Updates #230

PiperOrigin-RevId: 253168283
2019-06-13 22:28:01 -07:00
Ian Gudger 3e9b8ecbfe Plumb context through more layers of filesytem.
All functions which allocate objects containing AtomicRefCounts will soon need
a context.

PiperOrigin-RevId: 253147709
2019-06-13 18:40:38 -07:00
Ian Gudger 0a5ee6f7b2 Fix deadlock in fasync.
The deadlock can occur when both ends of a connected Unix socket which has
FIOASYNC enabled on at least one end are closed at the same time. One end
notifies that it is closing, calling (*waiter.Queue).Notify which takes
waiter.Queue.mu (as a read lock) and then calls (*FileAsync).Callback, which
takes FileAsync.mu. The other end tries to unregister for notifications by
calling (*FileAsync).Unregister, which takes FileAsync.mu and calls
(*waiter.Queue).EventUnregister which takes waiter.Queue.mu.

This is fixed by moving the calls to waiter.Waitable.EventRegister and
waiter.Waitable.EventUnregister outside of the protection of any mutex used
in (*FileAsync).Callback.

The new test is related, but does not cover this particular situation.

Also fix a data race on FileAsync.e.Callback. (*FileAsync).Callback checked
FileAsync.e.Callback under the protection of FileAsync.mu, but the waiter
calling (*FileAsync).Callback could not and did not. This is fixed by making
FileAsync.e.Callback immutable before passing it to the waiter for the first
time.

Fixes #346

PiperOrigin-RevId: 253138340
2019-06-13 17:26:22 -07:00
Rahat Mahmood 05ff1ffaad Implement getsockopt() SO_DOMAIN, SO_PROTOCOL and SO_TYPE.
SO_TYPE was already implemented for everything but netlink sockets.

PiperOrigin-RevId: 253138157
2019-06-13 17:24:51 -07:00
Adin Scannell add40fd6ad Update canonical repository.
This can be merged after:
https://github.com/google/gvisor-website/pull/77
  or
https://github.com/google/gvisor-website/pull/78

PiperOrigin-RevId: 253132620
2019-06-13 16:50:15 -07:00