The caller must call Readdir() at least twice to detect
EOF. The old code was always restarting the directory
search and then skipping elements already seen, effectively
doubling the cost to read a directory. The code now
remembers the last offset and doesn't reposition the cursor
if next request comes at the same offset.
PiperOrigin-RevId: 244957816
Change-Id: If21a8dc68b76614adbcf4301439adfda40f2643f
p9.messageByType was taking 7% of p9.recv before, spending time
with reflection and map lookup. Now it's reduced to 1%.
PiperOrigin-RevId: 244947313
Change-Id: I42813f920557b7656f8b29157eb32acd79e11fa5
os.NewFile() accounts for 38% of CPU time in localFile.Walk().
This change switchs to use fd.FD which is much cheaper to create.
Now, fd.New() in localFile.Walk() accounts for only 4%.
PiperOrigin-RevId: 244944983
Change-Id: Ic892df96cf2633e78ad379227a213cb93ee0ca46
Create, Start, and Destroy were racing to create and destroy the
metadata directory of containers.
This is a re-upload of
https://gvisor-review.googlesource.com/c/gvisor/+/16260, but with the
correct account.
Change-Id: I16b7a9d0971f0df873e7f4145e6ac8f72730a4f1
PiperOrigin-RevId: 244892991
For a symbol link to some directory, eg.
`/tmp/symlink -> /tmp/dir`
`fstatat("/tmp/symlink")` should return symbol link data, but
`fstatat("/tmp/symlink/")` (symlink with trailing slash) should return
directory data it points following linux behaviour.
Currently fstatat() a symlink with trailing slash will get "not a
directory" error which is wrong.
Signed-off-by: Wei Zhang <zhangwei198900@gmail.com>
Change-Id: I63469b1fb89d083d1c1255d32d52864606fbd7e2
PiperOrigin-RevId: 244783916
Support shutdown on only the read side of an endpoint. Reads performed
after a call to Shutdown with only the ShutdownRead flag will return
ErrClosedForReceive without data.
Break out the shutdown(2) with SHUT_RD syscall test into to two tests.
The first tests that no packets are sent when shutting down the read
side of a socket. The second tests that, after shutting down the read
side of a socket, unread data can still be read, or an EOF if there is
no more data to read.
Change-Id: I9d7c0a06937909cbb466b7591544a4bcaebb11ce
PiperOrigin-RevId: 244459430
The MSG_TRUNC flag is set in the msghdr when a message is truncated.
Fixesgoogle/gvisor#200
PiperOrigin-RevId: 244440486
Change-Id: I03c7d5e7f5935c0c6b8d69b012db1780ac5b8456
Inode ids are only guaranteed to be stable across save/restore if the file is
held open. This CL fixes a simple stat test to allow it to compare symlink and
target by inode id, as long as the link target is held open.
PiperOrigin-RevId: 244238343
Change-Id: I74c5115915b1cc032a4c16515a056a480f218f00
Only emit unimplemented syscall events for setting SO_OOBINLINE and SO_LINGER
when attempting to set unsupported values.
PiperOrigin-RevId: 244229675
Change-Id: Icc4562af8f733dd75a90404621711f01a32a9fc1
It is possible to create a listening socket which will accept
IPv4 and IPv6 connections. In this case, we set IPv6ProtocolNumber
for all accepted endpoints, even if they handle IPv4 connections.
This means that we can't use endpoint.netProto to set gso.L3HdrLen.
PiperOrigin-RevId: 244227948
Change-Id: I5e1863596cb9f3d216febacdb7dc75651882eef1
The existing logic attempting to do this is incorrect. Unary ^ has
higher precedence than &^, so mask always has UnblockableSignals
cleared, allowing dequeueSignalLocked to dequeue unblockable signals
(which allows userspace to ignore them).
Switch the logic so that unblockable signals are always masked.
PiperOrigin-RevId: 244058487
Change-Id: Ib19630ac04068a1fbfb9dc4a8eab1ccbdb21edc3
FD limit and file size limit is read from the host, instead
of using hard-coded defaults, given that they effect the sandbox
process. Also limit the direct cache to use no more than half
if the available FDs.
PiperOrigin-RevId: 244050323
Change-Id: I787ad0fdf07c49d589e51aebfeae477324fe26e6
Current, doPoll copies the user struct pollfd array into a
[]syscalls.PollFD, which contains internal kdefs.FD and
waiter.EventMask types. While these are currently binary-compatible with
the Linux versions, we generally discourage copying directly to internal
types (someone may inadvertantly change kdefs.FD to uint64).
Instead, copy directly to a []linux.PollFD, which will certainly be
binary compatible. Most of syscalls/polling.go is included directly into
syscalls/linux/sys_poll.go, as it can then operate directly on
linux.PollFD. The additional syscalls.PollFD type is providing little
value.
I've also added explicit conversion functions for waiter.EventMask,
which creates the possibility of a different binary format.
PiperOrigin-RevId: 244042947
Change-Id: I24e5b642002a32b3afb95a9dcb80d4acd1288abf
The file layout in the bucket is changed a little bit recently to support both v1 shim and v2 shim.
PiperOrigin-RevId: 243682904
Change-Id: Ic1373c6dc088ef41f829e7ce3ea3762e1e2b0292
It provides an easy way to run commands to quickly test gVisor.
By default it maps the host root as the container root with a
writable overlay on top (so the host root is not modified).
Example:
sudo runsc do ls -lh --color
sudo runsc do ~/src/test/my-test.sh
PiperOrigin-RevId: 243178711
Change-Id: I05f3d6ce253fe4b5f1362f4a07b5387f6ddb5dd9
Normal files display their path in the current mount namespace:
I0410 10:57:54.964196 216336 x:0] [ 1] ls X read(0x3 /proc/filesystems, 0x55cee3bdb2c0 "nodev\t9p\nnodev\tdevpts \nnodev\tdevtmpfs\nnodev\tproc\nnodev\tramdiskfs\nnodev\tsysfs\nnodev\ttmpfs\n", 0x1000) = 0x58 (24.462?s)
AT_FDCWD includes the CWD:
I0411 12:58:48.278427 1526 x:0] [ 1] stat_test E newfstatat(AT_FDCWD /home/prattmic, 0x55ea719b564e /proc/self, 0x7ef5cefc2be8, 0x0)
Sockets (and other non-vfs files) display an inode number (like
/proc/PID/fd):
I0410 10:54:38.909123 207684 x:0] [ 1] nc E bind(0x3 socket:[1], 0x55b5a1652040 {Family: AF_INET, Addr: , Port: 8080}, 0x10)
I also fixed a few syscall args that should be Path.
PiperOrigin-RevId: 243169025
Change-Id: Ic7dda6a82ae27062fe2a4a371557acfd6a21fa2a
RootFromContext can return a dirent with reference taken, or nil. We must call
DecRef if (and only if) a real dirent is returned.
PiperOrigin-RevId: 242965515
Change-Id: Ie2b7b4cb19ee09b6ccf788b71f3fd7efcdf35a11
Even superuser cannot raise RLIMIT_NOFILE above /proc/sys/fs/nr_open, so
start the test by lowering the limits before raising.
Change-Id: Ied6021c64178a6cb9098088a1a3384db523a226f
PiperOrigin-RevId: 242965249
DirentCache is already a savable type, and it ensures that it is empty at the
point of Save. There is no reason not to save it along with the MountSource.
This did uncover an issue where not all MountSources were properly flushed
before Save. If a mount point has an open file and is then unmounted, we save
the MountSource without flushing it first. This CL also fixes that by flushing
all MountSources for all open FDs on Save.
PiperOrigin-RevId: 242906637
Change-Id: I3acd9d52b6ce6b8c989f835a408016cb3e67018f
From a recent test failure:
"State:\tD (disk sleep)\n"
"disk sleep" does not match \w+. We need to allow spaces.
PiperOrigin-RevId: 242762469
Change-Id: Ic8d05a16669412a72c1e76b498373e5b22fe64c4
From sendfile spec and also the linux kernel code, we should
limit the count arg to 'MAX_RW_COUNT'. This patch export
'MAX_RW_COUNT' in kernel pkg and use it in the implementation
of sendfile syscall.
Signed-off-by: Li Qiang <pangpei.lq@antfin.com>
Change-Id: I1086fec0685587116984555abd22b07ac233fbd2
PiperOrigin-RevId: 242745831
Otherwise, we will not have capabilities in the user namespace.
And this patch adds the noexec option for mounts.
https://github.com/google/gvisor/issues/145
PiperOrigin-RevId: 242706519
Change-Id: I1b78b77d6969bd18038c71616e8eb7111b71207c
We construct a ramfs tree of "scaffolding" directories for all mount points, so
that a directory exists that each mount point can be mounted over.
We were creating these directories without write permissions, which meant that
they were not wribable even when underlayed under a writable filesystem. They
should be writable.
PiperOrigin-RevId: 242507789
Change-Id: I86645e35417560d862442ff5962da211dbe9b731
Strings are a better fit for this usage because they are immutable in Go, and
can contain arbitrary bytes. It also allows us to avoid casting bytes to string
(and the associated allocation) in the hot path when checking for overlay
whiteouts.
PiperOrigin-RevId: 242208856
Change-Id: I7699ae6302492eca71787dd0b72e0a5a217a3db2
This CL merges all RBE-specific configuration from .bazelrc_rbe into .bazelrc
so that it will be picked up by default by users running bazel.
It also checks in a bazelrc from the upstream bazel-toolchains repository, and
imports that into our repo-specific .bazelrc. This makes it easier to maintain
and update the bazelrc going forward.
Documentation was added to the README.
PiperOrigin-RevId: 242208733
Change-Id: Iea32de9be85b024bd74f88909b56b2a8ab34851a
From the SDM: "The least-significant byte in register EAX (register AL)
will always return 01H. Software should ignore this value and not
interpret it as an informational descriptor."
Unfortunately, online docs [1] [2] (likely based on an old version of the SDM)
say: "The least-significant byte in register EAX (register AL) indicates
the number of times the CPUID instruction must be executed with an input
value of 2 to get a complete description of the processor's caches and
TLBs."
dlang uses this second interpretation [3] and will loop 2^32 times if we
return zero. Fix this by specifying the fixed value of one. We still
don't support exposing the actual cache information, leaving all other
bytes empty. A zero byte means: "Null descriptor, this byte contains no
information."
[1] http://www.sandpile.org/x86/cpuid.htm#level_0000_0002h
[2] https://c9x.me/x86/html/file_module_x86_id_45.html
[3] 424640864c/src/core/cpuid.d (L533-L534)
PiperOrigin-RevId: 242046629
Change-Id: Ic0f0a5f974b20f71391cb85645bdcd4003e5fe88