Changes netstack to confirm to current linux behaviour where if the backlog is
full then we drop the SYN and do not send a SYN-ACK. Similarly we allow upto
backlog connections to be in SYN-RCVD state as long as the backlog is not full.
We also now drop a SYN if syn cookies are in use and the backlog for the
listening endpoint is full.
Added new tests to confirm the behaviour.
Also reverted the change to increase the backlog in TcpPortReuseMultiThread
syscall test.
Fixes#236
PiperOrigin-RevId: 252500462
Store enough information in the kernel socket table to distinguish
between different types of sockets. Previously we were only storing
the socket family, but this isn't enough to classify sockets. For
example, TCPv4 and UDPv4 sockets are both AF_INET, and ICMP sockets
are SOCK_DGRAM sockets with a particular protocol.
Instead of creating more sub-tables, flatten the socket table and
provide a filtering mechanism based on the socket entry.
Also generate and store a socket entry index ("sl" in linux) which
allows us to output entries in a stable order from procfs.
PiperOrigin-RevId: 252495895
Almost (?) all uses of CopyStringIn are via linux.copyInPath(), which
passes maxlen = linux.PATH_MAX = 4096. Pre-allocating a buffer of this
size is measurably inefficient in most cases: most paths will not be
this long, 4 KB is a lot of bytes to zero, and as of this writing the Go
runtime allocator maps only two 4 KB objects to each 8 KB span,
necessitating a call to runtime.mcache.refill() on ~every other call.
Limit the initial buffer size to 256 B instead, and geometrically
reallocate if necessary.
PiperOrigin-RevId: 251960441
Overlayfs was expecting the parent to exist when bind(2)
was called, which may not be the case. The fix is to copy
the parent directory to the upper layer before binding
the UDS.
There is not good place to add tests for it. Syscall tests
would be ideal, but it's hard to guarantee that the
directory where the socket is created hasn't been touched
before (and thus copied the parent to the upper layer).
Added it to runsc integration tests for now. If it turns
out we have lots of these kind of tests, we can consider
moving them somewhere more appropriate.
PiperOrigin-RevId: 251954156
We still only advertise a single NUMA node, and ignore mempolicy
accordingly, but mbind() at least now succeeds and has effects reflected
by get_mempolicy().
Also fix handling of nodemasks: round sizes to unsigned long (as
documented and done by Linux), and zero trailing bits when copying them
out.
PiperOrigin-RevId: 251950859
This is necessary for implementing network diagnostic interfaces like
/proc/net/{tcp,udp,unix} and sock_diag(7).
For pass-through endpoints such as hostinet, we obtain the socket
state from the backend. For netstack, we add explicit tracking of TCP
states.
PiperOrigin-RevId: 251934850
This allows an fdbased endpoint to have multiple underlying fd's from which
packets can be read and dispatched/written to.
This should allow for higher throughput as well as better scalability of the
network stack as number of connections increases.
Updates #231
PiperOrigin-RevId: 251852825
This is required to make the shutdown visible to peers outside the
sandbox.
The readClosed / writeClosed fields were dropped, as they were
preventing a shutdown socket from reading the remainder of queued bytes.
The host syscalls will return the appropriate errors for shutdown.
The control message tests have been split out of socket_unix.cc to make
the (few) remaining tests accessible to testing inherited host UDS,
which don't support sending control messages.
Updates #273
PiperOrigin-RevId: 251763060
In case of GSO, a segment can container more than one packet
and we need to use the pCount() helper to get a number of packets.
PiperOrigin-RevId: 251743020
Multicast packets are special in that their destination address does not
identify a specific interface. When sending out such a packet the multicast
address is the remote address, but for incoming packets it is the local
address. Hence, when looping a multicast packet, the route needs to be
tweaked to reflect this.
PiperOrigin-RevId: 251739298
We don't actually support core dumps, but some applications want to
get/set dumpability, which still has an effect in procfs.
Lack of support for set-uid binaries or fs creds simplifies things a
bit.
As-is, processes started via CreateProcess (i.e., init and sentryctl
exec) have normal dumpability. I'm a bit torn on whether sentryctl exec
tasks should be dumpable, but at least since they have no parent normal
UID/GID checks should protect them.
PiperOrigin-RevId: 251712714
When checking the length of the acceptedChan we should hold the
endpoint mutex otherwise a syn received while the listening socket
is being closed can result in a data race where the cleanupLocked
routine sets acceptedChan to nil while a handshake goroutine
in progress could try and check it at the same time.
PiperOrigin-RevId: 251537697
When pipe is created, a dirent of pipe will be
created and its initial reference is set as 0.
Cause all dirent will only be destroyed when
the reference decreased to -1, so there is already
a 'initial reference' of dirent after it created.
For destroying dirent after all reference released,
the correct way is to drop the 'initial reference'
once someone hold a reference to the dirent, such
as fs.NewFile, otherwise the reference of dirent
will stay 0 all the time, and will cause memory
leak of dirent.
Except pipe, timerfd/eventfd/epoll has the same
problem
Here is a simple case to create memory leak of dirent
for pipe/timerfd/eventfd/epoll in C langange, after
run the case, pprof the runsc process, you will
find lots dirents of pipe/timerfd/eventfd/epoll not
freed:
int main(int argc, char *argv[])
{
int i;
int n;
int pipefd[2];
if (argc != 3) {
printf("Usage: %s epoll|timerfd|eventfd|pipe <iterations>\n", argv[0]);
}
n = strtol(argv[2], NULL, 10);
if (strcmp(argv[1], "epoll") == 0) {
for (i = 0; i < n; ++i)
close(epoll_create(1));
} else if (strcmp(argv[1], "timerfd") == 0) {
for (i = 0; i < n; ++i)
close(timerfd_create(CLOCK_REALTIME, 0));
} else if (strcmp(argv[1], "eventfd") == 0) {
for (i = 0; i < n; ++i)
close(eventfd(0, 0));
} else if (strcmp(argv[1], "pipe") == 0) {
for (i = 0; i < n; ++i)
if (pipe(pipefd) == 0) {
close(pipefd[0]);
close(pipefd[1]);
}
}
printf("%s %s test finished\r\n",argv[1],argv[2]);
return 0;
}
Change-Id: Ia1b8a1fb9142edb00c040e44ec644d007f81f5d2
PiperOrigin-RevId: 251531096
Dirents are ref-counted, but Pipes are not. Holding a Dirent inside of a Pipe
raises difficult questions about the lifecycle of the Pipe and Dirent.
Fortunately, we can side-step those questions by removing the Dirent field from
Pipe entirely. We only need the Dirent when constructing fs.Files (which are
ref-counted), and in GetFile (when a Dirent is passed to us anyways).
PiperOrigin-RevId: 251497628
The io.Writer contract requires that Write writes all available
bytes and does not return short writes. This causes errors with
io.Copy, since our own Write interface does not have this same
contract.
PiperOrigin-RevId: 251368730
Netstack sets the unprocessed segment queue size to match the receive
buffer size. This is not required as this queue only needs to hold enough
for a short duration before the endpoint goroutine can process it.
Updates #230
PiperOrigin-RevId: 250976323
Funcion signatures are not validated during compilation. Since
they are not exported, they can change at any time. The guard
ensures that they are verified at least on every version upgrade.
PiperOrigin-RevId: 250733742
Netstack listen loop can get stuck if cookies are in-use and the app is slow to
accept incoming connections. Further we continue to complete handshake for a
connection even if the backlog is full. This creates a problem when a lots of
connections come in rapidly and we end up with lots of completed connections
just hanging around to be delivered.
These fixes change netstack behaviour to mirror what linux does as described
here in the following article
http://veithen.io/2014/01/01/how-tcp-backlog-works-in-linux.html
Now when cookies are not in-use Netstack will silently drop the ACK to a SYN-ACK
and not complete the handshake if the backlog is full. This will result in the
connection staying in a half-complete state. Eventually the sender will
retransmit the ACK and if backlog has space we will transition to a connected
state and deliver the endpoint.
Similarly when cookies are in use we do not try and create an endpoint unless
there is space in the accept queue to accept the newly created endpoint. If
there is no space then we again silently drop the ACK as we can just recreate it
when the ACK is retransmitted by the peer.
We also now use the backlog to cap the size of the SYN-RCVD queue for a given
endpoint. So at any time there can be N connections in the backlog and N in a
SYN-RCVD state if the application is not accepting connections. Any new SYNs
will be dropped.
This CL also fixes another small bug where we mark a new endpoint which has not
completed handshake as connected. We should wait till handshake successfully
completes before marking it connected.
Updates #236
PiperOrigin-RevId: 250717817
VmData is the size of private data segments.
It has the same meaning as in Linux.
Change-Id: Iebf1ae85940a810524a6cde9c2e767d4233ddb2a
PiperOrigin-RevId: 250593739
After bf959931ddb88c4e4366e96dd22e68fa0db9527c ("wait/ptrace: assume
__WALL if the child is traced") (Linux 4.7), tracees are always eligible
for waiting, regardless of type.
PiperOrigin-RevId: 250399527
We don't need to model internal interfaces after the system
call interfaces (which are objectively worse and simply use a
flag to distinguish between two logically different operations).
PiperOrigin-RevId: 249916814
Change-Id: I45d02e0ec0be66b782a685b1f305ea027694cab9
These wakers are uselessly allocated and passed around; nothing ever
listens for notifications on them. The code here appears to be
vestigial, so removing it and allowing a nil waker to be passed seems
appropriate.
PiperOrigin-RevId: 249879320
Change-Id: Icd209fb77cc0dd4e5c49d7a9f2adc32bf88b4b71
sendfile can be called for a big range and it can require significant
amount of time to process it, so we need to handle task interrupts in
this system call.
PiperOrigin-RevId: 249781023
Change-Id: Ifc2ec505d74c06f5ee76f93b8d30d518ec2d4015
Initialized BUILD with license
Mount is still unimplemented and is not meant to be
part of this CL. Rest of the fs interface is implemented.
Referenced the Linux kernel appropriately when needed
PiperOrigin-RevId: 249741997
Change-Id: Id1e4c7c9e68b3f6946da39896fc6a0c3dcd7f98c
Separate MountSource from Mount. This is needed to allow
mounts to be shared by multiple containers within the same
pod.
PiperOrigin-RevId: 249617810
Change-Id: Id2944feb7e4194951f355cbe6d4944ae3c02e468
gopark's signature was changed from having a string reason to a
uint8.
See: 4d7cf3fedb
This broke execution tracing of the sentry.
Switching to the right signature makes tracing work again.
Updates #220
PiperOrigin-RevId: 249565311
Change-Id: If77fd276cecb37d4003c8222f6de510b8031a074
The previous commit adds WNOTHREAD support to waitid, so we may as well
complete the upstream change.
Linux added WCLONE, WALL, WNOTHREAD support to waitid(2) in
91c4e8ea8f05916df0c8a6f383508ac7c9e10dba ("wait: allow sys_waitid() to
accept __WNOTHREAD/__WCLONE/__WALL"). i.e., Linux 4.7.
PiperOrigin-RevId: 249560587
Change-Id: Iff177b0848a3f7bae6cb5592e44500c5a942fbeb
There no obvious reason to require that BlockSize and StatFS
are MountSource operations. Today they are in INodeOperations,
and they can be moved elsewhere in the future as part of a
normal refactor process.
PiperOrigin-RevId: 249549982
Change-Id: Ib832e02faeaf8253674475df4e385bcc53d780f3
Pipe internals are made more efficient by avoiding garbage collection.
A pool is now used that can be shared by all pipes, and buffers are
chained via an intrusive list. The documentation for pipe structures
and methods is also simplified and clarified.
The pipe tests are now parameterized, so that they are run on all
different variants (named pipes, small buffers, default buffers).
The pipe buffer sizes are exposed by fcntl, which is now supported
by this change. A size change test has been added to the suite.
These new tests uncovered a bug regarding the semantics of open
named pipes with O_NONBLOCK, which is also fixed by this CL. This
fix also addresses the lack of the O_LARGEFILE flag for named pipes.
PiperOrigin-RevId: 249375888
Change-Id: I48e61e9c868aedb0cadda2dff33f09a560dee773
* A segment with filesz == 0, memsz > 0 should be an anonymous only
mapping. We were failing to load such an ELF.
* Anonymous pages are always mapped RW, regardless of the segment
protections.
PiperOrigin-RevId: 249355239
Change-Id: I251e5c0ce8848cf8420c3aadf337b0d77b1ad991
This is in preparation to support an fdbased endpoint that can read/dispatch
packets from multiple underlying fds.
Updates #231
PiperOrigin-RevId: 249337074
Change-Id: Id7d375186cffcf55ae5e38986e7d605a96916d35
This does not actually implement an efficient splice or sendfile. Rather, it
adds a generic plumbing to the file internals so that this can be added. All
file implementations use the stub fileutil.NoSplice implementation, which
causes sendfile and splice to fall back to an internal copy.
A basic splice system call interface is added, along with a test.
PiperOrigin-RevId: 249335960
Change-Id: Ic5568be2af0a505c19e7aec66d5af2480ab0939b
The backing 9p server must allow named pipe creation, which the runsc
fsgofer currently does not.
There are small changes to the overlay here. GetFile may block when
opening a named pipe, which can cause a deadlock:
1. open(O_RDONLY) -> copyMu.Lock() -> GetFile()
2. open(O_WRONLY) -> copyMu.Lock() -> Deadlock
A named pipe usable for writing must already be on the upper filesystem,
but we are still taking copyMu for write when checking for upper. That
can be changed to a read lock to fix the common case.
However, a named pipe on the lower filesystem would still deadlock in
open(O_WRONLY) when it tries to actually perform copy up (which would
simply return EINVAL). Move the copy up type check before taking copyMu
for write to avoid this.
p9 must be modified, as it was incorrectly removing the file mode when
sending messages on the wire.
PiperOrigin-RevId: 249154033
Change-Id: Id6637130e567b03758130eb6c7cdbc976384b7d6
* Creation of files, directories (and other fs objects) in a directory
should always update ctime.
* Same for removal.
* atime should not be updated on lookup, only readdir.
I've also renamed some misleading functions that update mtime and ctime.
PiperOrigin-RevId: 249115063
Change-Id: I30fa275fa7db96d01aa759ed64628c18bb3a7dc7
There is a lot of redundancy that we can simplify in the stat_times
test. This will make it easier to add new tests. However, the
simplification reveals that cached uattrs on goferfs don't properly
update ctime on rename.
PiperOrigin-RevId: 248773425
Change-Id: I52662728e1e9920981555881f9a85f9ce04041cf
This was upstreamed from Fuchsia, but it is pretty buggy and doesn't
rely on any private APIs. Thus it can be checked into the Fuchsia source
tree without forking netstack, where we can more easily iterate on (and
eventually remove) it.
PiperOrigin-RevId: 247506582
Change-Id: Ifb1b60c6c4941c374a59c5570a6a9cacf2468981
And stop storing the Filesystem in the MountSource.
This allows us to decouple the MountSource filesystem type from the name of the
filesystem.
PiperOrigin-RevId: 247292982
Change-Id: I49cbcce3c17883b7aa918ba76203dfd6d1b03cc8
Some behavior was broken due to the difficulty of running automated raw
socket tests.
Change-Id: I152ca53916bb24a0208f2dc1c4f5bc87f4724ff6
PiperOrigin-RevId: 246747067
The tcpip.Clock comment stated that times provided by it should not be used for
netstack internal timekeeping. This comment was from before the interface
supported monotonic times. The monotonic times that it provides are now be the
preferred time source for netstack internal timekeeping.
PiperOrigin-RevId: 246618772
Change-Id: I853b720e3d719b03fabd6156d2431da05d354bda
- include packet_list.go
- exclude state.go (by renaming to include an underscore)
Also rename raw.go to endpoint.go for consistency.
PiperOrigin-RevId: 246547912
Change-Id: I19c8331c794ba683a940cc96a8be6497b53ff24d
Fixed a small logic error that broke proper accounting of MultiPortEndpoints.
PiperOrigin-RevId: 246502126
Change-Id: I1a7d6ea134f811612e545676212899a3707bc2c2
This requires two changes:
1) Support for more than one socket to join a given multicast group.
2) Duplicate delivery of incoming multicast packets to all sockets listening
for it.
In addition, I tweaked the code (and added a test) to disallow duplicates
IP_ADD_MEMBERSHIP calls for the same group and NIC. This is how Linux does
it.
PiperOrigin-RevId: 246437315
Change-Id: Icad8300b4a8c3f501d9b4cd283bd3beabef88b72
This feature allows MemoryFile to delay eviction of "optional"
allocations, such as unused cached file pages.
Note that this incidentally makes CachingInodeOperations writeback
asynchronous, in the sense that it doesn't occur until eviction; this is
necessary because between when a cached page becomes evictable and when
it's evicted, file writes (via CachingInodeOperations.Write) may dirty
the page.
As currently implemented, this feature won't meaningfully impact
steady-state memory usage or caching; the reclaimer goroutine will
schedule eviction as soon as it runs out of other work to do. Future CLs
increase caching by adding constraints on when eviction is scheduled.
PiperOrigin-RevId: 246014822
Change-Id: Ia85feb25a2de92a48359eb84434b6ec6f9bea2cb
Cache last used messages and reuse them for subsequent requests.
If more messages are needed, they are created outside the cache
on demand.
PiperOrigin-RevId: 245836910
Change-Id: Icf099ddff95df420db8e09f5cdd41dcdce406c61
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.
1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.
Fixes#209
PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
Previously, createAt was eating all errors from FindInode except for EACCES and
proceeding with the creation. This is incorrect, as FindInode can return many
other errors (like ENAMETOOLONG) that should stop creation.
This CL changes createAt to return all errors encountered except for ENOENT,
which we can ignore because we are about to create the thing.
PiperOrigin-RevId: 245773222
Change-Id: I1b317021de70f0550fb865506f6d8147d4aebc56
Add the CloseRead & CloseWrite methods that performs shutdown on the
corresponding Read & Write sides of a connection.
Change-Id: I3996a2abdc7cd68a2becba44dc4bd9f0919d2ce1
PiperOrigin-RevId: 245537950
Maximum filename length is filesystem-dependent, and obtained via
statfs::f_namelen. This limit is usually 255 bytes (NAME_MAX), but not
always. For example, VFAT supports filenames of up to 255... UCS-2
characters, which Linux conservatively takes to mean UTF-8-encoded
bytes: fs/fat/inode.c:fat_statfs(), FAT_LFN_LEN * NLS_MAX_CHARSET_SIZE.
As a result, Linux's VFS does not enforce NAME_MAX:
$ rg --maxdepth=1 '\WNAME_MAX\W' fs/ include/linux/
fs/libfs.c
38: buf->f_namelen = NAME_MAX;
64: if (dentry->d_name.len > NAME_MAX)
include/linux/relay.h
74: char base_filename[NAME_MAX]; /* saved base filename */
include/linux/fscrypt.h
149: * filenames up to NAME_MAX bytes, since base64 encoding expands the length.
include/linux/exportfs.h
176: * understanding that it is already pointing to a a %NAME_MAX+1 sized
Remove this check from core VFS, and add it to ramfs (and by extension
tmpfs), where it is actually applicable:
mm/shmem.c:shmem_dir_inode_operations.lookup == simple_lookup *does*
enforce NAME_MAX.
PiperOrigin-RevId: 245324748
Change-Id: I17567c4324bfd60e31746a5270096e75db963fac
This CL fixes the following bugs:
- Uses atomic to set/read status instead of binary.LittleEndian.PutUint32 etc
which are not atomic.
- Increments ringOffsets for frames that are truncated (i.e status is
tpStatusCopy)
- Does not ignore frames with tpStatusLost bit set as they are valid frames and
only indicate that there some frames were lost before this one and metrics can
be retrieved with a getsockopt call.
- Adds checks to make sure blockSize is a multiple of page size. This is
required as the kernel allocates in pages per block and rejects sizes that are
not page aligned with an EINVAL.
Updates #210
PiperOrigin-RevId: 244959464
Change-Id: I5d61337b7e4c0f8a3063dcfc07791d4c4521ba1f
p9.messageByType was taking 7% of p9.recv before, spending time
with reflection and map lookup. Now it's reduced to 1%.
PiperOrigin-RevId: 244947313
Change-Id: I42813f920557b7656f8b29157eb32acd79e11fa5
os.NewFile() accounts for 38% of CPU time in localFile.Walk().
This change switchs to use fd.FD which is much cheaper to create.
Now, fd.New() in localFile.Walk() accounts for only 4%.
PiperOrigin-RevId: 244944983
Change-Id: Ic892df96cf2633e78ad379227a213cb93ee0ca46
For a symbol link to some directory, eg.
`/tmp/symlink -> /tmp/dir`
`fstatat("/tmp/symlink")` should return symbol link data, but
`fstatat("/tmp/symlink/")` (symlink with trailing slash) should return
directory data it points following linux behaviour.
Currently fstatat() a symlink with trailing slash will get "not a
directory" error which is wrong.
Signed-off-by: Wei Zhang <zhangwei198900@gmail.com>
Change-Id: I63469b1fb89d083d1c1255d32d52864606fbd7e2
PiperOrigin-RevId: 244783916
Support shutdown on only the read side of an endpoint. Reads performed
after a call to Shutdown with only the ShutdownRead flag will return
ErrClosedForReceive without data.
Break out the shutdown(2) with SHUT_RD syscall test into to two tests.
The first tests that no packets are sent when shutting down the read
side of a socket. The second tests that, after shutting down the read
side of a socket, unread data can still be read, or an EOF if there is
no more data to read.
Change-Id: I9d7c0a06937909cbb466b7591544a4bcaebb11ce
PiperOrigin-RevId: 244459430
The MSG_TRUNC flag is set in the msghdr when a message is truncated.
Fixesgoogle/gvisor#200
PiperOrigin-RevId: 244440486
Change-Id: I03c7d5e7f5935c0c6b8d69b012db1780ac5b8456
Only emit unimplemented syscall events for setting SO_OOBINLINE and SO_LINGER
when attempting to set unsupported values.
PiperOrigin-RevId: 244229675
Change-Id: Icc4562af8f733dd75a90404621711f01a32a9fc1
It is possible to create a listening socket which will accept
IPv4 and IPv6 connections. In this case, we set IPv6ProtocolNumber
for all accepted endpoints, even if they handle IPv4 connections.
This means that we can't use endpoint.netProto to set gso.L3HdrLen.
PiperOrigin-RevId: 244227948
Change-Id: I5e1863596cb9f3d216febacdb7dc75651882eef1
The existing logic attempting to do this is incorrect. Unary ^ has
higher precedence than &^, so mask always has UnblockableSignals
cleared, allowing dequeueSignalLocked to dequeue unblockable signals
(which allows userspace to ignore them).
Switch the logic so that unblockable signals are always masked.
PiperOrigin-RevId: 244058487
Change-Id: Ib19630ac04068a1fbfb9dc4a8eab1ccbdb21edc3
FD limit and file size limit is read from the host, instead
of using hard-coded defaults, given that they effect the sandbox
process. Also limit the direct cache to use no more than half
if the available FDs.
PiperOrigin-RevId: 244050323
Change-Id: I787ad0fdf07c49d589e51aebfeae477324fe26e6
Current, doPoll copies the user struct pollfd array into a
[]syscalls.PollFD, which contains internal kdefs.FD and
waiter.EventMask types. While these are currently binary-compatible with
the Linux versions, we generally discourage copying directly to internal
types (someone may inadvertantly change kdefs.FD to uint64).
Instead, copy directly to a []linux.PollFD, which will certainly be
binary compatible. Most of syscalls/polling.go is included directly into
syscalls/linux/sys_poll.go, as it can then operate directly on
linux.PollFD. The additional syscalls.PollFD type is providing little
value.
I've also added explicit conversion functions for waiter.EventMask,
which creates the possibility of a different binary format.
PiperOrigin-RevId: 244042947
Change-Id: I24e5b642002a32b3afb95a9dcb80d4acd1288abf
Normal files display their path in the current mount namespace:
I0410 10:57:54.964196 216336 x:0] [ 1] ls X read(0x3 /proc/filesystems, 0x55cee3bdb2c0 "nodev\t9p\nnodev\tdevpts \nnodev\tdevtmpfs\nnodev\tproc\nnodev\tramdiskfs\nnodev\tsysfs\nnodev\ttmpfs\n", 0x1000) = 0x58 (24.462?s)
AT_FDCWD includes the CWD:
I0411 12:58:48.278427 1526 x:0] [ 1] stat_test E newfstatat(AT_FDCWD /home/prattmic, 0x55ea719b564e /proc/self, 0x7ef5cefc2be8, 0x0)
Sockets (and other non-vfs files) display an inode number (like
/proc/PID/fd):
I0410 10:54:38.909123 207684 x:0] [ 1] nc E bind(0x3 socket:[1], 0x55b5a1652040 {Family: AF_INET, Addr: , Port: 8080}, 0x10)
I also fixed a few syscall args that should be Path.
PiperOrigin-RevId: 243169025
Change-Id: Ic7dda6a82ae27062fe2a4a371557acfd6a21fa2a
RootFromContext can return a dirent with reference taken, or nil. We must call
DecRef if (and only if) a real dirent is returned.
PiperOrigin-RevId: 242965515
Change-Id: Ie2b7b4cb19ee09b6ccf788b71f3fd7efcdf35a11
DirentCache is already a savable type, and it ensures that it is empty at the
point of Save. There is no reason not to save it along with the MountSource.
This did uncover an issue where not all MountSources were properly flushed
before Save. If a mount point has an open file and is then unmounted, we save
the MountSource without flushing it first. This CL also fixes that by flushing
all MountSources for all open FDs on Save.
PiperOrigin-RevId: 242906637
Change-Id: I3acd9d52b6ce6b8c989f835a408016cb3e67018f
From sendfile spec and also the linux kernel code, we should
limit the count arg to 'MAX_RW_COUNT'. This patch export
'MAX_RW_COUNT' in kernel pkg and use it in the implementation
of sendfile syscall.
Signed-off-by: Li Qiang <pangpei.lq@antfin.com>
Change-Id: I1086fec0685587116984555abd22b07ac233fbd2
PiperOrigin-RevId: 242745831
We construct a ramfs tree of "scaffolding" directories for all mount points, so
that a directory exists that each mount point can be mounted over.
We were creating these directories without write permissions, which meant that
they were not wribable even when underlayed under a writable filesystem. They
should be writable.
PiperOrigin-RevId: 242507789
Change-Id: I86645e35417560d862442ff5962da211dbe9b731
Strings are a better fit for this usage because they are immutable in Go, and
can contain arbitrary bytes. It also allows us to avoid casting bytes to string
(and the associated allocation) in the hot path when checking for overlay
whiteouts.
PiperOrigin-RevId: 242208856
Change-Id: I7699ae6302492eca71787dd0b72e0a5a217a3db2
From the SDM: "The least-significant byte in register EAX (register AL)
will always return 01H. Software should ignore this value and not
interpret it as an informational descriptor."
Unfortunately, online docs [1] [2] (likely based on an old version of the SDM)
say: "The least-significant byte in register EAX (register AL) indicates
the number of times the CPUID instruction must be executed with an input
value of 2 to get a complete description of the processor's caches and
TLBs."
dlang uses this second interpretation [3] and will loop 2^32 times if we
return zero. Fix this by specifying the fixed value of one. We still
don't support exposing the actual cache information, leaving all other
bytes empty. A zero byte means: "Null descriptor, this byte contains no
information."
[1] http://www.sandpile.org/x86/cpuid.htm#level_0000_0002h
[2] https://c9x.me/x86/html/file_module_x86_id_45.html
[3] 424640864c/src/core/cpuid.d (L533-L534)
PiperOrigin-RevId: 242046629
Change-Id: Ic0f0a5f974b20f71391cb85645bdcd4003e5fe88
In particular, ns.IDOfTask and tg.ID are used for gettid and getpid,
respectively, where removing defer saves ~100ns. This may be a small
improvement to application logging, which may call gettid/getpid
frequently.
PiperOrigin-RevId: 242039616
Change-Id: I860beb62db3fe077519835e6bafa7c74cba6ca80
This will save copies when preemption is not caused by a CPU migration.
PiperOrigin-RevId: 241844399
Change-Id: I2ba3b64aa377846ab763425bd59b61158f576851
Dirent.exists() is called in Create to check whether a child with the given
name already exists.
Dirent.exists() calls walk(), and before this CL allowed walk() to drop d.mu
while calling d.Inode.Lookup. During this existence check, a racing Rename()
can acquire d.mu and create a new child of the dirent with the same name.
(Note that the source and destination of the rename must be in the same
directory, otherwise renameMu will be taken preventing the race.) In this
case, d.exists() can return false, even though a child with the same name
actually does exist.
This CL changes d.exists() so that it does not release d.mu while walking, thus
preventing the race with Rename.
It also adds comments noting that lockForRename may not take renameMu if the
source and destination are in the same directory, as this is a bit surprising
(at least it was to me).
PiperOrigin-RevId: 241842579
Change-Id: I56524870e39dfcd18cab82054eb3088846c34813
If there are thousands of threads, ThreadGroupsAppend becomes very
expensive as it must iterate over all Tasks to find the ThreadGroup
leaders.
Reduce the cost by maintaining a map of ThreadGroups which can be used
to grab them all directly.
The one somewhat visible change is to convert PID namespace init
children zapping to a group-directed SIGKILL, as Linux did in
82058d668465 "signal: Use group_send_sig_info to kill all processes in a
pid namespace".
In a benchmark that creates N threads which sleep for two minutes, we
see approximately this much CPU time in ThreadGroupsAppend:
Before:
1 thread: 0ms
1024 threads: 30ms - 9130ms
4096 threads: 50ms - 2000ms
8192 threads: 18160ms
16384 threads: 17210ms
After:
1 thread: 0ms
1024 threads: 0ms
4096 threads: 0ms
8192 threads: 0ms
16384 threads: 0ms
The profiling is actually extremely noisy (likely due to cache effects),
as some runs show almost no samples at 1024, 4096 threads, but obviously
this does not scale to lots of threads.
PiperOrigin-RevId: 241828039
Change-Id: I17827c90045df4b3c49b3174f3a05bca3026a72c
The previous implementation revolved around runes instead of bytes, which caused
weird behavior when converting between the two. For example, peekRune would read
the byte 0xff from a buffer, convert it to a rune, then return it. As rune is an
alias of int32, 0xff was 0-padded to int32(255), which is the hex code point for
?. However, peekRune also returned the length of the byte (1). When calling
utf8.EncodeRune, we only allocated 1 byte, but tried the write the 2-byte
character ?.
tl;dr: I apparently didn't understand runes when I wrote this.
PiperOrigin-RevId: 241789081
Change-Id: I14c788af4d9754973137801500ef6af7ab8a8727
Also makes the safemem reading and writing inline, as it makes it easier to see
what locks are held.
PiperOrigin-RevId: 241775201
Change-Id: Ib1072f246773ef2d08b5b9a042eb7e9e0284175c
Added syscall annotations for unimplemented syscalls for later generation into
reference docs. Annotations are of the form:
@Syscall(<name>, <key:value>, ...)
Supported args and values are:
- arg: A syscall option. This entry only applies to the syscall when given this
option.
- support: Indicates support level
- UNIMPLEMENTED: Unimplemented (implies returns:ENOSYS)
- PARTIAL: Partial support. Details should be provided in note.
- FULL: Full support
- returns: Indicates a known return value. Values are
syscall errors. This is treated as a string so you can use something
like "returns:EPERM or ENOSYS".
- issue: A Github issue number.
- note: A note
Example:
// @Syscall(mmap, arg:MAP_PRIVATE, support:FULL, note:Private memory fully supported)
// @Syscall(mmap, arg:MAP_SHARED, support:UNIMPLEMENTED, issue:123, note:Shared memory not supported)
// @Syscall(setxattr, returns:ENOTSUP, note:Requires file system support)
Annotations should be placed as close to their implementation as possible
(preferrably as part of a supporting function's Godoc) and should be updated as
syscall support changes.
PiperOrigin-RevId: 241697482
Change-Id: I7a846135db124e1271dc5057d788cba82ca312d4
Also remove comments in InodeOperations that required that implementation of
some Create* operations ensure that the name does not already exist, since
these checks are all centralized in the Dirent.
PiperOrigin-RevId: 241637335
Change-Id: Id098dc6063ff7c38347af29d1369075ad1e89a58
Having raw socket code together will make it easier to add support for other raw
network protocols. Currently, only ICMP uses the raw endpoint. However, adding
support for other protocols such as UDP shouldn't be much more difficult than
adding a few switch cases.
PiperOrigin-RevId: 241564875
Change-Id: I77e03adafe4ce0fd29ba2d5dfdc547d2ae8f25bf
We weren't saving simple devices' last allocated inode numbers, which
caused inode number reuse across S/R.
PiperOrigin-RevId: 241414245
Change-Id: I964289978841ef0a57d2fa48daf8eab7633c1284
ilist:generic_list works faster (cl/240185278) and
the code looks cleaner without type casting.
PiperOrigin-RevId: 241381175
Change-Id: I8487ab1d73637b3e9733c253c56dce9e79f0d35f
We call NewSharedAnonMappable simply to use it for Mappable/MappingIdentity for
shared anon mmap. From MMapOpts.MappingIdentity: "If MMapOpts is used to
successfully create a memory mapping, a reference is taken on MappingIdentity."
mm.createVMALocked (below) takes this additional reference, so we don't need
the reference returned by NewSharedAnonMappable. Holding it leaks the mappable.
PiperOrigin-RevId: 241038108
Change-Id: I78ee3af78e0cc7aac4063b274b30d0e41eb5677d