os.NewFile() accounts for 38% of CPU time in localFile.Walk().
This change switchs to use fd.FD which is much cheaper to create.
Now, fd.New() in localFile.Walk() accounts for only 4%.
PiperOrigin-RevId: 244944983
Change-Id: Ic892df96cf2633e78ad379227a213cb93ee0ca46
For a symbol link to some directory, eg.
`/tmp/symlink -> /tmp/dir`
`fstatat("/tmp/symlink")` should return symbol link data, but
`fstatat("/tmp/symlink/")` (symlink with trailing slash) should return
directory data it points following linux behaviour.
Currently fstatat() a symlink with trailing slash will get "not a
directory" error which is wrong.
Signed-off-by: Wei Zhang <zhangwei198900@gmail.com>
Change-Id: I63469b1fb89d083d1c1255d32d52864606fbd7e2
PiperOrigin-RevId: 244783916
Support shutdown on only the read side of an endpoint. Reads performed
after a call to Shutdown with only the ShutdownRead flag will return
ErrClosedForReceive without data.
Break out the shutdown(2) with SHUT_RD syscall test into to two tests.
The first tests that no packets are sent when shutting down the read
side of a socket. The second tests that, after shutting down the read
side of a socket, unread data can still be read, or an EOF if there is
no more data to read.
Change-Id: I9d7c0a06937909cbb466b7591544a4bcaebb11ce
PiperOrigin-RevId: 244459430
The MSG_TRUNC flag is set in the msghdr when a message is truncated.
Fixesgoogle/gvisor#200
PiperOrigin-RevId: 244440486
Change-Id: I03c7d5e7f5935c0c6b8d69b012db1780ac5b8456
Only emit unimplemented syscall events for setting SO_OOBINLINE and SO_LINGER
when attempting to set unsupported values.
PiperOrigin-RevId: 244229675
Change-Id: Icc4562af8f733dd75a90404621711f01a32a9fc1
It is possible to create a listening socket which will accept
IPv4 and IPv6 connections. In this case, we set IPv6ProtocolNumber
for all accepted endpoints, even if they handle IPv4 connections.
This means that we can't use endpoint.netProto to set gso.L3HdrLen.
PiperOrigin-RevId: 244227948
Change-Id: I5e1863596cb9f3d216febacdb7dc75651882eef1
The existing logic attempting to do this is incorrect. Unary ^ has
higher precedence than &^, so mask always has UnblockableSignals
cleared, allowing dequeueSignalLocked to dequeue unblockable signals
(which allows userspace to ignore them).
Switch the logic so that unblockable signals are always masked.
PiperOrigin-RevId: 244058487
Change-Id: Ib19630ac04068a1fbfb9dc4a8eab1ccbdb21edc3
FD limit and file size limit is read from the host, instead
of using hard-coded defaults, given that they effect the sandbox
process. Also limit the direct cache to use no more than half
if the available FDs.
PiperOrigin-RevId: 244050323
Change-Id: I787ad0fdf07c49d589e51aebfeae477324fe26e6
Current, doPoll copies the user struct pollfd array into a
[]syscalls.PollFD, which contains internal kdefs.FD and
waiter.EventMask types. While these are currently binary-compatible with
the Linux versions, we generally discourage copying directly to internal
types (someone may inadvertantly change kdefs.FD to uint64).
Instead, copy directly to a []linux.PollFD, which will certainly be
binary compatible. Most of syscalls/polling.go is included directly into
syscalls/linux/sys_poll.go, as it can then operate directly on
linux.PollFD. The additional syscalls.PollFD type is providing little
value.
I've also added explicit conversion functions for waiter.EventMask,
which creates the possibility of a different binary format.
PiperOrigin-RevId: 244042947
Change-Id: I24e5b642002a32b3afb95a9dcb80d4acd1288abf
Normal files display their path in the current mount namespace:
I0410 10:57:54.964196 216336 x:0] [ 1] ls X read(0x3 /proc/filesystems, 0x55cee3bdb2c0 "nodev\t9p\nnodev\tdevpts \nnodev\tdevtmpfs\nnodev\tproc\nnodev\tramdiskfs\nnodev\tsysfs\nnodev\ttmpfs\n", 0x1000) = 0x58 (24.462?s)
AT_FDCWD includes the CWD:
I0411 12:58:48.278427 1526 x:0] [ 1] stat_test E newfstatat(AT_FDCWD /home/prattmic, 0x55ea719b564e /proc/self, 0x7ef5cefc2be8, 0x0)
Sockets (and other non-vfs files) display an inode number (like
/proc/PID/fd):
I0410 10:54:38.909123 207684 x:0] [ 1] nc E bind(0x3 socket:[1], 0x55b5a1652040 {Family: AF_INET, Addr: , Port: 8080}, 0x10)
I also fixed a few syscall args that should be Path.
PiperOrigin-RevId: 243169025
Change-Id: Ic7dda6a82ae27062fe2a4a371557acfd6a21fa2a
RootFromContext can return a dirent with reference taken, or nil. We must call
DecRef if (and only if) a real dirent is returned.
PiperOrigin-RevId: 242965515
Change-Id: Ie2b7b4cb19ee09b6ccf788b71f3fd7efcdf35a11
DirentCache is already a savable type, and it ensures that it is empty at the
point of Save. There is no reason not to save it along with the MountSource.
This did uncover an issue where not all MountSources were properly flushed
before Save. If a mount point has an open file and is then unmounted, we save
the MountSource without flushing it first. This CL also fixes that by flushing
all MountSources for all open FDs on Save.
PiperOrigin-RevId: 242906637
Change-Id: I3acd9d52b6ce6b8c989f835a408016cb3e67018f
From sendfile spec and also the linux kernel code, we should
limit the count arg to 'MAX_RW_COUNT'. This patch export
'MAX_RW_COUNT' in kernel pkg and use it in the implementation
of sendfile syscall.
Signed-off-by: Li Qiang <pangpei.lq@antfin.com>
Change-Id: I1086fec0685587116984555abd22b07ac233fbd2
PiperOrigin-RevId: 242745831
We construct a ramfs tree of "scaffolding" directories for all mount points, so
that a directory exists that each mount point can be mounted over.
We were creating these directories without write permissions, which meant that
they were not wribable even when underlayed under a writable filesystem. They
should be writable.
PiperOrigin-RevId: 242507789
Change-Id: I86645e35417560d862442ff5962da211dbe9b731
Strings are a better fit for this usage because they are immutable in Go, and
can contain arbitrary bytes. It also allows us to avoid casting bytes to string
(and the associated allocation) in the hot path when checking for overlay
whiteouts.
PiperOrigin-RevId: 242208856
Change-Id: I7699ae6302492eca71787dd0b72e0a5a217a3db2
From the SDM: "The least-significant byte in register EAX (register AL)
will always return 01H. Software should ignore this value and not
interpret it as an informational descriptor."
Unfortunately, online docs [1] [2] (likely based on an old version of the SDM)
say: "The least-significant byte in register EAX (register AL) indicates
the number of times the CPUID instruction must be executed with an input
value of 2 to get a complete description of the processor's caches and
TLBs."
dlang uses this second interpretation [3] and will loop 2^32 times if we
return zero. Fix this by specifying the fixed value of one. We still
don't support exposing the actual cache information, leaving all other
bytes empty. A zero byte means: "Null descriptor, this byte contains no
information."
[1] http://www.sandpile.org/x86/cpuid.htm#level_0000_0002h
[2] https://c9x.me/x86/html/file_module_x86_id_45.html
[3] 424640864c/src/core/cpuid.d (L533-L534)
PiperOrigin-RevId: 242046629
Change-Id: Ic0f0a5f974b20f71391cb85645bdcd4003e5fe88
In particular, ns.IDOfTask and tg.ID are used for gettid and getpid,
respectively, where removing defer saves ~100ns. This may be a small
improvement to application logging, which may call gettid/getpid
frequently.
PiperOrigin-RevId: 242039616
Change-Id: I860beb62db3fe077519835e6bafa7c74cba6ca80
This will save copies when preemption is not caused by a CPU migration.
PiperOrigin-RevId: 241844399
Change-Id: I2ba3b64aa377846ab763425bd59b61158f576851
Dirent.exists() is called in Create to check whether a child with the given
name already exists.
Dirent.exists() calls walk(), and before this CL allowed walk() to drop d.mu
while calling d.Inode.Lookup. During this existence check, a racing Rename()
can acquire d.mu and create a new child of the dirent with the same name.
(Note that the source and destination of the rename must be in the same
directory, otherwise renameMu will be taken preventing the race.) In this
case, d.exists() can return false, even though a child with the same name
actually does exist.
This CL changes d.exists() so that it does not release d.mu while walking, thus
preventing the race with Rename.
It also adds comments noting that lockForRename may not take renameMu if the
source and destination are in the same directory, as this is a bit surprising
(at least it was to me).
PiperOrigin-RevId: 241842579
Change-Id: I56524870e39dfcd18cab82054eb3088846c34813
If there are thousands of threads, ThreadGroupsAppend becomes very
expensive as it must iterate over all Tasks to find the ThreadGroup
leaders.
Reduce the cost by maintaining a map of ThreadGroups which can be used
to grab them all directly.
The one somewhat visible change is to convert PID namespace init
children zapping to a group-directed SIGKILL, as Linux did in
82058d668465 "signal: Use group_send_sig_info to kill all processes in a
pid namespace".
In a benchmark that creates N threads which sleep for two minutes, we
see approximately this much CPU time in ThreadGroupsAppend:
Before:
1 thread: 0ms
1024 threads: 30ms - 9130ms
4096 threads: 50ms - 2000ms
8192 threads: 18160ms
16384 threads: 17210ms
After:
1 thread: 0ms
1024 threads: 0ms
4096 threads: 0ms
8192 threads: 0ms
16384 threads: 0ms
The profiling is actually extremely noisy (likely due to cache effects),
as some runs show almost no samples at 1024, 4096 threads, but obviously
this does not scale to lots of threads.
PiperOrigin-RevId: 241828039
Change-Id: I17827c90045df4b3c49b3174f3a05bca3026a72c
The previous implementation revolved around runes instead of bytes, which caused
weird behavior when converting between the two. For example, peekRune would read
the byte 0xff from a buffer, convert it to a rune, then return it. As rune is an
alias of int32, 0xff was 0-padded to int32(255), which is the hex code point for
?. However, peekRune also returned the length of the byte (1). When calling
utf8.EncodeRune, we only allocated 1 byte, but tried the write the 2-byte
character ?.
tl;dr: I apparently didn't understand runes when I wrote this.
PiperOrigin-RevId: 241789081
Change-Id: I14c788af4d9754973137801500ef6af7ab8a8727
Also makes the safemem reading and writing inline, as it makes it easier to see
what locks are held.
PiperOrigin-RevId: 241775201
Change-Id: Ib1072f246773ef2d08b5b9a042eb7e9e0284175c
Added syscall annotations for unimplemented syscalls for later generation into
reference docs. Annotations are of the form:
@Syscall(<name>, <key:value>, ...)
Supported args and values are:
- arg: A syscall option. This entry only applies to the syscall when given this
option.
- support: Indicates support level
- UNIMPLEMENTED: Unimplemented (implies returns:ENOSYS)
- PARTIAL: Partial support. Details should be provided in note.
- FULL: Full support
- returns: Indicates a known return value. Values are
syscall errors. This is treated as a string so you can use something
like "returns:EPERM or ENOSYS".
- issue: A Github issue number.
- note: A note
Example:
// @Syscall(mmap, arg:MAP_PRIVATE, support:FULL, note:Private memory fully supported)
// @Syscall(mmap, arg:MAP_SHARED, support:UNIMPLEMENTED, issue:123, note:Shared memory not supported)
// @Syscall(setxattr, returns:ENOTSUP, note:Requires file system support)
Annotations should be placed as close to their implementation as possible
(preferrably as part of a supporting function's Godoc) and should be updated as
syscall support changes.
PiperOrigin-RevId: 241697482
Change-Id: I7a846135db124e1271dc5057d788cba82ca312d4
Also remove comments in InodeOperations that required that implementation of
some Create* operations ensure that the name does not already exist, since
these checks are all centralized in the Dirent.
PiperOrigin-RevId: 241637335
Change-Id: Id098dc6063ff7c38347af29d1369075ad1e89a58
Having raw socket code together will make it easier to add support for other raw
network protocols. Currently, only ICMP uses the raw endpoint. However, adding
support for other protocols such as UDP shouldn't be much more difficult than
adding a few switch cases.
PiperOrigin-RevId: 241564875
Change-Id: I77e03adafe4ce0fd29ba2d5dfdc547d2ae8f25bf
We weren't saving simple devices' last allocated inode numbers, which
caused inode number reuse across S/R.
PiperOrigin-RevId: 241414245
Change-Id: I964289978841ef0a57d2fa48daf8eab7633c1284
ilist:generic_list works faster (cl/240185278) and
the code looks cleaner without type casting.
PiperOrigin-RevId: 241381175
Change-Id: I8487ab1d73637b3e9733c253c56dce9e79f0d35f
We call NewSharedAnonMappable simply to use it for Mappable/MappingIdentity for
shared anon mmap. From MMapOpts.MappingIdentity: "If MMapOpts is used to
successfully create a memory mapping, a reference is taken on MappingIdentity."
mm.createVMALocked (below) takes this additional reference, so we don't need
the reference returned by NewSharedAnonMappable. Holding it leaks the mappable.
PiperOrigin-RevId: 241038108
Change-Id: I78ee3af78e0cc7aac4063b274b30d0e41eb5677d
The panic was caused by modifying the tree while iterating which invalidated the
iterator.
Also fixes another bug in SACKScoreboard.Insert() which was causing blocks to be
merged incorrectly.
PiperOrigin-RevId: 240895053
Change-Id: Ia72b8244297962df5c04283346da5226434740af
When fork a child process, the name filed of TaskContext is not set.
It results in that when we cat /proc/{pid}/status, the name filed is
null.
Like this:
Name:
State: S (sleeping)
Tgid: 28
Pid: 28
PPid: 26
TracerPid: 0
FDSize: 8
VmSize: 89712 kB
VmRSS: 6648 kB
Threads: 1
CapInh: 00000000a93d35fb
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a93d35fb
Seccomp: 0
Change-Id: I5d469098c37cedd19da16b7ffab2e546a28a321e
PiperOrigin-RevId: 240893304
- Document fsutil.CachedFileObject.FD() requirements on access
permissions, and change gofer.inodeFileState.FD() to honor them.
Fixes#147.
- Combine gofer.inodeFileState.readonly and
gofer.inodeFileState.readthrough, and simplify handle caching logic.
- Inline gofer.cachePolicy.cacheHandles into
gofer.inodeFileState.setSharedHandles, because users with access to
gofer.inodeFileState don't necessarily have access to the fs.Inode
(predictably, this is a save/restore problem).
Before this CL:
$ docker run --runtime=runsc-d -v $(pwd)/gvisor/repro:/root/repro -it ubuntu bash
root@34d51017ed67:/# /root/repro/runsc-b147
mmap: 0x7f3c01e45000
Segmentation fault
After this CL:
$ docker run --runtime=runsc-d -v $(pwd)/gvisor/repro:/root/repro -it ubuntu bash
root@d3c3cb56bbf9:/# /root/repro/runsc-b147
mmap: 0x7f78987ec000
o
PiperOrigin-RevId: 240818413
Change-Id: I49e1d4a81a0cb9177832b0a9f31a10da722a896b
The linux packet socket can handle GSO packets, so we can segment packets to
64K instead of the MTU which is usually 1500.
Here are numbers for the nginx-1m test:
runsc: 579330.01 [Kbytes/sec] received
runsc-gso: 1794121.66 [Kbytes/sec] received
runc: 2122139.06 [Kbytes/sec] received
and for tcp_benchmark:
$ tcp_benchmark --duration 15 --ideal
[ 4] 0.0-15.0 sec 86647 MBytes 48456 Mbits/sec
$ tcp_benchmark --client --duration 15 --ideal
[ 4] 0.0-15.0 sec 2173 MBytes 1214 Mbits/sec
$ tcp_benchmark --client --duration 15 --ideal --gso 65536
[ 4] 0.0-15.0 sec 19357 MBytes 10825 Mbits/sec
PiperOrigin-RevId: 240809103
Change-Id: I2637f104db28b5d4c64e1e766c610162a195775a
The start time is the number of clock ticks between the boot time and
application start time.
PiperOrigin-RevId: 240619475
Change-Id: Ic8bd7a73e36627ed563988864b0c551c052492a5
This is the same technique used by Go's strings.Builder
(https://golang.org/src/strings/builder.go#L45), and for the same
reason. (We can't just use strings.Builder because there's no way to get
the underlying []byte to pass to usermem.IO.CopyIn.)
PiperOrigin-RevId: 240594892
Change-Id: Ic070e7e480aee53a71289c7c120850991358c52c
This is a preparation for GSO changes (cl/234508902).
RELNOTES[gofers]: Refactor checksum code to include length, which
it already did, but in a convoluted way. Should be a no-op.
PiperOrigin-RevId: 240460794
Change-Id: I537381bc670b5a9f5d70a87aa3eb7252e8f5ace2
Memfds are simply anonymous tmpfs files with no associated
mounts. Also implementing file seals, which Linux only implements for
memfds at the moment.
PiperOrigin-RevId: 240450031
Change-Id: I31de78b950101ae8d7a13d0e93fe52d98ea06f2f
MM.insertPMAsLocked() passes vma.maxPerms to memmap.Mappable.Translate
(although it unsets AccessType.Write if the vma is private). This
somewhat simplifies handling of pmas, since it means only COW-break
needs to replace existing pmas. However, it also means that a MAP_SHARED
mapping of a file opened O_RDWR dirties the file, regardless of the
mapping's permissions and whether or not the mapping is ever actually
written to with I/O that ignores permissions (e.g.
ptrace(PTRACE_POKEDATA)).
To fix this:
- Change the pma-getting path to request only the permissions that are
required for the calling access.
- Change memmap.Mappable.Translate to take requested permissions, and
return allowed permissions. This preserves the existing behavior in the
common cases where the memmap.Mappable isn't
fsutil.CachingInodeOperations and doesn't care if the translated
platform.File pages are written to.
- Change the MM.getPMAsLocked path to support permission upgrading of
pmas outside of copy-on-write.
PiperOrigin-RevId: 240196979
Change-Id: Ie0147c62c1fbc409467a6fa16269a413f3d7d571
ilist:generic_list works faster than ilist:ilist.
Here is a beanchmark test to measure performance of epoll_wait, when readyList
isn't empty. It shows about 30% better performance with these changes.
Benchmark Time(ns) CPU(ns) Iterations
Before:
BM_EpollAllEvents 46725 46899 14286
After:
BM_EpollAllEvents 33167 33300 18919
PiperOrigin-RevId: 240185278
Change-Id: I3e33f9b214db13ab840b91613400525de5b58d18
Change the DPL from 0 to 3 for Breakpoint and Overflow,
then user space could trigger Breakpoint and Overflow
as excepected.
Change-Id: Ibead65fb8c98b32b7737f316db93b3a8d9dcd648
PiperOrigin-RevId: 239736648
Also, changing queue.writeBuf from a buffer.Bytes to a [][]byte should reduce
copying and reallocating of slices.
PiperOrigin-RevId: 239713547
Change-Id: I6ee5ff19c3ee2662f1af5749cae7b73db0569e96
.net sets these flags to -1 and then uses their result, especting it to be
zero.
Does not set actual flags (e.g. MSG_TRUNC), but setting to zero is more correct
than what we did before.
PiperOrigin-RevId: 239657951
Change-Id: I89c5f84bc9b94a2cd8ff84e8ecfea09e01142030
A credential object is immutable, so we don't need to copy it for a new
task.
PiperOrigin-RevId: 239519266
Change-Id: I0632f641fdea9554779ac25d84bee4231d0d18f2
Track new sockets created during accept(2) in the socket table for all
families. Previously we were only doing this for unix domain sockets.
PiperOrigin-RevId: 239475550
Change-Id: I16f009f24a06245bfd1d72ffd2175200f837c6ac
In the case of a rename replacing an existing destination inode, ramfs
Rename failed to first remove the replaced inode. This caused:
1. A leak of a reference to the inode (making it live indefinitely).
2. For directories, a leak of the replaced directory's .. link to the
parent. This would cause the parent's link count to incorrectly
increase.
(2) is much simpler to test than (1), so that's what I've done.
agentfs has a similar bug with link count only, so the Dirent layer
informs the Inode if this is a replacing rename.
Fixes#133
PiperOrigin-RevId: 239105698
Change-Id: I4450af2462d8ae3339def812287213d2cbeebde0
This is in preparation for improved page cache reclaim, which requires
greater integration between the page cache and page allocator.
PiperOrigin-RevId: 238444706
Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4
p9.Twalk.handle() with a non-empty path also stats the walked-to path
anyway, so the preceding GetAttr is completely wasted.
PiperOrigin-RevId: 238440645
Change-Id: I7fbc7536f46b8157639d0d1f491e6aaa9ab688a3
Previous memory allocation was excessive (80 MB). Changed
it to use 2 MB instead. There is no drop in perfomance due
to this change:
ab -n 100 -c 10 http://server/latin10m.txt ==> 10 MB file
80 MB: 178 MB/s
2 MB: 181 MB/s
PiperOrigin-RevId: 238321594
Change-Id: I1c8aed13cad5d75f4506d2b406b305117055fbe5
gonet.PacketConn now implements net.Conn, allowing it to be returned from
net.Dial.Dialer functions.
PiperOrigin-RevId: 238111980
Change-Id: I174884385ff4d9b8e9918fac7bbb5b93ca366ba7
HandleLocal is very similar conceptually to MULTICAST_LOOP, so we can unify
the implementations. This has the benefit of making HandleLocal apply even when
the fdbased link endpoint isn't in use.
In addition, move looping logic to route creation so that it doesn't need to be
run for each packet. This should improve performance.
PiperOrigin-RevId: 238099480
Change-Id: I72839f16f25310471453bc9d3fb8544815b25c23
- Redefine some memmap.Mappable, platform.File, and platform.Memory
semantics in terms of File reference counts (no functional change).
- Make AddressSpace.MapFile take a platform.File instead of a raw FD,
and replace platform.File.MapInto with platform.File.FD. This allows
kvm.AddressSpace.MapFile to always use platform.File.MapInternal instead
of maintaining its own (redundant) cache of file mappings in the sentry
address space.
PiperOrigin-RevId: 238044504
Change-Id: Ib73a11e4275c0da0126d0194aa6c6017a9cef64f
getsockopt(IP_MULTICAST_IF) only supports struct in_addr.
Also adds support for setsockopt(IP_MULTICAST_IF) with struct in_addr.
PiperOrigin-RevId: 237620230
Change-Id: I75e7b5b3e08972164eb1906f43ddd67aedffc27c
IP_MULTICAST_LOOP controls whether or not multicast packets sent on the default
route are looped back. In order to implement this switch, support for sending
and looping back multicast packets on the default route had to be implemented.
For now we only support IPv4 multicast.
PiperOrigin-RevId: 237534603
Change-Id: I490ac7ff8e8ebef417c7eb049a919c29d156ac1c
It is Implemented without the priority inheritance part given
that gVisor defers scheduling decisions to Go runtime and doesn't
have control over it.
PiperOrigin-RevId: 236989545
Change-Id: I714c8ca0798743ecf3167b14ffeb5cd834302560
Clean up some error handling, and add TODO explaining incorrect
behaviour with respect to broadcast on interfaces lacking an IP
address.
PiperOrigin-RevId: 236756233
Change-Id: I9662e7dc062c90565a32a3e153c4dbc98c55b522
Now that we have SO_BROADCAST, we don't need (some of) the hackery in the DHCP
client. This also fixes a bizarre regression observed in Fuchsia where DHCP
acquisition was taking over two minutes.
PiperOrigin-RevId: 236661954
Change-Id: Ibcfe5d311fa5df8ff4ff2e40ccedffe91f92daa5
The globalPool uses a sync.Once mechanism for initialization,
and no cleanup is strictly required. It's not really feasible
to have the platform implement a full creation -> destruction
cycle (due to the way filters are assumed to be installed), so
drop the FIXME.
PiperOrigin-RevId: 236385278
Change-Id: I98ac660ed58cc688d8a07147d16074a3e8181314
Current procfs has some bugs. After executing ls twice, many dirs come
out with same name like "1" or ".". Files like "cpuinfo" disappear.
Here variable names is a slice with cap() > len(). Sort after appending
to it will not alloc a new space and impact orignal slice. Same to m.
Signed-off-by: Ruidong Cao <crdfrank@gmail.com>
Change-Id: I83e5cd1c7968c6fe28c35ea4fee497488d4f9eef
PiperOrigin-RevId: 236222270
...in accordance with RFCs 1112 and 2464.
Fixes IPv4 multicast when IP_MULTICAST_IF is specified.
Don't return ErrNoRoute when no route is needed.
Don't set Route.NextHop when no route is needed.
PiperOrigin-RevId: 236199813
Change-Id: I48ed33e1b7f760deaa37e18ad7f1b8b62819ab43
Broadly, this change:
* Enables sockets to be created via `socket(AF_INET, SOCK_RAW, IPPROTO_ICMP)`.
* Passes the network-layer (IP) header up the stack to the transport endpoint,
which can pass it up to the socket layer. This allows a raw socket to return
the entire IP packet to users.
* Adds functions to stack.TransportProtocol, stack.Stack, stack.transportDemuxer
that enable incoming packets to be delivered to raw endpoints. New raw sockets
of other protocols (not ICMP) just need to register with the stack.
* Enables ping.endpoint to return IP headers when created via SOCK_RAW.
PiperOrigin-RevId: 235993280
Change-Id: I60ed994f5ff18b2cbd79f063a7fdf15d093d845a
Also exposes ipv4.MaxTotalSize since it is a generally useful constant.
PiperOrigin-RevId: 235799755
Change-Id: I1fa8d5294bf355acf5527cfdf274b3687d3c8b13
An earlier CL excessively minimizes the period in which it
holds a lock on NIC. This earlier CL had done this out of
the mistaken impression it fixed a broken test, when in
fact it just reduced the rate of failure of a flaky test
in tcp_test.go. This new change holds the lock on NIC
for the duration of the loop over n.endpoints.
PiperOrigin-RevId: 235732487
Change-Id: I53ee6df264f093ddc4d29e9acdcba6b4838cb112
This change does not make use of SACK information but adds support to track
SACK information and store it in the endpoint.
The actual SACK based recovery will be in a separate CL.
Part of commits to add RFC 6675 support to Netstack.
PiperOrigin-RevId: 235612264
Change-Id: I261f94844d7bad5abda803152ce6cc6125a467ff
cl/234850781 introduced a race condition in NIC.DeliverNetworkPacket
by failing to hold a lock. This change fixes this regressesion by acquiring
a read lock before iterating through n.endpoints, and then releasing the lock
once iteration is complete.
PiperOrigin-RevId: 235549770
Change-Id: Ib0133288be512d478cf759c3314dc95ec3205d4b
This change adds support for the SO_BROADCAST socket option in gVisor Netstack.
This support includes getsockopt()/setsockopt() functionality for both UDP and
TCP endpoints (the latter being a NOOP), dispatching broadcast messages up and
down the stack, and route finding/creation for broadcast packets. Finally, a
suite of tests have been implemented, exercising this functionality through the
Linux syscall API.
PiperOrigin-RevId: 234850781
Change-Id: If3e666666917d39f55083741c78314a06defb26c
tcp_proxy now uses an AF_PACKET socket as the FD for netstack link layer
endpoint instead of a tap device. It also changes the link layer endpoint to use
PacketMMap dispatch instead of Readv. This reduces overall cpu and reflects the
current runsc setup which uses PacketMMap and also uses veth devices to receive
packets.
Also fixed a bug in gonet where Read() was not doing coalescing read and would
read small amounts at a time.
PiperOrigin-RevId: 234714768
Change-Id: Idabf8e600e4512489d3ba441c4096dc74deba5d7
910448bbed066ab1082b510eef1ae61bb792d854 ("perf/x86/amd/uncore: Rename
cpufeatures macro for cache counters") in 4.14 changed the name.
We change both the internal and cpuinfo name. As the upstream commit
states, "In Family 17h, L3 is the last level cache as opposed to L2 in
previous families. Avoid this name confusion ..."
PiperOrigin-RevId: 234698034
Change-Id: Ibf2efd4c0b83c1a8b5bb123da65ea1d7c6acd778
The final merged patch in Linux 4.10,
4ab1586488cb56ed8728e54c4157cc38646874d9 ("x86/cpufeature: Add RDT CPUID
feature bits") named this feature "rdt_a". Earlier patch sets had named
this "rdt".
PiperOrigin-RevId: 234680481
Change-Id: I0cc968201ec9a2825701405e207994a7331322b7
- Use new user namespace for namespace creation checks.
- Ensure userns is never nil since it's used by other namespaces.
PiperOrigin-RevId: 234673175
Change-Id: I4b9d9d1e63ce4e24362089793961a996f7540cd9
In addition to simplifying the implementation, this fixes two bugs:
- seqfile.NewSeqFile unconditionally creates an inode with mode 0444,
but {uid,gid}_map have mode 0644.
- idMapSeqFile.Write implements fs.FileOperations.Write ... but it
doesn't implement any other fs.FileOperations methods and is never
used as fs.FileOperations. idMapSeqFile.GetFile() =>
seqfile.SeqFile.GetFile() uses seqfile.seqFileOperations instead,
which rejects all writes.
PiperOrigin-RevId: 234638212
Change-Id: I4568f741ab07929273a009d7e468c8205a8541bc
This allows setting a default send interface for IPv4 multicast. IPv6 support
will come later.
PiperOrigin-RevId: 234251379
Change-Id: I65922341cd8b8880f690fae3eeb7ddfa47c8c173
SO_TIMESTAMP is reimplemented in ping and UDP sockets (and needs to be added for
TCP), but can just be implemented in epsocket for simplicity. This will also
make SIOCGSTAMP easier to implement.
PiperOrigin-RevId: 234179300
Change-Id: Ib5ea0b1261dc218c1a8b15a65775de0050fe3230
If a background process tries to read from a TTY, linux sends it a SIGTTIN
unless the signal is blocked or ignored, or the process group is an orphan, in
which case the syscall returns EIO.
See drivers/tty/n_tty.c:n_tty_read()=>job_control().
If a background process tries to write a TTY, set the termios, or set the
foreground process group, linux then sends a SIGTTOU. If the signal is ignored
or blocked, linux allows the write. If the process group is an orphan, the
syscall returns EIO.
See drivers/tty/tty_io.c:tty_check_change().
PiperOrigin-RevId: 234044367
Change-Id: I009461352ac4f3f11c5d42c43ac36bb0caa580f9