In the case of a rename replacing an existing destination inode, ramfs
Rename failed to first remove the replaced inode. This caused:
1. A leak of a reference to the inode (making it live indefinitely).
2. For directories, a leak of the replaced directory's .. link to the
parent. This would cause the parent's link count to incorrectly
increase.
(2) is much simpler to test than (1), so that's what I've done.
agentfs has a similar bug with link count only, so the Dirent layer
informs the Inode if this is a replacing rename.
Fixes#133
PiperOrigin-RevId: 239105698
Change-Id: I4450af2462d8ae3339def812287213d2cbeebde0
This is in preparation for improved page cache reclaim, which requires
greater integration between the page cache and page allocator.
PiperOrigin-RevId: 238444706
Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4
p9.Twalk.handle() with a non-empty path also stats the walked-to path
anyway, so the preceding GetAttr is completely wasted.
PiperOrigin-RevId: 238440645
Change-Id: I7fbc7536f46b8157639d0d1f491e6aaa9ab688a3
Previous memory allocation was excessive (80 MB). Changed
it to use 2 MB instead. There is no drop in perfomance due
to this change:
ab -n 100 -c 10 http://server/latin10m.txt ==> 10 MB file
80 MB: 178 MB/s
2 MB: 181 MB/s
PiperOrigin-RevId: 238321594
Change-Id: I1c8aed13cad5d75f4506d2b406b305117055fbe5
gonet.PacketConn now implements net.Conn, allowing it to be returned from
net.Dial.Dialer functions.
PiperOrigin-RevId: 238111980
Change-Id: I174884385ff4d9b8e9918fac7bbb5b93ca366ba7
HandleLocal is very similar conceptually to MULTICAST_LOOP, so we can unify
the implementations. This has the benefit of making HandleLocal apply even when
the fdbased link endpoint isn't in use.
In addition, move looping logic to route creation so that it doesn't need to be
run for each packet. This should improve performance.
PiperOrigin-RevId: 238099480
Change-Id: I72839f16f25310471453bc9d3fb8544815b25c23
- Redefine some memmap.Mappable, platform.File, and platform.Memory
semantics in terms of File reference counts (no functional change).
- Make AddressSpace.MapFile take a platform.File instead of a raw FD,
and replace platform.File.MapInto with platform.File.FD. This allows
kvm.AddressSpace.MapFile to always use platform.File.MapInternal instead
of maintaining its own (redundant) cache of file mappings in the sentry
address space.
PiperOrigin-RevId: 238044504
Change-Id: Ib73a11e4275c0da0126d0194aa6c6017a9cef64f
getsockopt(IP_MULTICAST_IF) only supports struct in_addr.
Also adds support for setsockopt(IP_MULTICAST_IF) with struct in_addr.
PiperOrigin-RevId: 237620230
Change-Id: I75e7b5b3e08972164eb1906f43ddd67aedffc27c
IP_MULTICAST_LOOP controls whether or not multicast packets sent on the default
route are looped back. In order to implement this switch, support for sending
and looping back multicast packets on the default route had to be implemented.
For now we only support IPv4 multicast.
PiperOrigin-RevId: 237534603
Change-Id: I490ac7ff8e8ebef417c7eb049a919c29d156ac1c
It is Implemented without the priority inheritance part given
that gVisor defers scheduling decisions to Go runtime and doesn't
have control over it.
PiperOrigin-RevId: 236989545
Change-Id: I714c8ca0798743ecf3167b14ffeb5cd834302560
Clean up some error handling, and add TODO explaining incorrect
behaviour with respect to broadcast on interfaces lacking an IP
address.
PiperOrigin-RevId: 236756233
Change-Id: I9662e7dc062c90565a32a3e153c4dbc98c55b522
Now that we have SO_BROADCAST, we don't need (some of) the hackery in the DHCP
client. This also fixes a bizarre regression observed in Fuchsia where DHCP
acquisition was taking over two minutes.
PiperOrigin-RevId: 236661954
Change-Id: Ibcfe5d311fa5df8ff4ff2e40ccedffe91f92daa5
The globalPool uses a sync.Once mechanism for initialization,
and no cleanup is strictly required. It's not really feasible
to have the platform implement a full creation -> destruction
cycle (due to the way filters are assumed to be installed), so
drop the FIXME.
PiperOrigin-RevId: 236385278
Change-Id: I98ac660ed58cc688d8a07147d16074a3e8181314
Current procfs has some bugs. After executing ls twice, many dirs come
out with same name like "1" or ".". Files like "cpuinfo" disappear.
Here variable names is a slice with cap() > len(). Sort after appending
to it will not alloc a new space and impact orignal slice. Same to m.
Signed-off-by: Ruidong Cao <crdfrank@gmail.com>
Change-Id: I83e5cd1c7968c6fe28c35ea4fee497488d4f9eef
PiperOrigin-RevId: 236222270
...in accordance with RFCs 1112 and 2464.
Fixes IPv4 multicast when IP_MULTICAST_IF is specified.
Don't return ErrNoRoute when no route is needed.
Don't set Route.NextHop when no route is needed.
PiperOrigin-RevId: 236199813
Change-Id: I48ed33e1b7f760deaa37e18ad7f1b8b62819ab43
Broadly, this change:
* Enables sockets to be created via `socket(AF_INET, SOCK_RAW, IPPROTO_ICMP)`.
* Passes the network-layer (IP) header up the stack to the transport endpoint,
which can pass it up to the socket layer. This allows a raw socket to return
the entire IP packet to users.
* Adds functions to stack.TransportProtocol, stack.Stack, stack.transportDemuxer
that enable incoming packets to be delivered to raw endpoints. New raw sockets
of other protocols (not ICMP) just need to register with the stack.
* Enables ping.endpoint to return IP headers when created via SOCK_RAW.
PiperOrigin-RevId: 235993280
Change-Id: I60ed994f5ff18b2cbd79f063a7fdf15d093d845a
Also exposes ipv4.MaxTotalSize since it is a generally useful constant.
PiperOrigin-RevId: 235799755
Change-Id: I1fa8d5294bf355acf5527cfdf274b3687d3c8b13
An earlier CL excessively minimizes the period in which it
holds a lock on NIC. This earlier CL had done this out of
the mistaken impression it fixed a broken test, when in
fact it just reduced the rate of failure of a flaky test
in tcp_test.go. This new change holds the lock on NIC
for the duration of the loop over n.endpoints.
PiperOrigin-RevId: 235732487
Change-Id: I53ee6df264f093ddc4d29e9acdcba6b4838cb112
This change does not make use of SACK information but adds support to track
SACK information and store it in the endpoint.
The actual SACK based recovery will be in a separate CL.
Part of commits to add RFC 6675 support to Netstack.
PiperOrigin-RevId: 235612264
Change-Id: I261f94844d7bad5abda803152ce6cc6125a467ff
cl/234850781 introduced a race condition in NIC.DeliverNetworkPacket
by failing to hold a lock. This change fixes this regressesion by acquiring
a read lock before iterating through n.endpoints, and then releasing the lock
once iteration is complete.
PiperOrigin-RevId: 235549770
Change-Id: Ib0133288be512d478cf759c3314dc95ec3205d4b
This change adds support for the SO_BROADCAST socket option in gVisor Netstack.
This support includes getsockopt()/setsockopt() functionality for both UDP and
TCP endpoints (the latter being a NOOP), dispatching broadcast messages up and
down the stack, and route finding/creation for broadcast packets. Finally, a
suite of tests have been implemented, exercising this functionality through the
Linux syscall API.
PiperOrigin-RevId: 234850781
Change-Id: If3e666666917d39f55083741c78314a06defb26c
tcp_proxy now uses an AF_PACKET socket as the FD for netstack link layer
endpoint instead of a tap device. It also changes the link layer endpoint to use
PacketMMap dispatch instead of Readv. This reduces overall cpu and reflects the
current runsc setup which uses PacketMMap and also uses veth devices to receive
packets.
Also fixed a bug in gonet where Read() was not doing coalescing read and would
read small amounts at a time.
PiperOrigin-RevId: 234714768
Change-Id: Idabf8e600e4512489d3ba441c4096dc74deba5d7
910448bbed066ab1082b510eef1ae61bb792d854 ("perf/x86/amd/uncore: Rename
cpufeatures macro for cache counters") in 4.14 changed the name.
We change both the internal and cpuinfo name. As the upstream commit
states, "In Family 17h, L3 is the last level cache as opposed to L2 in
previous families. Avoid this name confusion ..."
PiperOrigin-RevId: 234698034
Change-Id: Ibf2efd4c0b83c1a8b5bb123da65ea1d7c6acd778
The final merged patch in Linux 4.10,
4ab1586488cb56ed8728e54c4157cc38646874d9 ("x86/cpufeature: Add RDT CPUID
feature bits") named this feature "rdt_a". Earlier patch sets had named
this "rdt".
PiperOrigin-RevId: 234680481
Change-Id: I0cc968201ec9a2825701405e207994a7331322b7
- Use new user namespace for namespace creation checks.
- Ensure userns is never nil since it's used by other namespaces.
PiperOrigin-RevId: 234673175
Change-Id: I4b9d9d1e63ce4e24362089793961a996f7540cd9
In addition to simplifying the implementation, this fixes two bugs:
- seqfile.NewSeqFile unconditionally creates an inode with mode 0444,
but {uid,gid}_map have mode 0644.
- idMapSeqFile.Write implements fs.FileOperations.Write ... but it
doesn't implement any other fs.FileOperations methods and is never
used as fs.FileOperations. idMapSeqFile.GetFile() =>
seqfile.SeqFile.GetFile() uses seqfile.seqFileOperations instead,
which rejects all writes.
PiperOrigin-RevId: 234638212
Change-Id: I4568f741ab07929273a009d7e468c8205a8541bc
This allows setting a default send interface for IPv4 multicast. IPv6 support
will come later.
PiperOrigin-RevId: 234251379
Change-Id: I65922341cd8b8880f690fae3eeb7ddfa47c8c173
SO_TIMESTAMP is reimplemented in ping and UDP sockets (and needs to be added for
TCP), but can just be implemented in epsocket for simplicity. This will also
make SIOCGSTAMP easier to implement.
PiperOrigin-RevId: 234179300
Change-Id: Ib5ea0b1261dc218c1a8b15a65775de0050fe3230
If a background process tries to read from a TTY, linux sends it a SIGTTIN
unless the signal is blocked or ignored, or the process group is an orphan, in
which case the syscall returns EIO.
See drivers/tty/n_tty.c:n_tty_read()=>job_control().
If a background process tries to write a TTY, set the termios, or set the
foreground process group, linux then sends a SIGTTOU. If the signal is ignored
or blocked, linux allows the write. If the process group is an orphan, the
syscall returns EIO.
See drivers/tty/tty_io.c:tty_check_change().
PiperOrigin-RevId: 234044367
Change-Id: I009461352ac4f3f11c5d42c43ac36bb0caa580f9
PACKET_RX_RING allows the use of an mmapped buffer to receive packets from the
kernel. This should cut down the number of host syscalls that need to be made
to receive packets when the underlying fd is a socket of the AF_PACKET type.
PiperOrigin-RevId: 233834998
Change-Id: I8060025c6ced206986e94cc46b8f382b81bfa47f
- Fix CopyIn/CopyOut/ZeroOut range checks.
- Include the faulting signal number in the panic message.
PiperOrigin-RevId: 233829501
Change-Id: I8959ead12d05dbd4cd63c2b908cddeb2a27eb513
Linux started doing this in b8be15d588060a03569ac85dc4a0247460988f5b
("x86/fpu/xstate: Re-enable XSAVES"), which first appeared in 4.8.
PiperOrigin-RevId: 233800931
Change-Id: Icac2c2b03ccf1a91f3070431efb5152ca619fca3
RFC7323 recommends that if the timestamp option was negotiated
then all packets should carry a TCP Timestamp and any packets that
do not should be dropped.
Netstack implemented this behaviour. Linux OTOH does not and will
accept such packets. This change makes Netstack behaviour compatible
with Linux.
Also now that we allow such packets, we do need to update RTO calculations
based on these packets even if timestamp option is enabled.
PiperOrigin-RevId: 233432268
Change-Id: I9f4742ae6b63930ac3b5e37d8c238761e6a4b29f
fs/gofer/inodeOperations.Release does some asynchronous work. Previously it
was calling fs.Async with an anonymous function, which caused the function to
be allocated on the heap. Because Release is relatively hot, this results in a
lot of small allocations and increased GC pressure, noticeable in perf profiles.
This CL adds a new function, AsyncWithContext, which is just like Async, but
passes a context to the async function. It avoids the need for an extra
anonymous function in fs/gofer/inodeOperations.Release. The Async function
itself still requires a single anonymous function.
PiperOrigin-RevId: 233141763
Change-Id: I1dce4a883a7be9a8a5b884db01e654655f16d19c
CopyObjectOut grows its destination byte slice incrementally, causing
many small slice allocations on the heap. This leads to increased GC and
noticeably slower stat calls.
PiperOrigin-RevId: 233140904
Change-Id: Ieb90295dd8dd45b3e56506fef9d7f86c92e97d97
This adds an extra Reflection call to CopyObjectOut, but avoids many small
slice allocations if the object is large, since without this we grow the
backing slice incrementally as we encode more data.
PiperOrigin-RevId: 233110960
Change-Id: I93569af55912391e5471277f779139c23f040147
Prevents URPC FDs from being closed mid-call, especially if they
are used as raw FDs.
PiperOrigin-RevId: 233087955
Change-Id: I815a2ff32cc5f03774605aef0b35a32862f8e633
Also includes a few fixes for IPv4 multicast support. IPv6 support is coming in
a followup CL.
PiperOrigin-RevId: 233008638
Change-Id: If7dae6222fef43fda48033f0292af77832d95e82
It currently allocates a new context on the heap each time it is called. Some
of these are in relatively hot paths like signal delivery and releasing gofer
inodes. It is also called very commonly in afterLoad. All of these should
benefit from fewer heap allocations.
PiperOrigin-RevId: 232938873
Change-Id: I53cec0ca299f56dcd4866b0b4fd2ec4938526849
- Change proc to return envp on overwrite of argv with limitations from
upstream.
- Add unit tests
- Change layout of argv/envp on the stack so that end of argv is contiguous with
beginning of envp.
PiperOrigin-RevId: 232506107
Change-Id: I993880499ab2c1220f6dc456a922235c49304dec
Dirty should be set only when the attribute is changed in the cache
only. Instances where the change was also sent to the backing file
doesn't need to dirty the attribute.
Also remove size update during WriteOut as writing dirty page would
naturaly grow the file if needed.
RELNOTES: relnotes is needed for the parent CL.
PiperOrigin-RevId: 232068978
Change-Id: I00ba54693a2c7adc06efa9e030faf8f2e8e7f188
This changed required making fsutil.HostMappable use
a backing file to ensure the correct FD would be used
for read/write operations.
RELNOTES: relnotes is needed for the parent CL.
PiperOrigin-RevId: 231836164
Change-Id: I8ae9639715529874ea7d80a65e2c711a5b4ce254
Nothing reads them and they can simply get stale.
Generated with:
$ sed -i "s/licenses(\(.*\)).*/licenses(\1)/" **/BUILD
PiperOrigin-RevId: 231818945
Change-Id: Ibc3f9838546b7e94f13f217060d31f4ada9d4bf0
quoting what "rscheff@gmx.at" pointed out over email.
"IsLost in RFC3517 is defined as >= (DupThresh * SMSS) while
RFC6675 improves upon this, and defines IsLost as >
((DupThresh - 1) * SMSS + 1).
The latter addresses situations where partial segments (size < MSS)
are sent (eg. last segment of a http protocol message sent with PSH
being less than MSS is common)."
PiperOrigin-RevId: 231512331
Change-Id: I1addd4a92e3e7baeb0bdda46463ebfae435da958
This should reduce the number of syscalls required to process packets
significantly and improve throughputs.
PiperOrigin-RevId: 231366886
Change-Id: I8b38077262bf9c53176bc4a94b530188d3d7c0ca
We were modifying InodeSimpleAttributes.Unstable.AccessTime without holding
the necessary lock. Luckily for us, InodeSimpleAttributes already has a
NotifyAccess method that will do the update while holding the lock.
In addition, we were holding dfo.dir.mu.Lock while setting AccessTime, which
is unnecessary, so that lock has been removed.
PiperOrigin-RevId: 231278447
Change-Id: I81ed6d3dbc0b18e3f90c1df5e5a9c06132761769
It never actually should have applied to environ (the relevant change in
Linux 4.2 is c2c0bb44620d "proc: fix PAGE_SIZE limit of
/proc/$PID/cmdline"), and we claim to be Linux 4.4 now anyway.
PiperOrigin-RevId: 231250661
Change-Id: I37f9c4280a533d1bcb3eebb7803373ac3c7b9f15
When file size changes outside the sandbox, page cache was not
refreshing file size which is required for cacheRemoteRevalidating.
In fact, cacheRemoteRevalidating should be skipping the cache
completely since it's not really benefiting from it. The cache is
cache is already bypassed for unstable attributes (see
cachePolicy.cacheUAttrs). And althought the cache is called to
map pages, they will always miss the cache and map directly from
the host.
Created a HostMappable struct that maps directly to the host and
use it for files with cacheRemoteRevalidating.
Closes#124
PiperOrigin-RevId: 230998440
Change-Id: Ic5f632eabe33b47241e05e98c95e9b2090ae08fc
Most of the entries are stubbed out at the moment, but even those were
only displayed if IPv6 support was enabled. The entries should be
displayed with IPv4-support only, and with only loopback devices.
PiperOrigin-RevId: 229946441
Change-Id: I18afaa3af386322787f91bf9d168ab66c01d5a4c
More helper structs have been added to the fsutil package to make it easier to
implement fs.InodeOperations and fs.FileOperations.
PiperOrigin-RevId: 229305982
Change-Id: Ib6f8d3862f4216745116857913dbfa351530223b
Removing check to RLIMIT_NOFILE in select call.
Adding unit test to select suite to document behavior.
Moving setrlimit class from mlock to a util file for reuse.
Fixing flaky test based on comments from Jamie.
PiperOrigin-RevId: 228726131
Change-Id: Ie9dbe970bbf835ba2cca6e17eec7c2ee6fadf459
- Call MemoryEvents.done.Add(1) outside of MemoryEvents.run() so that if
MemoryEvents.Stop() => MemoryEvents.done.Wait() is called before the
goroutine starts running, it still waits for the goroutine to stop.
- Use defer to call MemoryEvents.done.Done() in MemoryEvents.run() so that it's
called even if the goroutine panics.
PiperOrigin-RevId: 228623307
Change-Id: I1b0459e7999606c1a1a271b16092b1ca87005015
overlayFileOperations.Readdir was holding overlay.copyMu while calling
DirentReaddir, which then attempts to take take the corresponding Dirent.mu,
causing a lock order violation. (See lock order documentation in
fs/copy_up.go.)
We only actually need to hold copyMu during readdirEntries(), so holding the
lock is moved in there, thus avoiding the lock order violation.
A new lock was added to protect overlayFileOperations.dirCache. We were
inadvertently relying on copyMu to protect this. There is no reason it should
not have its own lock.
PiperOrigin-RevId: 228542473
Change-Id: I03c3a368c8cbc0b5a79d50cc486fc94adaddc1c2
See modified comment in auth.NewUserCredentials(); compare to the
behavior of setresuid(2) as implemented by
//pkg/sentry/kernel/task_identity.go:kernel.Task.setKUIDsUncheckedLocked().
PiperOrigin-RevId: 228381765
Change-Id: I45238777c8f63fcf41b99fce3969caaf682fe408
Using linux.Errno as an error doesn't work very well as none of the sentry code
expects error to contain a linux.Errno.
This moves using syserr.Error.ToLinux as an error in a syscall handler from a
runtime error to a compile error.
PiperOrigin-RevId: 227744312
Change-Id: Iea63108a5b198296c908614e09c01733dd684da0
This option allows multiple sockets to be bound to the same port.
Incoming packets are distributed to sockets using a hash based on source and
destination addresses. This means that all packets from one sender will be
received by the same server socket.
PiperOrigin-RevId: 227153413
Change-Id: I59b6edda9c2209d5b8968671e9129adb675920cf
epoll_wait acquires EventPoll.listsMu (in EventPoll.ReadEvents) and
then calls Inotify.Readiness which tries to acquire Inotify.evMu.
getdents acquires Inotify.evMu (in Inotify.queueEvent) and then calls
readyCallback.Callback which tries to acquire EventPoll.listsMu.
The fix is to release Inotify.evMu before calling Queue.Notify. Queue
is thread-safe and doesn't require Inotify.evMu to be held.
Closes#121
PiperOrigin-RevId: 227066695
Change-Id: Id29364bb940d1727f33a5dff9a3c52f390c15761
We don't explicitly support out-of-band data and treat it like normal in-band
data. This is equilivent to SO_OOBINLINE being enabled, so always report that
it is enabled.
PiperOrigin-RevId: 226572742
Change-Id: I4c30ccb83265e76c30dea631cbf86822e6ee1c1b
Within gVisor, plumb new socket options to netstack.
Within netstack, fix GetSockOpt and SetSockOpt return value logic.
PiperOrigin-RevId: 226532229
Change-Id: If40734e119eed633335f40b4c26facbebc791c74
The code that matches the event being published with events watchers
was wronly matching all watchers in case any of the control event bits
were set.
Issue #121
PiperOrigin-RevId: 226521230
Change-Id: Ie2c42bc4366faaf59fbf80a74e9297499bd93f9e
We must wait for all lazy resources to be released before closing the rootFile.
PiperOrigin-RevId: 226419499
Change-Id: I1d4d961a92b3816e02690cf3eaf0a88944d730cc
"RLIMIT_MEMLOCK: This is the maximum number of bytes of memory that may
be locked into RAM." - getrlimit(2)
PiperOrigin-RevId: 226384346
Change-Id: Iefac4a1bb69f7714dc813b5b871226a8344dc800
Implement pwritev2 and associated unit tests.
Clean up preadv2 unit tests.
Tag RWF_ flags in both preadv2 and pwritev2 with associated bug tickets.
PiperOrigin-RevId: 226222119
Change-Id: Ieb22672418812894ba114bbc88e67f1dd50de620
Connectionless Unix sockets (DGRAM Unix sockets created with the socket system
call) inherently only have a read queue. They do not establish bidirectional
connections, instead, the connect system call only sets a default send
location. Writes give the data to the other endpoint which has its own read
queue.
To simplify the code, connectionless Unix sockets still get read and write
queues, but the write queue is a dummy and never waited on. The read queue is
the connectionless endpoint's queue. This change fixes a bug where the dummy
queue was incorrectly set as the read queue and the endpoint's queue was
incorrectly set as the write queue. This meant that read notifications went
to the dummy queue and were black holed.
PiperOrigin-RevId: 225921042
Change-Id: I8d9059def787a2c3c305185b92d05093fbd2be2a
The old overlayBoundEndpoint assumed that the lower is not an overlay. It
should check if the lower is an overlay and handle that case.
PiperOrigin-RevId: 225882303
Change-Id: I60660c587d91db2826e0719da0983ec8ad024cb8
Currently mlock() and friends do nothing whatsoever. However, mlocking
is directly application-visible in a number of ways; for example,
madvise(MADV_DONTNEED) and msync(MS_INVALIDATE) both fail on mlocked
regions. We handle this inconsistently: MADV_DONTNEED is too important
to not work, but MS_INVALIDATE is rejected.
Change MM to track mlocked regions in a manner consistent with Linux.
It still will not actually pin pages into host physical memory, but:
- mlock() will now cause sentry memory management to precommit mlocked
pages.
- MADV_DONTNEED and MS_INVALIDATE will interact with mlocked pages as
described above.
PiperOrigin-RevId: 225861605
Change-Id: Iee187204979ac9a4d15d0e037c152c0902c8d0ee
Same as with broadcast packets, sending of a multicast packet shouldn't require
accessing the route table. The same applies to IPv6 link-local addresses, which
aren't routable at all (they don't belong to any subnet by definition).
PiperOrigin-RevId: 225775870
Change-Id: Ic53e6560c125a83be2be9c3d112e66b36e8dfe7b
Platform objects are not savable, storing references to them in
filesystem datastructures would cause save to fail if someone actually
passed in a Platform.
Current implementations work because everywhere a Platform is
expected, we currently pass in a Kernel object which embeds Platform
and thus satisfies the interface.
Eliminate this indirection and save pointers to Kernel directly.
PiperOrigin-RevId: 225288336
Change-Id: Ica399ff43f425e15bc150a0d7102196c3d54a2ab
unshare actually takes a subset of clone flags, but has no unique flags,
so formatting as clone flags is close enough.
PiperOrigin-RevId: 225082774
Change-Id: I5b580f18607c7785f323e37809094115520a17c0
MSG_WAITALL requests that recv family calls do not perform short reads. It only
has an effect for SOCK_STREAM sockets, other types ignore it.
PiperOrigin-RevId: 224918540
Change-Id: Id97fbf972f1f7cbd4e08eec0138f8cbdf1c94fe7
arch_prctl already verified that the new FS_BASE was canonical, but
Task.Clone did not. Centralize these checks in the arch packages.
Failure to validate could cause an error in PTRACE_SET_REGS when we try
to switch to the app.
PiperOrigin-RevId: 224862398
Change-Id: Iefe63b3f9aa6c4810326b8936e501be3ec407f14
Currently sending a broadcast packet (for DHCP, e.g.) requires a "default
route" of the format "0.0.0.0/0 via 0.0.0.0 <intf>". There is no good reason
for this and on devices with several ports this creates a rather akward route
table with lots of such default routes (which defeats the purpose of a default
route).
PiperOrigin-RevId: 224378769
Change-Id: Icd7ec8a206eb08083cff9a837f6f9ab231c73a19
Unlike FlagSet, order doesn't matter here, so it can simply be a map.
PiperOrigin-RevId: 224377910
Change-Id: I15810c698a7f02d8614bf09b59583ab73cba0514
By Walking before checking that the directory is writable and
executable, MayDelete may return the Walk error (e.g., ENOENT) which
would normally be masked by a permission error (EACCES).
PiperOrigin-RevId: 224222453
Change-Id: I108a7f730e6bdaa7f277eaddb776267c00805475
If sys_prctl is called with PR_SET_MM without CAP_SYS_RESOURCE,
the syscall should return failure with errno set to EPERM.
See: http://man7.org/linux/man-pages/man2/prctl.2.html
PiperOrigin-RevId: 224182874
Change-Id: I630d1dd44af8b444dd16e8e58a0764a0cf1ad9a3
This removes code that should have never made it in in the first place, but did so due to incomplete testing. With the new tests the original code fails, the new code passes.
PiperOrigin-RevId: 224086966
Change-Id: I646fef76977f4528f3705f497b95fad6b3ec32bc
FileOperations.Write should return ErrWouldBlock to allow the upper
layer to loop and sendmsg should continue writing where it left off
on a partial write.
PiperOrigin-RevId: 224081631
Change-Id: Ic61f6943ea6b7abbd82e4279decea215347eac48
The signature of time.now has remained unchanged:
c2412a7681/src/time/time.go (L1072)
PiperOrigin-RevId: 224061160
Change-Id: Ic84bd6ee8fb9952cd9ab580bcb0892444ce7c2da
The number of symbolic links that are allowed to be followed
are for a full path and not just a chain of symbolic links.
PiperOrigin-RevId: 224047321
Change-Id: I5e3c4caf66a93c17eeddcc7f046d1e8bb9434a40
Bazel runs multiple test cases on the same thread. Some of the test
cases rely on the test thread starting with the default memory policy,
while other tests modify the test thread's memory policy. This
obviously breaks when the test framework doesn't run each test case on
a new thread.
Also fixing an incompatibility where set_mempolicy(2) was prevented
from specifying an empty nodemask, which is allowed for some modes.
PiperOrigin-RevId: 224038957
Change-Id: Ibf780766f2706ebc9b129dbc8cf1b85c2a275074
Untyped integer constants default to type int and the binary package will panic
if one tries to encode an int.
PiperOrigin-RevId: 223890001
Change-Id: Iccc3afd6d74bad24c35d764508e450fd317b76ec
Replaces the WaitGroup with a RWMutex. Calls to Async hold the mutex for
reading, while AsyncBarrier takes the lock for writing. This ensures that all
executing Async work finishes before AsyncBarrier returns.
Also pushes the Async() call from Inode.Release into
gofer/InodeOperations.Release(). This removes a recursive Async call which
should not have been allowed in the first place. The gofer Release call is the
slow one (since it may make RPCs to the gofer), so putting the Async call there
makes sense.
PiperOrigin-RevId: 223093067
Change-Id: I116da7b20fce5ebab8d99c2ab0f27db7c89d890e
With rpcinet if shutdown flags are not saved before making
the rpc a race is possible where blocked threads are woken
up before the flags have been persisted. This would mean
that threads can block indefinitely in a recvmsg after a
shutdown(SHUT_RD) has happened.
PiperOrigin-RevId: 223089783
Change-Id: If595e7add12aece54bcdf668ab64c570910d061a
RET_KILL_THREAD doesn't work well for Go because it will
kill only the offending thread and leave the process hanging.
RET_TRAP can be masked out and it's not guaranteed to kill
the process. RET_KILL_PROCESS is available since 4.14.
For older kernel, continue to use RET_TRAP as this is the
best option (likely to kill process, easy to debug).
PiperOrigin-RevId: 222357867
Change-Id: Icc1d7d731274b16c2125b7a1ba4f7883fbdb2cbd
SyncSyscallFiltersToThreadGroup and Task.TheadID() both acquired TaskSet RWLock
in R mode and could deadlock if a writer comes in between.
PiperOrigin-RevId: 222313551
Change-Id: I4221057d8d46fec544cbfa55765c9a284fe7ebfa
Also update test utilities for probing vsyscall support and add a
metric to see if vsyscalls are actually used in sandboxes.
PiperOrigin-RevId: 221698834
Change-Id: I57870ecc33ea8c864bd7437833f21aa1e8117477
...to (remote, local), reflecting the (correct) names in the implementation of
DeliverNetworkPacket (see tcpip/stack/nic.go).
Also trim the names in DeliverNetworkPacket and elsewhere to avoid stuttering;
since the type is tcpip.LinkAddress, there's no need to include "LinkAddr" in
the parameter names.
Note that every callsite passes arguments in the order (src, dst).
PiperOrigin-RevId: 221514396
Change-Id: I3637454ad0d6e62a19e4dcbc2a16493798bd0f09
Previously, TCP_NODELAY was always enabled and we would lie about it being
configurable. TCP_NODELAY is now disabled by default (to match Linux) in the
socket layer so that non-gVisor users don't automatically start using this
questionable optimization.
PiperOrigin-RevId: 221368472
Change-Id: Ib0240f66d94455081f4e0ca94f09d9338b2c1356
sync_file_range - sync a file segment with disk
In Linux, sync_file_range() accepts three flags:
SYNC_FILE_RANGE_WAIT_BEFORE
Wait upon write-out of all pages in the specified range that
have already been submitted to the device driver for write-out
before performing any write.
SYNC_FILE_RANGE_WRITE
Initiate write-out of all dirty pages in the specified range
which are not presently submitted write-out. Note that even
this may block if you attempt to write more than request queue
size.
SYNC_FILE_RANGE_WAIT_AFTER
Wait upon write-out of all pages in the range after performing
any write.
In this implementation:
SYNC_FILE_RANGE_WAIT_BEFORE without SYNC_FILE_RANGE_WAIT_AFTER isn't
supported right now.
SYNC_FILE_RANGE_WRITE is skipped. It should initiate write-out of all
dirty pages, but it doesn't wait, so it should be safe to do nothing
while nobody uses SYNC_FILE_RANGE_WAIT_BEFORE.
SYNC_FILE_RANGE_WAIT_AFTER is equal to fdatasync(). In Linux,
sync_file_range() doesn't writes out the file's meta-data, but
fdatasync() does if a file size is changed.
PiperOrigin-RevId: 220730840
Change-Id: Iae5dfb23c2c916967d67cf1a1ad32f25eb3f6286
Create syscall stubs for missing syscalls upto Linux 4.4 and advertise
a kernel version of 4.4.
PiperOrigin-RevId: 220667680
Change-Id: Idbdccde538faabf16debc22f492dd053a8af0ba7
Increase timeout to prevent the entry from being
found when there is delay on the address resolution
goroutine that doesn't mark the request as failed.
PiperOrigin-RevId: 220504789
Change-Id: I7e44fd95d8624bd69962f862fbf5517a81395f2a
These files were added with the wrong name after all of the existing files
were corrected.
PiperOrigin-RevId: 220202068
Change-Id: Ia0d15233c1aa69330356a7cf16b5aa00d978e09c
Fluentd configuration uses 'log' for the log message
while containerd uses 'msg'. Since we can't have a single
JSON format for both, add another log format and make
debug log configurable.
PiperOrigin-RevId: 219729658
Change-Id: I2a6afc4034d893ab90bafc63b394c4fb62b2a7a0
Updated error messages so that it doesn't print full Go struct representations
when running a new container in a sandbox. For example, this occurs frequently
when commands are not found when doing a 'kubectl exec'.
PiperOrigin-RevId: 219729141
Change-Id: Ic3a7bc84cd7b2167f495d48a1da241d621d3ca09
Shm segments can be marked for lazy destruction via shmctl(IPC_RMID),
which destroys a segment once it is no longer attached to any
processes. We were unconditionally decrementing the segment refcount
on shmctl(IPC_RMID) which allowed a user to force a segment to be
destroyed by repeatedly calling shmctl(IPC_RMID), with outstanding
memory maps to the segment.
This is problematic because the memory released by a segment destroyed
this way can be reused by a different process while remaining
accessible by the process with outstanding maps to the segment.
PiperOrigin-RevId: 219713660
Change-Id: I443ab838322b4fb418ed87b2722c3413ead21845
https://github.com/containerd/containerd/blob/master/oci/spec.go#L206, the mode=755
didn't match the pattern modeRegexp = regexp.MustCompile("0[0-7][0-7][0-7]").
Closes#112
Signed-off-by: Juan <xionghuan.cn@gmail.com>
Change-Id: I469e0a68160a1278e34c9e1dbe4b7784c6f97e5a
PiperOrigin-RevId: 219672525
This reduces the number of floating point save/restore cycles required (since
we don't need to restore immediately following the switch, this always happens
in a known context) and allows the kernel hooks to capture state. This lets us
remove calls like "Current()".
PiperOrigin-RevId: 219552844
Change-Id: I7676fa2f6c18b9919718458aa888b832a7db8cab
This field was added in the intial implementation, before Route existed
to pass the local and remote addresses to the packet-writing path.
Today, the Route's members should be respected. A similar bug was
previously fixed in 214650822.
PiperOrigin-RevId: 219474095
Change-Id: Id2a8ee4421d2841c8d88ccb3c193c455086350ee
Use private futexes for performance and to align with other runtime uses.
PiperOrigin-RevId: 219422634
Change-Id: Ief2af5e8302847ea6dc246e8d1ee4d64684ca9dd
Actually parse flags from cpuinfo to avoid mistakenly matching
substrings in cpuinfo that happen to match a flags.
Some features were only exposed in recent versions of Linux. Don't
require them to appear in cpuinfo on old versions of Linux.
Move PREFETCHWT1 back to parse only features. It isn't actually exposed
in Linux yet. Move SDBG to shown features. It has been visible since
Linux 4.3.
PiperOrigin-RevId: 219381731
Change-Id: Ied7c0ee7c8a9879683e81933de56c9074b01108f
Extend the cpuid package to parse and emulate cpuid features that exist
only on AMD and not Intel. The least straightforward part of this is
that AMD duplicates several block 1 features in block 6. Thus we ignore
those features when parsing block 6 and add them when emulating.
PiperOrigin-RevId: 218935032
Change-Id: Id41bf1c24720b0d9b968e2c19ab5bc00a6d62bd4
Linux added these block 3 features to the end of /proc/cpuinfo in
dfb4a70f20c5b3880da56ee4c9484bdb4e8f1e65.
This also fixes that block 3 features were completely missing from
FeatureSet.FlagsString(false) because FlagsString only prints Linux
blocks regardless of the cpuinfo option.
PiperOrigin-RevId: 218913816
Change-Id: I2f9c38c7c9da4b247a140877d4aca782e80684bd
Previously this code used the tcpip error space. Since it is no longer part of
netstack, it can use the sentry's error space (except for a few cases where
there is still some shared code. This reduces the number of error space
conversions required for hot Unix socket operations.
PiperOrigin-RevId: 218541611
Change-Id: I3d13047006a8245b5dfda73364d37b8a453784bb
Pseudoterminal job control signals are meant to be received and handled by the
sandbox process, but if the ptrace stubs are running in the same process group,
they will receive the signals as well and inject then into the sentry kernel.
This can result in duplicate signals being delivered (often to the wrong
process), or a sentry panic if the ptrace stub is inactive.
This CL makes the ptrace stub run in a new session.
PiperOrigin-RevId: 218536851
Change-Id: Ie593c5687439bbfbf690ada3b2197ea71ed60a0e
Attempting to create a zero-len shm segment causes a panic since we
try to allocate a zero-len filemem region. The existing code had a
guard to disallow this, but the check didn't encode the fact that
requesting a private segment implies a segment creation regardless of
whether IPC_CREAT is explicitly specified.
PiperOrigin-RevId: 218405743
Change-Id: I30aef1232b2125ebba50333a73352c2f907977da
The channels {cancel,resCh} have roughly the same lifetime and are used for
roughly the same purpose as an entry's waiters; we can unify the state
management of the two mechanisms, while also reducing unncessary mutex locking
and unlocking.
Made some cosmetic changes while I'm here.
PiperOrigin-RevId: 218343915
Change-Id: Ic69546a2b7b390162b2231f07f335dd6199472d7
This change also adds extensive testing to the p9 package via mocks. The sanity
checks and type checks are moved from the gofer into the core package, where
they can be more easily validated.
PiperOrigin-RevId: 218296768
Change-Id: I4fc3c326e7bf1e0e140a454cbacbcc6fd617ab55
This allows us to release messages in the queue when all users close.
PiperOrigin-RevId: 218033550
Change-Id: I2f6e87650fced87a3977e3b74c64775c7b885c1b
Added events for *ctl syscalls that may have multiple different commands.
For runsc, each syscall event is only logged once. For *ctl syscalls, use
the cmd as identifier, not only the syscall number.
PiperOrigin-RevId: 218015941
Change-Id: Ie3c19131ae36124861e9b492a7dbe1765d9e5e59
This reduces the number of goroutines and runtime timers when
ITIMER_VIRTUAL or ITIMER_PROF are enabled, or when RLIMIT_CPU is set.
This also ensures that thread group CPU timers only advance if running
tasks are observed at the time the CPU clock advances, mostly
eliminating the possibility that a CPU timer expiration observes no
running tasks and falls back to the group leader.
PiperOrigin-RevId: 217603396
Change-Id: Ia24ce934d5574334857d9afb5ad8ca0b6a6e65f4
This queue only has a single user, so there is no need for it to use an
interface. Merging it into the same package as its sole user allows us to avoid
a circular dependency.
This simplifies the code and should slightly improve performance.
PiperOrigin-RevId: 217595889
Change-Id: Iabbd5164240b935f79933618c61581bc8dcd2822
Now containers run with "docker run -it" support control characters like ^C and
^Z.
This required refactoring our signal handling a bit. Signals delivered to the
"runsc boot" process are turned into loader.Signal calls with the appropriate
delivery mode. Previously they were always sent directly to PID 1.
PiperOrigin-RevId: 217566770
Change-Id: I5b7220d9a0f2b591a56335479454a200c6de8732
The existing logic is backwards and writes iov_len == 0 for a full write.
PiperOrigin-RevId: 217560377
Change-Id: I5a39c31bf0ba9063a8495993bfef58dc8ab7c5fa
* Integrate recvMsg and sendMsg functions into Recv and Send respectively as
they are no longer shared.
* Clean up partial read/write error handling code.
* Re-order code to make sense given that there is no longer a host.endpoint
type.
PiperOrigin-RevId: 217255072
Change-Id: Ib43fe9286452f813b8309d969be11f5fa40694cd
host.endpoint contained duplicated logic from the sockerpair implementation and
host.ConnectedEndpoint. Remove host.endpoint in favor of a
host.ConnectedEndpoint wrapped in a socketpair end.
PiperOrigin-RevId: 217240096
Change-Id: I4a3d51e3fe82bdf30e2d0152458b8499ab4c987c