Kernel.cpuClockTicker increments kernel.cpuClock, which tasks use as a clock to
track their CPU usage. This improves latency in the syscall path by avoid
expensive monotonic clock calls on every syscall entry/exit.
However, this timer fires every 10ms. Thus, when all tasks are idle (i.e.,
blocked or stopped), this forces a sentry wakeup every 10ms, when we may
otherwise be able to sleep until the next app-relevant event. These wakeups
cause the sentry to utilize approximately 2% CPU when the application is
otherwise idle.
Updates to clock are not strictly necessary when the app is idle, as there are
no readers of cpuClock. This commit reduces idle CPU by disabling the timer
when tasks are completely idle, and computing its effects at the next wakeup.
Rather than disabling the timer as soon as the app goes idle, we wait until the
next tick, which provides a window for short sleeps to sleep and wakeup without
doing the (relatively) expensive work of disabling and enabling the timer.
PiperOrigin-RevId: 272265822
'docker exec' was getting CAP_NET_RAW even when --net-raw=false
because it was not filtered out from when copying container's
capabilities.
PiperOrigin-RevId: 272260451
Linux changed this behavior in 16e72e9b30986ee15f17fbb68189ca842c32af58
(v4.11). Previously, extra pages were always mapped RW. Now, those pages will
be executable if the segment specified PF_X. They still must be writeable.
PiperOrigin-RevId: 272256280
It isn't allowed to splice data from and into the same pipe.
But right now this check is broken, because we don't check that both ends are
pipes.
PiperOrigin-RevId: 272107022
Netstack always picks a random start point everytime PickEphemeralPort
is called. While this is required for UDP so that DNS requests go
out through a randomized set of ports it is not required for TCP. Infact
Linux explicitly hashes the (srcip, dstip, dstport) and a one time secret
initialized at start of the application to get a random offset. But to
ensure it doesn't start from the same point on every scan it uses a static
hint that is incremented by 2 in every call to pick ephemeral ports.
The reason for 2 is Linux seems to split the port ranges where active connects
seem to use even ones while odd ones are used by listening sockets.
This CL implements a similar strategy where we use a hash + hint to generate
the offset to start the search for a free Ephemeral port.
This ensures that we cycle through the available port space in order for
repeated connects to the same destination and significantly reduces the
chance of picking a recently released port.
PiperOrigin-RevId: 272058370
The gofer's CachingInodeOperations implementation contains an optimization for
the common open-read-close pattern when we have a host FD. In this case, the
host kernel will update the timestamp for us to a reasonably close time, so we
don't need an extra RPC to the gofer.
However, when the app explicitly sets the timestamps (via futimes or similar)
then we actually DO need to update the timestamps, because the host kernel
won't do it for us.
To fix this, a new boolean `forceSetTimestamps` was added to
CachineInodeOperations.SetMaskedAttributes. It is only set by
gofer.InodeOperations.SetTimestamps.
PiperOrigin-RevId: 272048146
One would reasonably assume that a field named "regex" would expect
a regular expression. However, in this case, one would be wrong.
The "regex" field actually requires "FileSet" [1] syntax.
?\_(?)_/?
[1] http://ant.apache.org/manual/Types/fileset.html
PiperOrigin-RevId: 271917356
We don't want to upload packages from the presubmit jobs.
This will fix the error:
[11:01:34][ERROR] Cannot inject environment variables into
the build without allowed_env_vars regexes.
PiperOrigin-RevId: 271622996
BUILD:85:1: in _pkg_deb rule //runsc:runsc-debian: target
'//runsc:runsc-debian' depends on deprecated target
'@bazel_tools//tools/build_defs/pkg:make_deb': The internal version of
make_deb is deprecated. Please use the replacement for pkg_deb from
https://github.com/bazelbuild/rules_pkg/blob/master/pkg.
PiperOrigin-RevId: 271590386
Before https://golang.org/cl/173160 syscall.RawSyscall would zero out
the last three register arguments to the system call. That no longer happens.
For system calls that take more than three arguments, use RawSyscall6 to
ensure that we pass zero, not random data, for the additional arguments.
PiperOrigin-RevId: 271062527
Non-primary addresses are used for endpoints created to accept multicast and
broadcast packets, as well as "helper" endpoints (0.0.0.0) that allow sending
packets when no proper address has been assigned yet (e.g., for DHCP). These
addresses are not real addresses from a user point of view and should not be
part of the NICInfo() value. Also see b/127321246 for more info.
This switches NICInfo() to call a new NIC.PrimaryAddresses() function. To still
allow an option to get all addresses (mostly for testing) I added
Stack.GetAllAddresses() and NIC.AllAddresses().
In addition, the return value for GetMainNICAddress() was changed for the case
where the NIC has no primary address. Instead of returning an error here,
it now returns an empty AddressWithPrefix() value. The rational for this
change is that it is a valid case for a NIC to have no primary addresses.
Lastly, I refactored the code based on the new additions.
PiperOrigin-RevId: 270971764
How to reproduce:
$ echo "timeout 10 ls" > foo.sh
$ chmod +x foo.sh
$ ./foo.sh
(will hang here for 10 secs, and the output of ls does not show)
When "ls" process writes to stdout, it receives SIGTTOU signal, and
hangs there. Until "timeout" process timeouts, and kills "ls" process.
The expected result is: "ls" writes its output into tty, and terminates
immdedately, then "timeout" process receives SIGCHLD and terminates.
The reason for this failure is that we missed the check for TOSTOP (if
set, background processes will receive the SIGTTOU signal when they do
write).
We use drivers/tty/n_tty.c:n_tty_write() as a reference.
Fixes: #862
Reported-by: chris.zn <chris.zn@antfin.com>
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
Signed-off-by: chenglang.hy <chenglang.hy@antfin.com>
The test is checking the wrong poll_fd for POLLHUP. The only
reason it passed till now was because it was also checking
for POLLIN which was always true on the other fd from the
previous poll!
PiperOrigin-RevId: 270780401
Previously, the only safe way to use an fdbased endpoint was to leak the FD.
This change makes it possible to safely close the FD.
This is the first step towards having stoppable stacks.
Updates #837
PiperOrigin-RevId: 270346582
Previously, when we set hostname:
$ strace hostname abc
...
sethostname("abc", 3) = -1 ENAMETOOLONG (File name too long)
...
According to man 2 sethostname:
"The len argument specifies the number of bytes in name. (Thus, name
does not require a terminating null byte.)"
We wrongly use the CopyStringIn() to check terminating zero byte in
the implementation of sethostname syscall.
To fix this, we use CopyInBytes() instead.
Fixes: #861
Reported-by: chenglang.hy <chenglang.hy@antfin.com>
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>