Commit Graph

787 Commits

Author SHA1 Message Date
Ian Lewis 470997ca99 Allow for zero byte iovec with MSG_PEEK | MSG_TRUNC in recvmsg.
This allows for peeking at the length of the next message on a netlink socket
without pulling it off the socket's buffer/queue, allowing tools like 'ip' to
work.

This CL also fixes an issue where dump_done_errno was not included in the
NLMSG_DONE messages payload.

Issue #769

PiperOrigin-RevId: 274068637
2019-10-10 16:55:48 -07:00
Bhasker Hariharan c7e901f47a Fix bugs in fragment handling.
Strengthen the header.IPv4.IsValid check to correctly check
for IHL/TotalLength fields. Also add a check to make sure
fragmentOffsets + size of the fragment do not cause a wrap
around for the end of the fragment.

PiperOrigin-RevId: 274049313
2019-10-10 15:14:55 -07:00
Adin Scannell f8b1859319 Fix signalfd polling.
The signalfd descriptors otherwise always show as available. This can lead
programs to spin, assuming they are looking to see what signals are pending.

Updates #139

PiperOrigin-RevId: 274017890
2019-10-10 12:51:22 -07:00
gVisor bot bf870c1a42 Internal change.
PiperOrigin-RevId: 273861936
2019-10-09 17:56:05 -07:00
gVisor bot 7a2d5b2fa7 Merge pull request #811 from lubinszARM:pr_testutil
PiperOrigin-RevId: 273781641
2019-10-09 12:00:53 -07:00
Ian Gudger 7c1587e340 Implement IP_TTL.
Also change the default TTL to 64 to match Linux.

PiperOrigin-RevId: 273430341
2019-10-07 19:29:51 -07:00
Kevin Krakauer 1de0cf3563 Remove unnecessary context parameter for new pipes.
PiperOrigin-RevId: 273421634
2019-10-07 18:16:14 -07:00
Kevin Krakauer 6a98237949 Rename epsocket to netstack.
PiperOrigin-RevId: 273365058
2019-10-07 13:57:59 -07:00
gVisor bot 8fce24d33a Merge pull request #753 from lubinszARM:pr_syscall_linux
PiperOrigin-RevId: 273364848
2019-10-07 13:52:19 -07:00
Nicolas Lacasse f24c3188b5 Add sanity check that overlayCreate is called with an overlay parent inode.
PiperOrigin-RevId: 272987037
2019-10-04 17:03:50 -07:00
Kevin Krakauer 7ef1c44a7f Change linux.FileMode from uint to uint16, and update VFS to use FileMode.
In Linux (include/linux/types.h), mode_t is an unsigned short.

PiperOrigin-RevId: 272956350
2019-10-04 14:20:32 -07:00
Andrei Vagin db218fdfcf Don't report partialResult errors from sendfile
The input file descriptor is always a regular file, so sendfile can't lose any
data if it will not be able to write them to the output file descriptor.

Reported-by: syzbot+22d22330a35fa1c02155@syzkaller.appspotmail.com
PiperOrigin-RevId: 272730357
2019-10-03 13:38:30 -07:00
gVisor bot cde7711837 Merge pull request #865 from tanjianfeng:fix-829
PiperOrigin-RevId: 272522508
2019-10-02 14:51:04 -07:00
Andrei Vagin 2016cc283c fs/proc: report PID-s from a pid namespace of the proc mount
Right now, we can find more than one process with the 1 PID in /proc.

$ for i in `seq 10`; do
> unshare -fp sleep 1000 &
> done

$ ls /proc
1  1  1  1  12  18  24  29  6            loadavg  net   sys          version
1  1  1  1  16  20  26  32  cpuinfo      meminfo  self  thread-self
1  1  1  1  17  21  28  36  filesystems  mounts   stat  uptime

PiperOrigin-RevId: 272506593
2019-10-02 13:29:42 -07:00
Andrei Vagin 9a875306db
Merge branch 'master' into pr_syscall_linux 2019-10-02 13:00:07 -07:00
Michael Pratt 0d483985c5 Include AT_SECURE in the aux vector
gVisor does not currently implement the functionality that would result in
AT_SECURE = 1, but Linux includes AT_SECURE = 0 in the normal case, so we
should do the same.
PiperOrigin-RevId: 272311488
2019-10-01 15:43:14 -07:00
Michael Pratt dd69b49ed1 Disable cpuClockTicker when app is idle
Kernel.cpuClockTicker increments kernel.cpuClock, which tasks use as a clock to
track their CPU usage. This improves latency in the syscall path by avoid
expensive monotonic clock calls on every syscall entry/exit.

However, this timer fires every 10ms. Thus, when all tasks are idle (i.e.,
blocked or stopped), this forces a sentry wakeup every 10ms, when we may
otherwise be able to sleep until the next app-relevant event. These wakeups
cause the sentry to utilize approximately 2% CPU when the application is
otherwise idle.

Updates to clock are not strictly necessary when the app is idle, as there are
no readers of cpuClock. This commit reduces idle CPU by disabling the timer
when tasks are completely idle, and computing its effects at the next wakeup.

Rather than disabling the timer as soon as the app goes idle, we wait until the
next tick, which provides a window for short sleeps to sleep and wakeup without
doing the (relatively) expensive work of disabling and enabling the timer.

PiperOrigin-RevId: 272265822
2019-10-01 12:21:01 -07:00
Michael Pratt 53cc72da90 Honor X bit on extra anon pages in PT_LOAD segments
Linux changed this behavior in 16e72e9b30986ee15f17fbb68189ca842c32af58
(v4.11). Previously, extra pages were always mapped RW. Now, those pages will
be executable if the segment specified PF_X. They still must be writeable.

PiperOrigin-RevId: 272256280
2019-10-01 11:30:36 -07:00
Andrei Vagin 7a234f736f splice: try another fallback option only if the previous one isn't supported
Reported-by: syzbot+bb5ed342be51d39b0cbb@syzkaller.appspotmail.com
PiperOrigin-RevId: 272110815
2019-09-30 18:23:42 -07:00
Andrei Vagin 29a1ba54ea splice: compare inode numbers only if both ends are pipes
It isn't allowed to splice data from and into the same pipe.

But right now this check is broken, because we don't check that both ends are
pipes.

PiperOrigin-RevId: 272107022
2019-09-30 17:57:14 -07:00
Adin Scannell 20841b98e1 Update FIXME bug with GitHub issue.
PiperOrigin-RevId: 272101930
2019-09-30 17:24:29 -07:00
Nicolas Lacasse 3ad17ff597 Force timestamps to update when set via InodeOperations.SetTimestamps.
The gofer's CachingInodeOperations implementation contains an optimization for
the common open-read-close pattern when we have a host FD.  In this case, the
host kernel will update the timestamp for us to a reasonably close time, so we
don't need an extra RPC to the gofer.

However, when the app explicitly sets the timestamps (via futimes or similar)
then we actually DO need to update the timestamps, because the host kernel
won't do it for us.

To fix this, a new boolean `forceSetTimestamps` was added to
CachineInodeOperations.SetMaskedAttributes. It is only set by
gofer.InodeOperations.SetTimestamps.

PiperOrigin-RevId: 272048146
2019-09-30 13:08:45 -07:00
Michael Pratt 981fc188f0 Only copy out remaining time on nanosleep success
It looks like the old code attempted to do this, but didn't realize that err !=
nil even in the happy case.

PiperOrigin-RevId: 272005887
2019-09-30 13:07:32 -07:00
gVisor bot eebc38be7a Merge pull request #882 from DarcySail:darcy_faster_CopyStringIn
PiperOrigin-RevId: 271675009
2019-09-27 17:27:13 -07:00
gVisor bot 8539abc0df Merge pull request #864 from tanjianfeng:fix-861
PiperOrigin-RevId: 271649711
2019-09-27 15:18:09 -07:00
gVisor bot abbee5615f Implement SO_BINDTODEVICE sockopt
PiperOrigin-RevId: 271644926
2019-09-27 14:14:04 -07:00
Kevin Krakauer 543492650d Make raw socket tests pass in environments with or without CAP_NET_RAW.
PiperOrigin-RevId: 271442321
2019-09-26 15:09:20 -07:00
gVisor bot 99c86b8dbd Merge pull request #863 from tanjianfeng:fix-862
PiperOrigin-RevId: 271168948
2019-09-25 11:36:06 -07:00
gVisor bot 76ff1947b6 gvisor: change syscall.RawSyscall to syscall.RawSyscall6 where required
Before https://golang.org/cl/173160 syscall.RawSyscall would zero out
the last three register arguments to the system call. That no longer happens.
For system calls that take more than three arguments, use RawSyscall6 to
ensure that we pass zero, not random data, for the additional arguments.

PiperOrigin-RevId: 271062527
2019-09-24 23:47:42 -07:00
Adin Scannell 502f8f238e Stub out readahead implementation.
Closes #261

PiperOrigin-RevId: 270973347
2019-09-24 13:29:46 -07:00
henry.tjf bc9de939fd tty: fix sending SIGTTOU on tty write
How to reproduce:
  $ echo "timeout 10 ls" > foo.sh
  $ chmod +x foo.sh
  $ ./foo.sh
  (will hang here for 10 secs, and the output of ls does not show)

When "ls" process writes to stdout, it receives SIGTTOU signal, and
hangs there. Until "timeout" process timeouts, and kills "ls" process.

The expected result is: "ls" writes its output into tty, and terminates
immdedately, then "timeout" process receives SIGCHLD and terminates.

The reason for this failure is that we missed the check for TOSTOP (if
set, background processes will receive the SIGTTOU signal when they do
write).

We use drivers/tty/n_tty.c:n_tty_write() as a reference.

Fixes: #862

Reported-by: chris.zn <chris.zn@antfin.com>
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
Signed-off-by: chenglang.hy <chenglang.hy@antfin.com>
2019-09-24 14:18:22 +00:00
Andrei Vagin 03ee55cc62 netstack: convert more socket options to {Set,Get}SockOptInt
PiperOrigin-RevId: 270763208
2019-09-23 14:39:14 -07:00
gVisor bot 4aeedd47bf internal BUILD file cleanup.
PiperOrigin-RevId: 270680704
2019-09-23 08:25:13 -07:00
Jamie Liu fb55c2bd0d Change vfs.Dirent.Off to NextOff.
"d_off is the distance from the start of the directory to the start of the next
linux_dirent." - getdents(2).

PiperOrigin-RevId: 270349685
2019-09-20 14:24:29 -07:00
Jianfeng Tan 223481e927 fix set hostname
Previously, when we set hostname:

$ strace hostname abc
...
sethostname("abc", 3) = -1 ENAMETOOLONG (File name too long)
...

According to man 2 sethostname:

"The len argument specifies the number of bytes in name. (Thus, name
does not require a terminating null byte.)"

We wrongly use the CopyStringIn() to check terminating zero byte in
the implementation of sethostname syscall.

To fix this, we use CopyInBytes() instead.

Fixes: #861

Reported-by: chenglang.hy <chenglang.hy@antfin.com>
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
2019-09-20 17:57:25 +00:00
Jianfeng Tan 329b6653ff Implement /proc/net/tcp6
Fixes: #829

Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
Signed-off-by: Jielong Zhou <jielong.zjl@antfin.com>
2019-09-20 17:20:08 +00:00
Adin Scannell 75781ab3ef Remove defer from hot path and ensure Atomic is applied consistently.
PiperOrigin-RevId: 270114317
2019-09-19 13:39:32 -07:00
gVisor bot 1c0324d5a1 Merge pull request #876 from xiaobo55x:hostcpu
PiperOrigin-RevId: 270094324
2019-09-19 12:03:38 -07:00
Kevin Krakauer 0a8a75f3da Job control: controlling TTYs and foreground process groups.
Adresses a deadlock with the rolled back change:
b6a5b950d2
Creating a session from an orphaned process group was causing a lock to be
acquired twice by a single goroutine. This behavior is addressed, and a test
(OrphanRegression) has been added to pty.cc.

Implemented the following ioctls:
- TIOCSCTTY - set controlling TTY
- TIOCNOTTY - remove controlling tty, maybe signal some other processes
- TIOCGPGRP - get foreground process group. Also enables tcgetpgrp().
- TIOCSPGRP - set foreground process group. Also enabled tcsetpgrp().

Next steps are to actually turn terminal-generated control characters (e.g. C^c)
into signals to the proper process groups, and to send SIGTTOU and SIGTTIN when
appropriate.

PiperOrigin-RevId: 270088599
2019-09-19 11:36:47 -07:00
Hang Su d72c63664b Accelerate byte lookup in string with `bytealg/indexbyte`
`bytealg/indexbyte` will use AVX or SSE instruction set, if possible,
which could accelerate `CopyStringIn` function by 28%.

In worst case(CPU doesn't support SSE), `bytealg/indexbyte`
will degenerate to traversal lookup. When dealing with
short strings, `bytealg/indexbyte` has the same performance level as
before.

Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
Signed-off-by: Hang Su <darcy.sh@antfin.com>
2019-09-19 22:16:52 +08:00
Haibo Xu cabe10e603 Enable pkg/sentry/hostcpu support on arm64.
Signed-off-by: Haibo Xu haibo.xu@arm.com
Change-Id: I333872da9bdf56ddfa8ab2f034dfc1f36a7d3132
2019-09-18 23:51:42 +00:00
Adin Scannell c98e7f0d19 Signalfd support
Note that the exact semantics for these signalfds are slightly different from
Linux. These signalfds are bound to the process at creation time. Reads, polls,
etc. are all associated with signals directed at that task. In Linux, all
signalfd operations are associated with current, regardless of where the
signalfd originated.

In practice, this should not be an issue given how signalfds are used. In order
to fix this however, we will need to plumb the context through all the event
APIs. This gets complicated really quickly, because the waiter APIs are all
netstack-specific, and not generally exposed to the context.  Probably not
worthwhile fixing immediately.

PiperOrigin-RevId: 269901749
2019-09-18 15:16:42 -07:00
Bin Lu 38bc0b6b6a enable syscalls/linux to support arm64
Signed-off-by: Bin Lu <bin.lu@arm.com>
Change-Id: I45af8a54304f8bb0e248ab15f4e20b173ea9e430
2019-09-18 10:13:06 +00:00
Bin Lu 8e73e2cec5 enable kvm/testutil to support arm64
enable kvm/testutil to support arm64

The Arm64 user-mode execution stat consists of:
1, X0- X30
2, PC, SP, PSTATE
3, TPIDR_EL0, used for TLS
4, V0-V31: 32 128-bit registers for floating point and simd
5, FPSR

Currently, we first try to achieve goals 1 and 2.

This patch provids basic test utils for goals 1 & 2

Signed-off-by: Bin Lu <bin.lu@arm.com>
2019-09-18 09:57:59 +00:00
Andrei Vagin 3b7119a7c9 platform/ptrace: log exit code for stub processes
PiperOrigin-RevId: 269631877
2019-09-17 12:45:22 -07:00
Andrei Vagin 239a07aabf gvisor: return ENOTDIR from the unlink syscall
ENOTDIR has to be returned when a component used as a directory in
pathname is not, in  fact,  a directory.

PiperOrigin-RevId: 269037893
2019-09-13 21:44:57 -07:00
Adin Scannell 7c6ab6a219 Implement splice methods for pipes and sockets.
This also allows the tee(2) implementation to be enabled, since dup can now be
properly supported via WriteTo.

Note that this change necessitated some minor restructoring with the
fs.FileOperations splice methods. If the *fs.File is passed through directly,
then only public API methods are accessible, which will deadlock immediately
since the locking is already done by fs.Splice. Instead, we pass through an
abstract io.Reader or io.Writer, which elide locks and use the underlying
fs.FileOperations directly.

PiperOrigin-RevId: 268805207
2019-09-12 17:43:27 -07:00
Michael Pratt df5d377521 Remove go_test from go_stateify and go_marshal
They are no-ops, so the standard rule works fine.

PiperOrigin-RevId: 268776264
2019-09-12 15:10:17 -07:00
Rahat Mahmood 3733b9b893 go_marshal: Implement automatic generation of ABI marshalling code.
This CL implements go_marshal, a code generation utility for
automatically serializing and deserializing ABI structs.

The go_marshal tool automatically generates implementations of the new
marshal interface. Unlike binary.Marshal/Unmarshal, the generated
interface implementations use no runtime reflection, and translates to
a single memcpy for most structs. See go_marshal/README.md for
details.

PiperOrigin-RevId: 268065475
2019-09-09 13:36:39 -07:00
Nicolas Lacasse 7e94f171f4 Better strace logs for statx.
PiperOrigin-RevId: 267498537
2019-09-05 18:03:53 -07:00