Commit Graph

768 Commits

Author SHA1 Message Date
Fabricio Voznika 209a95a35a Propagate IP address prefix from host to netstack
Closes #4022

PiperOrigin-RevId: 343378647
2020-11-19 15:11:17 -08:00
Ayush Ranjan e5650d1240 [netstack] Move SO_KEEPALIVE and SO_ACCEPTCONN option to SocketOptions.
PiperOrigin-RevId: 343217712
2020-11-18 21:24:55 -08:00
Ayush Ranjan df37babd57 [netstack] Move SO_REUSEPORT and SO_REUSEADDR option to SocketOptions.
This changes also introduces:
- `SocketOptionsHandler` interface which can be implemented by endpoints to
  handle endpoint specific behavior on SetSockOpt. This is analogous to what
  Linux does.
- `DefaultSocketOptionsHandler` which is a default implementation of the above.
  This is embedded in all endpoints so that we don't have to uselessly
  implement empty functions. Endpoints with specific behavior can override the
  embedded method by manually defining its own implementation.

PiperOrigin-RevId: 343158301
2020-11-18 14:36:41 -08:00
Ayush Ranjan 3e73c519a5 [netstack] Move SO_NO_CHECK option to SocketOptions.
PiperOrigin-RevId: 343146856
2020-11-18 13:42:27 -08:00
Ayush Ranjan fc342fb439 [netstack] Move SO_PASSCRED option to SocketOptions.
This change also makes the following fixes:
- Make SocketOptions use atomic operations instead of having to acquire/drop
  locks upon each get/set option.
- Make documentation more consistent.
- Remove tcpip.SocketOptions from socketOpsCommon because it already exists
  in transport.Endpoint.
- Refactors get/set socket options tests to be easily extendable.

PiperOrigin-RevId: 343103780
2020-11-18 10:19:33 -08:00
Bhasker Hariharan 05d2a26f7a Fix possible deadlock in UDP.Write().
In UDP endpoint.Write() sendUDP is called with e.mu Rlocked. But if this happens
to send a datagram over loopback which ends up generating an ICMP response of
say ErrNoPortReachable, the handling of the response in HandleControlPacket also
acquires e.mu using RLock. This is mostly fine unless there is a competing
caller trying to acquire e.mu in exclusive mode using Lock(). This will deadlock
as a caller waiting in Lock() disallows an new RLocks() to ensure it can
actually acquire the Lock.

This is documented here https://golang.org/pkg/sync/#RWMutex.

This change releases the endpoint mutex before calling sendUDP to resolve the
possibility of the deadlock.

Reported-by: syzbot+537989797548c66e8ee3@syzkaller.appspotmail.com
Reported-by: syzbot+eb0b73b4ab486f7673ba@syzkaller.appspotmail.com
PiperOrigin-RevId: 342894148
2020-11-17 10:36:29 -08:00
Bhasker Hariharan fb9a649f39 Fix SO_ERROR behavior for TCP in gVisor.
Fixes the behaviour of SO_ERROR for tcp sockets where in linux it returns
sk->sk_err and if sk->sk_err is 0 then it returns sk->sk_soft_err. In gVisor TCP
we endpoint.HardError is the equivalent of sk->sk_err and endpoint.LastError
holds soft errors. This change brings this into alignment with Linux such that
both hard/soft errors are cleared when retrieved using getsockopt(.. SO_ERROR)
is called on a socket.

Fixes #3812

PiperOrigin-RevId: 342868552
2020-11-17 08:33:03 -08:00
Jamie Liu 267560d159 Reset watchdog timer between sendfile() iterations.
As part of this, change Task.interrupted() to not drain Task.interruptChan, and
do so explicitly using new function Task.unsetInterrupted() instead.

PiperOrigin-RevId: 342768365
2020-11-16 18:55:24 -08:00
Jamie Liu d5e17d2dbc Disable save/restore in PartialBadBufferTest.SendMsgTCP.
PiperOrigin-RevId: 342314586
2020-11-13 12:24:53 -08:00
Mithun Iyer 8e6963491c Deflake tcp_socket test.
Increase the wait time for the thread to be blocked on read/write
syscall.

PiperOrigin-RevId: 342204627
2020-11-12 23:04:12 -08:00
Nayana Bidari 5bb64ce1b8 Refactor SOL_SOCKET options
Store all the socket level options in a struct and call {Get/Set}SockOpt on
this struct. This will avoid implementing socket level options on all
endpoints. This CL contains implementing one socket level option for tcp and
udp endpoints.

PiperOrigin-RevId: 342203981
2020-11-12 22:57:00 -08:00
Mithun Iyer 199fcd0fe5 Skip `EventHUp` notify in `FIN_WAIT2` on a socket close.
This Notify was added as part of cl/279106406; but notifying `EventHUp`
in `FIN_WAIT2` is incorrect, as we want to only notify later on
`TIME_WAIT` or a reset. However, we do need to notify any blocked
waiters of an activity on the endpoint with `EventIn`|`EventOut`.

PiperOrigin-RevId: 341490913
2020-11-09 14:54:57 -08:00
Andrei Vagin 2fcca60a7b net: connect to the ipv4 localhost returns ENETUNREACH if the address isn't set
cl/340002915 modified the code to return EADDRNOTAVAIL if connect
is called for a localhost address which isn't set.

But actually, Linux returns EADDRNOTAVAIL for ipv6 addresses and ENETUNREACH
for ipv4 addresses.

Updates #4735

PiperOrigin-RevId: 341479129
2020-11-09 13:57:51 -08:00
Jing Chen 3ac00fe9c3 Implement command GETNCNT for semctl.
PiperOrigin-RevId: 341154192
2020-11-06 18:38:13 -08:00
Nicolas Lacasse 53eeb06ef1 Fix infinite loop when splicing to pipes/eventfds.
Writes to pipes of size < PIPE_BUF are guaranteed to be atomic, so writes
larger than that will return EAGAIN if the pipe has capacity < PIPE_BUF.

Writes to eventfds will return EAGAIN if the write would cause the eventfd
value to go over the max.

In both such cases, calling Ready() on the FD will return true (because it is
possible to write), but specific kinds of writes will in fact return EAGAIN.

This CL fixes an infinite loop in splice and sendfile (VFS1 and VFS2) by
forcing skipping the readiness check for the outfile in send, splice, and tee.

PiperOrigin-RevId: 341102260
2020-11-06 12:55:29 -08:00
Ghanan Gowripalan 955e09dfbd Do not send to the zero port
Port 0 is not meant to identify any remote port so attempting to send
a packet to it should return an error.

PiperOrigin-RevId: 341009528
2020-11-06 01:47:09 -08:00
Jamie Liu a00c5df98b Deflake semaphore_test.
- Disable saving in tests that wait for EINTR.

- Do not execute async-signal-unsafe code after fork() (see fork(2)'s manpage,
  "After a fork in a multithreaded program ...")

- Check for errors returned by semctl(GETZCNT).

PiperOrigin-RevId: 340901353
2020-11-05 12:07:12 -08:00
Jing Chen 1a3f417f4a Implement command GETZCNT for semctl.
PiperOrigin-RevId: 340389884
2020-11-02 23:58:45 -08:00
Andrei Vagin 9efaf67518 Clean up the code of setupTimeWaitClose
The active_closefd has to be shutdown only for write,
otherwise the second poll will always return immediately.

The second poll should not be called from a separate thread.

PiperOrigin-RevId: 340319071
2020-11-02 14:42:03 -08:00
Ian Lewis 5e606844df Fix returned error when deleting non-existant address
PiperOrigin-RevId: 340149214
2020-11-01 18:03:43 -08:00
Andrei Vagin df88f223bb net/tcpip: connect to unset loopback address has to return EADDRNOTAVAIL
In the docker container, the ipv6 loopback address is not set,
and connect("::1") has to return ENEADDRNOTAVAIL in this case.

Without this fix, it returns EHOSTUNREACH.

PiperOrigin-RevId: 340002915
2020-10-31 01:19:40 -07:00
Jamie Liu 9ad864628d Separate kernel.Task.AsCopyContext() into CopyContext() and OwnCopyContext().
kernel.copyContext{t} cannot be used outside of t's task goroutine, for three
reasons:

- t.CopyScratchBuffer() is task-goroutine-local.

- Calling t.MemoryManager() without running on t's task goroutine or locking
  t.mu violates t.MemoryManager()'s preconditions.

- kernel.copyContext passes t as context.Context to MM IO methods, which is
  illegal outside of t's task goroutine (cf. kernel.Task.Value()).

Fix this by splitting AsCopyContext() into CopyContext() (which takes an
explicit context.Context and is usable outside of the task goroutine) and
OwnCopyContext() (which uses t as context.Context, but is only usable by t's
task goroutine).

PiperOrigin-RevId: 339933809
2020-10-30 13:54:47 -07:00
gVisor bot 17e0a4adde Merge pull request #2849 from lubinszARM:pr_memory_barrier
PiperOrigin-RevId: 339504677
2020-10-28 11:45:01 -07:00
Bhasker Hariharan 24c33de748 Wake up any waiters on an ICMP error on UDP socket.
This change wakes up any waiters when we receive an ICMP port unreachable
control packet on an UDP socket as well as sets waiter.EventErr in
the result returned by Readiness() when e.lastError is not nil.

The latter is required where an epoll()/poll() is done after the error
is already handled since we will never notify again in such cases.

PiperOrigin-RevId: 339370469
2020-10-27 18:13:46 -07:00
Lennart 1c2836da37 Implement /proc/[pid]/mem
This PR implements /proc/[pid]/mem for `pkg/sentry/fs` (refer to #2716) and `pkg/sentry/fsimpl`.

@majek

COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/4060 from lnsp:proc-pid-mem 2caf9021254646f441be618a9bb5528610e44d43
PiperOrigin-RevId: 339369629
2020-10-27 18:07:22 -07:00
Ian Lewis 59e2c9f16a Add basic address deletion to netlink
Updates #3921

PiperOrigin-RevId: 339195417
2020-10-27 00:18:10 -07:00
Jing Chen facb2fb9c3 Implement command IPC_STAT for semctl.
PiperOrigin-RevId: 339166854
2020-10-26 19:26:42 -07:00
Dean Deng 0bdcee38bd Fix SCM Rights S/R reference leak.
Control messages collected when peeking into a socket were being leaked.

PiperOrigin-RevId: 339114961
2020-10-26 14:15:55 -07:00
Jamie Liu bc814b01ab Avoid excessive save/restore cycles in socket_ipv4_udp_unbound tests.
PiperOrigin-RevId: 338805321
2020-10-24 00:23:52 -07:00
Jamie Liu 9f87400f08 Support VFS2 save/restore.
Inode number consistency checks are now skipped in save/restore tests for
reasons described in greatest detail in StatTest.StateDoesntChangeAfterRename.
They pass in VFS1 due to the bug described in new test case
SimpleStatTest.DifferentFilesHaveDifferentDeviceInodeNumberPairs.

Fixes #1663

PiperOrigin-RevId: 338776148
2020-10-23 17:48:33 -07:00
Zach Koopmans 634e14a094 Fix socket_ipv4_udp_unbound_loopback_test_linux
Handle "Resource temporarily unavailable" EAGAIN errors with a select
call before calling recvmsg.

Also rename similar helper call from "RecvMsgTimeout" to "RecvTimeout",
because it calls "recv".

PiperOrigin-RevId: 338761695
2020-10-23 16:13:46 -07:00
Nayana Bidari 39e9b3bb8a Support getsockopt for SO_ACCEPTCONN.
The SO_ACCEPTCONN option is used only on getsockopt(). When this option is
specified, getsockopt() indicates whether socket listening is enabled for
the socket. A value of zero indicates that socket listening is disabled;
non-zero that it is enabled.

PiperOrigin-RevId: 338703206
2020-10-23 10:48:24 -07:00
Bhasker Hariharan 5d909dd49c Decrement e.synRcvdCount once handshake is complete.
Earlier the count was dropped only after calling e.deliverAccepted. This lead to
an issue where there were no connections in SYN-RCVD state for the listening
endpoint but e.synRcvdCount would not be zero because it was being reduced only
when handleSynSegment returned after deliverAccepted returned.

This issue is seen when the Nth SYN for a listen backlog of size N which would
cause the listen backlog to be full gets dropped occasionally. This happens when
the new SYN comes at when the previous completed endpoint has been delivered to
the accept queue but the synRcvdCount hasn't yet been decremented because the
goroutine running handleSynSegment has not yet completed.

PiperOrigin-RevId: 338690646
2020-10-23 09:43:09 -07:00
Dean Deng 9ca66ec598 Rewrite reference leak checker without finalizers.
Our current reference leak checker uses finalizers to verify whether an object
has reached zero references before it is garbage collected. There are multiple
problems with this mechanism, so a rewrite is in order.

With finalizers, there is no way to guarantee that a finalizer will run before
the program exits. When an unreachable object with a finalizer is garbage
collected, its finalizer will be added to a queue and run asynchronously. The
best we can do is run garbage collection upon sandbox exit to make sure that
all finalizers are enqueued.

Furthermore, if there is a chain of finalized objects, e.g. A points to B
points to C, garbage collection needs to run multiple times before all of the
finalizers are enqueued. The first GC run will register the finalizer for A but
not free it. It takes another GC run to free A, at which point B's finalizer
can be registered. As a result, we need to run GC as many times as the length
of the longest such chain to have a somewhat reliable leak checker.

Finally, a cyclical chain of structs pointing to one another will never be
garbage collected if a finalizer is set. This is a well-known issue with Go
finalizers (https://github.com/golang/go/issues/7358). Using leak checking on
filesystem objects that produce cycles will not work and even result in memory
leaks.

The new leak checker stores reference counted objects in a global map when
leak check is enabled and removes them once they are destroyed. At sandbox
exit, any remaining objects in the map are considered as leaked. This provides
a deterministic way of detecting leaks without relying on the complexities of
finalizers and garbage collection.

This approach has several benefits over the former, including:
- Always detects leaks of objects that should be destroyed very close to
  sandbox exit. The old checker very rarely detected these leaks, because it
  relied on garbage collection to be run in a short window of time.
- Panics if we forgot to enable leak check on a ref-counted object (we will try
  to remove it from the map when it is destroyed, but it will never have been
  added).
- Can store extra logging information in the map values without adding to the
  size of the ref count struct itself. With the size of just an int64, the ref
  count object remains compact, meaning frequent operations like IncRef/DecRef
  are more cache-efficient.
- Can aggregate leak results in a single report after the sandbox exits.
  Instead of having warnings littered in the log, which were
  non-deterministically triggered by garbage collection, we can print all
  warning messages at once. Note that this could also be a limitation--the
  sandbox must exit properly for leaks to be detected.

Some basic benchmarking indicates that this change does not significantly
affect performance when leak checking is enabled, which is understandable
since registering/unregistering is only done once for each filesystem object.

Updates #1486.

PiperOrigin-RevId: 338685972
2020-10-23 09:17:02 -07:00
Jamie Liu cd86bd4931 Fix runsc tests on VFS2 overlay.
- Check the sticky bit in overlay.filesystem.UnlinkAt(). Fixes
  StickyTest.StickyBitPermDenied.

- When configuring a VFS2 overlay in runsc, copy the lower layer's root
  owner/group/mode to the upper layer's root (as in the VFS1 equivalent,
  boot.addOverlay()). This makes the overlay root owned by UID/GID 65534 with
  mode 0755 rather than owned by UID/GID 0 with mode 01777. Fixes
  CreateTest.CreateFailsOnUnpermittedDir, which assumes that the test cannot
  create files in /.

- MknodTest.UnimplementedTypesReturnError assumes that the creation of device
  special files is not supported. However, while the VFS2 gofer client still
  doesn't support device special files, VFS2 tmpfs does, and in the overlay
  test dimension mknod() targets a tmpfs upper layer. The test initially has
  all capabilities, including CAP_MKNOD, so its creation of these files
  succeeds. Constrain these tests to VFS1.

- Rename overlay.nonDirectoryFD to overlay.regularFileFD and only use it for
  regular files, using the original FD for pipes and device special files. This
  is more consistent with Linux (which gets the original inode_operations, and
  therefore file_operations, for these file types from ovl_fill_inode() =>
  init_special_inode()) and fixes remaining mknod and pipe tests.

- Read/write 1KB at a time in PipeTest.Streaming, rather than 4 bytes. This
  isn't strictly necessary, but it makes the test less obnoxiously slow on
  ptrace.

Fixes #4407

PiperOrigin-RevId: 337971042
2020-10-19 17:48:02 -07:00
Dean Deng 4ddb58f6ef Use POSIX interval timers in flock test.
ualarm(2) is obsolete. Move IntervalTimer into a test util, where it can be
used by flock tests.

These tests were flaky with TSAN, probably because it slowed the tests down
enough that the alarm was expiring before flock() was called. Use an interval
timer so that even if we miss the first alarm (or more), flock() is still
guaranteed to be interrupted.

PiperOrigin-RevId: 337578751
2020-10-16 14:32:49 -07:00
Andrei Vagin c002fc36f9 sockets: ignore io.EOF from view.ReadAt
Reported-by: syzbot+5466463b7604c2902875@syzkaller.appspotmail.com
PiperOrigin-RevId: 337451896
2020-10-15 23:15:48 -07:00
Bhasker Hariharan db36d948fa TCP Receive window advertisement fixes.
The fix in commit 028e045da9 was incorrect as
it can cause the right edge of the window to shrink when we announce
a zero window due to receive buffer being full as its done before the check
for seeing if the window is being shrunk because of the selected window.

Further the window was calculated purely on available space but in cases where
we are getting full sized segments it makes more sense to use the actual bytes
being held. This CL changes to use the lower of the total available space vs
the available space in the maximal window we could advertise minus the actual
payload bytes being held.

This change also cleans up the code so that the window selection logic is
not duplicated between getSendParams() and windowCrossedACKThresholdLocked.

PiperOrigin-RevId: 336404827
2020-10-09 19:02:03 -07:00
Andrei Vagin 33d6622172 test/syscall/iptables: don't use designated initializers
test/syscalls/linux/iptables.cc:130:3:
error: C99 designator 'name' outside aggregate initializer
  130 |   };
      |
PiperOrigin-RevId: 336331738
2020-10-09 11:30:52 -07:00
Jamie Liu 1336af78d5 Implement membarrier(2) commands other than *_SYNC_CORE.
Updates #267

PiperOrigin-RevId: 335713923
2020-10-06 13:55:16 -07:00
Dean Deng e0aaf40e39 Fix kcov enabling and disabling procedures.
- When the KCOV_ENABLE_TRACE ioctl is called with the trace kind KCOV_TRACE_PC,
  the kcov mode should be set to KCOV_*MODE*_TRACE_PC.
- When the owning task of kcov exits, the memory mapping should not be cleared
  so it can be used by other tasks.
- Add more tests (also tested on native Linux kcov).

PiperOrigin-RevId: 335202585
2020-10-03 09:26:25 -07:00
Kevin Krakauer 6f8d64f422 ip6tables: redirect support
Adds support for the IPv6-compatible redirect target. Redirection is a limited
form of DNAT, where the destination is always the localhost.

Updates #3549.

PiperOrigin-RevId: 334698344
2020-09-30 16:04:26 -07:00
Bin Lu 45221684f3 avoid the random memory barrier issue in mmap testing on Arm64
There is a new random issue on some Arm64 machines.
This scene can be summarized as following:
Sometimes, the content of the func() pointer is still 0 opcode.

The probability of this kind of issue is very low,
currently only available on some machines.

After inserting a simple memory barrier, this issue was gone.

The code to directly use the memory barrier is as follows:
  memcpy(reinterpret_cast<void*>(addr), machine_code, sizeof(machine_code));
  isb()
  func = reinterpret_cast<uint32_t (*)(void)>(addr);

Signed-off-by: Bin Lu <bin.lu@arm.com>
2020-09-30 11:54:37 +08:00
Fabricio Voznika 4a428b13b2 Add /proc/[pid]/cwd
PiperOrigin-RevId: 334478850
2020-09-29 15:49:27 -07:00
Kevin Krakauer 7fbb45e8ed iptables: refactor to make targets extendable
Like matchers, targets should use a module-like register/lookup system. This
replaces the brittle switch statements we had before.

The only behavior change is supporing IPT_GET_REVISION_TARGET. This makes it
much easier to add IPv6 redirect in the next change.

Updates #3549.

PiperOrigin-RevId: 334469418
2020-09-29 15:02:25 -07:00
gVisor bot b6fb11a290 Migrates uses of deprecated map types to recommended types.
PiperOrigin-RevId: 334419854
2020-09-29 11:13:03 -07:00
Nayana Bidari 237b761f9a Fix lingering of TCP socket in the initial state.
When the socket is set with SO_LINGER and close()'d in the initial state, it
should not linger and return immediately.

PiperOrigin-RevId: 334263149
2020-09-28 16:39:12 -07:00
Dean Deng a0e0ba690f Support inotify in overlayfs.
Fixes #1479, #317.

PiperOrigin-RevId: 334258052
2020-09-28 16:11:16 -07:00
Dean Deng 2a60f92291 Clean up kcov.
Previously, we did not check the kcov mode when performing task work. As a
result, disabling kcov did not do anything.

Also avoid expensive atomic RMW when consuming coverage data. We don't need the
swap if the value is already zero (which is most of the time), and it is ok if
there are slight inconsistencies due to a race between coverage data generation
(incrementing the value) and consumption (reading a nonzero value and writing
zero).

PiperOrigin-RevId: 334049207
2020-09-27 15:33:51 -07:00
Andrei Vagin 0a232a5e8c test/syscall/mknod: Don't use a hard-coded file name
PiperOrigin-RevId: 333461380
2020-09-24 00:48:35 -07:00