unshare actually takes a subset of clone flags, but has no unique flags,
so formatting as clone flags is close enough.
PiperOrigin-RevId: 225082774
Change-Id: I5b580f18607c7785f323e37809094115520a17c0
MSG_WAITALL requests that recv family calls do not perform short reads. It only
has an effect for SOCK_STREAM sockets, other types ignore it.
PiperOrigin-RevId: 224918540
Change-Id: Id97fbf972f1f7cbd4e08eec0138f8cbdf1c94fe7
arch_prctl already verified that the new FS_BASE was canonical, but
Task.Clone did not. Centralize these checks in the arch packages.
Failure to validate could cause an error in PTRACE_SET_REGS when we try
to switch to the app.
PiperOrigin-RevId: 224862398
Change-Id: Iefe63b3f9aa6c4810326b8936e501be3ec407f14
Currently sending a broadcast packet (for DHCP, e.g.) requires a "default
route" of the format "0.0.0.0/0 via 0.0.0.0 <intf>". There is no good reason
for this and on devices with several ports this creates a rather akward route
table with lots of such default routes (which defeats the purpose of a default
route).
PiperOrigin-RevId: 224378769
Change-Id: Icd7ec8a206eb08083cff9a837f6f9ab231c73a19
Unlike FlagSet, order doesn't matter here, so it can simply be a map.
PiperOrigin-RevId: 224377910
Change-Id: I15810c698a7f02d8614bf09b59583ab73cba0514
By Walking before checking that the directory is writable and
executable, MayDelete may return the Walk error (e.g., ENOENT) which
would normally be masked by a permission error (EACCES).
PiperOrigin-RevId: 224222453
Change-Id: I108a7f730e6bdaa7f277eaddb776267c00805475
If sys_prctl is called with PR_SET_MM without CAP_SYS_RESOURCE,
the syscall should return failure with errno set to EPERM.
See: http://man7.org/linux/man-pages/man2/prctl.2.html
PiperOrigin-RevId: 224182874
Change-Id: I630d1dd44af8b444dd16e8e58a0764a0cf1ad9a3
This removes code that should have never made it in in the first place, but did so due to incomplete testing. With the new tests the original code fails, the new code passes.
PiperOrigin-RevId: 224086966
Change-Id: I646fef76977f4528f3705f497b95fad6b3ec32bc
FileOperations.Write should return ErrWouldBlock to allow the upper
layer to loop and sendmsg should continue writing where it left off
on a partial write.
PiperOrigin-RevId: 224081631
Change-Id: Ic61f6943ea6b7abbd82e4279decea215347eac48
The signature of time.now has remained unchanged:
c2412a7681/src/time/time.go (L1072)
PiperOrigin-RevId: 224061160
Change-Id: Ic84bd6ee8fb9952cd9ab580bcb0892444ce7c2da
The number of symbolic links that are allowed to be followed
are for a full path and not just a chain of symbolic links.
PiperOrigin-RevId: 224047321
Change-Id: I5e3c4caf66a93c17eeddcc7f046d1e8bb9434a40
Bazel runs multiple test cases on the same thread. Some of the test
cases rely on the test thread starting with the default memory policy,
while other tests modify the test thread's memory policy. This
obviously breaks when the test framework doesn't run each test case on
a new thread.
Also fixing an incompatibility where set_mempolicy(2) was prevented
from specifying an empty nodemask, which is allowed for some modes.
PiperOrigin-RevId: 224038957
Change-Id: Ibf780766f2706ebc9b129dbc8cf1b85c2a275074
Untyped integer constants default to type int and the binary package will panic
if one tries to encode an int.
PiperOrigin-RevId: 223890001
Change-Id: Iccc3afd6d74bad24c35d764508e450fd317b76ec
Replaces the WaitGroup with a RWMutex. Calls to Async hold the mutex for
reading, while AsyncBarrier takes the lock for writing. This ensures that all
executing Async work finishes before AsyncBarrier returns.
Also pushes the Async() call from Inode.Release into
gofer/InodeOperations.Release(). This removes a recursive Async call which
should not have been allowed in the first place. The gofer Release call is the
slow one (since it may make RPCs to the gofer), so putting the Async call there
makes sense.
PiperOrigin-RevId: 223093067
Change-Id: I116da7b20fce5ebab8d99c2ab0f27db7c89d890e
With rpcinet if shutdown flags are not saved before making
the rpc a race is possible where blocked threads are woken
up before the flags have been persisted. This would mean
that threads can block indefinitely in a recvmsg after a
shutdown(SHUT_RD) has happened.
PiperOrigin-RevId: 223089783
Change-Id: If595e7add12aece54bcdf668ab64c570910d061a
RET_KILL_THREAD doesn't work well for Go because it will
kill only the offending thread and leave the process hanging.
RET_TRAP can be masked out and it's not guaranteed to kill
the process. RET_KILL_PROCESS is available since 4.14.
For older kernel, continue to use RET_TRAP as this is the
best option (likely to kill process, easy to debug).
PiperOrigin-RevId: 222357867
Change-Id: Icc1d7d731274b16c2125b7a1ba4f7883fbdb2cbd
SyncSyscallFiltersToThreadGroup and Task.TheadID() both acquired TaskSet RWLock
in R mode and could deadlock if a writer comes in between.
PiperOrigin-RevId: 222313551
Change-Id: I4221057d8d46fec544cbfa55765c9a284fe7ebfa
Also update test utilities for probing vsyscall support and add a
metric to see if vsyscalls are actually used in sandboxes.
PiperOrigin-RevId: 221698834
Change-Id: I57870ecc33ea8c864bd7437833f21aa1e8117477
...to (remote, local), reflecting the (correct) names in the implementation of
DeliverNetworkPacket (see tcpip/stack/nic.go).
Also trim the names in DeliverNetworkPacket and elsewhere to avoid stuttering;
since the type is tcpip.LinkAddress, there's no need to include "LinkAddr" in
the parameter names.
Note that every callsite passes arguments in the order (src, dst).
PiperOrigin-RevId: 221514396
Change-Id: I3637454ad0d6e62a19e4dcbc2a16493798bd0f09
Previously, TCP_NODELAY was always enabled and we would lie about it being
configurable. TCP_NODELAY is now disabled by default (to match Linux) in the
socket layer so that non-gVisor users don't automatically start using this
questionable optimization.
PiperOrigin-RevId: 221368472
Change-Id: Ib0240f66d94455081f4e0ca94f09d9338b2c1356
sync_file_range - sync a file segment with disk
In Linux, sync_file_range() accepts three flags:
SYNC_FILE_RANGE_WAIT_BEFORE
Wait upon write-out of all pages in the specified range that
have already been submitted to the device driver for write-out
before performing any write.
SYNC_FILE_RANGE_WRITE
Initiate write-out of all dirty pages in the specified range
which are not presently submitted write-out. Note that even
this may block if you attempt to write more than request queue
size.
SYNC_FILE_RANGE_WAIT_AFTER
Wait upon write-out of all pages in the range after performing
any write.
In this implementation:
SYNC_FILE_RANGE_WAIT_BEFORE without SYNC_FILE_RANGE_WAIT_AFTER isn't
supported right now.
SYNC_FILE_RANGE_WRITE is skipped. It should initiate write-out of all
dirty pages, but it doesn't wait, so it should be safe to do nothing
while nobody uses SYNC_FILE_RANGE_WAIT_BEFORE.
SYNC_FILE_RANGE_WAIT_AFTER is equal to fdatasync(). In Linux,
sync_file_range() doesn't writes out the file's meta-data, but
fdatasync() does if a file size is changed.
PiperOrigin-RevId: 220730840
Change-Id: Iae5dfb23c2c916967d67cf1a1ad32f25eb3f6286
Create syscall stubs for missing syscalls upto Linux 4.4 and advertise
a kernel version of 4.4.
PiperOrigin-RevId: 220667680
Change-Id: Idbdccde538faabf16debc22f492dd053a8af0ba7
Increase timeout to prevent the entry from being
found when there is delay on the address resolution
goroutine that doesn't mark the request as failed.
PiperOrigin-RevId: 220504789
Change-Id: I7e44fd95d8624bd69962f862fbf5517a81395f2a
These files were added with the wrong name after all of the existing files
were corrected.
PiperOrigin-RevId: 220202068
Change-Id: Ia0d15233c1aa69330356a7cf16b5aa00d978e09c
Fluentd configuration uses 'log' for the log message
while containerd uses 'msg'. Since we can't have a single
JSON format for both, add another log format and make
debug log configurable.
PiperOrigin-RevId: 219729658
Change-Id: I2a6afc4034d893ab90bafc63b394c4fb62b2a7a0
Updated error messages so that it doesn't print full Go struct representations
when running a new container in a sandbox. For example, this occurs frequently
when commands are not found when doing a 'kubectl exec'.
PiperOrigin-RevId: 219729141
Change-Id: Ic3a7bc84cd7b2167f495d48a1da241d621d3ca09
Shm segments can be marked for lazy destruction via shmctl(IPC_RMID),
which destroys a segment once it is no longer attached to any
processes. We were unconditionally decrementing the segment refcount
on shmctl(IPC_RMID) which allowed a user to force a segment to be
destroyed by repeatedly calling shmctl(IPC_RMID), with outstanding
memory maps to the segment.
This is problematic because the memory released by a segment destroyed
this way can be reused by a different process while remaining
accessible by the process with outstanding maps to the segment.
PiperOrigin-RevId: 219713660
Change-Id: I443ab838322b4fb418ed87b2722c3413ead21845
https://github.com/containerd/containerd/blob/master/oci/spec.go#L206, the mode=755
didn't match the pattern modeRegexp = regexp.MustCompile("0[0-7][0-7][0-7]").
Closes#112
Signed-off-by: Juan <xionghuan.cn@gmail.com>
Change-Id: I469e0a68160a1278e34c9e1dbe4b7784c6f97e5a
PiperOrigin-RevId: 219672525
This reduces the number of floating point save/restore cycles required (since
we don't need to restore immediately following the switch, this always happens
in a known context) and allows the kernel hooks to capture state. This lets us
remove calls like "Current()".
PiperOrigin-RevId: 219552844
Change-Id: I7676fa2f6c18b9919718458aa888b832a7db8cab
This field was added in the intial implementation, before Route existed
to pass the local and remote addresses to the packet-writing path.
Today, the Route's members should be respected. A similar bug was
previously fixed in 214650822.
PiperOrigin-RevId: 219474095
Change-Id: Id2a8ee4421d2841c8d88ccb3c193c455086350ee
Use private futexes for performance and to align with other runtime uses.
PiperOrigin-RevId: 219422634
Change-Id: Ief2af5e8302847ea6dc246e8d1ee4d64684ca9dd
Actually parse flags from cpuinfo to avoid mistakenly matching
substrings in cpuinfo that happen to match a flags.
Some features were only exposed in recent versions of Linux. Don't
require them to appear in cpuinfo on old versions of Linux.
Move PREFETCHWT1 back to parse only features. It isn't actually exposed
in Linux yet. Move SDBG to shown features. It has been visible since
Linux 4.3.
PiperOrigin-RevId: 219381731
Change-Id: Ied7c0ee7c8a9879683e81933de56c9074b01108f
Extend the cpuid package to parse and emulate cpuid features that exist
only on AMD and not Intel. The least straightforward part of this is
that AMD duplicates several block 1 features in block 6. Thus we ignore
those features when parsing block 6 and add them when emulating.
PiperOrigin-RevId: 218935032
Change-Id: Id41bf1c24720b0d9b968e2c19ab5bc00a6d62bd4
Linux added these block 3 features to the end of /proc/cpuinfo in
dfb4a70f20c5b3880da56ee4c9484bdb4e8f1e65.
This also fixes that block 3 features were completely missing from
FeatureSet.FlagsString(false) because FlagsString only prints Linux
blocks regardless of the cpuinfo option.
PiperOrigin-RevId: 218913816
Change-Id: I2f9c38c7c9da4b247a140877d4aca782e80684bd
Previously this code used the tcpip error space. Since it is no longer part of
netstack, it can use the sentry's error space (except for a few cases where
there is still some shared code. This reduces the number of error space
conversions required for hot Unix socket operations.
PiperOrigin-RevId: 218541611
Change-Id: I3d13047006a8245b5dfda73364d37b8a453784bb
Pseudoterminal job control signals are meant to be received and handled by the
sandbox process, but if the ptrace stubs are running in the same process group,
they will receive the signals as well and inject then into the sentry kernel.
This can result in duplicate signals being delivered (often to the wrong
process), or a sentry panic if the ptrace stub is inactive.
This CL makes the ptrace stub run in a new session.
PiperOrigin-RevId: 218536851
Change-Id: Ie593c5687439bbfbf690ada3b2197ea71ed60a0e
Attempting to create a zero-len shm segment causes a panic since we
try to allocate a zero-len filemem region. The existing code had a
guard to disallow this, but the check didn't encode the fact that
requesting a private segment implies a segment creation regardless of
whether IPC_CREAT is explicitly specified.
PiperOrigin-RevId: 218405743
Change-Id: I30aef1232b2125ebba50333a73352c2f907977da
The channels {cancel,resCh} have roughly the same lifetime and are used for
roughly the same purpose as an entry's waiters; we can unify the state
management of the two mechanisms, while also reducing unncessary mutex locking
and unlocking.
Made some cosmetic changes while I'm here.
PiperOrigin-RevId: 218343915
Change-Id: Ic69546a2b7b390162b2231f07f335dd6199472d7
This change also adds extensive testing to the p9 package via mocks. The sanity
checks and type checks are moved from the gofer into the core package, where
they can be more easily validated.
PiperOrigin-RevId: 218296768
Change-Id: I4fc3c326e7bf1e0e140a454cbacbcc6fd617ab55
This allows us to release messages in the queue when all users close.
PiperOrigin-RevId: 218033550
Change-Id: I2f6e87650fced87a3977e3b74c64775c7b885c1b
Added events for *ctl syscalls that may have multiple different commands.
For runsc, each syscall event is only logged once. For *ctl syscalls, use
the cmd as identifier, not only the syscall number.
PiperOrigin-RevId: 218015941
Change-Id: Ie3c19131ae36124861e9b492a7dbe1765d9e5e59
This reduces the number of goroutines and runtime timers when
ITIMER_VIRTUAL or ITIMER_PROF are enabled, or when RLIMIT_CPU is set.
This also ensures that thread group CPU timers only advance if running
tasks are observed at the time the CPU clock advances, mostly
eliminating the possibility that a CPU timer expiration observes no
running tasks and falls back to the group leader.
PiperOrigin-RevId: 217603396
Change-Id: Ia24ce934d5574334857d9afb5ad8ca0b6a6e65f4
This queue only has a single user, so there is no need for it to use an
interface. Merging it into the same package as its sole user allows us to avoid
a circular dependency.
This simplifies the code and should slightly improve performance.
PiperOrigin-RevId: 217595889
Change-Id: Iabbd5164240b935f79933618c61581bc8dcd2822
Now containers run with "docker run -it" support control characters like ^C and
^Z.
This required refactoring our signal handling a bit. Signals delivered to the
"runsc boot" process are turned into loader.Signal calls with the appropriate
delivery mode. Previously they were always sent directly to PID 1.
PiperOrigin-RevId: 217566770
Change-Id: I5b7220d9a0f2b591a56335479454a200c6de8732
The existing logic is backwards and writes iov_len == 0 for a full write.
PiperOrigin-RevId: 217560377
Change-Id: I5a39c31bf0ba9063a8495993bfef58dc8ab7c5fa
* Integrate recvMsg and sendMsg functions into Recv and Send respectively as
they are no longer shared.
* Clean up partial read/write error handling code.
* Re-order code to make sense given that there is no longer a host.endpoint
type.
PiperOrigin-RevId: 217255072
Change-Id: Ib43fe9286452f813b8309d969be11f5fa40694cd
host.endpoint contained duplicated logic from the sockerpair implementation and
host.ConnectedEndpoint. Remove host.endpoint in favor of a
host.ConnectedEndpoint wrapped in a socketpair end.
PiperOrigin-RevId: 217240096
Change-Id: I4a3d51e3fe82bdf30e2d0152458b8499ab4c987c
- Change Dirent.Busy => Dirent.isMountPoint. The function body is unchanged,
and it is no longer exported.
- fs.MayDelete now checks that the victim is not the process root. This aligns
with Linux's namei.c:may_delete().
- Fix "is-ancestor" checks to actually compare all ancestors, not just the
parents.
- Fix handling of paths that end in dots, which are handled differently in
Rename vs. Unlink.
PiperOrigin-RevId: 217239274
Change-Id: I7a0eb768e70a1b2915017ce54f7f95cbf8edf1fb
This should reduce use-after-free errors and accidental close via create or
remove. This change includes one functional fix as well: when closing via
remove, the closed field was not set and the finalizer was not freed, so the
file would have been clunked at some random point in the future.
PiperOrigin-RevId: 216750000
Change-Id: Ice3292c6feb953ae97abac308afbafd2d9410402
This is a defense-in-depth measure. If the sentry is compromised, this prevents
system call injection to the stubs. There is some complexity with respect to
ptrace and seccomp interactions, so this protection is not really available
for kernel versions < 4.8; this is detected dynamically.
Note that this also solves the vsyscall emulation issue by adding in
appropriate trapping for those system calls. It does mean that a compromised
sentry could theoretically inject these into the stub (ignoring the trap and
resume, thereby allowing execution), but they are harmless.
PiperOrigin-RevId: 216647581
Change-Id: Id06c232cbac1f9489b1803ec97f83097fcba8eb8
Currently, in the face of FileMem fragmentation and a large sendmsg or
recvmsg call, host sockets may pass > 1024 iovecs to the host, which
will immediately cause the host to return EMSGSIZE.
When we detect this case, use a single intermediate buffer to pass to
the kernel, copying to/from the src/dst buffer.
To avoid creating unbounded intermediate buffers, enforce message size
checks and truncation w.r.t. the send buffer size. The same
functionality is added to netstack unix sockets for feature parity.
PiperOrigin-RevId: 216590198
Change-Id: I719a32e71c7b1098d5097f35e6daf7dd5190eff7
Also properly add padding after Procs in the linux.Sysinfo
structure. This will be implicitly padded to 64bits so we
need to do the same.
PiperOrigin-RevId: 216372907
Change-Id: I6eb6a27800da61d8f7b7b6e87bf0391a48fdb475
We accidentally set the wrong maximum. I've also added PATH_MAX and
NAME_MAX to the linux abi package.
PiperOrigin-RevId: 216221311
Change-Id: I44805fcf21508831809692184a0eba4cee469633
- Shared futex objects on shared mappings are represented by Mappable +
offset, analogous to Linux's use of inode + offset. Add type
futex.Key, and change the futex.Manager bucket API to use futex.Keys
instead of addresses.
- Extend the futex.Checker interface to be able to return Keys for
memory mappings. It returns Keys rather than just mappings because
whether the address or the target of the mapping is used in the Key
depends on whether the mapping is MAP_SHARED or MAP_PRIVATE; this
matters because using mapping target for a futex on a MAP_PRIVATE
mapping causes it to stop working across COW-breaking.
- futex.Manager.WaitComplete depends on atomic updates to
futex.Waiter.addr to determine when it has locked the right bucket,
which is much less straightforward for struct futex.Waiter.key. Switch
to an atomically-accessed futex.Waiter.bucket pointer.
- futex.Manager.Wake now needs to take a futex.Checker to resolve
addresses for shared futexes. CLONE_CHILD_CLEARTID requires the exit
path to perform a shared futex wakeup (Linux:
kernel/fork.c:mm_release() => sys_futex(tsk->clear_child_tid,
FUTEX_WAKE, ...)). This is a problem because futexChecker is in the
syscalls/linux package. Move it to kernel.
PiperOrigin-RevId: 216207039
Change-Id: I708d68e2d1f47e526d9afd95e7fed410c84afccf
Terminal support in runsc relies on host tty file descriptors that are imported
into the sandbox. Application tty ioctls are sent directly to the host fd.
However, those host tty ioctls are associated in the host kernel with a host
process (in this case runsc), and the host kernel intercepts job control
characters like ^C and send signals to the host process. Thus, typing ^C into a
"runsc exec" shell will send a SIGINT to the runsc process.
This change makes "runsc exec" handle all signals, and forward them into the
sandbox via the "ContainerSignal" urpc method. Since the "runsc exec" is
associated with a particular container process in the sandbox, the signal must
be associated with the same container process.
One big difficulty is that the signal should not necessarily be sent to the
sandbox process started by "exec", but instead must be sent to the foreground
process group for the tty. For example, we may exec "bash", and from bash call
"sleep 100". A ^C at this point should SIGINT sleep, not bash.
To handle this, tty files inside the sandbox must keep track of their
foreground process group, which is set/get via ioctls. When an incoming
ContainerSignal urpc comes in, we look up the foreground process group via the
tty file. Unfortunately, this means we have to expose and cache the tty file in
the Loader.
Note that "runsc exec" now handles signals properly, but "runs run" does not.
That will come in a later CL, as this one is complex enough already.
Example:
root@:/usr/local/apache2# sleep 100
^C
root@:/usr/local/apache2# sleep 100
^Z
[1]+ Stopped sleep 100
root@:/usr/local/apache2# fg
sleep 100
^C
root@:/usr/local/apache2#
PiperOrigin-RevId: 215334554
Change-Id: I53cdce39653027908510a5ba8d08c49f9cf24f39
There was a race where we checked task.Parent() != nil, and then later called
task.Parent() again, assuming that it is not nil. If the task is exiting, the
parent may have been set to nil in between the two calls, causing a panic.
This CL changes the code to only call task.Parent() once.
PiperOrigin-RevId: 215274456
Change-Id: Ib5a537312c917773265ec72016014f7bc59a5f59
host.endpoint already has the check, but it is missing from
host.ConnectedEndpoint.
PiperOrigin-RevId: 214962762
Change-Id: I88bb13a5c5871775e4e7bf2608433df8a3d348e6
Previously, if address resolution for UDP or Ping sockets required sending
packets using Write in Transport layer, Resolve would return ErrWouldBlock
and Write would return ErrNoLinkAddress. Meanwhile startAddressResolution
would run in background. Further calls to Write using same address would also
return ErrNoLinkAddress until resolution has been completed successfully.
Since Write is not allowed to block and System Calls need to be
interruptible in System Call layer, the caller to Write is responsible for
blocking upon return of ErrWouldBlock.
Now, when startAddressResolution is called a notification channel for
the completion of the address resolution is returned.
The channel will traverse up to the calling function of Write as well as
ErrNoLinkAddress. Once address resolution is complete (success or not) the
channel is closed. The caller would call Write again to send packets and
check if address resolution was compeleted successfully or not.
Fixesgoogle/gvisor#5
Change-Id: Idafaf31982bee1915ca084da39ae7bd468cebd93
PiperOrigin-RevId: 214962200
We already forward TCSETS and TCSETSW. TCSETSF is roughly equivalent but
discards pending input.
The filters were relaxed to allow host ioctls with TCSETSF argument.
This fixes programs like "passwd" that prevent user input from being displayed
on the terminal.
Before:
root@b8a0240fc836:/# passwd
Enter new UNIX password: 123
Retype new UNIX password: 123
passwd: password updated successfully
After:
root@ae6f5dabe402:/# passwd
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
PiperOrigin-RevId: 214869788
Change-Id: I31b4d1373c1388f7b51d0f2f45ce40aa8e8b0b58
In order to implement kill --all correctly, the Sentry needs
to track all tasks that belong to a given container. This change
introduces ContainerID to the task, that gets inherited by all
children. 'kill --all' then iterates over all tasks comparing the
ContainerID field to find all processes that need to be signalled.
PiperOrigin-RevId: 214841768
Change-Id: I693b2374be8692d88cc441ef13a0ae34abf73ac6
The //go:linkname directive requires the presence of
assembly files in the package. Even an empty file will do.
There was an empty assembly file commit_arm64.s, but
that is limited to GOARCH=arm64. Renaming to empty.s will
remove the unnecessary build constraint and allow building
netstack for other architectures than amd64 and arm64.
Without this, building directly with go (not bazel)
for e.g., GOARCH=arm gives:
sleep/sleep_unsafe.go:88:6: missing function body
sleep/sleep_unsafe.go:91:6: missing function body
Change-Id: I29d1d13e1ff31506a174d4595b8cd57fa58bf52b
PiperOrigin-RevId: 214820299
Old code was returning ID of the thread that created
the child process. It should be returning the ID of
the parent process instead.
PiperOrigin-RevId: 214720910
Change-Id: I95715c535bcf468ecf1ae771cccd04a4cd345b36
There is a subtle bug that is the result of two changes made when upstreaming
ICMPv6 support from Fuchsia:
1) ipv6.endpoint.WritePacket writes the local address it was initialized with,
rather than the provided route's local address
2) ipv6.endpoint.handleICMP doesn't set its route's local address to the ICMP
target address before writing the response
The result is that the ICMP response erroneously uses the target ipv6 address
(rather than icmp) as its source address in the response. When trying to debug
this by fixing (2), we ran into problems with bad ipv6 checksums because (1)
didn't respect the local address of the route being passed to it.
This fixes both problems.
PiperOrigin-RevId: 214650822
Change-Id: Ib6148bf432e6428d760ef9da35faef8e4b610d69
...by increasing the allotted timeout and using direct comparison rather than
reflect.DeepEqual (which should be faster).
PiperOrigin-RevId: 214027024
Change-Id: I0a2690e65c7e14b4cc118c7312dbbf5267dc78bc
This allows a NetworkDispatcher to implement transparent bridging,
assuming all implementations of LinkEndpoint.WritePacket call eth.Encode
with header.EthernetFields.SrcAddr set to the passed
Route.LocalLinkAddress, if it is provided.
PiperOrigin-RevId: 213686651
Change-Id: I446a4ac070970202f0724ef796ff1056ae4dd72a
From RFC7323#Section-4
The [RFC6298] RTT estimator has weighting factors, alpha and beta, based on an
implicit assumption that at most one RTTM will be sampled per RTT. When
multiple RTTMs per RTT are available to update the RTT estimator, an
implementation SHOULD try to adhere to the spirit of the history specified in
[RFC6298]. An implementation suggestion is detailed in Appendix G.
From RFC7323#appendix-G
Appendix G. RTO Calculation Modification
Taking multiple RTT samples per window would shorten the history calculated
by the RTO mechanism in [RFC6298], and the below algorithm aims to maintain a
similar history as originally intended by [RFC6298].
It is roughly known how many samples a congestion window worth of data will
yield, not accounting for ACK compression, and ACK losses. Such events will
result in more history of the path being reflected in the final value for
RTO, and are uncritical. This modification will ensure that a similar amount
of time is taken into account for the RTO estimation, regardless of how many
samples are taken per window:
ExpectedSamples = ceiling(FlightSize / (SMSS * 2))
alpha' = alpha / ExpectedSamples
beta' = beta / ExpectedSamples
Note that the factor 2 in ExpectedSamples is due to "Delayed ACKs".
Instead of using alpha and beta in the algorithm of [RFC6298], use alpha' and
beta' instead:
RTTVAR <- (1 - beta') * RTTVAR + beta' * |SRTT - R'|
SRTT <- (1 - alpha') * SRTT + alpha' * R'
(for each sample R')
PiperOrigin-RevId: 213644795
Change-Id: I52278b703540408938a8edb8c38be97b37f4a10e
If we have an overlay file whose corresponding Dirent is frozen, then we should
not bother calling Readdir on the upper or lower files, since DirentReaddir
will calculate children based on the frozen Dirent tree.
A test was added that fails without this change.
PiperOrigin-RevId: 213531215
Change-Id: I4d6c98f1416541a476a34418f664ba58f936a81d
This makes `runsc wait` behave more like waitpid()/wait4() in that:
- Once a process has run to completion, you can wait on it and get its exit
code.
- Processes not waited on will consume memory (like a zombie process)
PiperOrigin-RevId: 213358916
Change-Id: I5b5eca41ce71eea68e447380df8c38361a4d1558
This was previously broken in 212917409, resulting in "missing function body"
compilation errors.
PiperOrigin-RevId: 213323695
Change-Id: I32a95b76a1c73fd731f223062ec022318b979bd4
runApp.execute -> Task.SendSignal -> sendSignalLocked -> sendSignalTimerLocked
-> pendingSignals.enqueue assumes that it owns the arch.SignalInfo returned
from platform.Context.Switch.
On the other hand, ptrace.context.Switch assumes that it owns the returned
SignalInfo and can safely reuse it on the next call to Switch. The KVM platform
always returns a unique SignalInfo.
This becomes a problem when the returned signal is not immediately delivered,
allowing a future signal in Switch to change the previous pending SignalInfo.
This is noticeable in #38 when external SIGINTs are delivered from the PTY
slave FD. Note that the ptrace stubs are in the same process group as the
sentry, so they are eligible to receive the PTY signals. This should probably
change, but is not the only possible cause of this bug.
Updates #38
Original change by newmanwang <wcs1011@gmail.com>, updated by Michael Pratt
<mpratt@google.com>.
Change-Id: I5383840272309df70a29f67b25e8221f933622cd
PiperOrigin-RevId: 213071072
Linux permits hard-linking if the target is owned by the user OR the target has
Read+Write permission.
PiperOrigin-RevId: 213024613
Change-Id: If642066317b568b99084edd33ee4e8822ec9cbb3
The old kernel version, such as 4.4, only support 255 vcpus.
While gvisor is ran on these kernels, it could panic because the
vcpu id and vcpu number beyond max_vcpus.
Use ioctl(vmfd, _KVM_CHECK_EXTENSION, _KVM_CAP_MAX_VCPUS) to get max
vcpus number dynamically.
Change-Id: I50dd859a11b1c2cea854a8e27d4bf11a411aa45c
PiperOrigin-RevId: 212929704
Netstack needs to be portable, so this seems to be preferable to using raw
system calls.
PiperOrigin-RevId: 212917409
Change-Id: I7b2073e7db4b4bf75300717ca23aea4c15be944c
The contract in ExecArgs says that a reference on ExecArgs.Root must be held
for the lifetime of the struct, but the caller is free to drop the ref after
that.
As a result, proc.Exec must take an additional ref on Root when it constructs
the CreateProcessArgs, since that holds a pointer to Root as well. That ref is
dropped in CreateProcess.
PiperOrigin-RevId: 212828348
Change-Id: I7f44a612f337ff51a02b873b8a845d3119408707
This is different from the existing -pid-file flag, which saves a host pid.
PiperOrigin-RevId: 212713968
Change-Id: I2c486de8dd5cfd9b923fb0970165ef7c5fc597f0
We were previously openining the platform device (i.e. /dev/kvm) inside the
platfrom constructor (i.e. kvm.New). This requires that we have RW access to
the platform device when constructing the platform.
However, now that the runsc sandbox process runs as user "nobody", it is not
able to open the platform device.
This CL changes the kvm constructor to take the platform device FD, rather than
opening the device file itself. The device file is opened outside of the
sandbox and passed to the sandbox process.
PiperOrigin-RevId: 212505804
Change-Id: I427e1d9de5eb84c84f19d513356e1bb148a52910
We must use a context.Context with a Root Dirent that corresponds to the
container's chroot. Previously we were using the root context, which does not
have a chroot.
Getting the correct context required refactoring some of the path-lookup code.
We can't lookup the path without a context.Context, which requires
kernel.CreateProcArgs, which we only get inside control.Execute. So we have to
do the path lookup much later than we previously were.
PiperOrigin-RevId: 212064734
Change-Id: I84a5cfadacb21fd9c3ab9c393f7e308a40b9b537
This allows applications to verify they are running with gVisor. It
also helps debugging when running with a mix of container runtimes.
Closes#54
PiperOrigin-RevId: 212059457
Change-Id: I51d9595ee742b58c1f83f3902ab2e2ecbd5cedec
Before destroying the Kernel, we disable signal forwarding,
relinquishing control to the Go runtime. External signals that arrive
after disabling forwarding but before the sandbox exits thus may use
runtime.raise (i.e., tkill(2)) and violate the syscall filters.
Adjust forwardSignals to handle signals received after disabling
forwarding the same way they are handled before starting forwarding.
i.e., by implementing the standard Go runtime behavior using tgkill(2)
instead of tkill(2).
This also makes the stop callback block until forwarding actually stops.
This isn't required to avoid tkill(2) but is a saner interface.
PiperOrigin-RevId: 211995946
Change-Id: I3585841644409260eec23435cf65681ad41f5f03
It was always returning the MountNamespace root, which may be different from
the process Root if the process is in a chroot environment.
PiperOrigin-RevId: 211862181
Change-Id: I63bfeb610e2b0affa9fdbdd8147eba3c39014480
Now that it's possible to remove subnets, we must iterate over them with locks
held.
Also do the removal more efficiently while I'm here.
PiperOrigin-RevId: 211737416
Change-Id: I29025ec8b0c3ad11f22d4447e8ad473f1c785463
Makes it possible to avoid copying or allocating in cases where DeliverNetworkPacket (rx)
needs to turn around and call WritePacket (tx) with its VectorisedView.
Also removes the restriction on having VectorisedViews with multiple views in the write path.
PiperOrigin-RevId: 211728717
Change-Id: Ie03a65ecb4e28bd15ebdb9c69f05eced18fdfcff
Furthermore, allow for the specification of an ElementMapper. This allows a
single "Element" type to exist on multiple inline lists, and work without
having to embed the entry type.
This is a requisite change for supporting a per-Inode list of Dirents.
PiperOrigin-RevId: 211467497
Change-Id: If2768999b43e03fdaecf8ed15f435fe37518d163
Task.creds can only be changed by the task's own set*id and execve
syscalls, and Task namespaces can only be changed by the task's own
unshare/setns syscalls.
PiperOrigin-RevId: 211156279
Change-Id: I94d57105d34e8739d964400995a8a5d76306b2a0
From //pkg/sentry/context/context.go:
// - It is *not safe* to retain a Context passed to a function beyond the scope
// of that function call.
Passing a stored kernel.Task as a context.Context to
fs.FileOwnerFromContext violates this requirement.
PiperOrigin-RevId: 211143021
Change-Id: I4c5b02bd941407be4c9cfdbcbdfe5a26acaec037
This allows us to call kernel.FDMap.DecRef without holding mutexes
cleanly.
PiperOrigin-RevId: 211139657
Change-Id: Ie59d5210fb9282e1950e2e40323df7264a01bcec
This CL does NDP link-address discovery for IPv6.
It includes several small changes necessary to get linux to talk to
this implementation. In particular, a hop limit of 255 is necessary
for ICMPv6.
PiperOrigin-RevId: 211103930
Change-Id: If25370ab84c6b1decfb15de917f3b0020f2c4e0e