Weak references save / restore involves multiple interface indirection
and cause material latency overhead when there are lots of dirents, each
containing a weak reference map. The nil entries in the map should also
be purged.
PiperOrigin-RevId: 210593727
Change-Id: Ied6f4c3c0726fcc53a24b983d9b3a79121b6b758
This is to troubleshoot problems with a hung process that is
not responding to 'runsc debug --stack' command.
PiperOrigin-RevId: 210483513
Change-Id: I4377b210b4e51bc8a281ad34fd94f3df13d9187d
When revalidating a Dirent, if the inode id is the same, then we don't need to
throw away the entire Dirent. We can just update the unstable attributes in
place.
If the inode id has changed, then the remote file has been deleted or moved,
and we have no choice but to throw away the dirent we have a look up another.
In this case, we may still end up losing a mounted dirent that is a child of
the revalidated dirent. However, that seems appropriate here because the entire
mount point has been pulled out from underneath us.
Because gVisor's overlay is at the Inode level rather than the Dirent level, we
must pass the parent Inode and name along with the Inode that is being
revalidated.
PiperOrigin-RevId: 210431270
Change-Id: I705caef9c68900234972d5aac4ae3a78c61c7d42
Implements the TIOCGWINSZ and TIOCSWINSZ ioctls, which allow processes to resize
the terminal. This allows, for example, sshd to properly set the window size for
ssh sessions.
PiperOrigin-RevId: 210392504
Change-Id: I0d4789154d6d22f02509b31d71392e13ee4a50ba
This CL adds terminal support for "docker exec". We previously only supported
consoles for the container process, but not exec processes.
The SYS_IOCTL syscall was added to the default seccomp filter list, but only
for ioctls that get/set winsize and termios structs. We need to allow these
ioctl for all containers because it's possible to run "exec -ti" on a
container that was started without an attached console, after the filters
have been installed.
Note that control-character signals are still not properly supported.
Tested with:
$ docker run --runtime=runsc -it alpine
In another terminial:
$ docker exec -it <containerid> /bin/sh
PiperOrigin-RevId: 210185456
Change-Id: I6d2401e53a7697bb988c120a8961505c335f96d9
Compared to previous compressio / hashio nesting, there is up to 100% speedup.
PiperOrigin-RevId: 210161269
Change-Id: I481aa9fe980bb817fe465fe34d32ea33fc8abf1c
As required by the contract in Dirent.flush().
Also inline Dirent.freeze() into Dirent.Freeze(), since it is only called from
there.
PiperOrigin-RevId: 209783626
Change-Id: Ie6de4533d93dd299ffa01dabfa257c9cc259b1f4
When an inode file state failed to load asynchronuously, we want to report
the error instead of potentially panicing in another async loading goroutine
incorrectly unblocked.
PiperOrigin-RevId: 209683977
Change-Id: I591cde97710bbe3cdc53717ee58f1d28bbda9261
A new optimization in Go 1.11 improves the efficiency of slice extension:
"The compiler now optimizes slice extension of the form append(s, make([]T, n)...)."
https://tip.golang.org/doc/go1.11#performance-compiler
Before:
BenchmarkMarshalUnmarshal-12 2000000 664 ns/op 0 B/op 0 allocs/op
BenchmarkReadWrite-12 500000 2395 ns/op 304 B/op 24 allocs/op
After:
BenchmarkMarshalUnmarshal-12 2000000 628 ns/op 0 B/op 0 allocs/op
BenchmarkReadWrite-12 500000 2411 ns/op 304 B/op 24 allocs/op
BenchmarkMarshalUnmarshal benchmarks the code in this package, BenchmarkReadWrite benchmarks the code in the standard library.
PiperOrigin-RevId: 209679979
Change-Id: I51c6302e53f60bf79f84576b1ead4d36658897cb
The previous use of non-blocking writes could result in corrupt PCAP files if a
partial write occurs. Using (*os.File).Write solves this problem by not
allowing partial writes. This change does not increase allocations (in one path
it actually reduces them), but does add additional copying.
PiperOrigin-RevId: 209652974
Change-Id: I4b1cf2eda4cfd7f237a4245aceb7391b3055a66c
Numpy needs these.
Also added the "present" directory, since the contents are the same as possible
and online.
PiperOrigin-RevId: 209451777
Change-Id: I2048de3f57bf1c57e9b5421d607ca89c2a173684
Some linux commands depend on /sys/devices/system/cpu/possible, such
as 'lscpu'.
Add 2 knobs for cpu:
/sys/devices/system/cpu/possible
/sys/devices/system/cpu/online
Both the values are '0 - Kernel.ApplicationCores()-1'.
Change-Id: Iabd8a4e559cbb630ed249686b92c22b4e7120663
PiperOrigin-RevId: 209070163
When multiple containers run inside a sentry, each container has its own root
filesystem and set of mounts. Containers are also added after sentry boot rather
than all configured and known at boot time.
The fsgofer needs to be able to serve the root filesystem of each container.
Thus, it must be possible to add filesystems after the fsgofer has already
started.
This change:
* Creates a URPC endpoint within the gofer process that listens for requests to
serve new content.
* Enables the sentry, when starting a new container, to add the new container's
filesystem.
* Mounts those new filesystems at separate roots within the sentry.
PiperOrigin-RevId: 208903248
Change-Id: Ifa91ec9c8caf5f2f0a9eead83c4a57090ce92068
Previously, gofer filesystems were configured with the default "fscache"
policy, which caches filesystem metadata and contents aggressively. While this
setting is best for performance, it means that changes from inside the sandbox
may not be immediately propagated outside the sandbox, and vice-versa.
This CL changes volumes and the root fs configuration to use a new
"remote-revalidate" cache policy which tries to retain as much caching as
possible while still making fs changes visible across the sandbox boundary.
This cache policy is enabled by default for the root filesystem. The default
value for the "--file-access" flag is still "proxy", but the behavior is
changed to use the new cache policy.
A new value for the "--file-access" flag is added, called "proxy-exclusive",
which turns on the previous aggressive caching behavior. As the name implies,
this flag should be used when the sandbox has "exclusive" access to the
filesystem.
All volume mounts are configured to use the new cache policy, since it is
safest and most likely to be correct. There is not currently a way to change
this behavior, but it's possible to add such a mechanism in the future. The
configurability is a smaller issue for volumes, since most of the expensive
application fs operations (walking + stating files) will likely served by the
root fs.
PiperOrigin-RevId: 208735037
Change-Id: Ife048fab1948205f6665df8563434dbc6ca8cfc9
Now, there's a waiter for each end (master and slave) of the TTY, and each
waiter.Entry is only enqueued in one of the waiters.
PiperOrigin-RevId: 208734483
Change-Id: I06996148f123075f8dd48cde5a553e2be74c6dce
stat()-ing /proc/PID/fd/FD incremented but didn't decrement the refcount for
FD. This behavior wasn't usually noticeable, but in the above case:
- ls would never decrement the refcount of the write end of the pipe to 0.
- This caused the write end of the pipe never to close.
- wc would then hang read()-ing from the pipe.
PiperOrigin-RevId: 208728817
Change-Id: I4fca1ba5ca24e4108915a1d30b41dc63da40604d
InodeOperations.Bind now returns a Dirent which will be cached in the Dirent
tree.
When an overlay is in-use, Bind cannot return the Dirent created by the upper
filesystem because the Dirent does not know about the overlay. Instead,
overlayBind must create a new overlay-aware Inode and Dirent and return that.
This is analagous to how Lookup and overlayLookup work.
PiperOrigin-RevId: 208670710
Change-Id: I6390affbcf94c38656b4b458e248739b4853da29
Previously, an overlay would panic if either the upper or lower fs required
revalidation for a given Dirent. Now, we allow revalidation from the upper
file, but not the lower.
If a cached overlay inode does need revalidation (because the upper needs
revalidation), then the entire overlay Inode will be discarded and a new
overlay Inode will be built with a fresh copy of the upper file.
As a side effect of this change, Revalidate must take an Inode instead of a
Dirent, since an overlay needs to revalidate individual Inodes.
PiperOrigin-RevId: 208293638
Change-Id: Ic8f8d1ffdc09114721745661a09522b54420c5f1
Currently the implementation matches the behavior of moving data
between two file descriptors. However, it does not implement this
through zero-copy movement. Thus, this code is a starting point
to build the more complex implementation.
PiperOrigin-RevId: 208284483
Change-Id: Ibde79520a3d50bc26aead7ad4f128d2be31db14e
The cache policy determines whether Lookup should return a negative dirent, or
just ENOENT. This CL fixes one spot where we returned a negative dirent without
first consulting the policy.
PiperOrigin-RevId: 208280230
Change-Id: I8f963bbdb45a95a74ad0ecc1eef47eff2092d3a4
Previously, processes which used file-system Unix Domain Sockets could not be
checkpoint-ed in runsc because the sockets were saved with their inode
numbers which do not necessarily remain the same upon restore. Now,
the sockets are also saved with their paths so that the new inodes
can be determined for the sockets based on these paths after restoring.
Tests for cases with UDS use are included. Test cleanup to come.
PiperOrigin-RevId: 208268781
Change-Id: Ieaa5d5d9a64914ca105cae199fd8492710b1d7ec
The message size check is legitimate: the size must be negotiated, which
relies on the fixed message limit up front. Sending a message larger than
that indicates that the connection is out of sync and is considered a
socket error (disconnect).
Similarly, sending a size that is too small indicates that the stream is
out-of-sync or invalid.
PiperOrigin-RevId: 207996551
Change-Id: Icd8b513d5307e9d5953dbb957ee70ceea111098d
Add option to redirect packet back to netstack if it's destined to itself.
This fixes the problem where connecting to the local NIC address would
not work, e.g.:
echo bar | nc -l -p 8080 &
echo foo | nc 192.168.0.2 8080
PiperOrigin-RevId: 207995083
Change-Id: I17adc2a04df48bfea711011a5df206326a1fb8ef
Because the Drop method may be called across vCPUs, it is necessary to protect
the PCID database with a mutex to prevent concurrent modification. The PCID is
assigned prior to entersyscall, so it's safe to block.
PiperOrigin-RevId: 207992864
Change-Id: I8b36d55106981f51e30dcf03e12886330bb79d67
Data race is:
Read:
(*connectionlessEndpoint).UnidirectionalConnect:
writeQueue: e.receiver.(*queueReceiver).readQueue,
Write:
(*connectionlessEndpoint).Close:
e.receiver = nil
The problem is that (*connectionlessEndpoint).UnidirectionalConnect assumed
that baseEndpoint.receiver is immutable which is explicitly not the case.
Fixing this required two changes:
1. Add synchronization around access of baseEndpoint.receiver in
(*connectionlessEndpoint).UnidirectionalConnect.
2. Check for baseEndpoint.receiver being nil in
(*connectionlessEndpoint).UnidirectionalConnect.
PiperOrigin-RevId: 207984402
Change-Id: Icddeeb43805e777fa3ef874329fa704891d14181
SACK is disabled by default and needs to be manually enabled. It not only
improves performance, but also fixes hangs downloading files from certain
websites.
PiperOrigin-RevId: 207906742
Change-Id: I4fb7277b67bfdf83ac8195f1b9c38265a0d51e8b
This CL adds a new cache-policy for gofer filesystems that uses the host page
cache, but causes dirents to be reloaded on each Walk, and does not cache
readdir results.
This policy is useful when the remote filesystem may change out from underneath
us, as any remote changes will be reflected on the next Walk.
Importantly, this cache policy is only consistent if we do not use gVisor's
internal page cache, since that page cache is tied to the Inode and may be
thrown away upon Revalidation.
This cache policy should only be used when the gofer supports donating host
FDs, since then gVisor will make use of the host kernel page cache, which will
be consistent for all open files in the gofer. In fact, a panic will be raised
if a file is opened without a donated FD.
PiperOrigin-RevId: 207752937
Change-Id: I233cb78b4695bbe00a4605ae64080a47629329b8
In other news, apparently proc.fdInfo is the last user of ramfs.File.
PiperOrigin-RevId: 207564572
Change-Id: I5a92515698cc89652b80bea9a32d309e14059869
Store the new assigned pcid in p.cache[pt].
Signed-off-by: ShiruRen <renshiru2000@gmail.com>
Change-Id: I4aee4e06559e429fb5e90cb9fe28b36139e3b4b6
PiperOrigin-RevId: 207563833
This CL implements CUBIC as described in https://tools.ietf.org/html/rfc8312.
PiperOrigin-RevId: 207353142
Change-Id: I329cbf3277f91127e99e488f07d906f6779c6603
Add support for the seccomp syscall and the flag SECCOMP_FILTER_FLAG_TSYNC.
PiperOrigin-RevId: 207101507
Change-Id: I5eb8ba9d5ef71b0e683930a6429182726dc23175
When adding MultiDeviceKeys and their values into MultiDevice maps, make
sure the keys and values have not already been added. This ensures that
preexisting key/value pairs are not overridden.
PiperOrigin-RevId: 206942766
Change-Id: I9d85f38eb59ba59f0305e6614a52690608944981
Currently, there is an attempt to print FD flags, but
they are not decoded into a number, so we see something like this:
/criu # cat /proc/self/fdinfo/0
flags: {%!o(bool=000false)}
Actually, fdinfo has to contain file flags.
Change-Id: Idcbb7db908067447eb9ae6f2c3cfb861f2be1a97
PiperOrigin-RevId: 206794498
We have been unnecessarily creating too many savable types implicitly.
PiperOrigin-RevId: 206334201
Change-Id: Idc5a3a14bfb7ee125c4f2bb2b1c53164e46f29a8
When copying-up files from a lower fs to an upper, we also copy the extended
attributes on the file. If there is a (nested) overlay inside the lower, some
of these extended attributes configure the lower overlay, and should not be
copied-up to the upper.
In particular, whiteout attributes in the lower fs overlay should not be
copied-up, since the upper fs may actually contain the file.
PiperOrigin-RevId: 206236010
Change-Id: Ia0454ac7b99d0e11383f732a529cb195ed364062
This CL also puts the congestion control logic behind an
interface so that we can easily swap it out for say CUBIC
in the future.
PiperOrigin-RevId: 205732848
Change-Id: I891cdfd17d4d126b658b5faa0c6bd6083187944b
The current revalidation logic is very simple and does not do much
introspection of the dirent being revalidated (other than looking at the type
of file).
Fancier revalidation logic is coming soon, and we need to be able to look at
the cached and uncached attributes of a given dirent, and we need a context to
perform some of these operations.
PiperOrigin-RevId: 205307351
Change-Id: If17ea1c631d8f9489c0e05a263e23d7a8a3bf159
In the general case with an overlay, all mmap calls must go through the
overlay, because in the event of a copy-up, the overlay needs to invalidate any
previously-created mappings.
If there if no lower file, however, there will never be a copy-up, so the
overlay can delegate directly to the upper file in that case.
This also allows us to correctly mmap /dev/zero when it is in an overlay. This
file has special semantics which the overlay does not know about. In
particular, it does not implement Mappable(), which (in the general case) the
overlay uses to detect if a file is mappable or not.
PiperOrigin-RevId: 205306743
Change-Id: I92331649aa648340ef6e65411c2b42c12fa69631
Docker expects containers to be created before they are restored.
However, gVisor restoring requires specificactions regarding the kernel
and the file system. These actions were originally in booting the sandbox.
Now setting up the file system is deferred until a call to a call to
runsc start. In the restore case, the kernel is destroyed and a new kernel
is created in the same process, as we need the same process for Docker.
These changes required careful execution of concurrent processes which
required the use of a channel.
Full docker integration still needs the ability to restore into the same
container.
PiperOrigin-RevId: 205161441
Change-Id: Ie1d2304ead7e06855319d5dc310678f701bd099f
Dirent.FullName takes the global renameMu, but can be called during Create,
which itself takes dirent.mu and dirent.dirMu, which is a lock-order violation:
Dirent.Create
d.dirMu.Lock
d.mu.Lock
Inode.Create
gofer.inodeOperations.Create
gofer.NewFile
Dirent.FullName
d.renameMu.RLock
We only use the FullName here for logging, and in this case we can get by with
logging only the BaseName.
A `BaseName` method was added to Dirent, which simply returns the name, taking
d.parent.mu as required.
In the Create pathway, we can't call d.BaseName() because taking d.parent.mu
after d.mu violates the lock order. But we already know the base name of the
file we just created, so that's OK.
In the Open/GetFile pathway, we are free to call d.BaseName() because the other
dirent locks are not held.
PiperOrigin-RevId: 205112278
Change-Id: Ib45c734081aecc9b225249a65fa8093eb4995f10
Per the doc, usage must be kept maximally merged. Beyond that, it is simply a
good idea to keep fragmentation in usage to a minimum.
The glibc malloc allocator allocates one page at a time, potentially causing
lots of fragmentation. However, those pages are likely to have the same number
of references, often making it possible to merge ranges.
PiperOrigin-RevId: 204960339
Change-Id: I03a050cf771c29a4f05b36eaf75b1a09c9465e14
If usageSet is heavily fragmented, findUnallocatedRange and findReclaimable
can spend excessive cycles linearly scanning the set for unallocated/free
pages.
Improve common cases by beginning the scan only at the first page that could
possibly contain an unallocated/free page. This metadata only guarantees that
there is no lower unallocated/free page, but a scan may still be required
(especially for multi-page allocations).
That said, this heuristic can still provide significant performance
improvements for certain applications.
PiperOrigin-RevId: 204841833
Change-Id: Ic41ad33bf9537ecd673a6f5852ab353bf63ea1e6
This method allows an eventfd inside the Sentry to be registered with with
the host kernel.
Update comment about memory mapping host fds via CachingInodeOperations.
PiperOrigin-RevId: 204784859
Change-Id: I55823321e2d84c17ae0f7efaabc6b55b852ae257
Otherwise required and optional can be empty or have negative length.
PiperOrigin-RevId: 204007079
Change-Id: I59e472a87a8caac11ffb9a914b8d79bf0cd70995
Multiple whitespace characters are allowed. This fixes Ubuntu's
/usr/sbin/invoke-rc.d, which has trailing whitespace after the
interpreter which we were treating as an arg.
PiperOrigin-RevId: 203802278
Change-Id: I0a6cdb0af4b139cf8abb22fa70351fe3697a5c6b
80bdf8a406 accidentally moved vdso into an
inner scope, never assigning the vdso variable passed to the Kernel and
thus skipping VDSO mappings.
Fix this and remove the ability for loadVDSO to skip VDSO mappings,
since tests that do so are gone.
PiperOrigin-RevId: 203169135
Change-Id: Ifd8cadcbaf82f959223c501edcc4d83d05327eba
Add option to redirect packet back to netstack if it's destined to itself.
This fixes the problem where connecting to the local NIC address would
not work, e.g.:
echo bar | nc -l -p 8080 &
echo foo | nc 192.168.0.2 8080
PiperOrigin-RevId: 203157739
Change-Id: I31c9f7c501e3f55007f25e1852c27893a16ac6c4
The path in execve(2), interpreter script, and ELF interpreter may all
be no more than a NUL-byte. Handle each of those cases.
PiperOrigin-RevId: 203155745
Change-Id: I1c8b1b387924b23b2cf942341dfc76c9003da959
Fun fact: in protocol version negotiation, our 9p version must be
written "9P2000.L". In the 'version' mount option, it must be
written "9p2000.L". Very consistent!
The mount command as given complains about an unknown protocol
version. Drop it entirely because Linux defaults to 9p2000.L
anyways.
PiperOrigin-RevId: 202971961
Change-Id: I5d46c83f03182476033db9c36870c68aeaf30f65
Updated how restoring occurs through boot.go with a separate Restore function.
This prevents a new process and new mounts from being created.
Added tests to ensure the container is restored.
Registered checkpoint and restore commands so they can be used.
Docker support for these commands is still limited.
Working on #80.
PiperOrigin-RevId: 202710950
Change-Id: I2b893ceaef6b9442b1ce3743bd112383cb92af0c
There is a subtle bug where during cleanup with unread data a FIN can
be converted to a RST, at that point the entire connection should be
aborted as we're not expecting any ACKs to the RST.
PiperOrigin-RevId: 202691271
Change-Id: Idae70800208ca26e07a379bc6b2b8090805d0a22
CheckIORange is analagous to Linux's access_ok() method, which is checked when
copying in IOVecs in both lib/iov_iter.c:import_single_range() and
lib/iov_iter.c:import_iovec() => fs/read_write.c:rw_copy_check_uvector().
gVisor copies in IOVecs via Task.SingleIOSequence() and Task.CopyInIovecs().
We were checking the address range bounds, but not whether the address is
valid. To conform with linux, we should also check that the address is valid.
For usual preadv/pwritev syscalls, the effect of this change is not noticeable,
since we find out that the address is invalid before the syscall completes.
For vectorized async-IO operations, however, this change is necessary because
Linux returns EFAULT when the operation is submitted, but before it executes.
Thus, we must validate the iovecs when copying them in.
PiperOrigin-RevId: 202370092
Change-Id: I8759a63ccf7e6b90d90d30f78ab8935a0fcf4936
If the child stubs are killed by any unmaskable signal (e.g. SIGKILL), then
the parent process will similarly be killed, resulting in the death of all
other stubs.
The effect of this is that if the OOM killer selects and kills a stub, the
effect is the same as though the OOM killer selected and killed the sentry.
PiperOrigin-RevId: 202219984
Change-Id: I0b638ce7e59e0a0f4d5cde12a7d05242673049d7
IsChrooted still has the opportunity to race with another thread
entering the FSContext into a chroot, but that is unchanged (and
fine, AFAIK).
PiperOrigin-RevId: 202029117
Change-Id: I38bce763b3a7715fa6ae98aa200a19d51a0235f1
The interfaces and their addresses are already available via
the stack Intefaces and InterfaceAddrs.
Also add some tests as we had no tests around SIOCGIFCONF. I also added the socket_netgofer lifecycle for IOCTL tests.
PiperOrigin-RevId: 201744863
Change-Id: Ie0a285a2a2f859fa0cafada13201d5941b95499a
The shutdown behavior where we return EAGAIN for sockets
which are non-blocking is only correct for packet based sockets.
SOCK_STREAM sockets should return EOF.
PiperOrigin-RevId: 201703055
Change-Id: I20b25ceca7286c37766936475855959706fc5397
SendExternalSignal is no longer called before CreateProcess, so it can
enforce this simplified precondition.
StartForwarding, and after Kernel.Start.
PiperOrigin-RevId: 201591170
Change-Id: Ib7022ef7895612d7d82a00942ab59fa433c4d6e9
SIGUSR2 was being masked out to be used as a way to dump sentry
stacks. This could cause compatibility problems in cases anyone
uses SIGUSR2 to communicate with the container init process.
PiperOrigin-RevId: 201575374
Change-Id: I312246e828f38ad059139bb45b8addc2ed055d74
FIOASYNC and friends are used to send signals when a file is ready for IO.
This may or may not be needed by Nginx. While Nginx does use it, it is unclear
if the code that uses it has any effect.
PiperOrigin-RevId: 201550828
Change-Id: I7ba05a7db4eb2dfffde11e9bd9a35b65b98d7f50
It prints sandbox stacks to the log to help debug stuckness. I expect
that many more options will be added in the future.
PiperOrigin-RevId: 201405931
Change-Id: I87e560800cd5a5a7b210dc25a5661363c8c3a16e
Almost all of the hundreds of pending signal queues are empty upon save.
PiperOrigin-RevId: 201380318
Change-Id: I40747072435299de681d646e0862efac0637e172
After shutdown(SHUT_RD) calls to recv /w MSG_DONTWAIT or with
O_NONBLOCK should result in a EAGAIN and not 0. Blocking sockets
should return 0 as they would have otherwise blocked indefinitely.
PiperOrigin-RevId: 201271123
Change-Id: If589b69c17fa5b9ff05bcf9e44024da9588c8876
Instead, CPUs will be created dynamically. We also allow a relatively
efficient mechanism for stealing and notifying when a vCPU becomes
available via unlock.
Since the number of vCPUs is no longer fixed at machine creation time,
we make the dirtySet packing more efficient. This has the pleasant side
effect of cutting out the unsafe address space code.
PiperOrigin-RevId: 201266691
Change-Id: I275c73525a4f38e3714b9ac0fd88731c26adfe66
Resume checks the status of the container and unpauses the kernel
if its status is paused. Otherwise nothing happens.
Tests were added to ensure that the process is in the correct state
after various commands.
PiperOrigin-RevId: 201251234
Change-Id: Ifd11b336c33b654fea6238738f864fcf2bf81e19
Correct a data race in rpcinet where a shutdown and recvmsg can
race around shutown flags.
PiperOrigin-RevId: 201238366
Change-Id: I5eb06df4a2b4eba331eeb5de19076213081d581f
The new policy is identical to FSCACHE (which caches everything in memory), but
it also flushes writes to the backing fs agent immediately.
All gofer cache policy decisions have been moved into the cachePolicy type.
Previously they were sprinkled around the codebase.
There are many different things that we cache (page cache, negative dirents,
dirent LRU, unstable attrs, readdir results....), and I don't think we should
have individual flags to control each of these. Instead, we should have a few
high-level cache policies that are consistent and useful to users. This
refactoring makes it easy to add more such policies.
PiperOrigin-RevId: 201206937
Change-Id: I6e225c382b2e5e1b0ad4ccf8ca229873f4cd389d
Because rpcinet will emulate a blocking socket backed by an rpc based
non-blocking socket. In the event of a shutdown(SHUT_RD) followed by a
read a non-blocking socket is allowed to return an EWOULDBLOCK however
since a blocking socket knows it cannot receive anymore data it would
block indefinitely and in this situation linux returns 0. We have to
track this on the rpcinet sentry side so we can emulate that behavior
because the remote side has no way to know if the socket is actually
blocking within the sentry.
PiperOrigin-RevId: 201201618
Change-Id: I4ac3a7b74b5dae471ab97c2e7d33b83f425aedac
Add support for control messages, but at this time the only
control message that the sentry will support here is SO_TIMESTAMP.
PiperOrigin-RevId: 200922230
Change-Id: I63a852d9305255625d9df1d989bd46a66e93c446
There are circumstances under which the redpill call will not generate
the appropriate action and notification. Replace this call with an
explicit notification, which is guaranteed to transition as well as
perform the futex wake.
PiperOrigin-RevId: 200726934
Change-Id: Ie19e008a6007692dd7335a31a8b59f0af6e54aaa
Boot loader tries to stat mount to determine whether it's a file or not. This
may file if the sandbox process doesn't have access to the file. Instead, add
overlay on top of file, which is better anyway since we don't want to propagate
changes to the host.
PiperOrigin-RevId: 200411261
Change-Id: I14222410e8bc00ed037b779a1883d503843ffebb
Rpcinet already inherits socket.ReceiveTimeout; however, it's
never set on setsockopt(2). The value is currently forwarded
as an RPC and ignored as all sockets will be non-blocking
on the RPC side.
PiperOrigin-RevId: 200299260
Change-Id: I6c610ea22c808ff6420c63759dccfaeab17959dd
This is the first iteration of checkpoint that actually saves to a file.
Tests for checkpoint are included.
Ran into an issue when private unix sockets are enabled. An error message
was added for this case and the mutex state was set.
PiperOrigin-RevId: 200269470
Change-Id: I28d29a9f92c44bf73dc4a4b12ae0509ee4070e93
In order to minimize the likelihood of exit during page table
modifications, make the full set of page table functions split-safe.
This is not strictly necessary (and you may still incur splits due to
allocations from the allocator pool) but should make retries a very rare
occurance.
PiperOrigin-RevId: 200146688
Change-Id: I8fa36aa16b807beda2f0b057be60038258e8d597
hostinet/socket.go: the Sentry doesn't spawn new processes, but it doesn't hurt to protect the socket from leaking.
unet/unet.go: should be setting closing on exec. The FD is explicitly donated to children when needed.
PiperOrigin-RevId: 200135682
Change-Id: Ia8a45ced1e00a19420c8611b12e7a8ee770f89cb
SOCK_STREAM has special behavior with respect to MSG_TRUNC. Specifically,
the data isn't actually copied back out to userspace when MSG_TRUNC is
provided on a SOCK_STREAM.
According to tcp(7): "Since version 2.4, Linux supports the use of
MSG_TRUNC in the flags argument of recv(2) (and recvmsg(2)). This flag
causes the received bytes of data to be discarded, rather than passed
back in a caller-supplied buffer."
PiperOrigin-RevId: 200134860
Change-Id: I70f17a5f60ffe7794c3f0cfafd131c069202e90d
Minor refactor. line_discipline.go was home to 2 large structs (lineDiscipline
and queue), and queue is now large enough IMO to get its own file.
Also moves queue locks into the queue struct, making locking simpler.
PiperOrigin-RevId: 200080301
Change-Id: Ia75a0e9b3d9ac8d7e5a0f0099a54e1f5b8bdea34
Walking off the bottom of the sigaltstack, for example with recursive faults,
results in forced signal delivery, not resetting the stack or pushing signal
stack to whatever happens to lie below the signal stack.
PiperOrigin-RevId: 199856085
Change-Id: I0004d2523f0df35d18714de2685b3eaa147837e0
MSG_TRUNC can cause recvmsg(2) to return a value larger than
the buffer size. In this situation it's an indication that the
buffer was completely filled and that the msg was truncated.
Previously in rpcinet we were returning the buffer size but we
should actually be returning the payload length as returned by
the syscall.
PiperOrigin-RevId: 199814221
Change-Id: If09aa364219c1bf193603896fcc0dc5c55e85d21
Adds support for echo to terminals. Echoing is just copying input back out to
the user, e.g. when I type "foo" into a terminal, I expect "foo" to be echoed
back to my terminal.
Also makes the transform function part of the queue, eliminating the need to
pass them around together and the possibility of using the wrong transform for a
queue.
PiperOrigin-RevId: 199655147
Change-Id: I37c490d4fc1ee91da20ae58ba1f884a5c14fd0d8
Because of the KVM shadow page table implementation, modifications made
to guest page tables from host mode may not be syncronized correctly,
resulting in undefined behavior. This is a KVM bug: page table pages
should also be tracked for host modifications and resynced appropriately
(e.g. the guest could "DMA" into a page table page in theory).
However, since we can't rely on this being fixed everywhere, workaround
the issue by forcing page table modifications to be in guest mode. This
will generally be the case anyways, but now if an exit occurs during
modifications, we will re-enter and perform the modifications again.
PiperOrigin-RevId: 199587895
Change-Id: I83c20b4cf2a9f9fa56f59f34939601dd34538fb0
Instead of associating a single PCID with each set of page tables (which
will reach the maximum quickly), allow a dynamic pool for each vCPU.
This is the same way that Linux operates. We also split management of
PCIDs out of the page tables themselves for simplicity.
PiperOrigin-RevId: 199585631
Change-Id: I42f3486ada3cb2a26f623c65ac279b473ae63201
In order to prevent possible garbage collection and reuse of page table
pages prior to invalidation, introduce a former allocator abstraction
that can ensure entries are held during a single traversal. This also
cleans up the abstraction and splits it out of the machine itself.
PiperOrigin-RevId: 199581636
Change-Id: I2257d5d7ffd9c36f9b7ecd42f769261baeaf115c
This change will add support for ioctls that have previously
been supported by netstack.
LINE_LENGTH_IGNORE
PiperOrigin-RevId: 199544114
Change-Id: I3769202c19502c3b7d05e06ea9552acfd9255893