Because of the KVM shadow page table implementation, modifications made
to guest page tables from host mode may not be syncronized correctly,
resulting in undefined behavior. This is a KVM bug: page table pages
should also be tracked for host modifications and resynced appropriately
(e.g. the guest could "DMA" into a page table page in theory).
However, since we can't rely on this being fixed everywhere, workaround
the issue by forcing page table modifications to be in guest mode. This
will generally be the case anyways, but now if an exit occurs during
modifications, we will re-enter and perform the modifications again.
PiperOrigin-RevId: 199587895
Change-Id: I83c20b4cf2a9f9fa56f59f34939601dd34538fb0
Instead of associating a single PCID with each set of page tables (which
will reach the maximum quickly), allow a dynamic pool for each vCPU.
This is the same way that Linux operates. We also split management of
PCIDs out of the page tables themselves for simplicity.
PiperOrigin-RevId: 199585631
Change-Id: I42f3486ada3cb2a26f623c65ac279b473ae63201
In order to prevent possible garbage collection and reuse of page table
pages prior to invalidation, introduce a former allocator abstraction
that can ensure entries are held during a single traversal. This also
cleans up the abstraction and splits it out of the machine itself.
PiperOrigin-RevId: 199581636
Change-Id: I2257d5d7ffd9c36f9b7ecd42f769261baeaf115c
Just a UI/usability addition. It's a lot easier to type "60" than
"60185c721d7e10c00489f1fa210ee0d35c594873d6376b457fb1815e4fdbfc2c".
PiperOrigin-RevId: 199547932
Change-Id: I19011b5061a88aba48a9ad7f8cf954a6782de854
This change will add support for ioctls that have previously
been supported by netstack.
LINE_LENGTH_IGNORE
PiperOrigin-RevId: 199544114
Change-Id: I3769202c19502c3b7d05e06ea9552acfd9255893
Checkpoint command is plumbed through container and sandbox.
Restore has also been added but it is only a stub. None of this
works yet. More changes to come.
PiperOrigin-RevId: 199510105
Change-Id: Ibd08d57f4737847eb25ca20b114518e487320185
This change will add support for /proc/sys/net and /proc/net which will
be managed and owned by rpcinet. This will allow these inodes to be forward
as rpcs.
PiperOrigin-RevId: 199370799
Change-Id: I2c876005d98fe55dd126145163bee5a645458ce4
So that when saving TCP endpoint in these states, there is no pending or
background activities.
Also lift tcp network save rejection error to tcpip package.
PiperOrigin-RevId: 199370748
Change-Id: Ief7b45c2a7338d12414cd7c23db95de6a9c22700
Refuse to mount paths with "." and ".." in the path to prevent
a compromised Sentry to mount "../../secrets". Only allow
Attach to be called once per mount point.
PiperOrigin-RevId: 199225929
Change-Id: I2a3eb7ea0b23f22eb8dde2e383e32563ec003bd5
Containerd will start deleting container and rootfs after container
is stopped. However, if gofer is still running, rootfs cleanup will
fail because of device busy.
This CL makes sure that gofer is not running when container state is
stopped.
Change from: lantaol@google.com
PiperOrigin-RevId: 199172668
Change-Id: I9d874eec3ecf74fd9c8edd7f62d9f998edef66fe
9P socket was being created without CLOEXEC and was being inherited
by the children. This would prevent the gofer from detecting that the
sandbox had exited, because the socket would not be closed.
PiperOrigin-RevId: 199168959
Change-Id: I3ee1a07cbe7331b0aeb1cf2b697e728ce24f85a7
Common code to setup and run sandbox is moved to testutil. Also, don't
link "boot" and "gofer" commands with test binary. Instead, use runsc
binary from the build. This not only make the test setup simpler, but
also resolves a dependency issue with sandbox_tests not depending on
container package.
PiperOrigin-RevId: 199164478
Change-Id: I27226286ca3f914d4d381358270dd7d70ee8372f
This is necessary to prevent races with invalidation. It is currently
possible that page tables are garbage collected while paging caches
refer to them. We must ensure that pages are held until caches can be
invalidated. This is not achieved by this goal alone, but moving locking
to outside the page tables themselves is a requisite.
PiperOrigin-RevId: 198920784
Change-Id: I66fffecd49cb14aa2e676a84a68cabfc0c8b3e9a
Previously, the vCPU FS was always correct because it relied on the
reset coming out of the switch. When that doesn't occur, for example,
using bluepill directly, the FS value can be incorrect leading to
strange corruption.
This change is necessary for a subsequent change that enforces guest
mode for page table modifications, and it may reduce test flakiness.
(The problematic path may occur in tests, but does not occur in the
actual platform.)
PiperOrigin-RevId: 198648137
Change-Id: I513910a973dd8666c9a1d18cf78990964d6a644d
This is a refactor of ring0 and ring0/pagetables that changes from
individual arguments to opts structures. This should involve no
functional changes, but sets the stage for subsequent changes.
PiperOrigin-RevId: 198627556
Change-Id: Id4460340f6a73f0c793cd879324398139cd58ae9
This addresses the first issue reported in #59. CRI-O expects runsc to
return success to delete when --force is used with a non-existing container.
PiperOrigin-RevId: 198487418
Change-Id: If7660e8fdab1eb29549d0a7a45ea82e20a1d4f4a
Today poll will not wake up on a ECONNREFUSED if no poll mask
is specified, which is equivalent to POLLHUP | POLLERR which are
implicitly added during the poll syscall.
PiperOrigin-RevId: 197967183
Change-Id: I668d0730c33701228913f2d0843b48491b642efb
These were causing non-blocking related errnos to be returned to
the sentry when they were created as blocking FDs internally.
PiperOrigin-RevId: 197962932
Change-Id: I3f843535ff87ebf4cb5827e9f3d26abfb79461b0
Container user might not have enough priviledge to walk directories and
mount filesystems. Instead, create superuser to perform these steps of
the configuration.
PiperOrigin-RevId: 197953667
Change-Id: I643650ab654e665408e2af1b8e2f2aa12d58d4fb
Today when we transmit a RST it's happening during the time-wait
flow. Because a FIN is allowed to advance the acceptable ACK window
we're incorrectly doing that for a RST.
PiperOrigin-RevId: 197637565
Change-Id: I080190b06bd0225326cd68c1fbf37bd3fdbd414e
Kernel before 2.6.16 return EINVAL, but later return ESPIPE for this case.
Also change type of "length" from Uint(uint32) to Int64.
Because C header uses type "size_t" (unsigned long) or "off_t" (long) for length.
And it makes more sense to check length < 0 with Int64 because Uint cannot be negative.
Change-Id: Ifd7fea2dcded7577a30760558d0d31f479f074c4
PiperOrigin-RevId: 197616743
Establishes a way of communicating interface flags between netstack and
epsocket. More flags can be added over time.
PiperOrigin-RevId: 197616669
Change-Id: I230448c5fb5b7d2e8d69b41a451eb4e1096a0e30
Especially in situations with small numbers of vCPUs, the existing
system resulted in excessive thrashing. Now, execution contexts
co-ordinate as smoothly as they can to share a small number of cores.
PiperOrigin-RevId: 197483323
Change-Id: I0afc0c5363ea9386994355baf3904bf5fe08c56c
In Linux, many UDS ioctls are passed through to the NIC driver. We do the same
here, passing ioctl calls to Unix sockets through to epsocket.
In Linux you can see this path at net/socket.c:sock_ioctl, which calls
sock_do_ioctl, which calls net/core/dev_ioctl.c:dev_ioctl.
SIOCGIFNAME is also added.
PiperOrigin-RevId: 197167508
Change-Id: I62c326a4792bd0a473e9c9108aafb6a6354f2b64
Capabilities for sysv sem operations were being checked against the
current task's user namespace. They should be checked against the user
namespace owning the ipc namespace for the sems instead, per
ipc/util.c:ipcperms().
PiperOrigin-RevId: 197063111
Change-Id: Iba29486b316f2e01ee331dda4e48a6ab7960d589
Previously, dual stack UDP sockets bound to an IPv4 address could not use
sendto to communicate with IPv4 addresses. Further, dual stack UDP sockets
bound to an IPv6 address could use sendto to communicate with IPv4 addresses.
Neither of these behaviors are consistent with Linux.
PiperOrigin-RevId: 197036024
Change-Id: Ic3713efc569f26196e35bb41e6ad63f23675fc90
This is another step towards multi-container support.
Previously, we delivered signals directly to the sandbox process (which then
forwarded the signal to PID 1 inside the sandbox). Similarly, we waited on a
container by waiting on the sandbox process itself. This approach will not work
when there are multiple containers inside the sandbox, and we need to
signal/wait on individual containers.
This CL adds two new messages, ContainerSignal and ContainerWait. These
messages include the id of the container to signal/wait. The controller inside
the sandbox receives these messages and signals/waits on the appropriate
process inside the sandbox.
The container id is plumbed into the sandbox, but it currently is not used. We
still end up signaling/waiting on PID 1 in all cases. Once we actually have
multiple containers inside the sandbox, we will need to keep some sort of map
of container id -> pid (or possibly pid namespace), and signal/kill the
appropriate process for the container.
PiperOrigin-RevId: 197028366
Change-Id: I07b4d5dc91ecd2affc1447e6b4bdd6b0b7360895
So that when saving TCP endpoint in these states, there is no pending or
background activities.
Also lift tcp network save rejection error to tcpip package.
PiperOrigin-RevId: 196886839
Change-Id: I0fe73750f2743ec7e62d139eb2cec758c5dd6698
This should fix the socket Dirent memory leak.
fs.NewFile takes a new reference. It should hold the *only* reference.
DecRef that socket Dirent.
Before the globalDirentMap was introduced, a mis-refcounted Dirent
would be garbage collected when all references to it were gone. For
socket Dirents, this meant that they would be garbage collected when
the associated fs.Files disappeared.
After the globalDirentMap, Dirents *must* be reference-counted
correctly to be garbage collected, as Dirents remove themselves
from the global map when their refcount goes to -1 (see Dirent.destroy).
That removes the last pointer to that Dirent.
PiperOrigin-RevId: 196878973
Change-Id: Ic7afcd1de97c7101ccb13be5fc31de0fb50963f0
When doing a BidirectionalConnect we don't need to continue holding
the ConnectingEndpoint's mutex when creating the NewConnectedEndpoint
as it was held during the Connect. Additionally, we're not holding
the baseEndpoint mutex while Unregistering an event.
PiperOrigin-RevId: 196875557
Change-Id: Ied4ceed89de883121c6cba81bc62aa3a8549b1e9