If physical pages of a memory region are not mapped yet, the kernel will
trigger KVM_EXIT_MMIO and we will map physical pages in bluepillHandler().
An instruction that triggered a fault will not be re-executed, it
will be emulated in the kernel, but it can't emulate complex
instructions like xsave, xrstor. We can touch the memory with
simple instructions to workaround this problem.
The syscall package has been deprecated in favor of golang.org/x/sys.
Note that syscall is still used in the following places:
- pkg/sentry/socket/hostinet/stack.go: some netlink related functionalities
are not yet available in golang.org/x/sys.
- syscall.Stat_t is still used in some places because os.FileInfo.Sys() still
returns it and not unix.Stat_t.
Updates #214
PiperOrigin-RevId: 360701387
Implement basic lazy save and restore for FPSIMD registers, which only
restore FPSIMD state on el0_fpsimd_acc and save FPSIMD state in switch().
Signed-off-by: Robin Luk <lubin.lu@antgroup.com>
This allows the package to serve as a general purpose ring0 support package, as
opposed to being bound to specific sentry platforms.
Updates #5039
PiperOrigin-RevId: 355220044
In order to improve the performance, some kpti related codes(TCR.A1) have
been reverted, and set kernel pagetable as global.
Signed-off-by: Robin Luk <lubin.lu@antgroup.com>
If no vild syndrome(data abort outside memslots) was reported by kvm, let userspace to do the
ext_dabt injection to bail out this issue.
Signed-off-by: Robin Luk <lubin.lu@antgroup.com>
Use an sErr injection to trigger sigbus when we receive EFAULT from the
run ioctl.
After applying this patch, mmap_test_runsc_kvm will be passed on
Arm64.
Signed-off-by: Bin Lu <bin.lu@arm.com>
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/4542 from lubinszARM:pr_kvm_mmap_1 f81bd42466d1d60a581e5fb34de18b78878c68c1
PiperOrigin-RevId: 340461239
I have added support for setSystemTimeLegacy() by setting cntvoff.
With this pr, TestRdtsc and other kvm syscall test cases(nanosleep,
wait...) can be passed on Arm64.
TO-DO: Add precise synchronization to KVM for Arm64.
Reference PR: https://github.com/google/gvisor/pull/4397
Signed-off-by: Bin Lu <bin.lu@arm.com>
Consistent with the linux approach, we will produce a sigill to handle
el0_undef.
After applying this patch, exec_binary_test_runsc_kvm will be passed on
Arm64.
Signed-off-by: Bin Lu <bin.lu@arm.com>
The required states may simply not be observed by the thread running bounce, so
track guest and user generations to ensure that at least one of the desired
state transitions happens.
Fixes#3532
PiperOrigin-RevId: 336908216
the correct value needed is 0xbbff440c0400 but the const
defined is 0x000000000000ffc0 due to the operator error
in _MT_EL1_INIT, both kernel and user space memory
attribute should be Normal memory not DEVICE_nGnRE
Signed-off-by: Min Le <lemin.lm@antgroup.com>
Before we thought that interrupts are always disabled in the kernel
space, but here is a case when goruntime switches on a goroutine which
has been saved in the host mode. On restore, the popf instruction is
used to restore flags and this means that all flags what the goroutine
has in the host mode will be restored in the kernel mode. And in the
host mode, interrupts are always enabled.
The long story short, we can't use the IF flag for determine whether a
tasks is running in user or kernel mode.
This patch reworks the code so that in userspace, the first bit of the
IOPL flag will be always set. This doesn't give any new privilidges for
a task because CPL in userspace is always 3. But then we can use this
flag to distinguish user and kernel modes. The IOPL flag is never set in
the kernel and host modes.
Reported-by: syzbot+5036b325a8eb15c030cf@syzkaller.appspotmail.com
Reported-by: syzbot+034d580e89ad67b8dc75@syzkaller.appspotmail.com
Signed-off-by: Andrei Vagin <avagin@gmail.com>
This immediately revealed an escape analysis violation (!), where
the sync.Map was being used in a context that escapes were not
allowed. This is a relatively minor fix and is included.
PiperOrigin-RevId: 328611237
Actually, gvisor has KPTI (Kernel PageTable Isolation) between
gr0 and gr3. But the upper half of the userCR3 contains the
whole sentry kernel which makes the kernel vulnerable to
gr3 APP through CPU bugs.
This patch implement full KPTI functionality for gvisor. It doesn't
map the whole kernel in the upper. It maps only the text section
of the binary and the entry area required by the ISA. The entry area
contains the global idt, the percpu gdt/tss etc. The entry area
packs all these together which is less than 350k for 512 vCPUs.
The text section is normally nonsensitive. It is possible to
map only the entry functions (interrupt handler etc.) only.
But it requires some hacks.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antfin.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
kernelEntry is split from CPU that contains minimal CPU-specific
arch state that can be mapped at the upper of the address space.
It is prepared for KPTI for gvisor.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antfin.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
m.Get() has guaranteed that if any OS thread TID is in guest,
m.vCPUs[TID] points to the vCPU in which the OS thread TID is running.
So if m.Get() returns with the corrent context in guest,
the vCPU of it must be the same as what Get() returns.
So bluepill() doesn't need to check if the vCPU is matched or not.
The check need to access to %gs register which will not points
to vCPU later when KPTI for gvisor is enabled. We can still
fetch the vCPU pointer from %gs later (when %gs points to kernelEntry),
but it needs the ENTRY_CPU_SELF which is generated by
ring0/offset_amd64.go. So we just simply remove the check.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antfin.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>