Commit Graph

64 Commits

Author SHA1 Message Date
Ruidong Cao a2b794b30d FPE_INTOVF (integer overflow) should be 2 refer to Linux.
Signed-off-by: Ruidong Cao <crdfrank@gmail.com>
Change-Id: I03f8ab25cf29257b31f145cf43304525a93f3300
PiperOrigin-RevId: 235763203
2019-02-26 11:48:49 -08:00
Jamie Liu 0e84ae72e0 Improve safecopy sanity checks.
- Fix CopyIn/CopyOut/ZeroOut range checks.

- Include the faulting signal number in the panic message.

PiperOrigin-RevId: 233829501
Change-Id: I8959ead12d05dbd4cd63c2b908cddeb2a27eb513
2019-02-13 14:25:15 -08:00
Michael Pratt 2a0c69b19f Remove license comments
Nothing reads them and they can simply get stale.

Generated with:
$ sed -i "s/licenses(\(.*\)).*/licenses(\1)/" **/BUILD

PiperOrigin-RevId: 231818945
Change-Id: Ibc3f9838546b7e94f13f217060d31f4ada9d4bf0
2019-01-31 11:12:53 -08:00
Fabricio Voznika 03226cd950 Add BPFAction type with Stringer
PiperOrigin-RevId: 226018694
Change-Id: I98965e26fe565f37e98e5df5f997363ab273c91b
2018-12-18 10:28:28 -08:00
Haibo Xu 52fe3b87a4 Add safecopy support for arm64 platform.
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I565214581eeb44045169da7f44d45a489082ac3a
PiperOrigin-RevId: 224938170
2018-12-10 21:35:02 -08:00
Michael Pratt 99d5958693 Validate FS_BASE in Task.Clone
arch_prctl already verified that the new FS_BASE was canonical, but
Task.Clone did not. Centralize these checks in the arch packages.

Failure to validate could cause an error in PTRACE_SET_REGS when we try
to switch to the app.

PiperOrigin-RevId: 224862398
Change-Id: Iefe63b3f9aa6c4810326b8936e501be3ec407f14
2018-12-10 12:37:16 -08:00
Michael Pratt 076f107643 Remove initRegs arg from clone
It is always the same as t.initRegs.

PiperOrigin-RevId: 224085550
Change-Id: I5cc4ddc3b481d4748c3c43f6f4bb50da1dbac694
2018-12-04 18:53:43 -08:00
Haibo Xu 9e0f132377 Add procid support for arm64 platform
Change-Id: I7c3db8dfdf95a125d7384c1d67c3300dbb99a47e
PiperOrigin-RevId: 223039923
2018-11-27 12:46:39 -08:00
Fabricio Voznika eaac94d91c Use RET_KILL_PROCESS if available in kernel
RET_KILL_THREAD doesn't work well for Go because it will
kill only the offending thread and leave the process hanging.
RET_TRAP can be masked out and it's not guaranteed to kill
the process. RET_KILL_PROCESS is available since 4.14.

For older kernel, continue to use RET_TRAP as this is the
best option (likely to kill process, easy to debug).

PiperOrigin-RevId: 222357867
Change-Id: Icc1d7d731274b16c2125b7a1ba4f7883fbdb2cbd
2018-11-20 22:56:51 -08:00
Michael Pratt 03c1eb78b5 Reference upstream licenses
Include copyright notices and the referenced LICENSE file.

PiperOrigin-RevId: 222171321
Change-Id: I0cc0b167ca51b536d1087bf1c4742fdf1430bc2a
2018-11-20 14:05:16 -08:00
Adin Scannell fb613020c7 kvm: simplify floating point logic.
This reduces the number of floating point save/restore cycles required (since
we don't need to restore immediately following the switch, this always happens
in a known context) and allows the kernel hooks to capture state. This lets us
remove calls like "Current()".

PiperOrigin-RevId: 219552844
Change-Id: I7676fa2f6c18b9919718458aa888b832a7db8cab
2018-10-31 15:59:23 -07:00
Adin Scannell c4bbb54168 kvm: add detailed traces on vCPU errors.
This improves debuggability greatly.

PiperOrigin-RevId: 219551560
Change-Id: I2ecaffdd1c17b0d9f25911538ea6f693e2bc699f
2018-10-31 15:50:10 -07:00
Adin Scannell e9dbd5ab67 kvm: avoid siginfo allocations.
PiperOrigin-RevId: 219492587
Change-Id: I47f6fc0b74a4907ab0aff03d5f26453bdb983bb5
2018-10-31 10:08:06 -07:00
Adin Scannell 0091db9cbd kvm: use private futexes.
Use private futexes for performance and to align with other runtime uses.

PiperOrigin-RevId: 219422634
Change-Id: Ief2af5e8302847ea6dc246e8d1ee4d64684ca9dd
2018-10-30 22:46:42 -07:00
Adin Scannell e7191f058f Use TRAP to simplify vsyscall emulation.
PiperOrigin-RevId: 218592058
Change-Id: I373a2d813aa6cc362500dd5a894c0b214a1959d7
2018-10-24 15:52:44 -07:00
Nicolas Lacasse 4a1a2dead9 Run ptrace stubs in their own session and process group.
Pseudoterminal job control signals are meant to be received and handled by the
sandbox process, but if the ptrace stubs are running in the same process group,
they will receive the signals as well and inject then into the sentry kernel.

This can result in duplicate signals being delivered (often to the wrong
process), or a sentry panic if the ptrace stub is inactive.

This CL makes the ptrace stub run in a new session.

PiperOrigin-RevId: 218536851
Change-Id: Ie593c5687439bbfbf690ada3b2197ea71ed60a0e
2018-10-24 10:42:35 -07:00
Adin Scannell 75cd70ecc9 Track paths and provide a rename hook.
This change also adds extensive testing to the p9 package via mocks. The sanity
checks and type checks are moved from the gofer into the core package, where
they can be more easily validated.

PiperOrigin-RevId: 218296768
Change-Id: I4fc3c326e7bf1e0e140a454cbacbcc6fd617ab55
2018-10-23 00:20:15 -07:00
Ian Gudger 8fce67af24 Use correct company name in copyright header
PiperOrigin-RevId: 217951017
Change-Id: Ie08bf6987f98467d07457bcf35b5f1ff6e43c035
2018-10-19 16:35:11 -07:00
Adin Scannell 463e73d46d Add seccomp filter configuration to ptrace stubs.
This is a defense-in-depth measure. If the sentry is compromised, this prevents
system call injection to the stubs. There is some complexity with respect to
ptrace and seccomp interactions, so this protection is not really available
for kernel versions < 4.8; this is detected dynamically.

Note that this also solves the vsyscall emulation issue by adding in
appropriate trapping for those system calls. It does mean that a compromised
sentry could theoretically inject these into the stub (ignoring the trap and
resume, thereby allowing execution), but they are harmless.

PiperOrigin-RevId: 216647581
Change-Id: Id06c232cbac1f9489b1803ec97f83097fcba8eb8
2018-10-10 22:40:28 -07:00
Fabricio Voznika da20559137 Provide better message when memfd_create fails with ENOSYS
Updates #100

PiperOrigin-RevId: 213414821
Change-Id: I90c2e6c18c54a6afcd7ad6f409f670aa31577d37
2018-09-18 02:09:28 -07:00
newmanwang de5a590ee2 Avoid reuse of pending SignalInfo objects
runApp.execute -> Task.SendSignal -> sendSignalLocked -> sendSignalTimerLocked
-> pendingSignals.enqueue assumes that it owns the arch.SignalInfo returned
from platform.Context.Switch.

On the other hand, ptrace.context.Switch assumes that it owns the returned
SignalInfo and can safely reuse it on the next call to Switch. The KVM platform
always returns a unique SignalInfo.

This becomes a problem when the returned signal is not immediately delivered,
allowing a future signal in Switch to change the previous pending SignalInfo.

This is noticeable in #38 when external SIGINTs are delivered from the PTY
slave FD. Note that the ptrace stubs are in the same process group as the
sentry, so they are eligible to receive the PTY signals. This should probably
change, but is not the only possible cause of this bug.

Updates #38

Original change by newmanwang <wcs1011@gmail.com>, updated by Michael Pratt
<mpratt@google.com>.

Change-Id: I5383840272309df70a29f67b25e8221f933622cd
PiperOrigin-RevId: 213071072
2018-09-14 17:39:25 -07:00
Chenggang faa34a0738 platform/kvm: Get max vcpu number dynamically by ioctl
The old kernel version, such as 4.4, only support 255 vcpus.
While gvisor is ran on these kernels, it could panic because the
vcpu id and vcpu number beyond max_vcpus.
Use ioctl(vmfd, _KVM_CHECK_EXTENSION, _KVM_CAP_MAX_VCPUS) to get max
vcpus number dynamically.

Change-Id: I50dd859a11b1c2cea854a8e27d4bf11a411aa45c
PiperOrigin-RevId: 212929704
2018-09-13 21:47:11 -07:00
Nicolas Lacasse 6cc9b311af platform: Pass device fd into platform constructor.
We were previously openining the platform device (i.e. /dev/kvm) inside the
platfrom constructor (i.e. kvm.New).  This requires that we have RW access to
the platform device when constructing the platform.

However, now that the runsc sandbox process runs as user "nobody", it is not
able to open the platform device.

This CL changes the kvm constructor to take the platform device FD, rather than
opening the device file itself. The device file is opened outside of the
sandbox and passed to the sandbox process.

PiperOrigin-RevId: 212505804
Change-Id: I427e1d9de5eb84c84f19d513356e1bb148a52910
2018-09-11 13:09:46 -07:00
Jamie Liu a29c39aa62 Map committed chunks concurrently in FileMem.LoadFrom.
PiperOrigin-RevId: 212345401
Change-Id: Iac626ee87ba312df88ab1019ade6ecd62c04c75c
2018-09-10 15:23:44 -07:00
Michael Pratt 25a8e13a78 Bump to Go 1.11
The procid offset is unchanged.

PiperOrigin-RevId: 210551969
Change-Id: I33ba1ce56c2f5631b712417d870aa65ef24e6022
2018-08-28 09:22:41 -07:00
Adin Scannell a7a8d07d7d Add separate Recycle method for allocator.
This improves debugging for pagetable-related issues.

PiperOrigin-RevId: 209827795
Change-Id: I4cfa11664b0b52f26f6bc90a14c5bb106f01e038
2018-08-22 14:16:04 -07:00
Adin Scannell dbbe9ec915 Protect PCIDs with a mutex.
Because the Drop method may be called across vCPUs, it is necessary to protect
the PCID database with a mutex to prevent concurrent modification. The PCID is
assigned prior to entersyscall, so it's safe to block.

PiperOrigin-RevId: 207992864
Change-Id: I8b36d55106981f51e30dcf03e12886330bb79d67
2018-08-08 21:29:19 -07:00
ShiruRen 3ec074897f Fix a bug in PCIDs.Assign
Store the new assigned pcid in p.cache[pt].

Signed-off-by: ShiruRen <renshiru2000@gmail.com>

Change-Id: I4aee4e06559e429fb5e90cb9fe28b36139e3b4b6
PiperOrigin-RevId: 207563833
2018-08-06 10:11:56 -07:00
Zhaozhong Ni 57d0fcbdbf Automated rollback of changelist 207037226
PiperOrigin-RevId: 207125440
Change-Id: I6c572afb4d693ee72a0c458a988b0e96d191cd49
2018-08-02 10:42:48 -07:00
Michael Pratt 60add78980 Automated rollback of changelist 207007153
PiperOrigin-RevId: 207037226
Change-Id: I8b5f1a056d4f3eab17846f2e0193bb737ecb5428
2018-08-01 19:57:32 -07:00
Zhaozhong Ni b9e1cf8404 stateify: convert all packages to use explicit mode.
PiperOrigin-RevId: 207007153
Change-Id: Ifedf1cc3758dc18be16647a4ece9c840c1c636c9
2018-08-01 15:43:24 -07:00
Zhaozhong Ni be7fcbc558 stateify: support explicit annotation mode; convert refs and stack packages.
We have been unnecessarily creating too many savable types implicitly.

PiperOrigin-RevId: 206334201
Change-Id: Idc5a3a14bfb7ee125c4f2bb2b1c53164e46f29a8
2018-07-27 10:17:21 -07:00
Fabricio Voznika d7a34790a0 Add KVM and overlay dimensions to container_test
PiperOrigin-RevId: 205714667
Change-Id: I317a2ca98ac3bdad97c4790fcc61b004757d99ef
2018-07-23 13:31:42 -07:00
Michael Pratt 733ebe7c09 Merge FileMem.usage in IncRef
Per the doc, usage must be kept maximally merged. Beyond that, it is simply a
good idea to keep fragmentation in usage to a minimum.

The glibc malloc allocator allocates one page at a time, potentially causing
lots of fragmentation. However, those pages are likely to have the same number
of references, often making it possible to merge ranges.

PiperOrigin-RevId: 204960339
Change-Id: I03a050cf771c29a4f05b36eaf75b1a09c9465e14
2018-07-17 13:03:59 -07:00
Adin Scannell 29e00c943a Add CPUID faulting for ptrace and KVM.
PiperOrigin-RevId: 204858314
Change-Id: I8252bf8de3232a7a27af51076139b585e73276d4
2018-07-16 22:02:58 -07:00
Michael Pratt 14d06064d2 Start allocation and reclaim scans only where they may find a match
If usageSet is heavily fragmented, findUnallocatedRange and findReclaimable
can spend excessive cycles linearly scanning the set for unallocated/free
pages.

Improve common cases by beginning the scan only at the first page that could
possibly contain an unallocated/free page. This metadata only guarantees that
there is no lower unallocated/free page, but a scan may still be required
(especially for multi-page allocations).

That said, this heuristic can still provide significant performance
improvements for certain applications.

PiperOrigin-RevId: 204841833
Change-Id: Ic41ad33bf9537ecd673a6f5852ab353bf63ea1e6
2018-07-16 18:19:01 -07:00
Jamie Liu ee0ef506d4 Add MemoryManager.Pin.
PiperOrigin-RevId: 204162313
Change-Id: Ib0593dde88ac33e222c12d0dca6733ef1f1035dc
2018-07-11 11:52:09 -07:00
Adin Scannell dc33d71f8c Change SIGCHLD to SIGKILL in ptrace stubs.
If the child stubs are killed by any unmaskable signal (e.g. SIGKILL), then
the parent process will similarly be killed, resulting in the death of all
other stubs.

The effect of this is that if the OOM killer selects and kills a stub, the
effect is the same as though the OOM killer selected and killed the sentry.

PiperOrigin-RevId: 202219984
Change-Id: I0b638ce7e59e0a0f4d5cde12a7d05242673049d7
2018-06-26 16:54:44 -07:00
Adin Scannell be76cad5bc Make KVM more scalable by removing CPU cap.
Instead, CPUs will be created dynamically. We also allow a relatively
efficient mechanism for stealing and notifying when a vCPU becomes
available via unlock.

Since the number of vCPUs is no longer fixed at machine creation time,
we make the dirtySet packing more efficient. This has the pleasant side
effect of cutting out the unsafe address space code.

PiperOrigin-RevId: 201266691
Change-Id: I275c73525a4f38e3714b9ac0fd88731c26adfe66
2018-06-19 17:00:30 -07:00
Adin Scannell b31ac4e1df Use notify explicitly on unlock path.
There are circumstances under which the redpill call will not generate
the appropriate action and notification. Replace this call with an
explicit notification, which is guaranteed to transition as well as
perform the futex wake.

PiperOrigin-RevId: 200726934
Change-Id: Ie19e008a6007692dd7335a31a8b59f0af6e54aaa
2018-06-15 09:30:08 -07:00
Adin Scannell 7b7b199ed0 Deflake kvm_test.
PiperOrigin-RevId: 200439846
Change-Id: I9970fe0716cb02f0f41b754891d55db7e0729f56
2018-06-13 13:05:33 -07:00
Jamie Liu 55b9058456 Log filemem state when panicing due to invalid refcount.
PiperOrigin-RevId: 200408305
Change-Id: I676ee49ec77697105723577928c7f82088cd378e
2018-06-13 10:03:54 -07:00
Adin Scannell 41f766893a Minor ring0 interface cleanup.
- Remove unused methods.
- Provide declaration for asm function.

PiperOrigin-RevId: 200146850
Change-Id: Ic455c96ffe0d2e78ef15f824eb65d7de705b054a
2018-06-11 18:17:15 -07:00
Adin Scannell 1397a413b4 Make page tables split-safe.
In order to minimize the likelihood of exit during page table
modifications, make the full set of page table functions split-safe.
This is not strictly necessary (and you may still incur splits due to
allocations from the allocator pool) but should make retries a very rare
occurance.

PiperOrigin-RevId: 200146688
Change-Id: I8fa36aa16b807beda2f0b057be60038258e8d597
2018-06-11 18:15:14 -07:00
Adin Scannell 09b0a9c320 Handle all exception vectors.
PiperOrigin-RevId: 200144655
Change-Id: I5a753c74b75007b7714d6fe34aa0d2e845dc5c41
2018-06-11 17:57:19 -07:00
Adin Scannell c0ab059e7b Fix kernel flags handling and add missing vectors.
PiperOrigin-RevId: 199877174
Change-Id: I9d19ea301608c2b989df0a6123abb1e779427853
2018-06-08 17:51:50 -07:00
Adin Scannell d269845159 Ensure guest-mode for page table modifications.
Because of the KVM shadow page table implementation, modifications made
to guest page tables from host mode may not be syncronized correctly,
resulting in undefined behavior. This is a KVM bug: page table pages
should also be tracked for host modifications and resynced appropriately
(e.g. the guest could "DMA" into a page table page in theory).

However, since we can't rely on this being fixed everywhere, workaround
the issue by forcing page table modifications to be in guest mode. This
will generally be the case anyways, but now if an exit occurs during
modifications, we will re-enter and perform the modifications again.

PiperOrigin-RevId: 199587895
Change-Id: I83c20b4cf2a9f9fa56f59f34939601dd34538fb0
2018-06-06 23:26:14 -07:00
Adin Scannell 3374849cb5 Split PCID implementation from page tables.
Instead of associating a single PCID with each set of page tables (which
will reach the maximum quickly), allow a dynamic pool for each vCPU.
This is the same way that Linux operates. We also split management of
PCIDs out of the page tables themselves for simplicity.

PiperOrigin-RevId: 199585631
Change-Id: I42f3486ada3cb2a26f623c65ac279b473ae63201
2018-06-06 22:52:55 -07:00
Adin Scannell 1b5062263b Add allocator abstraction for page tables.
In order to prevent possible garbage collection and reuse of page table
pages prior to invalidation, introduce a former allocator abstraction
that can ensure entries are held during a single traversal. This also
cleans up the abstraction and splits it out of the machine itself.

PiperOrigin-RevId: 199581636
Change-Id: I2257d5d7ffd9c36f9b7ecd42f769261baeaf115c
2018-06-06 21:48:24 -07:00
Adin Scannell 659b10d1a6 Move page tables lock into the address space.
This is necessary to prevent races with invalidation. It is currently
possible that page tables are garbage collected while paging caches
refer to them. We must ensure that pages are held until caches can be
invalidated. This is not achieved by this goal alone, but moving locking
to outside the page tables themselves is a requisite.

PiperOrigin-RevId: 198920784
Change-Id: I66fffecd49cb14aa2e676a84a68cabfc0c8b3e9a
2018-06-01 13:51:16 -07:00