gvisor

Commit Graph

Author	SHA1	Message	Date
Ian Gudger	167f2401c4	Merge host.endpoint into host.ConnectedEndpoint host.endpoint contained duplicated logic from the sockerpair implementation and host.ConnectedEndpoint. Remove host.endpoint in favor of a host.ConnectedEndpoint wrapped in a socketpair end. PiperOrigin-RevId: 217240096 Change-Id: I4a3d51e3fe82bdf30e2d0152458b8499ab4c987c	2018-10-15 17:48:11 -07:00
Nicolas Lacasse	ecd94ea7a6	Clean up Rename and Unlink checks for EBUSY. - Change Dirent.Busy => Dirent.isMountPoint. The function body is unchanged, and it is no longer exported. - fs.MayDelete now checks that the victim is not the process root. This aligns with Linux's namei.c:may_delete(). - Fix "is-ancestor" checks to actually compare all ancestors, not just the parents. - Fix handling of paths that end in dots, which are handled differently in Rename vs. Unlink. PiperOrigin-RevId: 217239274 Change-Id: I7a0eb768e70a1b2915017ce54f7f95cbf8edf1fb	2018-10-15 17:42:30 -07:00
Zhaozhong Ni	4ea69fce8d	sentry: save fs.Dirent deleted info. PiperOrigin-RevId: 217155458 Change-Id: Id3265b1ec784787039e2131c80254ac4937330c7	2018-10-15 09:31:32 -07:00
Kevin Krakauer	47d3862c33	runsc: Support retrieving MTU via netdevice ioctl. This enables ifconfig to display MTU. PiperOrigin-RevId: 216917021 Change-Id: Id513b23d9d76899bcb71b0b6a25036f41629a923	2018-10-12 13:58:32 -07:00
Zhaozhong Ni	0bfa03d61c	sentry: allow saving of unlinked files with open fds on virtual fs. PiperOrigin-RevId: 216733414 Change-Id: I33cd3eb818f0c39717d6656fcdfff6050b37ebb0	2018-10-11 11:41:44 -07:00
Adin Scannell	463e73d46d	Add seccomp filter configuration to ptrace stubs. This is a defense-in-depth measure. If the sentry is compromised, this prevents system call injection to the stubs. There is some complexity with respect to ptrace and seccomp interactions, so this protection is not really available for kernel versions < 4.8; this is detected dynamically. Note that this also solves the vsyscall emulation issue by adding in appropriate trapping for those system calls. It does mean that a compromised sentry could theoretically inject these into the stub (ignoring the trap and resume, thereby allowing execution), but they are harmless. PiperOrigin-RevId: 216647581 Change-Id: Id06c232cbac1f9489b1803ec97f83097fcba8eb8	2018-10-10 22:40:28 -07:00
Michael Pratt	ddb34b3690	Enforce message size limits and avoid host calls with too many iovecs Currently, in the face of FileMem fragmentation and a large sendmsg or recvmsg call, host sockets may pass > 1024 iovecs to the host, which will immediately cause the host to return EMSGSIZE. When we detect this case, use a single intermediate buffer to pass to the kernel, copying to/from the src/dst buffer. To avoid creating unbounded intermediate buffers, enforce message size checks and truncation w.r.t. the send buffer size. The same functionality is added to netstack unix sockets for feature parity. PiperOrigin-RevId: 216590198 Change-Id: I719a32e71c7b1098d5097f35e6daf7dd5190eff7	2018-10-10 14:10:17 -07:00
Nicolas Lacasse	b78552d30e	When creating a new process group, add it to the session. PiperOrigin-RevId: 216554791 Change-Id: Ia6b7a2e6eaad80a81b2a8f2e3241e93ebc2bda35	2018-10-10 10:42:11 -07:00
Ian Gudger	c36d2ef373	Add new netstack metrics to the sentry PiperOrigin-RevId: 216431260 Change-Id: Ia6e5c8d506940148d10ff2884cf4440f470e5820	2018-10-09 15:12:44 -07:00
Brian Geffon	acf7a95189	Add memunit to sysinfo(2). Also properly add padding after Procs in the linux.Sysinfo structure. This will be implicitly padded to 64bits so we need to do the same. PiperOrigin-RevId: 216372907 Change-Id: I6eb6a27800da61d8f7b7b6e87bf0391a48fdb475	2018-10-09 09:52:14 -07:00
Michael Pratt	569c2b06c4	Statfs Namelen should be NAME_MAX not PATH_MAX We accidentally set the wrong maximum. I've also added PATH_MAX and NAME_MAX to the linux abi package. PiperOrigin-RevId: 216221311 Change-Id: I44805fcf21508831809692184a0eba4cee469633	2018-10-08 11:39:54 -07:00
Jamie Liu	e9e8be6613	Implement shared futexes. - Shared futex objects on shared mappings are represented by Mappable + offset, analogous to Linux's use of inode + offset. Add type futex.Key, and change the futex.Manager bucket API to use futex.Keys instead of addresses. - Extend the futex.Checker interface to be able to return Keys for memory mappings. It returns Keys rather than just mappings because whether the address or the target of the mapping is used in the Key depends on whether the mapping is MAP_SHARED or MAP_PRIVATE; this matters because using mapping target for a futex on a MAP_PRIVATE mapping causes it to stop working across COW-breaking. - futex.Manager.WaitComplete depends on atomic updates to futex.Waiter.addr to determine when it has locked the right bucket, which is much less straightforward for struct futex.Waiter.key. Switch to an atomically-accessed futex.Waiter.bucket pointer. - futex.Manager.Wake now needs to take a futex.Checker to resolve addresses for shared futexes. CLONE_CHILD_CLEARTID requires the exit path to perform a shared futex wakeup (Linux: kernel/fork.c:mm_release() => sys_futex(tsk->clear_child_tid, FUTEX_WAKE, ...)). This is a problem because futexChecker is in the syscalls/linux package. Move it to kernel. PiperOrigin-RevId: 216207039 Change-Id: I708d68e2d1f47e526d9afd95e7fed410c84afccf	2018-10-08 10:20:38 -07:00
Ian Gudger	beac59b37a	Fix panic if FIOASYNC callback is registered and triggered without target PiperOrigin-RevId: 215674589 Change-Id: I4f8871b64c570dc6da448d2fe351cec8a406efeb	2018-10-03 20:22:31 -07:00
Nicolas Lacasse	213f6688a5	Implement TIOCSCTTY ioctl as a noop. PiperOrigin-RevId: 215658757 Change-Id: If63b33293f3e53a7f607ae72daa79e2b7ef6fcfd	2018-10-03 17:29:56 -07:00
Ian Gudger	4fef31f96c	Add S/R support for FIOASYNC PiperOrigin-RevId: 215655197 Change-Id: I668b1bc7c29daaf2999f8f759138bcbb09c4de6f	2018-10-03 17:03:09 -07:00
Nicolas Lacasse	f1c01ed886	runsc: Support job control signals in "exec -it". Terminal support in runsc relies on host tty file descriptors that are imported into the sandbox. Application tty ioctls are sent directly to the host fd. However, those host tty ioctls are associated in the host kernel with a host process (in this case runsc), and the host kernel intercepts job control characters like ^C and send signals to the host process. Thus, typing ^C into a "runsc exec" shell will send a SIGINT to the runsc process. This change makes "runsc exec" handle all signals, and forward them into the sandbox via the "ContainerSignal" urpc method. Since the "runsc exec" is associated with a particular container process in the sandbox, the signal must be associated with the same container process. One big difficulty is that the signal should not necessarily be sent to the sandbox process started by "exec", but instead must be sent to the foreground process group for the tty. For example, we may exec "bash", and from bash call "sleep 100". A ^C at this point should SIGINT sleep, not bash. To handle this, tty files inside the sandbox must keep track of their foreground process group, which is set/get via ioctls. When an incoming ContainerSignal urpc comes in, we look up the foreground process group via the tty file. Unfortunately, this means we have to expose and cache the tty file in the Loader. Note that "runsc exec" now handles signals properly, but "runs run" does not. That will come in a later CL, as this one is complex enough already. Example: root@:/usr/local/apache2# sleep 100 ^C root@:/usr/local/apache2# sleep 100 ^Z [1]+ Stopped sleep 100 root@:/usr/local/apache2# fg sleep 100 ^C root@:/usr/local/apache2# PiperOrigin-RevId: 215334554 Change-Id: I53cdce39653027908510a5ba8d08c49f9cf24f39	2018-10-01 22:06:56 -07:00
Michael Pratt	0400e54592	Add itimer types to linux package, strace PiperOrigin-RevId: 215278262 Change-Id: Icd10384c99802be6097be938196044386441e282	2018-10-01 14:16:53 -07:00
Nicolas Lacasse	07aa040842	Fix possible panic in control.Processes. There was a race where we checked task.Parent() != nil, and then later called task.Parent() again, assuming that it is not nil. If the task is exiting, the parent may have been set to nil in between the two calls, causing a panic. This CL changes the code to only call task.Parent() once. PiperOrigin-RevId: 215274456 Change-Id: Ib5a537312c917773265ec72016014f7bc59a5f59	2018-10-01 13:56:07 -07:00
Michael Pratt	3ff24b4f2c	Require AF_UNIX sockets from the gofer host.endpoint already has the check, but it is missing from host.ConnectedEndpoint. PiperOrigin-RevId: 214962762 Change-Id: I88bb13a5c5871775e4e7bf2608433df8a3d348e6	2018-09-28 11:03:11 -07:00
Sepehr Raissian	c17ea8c6e2	Block for link address resolution Previously, if address resolution for UDP or Ping sockets required sending packets using Write in Transport layer, Resolve would return ErrWouldBlock and Write would return ErrNoLinkAddress. Meanwhile startAddressResolution would run in background. Further calls to Write using same address would also return ErrNoLinkAddress until resolution has been completed successfully. Since Write is not allowed to block and System Calls need to be interruptible in System Call layer, the caller to Write is responsible for blocking upon return of ErrWouldBlock. Now, when startAddressResolution is called a notification channel for the completion of the address resolution is returned. The channel will traverse up to the calling function of Write as well as ErrNoLinkAddress. Once address resolution is complete (success or not) the channel is closed. The caller would call Write again to send packets and check if address resolution was compeleted successfully or not. Fixes google/gvisor#5 Change-Id: Idafaf31982bee1915ca084da39ae7bd468cebd93 PiperOrigin-RevId: 214962200	2018-09-28 11:00:16 -07:00
Nicolas Lacasse	b709d23987	Forward ioctl(TCSETSF) calls on host ttys to the host kernel. We already forward TCSETS and TCSETSW. TCSETSF is roughly equivalent but discards pending input. The filters were relaxed to allow host ioctls with TCSETSF argument. This fixes programs like "passwd" that prevent user input from being displayed on the terminal. Before: root@b8a0240fc836:/# passwd Enter new UNIX password: 123 Retype new UNIX password: 123 passwd: password updated successfully After: root@ae6f5dabe402:/# passwd Enter new UNIX password: Retype new UNIX password: passwd: password updated successfully PiperOrigin-RevId: 214869788 Change-Id: I31b4d1373c1388f7b51d0f2f45ce40aa8e8b0b58	2018-09-27 18:17:38 -07:00
Fabricio Voznika	491faac03b	Implement 'runsc kill --all' In order to implement kill --all correctly, the Sentry needs to track all tasks that belong to a given container. This change introduces ContainerID to the task, that gets inherited by all children. 'kill --all' then iterates over all tasks comparing the ContainerID field to find all processes that need to be signalled. PiperOrigin-RevId: 214841768 Change-Id: I693b2374be8692d88cc441ef13a0ae34abf73ac6	2018-09-27 15:00:58 -07:00
Zhaozhong Ni	234f36b6f2	sentry: export cpuTime function. PiperOrigin-RevId: 214798278 Change-Id: Id59d1ceb35037cda0689d3a1c4844e96c6957615	2018-09-27 12:52:25 -07:00
Fabricio Voznika	fca9a390db	Return correct parent PID Old code was returning ID of the thread that created the child process. It should be returning the ID of the parent process instead. PiperOrigin-RevId: 214720910 Change-Id: I95715c535bcf468ecf1ae771cccd04a4cd345b36	2018-09-26 22:00:04 -07:00
Nicolas Lacasse	fd222d62ed	Short-circuit Readdir calls on overlay files when the dirent is frozen. If we have an overlay file whose corresponding Dirent is frozen, then we should not bother calling Readdir on the upper or lower files, since DirentReaddir will calculate children based on the frozen Dirent tree. A test was added that fails without this change. PiperOrigin-RevId: 213531215 Change-Id: I4d6c98f1416541a476a34418f664ba58f936a81d	2018-09-18 15:42:22 -07:00
Brian Geffon	ed08597d12	Allow for MSG_CTRUNC in input flags for recv. PiperOrigin-RevId: 213481363 Change-Id: I8150ea20cebeb207afe031ed146244de9209e745	2018-09-18 11:14:37 -07:00
Fabricio Voznika	da20559137	Provide better message when memfd_create fails with ENOSYS Updates #100 PiperOrigin-RevId: 213414821 Change-Id: I90c2e6c18c54a6afcd7ad6f409f670aa31577d37	2018-09-18 02:09:28 -07:00
Fabricio Voznika	5d9816be41	Remove memory usage static init panic() during init() can be hard to debug. Updates #100 PiperOrigin-RevId: 213391932 Change-Id: Ic103f1981c5b48f1e12da3b42e696e84ffac02a9	2018-09-17 21:34:37 -07:00
Kevin Krakauer	bb88c187c5	runsc: Enable waiting on exited processes. This makes `runsc wait` behave more like waitpid()/wait4() in that: - Once a process has run to completion, you can wait on it and get its exit code. - Processes not waited on will consume memory (like a zombie process) PiperOrigin-RevId: 213358916 Change-Id: I5b5eca41ce71eea68e447380df8c38361a4d1558	2018-09-17 16:25:24 -07:00
Ian Gudger	ab6fa44588	Allow kernel.(*Task).Block to accept an extract only channel PiperOrigin-RevId: 213328293 Change-Id: I4164133e6f709ecdb89ffbb5f7df3324c273860a	2018-09-17 13:35:54 -07:00
Michael Pratt	d639c3d61b	Allow NULL data in mount(2) PiperOrigin-RevId: 213315267 Change-Id: I7562bcd81fb22e90aa9c7dd9eeb94803fcb8c5af	2018-09-17 12:16:29 -07:00
newmanwang	de5a590ee2	Avoid reuse of pending SignalInfo objects runApp.execute -> Task.SendSignal -> sendSignalLocked -> sendSignalTimerLocked -> pendingSignals.enqueue assumes that it owns the arch.SignalInfo returned from platform.Context.Switch. On the other hand, ptrace.context.Switch assumes that it owns the returned SignalInfo and can safely reuse it on the next call to Switch. The KVM platform always returns a unique SignalInfo. This becomes a problem when the returned signal is not immediately delivered, allowing a future signal in Switch to change the previous pending SignalInfo. This is noticeable in #38 when external SIGINTs are delivered from the PTY slave FD. Note that the ptrace stubs are in the same process group as the sentry, so they are eligible to receive the PTY signals. This should probably change, but is not the only possible cause of this bug. Updates #38 Original change by newmanwang <wcs1011@gmail.com>, updated by Michael Pratt <mpratt@google.com>. Change-Id: I5383840272309df70a29f67b25e8221f933622cd PiperOrigin-RevId: 213071072	2018-09-14 17:39:25 -07:00
Michael Pratt	3aa50f18a4	Reuse readlink parameter, add sockaddr max. PiperOrigin-RevId: 213058623 Change-Id: I522598c655d633b9330990951ff1c54d1023ec29	2018-09-14 16:00:02 -07:00
Nicolas Lacasse	b84bfa570d	Make gVisor hard link check match Linux's. Linux permits hard-linking if the target is owned by the user OR the target has Read+Write permission. PiperOrigin-RevId: 213024613 Change-Id: If642066317b568b99084edd33ee4e8822ec9cbb3	2018-09-14 12:29:46 -07:00
Jamie Liu	0380bcb3a4	Fix interaction between rt_sigtimedwait and ignored signals. PiperOrigin-RevId: 213011782 Change-Id: I716c6ea3c586b0c6c5a892b6390d2d11478bc5af	2018-09-14 11:10:50 -07:00
Chenggang	faa34a0738	platform/kvm: Get max vcpu number dynamically by ioctl The old kernel version, such as 4.4, only support 255 vcpus. While gvisor is ran on these kernels, it could panic because the vcpu id and vcpu number beyond max_vcpus. Use ioctl(vmfd, _KVM_CHECK_EXTENSION, _KVM_CAP_MAX_VCPUS) to get max vcpus number dynamically. Change-Id: I50dd859a11b1c2cea854a8e27d4bf11a411aa45c PiperOrigin-RevId: 212929704	2018-09-13 21:47:11 -07:00
Ian Gudger	29a7271f5d	Plumb monotonic time to netstack Netstack needs to be portable, so this seems to be preferable to using raw system calls. PiperOrigin-RevId: 212917409 Change-Id: I7b2073e7db4b4bf75300717ca23aea4c15be944c	2018-09-13 19:12:15 -07:00
Rahat Mahmood	adf8f33970	Extend memory usage events to report mapped memory usage. PiperOrigin-RevId: 212887555 Change-Id: I3545383ce903cbe9f00d9b5288d9ef9a049b9f4f	2018-09-13 15:16:47 -07:00
Michael Pratt	9c6b38e295	Format struct itimerspec PiperOrigin-RevId: 212874745 Change-Id: I0c3e8e6a9e8976631cee03bf0b8891b336ddb8c8	2018-09-13 14:07:47 -07:00
Nicolas Lacasse	e2d79480f5	initArgs must hold a reference on the Root if it is not nil. The contract in ExecArgs says that a reference on ExecArgs.Root must be held for the lifetime of the struct, but the caller is free to drop the ref after that. As a result, proc.Exec must take an additional ref on Root when it constructs the CreateProcessArgs, since that holds a pointer to Root as well. That ref is dropped in CreateProcess. PiperOrigin-RevId: 212828348 Change-Id: I7f44a612f337ff51a02b873b8a845d3119408707	2018-09-13 09:50:35 -07:00
Kevin Krakauer	2eff1fdd06	runsc: Add exec flag that specifies where to save the sandbox-internal pid. This is different from the existing -pid-file flag, which saves a host pid. PiperOrigin-RevId: 212713968 Change-Id: I2c486de8dd5cfd9b923fb0970165ef7c5fc597f0	2018-09-12 15:23:35 -07:00
Nicolas Lacasse	6cc9b311af	platform: Pass device fd into platform constructor. We were previously openining the platform device (i.e. /dev/kvm) inside the platfrom constructor (i.e. kvm.New). This requires that we have RW access to the platform device when constructing the platform. However, now that the runsc sandbox process runs as user "nobody", it is not able to open the platform device. This CL changes the kvm constructor to take the platform device FD, rather than opening the device file itself. The device file is opened outside of the sandbox and passed to the sandbox process. PiperOrigin-RevId: 212505804 Change-Id: I427e1d9de5eb84c84f19d513356e1bb148a52910	2018-09-11 13:09:46 -07:00
Jamie Liu	a29c39aa62	Map committed chunks concurrently in FileMem.LoadFrom. PiperOrigin-RevId: 212345401 Change-Id: Iac626ee87ba312df88ab1019ade6ecd62c04c75c	2018-09-10 15:23:44 -07:00
Fabricio Voznika	7e9e6745ca	Allow '/dev/zero' to be mapped with unaligned length PiperOrigin-RevId: 212321271 Change-Id: I79d71c2e6f4b8fcd3b9b923fe96c2256755f4c48	2018-09-10 13:24:55 -07:00
Michael Pratt	7045828a31	Update cleanup TODO PiperOrigin-RevId: 212068327 Change-Id: I3f360cdf7d6caa1c96fae68ae3a1caaf440f0cbe	2018-09-07 18:14:57 -07:00
Nicolas Lacasse	9751b800a6	runsc: Support multi-container exec. We must use a context.Context with a Root Dirent that corresponds to the container's chroot. Previously we were using the root context, which does not have a chroot. Getting the correct context required refactoring some of the path-lookup code. We can't lookup the path without a context.Context, which requires kernel.CreateProcArgs, which we only get inside control.Execute. So we have to do the path lookup much later than we previously were. PiperOrigin-RevId: 212064734 Change-Id: I84a5cfadacb21fd9c3ab9c393f7e308a40b9b537	2018-09-07 17:39:54 -07:00
Fabricio Voznika	172860a059	Add 'Starting gVisor...' message to syslog This allows applications to verify they are running with gVisor. It also helps debugging when running with a mix of container runtimes. Closes #54 PiperOrigin-RevId: 212059457 Change-Id: I51d9595ee742b58c1f83f3902ab2e2ecbd5cedec	2018-09-07 16:59:27 -07:00
Fabricio Voznika	f895cb4d8b	Use root abstract socket namespace for exec PiperOrigin-RevId: 211999211 Change-Id: I5968dd1a8313d3e49bb6e6614e130107495de41d	2018-09-07 10:45:55 -07:00
Michael Pratt	169e2efc5a	Continue handling signals after disabling forwarding Before destroying the Kernel, we disable signal forwarding, relinquishing control to the Go runtime. External signals that arrive after disabling forwarding but before the sandbox exits thus may use runtime.raise (i.e., tkill(2)) and violate the syscall filters. Adjust forwardSignals to handle signals received after disabling forwarding the same way they are handled before starting forwarding. i.e., by implementing the standard Go runtime behavior using tgkill(2) instead of tkill(2). This also makes the stop callback block until forwarding actually stops. This isn't required to avoid tkill(2) but is a saner interface. PiperOrigin-RevId: 211995946 Change-Id: I3585841644409260eec23435cf65681ad41f5f03	2018-09-07 10:28:25 -07:00
Nicolas Lacasse	6516b5648b	createProcessArgs.RootFromContext should return process Root if it exists. It was always returning the MountNamespace root, which may be different from the process Root if the process is in a chroot environment. PiperOrigin-RevId: 211862181 Change-Id: I63bfeb610e2b0affa9fdbdd8147eba3c39014480	2018-09-06 13:47:49 -07:00

1 2 3 4 5 ...

276 Commits