RootFromContext can return a dirent with reference taken, or nil. We must call
DecRef if (and only if) a real dirent is returned.
PiperOrigin-RevId: 242965515
Change-Id: Ie2b7b4cb19ee09b6ccf788b71f3fd7efcdf35a11
DirentCache is already a savable type, and it ensures that it is empty at the
point of Save. There is no reason not to save it along with the MountSource.
This did uncover an issue where not all MountSources were properly flushed
before Save. If a mount point has an open file and is then unmounted, we save
the MountSource without flushing it first. This CL also fixes that by flushing
all MountSources for all open FDs on Save.
PiperOrigin-RevId: 242906637
Change-Id: I3acd9d52b6ce6b8c989f835a408016cb3e67018f
We construct a ramfs tree of "scaffolding" directories for all mount points, so
that a directory exists that each mount point can be mounted over.
We were creating these directories without write permissions, which meant that
they were not wribable even when underlayed under a writable filesystem. They
should be writable.
PiperOrigin-RevId: 242507789
Change-Id: I86645e35417560d862442ff5962da211dbe9b731
Strings are a better fit for this usage because they are immutable in Go, and
can contain arbitrary bytes. It also allows us to avoid casting bytes to string
(and the associated allocation) in the hot path when checking for overlay
whiteouts.
PiperOrigin-RevId: 242208856
Change-Id: I7699ae6302492eca71787dd0b72e0a5a217a3db2
Dirent.exists() is called in Create to check whether a child with the given
name already exists.
Dirent.exists() calls walk(), and before this CL allowed walk() to drop d.mu
while calling d.Inode.Lookup. During this existence check, a racing Rename()
can acquire d.mu and create a new child of the dirent with the same name.
(Note that the source and destination of the rename must be in the same
directory, otherwise renameMu will be taken preventing the race.) In this
case, d.exists() can return false, even though a child with the same name
actually does exist.
This CL changes d.exists() so that it does not release d.mu while walking, thus
preventing the race with Rename.
It also adds comments noting that lockForRename may not take renameMu if the
source and destination are in the same directory, as this is a bit surprising
(at least it was to me).
PiperOrigin-RevId: 241842579
Change-Id: I56524870e39dfcd18cab82054eb3088846c34813
The previous implementation revolved around runes instead of bytes, which caused
weird behavior when converting between the two. For example, peekRune would read
the byte 0xff from a buffer, convert it to a rune, then return it. As rune is an
alias of int32, 0xff was 0-padded to int32(255), which is the hex code point for
?. However, peekRune also returned the length of the byte (1). When calling
utf8.EncodeRune, we only allocated 1 byte, but tried the write the 2-byte
character ?.
tl;dr: I apparently didn't understand runes when I wrote this.
PiperOrigin-RevId: 241789081
Change-Id: I14c788af4d9754973137801500ef6af7ab8a8727
Also makes the safemem reading and writing inline, as it makes it easier to see
what locks are held.
PiperOrigin-RevId: 241775201
Change-Id: Ib1072f246773ef2d08b5b9a042eb7e9e0284175c
Also remove comments in InodeOperations that required that implementation of
some Create* operations ensure that the name does not already exist, since
these checks are all centralized in the Dirent.
PiperOrigin-RevId: 241637335
Change-Id: Id098dc6063ff7c38347af29d1369075ad1e89a58
ilist:generic_list works faster (cl/240185278) and
the code looks cleaner without type casting.
PiperOrigin-RevId: 241381175
Change-Id: I8487ab1d73637b3e9733c253c56dce9e79f0d35f
- Document fsutil.CachedFileObject.FD() requirements on access
permissions, and change gofer.inodeFileState.FD() to honor them.
Fixes#147.
- Combine gofer.inodeFileState.readonly and
gofer.inodeFileState.readthrough, and simplify handle caching logic.
- Inline gofer.cachePolicy.cacheHandles into
gofer.inodeFileState.setSharedHandles, because users with access to
gofer.inodeFileState don't necessarily have access to the fs.Inode
(predictably, this is a save/restore problem).
Before this CL:
$ docker run --runtime=runsc-d -v $(pwd)/gvisor/repro:/root/repro -it ubuntu bash
root@34d51017ed67:/# /root/repro/runsc-b147
mmap: 0x7f3c01e45000
Segmentation fault
After this CL:
$ docker run --runtime=runsc-d -v $(pwd)/gvisor/repro:/root/repro -it ubuntu bash
root@d3c3cb56bbf9:/# /root/repro/runsc-b147
mmap: 0x7f78987ec000
o
PiperOrigin-RevId: 240818413
Change-Id: I49e1d4a81a0cb9177832b0a9f31a10da722a896b
The start time is the number of clock ticks between the boot time and
application start time.
PiperOrigin-RevId: 240619475
Change-Id: Ic8bd7a73e36627ed563988864b0c551c052492a5
Memfds are simply anonymous tmpfs files with no associated
mounts. Also implementing file seals, which Linux only implements for
memfds at the moment.
PiperOrigin-RevId: 240450031
Change-Id: I31de78b950101ae8d7a13d0e93fe52d98ea06f2f
MM.insertPMAsLocked() passes vma.maxPerms to memmap.Mappable.Translate
(although it unsets AccessType.Write if the vma is private). This
somewhat simplifies handling of pmas, since it means only COW-break
needs to replace existing pmas. However, it also means that a MAP_SHARED
mapping of a file opened O_RDWR dirties the file, regardless of the
mapping's permissions and whether or not the mapping is ever actually
written to with I/O that ignores permissions (e.g.
ptrace(PTRACE_POKEDATA)).
To fix this:
- Change the pma-getting path to request only the permissions that are
required for the calling access.
- Change memmap.Mappable.Translate to take requested permissions, and
return allowed permissions. This preserves the existing behavior in the
common cases where the memmap.Mappable isn't
fsutil.CachingInodeOperations and doesn't care if the translated
platform.File pages are written to.
- Change the MM.getPMAsLocked path to support permission upgrading of
pmas outside of copy-on-write.
PiperOrigin-RevId: 240196979
Change-Id: Ie0147c62c1fbc409467a6fa16269a413f3d7d571
Also, changing queue.writeBuf from a buffer.Bytes to a [][]byte should reduce
copying and reallocating of slices.
PiperOrigin-RevId: 239713547
Change-Id: I6ee5ff19c3ee2662f1af5749cae7b73db0569e96
In the case of a rename replacing an existing destination inode, ramfs
Rename failed to first remove the replaced inode. This caused:
1. A leak of a reference to the inode (making it live indefinitely).
2. For directories, a leak of the replaced directory's .. link to the
parent. This would cause the parent's link count to incorrectly
increase.
(2) is much simpler to test than (1), so that's what I've done.
agentfs has a similar bug with link count only, so the Dirent layer
informs the Inode if this is a replacing rename.
Fixes#133
PiperOrigin-RevId: 239105698
Change-Id: I4450af2462d8ae3339def812287213d2cbeebde0
This is in preparation for improved page cache reclaim, which requires
greater integration between the page cache and page allocator.
PiperOrigin-RevId: 238444706
Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4
p9.Twalk.handle() with a non-empty path also stats the walked-to path
anyway, so the preceding GetAttr is completely wasted.
PiperOrigin-RevId: 238440645
Change-Id: I7fbc7536f46b8157639d0d1f491e6aaa9ab688a3
- Redefine some memmap.Mappable, platform.File, and platform.Memory
semantics in terms of File reference counts (no functional change).
- Make AddressSpace.MapFile take a platform.File instead of a raw FD,
and replace platform.File.MapInto with platform.File.FD. This allows
kvm.AddressSpace.MapFile to always use platform.File.MapInternal instead
of maintaining its own (redundant) cache of file mappings in the sentry
address space.
PiperOrigin-RevId: 238044504
Change-Id: Ib73a11e4275c0da0126d0194aa6c6017a9cef64f
Current procfs has some bugs. After executing ls twice, many dirs come
out with same name like "1" or ".". Files like "cpuinfo" disappear.
Here variable names is a slice with cap() > len(). Sort after appending
to it will not alloc a new space and impact orignal slice. Same to m.
Signed-off-by: Ruidong Cao <crdfrank@gmail.com>
Change-Id: I83e5cd1c7968c6fe28c35ea4fee497488d4f9eef
PiperOrigin-RevId: 236222270
In addition to simplifying the implementation, this fixes two bugs:
- seqfile.NewSeqFile unconditionally creates an inode with mode 0444,
but {uid,gid}_map have mode 0644.
- idMapSeqFile.Write implements fs.FileOperations.Write ... but it
doesn't implement any other fs.FileOperations methods and is never
used as fs.FileOperations. idMapSeqFile.GetFile() =>
seqfile.SeqFile.GetFile() uses seqfile.seqFileOperations instead,
which rejects all writes.
PiperOrigin-RevId: 234638212
Change-Id: I4568f741ab07929273a009d7e468c8205a8541bc
If a background process tries to read from a TTY, linux sends it a SIGTTIN
unless the signal is blocked or ignored, or the process group is an orphan, in
which case the syscall returns EIO.
See drivers/tty/n_tty.c:n_tty_read()=>job_control().
If a background process tries to write a TTY, set the termios, or set the
foreground process group, linux then sends a SIGTTOU. If the signal is ignored
or blocked, linux allows the write. If the process group is an orphan, the
syscall returns EIO.
See drivers/tty/tty_io.c:tty_check_change().
PiperOrigin-RevId: 234044367
Change-Id: I009461352ac4f3f11c5d42c43ac36bb0caa580f9
fs/gofer/inodeOperations.Release does some asynchronous work. Previously it
was calling fs.Async with an anonymous function, which caused the function to
be allocated on the heap. Because Release is relatively hot, this results in a
lot of small allocations and increased GC pressure, noticeable in perf profiles.
This CL adds a new function, AsyncWithContext, which is just like Async, but
passes a context to the async function. It avoids the need for an extra
anonymous function in fs/gofer/inodeOperations.Release. The Async function
itself still requires a single anonymous function.
PiperOrigin-RevId: 233141763
Change-Id: I1dce4a883a7be9a8a5b884db01e654655f16d19c
- Change proc to return envp on overwrite of argv with limitations from
upstream.
- Add unit tests
- Change layout of argv/envp on the stack so that end of argv is contiguous with
beginning of envp.
PiperOrigin-RevId: 232506107
Change-Id: I993880499ab2c1220f6dc456a922235c49304dec
Dirty should be set only when the attribute is changed in the cache
only. Instances where the change was also sent to the backing file
doesn't need to dirty the attribute.
Also remove size update during WriteOut as writing dirty page would
naturaly grow the file if needed.
RELNOTES: relnotes is needed for the parent CL.
PiperOrigin-RevId: 232068978
Change-Id: I00ba54693a2c7adc06efa9e030faf8f2e8e7f188
This changed required making fsutil.HostMappable use
a backing file to ensure the correct FD would be used
for read/write operations.
RELNOTES: relnotes is needed for the parent CL.
PiperOrigin-RevId: 231836164
Change-Id: I8ae9639715529874ea7d80a65e2c711a5b4ce254
Nothing reads them and they can simply get stale.
Generated with:
$ sed -i "s/licenses(\(.*\)).*/licenses(\1)/" **/BUILD
PiperOrigin-RevId: 231818945
Change-Id: Ibc3f9838546b7e94f13f217060d31f4ada9d4bf0
We were modifying InodeSimpleAttributes.Unstable.AccessTime without holding
the necessary lock. Luckily for us, InodeSimpleAttributes already has a
NotifyAccess method that will do the update while holding the lock.
In addition, we were holding dfo.dir.mu.Lock while setting AccessTime, which
is unnecessary, so that lock has been removed.
PiperOrigin-RevId: 231278447
Change-Id: I81ed6d3dbc0b18e3f90c1df5e5a9c06132761769
It never actually should have applied to environ (the relevant change in
Linux 4.2 is c2c0bb44620d "proc: fix PAGE_SIZE limit of
/proc/$PID/cmdline"), and we claim to be Linux 4.4 now anyway.
PiperOrigin-RevId: 231250661
Change-Id: I37f9c4280a533d1bcb3eebb7803373ac3c7b9f15
When file size changes outside the sandbox, page cache was not
refreshing file size which is required for cacheRemoteRevalidating.
In fact, cacheRemoteRevalidating should be skipping the cache
completely since it's not really benefiting from it. The cache is
cache is already bypassed for unstable attributes (see
cachePolicy.cacheUAttrs). And althought the cache is called to
map pages, they will always miss the cache and map directly from
the host.
Created a HostMappable struct that maps directly to the host and
use it for files with cacheRemoteRevalidating.
Closes#124
PiperOrigin-RevId: 230998440
Change-Id: Ic5f632eabe33b47241e05e98c95e9b2090ae08fc
Most of the entries are stubbed out at the moment, but even those were
only displayed if IPv6 support was enabled. The entries should be
displayed with IPv4-support only, and with only loopback devices.
PiperOrigin-RevId: 229946441
Change-Id: I18afaa3af386322787f91bf9d168ab66c01d5a4c
More helper structs have been added to the fsutil package to make it easier to
implement fs.InodeOperations and fs.FileOperations.
PiperOrigin-RevId: 229305982
Change-Id: Ib6f8d3862f4216745116857913dbfa351530223b
overlayFileOperations.Readdir was holding overlay.copyMu while calling
DirentReaddir, which then attempts to take take the corresponding Dirent.mu,
causing a lock order violation. (See lock order documentation in
fs/copy_up.go.)
We only actually need to hold copyMu during readdirEntries(), so holding the
lock is moved in there, thus avoiding the lock order violation.
A new lock was added to protect overlayFileOperations.dirCache. We were
inadvertently relying on copyMu to protect this. There is no reason it should
not have its own lock.
PiperOrigin-RevId: 228542473
Change-Id: I03c3a368c8cbc0b5a79d50cc486fc94adaddc1c2
epoll_wait acquires EventPoll.listsMu (in EventPoll.ReadEvents) and
then calls Inotify.Readiness which tries to acquire Inotify.evMu.
getdents acquires Inotify.evMu (in Inotify.queueEvent) and then calls
readyCallback.Callback which tries to acquire EventPoll.listsMu.
The fix is to release Inotify.evMu before calling Queue.Notify. Queue
is thread-safe and doesn't require Inotify.evMu to be held.
Closes#121
PiperOrigin-RevId: 227066695
Change-Id: Id29364bb940d1727f33a5dff9a3c52f390c15761
The code that matches the event being published with events watchers
was wronly matching all watchers in case any of the control event bits
were set.
Issue #121
PiperOrigin-RevId: 226521230
Change-Id: Ie2c42bc4366faaf59fbf80a74e9297499bd93f9e
We must wait for all lazy resources to be released before closing the rootFile.
PiperOrigin-RevId: 226419499
Change-Id: I1d4d961a92b3816e02690cf3eaf0a88944d730cc
The old overlayBoundEndpoint assumed that the lower is not an overlay. It
should check if the lower is an overlay and handle that case.
PiperOrigin-RevId: 225882303
Change-Id: I60660c587d91db2826e0719da0983ec8ad024cb8
Platform objects are not savable, storing references to them in
filesystem datastructures would cause save to fail if someone actually
passed in a Platform.
Current implementations work because everywhere a Platform is
expected, we currently pass in a Kernel object which embeds Platform
and thus satisfies the interface.
Eliminate this indirection and save pointers to Kernel directly.
PiperOrigin-RevId: 225288336
Change-Id: Ica399ff43f425e15bc150a0d7102196c3d54a2ab
MSG_WAITALL requests that recv family calls do not perform short reads. It only
has an effect for SOCK_STREAM sockets, other types ignore it.
PiperOrigin-RevId: 224918540
Change-Id: Id97fbf972f1f7cbd4e08eec0138f8cbdf1c94fe7
By Walking before checking that the directory is writable and
executable, MayDelete may return the Walk error (e.g., ENOENT) which
would normally be masked by a permission error (EACCES).
PiperOrigin-RevId: 224222453
Change-Id: I108a7f730e6bdaa7f277eaddb776267c00805475
The number of symbolic links that are allowed to be followed
are for a full path and not just a chain of symbolic links.
PiperOrigin-RevId: 224047321
Change-Id: I5e3c4caf66a93c17eeddcc7f046d1e8bb9434a40
Replaces the WaitGroup with a RWMutex. Calls to Async hold the mutex for
reading, while AsyncBarrier takes the lock for writing. This ensures that all
executing Async work finishes before AsyncBarrier returns.
Also pushes the Async() call from Inode.Release into
gofer/InodeOperations.Release(). This removes a recursive Async call which
should not have been allowed in the first place. The gofer Release call is the
slow one (since it may make RPCs to the gofer), so putting the Async call there
makes sense.
PiperOrigin-RevId: 223093067
Change-Id: I116da7b20fce5ebab8d99c2ab0f27db7c89d890e
Create syscall stubs for missing syscalls upto Linux 4.4 and advertise
a kernel version of 4.4.
PiperOrigin-RevId: 220667680
Change-Id: Idbdccde538faabf16debc22f492dd053a8af0ba7
https://github.com/containerd/containerd/blob/master/oci/spec.go#L206, the mode=755
didn't match the pattern modeRegexp = regexp.MustCompile("0[0-7][0-7][0-7]").
Closes#112
Signed-off-by: Juan <xionghuan.cn@gmail.com>
Change-Id: I469e0a68160a1278e34c9e1dbe4b7784c6f97e5a
PiperOrigin-RevId: 219672525
Previously this code used the tcpip error space. Since it is no longer part of
netstack, it can use the sentry's error space (except for a few cases where
there is still some shared code. This reduces the number of error space
conversions required for hot Unix socket operations.
PiperOrigin-RevId: 218541611
Change-Id: I3d13047006a8245b5dfda73364d37b8a453784bb
This change also adds extensive testing to the p9 package via mocks. The sanity
checks and type checks are moved from the gofer into the core package, where
they can be more easily validated.
PiperOrigin-RevId: 218296768
Change-Id: I4fc3c326e7bf1e0e140a454cbacbcc6fd617ab55
Added events for *ctl syscalls that may have multiple different commands.
For runsc, each syscall event is only logged once. For *ctl syscalls, use
the cmd as identifier, not only the syscall number.
PiperOrigin-RevId: 218015941
Change-Id: Ie3c19131ae36124861e9b492a7dbe1765d9e5e59
* Integrate recvMsg and sendMsg functions into Recv and Send respectively as
they are no longer shared.
* Clean up partial read/write error handling code.
* Re-order code to make sense given that there is no longer a host.endpoint
type.
PiperOrigin-RevId: 217255072
Change-Id: Ib43fe9286452f813b8309d969be11f5fa40694cd
host.endpoint contained duplicated logic from the sockerpair implementation and
host.ConnectedEndpoint. Remove host.endpoint in favor of a
host.ConnectedEndpoint wrapped in a socketpair end.
PiperOrigin-RevId: 217240096
Change-Id: I4a3d51e3fe82bdf30e2d0152458b8499ab4c987c
- Change Dirent.Busy => Dirent.isMountPoint. The function body is unchanged,
and it is no longer exported.
- fs.MayDelete now checks that the victim is not the process root. This aligns
with Linux's namei.c:may_delete().
- Fix "is-ancestor" checks to actually compare all ancestors, not just the
parents.
- Fix handling of paths that end in dots, which are handled differently in
Rename vs. Unlink.
PiperOrigin-RevId: 217239274
Change-Id: I7a0eb768e70a1b2915017ce54f7f95cbf8edf1fb
Currently, in the face of FileMem fragmentation and a large sendmsg or
recvmsg call, host sockets may pass > 1024 iovecs to the host, which
will immediately cause the host to return EMSGSIZE.
When we detect this case, use a single intermediate buffer to pass to
the kernel, copying to/from the src/dst buffer.
To avoid creating unbounded intermediate buffers, enforce message size
checks and truncation w.r.t. the send buffer size. The same
functionality is added to netstack unix sockets for feature parity.
PiperOrigin-RevId: 216590198
Change-Id: I719a32e71c7b1098d5097f35e6daf7dd5190eff7
Terminal support in runsc relies on host tty file descriptors that are imported
into the sandbox. Application tty ioctls are sent directly to the host fd.
However, those host tty ioctls are associated in the host kernel with a host
process (in this case runsc), and the host kernel intercepts job control
characters like ^C and send signals to the host process. Thus, typing ^C into a
"runsc exec" shell will send a SIGINT to the runsc process.
This change makes "runsc exec" handle all signals, and forward them into the
sandbox via the "ContainerSignal" urpc method. Since the "runsc exec" is
associated with a particular container process in the sandbox, the signal must
be associated with the same container process.
One big difficulty is that the signal should not necessarily be sent to the
sandbox process started by "exec", but instead must be sent to the foreground
process group for the tty. For example, we may exec "bash", and from bash call
"sleep 100". A ^C at this point should SIGINT sleep, not bash.
To handle this, tty files inside the sandbox must keep track of their
foreground process group, which is set/get via ioctls. When an incoming
ContainerSignal urpc comes in, we look up the foreground process group via the
tty file. Unfortunately, this means we have to expose and cache the tty file in
the Loader.
Note that "runsc exec" now handles signals properly, but "runs run" does not.
That will come in a later CL, as this one is complex enough already.
Example:
root@:/usr/local/apache2# sleep 100
^C
root@:/usr/local/apache2# sleep 100
^Z
[1]+ Stopped sleep 100
root@:/usr/local/apache2# fg
sleep 100
^C
root@:/usr/local/apache2#
PiperOrigin-RevId: 215334554
Change-Id: I53cdce39653027908510a5ba8d08c49f9cf24f39
host.endpoint already has the check, but it is missing from
host.ConnectedEndpoint.
PiperOrigin-RevId: 214962762
Change-Id: I88bb13a5c5871775e4e7bf2608433df8a3d348e6
We already forward TCSETS and TCSETSW. TCSETSF is roughly equivalent but
discards pending input.
The filters were relaxed to allow host ioctls with TCSETSF argument.
This fixes programs like "passwd" that prevent user input from being displayed
on the terminal.
Before:
root@b8a0240fc836:/# passwd
Enter new UNIX password: 123
Retype new UNIX password: 123
passwd: password updated successfully
After:
root@ae6f5dabe402:/# passwd
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
PiperOrigin-RevId: 214869788
Change-Id: I31b4d1373c1388f7b51d0f2f45ce40aa8e8b0b58
If we have an overlay file whose corresponding Dirent is frozen, then we should
not bother calling Readdir on the upper or lower files, since DirentReaddir
will calculate children based on the frozen Dirent tree.
A test was added that fails without this change.
PiperOrigin-RevId: 213531215
Change-Id: I4d6c98f1416541a476a34418f664ba58f936a81d
Linux permits hard-linking if the target is owned by the user OR the target has
Read+Write permission.
PiperOrigin-RevId: 213024613
Change-Id: If642066317b568b99084edd33ee4e8822ec9cbb3
We must use a context.Context with a Root Dirent that corresponds to the
container's chroot. Previously we were using the root context, which does not
have a chroot.
Getting the correct context required refactoring some of the path-lookup code.
We can't lookup the path without a context.Context, which requires
kernel.CreateProcArgs, which we only get inside control.Execute. So we have to
do the path lookup much later than we previously were.
PiperOrigin-RevId: 212064734
Change-Id: I84a5cfadacb21fd9c3ab9c393f7e308a40b9b537
Furthermore, allow for the specification of an ElementMapper. This allows a
single "Element" type to exist on multiple inline lists, and work without
having to embed the entry type.
This is a requisite change for supporting a per-Inode list of Dirents.
PiperOrigin-RevId: 211467497
Change-Id: If2768999b43e03fdaecf8ed15f435fe37518d163
From //pkg/sentry/context/context.go:
// - It is *not safe* to retain a Context passed to a function beyond the scope
// of that function call.
Passing a stored kernel.Task as a context.Context to
fs.FileOwnerFromContext violates this requirement.
PiperOrigin-RevId: 211143021
Change-Id: I4c5b02bd941407be4c9cfdbcbdfe5a26acaec037
dirent.walk() takes renameMu, but is often called with renameMu already held,
which can lead to a deadlock.
Fix this by requiring renameMu to be held for reading when dirent.walk() is
called. This causes walks and existence checks to block while a rename
operation takes place, but that is what we were already trying to enforce by
taking renameMu in walk() anyways.
PiperOrigin-RevId: 210760780
Change-Id: Id61018e6e4adbeac53b9c1b3aa24ab77f75d8a54
dirent.go:Rename() walks to the file being replaced and defers
replaced.DecRef(). After the rename, the reference is dropped, triggering a
writeout and SettAttr call to the gofer. Because of lazyOpenForWrite, the gofer
opens the replaced file BY ITS OLD NAME and calls ftruncate on it.
This CL changes Remove to drop the reference on replaced (and thus trigger
writeout) before the actual rename call.
PiperOrigin-RevId: 210756097
Change-Id: I01ea09a5ee6c2e2d464560362f09943641638e0f
Weak references save / restore involves multiple interface indirection
and cause material latency overhead when there are lots of dirents, each
containing a weak reference map. The nil entries in the map should also
be purged.
PiperOrigin-RevId: 210593727
Change-Id: Ied6f4c3c0726fcc53a24b983d9b3a79121b6b758
When revalidating a Dirent, if the inode id is the same, then we don't need to
throw away the entire Dirent. We can just update the unstable attributes in
place.
If the inode id has changed, then the remote file has been deleted or moved,
and we have no choice but to throw away the dirent we have a look up another.
In this case, we may still end up losing a mounted dirent that is a child of
the revalidated dirent. However, that seems appropriate here because the entire
mount point has been pulled out from underneath us.
Because gVisor's overlay is at the Inode level rather than the Dirent level, we
must pass the parent Inode and name along with the Inode that is being
revalidated.
PiperOrigin-RevId: 210431270
Change-Id: I705caef9c68900234972d5aac4ae3a78c61c7d42
Implements the TIOCGWINSZ and TIOCSWINSZ ioctls, which allow processes to resize
the terminal. This allows, for example, sshd to properly set the window size for
ssh sessions.
PiperOrigin-RevId: 210392504
Change-Id: I0d4789154d6d22f02509b31d71392e13ee4a50ba
This CL adds terminal support for "docker exec". We previously only supported
consoles for the container process, but not exec processes.
The SYS_IOCTL syscall was added to the default seccomp filter list, but only
for ioctls that get/set winsize and termios structs. We need to allow these
ioctl for all containers because it's possible to run "exec -ti" on a
container that was started without an attached console, after the filters
have been installed.
Note that control-character signals are still not properly supported.
Tested with:
$ docker run --runtime=runsc -it alpine
In another terminial:
$ docker exec -it <containerid> /bin/sh
PiperOrigin-RevId: 210185456
Change-Id: I6d2401e53a7697bb988c120a8961505c335f96d9
As required by the contract in Dirent.flush().
Also inline Dirent.freeze() into Dirent.Freeze(), since it is only called from
there.
PiperOrigin-RevId: 209783626
Change-Id: Ie6de4533d93dd299ffa01dabfa257c9cc259b1f4
When an inode file state failed to load asynchronuously, we want to report
the error instead of potentially panicing in another async loading goroutine
incorrectly unblocked.
PiperOrigin-RevId: 209683977
Change-Id: I591cde97710bbe3cdc53717ee58f1d28bbda9261
Numpy needs these.
Also added the "present" directory, since the contents are the same as possible
and online.
PiperOrigin-RevId: 209451777
Change-Id: I2048de3f57bf1c57e9b5421d607ca89c2a173684
Some linux commands depend on /sys/devices/system/cpu/possible, such
as 'lscpu'.
Add 2 knobs for cpu:
/sys/devices/system/cpu/possible
/sys/devices/system/cpu/online
Both the values are '0 - Kernel.ApplicationCores()-1'.
Change-Id: Iabd8a4e559cbb630ed249686b92c22b4e7120663
PiperOrigin-RevId: 209070163
Previously, gofer filesystems were configured with the default "fscache"
policy, which caches filesystem metadata and contents aggressively. While this
setting is best for performance, it means that changes from inside the sandbox
may not be immediately propagated outside the sandbox, and vice-versa.
This CL changes volumes and the root fs configuration to use a new
"remote-revalidate" cache policy which tries to retain as much caching as
possible while still making fs changes visible across the sandbox boundary.
This cache policy is enabled by default for the root filesystem. The default
value for the "--file-access" flag is still "proxy", but the behavior is
changed to use the new cache policy.
A new value for the "--file-access" flag is added, called "proxy-exclusive",
which turns on the previous aggressive caching behavior. As the name implies,
this flag should be used when the sandbox has "exclusive" access to the
filesystem.
All volume mounts are configured to use the new cache policy, since it is
safest and most likely to be correct. There is not currently a way to change
this behavior, but it's possible to add such a mechanism in the future. The
configurability is a smaller issue for volumes, since most of the expensive
application fs operations (walking + stating files) will likely served by the
root fs.
PiperOrigin-RevId: 208735037
Change-Id: Ife048fab1948205f6665df8563434dbc6ca8cfc9
Now, there's a waiter for each end (master and slave) of the TTY, and each
waiter.Entry is only enqueued in one of the waiters.
PiperOrigin-RevId: 208734483
Change-Id: I06996148f123075f8dd48cde5a553e2be74c6dce
stat()-ing /proc/PID/fd/FD incremented but didn't decrement the refcount for
FD. This behavior wasn't usually noticeable, but in the above case:
- ls would never decrement the refcount of the write end of the pipe to 0.
- This caused the write end of the pipe never to close.
- wc would then hang read()-ing from the pipe.
PiperOrigin-RevId: 208728817
Change-Id: I4fca1ba5ca24e4108915a1d30b41dc63da40604d
InodeOperations.Bind now returns a Dirent which will be cached in the Dirent
tree.
When an overlay is in-use, Bind cannot return the Dirent created by the upper
filesystem because the Dirent does not know about the overlay. Instead,
overlayBind must create a new overlay-aware Inode and Dirent and return that.
This is analagous to how Lookup and overlayLookup work.
PiperOrigin-RevId: 208670710
Change-Id: I6390affbcf94c38656b4b458e248739b4853da29
Previously, an overlay would panic if either the upper or lower fs required
revalidation for a given Dirent. Now, we allow revalidation from the upper
file, but not the lower.
If a cached overlay inode does need revalidation (because the upper needs
revalidation), then the entire overlay Inode will be discarded and a new
overlay Inode will be built with a fresh copy of the upper file.
As a side effect of this change, Revalidate must take an Inode instead of a
Dirent, since an overlay needs to revalidate individual Inodes.
PiperOrigin-RevId: 208293638
Change-Id: Ic8f8d1ffdc09114721745661a09522b54420c5f1
The cache policy determines whether Lookup should return a negative dirent, or
just ENOENT. This CL fixes one spot where we returned a negative dirent without
first consulting the policy.
PiperOrigin-RevId: 208280230
Change-Id: I8f963bbdb45a95a74ad0ecc1eef47eff2092d3a4
Previously, processes which used file-system Unix Domain Sockets could not be
checkpoint-ed in runsc because the sockets were saved with their inode
numbers which do not necessarily remain the same upon restore. Now,
the sockets are also saved with their paths so that the new inodes
can be determined for the sockets based on these paths after restoring.
Tests for cases with UDS use are included. Test cleanup to come.
PiperOrigin-RevId: 208268781
Change-Id: Ieaa5d5d9a64914ca105cae199fd8492710b1d7ec
This CL adds a new cache-policy for gofer filesystems that uses the host page
cache, but causes dirents to be reloaded on each Walk, and does not cache
readdir results.
This policy is useful when the remote filesystem may change out from underneath
us, as any remote changes will be reflected on the next Walk.
Importantly, this cache policy is only consistent if we do not use gVisor's
internal page cache, since that page cache is tied to the Inode and may be
thrown away upon Revalidation.
This cache policy should only be used when the gofer supports donating host
FDs, since then gVisor will make use of the host kernel page cache, which will
be consistent for all open files in the gofer. In fact, a panic will be raised
if a file is opened without a donated FD.
PiperOrigin-RevId: 207752937
Change-Id: I233cb78b4695bbe00a4605ae64080a47629329b8
In other news, apparently proc.fdInfo is the last user of ramfs.File.
PiperOrigin-RevId: 207564572
Change-Id: I5a92515698cc89652b80bea9a32d309e14059869
Currently, there is an attempt to print FD flags, but
they are not decoded into a number, so we see something like this:
/criu # cat /proc/self/fdinfo/0
flags: {%!o(bool=000false)}
Actually, fdinfo has to contain file flags.
Change-Id: Idcbb7db908067447eb9ae6f2c3cfb861f2be1a97
PiperOrigin-RevId: 206794498
We have been unnecessarily creating too many savable types implicitly.
PiperOrigin-RevId: 206334201
Change-Id: Idc5a3a14bfb7ee125c4f2bb2b1c53164e46f29a8
When copying-up files from a lower fs to an upper, we also copy the extended
attributes on the file. If there is a (nested) overlay inside the lower, some
of these extended attributes configure the lower overlay, and should not be
copied-up to the upper.
In particular, whiteout attributes in the lower fs overlay should not be
copied-up, since the upper fs may actually contain the file.
PiperOrigin-RevId: 206236010
Change-Id: Ia0454ac7b99d0e11383f732a529cb195ed364062
The current revalidation logic is very simple and does not do much
introspection of the dirent being revalidated (other than looking at the type
of file).
Fancier revalidation logic is coming soon, and we need to be able to look at
the cached and uncached attributes of a given dirent, and we need a context to
perform some of these operations.
PiperOrigin-RevId: 205307351
Change-Id: If17ea1c631d8f9489c0e05a263e23d7a8a3bf159
In the general case with an overlay, all mmap calls must go through the
overlay, because in the event of a copy-up, the overlay needs to invalidate any
previously-created mappings.
If there if no lower file, however, there will never be a copy-up, so the
overlay can delegate directly to the upper file in that case.
This also allows us to correctly mmap /dev/zero when it is in an overlay. This
file has special semantics which the overlay does not know about. In
particular, it does not implement Mappable(), which (in the general case) the
overlay uses to detect if a file is mappable or not.
PiperOrigin-RevId: 205306743
Change-Id: I92331649aa648340ef6e65411c2b42c12fa69631
Dirent.FullName takes the global renameMu, but can be called during Create,
which itself takes dirent.mu and dirent.dirMu, which is a lock-order violation:
Dirent.Create
d.dirMu.Lock
d.mu.Lock
Inode.Create
gofer.inodeOperations.Create
gofer.NewFile
Dirent.FullName
d.renameMu.RLock
We only use the FullName here for logging, and in this case we can get by with
logging only the BaseName.
A `BaseName` method was added to Dirent, which simply returns the name, taking
d.parent.mu as required.
In the Create pathway, we can't call d.BaseName() because taking d.parent.mu
after d.mu violates the lock order. But we already know the base name of the
file we just created, so that's OK.
In the Open/GetFile pathway, we are free to call d.BaseName() because the other
dirent locks are not held.
PiperOrigin-RevId: 205112278
Change-Id: Ib45c734081aecc9b225249a65fa8093eb4995f10
Per the doc, usage must be kept maximally merged. Beyond that, it is simply a
good idea to keep fragmentation in usage to a minimum.
The glibc malloc allocator allocates one page at a time, potentially causing
lots of fragmentation. However, those pages are likely to have the same number
of references, often making it possible to merge ranges.
PiperOrigin-RevId: 204960339
Change-Id: I03a050cf771c29a4f05b36eaf75b1a09c9465e14
This method allows an eventfd inside the Sentry to be registered with with
the host kernel.
Update comment about memory mapping host fds via CachingInodeOperations.
PiperOrigin-RevId: 204784859
Change-Id: I55823321e2d84c17ae0f7efaabc6b55b852ae257
Otherwise required and optional can be empty or have negative length.
PiperOrigin-RevId: 204007079
Change-Id: I59e472a87a8caac11ffb9a914b8d79bf0cd70995
Updated how restoring occurs through boot.go with a separate Restore function.
This prevents a new process and new mounts from being created.
Added tests to ensure the container is restored.
Registered checkpoint and restore commands so they can be used.
Docker support for these commands is still limited.
Working on #80.
PiperOrigin-RevId: 202710950
Change-Id: I2b893ceaef6b9442b1ce3743bd112383cb92af0c
FIOASYNC and friends are used to send signals when a file is ready for IO.
This may or may not be needed by Nginx. While Nginx does use it, it is unclear
if the code that uses it has any effect.
PiperOrigin-RevId: 201550828
Change-Id: I7ba05a7db4eb2dfffde11e9bd9a35b65b98d7f50
The new policy is identical to FSCACHE (which caches everything in memory), but
it also flushes writes to the backing fs agent immediately.
All gofer cache policy decisions have been moved into the cachePolicy type.
Previously they were sprinkled around the codebase.
There are many different things that we cache (page cache, negative dirents,
dirent LRU, unstable attrs, readdir results....), and I don't think we should
have individual flags to control each of these. Instead, we should have a few
high-level cache policies that are consistent and useful to users. This
refactoring makes it easy to add more such policies.
PiperOrigin-RevId: 201206937
Change-Id: I6e225c382b2e5e1b0ad4ccf8ca229873f4cd389d
Boot loader tries to stat mount to determine whether it's a file or not. This
may file if the sandbox process doesn't have access to the file. Instead, add
overlay on top of file, which is better anyway since we don't want to propagate
changes to the host.
PiperOrigin-RevId: 200411261
Change-Id: I14222410e8bc00ed037b779a1883d503843ffebb
This is the first iteration of checkpoint that actually saves to a file.
Tests for checkpoint are included.
Ran into an issue when private unix sockets are enabled. An error message
was added for this case and the mutex state was set.
PiperOrigin-RevId: 200269470
Change-Id: I28d29a9f92c44bf73dc4a4b12ae0509ee4070e93
Minor refactor. line_discipline.go was home to 2 large structs (lineDiscipline
and queue), and queue is now large enough IMO to get its own file.
Also moves queue locks into the queue struct, making locking simpler.
PiperOrigin-RevId: 200080301
Change-Id: Ia75a0e9b3d9ac8d7e5a0f0099a54e1f5b8bdea34
Adds support for echo to terminals. Echoing is just copying input back out to
the user, e.g. when I type "foo" into a terminal, I expect "foo" to be echoed
back to my terminal.
Also makes the transform function part of the queue, eliminating the need to
pass them around together and the possibility of using the wrong transform for a
queue.
PiperOrigin-RevId: 199655147
Change-Id: I37c490d4fc1ee91da20ae58ba1f884a5c14fd0d8
This change will add support for /proc/sys/net and /proc/net which will
be managed and owned by rpcinet. This will allow these inodes to be forward
as rpcs.
PiperOrigin-RevId: 199370799
Change-Id: I2c876005d98fe55dd126145163bee5a645458ce4
When doing a BidirectionalConnect we don't need to continue holding
the ConnectingEndpoint's mutex when creating the NewConnectedEndpoint
as it was held during the Connect. Additionally, we're not holding
the baseEndpoint mutex while Unregistering an event.
PiperOrigin-RevId: 196875557
Change-Id: Ied4ceed89de883121c6cba81bc62aa3a8549b1e9
Previously, inet.Stack was referenced in 2 structs in sentry/socket that can be
saved/restored. If an app is saved and restored on another machine, it may try
to use the old stack, which will have been replaced by a new stack on the new
machine.
PiperOrigin-RevId: 196733985
Change-Id: I6a8cfe73b5d7a90749734677dada635ab3389cb9
When file is backed by host FD, atime and mtime for the host file and the
cached attributes in the Sentry must be close together. In this case,
the call to update atime and mtime can be skipped. This is important when
host filesystem is using overlay because updating atime and mtime explicitly
forces a copy up for every file that is touched.
PiperOrigin-RevId: 196176413
Change-Id: I3933ea91637a071ba2ea9db9d8ac7cdba5dc0482
Otherwise, mounts that fail to be unmounted (EBUSY) will be removed
from the children list anyway.
At this point, this just affects /proc/pid/mounts and /proc/pid/mountinfo.
PiperOrigin-RevId: 195267588
Change-Id: I79114483d73b90f9a7d764a7d513b5b2f251182e