2018-04-27 17:37:02 +00:00
|
|
|
This package provides an implementation of the Linux virtual filesystem.
|
|
|
|
|
|
|
|
[TOC]
|
|
|
|
|
|
|
|
## Overview
|
|
|
|
|
|
|
|
- An `fs.Dirent` caches an `fs.Inode` in memory at a path in the VFS, giving
|
|
|
|
the `fs.Inode` a relative position with respect to other `fs.Inode`s.
|
|
|
|
|
|
|
|
- If an `fs.Dirent` is referenced by two file descriptors, then those file
|
|
|
|
descriptors are coherent with each other: they depend on the same
|
|
|
|
`fs.Inode`.
|
|
|
|
|
|
|
|
- A mount point is an `fs.Dirent` for which `fs.Dirent.mounted` is true. It
|
|
|
|
exposes the root of a mounted filesystem.
|
|
|
|
|
|
|
|
- The `fs.Inode` produced by a registered filesystem on mount(2) owns an
|
|
|
|
`fs.MountedFilesystem` from which other `fs.Inode`s will be looked up. For a
|
|
|
|
remote filesystem, the `fs.MountedFilesystem` owns the connection to that
|
|
|
|
remote filesystem.
|
|
|
|
|
|
|
|
- In general:
|
|
|
|
|
|
|
|
```
|
|
|
|
fs.Inode <------------------------------
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
produced by |
|
|
|
|
exactly one |
|
|
|
|
| responsible for the
|
|
|
|
| virtual identity of
|
|
|
|
v |
|
|
|
|
fs.MountedFilesystem -------------------
|
|
|
|
```
|
|
|
|
|
|
|
|
Glossary:
|
|
|
|
|
|
|
|
- VFS: virtual filesystem.
|
|
|
|
|
|
|
|
- inode: a virtual file object holding a cached view of a file on a backing
|
|
|
|
filesystem (includes metadata and page caches).
|
|
|
|
|
|
|
|
- superblock: the virtual state of a mounted filesystem (e.g. the virtual
|
|
|
|
inode number set).
|
|
|
|
|
|
|
|
- mount namespace: a view of the mounts under a root (during path traversal,
|
|
|
|
the VFS makes visible/follows the mount point that is in the current task's
|
|
|
|
mount namespace).
|
|
|
|
|
|
|
|
## Save and restore
|
|
|
|
|
|
|
|
An application's hard dependencies on filesystem state can be broken down into
|
|
|
|
two categories:
|
|
|
|
|
|
|
|
- The state necessary to execute a traversal on or view the *virtual*
|
|
|
|
filesystem hierarchy, regardless of what files an application has open.
|
|
|
|
|
|
|
|
- The state necessary to represent open files.
|
|
|
|
|
|
|
|
The first is always necessary to save and restore. An application may never have
|
|
|
|
any open file descriptors, but across save and restore it should see a coherent
|
2019-04-29 21:03:04 +00:00
|
|
|
view of any mount namespace. NOTE(b/63601033): Currently only one "initial"
|
2018-04-27 17:37:02 +00:00
|
|
|
mount namespace is supported.
|
|
|
|
|
|
|
|
The second is so that system calls across save and restore are coherent with
|
|
|
|
each other (e.g. so that unintended re-reads or overwrites do not occur).
|
|
|
|
|
|
|
|
Specifically this state is:
|
|
|
|
|
|
|
|
- An `fs.MountManager` containing mount points.
|
|
|
|
|
2019-07-03 02:27:51 +00:00
|
|
|
- A `kernel.FDTable` containing pointers to open files.
|
2018-04-27 17:37:02 +00:00
|
|
|
|
|
|
|
Anything else managed by the VFS that can be easily loaded into memory from a
|
2018-11-13 01:43:43 +00:00
|
|
|
filesystem is synced back to those filesystems and is not saved. Examples are
|
2018-04-27 17:37:02 +00:00
|
|
|
pages in page caches used for optimizations (i.e. readahead and writeback), and
|
|
|
|
directory entries used to accelerate path lookups.
|
|
|
|
|
|
|
|
### Mount points
|
|
|
|
|
|
|
|
Saving and restoring a mount point means saving and restoring:
|
|
|
|
|
|
|
|
- The root of the mounted filesystem.
|
|
|
|
|
|
|
|
- Mount flags, which control how the VFS interacts with the mounted
|
|
|
|
filesystem.
|
|
|
|
|
|
|
|
- Any relevant metadata about the mounted filesystem.
|
|
|
|
|
|
|
|
- All `fs.Inode`s referenced by the application that reside under the mount
|
|
|
|
point.
|
|
|
|
|
|
|
|
`fs.MountedFilesystem` is metadata about a filesystem that is mounted. It is
|
|
|
|
referenced by every `fs.Inode` loaded into memory under the mount point
|
|
|
|
including the `fs.Inode` of the mount point itself. The `fs.MountedFilesystem`
|
|
|
|
maps file objects on the filesystem to a virtualized `fs.Inode` number and vice
|
|
|
|
versa.
|
|
|
|
|
|
|
|
To restore all `fs.Inode`s under a given mount point, each `fs.Inode` leverages
|
|
|
|
its dependency on an `fs.MountedFilesystem`. Since the `fs.MountedFilesystem`
|
|
|
|
knows how an `fs.Inode` maps to a file object on a backing filesystem, this
|
|
|
|
mapping can be trivially consulted by each `fs.Inode` when the `fs.Inode` is
|
|
|
|
restored.
|
|
|
|
|
|
|
|
In detail, a mount point is saved in two steps:
|
|
|
|
|
|
|
|
- First, after the kernel is paused but before state.Save, we walk all mount
|
|
|
|
namespaces and install a mapping from `fs.Inode` numbers to file paths
|
|
|
|
relative to the root of the mounted filesystem in each
|
|
|
|
`fs.MountedFilesystem`. This is subsequently called the set of `fs.Inode`
|
|
|
|
mappings.
|
|
|
|
|
|
|
|
- Second, during state.Save, each `fs.MountedFilesystem` decides whether to
|
|
|
|
save the set of `fs.Inode` mappings. In-memory filesystems, like tmpfs, have
|
|
|
|
no need to save a set of `fs.Inode` mappings, since the `fs.Inode`s can be
|
|
|
|
entirely encoded in state file. Each `fs.MountedFilesystem` also optionally
|
|
|
|
saves the device name from when the filesystem was originally mounted. Each
|
|
|
|
`fs.Inode` saves its virtual identifier and a reference to a
|
|
|
|
`fs.MountedFilesystem`.
|
|
|
|
|
|
|
|
A mount point is restored in two steps:
|
|
|
|
|
|
|
|
- First, before state.Load, all mount configurations are stored in a global
|
|
|
|
`fs.RestoreEnvironment`. This tells us what mount points the user wants to
|
|
|
|
restore and how to re-establish pointers to backing filesystems.
|
|
|
|
|
|
|
|
- Second, during state.Load, each `fs.MountedFilesystem` optionally searches
|
|
|
|
for a mount in the `fs.RestoreEnvironment` that matches its saved device
|
2019-06-27 21:23:29 +00:00
|
|
|
name. The `fs.MountedFilesystem` then reestablishes a pointer to the root of
|
2018-04-27 17:37:02 +00:00
|
|
|
the mounted filesystem. For example, the mount specification provides the
|
|
|
|
network connection for a mounted remote filesystem client to communicate
|
|
|
|
with its remote file server. The `fs.MountedFilesystem` also trivially loads
|
|
|
|
its set of `fs.Inode` mappings. When an `fs.Inode` is encountered, the
|
|
|
|
`fs.Inode` loads its virtual identifier and its reference a
|
|
|
|
`fs.MountedFilesystem`. It uses the `fs.MountedFilesystem` to obtain the
|
|
|
|
root of the mounted filesystem and the `fs.Inode` mappings to obtain the
|
|
|
|
relative file path to its data. With these, the `fs.Inode` re-establishes a
|
|
|
|
pointer to its file object.
|
|
|
|
|
|
|
|
A mount point can trivially restore its `fs.Inode`s in parallel since
|
|
|
|
`fs.Inode`s have a restore dependency on their `fs.MountedFilesystem` and not on
|
|
|
|
each other.
|
|
|
|
|
|
|
|
### Open files
|
|
|
|
|
|
|
|
An `fs.File` references the following filesystem objects:
|
|
|
|
|
|
|
|
```go
|
|
|
|
fs.File -> fs.Dirent -> fs.Inode -> fs.MountedFilesystem
|
|
|
|
```
|
|
|
|
|
2018-07-12 17:36:16 +00:00
|
|
|
The `fs.Inode` is restored using its `fs.MountedFilesystem`. The
|
|
|
|
[Mount points](#mount-points) section above describes how this happens in
|
|
|
|
detail. The `fs.Dirent` restores its pointer to an `fs.Inode`, pointers to
|
|
|
|
parent and children `fs.Dirents`, and the basename of the file.
|
2018-04-27 17:37:02 +00:00
|
|
|
|
|
|
|
Otherwise an `fs.File` restores flags, an offset, and a unique identifier (only
|
|
|
|
used internally).
|
|
|
|
|
|
|
|
It may use the `fs.Inode`, which it indirectly holds a reference on through the
|
2019-06-27 21:23:29 +00:00
|
|
|
`fs.Dirent`, to reestablish an open file handle on the backing filesystem (e.g.
|
2018-04-27 17:37:02 +00:00
|
|
|
to continue reading and writing).
|
|
|
|
|
|
|
|
## Overlay
|
|
|
|
|
|
|
|
The overlay implementation in the fs package takes Linux overlayfs as a frame of
|
|
|
|
reference but corrects for several POSIX consistency errors.
|
|
|
|
|
|
|
|
In Linux overlayfs, the `struct inode` used for reading and writing to the same
|
|
|
|
file may be different. This is because the `struct inode` is dissociated with
|
|
|
|
the process of copying up the file from the upper to the lower directory. Since
|
|
|
|
flock(2) and fcntl(2) locks, inotify(7) watches, page caches, and a file's
|
|
|
|
identity are all stored directly or indirectly off the `struct inode`, these
|
|
|
|
properties of the `struct inode` may be stale after the first modification. This
|
|
|
|
can lead to file locking bugs, missed inotify events, and inconsistent data in
|
|
|
|
shared memory mappings of files, to name a few problems.
|
|
|
|
|
|
|
|
The fs package maintains a single `fs.Inode` to represent a directory entry in
|
|
|
|
an overlay and defines operations on this `fs.Inode` which synchronize with the
|
|
|
|
copy up process. This achieves several things:
|
|
|
|
|
|
|
|
+ File locks, inotify watches, and the identity of the file need not be copied
|
|
|
|
at all.
|
|
|
|
|
|
|
|
+ Memory mappings of files coordinate with the copy up process so that if a
|
|
|
|
file in the lower directory is memory mapped, all references to it are
|
|
|
|
invalidated, forcing the application to re-fault on memory mappings of the
|
|
|
|
file under the upper directory.
|
|
|
|
|
|
|
|
The `fs.Inode` holds metadata about files in the upper and/or lower directories
|
|
|
|
via an `fs.overlayEntry`. The `fs.overlayEntry` implements the `fs.Mappable`
|
|
|
|
interface. It multiplexes between upper and lower directory memory mappings and
|
|
|
|
stores a copy of memory references so they can be transferred to the upper
|
|
|
|
directory `fs.Mappable` when the file is copied up.
|
|
|
|
|
2018-08-11 00:15:27 +00:00
|
|
|
The lower filesystem in an overlay may contain another (nested) overlay, but the
|
|
|
|
upper filesystem may not contain another overlay. In other words, nested
|
|
|
|
overlays form a tree structure that only allows branching in the lower
|
|
|
|
filesystem.
|
|
|
|
|
|
|
|
Caching decisions in the overlay are delegated to the upper filesystem, meaning
|
|
|
|
that the Keep and Revalidate methods on the overlay return the same values as
|
|
|
|
the upper filesystem. A small wrinkle is that the lower filesystem is not
|
|
|
|
allowed to return `true` from Revalidate, as the overlay can not reload inodes
|
|
|
|
from the lower filesystem. A lower filesystem that does return `true` from
|
|
|
|
Revalidate will trigger a panic.
|
|
|
|
|
2018-04-27 17:37:02 +00:00
|
|
|
The `fs.Inode` also holds a reference to a `fs.MountedFilesystem` that
|
|
|
|
normalizes across the mounted filesystem state of the upper and lower
|
|
|
|
directories.
|
|
|
|
|
|
|
|
When a file is copied from the lower to the upper directory, attempts to
|
|
|
|
interact with the file block until the copy completes. All copying synchronizes
|
|
|
|
with rename(2).
|
|
|
|
|
|
|
|
## Future Work
|
|
|
|
|
|
|
|
### Overlay
|
|
|
|
|
|
|
|
When a file is copied from a lower directory to an upper directory, several
|
|
|
|
locks are taken: the global renamuMu and the copyMu of the `fs.Inode` being
|
|
|
|
copied. This blocks operations on the file, including fault handling of memory
|
|
|
|
mappings. Performance could be improved by copying files into a temporary
|
|
|
|
directory that resides on the same filesystem as the upper directory and doing
|
|
|
|
an atomic rename, holding locks only during the rename operation.
|
|
|
|
|
|
|
|
Additionally files are copied up synchronously. For large files, this causes a
|
|
|
|
noticeable latency. Performance could be improved by pipelining copies at
|
|
|
|
non-overlapping file offsets.
|