Immutable mounts are mounts that can't be changed in any aspect, if the
VFSRootContext that hold them is used by a process. This includes two
operations on a mount:
1. Trying to remove the mount from the mount table.
2. Trying to change the flags of the mount.
The condition of a VFSRootContext being held by a process or not is
crucial, as the intention is to allow removal of mounts that marked as
immutable if the VFSRootContext is not being used anymore (for example,
if the container that was created with such context stopped).
Marking mounts as immutable on the first VFS root context essentially
ensures they will never be modified because there will be a process
using that context (which is the "main" VFS root context in the system
runtime).
It should be noted that setting a mount as immutable can be done in
creation time of the mount by passing the MS_IMMUTABLE flag, or by doing
a remount with MS_IMMUTABLE flag.
Previously, the VFS layer would try to handle renames more-or-less by
itself, which really only worked for ext2, and even that was only due to
the replace_child kludge existing specifically for this purpose. This
never worked properly for FATFS, since the VFS layer effectively
depended on filesystems having some kind of reference-counting for
inodes, which is something that simply doesn't exist on any FAT variant
we support.
To resolve various issues with the existing scheme, this commit makes
filesystem implementations themselves responsible for the actual rename
operation, while keeping all the existing validation inside the VFS
layer. The only intended behavior change here is that rename operations
should actually properly work on FATFS.
This change has many improvements:
- We don't use `LockRefPtr` to hold instances of many base devices as
with the DeviceManagement class. Instead, we have a saner pattern of
holding them in a `NonnullRefPtr<T> const`, in a small-text footprint
class definition in the `Device.cpp` file.
- The awkwardness of using `::the()` each time we need to get references
to mostly-static objects (like the Event queue) in runtime is now gone
in the migration to using the `Device` class.
- Acquiring a device feel more obvious because we use now the Device
class for this method. The method name is improved as well.
Instead of using a raw `KBuffer` and letting each implementation to
populating the specific flags on its own, we change things so we only
let each FileSystem implementation to validate the flag and its value
but then store it in a HashMap which its key is the flag name and
the value is a special new class called `FileSystemSpecificOption`
which wraps around `AK::Variant<...>`.
This approach has multiple advantages over the previous:
- It allows runtime inspection of what the user has set on a `MountFile`
description for a specific filesystem.
- It ensures accidental overriding of filesystem specific option that
was already set is not possible
- It removes ugly casting of a `KBuffer` contents to a strongly-typed
values. Instead, a strongly-typed `AK::Variant` is used which ensures
we always get a value without doing any casting.
Please note that we have removed support for ASCII string-oriented flags
as there were no actual use cases, and supporting such type would make
`FileSystemSpecificOption` more complicated unnecessarily for now.
There's no point in constructing an object just for the sake of keeping
a state that can be touched by anything in the kernel code.
Let's reduce everything to be in a C++ namespace called with the
previous name "VirtualFileSystem" and keep a smaller textual-footprint
struct called "VirtualFileSystemDetails".
This change also cleans up old "friend class" statements that were no
longer needed, and move methods from the VirtualFileSystem code to more
appropriate places as well.
Please note that the method of locking all filesystems during shutdown
is removed, as in that place there's no meaning to actually locking all
filesystems because of running in kernel mode entirely.
This new syscall will be used by the upcoming runc (run-container)
utility.
In addition to that, this syscall allows userspace to neatly copy RAMFS
instances to other places, which was not possible in the past.
The VFSRootContext class, as its name suggests, holds a context for a
root directory with its mount table and the root custody/inode in the
same class.
The idea is derived from the Linux mount namespace mechanism.
It mimicks the concept of the ProcessList object, but it is adjusted for
a root directory tree context.
In contrast to the ProcessList concept, processes that share the default
VFSRootContext can't see other VFSRootContext related properties such as
as the mount table and root custody/inode.
To accommodate to this change progressively, we internally create 2 main
VFS root contexts for now - one for kernel processes (as they don't need
to care about VFS root contexts for the most part), and another for all
userspace programs.
This separation allows us to continue pretending for userspace that
everything is "normal" as it is used to be, until we introduce proper
interfaces in the mount-related syscalls as well as in the SysFS.
We make VFSRootContext objects being listed, as another preparation
before we could expose interfaces to userspace.
As a result, the PowerStateSwitchTask now iterates on all contexts
and tear them down one by one.
Instead of putting everything in one hash map, let's distinguish between
the devices based on their type.
This change makes the devices semantically separated, and is considered
a preparation before we could expose a comprehensive list of allocations
per major numbers and their purpose.
To be able to do this, we add a new class called CustodyBase, which can
be resolved on-demand internally in the VirtualFileSystem resolving path
code.
When being resolved, CustodyBase will return a known custody if it was
constructed with such, if that's not the case it will provide the root
custody if the original path is absolute.
Lastly, if that's not the case as well, it will resolve the given dirfd
to provide a Custody object.
Similarly to DevPtsFS, this filesystem is about exposing loop device
nodes easily in /dev/loop, so userspace doesn't need to do anything in
order to use new devices immediately.
This device is a block device that allows a user to effectively treat an
Inode as a block device.
The static construction method is given an OpenFileDescription reference
but validates that:
- The description has a valid custody (so it's not some arbitrary file).
Failing this requirement will yield EINVAL.
- The description custody points to an Inode which is a regular file, as
we only support (seekable) regular files. Failing this requirement
will yield ENOTSUP.
LoopDevice can be used to mount a regular file on the filesystem like
other supported types of (physical) block devices.
Previously we could get a raw pointer to a Mount object which might be
invalid when actually dereferencing it.
To ensure this could not happen, we should just use a callback that will
be used immediately after finding the appropriate Mount entry, while
holding the mount table lock.
We don't really need this method anymore, because we could just try to
find the mount entry based on the given mount point host custody.
This also allows us to remove the is_vfs_root and root_inode_id methods
from the VirtualFileSystem class.
We could easily encounter a case where we do the following:
```
mkdir -p /tmp2
mount /dev/hda /tmp2
```
would produce a bug that doing `ls /tmp2/tmp2` will give the contents
on `/dev/hda` ext2 root directory and also on `/tmp2/tmp2/tmp2` and so
on.
To prevent this, we must compare the current custody against each mount
entry's custody to ensure their paths match.
This is not useful, as we have literally zero knowledge about where this
inode is actually located at with respect to the entire global path tree
so we could easily encounter a case where we do the following:
```
mkdir -p /tmp2
mount /dev/hda /tmp2
```
and when traversing the /tmp2 directory entries, we will see the root
inode of /dev/hda on "/tmp2/tmp2", even if it was not mounted.
Therefore, we should just plainly give the raw directory entries as they
are written "on the disk". Anything else that needs to exactly know if
there's an underlying mounted filesystem, can just use the stat syscall
instead.
This ensures that the host mount point custody path is not the same like
the new to-be-mounted custody.
A scenario that could happen before adding this check is:
```
mkdir -p /tmp2
mount /dev/hda /tmp2/
mount /dev/hda /tmp2/
mount /dev/hda /tmp2/ # this will fail here
```
and after adding this check, the following scenario is now this:
```
mkdir -p /tmp2
mount /dev/hda /tmp2/
mount /dev/hda /tmp2/ # this will fail here
mount /dev/hda /tmp2/ # this will fail here too
```
This is correct since unmount doesn't treat bind mounts specially. If we
don't do this, unmounting bind mounts will call
prepare_for_last_unmount() on the guest FS much too early, which will
most likely fail due to a busy file system.
This is a preparation before we can create a usable mechanism to use
filesystem-specific mount flags.
To keep some compatibility with userland code, LibC and LibCore mount
functions are kept being usable, but now instead of doing an "atomic"
syscall, they do multiple syscalls to perform the complete procedure of
mounting a filesystem.
The FileBackedFileSystem IntrusiveList in the VFS code is now changed to
be protected by a Mutex, because when we mount a new filesystem, we need
to check if a filesystem is already created for a given source_fd so we
do a scan for that OpenFileDescription in that list. If we fail to find
an already-created filesystem we create a new one and register it in the
list if we successfully mounted it. We use a Mutex because we might need
to initiate disk access during the filesystem creation, which will take
other mutexes in other parts of the kernel, therefore making it not
possible to take a spinlock while doing this.
This has KString, KBuffer, DoubleBuffer, KBufferBuilder, IOWindow,
UserOrKernelBuffer and ScopedCritical classes being moved to the
Kernel/Library subdirectory.
Also, move the panic and assertions handling code to that directory.
When deleting a directory, the rmdir syscall should fail if the path was
unveiled without the 'c' permission. This matches the same behavior that
OpenBSD enforces when doing this kind of operation.
When deleting a file, the unlink syscall should fail if the path was
unveiled without the 'w' permission, to ensure that userspace is aware
of the possibility of removing a file only when the path was unveiled as
writable.
When using the userdel utility, we now unveil that directory path with
the unveil 'c' permission so removal of an account home directory is
done properly.
"Wherever applicable" = most places, actually :^), especially for
networking and filesystem timestamps.
This includes changes to unzip, which uses DOSPackedTime, since that is
changed for the FAT file systems.
That's what this class really is; in fact that's what the first line of
the comment says it is.
This commit does not rename the main files, since those will contain
other time-related classes in a little bit.
We have a problem with the original utimensat syscall because when we
do call LibC futimens function, internally we provide an empty path,
and the Kernel get_syscall_path_argument method will detect this as an
invalid path.
This happens to spit an error for example in the touch utility, so if a
user is running "touch non_existing_file", it will create that file, but
the user will still see an error coming from LibC futimens function.
This new syscall gets an open file description and it provides the same
functionality as utimensat, on the specified open file description.
The new syscall will be used later by LibC to properly implement LibC
futimens function so the situation described with relation to the
"touch" utility could be fixed.
There was only one permanent storage location for these: as a member
in the Mount class.
That member is never modified after Mount initialization, so we don't
need to worry about races there.
This patch switches away from {Nonnull,}LockRefPtr to the non-locking
smart pointers throughout the kernel.
I've looked at the handful of places where these were being persisted
and I don't see any race situations.
Note that the process file descriptor table (Process::m_fds) was already
guarded via MutexProtected.
Before of this patch, we looked at the unveil data of the FinalizerTask,
which naturally doesn't have any unveil restrictions, therefore allowing
an unveil bypass for a process that enabled performance coredumps.
To ensure we always check the dumped process unveil data, an option to
pass a Process& has been added to a couple of methods in the class of
VirtualFileSystem.
This is considered somewhat an abstraction layer violation, because we
should always let userspace to decide on the root filesystem mount flags
because it allows the user to configure the mount table to preferences
that they desire.
Now that SystemServer is modified to re-mount the root mount with the
desired flags, we can just mount the root filesystem without assuming
special flags.
Resolves issue where a panic would occur if the file system failed to
initialize or mount, due to how the FileSystem was already added to
VFS's list. The newly-created FileSystem destructor would fail as a
result of the object still remaining in the IntrusiveList.
This step would ideally not have been necessary (increases amount of
refactoring and templates necessary, which in turn increases build
times), but it gives us a couple of nice properties:
- SpinlockProtected inside Singleton (a very common combination) can now
obtain any lock rank just via the template parameter. It was not
previously possible to do this with SingletonInstanceCreator magic.
- SpinlockProtected's lock rank is now mandatory; this is the majority
of cases and allows us to see where we're still missing proper ranks.
- The type already informs us what lock rank a lock has, which aids code
readability and (possibly, if gdb cooperates) lock mismatch debugging.
- The rank of a lock can no longer be dynamic, which is not something we
wanted in the first place (or made use of). Locks randomly changing
their rank sounds like a disaster waiting to happen.
- In some places, we might be able to statically check that locks are
taken in the right order (with the right lock rank checking
implementation) as rank information is fully statically known.
This refactoring even more exposes the fact that Mutex has no lock rank
capabilites, which is not fixed here.
We were already handling the rmdir("..") case by refusing to remove
directories that were not empty.
This patch removes a FIXME from January 2019 and adds a test. :^)
Dr. POSIX says that we should reject attempts to rmdir() the file named
"." so this patch does exactly that. We also add a test.
This solves a FIXME from January 2019. :^)
The fact that we used a Vector meant that even if creating a Mount
object succeeded, we were still at a risk that appending to the actual
mounts Vector could fail due to OOM condition. To guard against this,
the mount table is now an IntrusiveList, which always means that when
allocation of a Mount object succeeded, then inserting that object to
the list will succeed, which allows us to fail early in case of OOM
condition.