This commit reorganizes the BootInfo struct definition so it can be
shared for all architectures.
The existing free extern "C" boot info variables have been removed and
replaced with a global BootInfo struct, 'g_boot_info'.
On x86-64, the BootInfo is directly copied from the Prekernel-provided
struct.
On AArch64 and RISC-V, BootInfo is populated during pre_init.
This change has many improvements:
- We don't use `LockRefPtr` to hold instances of many base devices as
with the DeviceManagement class. Instead, we have a saner pattern of
holding them in a `NonnullRefPtr<T> const`, in a small-text footprint
class definition in the `Device.cpp` file.
- The awkwardness of using `::the()` each time we need to get references
to mostly-static objects (like the Event queue) in runtime is now gone
in the migration to using the `Device` class.
- Acquiring a device feel more obvious because we use now the Device
class for this method. The method name is improved as well.
Similarly to VFSRootContext and ScopedProcessList, this class intends
to form resource isolation as well.
We add this class as an infrastructure preparation of hostname contexts
which should allow processes to obtain different hostnames on the same
machine.
There's no point in constructing an object just for the sake of keeping
a state that can be touched by anything in the kernel code.
Let's reduce everything to be in a C++ namespace called with the
previous name "VirtualFileSystem" and keep a smaller textual-footprint
struct called "VirtualFileSystemDetails".
This change also cleans up old "friend class" statements that were no
longer needed, and move methods from the VirtualFileSystem code to more
appropriate places as well.
Please note that the method of locking all filesystems during shutdown
is removed, as in that place there's no meaning to actually locking all
filesystems because of running in kernel mode entirely.
The whole concept of Jails was far more complicated than I actually want
it to be, so let's reduce the complexity of how it works from now on.
Please note that we always leaked the attach count of a Jail object in
the fork syscall if it failed midway.
Instead, we should have attach to the jail just before registering the
new Process, so we don't need to worry about unsuccessful Process
creation.
The reduction of complexity in regard to jails means that instead of
relying on jails to provide PID isolation, we could simplify the whole
idea of them to be a simple SetOnce, and let the ProcessList (now called
ScopedProcessList) to be responsible for this type of isolation.
Therefore, we apply the following changes to do so:
- We make the Jail concept no longer a class of its own. Instead, we
simplify the idea of being jailed to a simple ProtectedValues boolean
flag. This means that we no longer check of matching jail pointers
anywhere in the Kernel code.
To set a process as jailed, a new prctl option was added to set a
Kernel SetOnce boolean flag (so it cannot change ever again).
- We provide Process & Thread methods to iterate over process lists.
A process can either iterate on the global process list, or if it's
attached to a scoped process list, then only over that list.
This essentially replaces the need of checking the Jail pointer of a
process when iterating over process lists.
Expose some initial interfaces in the mount-related syscalls to select
the desired VFSRootContext, by specifying the VFSRootContext index
number.
For now there's still no way to create a different VFSRootContext, so
the only valid IDs are -1 (for currently attached VFSRootContext) or 1
for the first userspace VFSRootContext.
The VFSRootContext class, as its name suggests, holds a context for a
root directory with its mount table and the root custody/inode in the
same class.
The idea is derived from the Linux mount namespace mechanism.
It mimicks the concept of the ProcessList object, but it is adjusted for
a root directory tree context.
In contrast to the ProcessList concept, processes that share the default
VFSRootContext can't see other VFSRootContext related properties such as
as the mount table and root custody/inode.
To accommodate to this change progressively, we internally create 2 main
VFS root contexts for now - one for kernel processes (as they don't need
to care about VFS root contexts for the most part), and another for all
userspace programs.
This separation allows us to continue pretending for userspace that
everything is "normal" as it is used to be, until we introduce proper
interfaces in the mount-related syscalls as well as in the SysFS.
We make VFSRootContext objects being listed, as another preparation
before we could expose interfaces to userspace.
As a result, the PowerStateSwitchTask now iterates on all contexts
and tear them down one by one.
To be able to do this, we add a new class called CustodyBase, which can
be resolved on-demand internally in the VirtualFileSystem resolving path
code.
When being resolved, CustodyBase will return a known custody if it was
constructed with such, if that's not the case it will provide the root
custody if the original path is absolute.
Lastly, if that's not the case as well, it will resolve the given dirfd
to provide a Custody object.
We have many places in the kernel code that we have boolean flags that
are only set once, and never reset again but are checked multiple times
before and after the time they're being set, which matches the purpose
of the SetOnce class.
While this clutters Process.cpp a tiny bit, I feel that it's worth it:
- 2x speed on the kcov_loop benchmark. Likely more during fuzzing.
- Overall code complexity is going down with this change.
- By reducing the code reachable from __sanitizer_cov_trace_pc code,
we can now instrument more code.
We should consider whether the selected Thread is within the same jail
or not.
Therefore let's make it clear to callers with jail semantics if a called
method checks if the desired Thread object is within the same jail.
As for Thread::for_each_* methods, currently nothing in the kernel
codebase needs iteration with consideration for jails, so the old
Thread::for_each* were simply renamed to include "ignoring_jails" suffix
in their names.
These syscalls are not necessary on their own, and they give the false
impression that a caller could set or get the thread name of any process
in the system, which is not true.
Therefore, move the functionality of these syscalls to be options in the
prctl syscall, which makes it abundantly clear that these operations
could only occur from a running thread in a process that sees other
threads in that process only.
Using the kernel stack is preferable, especially when the examined
strings should be limited to a reasonable length.
This is a small improvement, because if we don't actually move these
strings then we don't need to own heap allocations for them during the
syscall handler function scope.
In addition to that, some kernel strings are known to be limited, like
the hostname string, for these strings we also can use FixedStringBuffer
to store and copy to and from these buffers, without using any heap
allocations at all.
Instead, use the FixedCharBuffer class to ensure we always use a static
buffer storage for these names. This ensures that if a Process or a
Thread were created, there's a guarantee that setting a new name will
never fail, as only copying of strings should be done to that static
storage.
The limits which are set are 32 characters for processes' names and 64
characters for thread names - this is because threads' names could be
more verbose than processes' names.
Once we move to a more proper shutdown procedure, processes other than
the finalizer task must be able to perform cleanup and finalization
duties, not only because the finalizer task itself needs to be cleaned
up by someone. This global variable, mirroring the early boot flags,
allows a future shutdown process to perform cleanup on its own.
Note that while this *could* be considered a weakening in security, the
attack surface is minimal and the results are not dramatic. To exploit
this, an attacker would have to gain a Kernel write primitive to this
global variable (bypassing KASLR among other things) and then gain some
way of calling the relevant functions, all of this only to destroy some
other running process. The same effect can be achieved with LPE which
can often be gained with significantly simpler userspace exploits (e.g.
of setuid binaries).
This has KString, KBuffer, DoubleBuffer, KBufferBuilder, IOWindow,
UserOrKernelBuffer and ScopedCritical classes being moved to the
Kernel/Library subdirectory.
Also, move the panic and assertions handling code to that directory.